Use FIPS-safe hashes for program cache keys#2087
Conversation
Signed-off-by: Aryan <aryansputta@gmail.com>
Signed-off-by: Aryan <aryansputta@gmail.com>
leofang
left a comment
There was a problem hiding this comment.
Thanks for tackling the FIPS issue — the core fix (blake2b → sha256, bump _KEY_SCHEMA_VERSION) is the right direction. A few changes requested:
- Remove
test_program_cache_fips.py— the elaborate module-stubbing doesn't test real behavior (see file-level comment for details). - Add
usedforsecurity=Falseto bothhashlib.sha256()call sites — documents intent and can unlock faster code paths. - Add inline FIPS comments at both hash usage sites so future maintainers understand the constraint.
- Benchmark FIPS-compliant alternatives before settling on SHA-256 — on 64-bit systems
sha512is often faster, and security doesn't matter here, only collision resistance and speed.
To clarify: All FIPS-compliant algorithms should be considered and benchmarked. We are not bound to the 256-bit key size. Either shorter or longer is OK, as long as it is fast and has a low chance to collide. |
Signed-off-by: Aryan <aryansputta@gmail.com>
Signed-off-by: Aryan <aryansputta@gmail.com>
d737296 to
c08fc21
Compare
|
Addressed your requested changes.
|
Could you share your benchmark script? |
|
Sure. I turned the local benchmark into a standalone stdlib-only script that mirrors both hash sites plus the coupled end-to-end path where On this x86_64 host, the SHA-2 family clearly beat the SHA-3 family for these workloads. For the coupled end-to-end cases, A representative run from
Benchmark script#!/usr/bin/env python3
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# SPDX-License-Identifier: Apache-2.0
"""Benchmark FIPS-approved hashlib candidates for cuda.core program-cache use.
This mirrors the two relevant call sites:
* ``FileStreamProgramCache._path_for_key()``: hash a cache key to a stable
filename component via ``hexdigest()``.
* ``make_program_cache_key()``: incrementally build the digest from labeled
payload chunks and return ``digest()``.
The benchmark is intentionally stdlib-only so reviewers can run it directly.
"""
from __future__ import annotations
import argparse
import hashlib
import inspect
import statistics
import sys
import time
from dataclasses import dataclass
from typing import Callable
_DEFAULT_ALGORITHMS = (
"sha224",
"sha256",
"sha384",
"sha512",
"sha3_224",
"sha3_256",
"sha3_384",
"sha3_512",
)
@dataclass(frozen=True)
class HashCase:
name: str
runner: Callable[[Callable[..., object]], None]
def _supports_usedforsecurity(constructor: Callable[..., object]) -> bool:
try:
signature = inspect.signature(constructor)
except (TypeError, ValueError):
return False
return "usedforsecurity" in signature.parameters
def _make_constructor(name: str) -> Callable[..., object]:
constructor = getattr(hashlib, name)
if _supports_usedforsecurity(constructor):
return lambda data=b"": constructor(data, usedforsecurity=False)
return constructor
def _file_stream_case(name: str, key: bytes) -> HashCase:
def _runner(constructor: Callable[..., object]) -> None:
constructor(key).hexdigest()
return HashCase(name, _runner)
def _program_cache_case(name: str, payloads: tuple[tuple[str, bytes], ...]) -> HashCase:
def _runner(constructor: Callable[..., object]) -> None:
hasher = constructor()
for label, payload in payloads:
hasher.update(label.encode("ascii"))
hasher.update(len(payload).to_bytes(8, "big"))
hasher.update(payload)
hasher.digest()
return HashCase(name, _runner)
def _end_to_end_case(name: str, payloads: tuple[tuple[str, bytes], ...]) -> HashCase:
def _runner(constructor: Callable[..., object]) -> None:
hasher = constructor()
for label, payload in payloads:
hasher.update(label.encode("ascii"))
hasher.update(len(payload).to_bytes(8, "big"))
hasher.update(payload)
key = hasher.digest()
constructor(key).hexdigest()
return HashCase(name, _runner)
def _sample_cases() -> tuple[HashCase, ...]:
file_stream_key = bytes.fromhex("ab" * 48)
long_file_stream_key = (b"cuda-core-cache-key-" * 128)[:4096]
source = b"""
extern "C" __global__ void saxpy(float a, const float* x, float* y) {
const int i = blockIdx.x * blockDim.x + threadIdx.x;
y[i] = a * x[i] + y[i];
}
""".strip()
ptx = b"""
.version 8.0
.target sm_90
.address_size 64
.visible .entry saxpy() { ret; }
""".strip()
option_bytes = (
b"name='saxpy'",
b"arch='sm_90'",
b"max_register_count=None",
b"time=False",
b"link_time_optimization=False",
b"debug=False",
b"lineinfo=False",
b"ftz=None",
b"prec_div=None",
b"prec_sqrt=None",
b"fma=None",
b"split_compile=None",
b"ptxas_options=None",
b"no_cache=False",
)
names = (b"saxpy", b"_Z5saxpyv")
extra_digest = bytes.fromhex("cd" * 32)
cpp_payloads = (
("schema", b"2"),
("nvrtc", b"13.2"),
("code_type", b"c++"),
("target_type", b"cubin"),
("code", source),
("option_count", str(len(option_bytes)).encode("ascii")),
*tuple(("option", item) for item in option_bytes),
("names_count", str(len(names)).encode("ascii")),
*tuple(("name", item) for item in names),
("options_name", b"saxpy"),
("extra_digest", extra_digest),
)
ptx_payloads = (
("schema", b"2"),
("linker", b"nvJitLink-13.2"),
("code_type", b"ptx"),
("target_type", b"cubin"),
("code", ptx),
("option_count", str(len(option_bytes)).encode("ascii")),
*tuple(("option", item) for item in option_bytes),
("names_count", b"0"),
("extra_digest", extra_digest),
)
return (
_file_stream_case("file_stream_key_48b", file_stream_key),
_file_stream_case("file_stream_key_4k", long_file_stream_key),
_program_cache_case("program_cache_cpp", cpp_payloads),
_program_cache_case("program_cache_ptx", ptx_payloads),
_end_to_end_case("end_to_end_cpp", cpp_payloads),
_end_to_end_case("end_to_end_ptx", ptx_payloads),
)
def _benchmark_case(
case: HashCase,
constructor: Callable[..., object],
*,
loops: int,
repeat: int,
) -> tuple[float, float]:
samples_ns: list[float] = []
for _ in range(repeat):
start = time.perf_counter_ns()
for _ in range(loops):
case.runner(constructor)
elapsed = time.perf_counter_ns() - start
samples_ns.append(elapsed / loops)
return statistics.mean(samples_ns), min(samples_ns)
def _format_ns(value: float) -> str:
return f"{value:,.1f}"
def _write_line(text: str = "") -> None:
sys.stdout.write(text + "\n")
def main() -> None:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--loops",
type=int,
default=200_000,
help="Iterations per repeat for each algorithm/case pair.",
)
parser.add_argument(
"--repeat",
type=int,
default=7,
help="Independent timing repeats for each algorithm/case pair.",
)
parser.add_argument(
"--algorithms",
nargs="+",
default=list(_DEFAULT_ALGORITHMS),
help="hashlib algorithm names to benchmark.",
)
args = parser.parse_args()
cases = _sample_cases()
widths = {
"algorithm": max(len("Algorithm"), max(len(name) for name in args.algorithms)),
"case": max(len(case.name) for case in cases),
}
_write_line(
f"{'Algorithm':<{widths['algorithm']}} "
f"{'Case':<{widths['case']}} {'mean ns/op':>12} {'best ns/op':>12}"
)
_write_line("-" * (widths["algorithm"] + widths["case"] + 28))
for algorithm in args.algorithms:
constructor = _make_constructor(algorithm)
for case in cases:
mean_ns, best_ns = _benchmark_case(case, constructor, loops=args.loops, repeat=args.repeat)
_write_line(
f"{algorithm:<{widths['algorithm']}} "
f"{case.name:<{widths['case']}} "
f"{_format_ns(mean_ns):>12} {_format_ns(best_ns):>12}"
)
if __name__ == "__main__":
main() |
Benchmark: FIPS-approved hash algorithms for program cacheI re-ran the benchmark with proper parameters ( Results (end-to-end case, best ns/op)All FIPS-approved algorithms available in
Analysis
RecommendationPlease switch back from
When running the benchmark, please use the default parameters ( |
|
Follow-up on the hash choice: the earlier I have switched both runtime call sites back to For reproducibility, the PR now carries |
|
Please kindly push your local changes, thanks! |
Closes #2043.
Summary
_KEY_SCHEMA_VERSION = 2so pre-FIPS cache entries remain unreachable after the hash-family changeusedforsecurity=Falseat both hashlib call sites so the cache remains usable on FIPS-enforcing systemsscripts/bench_program_cache_hashes.pyin-tree as a stdlib-only review/support benchmark for reproducing hash comparisonsTesting
python3 scripts/bench_program_cache_hashes.pycuda_coreprogram-cache tests in a repo-managed environment