Use FIPS-safe hashes for program cache keys by aryanputta · Pull Request #2087 · NVIDIA/cuda-python

aryanputta · 2026-05-14T05:25:55Z

Closes #2043.

Summary

replace the internal program-cache blake2b digests with a FIPS-safe SHA-2 choice in both cache-key derivation and file-path hashing
keep _KEY_SCHEMA_VERSION = 2 so pre-FIPS cache entries remain unreachable after the hash-family change
use usedforsecurity=False at both hashlib call sites so the cache remains usable on FIPS-enforcing systems
keep scripts/bench_program_cache_hashes.py in-tree as a stdlib-only review/support benchmark for reproducing hash comparisons

Testing

python3 scripts/bench_program_cache_hashes.py
targeted cuda_core program-cache tests in a repo-managed environment

Signed-off-by: Aryan <aryansputta@gmail.com>

copy-pr-bot · 2026-05-14T05:25:59Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Aryan <aryansputta@gmail.com>

leofang

Thanks for tackling the FIPS issue — the core fix (blake2b → sha256, bump _KEY_SCHEMA_VERSION) is the right direction. A few changes requested:

Remove test_program_cache_fips.py — the elaborate module-stubbing doesn't test real behavior (see file-level comment for details).
Add usedforsecurity=False to both hashlib.sha256() call sites — documents intent and can unlock faster code paths.
Add inline FIPS comments at both hash usage sites so future maintainers understand the constraint.
Benchmark FIPS-compliant alternatives before settling on SHA-256 — on 64-bit systems sha512 is often faster, and security doesn't matter here, only collision resistance and speed.

leofang · 2026-05-14T16:00:19Z

4. Benchmark FIPS-compliant alternatives before settling on SHA-256 — on 64-bit systems sha512 is often faster, and security doesn't matter here, only collision resistance and speed.

To clarify: All FIPS-compliant algorithms should be considered and benchmarked. We are not bound to the 256-bit key size. Either shorter or longer is OK, as long as it is fast and has a low chance to collide.

Signed-off-by: Aryan <aryansputta@gmail.com>

aryanputta · 2026-05-14T17:00:40Z

@leofang

Addressed your requested changes.
Changes made:

removed cuda_core/tests/test_program_cache_fips.py
added usedforsecurity=False at both hash call sites
added inline FIPS comments at both hash call sites
benchmarked FIPS-approved alternatives locally and switched both sites consistently to sha384
cleaned up the remaining stale test wording so it now says 384-bit digest

leofang · 2026-05-14T19:22:39Z

benchmarked FIPS-approved alternatives locally and switched both sites consistently to sha384

Could you share your benchmark script?

aryanputta · 2026-05-14T20:05:58Z

Sure. I turned the local benchmark into a standalone stdlib-only script that mirrors both hash sites plus the coupled end-to-end path where make_program_cache_key() feeds FileStreamProgramCache._path_for_key().

On this x86_64 host, the SHA-2 family clearly beat the SHA-3 family for these workloads. For the coupled end-to-end cases, sha384 and sha512 were effectively the two leaders, and sha384 edged out or matched the others closely enough that I kept it consistently at both sites while also keeping the key shorter than sha512.

A representative run from python3 scripts/bench_program_cache_hashes.py --loops 5000 --repeat 2:

end_to_end_cpp: sha384 10,657.7 ns/op, sha256 11,034.8 ns/op, sha512 11,043.4 ns/op
end_to_end_ptx: sha384 9,526.4 ns/op, sha512 9,642.1 ns/op, sha256 9,714.3 ns/op

Benchmark script

#!/usr/bin/env python3
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# SPDX-License-Identifier: Apache-2.0

"""Benchmark FIPS-approved hashlib candidates for cuda.core program-cache use.

This mirrors the two relevant call sites:

* ``FileStreamProgramCache._path_for_key()``: hash a cache key to a stable
  filename component via ``hexdigest()``.
* ``make_program_cache_key()``: incrementally build the digest from labeled
  payload chunks and return ``digest()``.

The benchmark is intentionally stdlib-only so reviewers can run it directly.
"""

from __future__ import annotations

import argparse
import hashlib
import inspect
import statistics
import sys
import time
from dataclasses import dataclass
from typing import Callable

_DEFAULT_ALGORITHMS = (
    "sha224",
    "sha256",
    "sha384",
    "sha512",
    "sha3_224",
    "sha3_256",
    "sha3_384",
    "sha3_512",
)


@dataclass(frozen=True)
class HashCase:
    name: str
    runner: Callable[[Callable[..., object]], None]


def _supports_usedforsecurity(constructor: Callable[..., object]) -> bool:
    try:
        signature = inspect.signature(constructor)
    except (TypeError, ValueError):
        return False
    return "usedforsecurity" in signature.parameters


def _make_constructor(name: str) -> Callable[..., object]:
    constructor = getattr(hashlib, name)
    if _supports_usedforsecurity(constructor):
        return lambda data=b"": constructor(data, usedforsecurity=False)
    return constructor


def _file_stream_case(name: str, key: bytes) -> HashCase:
    def _runner(constructor: Callable[..., object]) -> None:
        constructor(key).hexdigest()

    return HashCase(name, _runner)


def _program_cache_case(name: str, payloads: tuple[tuple[str, bytes], ...]) -> HashCase:
    def _runner(constructor: Callable[..., object]) -> None:
        hasher = constructor()
        for label, payload in payloads:
            hasher.update(label.encode("ascii"))
            hasher.update(len(payload).to_bytes(8, "big"))
            hasher.update(payload)
        hasher.digest()

    return HashCase(name, _runner)


def _end_to_end_case(name: str, payloads: tuple[tuple[str, bytes], ...]) -> HashCase:
    def _runner(constructor: Callable[..., object]) -> None:
        hasher = constructor()
        for label, payload in payloads:
            hasher.update(label.encode("ascii"))
            hasher.update(len(payload).to_bytes(8, "big"))
            hasher.update(payload)
        key = hasher.digest()
        constructor(key).hexdigest()

    return HashCase(name, _runner)


def _sample_cases() -> tuple[HashCase, ...]:
    file_stream_key = bytes.fromhex("ab" * 48)
    long_file_stream_key = (b"cuda-core-cache-key-" * 128)[:4096]

    source = b"""
extern "C" __global__ void saxpy(float a, const float* x, float* y) {
    const int i = blockIdx.x * blockDim.x + threadIdx.x;
    y[i] = a * x[i] + y[i];
}
""".strip()
    ptx = b"""
.version 8.0
.target sm_90
.address_size 64
.visible .entry saxpy() { ret; }
""".strip()
    option_bytes = (
        b"name='saxpy'",
        b"arch='sm_90'",
        b"max_register_count=None",
        b"time=False",
        b"link_time_optimization=False",
        b"debug=False",
        b"lineinfo=False",
        b"ftz=None",
        b"prec_div=None",
        b"prec_sqrt=None",
        b"fma=None",
        b"split_compile=None",
        b"ptxas_options=None",
        b"no_cache=False",
    )
    names = (b"saxpy", b"_Z5saxpyv")
    extra_digest = bytes.fromhex("cd" * 32)

    cpp_payloads = (
        ("schema", b"2"),
        ("nvrtc", b"13.2"),
        ("code_type", b"c++"),
        ("target_type", b"cubin"),
        ("code", source),
        ("option_count", str(len(option_bytes)).encode("ascii")),
        *tuple(("option", item) for item in option_bytes),
        ("names_count", str(len(names)).encode("ascii")),
        *tuple(("name", item) for item in names),
        ("options_name", b"saxpy"),
        ("extra_digest", extra_digest),
    )
    ptx_payloads = (
        ("schema", b"2"),
        ("linker", b"nvJitLink-13.2"),
        ("code_type", b"ptx"),
        ("target_type", b"cubin"),
        ("code", ptx),
        ("option_count", str(len(option_bytes)).encode("ascii")),
        *tuple(("option", item) for item in option_bytes),
        ("names_count", b"0"),
        ("extra_digest", extra_digest),
    )

    return (
        _file_stream_case("file_stream_key_48b", file_stream_key),
        _file_stream_case("file_stream_key_4k", long_file_stream_key),
        _program_cache_case("program_cache_cpp", cpp_payloads),
        _program_cache_case("program_cache_ptx", ptx_payloads),
        _end_to_end_case("end_to_end_cpp", cpp_payloads),
        _end_to_end_case("end_to_end_ptx", ptx_payloads),
    )


def _benchmark_case(
    case: HashCase,
    constructor: Callable[..., object],
    *,
    loops: int,
    repeat: int,
) -> tuple[float, float]:
    samples_ns: list[float] = []
    for _ in range(repeat):
        start = time.perf_counter_ns()
        for _ in range(loops):
            case.runner(constructor)
        elapsed = time.perf_counter_ns() - start
        samples_ns.append(elapsed / loops)
    return statistics.mean(samples_ns), min(samples_ns)


def _format_ns(value: float) -> str:
    return f"{value:,.1f}"


def _write_line(text: str = "") -> None:
    sys.stdout.write(text + "\n")


def main() -> None:
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "--loops",
        type=int,
        default=200_000,
        help="Iterations per repeat for each algorithm/case pair.",
    )
    parser.add_argument(
        "--repeat",
        type=int,
        default=7,
        help="Independent timing repeats for each algorithm/case pair.",
    )
    parser.add_argument(
        "--algorithms",
        nargs="+",
        default=list(_DEFAULT_ALGORITHMS),
        help="hashlib algorithm names to benchmark.",
    )
    args = parser.parse_args()

    cases = _sample_cases()
    widths = {
        "algorithm": max(len("Algorithm"), max(len(name) for name in args.algorithms)),
        "case": max(len(case.name) for case in cases),
    }

    _write_line(
        f"{'Algorithm':<{widths['algorithm']}}  "
        f"{'Case':<{widths['case']}}  {'mean ns/op':>12}  {'best ns/op':>12}"
    )
    _write_line("-" * (widths["algorithm"] + widths["case"] + 28))

    for algorithm in args.algorithms:
        constructor = _make_constructor(algorithm)
        for case in cases:
            mean_ns, best_ns = _benchmark_case(case, constructor, loops=args.loops, repeat=args.repeat)
            _write_line(
                f"{algorithm:<{widths['algorithm']}}  "
                f"{case.name:<{widths['case']}}  "
                f"{_format_ns(mean_ns):>12}  {_format_ns(best_ns):>12}"
            )


if __name__ == "__main__":
    main()

leofang · 2026-05-15T02:07:02Z

Benchmark: FIPS-approved hash algorithms for program cache

I re-ran the benchmark with proper parameters (--loops 200000 --repeat 7) on an AMD Threadripper PRO 3975WX (x86_64 with SHA-NI). The original run used --loops 5000 --repeat 2, which is far too few samples to draw conclusions — the sha384 "win" in the original results was noise.

Results (end-to-end case, best ns/op)

All FIPS-approved algorithms available in hashlib were tested:

Algorithm	best ns/op	digest bits	collision resistance
sha256	6,443	256	2^128
sha224	6,592	224	2^112
sha1	6,472	160	2^80
sha384	7,613	384	2^192
sha512	7,629	512	2^256
sha512_224	8,264	224	2^112
sha512_256	8,231	256	2^128
sha3_256	8,198	256	2^128
sha3_224	8,231	224	2^112
sha3_384	8,763	384	2^192
shake_128	9,330	variable	variable
shake_256	9,590	variable	variable
sha3_512	9,887	512	2^256

Analysis

SHA-NI hardware acceleration makes sha256/sha224/sha1 the fastest group at ~6,500 ns. SHA-NI is present on most modern AMD (Zen+) and Intel (Ice Lake+) CPUs. It only accelerates SHA-256 and SHA-1 — not SHA-512 or SHA-3.
sha384 and sha512 are ~15% slower because they use 64-bit software operations with no hardware acceleration on x86_64.
SHA-3 family is 25-50% slower — expected, as SHA-3 (Keccak) is designed for hardware, not software on x86_64.
sha1 is equally fast but only provides 160-bit / 2^80 collision resistance. No reason to pick it when sha256 is the same speed with 2^128.

Recommendation

Please switch back from sha384 to sha256. It is:

Fastest (tied with sha224/sha1, thanks to SHA-NI)
256-bit collision resistance (2^128) — more than sufficient
Universally available in every Python build
FIPS-approved

When running the benchmark, please use the default parameters (--loops 200000 --repeat 7) — --loops 5000 --repeat 2 does not produce statistically meaningful results.

aryanputta · 2026-05-15T04:18:37Z

Follow-up on the hash choice: the earlier sha384 switch came from an undersampled local run. After the maintainer reran the benchmark with --loops 200000 --repeat 7, the result on SHA-NI x86_64 was that sha256 is the fastest FIPS-approved practical choice for this program-cache workload.

I have switched both runtime call sites back to sha256(..., usedforsecurity=False) consistently, kept _KEY_SCHEMA_VERSION = 2, and updated the cache-key/file-stream wording and digest-width tests back to 32 bytes.

For reproducibility, the PR now carries scripts/bench_program_cache_hashes.py as a stdlib-only review/support artifact with the broader algorithm set and the maintainer-style default benchmark parameters.

leofang · 2026-05-15T20:42:18Z

Please kindly push your local changes, thanks!

Use FIPS-safe hashes for program cache keys

956fd51

Signed-off-by: Aryan <aryansputta@gmail.com>

github-actions Bot added the cuda.core Everything related to the cuda.core module label May 14, 2026

Fix FIPS cache test linting

d8727ee

Signed-off-by: Aryan <aryansputta@gmail.com>

leofang self-requested a review May 14, 2026 15:12

leofang requested changes May 14, 2026

View reviewed changes

Comment thread cuda_core/tests/test_program_cache_fips.py Outdated

Comment thread cuda_core/cuda/core/utils/_program_cache/_keys.py Outdated

Comment thread cuda_core/cuda/core/utils/_program_cache/_file_stream.py Outdated

aryanputta added 2 commits May 14, 2026 12:22

Use benchmarked FIPS-safe cache hashing

44164f7

Signed-off-by: Aryan <aryansputta@gmail.com>

Clarify SHA-384 cache digest comment

c08fc21

Signed-off-by: Aryan <aryansputta@gmail.com>

aryanputta force-pushed the fix-fips-program-cache-hash branch from d737296 to c08fc21 Compare May 14, 2026 16:56

leofang added the enhancement Any code-related improvements label May 14, 2026

leofang added this to the cuda.core next milestone May 14, 2026

leofang added the P0 High priority - Must do! label May 15, 2026

Switch program cache hashing back to sha256

6f98260

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use FIPS-safe hashes for program cache keys#2087

Use FIPS-safe hashes for program cache keys#2087
aryanputta wants to merge 5 commits into
NVIDIA:mainfrom
aryanputta:fix-fips-program-cache-hash

aryanputta commented May 14, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 14, 2026

Uh oh!

leofang left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leofang commented May 14, 2026

Uh oh!

aryanputta commented May 14, 2026

Uh oh!

leofang commented May 14, 2026

Uh oh!

aryanputta commented May 14, 2026

Uh oh!

leofang commented May 15, 2026

Uh oh!

aryanputta commented May 15, 2026

Uh oh!

leofang commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aryanputta commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

copy-pr-bot Bot commented May 14, 2026

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leofang commented May 14, 2026

Uh oh!

aryanputta commented May 14, 2026

Uh oh!

leofang commented May 14, 2026

Uh oh!

aryanputta commented May 14, 2026

Uh oh!

leofang commented May 15, 2026

Benchmark: FIPS-approved hash algorithms for program cache

Results (end-to-end case, best ns/op)

Analysis

Recommendation

Uh oh!

aryanputta commented May 15, 2026

Uh oh!

leofang commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aryanputta commented May 14, 2026 •

edited

Loading