fix(upgrade): re-verify ML-DSA signature on every cache hit#100
fix(upgrade): re-verify ML-DSA signature on every cache hit#100grumbach wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens the auto-upgrade disk cache against A1 (RCE-class) cache poisoning by ensuring ML-DSA signature verification is performed on every cache hit and by eliminating TOCTOU between verification and extraction.
Changes:
- Cache the signed archive + detached signature (not the extracted binary) and re-verify ML-DSA-65 on every cache hit via a private staged copy.
- Add best-effort Unix permission tightening for the shared upgrade cache directory (0700).
- Update upgrade apply flow to extract from verified cached archives and cache verified artifacts pre-extraction; add regression/TOCTOU unit tests.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/upgrade/cache_dir.rs |
Best-effort permission tightening of the shared upgrade cache directory on Unix. |
src/upgrade/binary_cache.rs |
Redesign cache to store archive+signature, re-verify ML-DSA on every hit, stage into process-private dir, and add A1/TOCTOU-focused tests. |
src/upgrade/apply.rs |
Switch upgrade resolution to reuse verified cached archives and cache verified artifacts before extraction. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Fast corruption pre-check on the PRIVATE copy (NOT the security | ||
| // decision). A copy error or truncation surfaces here. | ||
| let actual_hash = sha256_file(&private_archive).ok()?; | ||
| if actual_hash != meta.archive_sha256 { | ||
| warn!( | ||
| "Binary cache SHA-256 mismatch for version {version} (expected {}, got {})", | ||
| meta.sha256, actual_hash | ||
| "Binary cache SHA-256 mismatch for version {version} \ | ||
| (expected {}, got {actual_hash}) — ignoring cache entry", | ||
| meta.archive_sha256 | ||
| ); | ||
| return None; |
There was a problem hiding this comment.
Already addressed — the cleanup helper is called on this path (line 287 in the current branch tip). The original review pointed at line numbers from the pre-squash branch.
| if let Err(e) = self.verify_archive(&private_archive, &private_sig) { | ||
| warn!( | ||
| "Cached archive for version {version} FAILED ML-DSA signature \ | ||
| re-verification ({e}); discarding cache entry (possible \ | ||
| on-disk tampering). A fresh verified download will run." | ||
| ); | ||
| return None; | ||
| } |
There was a problem hiding this comment.
Already addressed — same cleanup("signature re-verification failed") path is wired on the signature-verify-fail branch.
Per review-wave findings on PR WithAutonomi#100: 1. The upgrade staging tempdir is now created with 0700 on Unix via `tempfile::Builder::permissions`. Doc comments in binary_cache.rs describe the staging dir as "process-private" / "unreachable to the attacker"; pre-followup the dir inherited the default umask (typically 0755), so a same-host attacker on a different UID could in principle race the verify→extract window. The ML-DSA re-verification on the private copy is the primary defence; this is the small layer that brings the code in line with the doc claim. 2. `get_verified_archive` now cleans up the staged private archive + signature on every rejection path (sha256 read failure, sha256 mismatch, signature re-verification failure). The parent TempDir is already auto-cleaned on drop, so this is hygiene — but it also defends against any future caller that reuses the same private_dir across retries by making each rejection start from a clean slate.
97a1a34 to
ae0038a
Compare
…il cleanup Per second-round review wave (codex xhigh) on PR WithAutonomi#100: 1. [BLOCKER] get_verified_archive had no size policy before copying or reading cached files. signature::verify_from_file* and sha256_file both fs::read() the archive in full. An attacker with cache-dir write access (the A1 threat model) could otherwise drop a multi-GB .archive and force disk exhaustion in the private staging dir, or OOM during re-verification, before the entry was rejected. The download path already enforces MAX_ARCHIVE_SIZE_BYTES = 200 MiB; the cache-hit path now honours the same bound (plus the fixed SIGNATURE_SIZE) BEFORE any copy or read. store_archive itself defence-in-depths the same gate. 2. [SHOULD-FIX] download_verify_extract stored the cache entry BEFORE running extract_binary. A validly-signed but malformed/incompatible release therefore became a shared cache poison pill: every later node would hit the cache, fail extract, and re-download under no lock. Reordered: extract first, cache only on success. 3. [NIT] The cleanup helper was defined AFTER the two fs::copy calls that could leave private_archive (or a partially created private_sig) behind. Moved the helper definition before the copies so every rejection path — including a partial copy failure — runs the same scrub. Three new tests pin the size policy and the new ordering: test_oversize_cached_archive_is_rejected_before_copy test_wrong_size_signature_is_rejected_before_copy test_store_archive_rejects_oversize MAX_ARCHIVE_SIZE_BYTES tightened to pub(super) (one fewer clippy warning). 10/10 binary_cache tests + 485 lib tests pass; cfd clean.
| // `signature::verify_from_file*` and `sha256_file` both load the | ||
| // archive into memory in full. An attacker with cache-dir write | ||
| // access (the A1 threat model) could otherwise drop a multi-GB | ||
| // `.archive` and force disk exhaustion in the staging dir or an OOM | ||
| // during re-verification before the entry is rejected. The download | ||
| // path already enforces `MAX_ARCHIVE_SIZE_BYTES`; cache hits must | ||
| // honour the same bound, plus the fixed `SIGNATURE_SIZE`. |
There was a problem hiding this comment.
Good catch — comment was wrong. Updated to clarify the memory pressure comes from signature::verify_from_file* (FIPS-204 requires the message as a slice); sha256_file streams 8 KiB chunks. Fixed in 7ba9bb6.
| let meta_json = serde_json::to_string(&meta) | ||
| .map_err(|e| Error::Upgrade(format!("Failed to serialize binary cache meta: {e}")))?; |
There was a problem hiding this comment.
Fixed — now reads "Failed to serialize cached archive metadata". 7ba9bb6.
… bypass Per round-3 codex review on PR WithAutonomi#100: The size gate from the prior commit used `fs::metadata` (which follows symlinks). A cache-dir writer could plant `ant-node-X.archive -> /dev/zero` (or a FIFO/device); `.len()` stats as 0 so the size gate passed, but `fs::copy` then read the underlying special file indefinitely into the private staging dir, exhausting disk before any signature check ran. Fix: replace `fs::metadata` with `fs::symlink_metadata` in `get_verified_archive` and `store_archive`, and require `file_type().is_file()` before any size check or copy. Symlinks, FIFOs, sockets, and device files are now rejected with a `warn!` before any read or copy of the suspicious path. This narrows the attack surface to "an attacker with cache-dir write who can also write real files there" — they still cannot bypass the ML-DSA signature re-verification on the private copy (the primary A1 defence) and they cannot trigger disk exhaustion via special-file indirection. New regression test on Unix: `test_symlink_cached_archive_is_rejected_before_copy` plants `cached_archive` as a symlink to `/dev/zero` and asserts `get_verified_archive` returns `None` without staging anything. 486 lib tests pass; cfd clean.
5d8807b to
d2fd74e
Compare
| let cached_archive = self.cached_archive_path(version); | ||
| let cached_sig = self.cached_signature_path(version); | ||
| let meta_path = self.meta_path(version); | ||
|
|
||
| let meta_data = fs::read_to_string(&meta_path).ok()?; |
There was a problem hiding this comment.
Fixed — the metadata path now goes through the same open-handle gate as the archive and signature: regular-file check on the opened File and a 4 KiB MAX_META_BYTES cap. A symlink-to-/dev/zero or a huge meta file is rejected before any read. New test test_meta_symlink_to_special_file_is_rejected covers it. 7ba9bb6.
| let archive_meta = match fs::symlink_metadata(&cached_archive) { | ||
| Ok(m) => m, | ||
| Err(e) => { | ||
| debug!("Cannot stat cached archive for {version}: {e}"); | ||
| return None; |
There was a problem hiding this comment.
Narrowed in 7ba9bb6. New open_regular_capped helper opens archive + signature + meta ONCE each, validates file_type().is_file() and size on the resulting File handle, and the subsequent stream into the private staging dir reads from that open handle (wrapped in Read::take(len)). There is no second path-based stat or open after this point, so a race on the cache path between the validation and the copy cannot redirect what gets staged.
| // Size policy gate — runs BEFORE we copy or read the cached files. | ||
| // | ||
| // `signature::verify_from_file*` and `sha256_file` both load the | ||
| // archive into memory in full. An attacker with cache-dir write | ||
| // access (the A1 threat model) could otherwise drop a multi-GB | ||
| // `.archive` and force disk exhaustion in the staging dir or an OOM |
There was a problem hiding this comment.
Fixed in the same commit as the other comment-fix — the wording now correctly attributes the in-memory load to signature::verify_from_file*.
The shared upgrade binary cache stored the extracted binary and, on a cache hit, returned it after only a SHA-256 check against a sibling .meta.json. SHA-256 is not a security control: anyone able to write to the shared cache directory (a co-located process, a shared container volume, a low-privilege foothold on the host) could drop a malicious binary plus a forged matching metadata hash, and the next ant-node instance to upgrade would execute it with no signature verification at all — persistent RCE on every co-located node. The ML-DSA-65 signature covers the archive and was only checked on the initial download, never on a cache hit. Changes: - Cache the signed *archive + detached signature* instead of the extracted binary. `BinaryCache::get_verified_archive` re-runs ML-DSA-65 verification on every cache hit; the binary is always extracted fresh from the just-verified archive. A tampered archive, tampered or missing signature, or forged metadata fails verification against the pinned release public key, so a poisoned cache entry is rejected and a fresh verified download runs. - Stage cached files into the caller's process-private temp directory and verify that copy, then extract from the same private path. Closes the verify-vs-extract TOCTOU on the shared cache files: an attacker cannot swap the bytes between when the verifier reads them and when the extractor reads them. - Size policy before any copy or read. `fs::symlink_metadata` + `file_type().is_file()` rejects symlinks / FIFOs / devices outright; archive size is bounded by `MAX_ARCHIVE_SIZE_BYTES` and the signature must be exactly `SIGNATURE_SIZE` bytes. Otherwise an attacker could plant `cached.archive -> /dev/zero` (stats as 0 bytes) and force unbounded disk fill in the staging dir or OOM in `signature::verify`. - Cache only after successful extraction. A validly-signed-but-malformed release no longer becomes a shared cache poison pill that every later node downloads, fails to extract, and re-downloads. - `cache_dir.rs` restricts the shared upgrade cache directory to 0700 on Unix as defence in depth; the ML-DSA gate is the primary control. - `store_archive` mirrors the same size / file-type / signature checks before persisting, so a poisoned entry cannot be created through the supported path either. Tests in `src/upgrade/binary_cache.rs` cover the tamper path (SHA-256-forged swap on disk rejected by the signature re-check), the post-hit shared-file swap (private copy unaffected), the symlink-to- `/dev/zero` bypass attempt, oversize archive / wrong-sized signature rejection, and round-trip storage. Production verifies against the pinned `RELEASE_SIGNING_KEY`; tests use a `#[cfg(test)]`-only constructor that injects a generated key without weakening the production trust anchor. Residual: cache entries are not bound to a specific release version (the ML-DSA signing context is constant across versions), so a same-UID attacker who already has any past validly-signed release can plant it under a newer version's cache key and force a downgrade to that old signed binary. Not RCE (still legitimately-signed bytes) and a same-UID attacker has easier paths anyway; closing it cleanly requires coordinated changes in the release-signing pipeline, ant-keygen, ant-node, and ant-client, and is tracked in the binary_cache module docs.
d2fd74e to
de30b06
Compare
dirvine
left a comment
There was a problem hiding this comment.
Code Review: fix(upgrade): re-verify ML-DSA signature on every cache hit
I've reviewed the diff across all 3 files (+668/−124), the PR description, CI status, and consulted the codebase's signature.rs module.
Summary
This is a well-structured, thorough fix for the cache-poisoning RCE (A1). The approach is architecturally sound and the implementation is careful with security-relevant details. Main concerns below are minor — no blockers.
✅ What's done well
-
Core threat model is correctly identified and fixed. Caching the extracted binary with only SHA-256 integrity was the vulnerability; caching the signed archive and re-verifying ML-DSA on every hit is the correct remediation.
-
TOCTOU elimination via private staged copy. Copying the archive+signature into a process-private temp dir before verification, then extracting from that same private copy, definitively defeats a post-verification file swap on the shared cache. The test
test_returned_archive_is_private_copy_immune_to_post_hit_swapconfirms this. -
Defence in depth. Size policy gates (
symlink_metadata+is_file()+ size bounds) run before any copy or read, preventing symlink/FIFO/device-file attacks and OOM/disk-exhaustion vectors. The 0700 cache dir permissions andstore_archivere-verification are additional layered controls. -
#[cfg(test)]-only constructor.new_with_verify_keyis guarded by#[cfg(test)], ensuring the production trust anchor (the pinnedRELEASE_SIGNING_KEY) cannot be weakened. Theverify_key: Option<MlDsaPublicKey>— whereNonemeans "use embedded production key" andSome(key)is test-only — is a clean pattern. -
Archive metadata written last (write-to-temp-then-rename). Readers never observe a partial state. Verified by inspection of
store_archive.
🔍 Issues to address
1. macOS test failure — investigate CI log
Test (macos-latest) failed in the latest run. The run is still in progress so logs aren't available yet, but this needs investigation before merge. Likely candidates: ML-DSA key generation timeout on CI hardware (the OnceLock in test_keypair() should cache it, but first generation may be slow), or a filesystem-specific issue with the symlink tests on macOS.
Action: Investigate the failed test log once the run completes.
2. Extraction safety note (informational)
The extract_from_tar_gz function extracts to dest_dir but should be checked for path traversal (entries with ../ in archive paths). This is outside the scope of this PR's cache fix, but worth verifying that the tar extraction doesn't follow symlinks or escape the temp dir.
3. Potential follow-up: version binding
The downgrade attack documented in the module docs is correctly scoped as a follow-up. One possible approach: include the version string inside the archive metadata (as a signed manifest), and have the cache verifier check archive_version == expected_version in addition to the ML-DSA signature. This wouldn't require changing the signing context or key, just adding a structured manifest to the archive. Worth discussing as the next step.
🔍 Minor observations (no action required)
atomic_copybehaviour: Thefs::remove_file(dest)beforefs::renameis correct for Windows compatibility. On Unix,renameis atomic even when the destination exists.- Concurrent access safety: The exclusive
DownloadLockGuardcorrectly serialises downloads+stores. The read path (get_verified_archive) is lock-free and safe becausemeta.jsonis always written last — a reader can never observe a partially-committed write. - Hard links:
symlink_metadata+is_file()correctly rejects symlinks but not hard links. Hard links require write permission on the target directory + target, which is the same access level needed for direct modification — acceptable residual.
🧾 Overall assessment
Approved with the macOS test failure needing investigation. The fix is correct, the threat model coverage is thorough, and the implementation is careful. The documented residual (version-unbound cache entries) is a known limitation scoped for follow-up work.
Key strengths:
- Cryptographic verification on every cache hit (primary gate)
- Private staging eliminates TOCTOU
- Size/file-type gates prevent pre-verification attacks
- Test coverage includes tamper, TOCTOU, symlink, and overflow scenarios
- Production trust anchor is structurally isolated from test code
Review feedback on the upgrade binary cache: - `meta.json` was read with an unbounded `fs::read_to_string`. An attacker with write access to the shared cache directory could plant the metadata sidecar as a symlink to `/dev/zero` or as a huge file and stall the read into a hang/OOM before the archive/sig hardening ran. The metadata path now goes through the same open-once-and-validate gate as the archive: regular-file check on the opened handle, capped at `MAX_META_BYTES` (4 KiB). - Archive + signature staging previously did `symlink_metadata` (path) followed by `fs::copy` (path), leaving a small TOCTOU window where an attacker could race-swap the path to a symlink/FIFO/device or an oversized file between the check and the copy. Both files are now opened once via `open_regular_capped`, validated on the resulting `File` handle (size + file-type), and copied into the private staging dir from the open handle (wrapped in `Read::take(len)` as belt-and-braces against a post-open extension). All subsequent operations on those files use the staged private bytes, never the shared path. - Comment fix: the prior comment claimed `sha256_file` loads the archive into memory in full. It actually streams in 8 KiB chunks; the memory-pressure concern is `signature::verify_from_file*` (FIPS-204 requires the message as a slice). Wording updated. - Stale error message "Failed to serialize binary cache meta" updated to "Failed to serialize cached archive metadata" — the cache now stores archive metadata, not extracted-binary metadata. Two new tests: test_oversized_meta_is_rejected test_meta_symlink_to_special_file_is_rejected (Unix-only) 488 lib tests pass; cfd clean.
|
Pushed 7ba9bb6 addressing the inline review comments:
Two new tests: The Copilot comments about cleanup-on-rejection paths (lines 277/295 in the original review) were referring to pre-squash line numbers; the cleanup helper is already wired on every rejection path on the current branch tip — replied inline. |
| fn open_regular_capped(path: &Path, max_len: u64) -> io::Result<(File, u64)> { | ||
| let file = OpenOptions::new().read(true).open(path)?; | ||
| let meta = file.metadata()?; | ||
| if !meta.file_type().is_file() { | ||
| return Err(io::Error::new( | ||
| io::ErrorKind::InvalidInput, | ||
| "not a regular file (FIFO/device/socket/dir)", | ||
| )); |
Summary
The shared upgrade binary cache stored the extracted binary and, on a cache hit, returned it after only a SHA-256 check against a sibling
.meta.json. SHA-256 is not a security control: anyone able to write to the shared upgrade cache directory ({data_dir}/upgrades/, shared across all node instances on a host — a co-located process, a shared container volume, a low-privilege foothold on the host) could drop a malicious binary plus a forged matching metadata hash, and the next ant-node instance to upgrade would execute it with no signature verification at all — persistent RCE on every co-located node. The ML-DSA-65 signature covers the archive and was only checked on the initial download, never on a cache hit.Fix
Cache the signed archive + detached signature instead of the extracted binary.
BinaryCache::get_verified_archivere-runs ML-DSA-65 verification on every cache hit; the binary is always extracted fresh from the just-verified archive. A tampered archive, tampered or missing signature, or forged metadata fails verification against the pinned release public key, so a poisoned cache entry is rejected and a fresh verified download runs.Stage into a process-private temp dir and verify that copy, then extract from the same private path. Closes the verify-vs-extract TOCTOU on the shared cache files: an attacker cannot swap the bytes between when the verifier reads them and when the extractor reads them.
Size policy before any copy or read.
fs::symlink_metadata+file_type().is_file()rejects symlinks / FIFOs / devices outright; archive size is bounded byMAX_ARCHIVE_SIZE_BYTESand the signature must be exactlySIGNATURE_SIZEbytes. Otherwise an attacker could plantcached.archive -> /dev/zero(stats as 0 bytes) and force unbounded disk fill in the staging dir or OOM insignature::verify.Cache only after successful extraction. A validly-signed-but-malformed release no longer becomes a shared cache poison pill that every later node downloads, fails to extract, and re-downloads.
Cache dir tightened to 0700 on Unix as defence in depth (
cache_dir.rs); the ML-DSA gate is the primary control.store_archivemirrors the same size / file-type / signature checks before persisting, so a poisoned entry cannot be created through the supported path either.Production verifies against the pinned
RELEASE_SIGNING_KEY; a#[cfg(test)]-only constructor injects a generated key so the cache is unit-testable without weakening the production trust anchor.Tests
src/upgrade/binary_cache.rscovers:cached.archive -> /dev/zerois rejected pre-copy by theis_file()check.Residual (not addressed here)
Cache entries are not bound to a specific release version: the ML-DSA signing context is constant across versions, so a same-UID attacker who already has any past validly-signed release can plant it under a newer version's cache key and force a downgrade to that old signed binary. Not RCE (still legitimately-signed bytes) and a same-UID attacker has easier paths anyway. Closing it cleanly requires coordinated changes in the release-signing pipeline, ant-keygen, ant-node, and ant-client; tracked in the
binary_cachemodule docs.