=Is your feature request related to a problem? Please describe.
Yes. The virtualenv cache in Codeanalyzer is keyed on directory
existence, not on dependency-manifest content, and a single
rebuild_analysis flag conflates three independent caches (venv /
CodeQL DB / analysis.json). This causes one correctness bug and one
performance bug:
-
Stale-deps correctness bug — core.py:231-235:
venv_path = self.cache_dir / self.project_dir.name / "virtualenv"
if not venv_path.exists() or self.rebuild_analysis:
# python -m venv ...; pip install -r requirements*.txt; pip install -e .
Edit requirements.txt/pyproject.toml/a lockfile while the venv
dir already exists and rebuild_analysis=False → the venv is not
recreated → analysis silently runs against stale dependency versions
(wrong Jedi/CodeQL resolution).
-
Wasted rebuilds — rebuild_analysis=True (--eager) with
byte-identical dependencies still tears down and re-pip installs
the venv (~30s), even though --eager is meant to invalidate the
analysis, not the environment.
-
One flag, three caches — rebuild_analysis (read at
core.py:62) gates the venv (:235), the CodeQL DB (:326), and
the symbol-table cache (:370,502,552,582). There is no way to
rebuild the analysis without also paying for a full venv rebuild.
The three artifacts have three different invalidation triggers:
| Artifact |
Correct rebuild trigger |
Granularity |
| virtualenv |
dependency-manifest content changes |
all-or-nothing |
| CodeQL DB |
any *.py content changes |
all-or-nothing |
| symbol table |
per-file content changes |
incremental (already correct) |
call graph / analysis.json |
any *.py content changes |
all-or-nothing |
Describe the solution you'd like
Cache each tier independently, keyed by its own content hash:
-
venv keyed on a dependency-manifest hash. SHA256 over the
manifests that exist (requirements.txt, requirements-dev.txt,
dev-requirements.txt, test-requirements.txt, pyproject.toml,
setup.py, setup.cfg, Pipfile, Pipfile.lock, poetry.lock,
uv.lock). Persist as <venv>/.deps_hash; rebuild iff
venv missing OR stored_hash != current_hash OR rebuild_venv.
Editing source must not rebuild the venv; changing a dependency
must, even when rebuild_analysis=False.
-
Separate, content-addressed cache roots:
cache_dir/venv/<dep_hash>/ — virtualenv
cache_dir/codeql/<src_hash>/ — CodeQL DB. The source-checksum
invalidation at core.py:313-352 is already correct; the change
is to key the DB directory by <src_hash> and retain prior
DBs instead of --overwriteing in place, so revisiting an
earlier source state (git bisect, branch switch, CI re-run of a
prior SHA) reuses an existing DB.
-
Split rebuild_analysis into independent controls
(rebuild_venv, rebuild_db, rebuild_analysis), each defaulting
to its own content-key check. Keep --eager (== rebuild all) as
back-compat sugar.
-
Expose the resolved venv/DB/analysis paths via
AnalysisOptions so embedders (e.g. CLDK) can select the cache
root without re-deriving keys.
Acceptance criteria:
Implementation sketch:
- Add
Codeanalyzer._dependency_hash(project_dir) -> str (mirror of
_compute_checksum but over the manifest list above).
- In
__enter__, replace the not venv_path.exists() or rebuild
predicate with the .deps_hash marker comparison; write the marker
after a successful pip install.
- Relocate
db_path under cache_dir/codeql/<src_hash>/ keyed by the
existing _compute_checksum; drop --overwrite in favor of
per-hash dirs.
- Extend
AnalysisOptions with rebuild_venv/rebuild_db fields and
read-back properties for the resolved paths.
Describe alternatives you've considered
-
Downstream workaround (current state in CLDK). python-sdk
passes a dependency-hash-keyed cache_dir from
cldk/analysis/python/codeanalyzer/cache.py. This keeps the venv
stable but cannot prevent the in-place CodeQL DB rebuild,
because venv and DB share one cache_dir. It also duplicates key
logic that rightfully belongs upstream. Rejected as a permanent
fix; it only masks the venv bug for one consumer.
-
Hash the whole project tree for everything (single key). Simple,
but any one-character source edit would then invalidate the venv
too — the exact bug being fixed, inverted. Rejected.
-
Always rebuild the venv (drop caching). Correct but defeats the
purpose; ~30s pip install on every run. Rejected.
-
mtime/size-based venv invalidation instead of content hash.
Cheaper to compute, but unreliable across clones/checkouts/CI where
mtimes differ for identical content. Content hash is the robust
choice; manifest files are small so the cost is negligible.
Additional context
- Discovered while building feature-parity Python analysis in CLDK
(python-sdk), where the venv rebuild dominated repeated-run
latency (~30s cold vs ~3s warm once the venv was stabilized).
- Line references are against installed
0.1.14.
- Once this lands, the CLDK-side helper
(cldk/analysis/python/codeanalyzer/cache.py) should collapse to
"pick a root," and its workaround comments referencing this issue
should be removed.
=Is your feature request related to a problem? Please describe.
Yes. The virtualenv cache in
Codeanalyzeris keyed on directoryexistence, not on dependency-manifest content, and a single
rebuild_analysisflag conflates three independent caches (venv /CodeQL DB /
analysis.json). This causes one correctness bug and oneperformance bug:
Stale-deps correctness bug —
core.py:231-235:Edit
requirements.txt/pyproject.toml/a lockfile while the venvdir already exists and
rebuild_analysis=False→ the venv is notrecreated → analysis silently runs against stale dependency versions
(wrong Jedi/CodeQL resolution).
Wasted rebuilds —
rebuild_analysis=True(--eager) withbyte-identical dependencies still tears down and re-
pip installsthe venv (~30s), even though
--eageris meant to invalidate theanalysis, not the environment.
One flag, three caches —
rebuild_analysis(read atcore.py:62) gates the venv (:235), the CodeQL DB (:326), andthe symbol-table cache (
:370,502,552,582). There is no way torebuild the analysis without also paying for a full venv rebuild.
The three artifacts have three different invalidation triggers:
*.pycontent changesanalysis.json*.pycontent changesDescribe the solution you'd like
Cache each tier independently, keyed by its own content hash:
venv keyed on a dependency-manifest hash. SHA256 over the
manifests that exist (
requirements.txt,requirements-dev.txt,dev-requirements.txt,test-requirements.txt,pyproject.toml,setup.py,setup.cfg,Pipfile,Pipfile.lock,poetry.lock,uv.lock). Persist as<venv>/.deps_hash; rebuild iffvenv missing OR stored_hash != current_hash OR rebuild_venv.Editing source must not rebuild the venv; changing a dependency
must, even when
rebuild_analysis=False.Separate, content-addressed cache roots:
cache_dir/venv/<dep_hash>/— virtualenvcache_dir/codeql/<src_hash>/— CodeQL DB. The source-checksuminvalidation at
core.py:313-352is already correct; the changeis to key the DB directory by
<src_hash>and retain priorDBs instead of
--overwriteing in place, so revisiting anearlier source state (git bisect, branch switch, CI re-run of a
prior SHA) reuses an existing DB.
Split
rebuild_analysisinto independent controls(
rebuild_venv,rebuild_db,rebuild_analysis), each defaultingto its own content-key check. Keep
--eager(== rebuild all) asback-compat sugar.
Expose the resolved venv/DB/analysis paths via
AnalysisOptionsso embedders (e.g. CLDK) can select the cacheroot without re-deriving keys.
Acceptance criteria:
.pyfile rebuilds the CodeQL DB +analysis.jsonbut reuses the existing venv (no
pip install).rebuild_analysis=False.--eagerwith unchanged deps does not rebuild the venv.cached CodeQL DB (no
database create).AnalysisOptionsexposes the resolved venv/DB/analysis paths.cache_dircallers keep working (back-compatdefault for the new roots).
Implementation sketch:
Codeanalyzer._dependency_hash(project_dir) -> str(mirror of_compute_checksumbut over the manifest list above).__enter__, replace thenot venv_path.exists() or rebuildpredicate with the
.deps_hashmarker comparison; write the markerafter a successful
pip install.db_pathundercache_dir/codeql/<src_hash>/keyed by theexisting
_compute_checksum; drop--overwritein favor ofper-hash dirs.
AnalysisOptionswithrebuild_venv/rebuild_dbfields andread-back properties for the resolved paths.
Describe alternatives you've considered
Downstream workaround (current state in CLDK).
python-sdkpasses a dependency-hash-keyed
cache_dirfromcldk/analysis/python/codeanalyzer/cache.py. This keeps the venvstable but cannot prevent the in-place CodeQL DB rebuild,
because venv and DB share one
cache_dir. It also duplicates keylogic that rightfully belongs upstream. Rejected as a permanent
fix; it only masks the venv bug for one consumer.
Hash the whole project tree for everything (single key). Simple,
but any one-character source edit would then invalidate the venv
too — the exact bug being fixed, inverted. Rejected.
Always rebuild the venv (drop caching). Correct but defeats the
purpose; ~30s
pip installon every run. Rejected.mtime/size-based venv invalidation instead of content hash.
Cheaper to compute, but unreliable across clones/checkouts/CI where
mtimes differ for identical content. Content hash is the robust
choice; manifest files are small so the cost is negligible.
Additional context
(
python-sdk), where the venv rebuild dominated repeated-runlatency (~30s cold vs ~3s warm once the venv was stabilized).
0.1.14.(
cldk/analysis/python/codeanalyzer/cache.py) should collapse to"pick a root," and its workaround comments referencing this issue
should be removed.