Skip to content

launch regression test#170

Closed
lfengad wants to merge 26 commits into
NVIDIA:mainfrom
nvidia-cosmos:liangf/launch-regression-test
Closed

launch regression test#170
lfengad wants to merge 26 commits into
NVIDIA:mainfrom
nvidia-cosmos:liangf/launch-regression-test

Conversation

@lfengad
Copy link
Copy Markdown
Collaborator

@lfengad lfengad commented May 18, 2026

No description provided.

lfengad and others added 26 commits May 11, 2026 20:51
Bring the full Cosmos3 project tree into this repository: the cosmos3
Python package, vllm-cosmos3, docs, examples, inputs, schemas, CI config,
Docker setup, and tooling (pyproject, uv.lock, pre-commit, ruff, pyrefly,
justfile). Existing root files (README.md, RELEASE.md,
cosmos-logo-thumbnail.png) are preserved unchanged; .gitattributes
includes an LFS override for the preserved logo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Combine the Cosmos3 documentation (user guide, overview, setup,
inference, models, modalities, CLI reference) from cosmos3-internal
with the original notice pointing to the archived-ces2025 branch and
the nvidia-cosmos GitHub organization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rename the existing cosmos3 package directory to cosmos-inference and
introduce a new top-level cosmos/ package skeleton matching the planned
framework layout:

  cosmos/
    model/, inference/, data/, trainer/
    algorithm/{loss,reward,rl}/
    controller/, workers/{simulations,rollout,reference,reward}/
    communicator/, checkpoint/, callbacks/, tools/
    utils/, evaluation/, launcher/

Also add root-level tests/ and tools/ placeholders. Each new Python
subpackage has an empty __init__.py; tests/ and tools/ use .gitkeep.

Note: imports inside cosmos-inference/ still reference `cosmos3.*` and
will need updating in a follow-up; pyproject.toml package config is
also unchanged for now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move all cosmos3-internal-originated files at the repo root (dotfiles,
config, ci/, docker/, docs/, examples/, inputs/, schemas/, vllm-cosmos3/,
AGENTS.md, ATTRIBUTIONS.md, CHANGELOG.md, CONTRIBUTING.md, Dockerfile,
LICENSE, conftest.py, justfile, pyproject.toml, pyrefly.toml, uv.lock)
into cosmos-inference/, so the inference package is self-contained.

Root now matches the planned framework layout:
  - cosmos/         (new framework skeleton from prior commit)
  - cosmos-inference/  (full Cosmos3 codebase)
  - examples/, docs/, docker/, tests/, tools/  (fresh placeholders)
  - README.md, RELEASE.md, cosmos-logo-thumbnail.png  (originals)

Other changes:
- Root README.md restored to the original deprecated-notice version.
- cosmos-inference/README.md added with the Cosmos3 documentation
  (previously merged into root README).
- cosmos-inference/.gitattributes: drop the stale
  `cosmos-logo-thumbnail.png` LFS override (the file lives at root, not
  inside cosmos-inference/).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Minimal hatchling-backed pyproject.toml declaring the cosmos skeleton
package (version 0.0.0, no dependencies, python >=3.13), plus a uv.lock
generated by `uv lock` resolving the single editable cosmos package.

Brings the repo root in line with the planned layout:
  pyproject.toml, uv.lock, README.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace cosmos-inference/ contents with the exact snapshot of origin/main
(99450113) from the cosmos3-internal repo to drop the stale flat-layout
files and align with upstream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy latest changes from ~/work/cosmos3-internal (HEAD bf3b3ac) into
cosmos-inference/ to pick up new action-policy datasets/configs, LoRA
utility, learning-rate logger, VLM augmentors, and updated docs/configs.
Excludes .git and __pycache__; pre-existing destination-only files
(e.g. action_policy_sft_8b.yaml, mixed_modality_sft_8b.yaml,
callbacks/vlm) were preserved rather than deleted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy latest changes from ~/work/cosmos3-internal (HEAD fc7f97d) into
cosmos-inference/, adding the diffusers-cosmos3 package, new algorithm
and processors subtrees, eval scripts, and assorted training/inference
updates. Excludes .git and __pycache__.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add 9 docs in docs/ (setup, code_structure, training, dataset,
checkpoints, inference, configs, faq, examples) and READMEs in
examples/, docker/, and tools/. All are section-heading skeletons to
be filled in for the open-source training-infra release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- docs/setup.md: port content from cosmos-inference/docs/setup.md
  adapted for the training-infra repo (new clone URL, train-only
  extras and CUDA groups, Recommended Base Image section, NGC
  base-image quickstart that runs from the repo root).
- pyproject.toml: mirror cosmos-inference/pyproject.toml with
  name "cosmos" and hatch packages ["cosmos"]; strip
  diffusers-cosmos3 from the train extra and tool.uv.sources
  (only used by the inference-side conversion script).
- .python-version: 3.10 (was 3.13 stub).
- uv.lock: regenerated against the new pyproject; external pin
  versions match cosmos-inference exactly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- README.md: rewrite as the framework entry point introducing the
  training-infra half (cosmos/, docs/) and the inference-infra half
  (cosmos-inference/), with a docs table connecting each docs/*.md
  to a one-line description and a Training-section Reference block
  for code structure / FAQ / AGENTS.md.
- docs/code_structure.md: fill in from the actual cosmos/ layout —
  repository layout, per-subpackage descriptions, and a "where to
  add new code" table.
- docs/inference.md: fill in as a bridge doc tying trained
  checkpoints to inference, with quickstart, modalities, backends,
  and pointers into cosmos-inference/docs/*.md as the source of
  truth for full inference details.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OSS files copied from cosmos-inference (no cosmos3 references in any
of them; copied as-is):
  LICENSE, .gitignore, .gitattributes, .dockerignore, CONTRIBUTING.md,
  .pre-commit-config.yaml, .gitleaks.toml, .ruff.toml, .coveragerc,
  .github/ (issue templates + pre-commit workflow).

Other changes:
- .python-version: 3.10 -> 3.13 (mirrors cosmos-inference).
- cosmos-inference/: remove 7 stale files that were deleted upstream
  but persisted locally because earlier syncs ran without --delete;
  cosmos-inference/ is now an exact mirror of cosmos3-internal main
  (fc7f97d).
- README.md: before the Setup install commands, link System
  Requirements and Recommended Base Image (link-only, no inline
  details).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No-adaptation copies:
- ci/ (license.txt + uv_lock helpers)
- docker/Dockerfile (was cosmos-inference/Dockerfile)
- docker/ci.Dockerfile, docker/nightly.Dockerfile, docker/entrypoint.sh

Adapted from cosmos3 to cosmos:
- pyrefly.toml: project-includes -> cosmos/
- .pytest.toml: replace cosmos3._src.imaginaire entry with
  cosmos-inference in norecursedirs
- justfile: package/module/short name -> cosmos; docker recipes
  use docker/Dockerfile
- conftest.py: slimmed adaptation; inlines ALL_NUM_GPUS / ALL_LEVELS
  / ALLOWED_GPUS_BY_LEVEL and keeps generic fixtures, options, and
  markers. Drops Args / init logging / seed fixtures that depend on
  cosmos3 modules not yet ported to root cosmos/.
- AGENTS.md: full rewrite as framework-level map (training-side
  cosmos/ and inference-side cosmos-inference/ tables, docs indices,
  and common-task tables for both halves).
- README.md: AGENTS.md reference now points at root ./AGENTS.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ci/.pre-commit-config-base.yaml, .pre-commit-config.yaml: exclude
  cosmos-inference/ from pre-commit; addlicense now skips binary
  extensions (png/jpg/pdf/safetensors/etc.) so it can't corrupt
  non-text assets.
- .gitattributes: exempt cosmos-logo-thumbnail.png from the *.png
  LFS rule so the existing 25 KB git blob is preserved (rather than
  being rewritten into a 130-byte LFS pointer at commit time).
- .gitignore: ignore .rumdl_cache/.
- .config/rumdl.toml: copy from cosmos-inference so the markdown
  linter disables MD013 / MD033 / MD040 (line length / inline HTML
  / fenced-code language) and excepts MD041 for README.md.
- addlicense: added SPDX headers to all cosmos/**/__init__.py
  skeleton files.
- markdown-toc-creator / rumdl-fmt: TOC regen and table-alignment
  fixes in AGENTS.md, README.md, CONTRIBUTING.md, RELEASE.md,
  docs/code_structure.md, docs/setup.md.
- RELEASE.md: replace stale broken-link entries with a one-line
  stub until the first tagged release.
- README.md: replace non-descriptive "here" link text with
  "one-shot quickstart" to satisfy MD059.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Note that the 5 cosmos3-* agent skills live under
cosmos-inference/.agents/skills/ and cosmos-inference/.claude/skills/
and activate when working inside that subtree; root-side skills can
be added later as cosmos/ grows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-runs the cosmos release pipeline (via imaginaire4 cosmos_release_claude/cosmos.toml)
into this tree. Adds:
- 446 cosmos/*.py files (up from prior snapshot; covers callbacks/learning_rate_logger,
  data/vfm/data_packer{,_dataloader}, augmentors, etc.)
- configs/base/experiment/action/{pretrained,posttrain}_config/ subtree (libero
  policy datapacker + supporting helpers/configs)
- configs/base/experiment/posttrain_video/ subtree (t2w_sft_8b_local_datapacker)
- configs/base/vlm/experiment/llava_ov_datapacker_experiment.py
- 4 launch scripts with PYTHONHASHSEED=42 + scripts.train --deterministic:
    launch_mixed_modality_sft_8b.sh
    launch_vlm_llava_ov.sh
    launch_action_libero.sh
    launch_t2w_sft_local_datapacker.sh
- run_4exps_oss.sh: sequential 4-smoke sweep runner
- sitecustomize.py: atexit sys.modules dumper for dead-file inventory
  (auto-loaded via PYTHONPATH=. when LOAD_TRACE_DIR is set)
- configs/base/config.py: register the 2 new experiments alongside
  mixed_modality_sft_8b
- experiments/sft/mixed_modality_sft_8b.py: VLMConfig v12 schema
  (pretrained_weights nested dict, qk_norm, etc.)

Maintenance:
- .gitignore: ignore __pycache__/, *.pyc, training_output/, outputs/.
- Untrack 223 stale __pycache__ entries that were checked in by an earlier
  snapshot.

Verified all 4 smokes train clean under --deterministic, with iter-by-iter loss
matching the cosmos_opensource and imaginaire4 source-side runs to 4 decimals on
the byte-identical paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The release tool writes cosmos_training_meta/files.txt on every run; it's
purely a debugging artifact and shouldn't be in the tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the `cosmos` package from `cosmos/` at the repo root into
`cosmos_training/cosmos/` so the training package source lives alongside
its driver scripts, configs, and experiments. The package import name
stays `cosmos` — no import rewrites needed.

Also brings in the 5 scaffolding subdirs (algorithm, controller,
evaluation, inference, workers) that were already at the root skeleton,
so the new location has the full intended layout.

Update build/type-check config to follow:
- pyproject.toml: hatch sdist/wheel packages -> cosmos_training/cosmos
- pyrefly.toml: project-includes and search-path point at new location

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merge the duplicated utils/vfm/vlm/ tree into utils/vlm/ and promote the
DTensor-aware utils/vfm/fused_adam.py to top-level utils/fused_adam.py.
The two trees were diverged feature-forks (not stale copies); the merged
files preserve the union of features from both:

- optimizer.py: trainable_params/frozen_params regex freeze and the
  tuple(config.betas) bugfix from vfm/vlm, freeze_llm_moe_gates from vlm
- pretrained_models_downloader.py: env-var-aware _load_s3_credentials,
  HF Hub fallback, and INTERNAL gate from vfm/vlm; resolve_hf_model_store
  and the GCS ETag streaming workaround from vlm
- flop_calculator.py: vfm/vlm version, now importing from the canonical
  cosmos.tools.flops.qwen3_vl with is_causal=False to preserve the
  bidirectional FLOP counts the dynamic batcher was calibrated against
- utils/vlm/compute_flops_qwen3vl.py removed (replaced by the canonical
  cosmos.tools.flops.qwen3_vl); numeric output verified bit-identical
  across dense/MoE x text/image/video cases
- constant.py and create_position_ids.py: docstrings and type-hints only,
  behavior unchanged

Redirected import sites in cosmos/model/vfm/ and cosmos/data/vfm/.
utils/vlm/fused_adam.py left in place — it uses a different TE module
path (te.pytorch.optimizers vs transformer_engine_torch) and unifying it
needs runtime kernel-equivalence verification.

Net: 2297 lines removed, 311 added. Adds a project-scoped skill at
.claude/skills/cosmos-utils-vlm-migration/ so future backports targeting
the deleted paths get redirected to the new layout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pre-commit check-shebang-scripts-are-executable hook flagged
cosmos_training/cosmos/data/vfm/action/urdf_visualizer/viewer.py: it has
a `#!/usr/bin/env python` shebang but was tracked as 0644. Set the git
mode to 0755 so the shebang is honored when the file is invoked directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pre-commit check-executables-have-shebangs hook flagged 10 .py files
under cosmos_training/cosmos/model/vfm/tokenizers/audio/avae_utils/ as
marked executable but lacking shebangs. All of them are library modules
(SPDX headers, no `#!` line), so the executable bit was incorrect.
Set git mode to 0644 for the whole subtree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Root README: ./cosmos/{trainer,model,workers,algorithm} and bare
  ./cosmos were leftover from the cosmos → cosmos_training/cosmos
  relocation. Updated each URL to ./cosmos_training/cosmos/... so the
  links resolve again; display text left as cosmos/X/ (the Python
  package path).
- cosmos_training/cosmos/utils/one_logger/README.md:
  - packages/launcher/README.md doesn't exist in this repo (was an
    imaginaire4 monorepo path). Removed the link, kept "launcher" as
    plain text.
  - imaginaire/utils/callback.py was the old imaginaire4 location of
    OneLoggerCallback; it now lives at cosmos_training/cosmos/utils/
    callback.py. Updated link to the relative ../callback.py.

Clears the pre-commit MD057 (relative-link-exists) failures. Other stale
content in one_logger/README.md (imaginaire4 references, wandb team
links, etc.) is left for a separate documentation pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runs the same torchrun invocations as launch_vlm_llava_ov.sh and
launch_mixed_modality_sft_8b.sh in deterministic mode, parses rank-0
loss and global clip_grad_norm from the captured log, and asserts
against goldens inlined at the bottom of the file. VLM check is
limited to the first 2 iters (HF Hub streaming order drifts after);
mixed-modality is bit-exact deterministic across all 10 iters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lfengad lfengad closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants