launch regression test by lfengad · Pull Request #170 · NVIDIA/Cosmos

lfengad · 2026-05-18T14:48:53Z

No description provided.

Bring the full Cosmos3 project tree into this repository: the cosmos3 Python package, vllm-cosmos3, docs, examples, inputs, schemas, CI config, Docker setup, and tooling (pyproject, uv.lock, pre-commit, ruff, pyrefly, justfile). Existing root files (README.md, RELEASE.md, cosmos-logo-thumbnail.png) are preserved unchanged; .gitattributes includes an LFS override for the preserved logo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Combine the Cosmos3 documentation (user guide, overview, setup, inference, models, modalities, CLI reference) from cosmos3-internal with the original notice pointing to the archived-ces2025 branch and the nvidia-cosmos GitHub organization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rename the existing cosmos3 package directory to cosmos-inference and introduce a new top-level cosmos/ package skeleton matching the planned framework layout: cosmos/ model/, inference/, data/, trainer/ algorithm/{loss,reward,rl}/ controller/, workers/{simulations,rollout,reference,reward}/ communicator/, checkpoint/, callbacks/, tools/ utils/, evaluation/, launcher/ Also add root-level tests/ and tools/ placeholders. Each new Python subpackage has an empty __init__.py; tests/ and tools/ use .gitkeep. Note: imports inside cosmos-inference/ still reference `cosmos3.*` and will need updating in a follow-up; pyproject.toml package config is also unchanged for now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move all cosmos3-internal-originated files at the repo root (dotfiles, config, ci/, docker/, docs/, examples/, inputs/, schemas/, vllm-cosmos3/, AGENTS.md, ATTRIBUTIONS.md, CHANGELOG.md, CONTRIBUTING.md, Dockerfile, LICENSE, conftest.py, justfile, pyproject.toml, pyrefly.toml, uv.lock) into cosmos-inference/, so the inference package is self-contained. Root now matches the planned framework layout: - cosmos/ (new framework skeleton from prior commit) - cosmos-inference/ (full Cosmos3 codebase) - examples/, docs/, docker/, tests/, tools/ (fresh placeholders) - README.md, RELEASE.md, cosmos-logo-thumbnail.png (originals) Other changes: - Root README.md restored to the original deprecated-notice version. - cosmos-inference/README.md added with the Cosmos3 documentation (previously merged into root README). - cosmos-inference/.gitattributes: drop the stale `cosmos-logo-thumbnail.png` LFS override (the file lives at root, not inside cosmos-inference/). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Minimal hatchling-backed pyproject.toml declaring the cosmos skeleton package (version 0.0.0, no dependencies, python >=3.13), plus a uv.lock generated by `uv lock` resolving the single editable cosmos package. Brings the repo root in line with the planned layout: pyproject.toml, uv.lock, README.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace cosmos-inference/ contents with the exact snapshot of origin/main (99450113) from the cosmos3-internal repo to drop the stale flat-layout files and align with upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copy latest changes from ~/work/cosmos3-internal (HEAD bf3b3ac) into cosmos-inference/ to pick up new action-policy datasets/configs, LoRA utility, learning-rate logger, VLM augmentors, and updated docs/configs. Excludes .git and __pycache__; pre-existing destination-only files (e.g. action_policy_sft_8b.yaml, mixed_modality_sft_8b.yaml, callbacks/vlm) were preserved rather than deleted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copy latest changes from ~/work/cosmos3-internal (HEAD fc7f97d) into cosmos-inference/, adding the diffusers-cosmos3 package, new algorithm and processors subtrees, eval scripts, and assorted training/inference updates. Excludes .git and __pycache__. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add 9 docs in docs/ (setup, code_structure, training, dataset, checkpoints, inference, configs, faq, examples) and READMEs in examples/, docker/, and tools/. All are section-heading skeletons to be filled in for the open-source training-infra release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- docs/setup.md: port content from cosmos-inference/docs/setup.md adapted for the training-infra repo (new clone URL, train-only extras and CUDA groups, Recommended Base Image section, NGC base-image quickstart that runs from the repo root). - pyproject.toml: mirror cosmos-inference/pyproject.toml with name "cosmos" and hatch packages ["cosmos"]; strip diffusers-cosmos3 from the train extra and tool.uv.sources (only used by the inference-side conversion script). - .python-version: 3.10 (was 3.13 stub). - uv.lock: regenerated against the new pyproject; external pin versions match cosmos-inference exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- README.md: rewrite as the framework entry point introducing the training-infra half (cosmos/, docs/) and the inference-infra half (cosmos-inference/), with a docs table connecting each docs/*.md to a one-line description and a Training-section Reference block for code structure / FAQ / AGENTS.md. - docs/code_structure.md: fill in from the actual cosmos/ layout — repository layout, per-subpackage descriptions, and a "where to add new code" table. - docs/inference.md: fill in as a bridge doc tying trained checkpoints to inference, with quickstart, modalities, backends, and pointers into cosmos-inference/docs/*.md as the source of truth for full inference details. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

OSS files copied from cosmos-inference (no cosmos3 references in any of them; copied as-is): LICENSE, .gitignore, .gitattributes, .dockerignore, CONTRIBUTING.md, .pre-commit-config.yaml, .gitleaks.toml, .ruff.toml, .coveragerc, .github/ (issue templates + pre-commit workflow). Other changes: - .python-version: 3.10 -> 3.13 (mirrors cosmos-inference). - cosmos-inference/: remove 7 stale files that were deleted upstream but persisted locally because earlier syncs ran without --delete; cosmos-inference/ is now an exact mirror of cosmos3-internal main (fc7f97d). - README.md: before the Setup install commands, link System Requirements and Recommended Base Image (link-only, no inline details). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

No-adaptation copies: - ci/ (license.txt + uv_lock helpers) - docker/Dockerfile (was cosmos-inference/Dockerfile) - docker/ci.Dockerfile, docker/nightly.Dockerfile, docker/entrypoint.sh Adapted from cosmos3 to cosmos: - pyrefly.toml: project-includes -> cosmos/ - .pytest.toml: replace cosmos3._src.imaginaire entry with cosmos-inference in norecursedirs - justfile: package/module/short name -> cosmos; docker recipes use docker/Dockerfile - conftest.py: slimmed adaptation; inlines ALL_NUM_GPUS / ALL_LEVELS / ALLOWED_GPUS_BY_LEVEL and keeps generic fixtures, options, and markers. Drops Args / init logging / seed fixtures that depend on cosmos3 modules not yet ported to root cosmos/. - AGENTS.md: full rewrite as framework-level map (training-side cosmos/ and inference-side cosmos-inference/ tables, docs indices, and common-task tables for both halves). - README.md: AGENTS.md reference now points at root ./AGENTS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- ci/.pre-commit-config-base.yaml, .pre-commit-config.yaml: exclude cosmos-inference/ from pre-commit; addlicense now skips binary extensions (png/jpg/pdf/safetensors/etc.) so it can't corrupt non-text assets. - .gitattributes: exempt cosmos-logo-thumbnail.png from the *.png LFS rule so the existing 25 KB git blob is preserved (rather than being rewritten into a 130-byte LFS pointer at commit time). - .gitignore: ignore .rumdl_cache/. - .config/rumdl.toml: copy from cosmos-inference so the markdown linter disables MD013 / MD033 / MD040 (line length / inline HTML / fenced-code language) and excepts MD041 for README.md. - addlicense: added SPDX headers to all cosmos/**/__init__.py skeleton files. - markdown-toc-creator / rumdl-fmt: TOC regen and table-alignment fixes in AGENTS.md, README.md, CONTRIBUTING.md, RELEASE.md, docs/code_structure.md, docs/setup.md. - RELEASE.md: replace stale broken-link entries with a one-line stub until the first tagged release. - README.md: replace non-descriptive "here" link text with "one-shot quickstart" to satisfy MD059. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Note that the 5 cosmos3-* agent skills live under cosmos-inference/.agents/skills/ and cosmos-inference/.claude/skills/ and activate when working inside that subtree; root-side skills can be added later as cosmos/ grows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Re-runs the cosmos release pipeline (via imaginaire4 cosmos_release_claude/cosmos.toml) into this tree. Adds: - 446 cosmos/*.py files (up from prior snapshot; covers callbacks/learning_rate_logger, data/vfm/data_packer{,_dataloader}, augmentors, etc.) - configs/base/experiment/action/{pretrained,posttrain}_config/ subtree (libero policy datapacker + supporting helpers/configs) - configs/base/experiment/posttrain_video/ subtree (t2w_sft_8b_local_datapacker) - configs/base/vlm/experiment/llava_ov_datapacker_experiment.py - 4 launch scripts with PYTHONHASHSEED=42 + scripts.train --deterministic: launch_mixed_modality_sft_8b.sh launch_vlm_llava_ov.sh launch_action_libero.sh launch_t2w_sft_local_datapacker.sh - run_4exps_oss.sh: sequential 4-smoke sweep runner - sitecustomize.py: atexit sys.modules dumper for dead-file inventory (auto-loaded via PYTHONPATH=. when LOAD_TRACE_DIR is set) - configs/base/config.py: register the 2 new experiments alongside mixed_modality_sft_8b - experiments/sft/mixed_modality_sft_8b.py: VLMConfig v12 schema (pretrained_weights nested dict, qk_norm, etc.) Maintenance: - .gitignore: ignore __pycache__/, *.pyc, training_output/, outputs/. - Untrack 223 stale __pycache__ entries that were checked in by an earlier snapshot. Verified all 4 smokes train clean under --deterministic, with iter-by-iter loss matching the cosmos_opensource and imaginaire4 source-side runs to 4 decimals on the byte-identical paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The release tool writes cosmos_training_meta/files.txt on every run; it's purely a debugging artifact and shouldn't be in the tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Initial version from I4 code

Move the `cosmos` package from `cosmos/` at the repo root into `cosmos_training/cosmos/` so the training package source lives alongside its driver scripts, configs, and experiments. The package import name stays `cosmos` — no import rewrites needed. Also brings in the 5 scaffolding subdirs (algorithm, controller, evaluation, inference, workers) that were already at the root skeleton, so the new location has the full intended layout. Update build/type-check config to follow: - pyproject.toml: hatch sdist/wheel packages -> cosmos_training/cosmos - pyrefly.toml: project-includes and search-path point at new location Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge the duplicated utils/vfm/vlm/ tree into utils/vlm/ and promote the DTensor-aware utils/vfm/fused_adam.py to top-level utils/fused_adam.py. The two trees were diverged feature-forks (not stale copies); the merged files preserve the union of features from both: - optimizer.py: trainable_params/frozen_params regex freeze and the tuple(config.betas) bugfix from vfm/vlm, freeze_llm_moe_gates from vlm - pretrained_models_downloader.py: env-var-aware _load_s3_credentials, HF Hub fallback, and INTERNAL gate from vfm/vlm; resolve_hf_model_store and the GCS ETag streaming workaround from vlm - flop_calculator.py: vfm/vlm version, now importing from the canonical cosmos.tools.flops.qwen3_vl with is_causal=False to preserve the bidirectional FLOP counts the dynamic batcher was calibrated against - utils/vlm/compute_flops_qwen3vl.py removed (replaced by the canonical cosmos.tools.flops.qwen3_vl); numeric output verified bit-identical across dense/MoE x text/image/video cases - constant.py and create_position_ids.py: docstrings and type-hints only, behavior unchanged Redirected import sites in cosmos/model/vfm/ and cosmos/data/vfm/. utils/vlm/fused_adam.py left in place — it uses a different TE module path (te.pytorch.optimizers vs transformer_engine_torch) and unifying it needs runtime kernel-equivalence verification. Net: 2297 lines removed, 311 added. Adds a project-scoped skill at .claude/skills/cosmos-utils-vlm-migration/ so future backports targeting the deleted paths get redirected to the new layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The pre-commit check-shebang-scripts-are-executable hook flagged cosmos_training/cosmos/data/vfm/action/urdf_visualizer/viewer.py: it has a `#!/usr/bin/env python` shebang but was tracked as 0644. Set the git mode to 0755 so the shebang is honored when the file is invoked directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The pre-commit check-executables-have-shebangs hook flagged 10 .py files under cosmos_training/cosmos/model/vfm/tokenizers/audio/avae_utils/ as marked executable but lacking shebangs. All of them are library modules (SPDX headers, no `#!` line), so the executable bit was incorrect. Set git mode to 0644 for the whole subtree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Root README: ./cosmos/{trainer,model,workers,algorithm} and bare ./cosmos were leftover from the cosmos → cosmos_training/cosmos relocation. Updated each URL to ./cosmos_training/cosmos/... so the links resolve again; display text left as cosmos/X/ (the Python package path). - cosmos_training/cosmos/utils/one_logger/README.md: - packages/launcher/README.md doesn't exist in this repo (was an imaginaire4 monorepo path). Removed the link, kept "launcher" as plain text. - imaginaire/utils/callback.py was the old imaginaire4 location of OneLoggerCallback; it now lives at cosmos_training/cosmos/utils/ callback.py. Updated link to the relative ../callback.py. Clears the pre-commit MD057 (relative-link-exists) failures. Other stale content in one_logger/README.md (imaginaire4 references, wandb team links, etc.) is left for a separate documentation pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Runs the same torchrun invocations as launch_vlm_llava_ov.sh and launch_mixed_modality_sft_8b.sh in deterministic mode, parses rank-0 loss and global clip_grad_norm from the captured log, and asserts against goldens inlined at the bottom of the file. VLM check is limited to the first 2 iters (HF Hub streaming order drifts after); mixed-modality is bit-exact deterministic across all 10 iters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lfengad and others added 26 commits May 11, 2026 20:51

Add minimal release

c82be1c

gitignore: untrack cosmos_training_meta/

f831402

The release tool writes cosmos_training_meta/files.txt on every run; it's purely a debugging artifact and shouldn't be in the tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge branch 'main' into yangyangt/minimal_release

ec9841d

Merge pull request #2 from nvidia-cosmos/yangyangt/minimal_release

6479f77

Initial version from I4 code

lfengad closed this May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

launch regression test#170

launch regression test#170
lfengad wants to merge 26 commits into
NVIDIA:mainfrom
nvidia-cosmos:liangf/launch-regression-test

lfengad commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lfengad commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants