launch regression test#170
Closed
lfengad wants to merge 26 commits into
Closed
Conversation
Bring the full Cosmos3 project tree into this repository: the cosmos3 Python package, vllm-cosmos3, docs, examples, inputs, schemas, CI config, Docker setup, and tooling (pyproject, uv.lock, pre-commit, ruff, pyrefly, justfile). Existing root files (README.md, RELEASE.md, cosmos-logo-thumbnail.png) are preserved unchanged; .gitattributes includes an LFS override for the preserved logo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Combine the Cosmos3 documentation (user guide, overview, setup, inference, models, modalities, CLI reference) from cosmos3-internal with the original notice pointing to the archived-ces2025 branch and the nvidia-cosmos GitHub organization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rename the existing cosmos3 package directory to cosmos-inference and
introduce a new top-level cosmos/ package skeleton matching the planned
framework layout:
cosmos/
model/, inference/, data/, trainer/
algorithm/{loss,reward,rl}/
controller/, workers/{simulations,rollout,reference,reward}/
communicator/, checkpoint/, callbacks/, tools/
utils/, evaluation/, launcher/
Also add root-level tests/ and tools/ placeholders. Each new Python
subpackage has an empty __init__.py; tests/ and tools/ use .gitkeep.
Note: imports inside cosmos-inference/ still reference `cosmos3.*` and
will need updating in a follow-up; pyproject.toml package config is
also unchanged for now.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move all cosmos3-internal-originated files at the repo root (dotfiles, config, ci/, docker/, docs/, examples/, inputs/, schemas/, vllm-cosmos3/, AGENTS.md, ATTRIBUTIONS.md, CHANGELOG.md, CONTRIBUTING.md, Dockerfile, LICENSE, conftest.py, justfile, pyproject.toml, pyrefly.toml, uv.lock) into cosmos-inference/, so the inference package is self-contained. Root now matches the planned framework layout: - cosmos/ (new framework skeleton from prior commit) - cosmos-inference/ (full Cosmos3 codebase) - examples/, docs/, docker/, tests/, tools/ (fresh placeholders) - README.md, RELEASE.md, cosmos-logo-thumbnail.png (originals) Other changes: - Root README.md restored to the original deprecated-notice version. - cosmos-inference/README.md added with the Cosmos3 documentation (previously merged into root README). - cosmos-inference/.gitattributes: drop the stale `cosmos-logo-thumbnail.png` LFS override (the file lives at root, not inside cosmos-inference/). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Minimal hatchling-backed pyproject.toml declaring the cosmos skeleton package (version 0.0.0, no dependencies, python >=3.13), plus a uv.lock generated by `uv lock` resolving the single editable cosmos package. Brings the repo root in line with the planned layout: pyproject.toml, uv.lock, README.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace cosmos-inference/ contents with the exact snapshot of origin/main (99450113) from the cosmos3-internal repo to drop the stale flat-layout files and align with upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy latest changes from ~/work/cosmos3-internal (HEAD bf3b3ac) into cosmos-inference/ to pick up new action-policy datasets/configs, LoRA utility, learning-rate logger, VLM augmentors, and updated docs/configs. Excludes .git and __pycache__; pre-existing destination-only files (e.g. action_policy_sft_8b.yaml, mixed_modality_sft_8b.yaml, callbacks/vlm) were preserved rather than deleted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy latest changes from ~/work/cosmos3-internal (HEAD fc7f97d) into cosmos-inference/, adding the diffusers-cosmos3 package, new algorithm and processors subtrees, eval scripts, and assorted training/inference updates. Excludes .git and __pycache__. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add 9 docs in docs/ (setup, code_structure, training, dataset, checkpoints, inference, configs, faq, examples) and READMEs in examples/, docker/, and tools/. All are section-heading skeletons to be filled in for the open-source training-infra release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- docs/setup.md: port content from cosmos-inference/docs/setup.md adapted for the training-infra repo (new clone URL, train-only extras and CUDA groups, Recommended Base Image section, NGC base-image quickstart that runs from the repo root). - pyproject.toml: mirror cosmos-inference/pyproject.toml with name "cosmos" and hatch packages ["cosmos"]; strip diffusers-cosmos3 from the train extra and tool.uv.sources (only used by the inference-side conversion script). - .python-version: 3.10 (was 3.13 stub). - uv.lock: regenerated against the new pyproject; external pin versions match cosmos-inference exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- README.md: rewrite as the framework entry point introducing the training-infra half (cosmos/, docs/) and the inference-infra half (cosmos-inference/), with a docs table connecting each docs/*.md to a one-line description and a Training-section Reference block for code structure / FAQ / AGENTS.md. - docs/code_structure.md: fill in from the actual cosmos/ layout — repository layout, per-subpackage descriptions, and a "where to add new code" table. - docs/inference.md: fill in as a bridge doc tying trained checkpoints to inference, with quickstart, modalities, backends, and pointers into cosmos-inference/docs/*.md as the source of truth for full inference details. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OSS files copied from cosmos-inference (no cosmos3 references in any of them; copied as-is): LICENSE, .gitignore, .gitattributes, .dockerignore, CONTRIBUTING.md, .pre-commit-config.yaml, .gitleaks.toml, .ruff.toml, .coveragerc, .github/ (issue templates + pre-commit workflow). Other changes: - .python-version: 3.10 -> 3.13 (mirrors cosmos-inference). - cosmos-inference/: remove 7 stale files that were deleted upstream but persisted locally because earlier syncs ran without --delete; cosmos-inference/ is now an exact mirror of cosmos3-internal main (fc7f97d). - README.md: before the Setup install commands, link System Requirements and Recommended Base Image (link-only, no inline details). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No-adaptation copies: - ci/ (license.txt + uv_lock helpers) - docker/Dockerfile (was cosmos-inference/Dockerfile) - docker/ci.Dockerfile, docker/nightly.Dockerfile, docker/entrypoint.sh Adapted from cosmos3 to cosmos: - pyrefly.toml: project-includes -> cosmos/ - .pytest.toml: replace cosmos3._src.imaginaire entry with cosmos-inference in norecursedirs - justfile: package/module/short name -> cosmos; docker recipes use docker/Dockerfile - conftest.py: slimmed adaptation; inlines ALL_NUM_GPUS / ALL_LEVELS / ALLOWED_GPUS_BY_LEVEL and keeps generic fixtures, options, and markers. Drops Args / init logging / seed fixtures that depend on cosmos3 modules not yet ported to root cosmos/. - AGENTS.md: full rewrite as framework-level map (training-side cosmos/ and inference-side cosmos-inference/ tables, docs indices, and common-task tables for both halves). - README.md: AGENTS.md reference now points at root ./AGENTS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ci/.pre-commit-config-base.yaml, .pre-commit-config.yaml: exclude cosmos-inference/ from pre-commit; addlicense now skips binary extensions (png/jpg/pdf/safetensors/etc.) so it can't corrupt non-text assets. - .gitattributes: exempt cosmos-logo-thumbnail.png from the *.png LFS rule so the existing 25 KB git blob is preserved (rather than being rewritten into a 130-byte LFS pointer at commit time). - .gitignore: ignore .rumdl_cache/. - .config/rumdl.toml: copy from cosmos-inference so the markdown linter disables MD013 / MD033 / MD040 (line length / inline HTML / fenced-code language) and excepts MD041 for README.md. - addlicense: added SPDX headers to all cosmos/**/__init__.py skeleton files. - markdown-toc-creator / rumdl-fmt: TOC regen and table-alignment fixes in AGENTS.md, README.md, CONTRIBUTING.md, RELEASE.md, docs/code_structure.md, docs/setup.md. - RELEASE.md: replace stale broken-link entries with a one-line stub until the first tagged release. - README.md: replace non-descriptive "here" link text with "one-shot quickstart" to satisfy MD059. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Note that the 5 cosmos3-* agent skills live under cosmos-inference/.agents/skills/ and cosmos-inference/.claude/skills/ and activate when working inside that subtree; root-side skills can be added later as cosmos/ grows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-runs the cosmos release pipeline (via imaginaire4 cosmos_release_claude/cosmos.toml)
into this tree. Adds:
- 446 cosmos/*.py files (up from prior snapshot; covers callbacks/learning_rate_logger,
data/vfm/data_packer{,_dataloader}, augmentors, etc.)
- configs/base/experiment/action/{pretrained,posttrain}_config/ subtree (libero
policy datapacker + supporting helpers/configs)
- configs/base/experiment/posttrain_video/ subtree (t2w_sft_8b_local_datapacker)
- configs/base/vlm/experiment/llava_ov_datapacker_experiment.py
- 4 launch scripts with PYTHONHASHSEED=42 + scripts.train --deterministic:
launch_mixed_modality_sft_8b.sh
launch_vlm_llava_ov.sh
launch_action_libero.sh
launch_t2w_sft_local_datapacker.sh
- run_4exps_oss.sh: sequential 4-smoke sweep runner
- sitecustomize.py: atexit sys.modules dumper for dead-file inventory
(auto-loaded via PYTHONPATH=. when LOAD_TRACE_DIR is set)
- configs/base/config.py: register the 2 new experiments alongside
mixed_modality_sft_8b
- experiments/sft/mixed_modality_sft_8b.py: VLMConfig v12 schema
(pretrained_weights nested dict, qk_norm, etc.)
Maintenance:
- .gitignore: ignore __pycache__/, *.pyc, training_output/, outputs/.
- Untrack 223 stale __pycache__ entries that were checked in by an earlier
snapshot.
Verified all 4 smokes train clean under --deterministic, with iter-by-iter loss
matching the cosmos_opensource and imaginaire4 source-side runs to 4 decimals on
the byte-identical paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The release tool writes cosmos_training_meta/files.txt on every run; it's purely a debugging artifact and shouldn't be in the tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Initial version from I4 code
Move the `cosmos` package from `cosmos/` at the repo root into `cosmos_training/cosmos/` so the training package source lives alongside its driver scripts, configs, and experiments. The package import name stays `cosmos` — no import rewrites needed. Also brings in the 5 scaffolding subdirs (algorithm, controller, evaluation, inference, workers) that were already at the root skeleton, so the new location has the full intended layout. Update build/type-check config to follow: - pyproject.toml: hatch sdist/wheel packages -> cosmos_training/cosmos - pyrefly.toml: project-includes and search-path point at new location Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merge the duplicated utils/vfm/vlm/ tree into utils/vlm/ and promote the DTensor-aware utils/vfm/fused_adam.py to top-level utils/fused_adam.py. The two trees were diverged feature-forks (not stale copies); the merged files preserve the union of features from both: - optimizer.py: trainable_params/frozen_params regex freeze and the tuple(config.betas) bugfix from vfm/vlm, freeze_llm_moe_gates from vlm - pretrained_models_downloader.py: env-var-aware _load_s3_credentials, HF Hub fallback, and INTERNAL gate from vfm/vlm; resolve_hf_model_store and the GCS ETag streaming workaround from vlm - flop_calculator.py: vfm/vlm version, now importing from the canonical cosmos.tools.flops.qwen3_vl with is_causal=False to preserve the bidirectional FLOP counts the dynamic batcher was calibrated against - utils/vlm/compute_flops_qwen3vl.py removed (replaced by the canonical cosmos.tools.flops.qwen3_vl); numeric output verified bit-identical across dense/MoE x text/image/video cases - constant.py and create_position_ids.py: docstrings and type-hints only, behavior unchanged Redirected import sites in cosmos/model/vfm/ and cosmos/data/vfm/. utils/vlm/fused_adam.py left in place — it uses a different TE module path (te.pytorch.optimizers vs transformer_engine_torch) and unifying it needs runtime kernel-equivalence verification. Net: 2297 lines removed, 311 added. Adds a project-scoped skill at .claude/skills/cosmos-utils-vlm-migration/ so future backports targeting the deleted paths get redirected to the new layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pre-commit check-shebang-scripts-are-executable hook flagged cosmos_training/cosmos/data/vfm/action/urdf_visualizer/viewer.py: it has a `#!/usr/bin/env python` shebang but was tracked as 0644. Set the git mode to 0755 so the shebang is honored when the file is invoked directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pre-commit check-executables-have-shebangs hook flagged 10 .py files under cosmos_training/cosmos/model/vfm/tokenizers/audio/avae_utils/ as marked executable but lacking shebangs. All of them are library modules (SPDX headers, no `#!` line), so the executable bit was incorrect. Set git mode to 0644 for the whole subtree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Root README: ./cosmos/{trainer,model,workers,algorithm} and bare
./cosmos were leftover from the cosmos → cosmos_training/cosmos
relocation. Updated each URL to ./cosmos_training/cosmos/... so the
links resolve again; display text left as cosmos/X/ (the Python
package path).
- cosmos_training/cosmos/utils/one_logger/README.md:
- packages/launcher/README.md doesn't exist in this repo (was an
imaginaire4 monorepo path). Removed the link, kept "launcher" as
plain text.
- imaginaire/utils/callback.py was the old imaginaire4 location of
OneLoggerCallback; it now lives at cosmos_training/cosmos/utils/
callback.py. Updated link to the relative ../callback.py.
Clears the pre-commit MD057 (relative-link-exists) failures. Other stale
content in one_logger/README.md (imaginaire4 references, wandb team
links, etc.) is left for a separate documentation pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runs the same torchrun invocations as launch_vlm_llava_ov.sh and launch_mixed_modality_sft_8b.sh in deterministic mode, parses rank-0 loss and global clip_grad_norm from the captured log, and asserts against goldens inlined at the bottom of the file. VLM check is limited to the first 2 iters (HF Hub streaming order drifts after); mixed-modality is bit-exact deterministic across all 10 iters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.