Pengcuo/toml into main by pengcuo · Pull Request #172 · NVIDIA/Cosmos

pengcuo · 2026-05-19T12:17:31Z

Merge config into the TOML file and remove vae_path.

Bring the full Cosmos3 project tree into this repository: the cosmos3 Python package, vllm-cosmos3, docs, examples, inputs, schemas, CI config, Docker setup, and tooling (pyproject, uv.lock, pre-commit, ruff, pyrefly, justfile). Existing root files (README.md, RELEASE.md, cosmos-logo-thumbnail.png) are preserved unchanged; .gitattributes includes an LFS override for the preserved logo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Merge Cosmos3 README content with deprecated-repo notice Combine the Cosmos3 documentation (user guide, overview, setup, inference, models, modalities, CLI reference) from cosmos3-internal with the original notice pointing to the archived-ces2025 branch and the nvidia-cosmos GitHub organization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Restructure: rename cosmos3 to cosmos-inference, add cosmos/ skeleton Rename the existing cosmos3 package directory to cosmos-inference and introduce a new top-level cosmos/ package skeleton matching the planned framework layout: cosmos/ model/, inference/, data/, trainer/ algorithm/{loss,reward,rl}/ controller/, workers/{simulations,rollout,reference,reward}/ communicator/, checkpoint/, callbacks/, tools/ utils/, evaluation/, launcher/ Also add root-level tests/ and tools/ placeholders. Each new Python subpackage has an empty __init__.py; tests/ and tools/ use .gitkeep. Note: imports inside cosmos-inference/ still reference `cosmos3.*` and will need updating in a follow-up; pyproject.toml package config is also unchanged for now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Consolidate Cosmos3 content under cosmos-inference/, restore root layout Move all cosmos3-internal-originated files at the repo root (dotfiles, config, ci/, docker/, docs/, examples/, inputs/, schemas/, vllm-cosmos3/, AGENTS.md, ATTRIBUTIONS.md, CHANGELOG.md, CONTRIBUTING.md, Dockerfile, LICENSE, conftest.py, justfile, pyproject.toml, pyrefly.toml, uv.lock) into cosmos-inference/, so the inference package is self-contained. Root now matches the planned framework layout: - cosmos/ (new framework skeleton from prior commit) - cosmos-inference/ (full Cosmos3 codebase) - examples/, docs/, docker/, tests/, tools/ (fresh placeholders) - README.md, RELEASE.md, cosmos-logo-thumbnail.png (originals) Other changes: - Root README.md restored to the original deprecated-notice version. - cosmos-inference/README.md added with the Cosmos3 documentation (previously merged into root README). - cosmos-inference/.gitattributes: drop the stale `cosmos-logo-thumbnail.png` LFS override (the file lives at root, not inside cosmos-inference/). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Add stub pyproject.toml and uv.lock for cosmos/ package at root Minimal hatchling-backed pyproject.toml declaring the cosmos skeleton package (version 0.0.0, no dependencies, python >=3.13), plus a uv.lock generated by `uv lock` resolving the single editable cosmos package. Brings the repo root in line with the planned layout: pyproject.toml, uv.lock, README.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Sync cosmos-inference/ with cosmos3-internal origin/main Replace cosmos-inference/ contents with the exact snapshot of origin/main (99450113) from the cosmos3-internal repo to drop the stale flat-layout files and align with upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cosmos_training: re-release with 4 verified smokes + deterministic mode Re-runs the cosmos release pipeline (via imaginaire4 cosmos_release_claude/cosmos.toml) into this tree. Adds: - 446 cosmos/*.py files (up from prior snapshot; covers callbacks/learning_rate_logger, data/vfm/data_packer{,_dataloader}, augmentors, etc.) - configs/base/experiment/action/{pretrained,posttrain}_config/ subtree (libero policy datapacker + supporting helpers/configs) - configs/base/experiment/posttrain_video/ subtree (t2w_sft_8b_local_datapacker) - configs/base/vlm/experiment/llava_ov_datapacker_experiment.py - 4 launch scripts with PYTHONHASHSEED=42 + scripts.train --deterministic: launch_mixed_modality_sft_8b.sh launch_vlm_llava_ov.sh launch_action_libero.sh launch_t2w_sft_local_datapacker.sh - run_4exps_oss.sh: sequential 4-smoke sweep runner - sitecustomize.py: atexit sys.modules dumper for dead-file inventory (auto-loaded via PYTHONPATH=. when LOAD_TRACE_DIR is set) - configs/base/config.py: register the 2 new experiments alongside mixed_modality_sft_8b - experiments/sft/mixed_modality_sft_8b.py: VLMConfig v12 schema (pretrained_weights nested dict, qk_norm, etc.) Maintenance: - .gitignore: ignore __pycache__/, *.pyc, training_output/, outputs/. - Untrack 223 stale __pycache__ entries that were checked in by an earlier snapshot. Verified all 4 smokes train clean under --deterministic, with iter-by-iter loss matching the cosmos_opensource and imaginaire4 source-side runs to 4 decimals on the byte-identical paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> gitignore: untrack cosmos_training_meta/ The release tool writes cosmos_training_meta/files.txt on every run; it's purely a debugging artifact and shouldn't be in the tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copy latest changes from ~/work/cosmos3-internal (HEAD bf3b3ac) into cosmos-inference/ to pick up new action-policy datasets/configs, LoRA utility, learning-rate logger, VLM augmentors, and updated docs/configs. Excludes .git and __pycache__; pre-existing destination-only files (e.g. action_policy_sft_8b.yaml, mixed_modality_sft_8b.yaml, callbacks/vlm) were preserved rather than deleted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Sync cosmos-inference/ with cosmos3-internal main (fc7f97d) Copy latest changes from ~/work/cosmos3-internal (HEAD fc7f97d) into cosmos-inference/, adding the diffusers-cosmos3 package, new algorithm and processors subtrees, eval scripts, and assorted training/inference updates. Excludes .git and __pycache__. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Scaffold OSS documentation skeleton Add 9 docs in docs/ (setup, code_structure, training, dataset, checkpoints, inference, configs, faq, examples) and READMEs in examples/, docker/, and tools/. All are section-heading skeletons to be filled in for the open-source training-infra release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Fill in docs/setup.md and sync root uv env with cosmos-inference - docs/setup.md: port content from cosmos-inference/docs/setup.md adapted for the training-infra repo (new clone URL, train-only extras and CUDA groups, Recommended Base Image section, NGC base-image quickstart that runs from the repo root). - pyproject.toml: mirror cosmos-inference/pyproject.toml with name "cosmos" and hatch packages ["cosmos"]; strip diffusers-cosmos3 from the train extra and tool.uv.sources (only used by the inference-side conversion script). - .python-version: 3.10 (was 3.13 stub). - uv.lock: regenerated against the new pyproject; external pin versions match cosmos-inference exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Frame root as framework, fill code_structure and inference docs - README.md: rewrite as the framework entry point introducing the training-infra half (cosmos/, docs/) and the inference-infra half (cosmos-inference/), with a docs table connecting each docs/*.md to a one-line description and a Training-section Reference block for code structure / FAQ / AGENTS.md. - docs/code_structure.md: fill in from the actual cosmos/ layout — repository layout, per-subpackage descriptions, and a "where to add new code" table. - docs/inference.md: fill in as a bridge doc tying trained checkpoints to inference, with quickstart, modalities, backends, and pointers into cosmos-inference/docs/*.md as the source of truth for full inference details. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Add OSS hygiene files at root and mirror cosmos-inference with upstream OSS files copied from cosmos-inference (no cosmos3 references in any of them; copied as-is): LICENSE, .gitignore, .gitattributes, .dockerignore, CONTRIBUTING.md, .pre-commit-config.yaml, .gitleaks.toml, .ruff.toml, .coveragerc, .github/ (issue templates + pre-commit workflow). Other changes: - .python-version: 3.10 -> 3.13 (mirrors cosmos-inference). - cosmos-inference/: remove 7 stale files that were deleted upstream but persisted locally because earlier syncs ran without --delete; cosmos-inference/ is now an exact mirror of cosmos3-internal main (fc7f97d). - README.md: before the Setup install commands, link System Requirements and Recommended Base Image (link-only, no inline details). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Port remaining dev configs and Docker assets from cosmos-inference No-adaptation copies: - ci/ (license.txt + uv_lock helpers) - docker/Dockerfile (was cosmos-inference/Dockerfile) - docker/ci.Dockerfile, docker/nightly.Dockerfile, docker/entrypoint.sh Adapted from cosmos3 to cosmos: - pyrefly.toml: project-includes -> cosmos/ - .pytest.toml: replace cosmos3._src.imaginaire entry with cosmos-inference in norecursedirs - justfile: package/module/short name -> cosmos; docker recipes use docker/Dockerfile - conftest.py: slimmed adaptation; inlines ALL_NUM_GPUS / ALL_LEVELS / ALLOWED_GPUS_BY_LEVEL and keeps generic fixtures, options, and markers. Drops Args / init logging / seed fixtures that depend on cosmos3 modules not yet ported to root cosmos/. - AGENTS.md: full rewrite as framework-level map (training-side cosmos/ and inference-side cosmos-inference/ tables, docs indices, and common-task tables for both halves). - README.md: AGENTS.md reference now points at root ./AGENTS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Apply pre-commit fixes; add rumdl config and .rumdl_cache gitignore - ci/.pre-commit-config-base.yaml, .pre-commit-config.yaml: exclude cosmos-inference/ from pre-commit; addlicense now skips binary extensions (png/jpg/pdf/safetensors/etc.) so it can't corrupt non-text assets. - .gitattributes: exempt cosmos-logo-thumbnail.png from the *.png LFS rule so the existing 25 KB git blob is preserved (rather than being rewritten into a 130-byte LFS pointer at commit time). - .gitignore: ignore .rumdl_cache/. - .config/rumdl.toml: copy from cosmos-inference so the markdown linter disables MD013 / MD033 / MD040 (line length / inline HTML / fenced-code language) and excepts MD041 for README.md. - addlicense: added SPDX headers to all cosmos/**/__init__.py skeleton files. - markdown-toc-creator / rumdl-fmt: TOC regen and table-alignment fixes in AGENTS.md, README.md, CONTRIBUTING.md, RELEASE.md, docs/code_structure.md, docs/setup.md. - RELEASE.md: replace stale broken-link entries with a one-line stub until the first tagged release. - README.md: replace non-descriptive "here" link text with "one-shot quickstart" to satisfy MD059. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> AGENTS.md: add pointer to inference-side skills Note that the 5 cosmos3-* agent skills live under cosmos-inference/.agents/skills/ and cosmos-inference/.claude/skills/ and activate when working inside that subtree; root-side skills can be added later as cosmos/ grows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move the `cosmos` package from `cosmos/` at the repo root into `cosmos_training/cosmos/` so the training package source lives alongside its driver scripts, configs, and experiments. The package import name stays `cosmos` — no import rewrites needed. Also brings in the 5 scaffolding subdirs (algorithm, controller, evaluation, inference, workers) that were already at the root skeleton, so the new location has the full intended layout. Update build/type-check config to follow: - pyproject.toml: hatch sdist/wheel packages -> cosmos_training/cosmos - pyrefly.toml: project-includes and search-path point at new location Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Runs the same torchrun invocations as launch_vlm_llava_ov.sh and launch_mixed_modality_sft_8b.sh in deterministic mode, parses rank-0 loss and global clip_grad_norm from the captured log, and asserts against goldens inlined at the bottom of the file. VLM check is limited to the first 2 iters (HF Hub streaming order drifts after); mixed-modality is bit-exact deterministic across all 10 iters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…stems Squash of liangf/utils-cleanup branch (originally 13 commits, PR #5): - consolidate utils/vfm/vlm into utils/vlm and utils/vfm/fused_adam into utils/fused_adam (bfb043d) - drop dead utils/optim_instantiate.py and utils/configs/ (b66d9f8) - drop dead utils/one_logger/ subsystem (a4f7ea1) - drop orphan training_telemetry/context_managers.py (1c55568) - drop three more orphans: vlm/flop_calculator + env_parsers (customization, inference) (004b14d) - add SPDX/Apache-2.0 headers to files missing them (0900113) - docs: fix broken relative links in READMEs (21639f7) - docs: fix stale references in training_telemetry README (de51b60) - mark viewer.py executable to match its shebang (0e7b015) - clear executable bit on avae_utils library modules (63cfbfb) - skill: broaden cosmos-utils-vlm-migration to cover follow-up deletions (13dc960) - pre-commit: exclude .claude/ from project hooks (d3be55a) - lint (52ed386) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add a `--toml` flag to scripts/train.py that loads training config from TOML. Two modes: standalone (full structured TOML) or paired with `--config=<py>` where the TOML is a flat interface-schema file that scripts/interface_toml.py translates to Hydra overrides (variant auto-detected from the --config path: vfm-vlm vs vfm-base). Files: cosmos/utils/serialization.py - from_toml() + Literal type handling cosmos/utils/config.py - TOML branch in load_config + _reload_make_config_for_registrations for ...defaults.config -> ...config sibling resolution scripts/train.py - --toml argparse + dispatch scripts/interface_toml.py - interface TOML -> Hydra override translator with vlm/base mapping dicts toml/{vfm,vlm}_example.toml - schema-shaped templates toml/mixed_modality_sft_8b.toml + run_mixed_modality_sft_trace_toml.sh - TOML launcher for the trace experiment toml/alignment_test.toml + toml/run_alignment_test.sh - --dryrun parity test: byte-identical config.yaml between direct CLI and TOML .gitignore - drop tracked __pycache__/*.pyc Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Push the wiring that lived in shell launch scripts down into the interface TOMLs and their target experiments, so each launch script is just `torchrun ... --toml=<file>`. interface_toml.py - read_config_py_from_toml: top-level `config = "vfm"|"vlm"` field selects the paired configs/base/...config.py — launch scripts no longer need --config=. - read_path_overrides: top-level checkpoint_path / wan_vae_path / dataset_jsonl / model_path replace per-script bash variables and CLI tail overrides. - translate_interface_toml strips these launcher-directive keys before flattening so they never leak into Hydra overrides. train.py - When only --toml is given, resolve the paired config.py from the TOML's `config` field; fall back to structured-TOML load otherwise. - _apply_path_overrides mutates the resolved Config post-load: checkpoint.load_path, model.config.tokenizer.vae_path, and model.config.policy.backbone.model_name are fixed attribute paths; dataset_jsonl is routed by hasattr probe on dataloader_train so t2w (data_source.jsonl_paths) and mixed_modality (dataloader.datasets.video.dataset.jsonl_paths) share one key. Experiments absorb the "local-only mode" CLI tail - libero_policy_datapacker_experiment.py: tokenizer.bucket_name / object_store_credential_path_pretrained="", checkpoint.{load,save}_to_object_store=dict(enabled=False), vlm_config.tokenizer=dict(config_variant="hf"), vlm_config.pretrained_weights.enabled=False. - t2w_sft_8b_local_datapacker.py: same set, minus the two that were already encoded (vlm_config.pretrained_weights.enabled=False and data_source.num_video_frames=61). - llava_ov_datapacker_experiment.py: checkpoint.{load,save}_to_object_store deep-merge stubs. 5 launch scripts now end with `--toml=<file> [--deterministic]` — no --config=, no bash CHECKPOINT_PATH/WAN_VAE_PATH/DATASET_JSONL/MODEL_PATH declarations, no CLI tail overrides. The MAX_ITER env var on run_mixed_modality_sft_trace_toml.sh is gone; max_iter is driven by the TOML's [train].max_iter via the existing interface mapping. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lfengad and others added 9 commits May 19, 2026 02:48

Do not set ckpt_type to dummy for VFM.

de21558

pengcuo closed this May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pengcuo/toml into main#172

Pengcuo/toml into main#172
pengcuo wants to merge 9 commits into
NVIDIA:mainfrom
nvidia-cosmos:pengcuo/toml_into_main

pengcuo commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pengcuo commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants