Pengcuo/toml into main#172
Closed
pengcuo wants to merge 9 commits into
Closed
Conversation
Bring the full Cosmos3 project tree into this repository: the cosmos3
Python package, vllm-cosmos3, docs, examples, inputs, schemas, CI config,
Docker setup, and tooling (pyproject, uv.lock, pre-commit, ruff, pyrefly,
justfile). Existing root files (README.md, RELEASE.md,
cosmos-logo-thumbnail.png) are preserved unchanged; .gitattributes
includes an LFS override for the preserved logo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merge Cosmos3 README content with deprecated-repo notice
Combine the Cosmos3 documentation (user guide, overview, setup,
inference, models, modalities, CLI reference) from cosmos3-internal
with the original notice pointing to the archived-ces2025 branch and
the nvidia-cosmos GitHub organization.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restructure: rename cosmos3 to cosmos-inference, add cosmos/ skeleton
Rename the existing cosmos3 package directory to cosmos-inference and
introduce a new top-level cosmos/ package skeleton matching the planned
framework layout:
cosmos/
model/, inference/, data/, trainer/
algorithm/{loss,reward,rl}/
controller/, workers/{simulations,rollout,reference,reward}/
communicator/, checkpoint/, callbacks/, tools/
utils/, evaluation/, launcher/
Also add root-level tests/ and tools/ placeholders. Each new Python
subpackage has an empty __init__.py; tests/ and tools/ use .gitkeep.
Note: imports inside cosmos-inference/ still reference `cosmos3.*` and
will need updating in a follow-up; pyproject.toml package config is
also unchanged for now.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Consolidate Cosmos3 content under cosmos-inference/, restore root layout
Move all cosmos3-internal-originated files at the repo root (dotfiles,
config, ci/, docker/, docs/, examples/, inputs/, schemas/, vllm-cosmos3/,
AGENTS.md, ATTRIBUTIONS.md, CHANGELOG.md, CONTRIBUTING.md, Dockerfile,
LICENSE, conftest.py, justfile, pyproject.toml, pyrefly.toml, uv.lock)
into cosmos-inference/, so the inference package is self-contained.
Root now matches the planned framework layout:
- cosmos/ (new framework skeleton from prior commit)
- cosmos-inference/ (full Cosmos3 codebase)
- examples/, docs/, docker/, tests/, tools/ (fresh placeholders)
- README.md, RELEASE.md, cosmos-logo-thumbnail.png (originals)
Other changes:
- Root README.md restored to the original deprecated-notice version.
- cosmos-inference/README.md added with the Cosmos3 documentation
(previously merged into root README).
- cosmos-inference/.gitattributes: drop the stale
`cosmos-logo-thumbnail.png` LFS override (the file lives at root, not
inside cosmos-inference/).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add stub pyproject.toml and uv.lock for cosmos/ package at root
Minimal hatchling-backed pyproject.toml declaring the cosmos skeleton
package (version 0.0.0, no dependencies, python >=3.13), plus a uv.lock
generated by `uv lock` resolving the single editable cosmos package.
Brings the repo root in line with the planned layout:
pyproject.toml, uv.lock, README.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sync cosmos-inference/ with cosmos3-internal origin/main
Replace cosmos-inference/ contents with the exact snapshot of origin/main
(99450113) from the cosmos3-internal repo to drop the stale flat-layout
files and align with upstream.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cosmos_training: re-release with 4 verified smokes + deterministic mode
Re-runs the cosmos release pipeline (via imaginaire4 cosmos_release_claude/cosmos.toml)
into this tree. Adds:
- 446 cosmos/*.py files (up from prior snapshot; covers callbacks/learning_rate_logger,
data/vfm/data_packer{,_dataloader}, augmentors, etc.)
- configs/base/experiment/action/{pretrained,posttrain}_config/ subtree (libero
policy datapacker + supporting helpers/configs)
- configs/base/experiment/posttrain_video/ subtree (t2w_sft_8b_local_datapacker)
- configs/base/vlm/experiment/llava_ov_datapacker_experiment.py
- 4 launch scripts with PYTHONHASHSEED=42 + scripts.train --deterministic:
launch_mixed_modality_sft_8b.sh
launch_vlm_llava_ov.sh
launch_action_libero.sh
launch_t2w_sft_local_datapacker.sh
- run_4exps_oss.sh: sequential 4-smoke sweep runner
- sitecustomize.py: atexit sys.modules dumper for dead-file inventory
(auto-loaded via PYTHONPATH=. when LOAD_TRACE_DIR is set)
- configs/base/config.py: register the 2 new experiments alongside
mixed_modality_sft_8b
- experiments/sft/mixed_modality_sft_8b.py: VLMConfig v12 schema
(pretrained_weights nested dict, qk_norm, etc.)
Maintenance:
- .gitignore: ignore __pycache__/, *.pyc, training_output/, outputs/.
- Untrack 223 stale __pycache__ entries that were checked in by an earlier
snapshot.
Verified all 4 smokes train clean under --deterministic, with iter-by-iter loss
matching the cosmos_opensource and imaginaire4 source-side runs to 4 decimals on
the byte-identical paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gitignore: untrack cosmos_training_meta/
The release tool writes cosmos_training_meta/files.txt on every run; it's
purely a debugging artifact and shouldn't be in the tree.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy latest changes from ~/work/cosmos3-internal (HEAD bf3b3ac) into cosmos-inference/ to pick up new action-policy datasets/configs, LoRA utility, learning-rate logger, VLM augmentors, and updated docs/configs. Excludes .git and __pycache__; pre-existing destination-only files (e.g. action_policy_sft_8b.yaml, mixed_modality_sft_8b.yaml, callbacks/vlm) were preserved rather than deleted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Sync cosmos-inference/ with cosmos3-internal main (fc7f97d) Copy latest changes from ~/work/cosmos3-internal (HEAD fc7f97d) into cosmos-inference/, adding the diffusers-cosmos3 package, new algorithm and processors subtrees, eval scripts, and assorted training/inference updates. Excludes .git and __pycache__. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Scaffold OSS documentation skeleton Add 9 docs in docs/ (setup, code_structure, training, dataset, checkpoints, inference, configs, faq, examples) and READMEs in examples/, docker/, and tools/. All are section-heading skeletons to be filled in for the open-source training-infra release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Fill in docs/setup.md and sync root uv env with cosmos-inference - docs/setup.md: port content from cosmos-inference/docs/setup.md adapted for the training-infra repo (new clone URL, train-only extras and CUDA groups, Recommended Base Image section, NGC base-image quickstart that runs from the repo root). - pyproject.toml: mirror cosmos-inference/pyproject.toml with name "cosmos" and hatch packages ["cosmos"]; strip diffusers-cosmos3 from the train extra and tool.uv.sources (only used by the inference-side conversion script). - .python-version: 3.10 (was 3.13 stub). - uv.lock: regenerated against the new pyproject; external pin versions match cosmos-inference exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Frame root as framework, fill code_structure and inference docs - README.md: rewrite as the framework entry point introducing the training-infra half (cosmos/, docs/) and the inference-infra half (cosmos-inference/), with a docs table connecting each docs/*.md to a one-line description and a Training-section Reference block for code structure / FAQ / AGENTS.md. - docs/code_structure.md: fill in from the actual cosmos/ layout — repository layout, per-subpackage descriptions, and a "where to add new code" table. - docs/inference.md: fill in as a bridge doc tying trained checkpoints to inference, with quickstart, modalities, backends, and pointers into cosmos-inference/docs/*.md as the source of truth for full inference details. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Add OSS hygiene files at root and mirror cosmos-inference with upstream OSS files copied from cosmos-inference (no cosmos3 references in any of them; copied as-is): LICENSE, .gitignore, .gitattributes, .dockerignore, CONTRIBUTING.md, .pre-commit-config.yaml, .gitleaks.toml, .ruff.toml, .coveragerc, .github/ (issue templates + pre-commit workflow). Other changes: - .python-version: 3.10 -> 3.13 (mirrors cosmos-inference). - cosmos-inference/: remove 7 stale files that were deleted upstream but persisted locally because earlier syncs ran without --delete; cosmos-inference/ is now an exact mirror of cosmos3-internal main (fc7f97d). - README.md: before the Setup install commands, link System Requirements and Recommended Base Image (link-only, no inline details). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Port remaining dev configs and Docker assets from cosmos-inference No-adaptation copies: - ci/ (license.txt + uv_lock helpers) - docker/Dockerfile (was cosmos-inference/Dockerfile) - docker/ci.Dockerfile, docker/nightly.Dockerfile, docker/entrypoint.sh Adapted from cosmos3 to cosmos: - pyrefly.toml: project-includes -> cosmos/ - .pytest.toml: replace cosmos3._src.imaginaire entry with cosmos-inference in norecursedirs - justfile: package/module/short name -> cosmos; docker recipes use docker/Dockerfile - conftest.py: slimmed adaptation; inlines ALL_NUM_GPUS / ALL_LEVELS / ALLOWED_GPUS_BY_LEVEL and keeps generic fixtures, options, and markers. Drops Args / init logging / seed fixtures that depend on cosmos3 modules not yet ported to root cosmos/. - AGENTS.md: full rewrite as framework-level map (training-side cosmos/ and inference-side cosmos-inference/ tables, docs indices, and common-task tables for both halves). - README.md: AGENTS.md reference now points at root ./AGENTS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Apply pre-commit fixes; add rumdl config and .rumdl_cache gitignore - ci/.pre-commit-config-base.yaml, .pre-commit-config.yaml: exclude cosmos-inference/ from pre-commit; addlicense now skips binary extensions (png/jpg/pdf/safetensors/etc.) so it can't corrupt non-text assets. - .gitattributes: exempt cosmos-logo-thumbnail.png from the *.png LFS rule so the existing 25 KB git blob is preserved (rather than being rewritten into a 130-byte LFS pointer at commit time). - .gitignore: ignore .rumdl_cache/. - .config/rumdl.toml: copy from cosmos-inference so the markdown linter disables MD013 / MD033 / MD040 (line length / inline HTML / fenced-code language) and excepts MD041 for README.md. - addlicense: added SPDX headers to all cosmos/**/__init__.py skeleton files. - markdown-toc-creator / rumdl-fmt: TOC regen and table-alignment fixes in AGENTS.md, README.md, CONTRIBUTING.md, RELEASE.md, docs/code_structure.md, docs/setup.md. - RELEASE.md: replace stale broken-link entries with a one-line stub until the first tagged release. - README.md: replace non-descriptive "here" link text with "one-shot quickstart" to satisfy MD059. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> AGENTS.md: add pointer to inference-side skills Note that the 5 cosmos3-* agent skills live under cosmos-inference/.agents/skills/ and cosmos-inference/.claude/skills/ and activate when working inside that subtree; root-side skills can be added later as cosmos/ grows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the `cosmos` package from `cosmos/` at the repo root into `cosmos_training/cosmos/` so the training package source lives alongside its driver scripts, configs, and experiments. The package import name stays `cosmos` — no import rewrites needed. Also brings in the 5 scaffolding subdirs (algorithm, controller, evaluation, inference, workers) that were already at the root skeleton, so the new location has the full intended layout. Update build/type-check config to follow: - pyproject.toml: hatch sdist/wheel packages -> cosmos_training/cosmos - pyrefly.toml: project-includes and search-path point at new location Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runs the same torchrun invocations as launch_vlm_llava_ov.sh and launch_mixed_modality_sft_8b.sh in deterministic mode, parses rank-0 loss and global clip_grad_norm from the captured log, and asserts against goldens inlined at the bottom of the file. VLM check is limited to the first 2 iters (HF Hub streaming order drifts after); mixed-modality is bit-exact deterministic across all 10 iters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…stems
Squash of liangf/utils-cleanup branch (originally 13 commits, PR #5):
- consolidate utils/vfm/vlm into utils/vlm and utils/vfm/fused_adam
into utils/fused_adam (bfb043d)
- drop dead utils/optim_instantiate.py and utils/configs/ (b66d9f8)
- drop dead utils/one_logger/ subsystem (a4f7ea1)
- drop orphan training_telemetry/context_managers.py (1c55568)
- drop three more orphans: vlm/flop_calculator + env_parsers
(customization, inference) (004b14d)
- add SPDX/Apache-2.0 headers to files missing them (0900113)
- docs: fix broken relative links in READMEs (21639f7)
- docs: fix stale references in training_telemetry README (de51b60)
- mark viewer.py executable to match its shebang (0e7b015)
- clear executable bit on avae_utils library modules (63cfbfb)
- skill: broaden cosmos-utils-vlm-migration to cover follow-up
deletions (13dc960)
- pre-commit: exclude .claude/ from project hooks (d3be55a)
- lint (52ed386)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a `--toml` flag to scripts/train.py that loads training config from
TOML. Two modes: standalone (full structured TOML) or paired with
`--config=<py>` where the TOML is a flat interface-schema file that
scripts/interface_toml.py translates to Hydra overrides (variant
auto-detected from the --config path: vfm-vlm vs vfm-base).
Files:
cosmos/utils/serialization.py - from_toml() + Literal type handling
cosmos/utils/config.py - TOML branch in load_config +
_reload_make_config_for_registrations
for ...defaults.config -> ...config
sibling resolution
scripts/train.py - --toml argparse + dispatch
scripts/interface_toml.py - interface TOML -> Hydra override
translator with vlm/base mapping dicts
toml/{vfm,vlm}_example.toml - schema-shaped templates
toml/mixed_modality_sft_8b.toml + run_mixed_modality_sft_trace_toml.sh
- TOML launcher for the trace experiment
toml/alignment_test.toml + toml/run_alignment_test.sh
- --dryrun parity test: byte-identical
config.yaml between direct CLI and TOML
.gitignore - drop tracked __pycache__/*.pyc
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Push the wiring that lived in shell launch scripts down into the
interface TOMLs and their target experiments, so each launch script is
just `torchrun ... --toml=<file>`.
interface_toml.py
- read_config_py_from_toml: top-level `config = "vfm"|"vlm"` field
selects the paired configs/base/...config.py — launch scripts no
longer need --config=.
- read_path_overrides: top-level checkpoint_path / wan_vae_path /
dataset_jsonl / model_path replace per-script bash variables and
CLI tail overrides.
- translate_interface_toml strips these launcher-directive keys
before flattening so they never leak into Hydra overrides.
train.py
- When only --toml is given, resolve the paired config.py from the
TOML's `config` field; fall back to structured-TOML load otherwise.
- _apply_path_overrides mutates the resolved Config post-load:
checkpoint.load_path, model.config.tokenizer.vae_path, and
model.config.policy.backbone.model_name are fixed attribute paths;
dataset_jsonl is routed by hasattr probe on dataloader_train so
t2w (data_source.jsonl_paths) and mixed_modality
(dataloader.datasets.video.dataset.jsonl_paths) share one key.
Experiments absorb the "local-only mode" CLI tail
- libero_policy_datapacker_experiment.py:
tokenizer.bucket_name / object_store_credential_path_pretrained="",
checkpoint.{load,save}_to_object_store=dict(enabled=False),
vlm_config.tokenizer=dict(config_variant="hf"),
vlm_config.pretrained_weights.enabled=False.
- t2w_sft_8b_local_datapacker.py: same set, minus the two that were
already encoded (vlm_config.pretrained_weights.enabled=False and
data_source.num_video_frames=61).
- llava_ov_datapacker_experiment.py: checkpoint.{load,save}_to_object_store
deep-merge stubs.
5 launch scripts now end with `--toml=<file> [--deterministic]` — no
--config=, no bash CHECKPOINT_PATH/WAN_VAE_PATH/DATASET_JSONL/MODEL_PATH
declarations, no CLI tail overrides. The MAX_ITER env var on
run_mixed_modality_sft_trace_toml.sh is gone; max_iter is driven by the
TOML's [train].max_iter via the existing interface mapping.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Merge config into the TOML file and remove vae_path.