Skip to content

Pengcuo/toml into main#172

Closed
pengcuo wants to merge 9 commits into
NVIDIA:mainfrom
nvidia-cosmos:pengcuo/toml_into_main
Closed

Pengcuo/toml into main#172
pengcuo wants to merge 9 commits into
NVIDIA:mainfrom
nvidia-cosmos:pengcuo/toml_into_main

Conversation

@pengcuo
Copy link
Copy Markdown
Collaborator

@pengcuo pengcuo commented May 19, 2026

Merge config into the TOML file and remove vae_path.

lfengad and others added 9 commits May 19, 2026 02:48
Bring the full Cosmos3 project tree into this repository: the cosmos3
Python package, vllm-cosmos3, docs, examples, inputs, schemas, CI config,
Docker setup, and tooling (pyproject, uv.lock, pre-commit, ruff, pyrefly,
justfile). Existing root files (README.md, RELEASE.md,
cosmos-logo-thumbnail.png) are preserved unchanged; .gitattributes
includes an LFS override for the preserved logo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge Cosmos3 README content with deprecated-repo notice

Combine the Cosmos3 documentation (user guide, overview, setup,
inference, models, modalities, CLI reference) from cosmos3-internal
with the original notice pointing to the archived-ces2025 branch and
the nvidia-cosmos GitHub organization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Restructure: rename cosmos3 to cosmos-inference, add cosmos/ skeleton

Rename the existing cosmos3 package directory to cosmos-inference and
introduce a new top-level cosmos/ package skeleton matching the planned
framework layout:

  cosmos/
    model/, inference/, data/, trainer/
    algorithm/{loss,reward,rl}/
    controller/, workers/{simulations,rollout,reference,reward}/
    communicator/, checkpoint/, callbacks/, tools/
    utils/, evaluation/, launcher/

Also add root-level tests/ and tools/ placeholders. Each new Python
subpackage has an empty __init__.py; tests/ and tools/ use .gitkeep.

Note: imports inside cosmos-inference/ still reference `cosmos3.*` and
will need updating in a follow-up; pyproject.toml package config is
also unchanged for now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Consolidate Cosmos3 content under cosmos-inference/, restore root layout

Move all cosmos3-internal-originated files at the repo root (dotfiles,
config, ci/, docker/, docs/, examples/, inputs/, schemas/, vllm-cosmos3/,
AGENTS.md, ATTRIBUTIONS.md, CHANGELOG.md, CONTRIBUTING.md, Dockerfile,
LICENSE, conftest.py, justfile, pyproject.toml, pyrefly.toml, uv.lock)
into cosmos-inference/, so the inference package is self-contained.

Root now matches the planned framework layout:
  - cosmos/         (new framework skeleton from prior commit)
  - cosmos-inference/  (full Cosmos3 codebase)
  - examples/, docs/, docker/, tests/, tools/  (fresh placeholders)
  - README.md, RELEASE.md, cosmos-logo-thumbnail.png  (originals)

Other changes:
- Root README.md restored to the original deprecated-notice version.
- cosmos-inference/README.md added with the Cosmos3 documentation
  (previously merged into root README).
- cosmos-inference/.gitattributes: drop the stale
  `cosmos-logo-thumbnail.png` LFS override (the file lives at root, not
  inside cosmos-inference/).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add stub pyproject.toml and uv.lock for cosmos/ package at root

Minimal hatchling-backed pyproject.toml declaring the cosmos skeleton
package (version 0.0.0, no dependencies, python >=3.13), plus a uv.lock
generated by `uv lock` resolving the single editable cosmos package.

Brings the repo root in line with the planned layout:
  pyproject.toml, uv.lock, README.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sync cosmos-inference/ with cosmos3-internal origin/main

Replace cosmos-inference/ contents with the exact snapshot of origin/main
(99450113) from the cosmos3-internal repo to drop the stale flat-layout
files and align with upstream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cosmos_training: re-release with 4 verified smokes + deterministic mode

Re-runs the cosmos release pipeline (via imaginaire4 cosmos_release_claude/cosmos.toml)
into this tree. Adds:
- 446 cosmos/*.py files (up from prior snapshot; covers callbacks/learning_rate_logger,
  data/vfm/data_packer{,_dataloader}, augmentors, etc.)
- configs/base/experiment/action/{pretrained,posttrain}_config/ subtree (libero
  policy datapacker + supporting helpers/configs)
- configs/base/experiment/posttrain_video/ subtree (t2w_sft_8b_local_datapacker)
- configs/base/vlm/experiment/llava_ov_datapacker_experiment.py
- 4 launch scripts with PYTHONHASHSEED=42 + scripts.train --deterministic:
    launch_mixed_modality_sft_8b.sh
    launch_vlm_llava_ov.sh
    launch_action_libero.sh
    launch_t2w_sft_local_datapacker.sh
- run_4exps_oss.sh: sequential 4-smoke sweep runner
- sitecustomize.py: atexit sys.modules dumper for dead-file inventory
  (auto-loaded via PYTHONPATH=. when LOAD_TRACE_DIR is set)
- configs/base/config.py: register the 2 new experiments alongside
  mixed_modality_sft_8b
- experiments/sft/mixed_modality_sft_8b.py: VLMConfig v12 schema
  (pretrained_weights nested dict, qk_norm, etc.)

Maintenance:
- .gitignore: ignore __pycache__/, *.pyc, training_output/, outputs/.
- Untrack 223 stale __pycache__ entries that were checked in by an earlier
  snapshot.

Verified all 4 smokes train clean under --deterministic, with iter-by-iter loss
matching the cosmos_opensource and imaginaire4 source-side runs to 4 decimals on
the byte-identical paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gitignore: untrack cosmos_training_meta/

The release tool writes cosmos_training_meta/files.txt on every run; it's
purely a debugging artifact and shouldn't be in the tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy latest changes from ~/work/cosmos3-internal (HEAD bf3b3ac) into
cosmos-inference/ to pick up new action-policy datasets/configs, LoRA
utility, learning-rate logger, VLM augmentors, and updated docs/configs.
Excludes .git and __pycache__; pre-existing destination-only files
(e.g. action_policy_sft_8b.yaml, mixed_modality_sft_8b.yaml,
callbacks/vlm) were preserved rather than deleted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sync cosmos-inference/ with cosmos3-internal main (fc7f97d)

Copy latest changes from ~/work/cosmos3-internal (HEAD fc7f97d) into
cosmos-inference/, adding the diffusers-cosmos3 package, new algorithm
and processors subtrees, eval scripts, and assorted training/inference
updates. Excludes .git and __pycache__.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Scaffold OSS documentation skeleton

Add 9 docs in docs/ (setup, code_structure, training, dataset,
checkpoints, inference, configs, faq, examples) and READMEs in
examples/, docker/, and tools/. All are section-heading skeletons to
be filled in for the open-source training-infra release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fill in docs/setup.md and sync root uv env with cosmos-inference

- docs/setup.md: port content from cosmos-inference/docs/setup.md
  adapted for the training-infra repo (new clone URL, train-only
  extras and CUDA groups, Recommended Base Image section, NGC
  base-image quickstart that runs from the repo root).
- pyproject.toml: mirror cosmos-inference/pyproject.toml with
  name "cosmos" and hatch packages ["cosmos"]; strip
  diffusers-cosmos3 from the train extra and tool.uv.sources
  (only used by the inference-side conversion script).
- .python-version: 3.10 (was 3.13 stub).
- uv.lock: regenerated against the new pyproject; external pin
  versions match cosmos-inference exactly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Frame root as framework, fill code_structure and inference docs

- README.md: rewrite as the framework entry point introducing the
  training-infra half (cosmos/, docs/) and the inference-infra half
  (cosmos-inference/), with a docs table connecting each docs/*.md
  to a one-line description and a Training-section Reference block
  for code structure / FAQ / AGENTS.md.
- docs/code_structure.md: fill in from the actual cosmos/ layout —
  repository layout, per-subpackage descriptions, and a "where to
  add new code" table.
- docs/inference.md: fill in as a bridge doc tying trained
  checkpoints to inference, with quickstart, modalities, backends,
  and pointers into cosmos-inference/docs/*.md as the source of
  truth for full inference details.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add OSS hygiene files at root and mirror cosmos-inference with upstream

OSS files copied from cosmos-inference (no cosmos3 references in any
of them; copied as-is):
  LICENSE, .gitignore, .gitattributes, .dockerignore, CONTRIBUTING.md,
  .pre-commit-config.yaml, .gitleaks.toml, .ruff.toml, .coveragerc,
  .github/ (issue templates + pre-commit workflow).

Other changes:
- .python-version: 3.10 -> 3.13 (mirrors cosmos-inference).
- cosmos-inference/: remove 7 stale files that were deleted upstream
  but persisted locally because earlier syncs ran without --delete;
  cosmos-inference/ is now an exact mirror of cosmos3-internal main
  (fc7f97d).
- README.md: before the Setup install commands, link System
  Requirements and Recommended Base Image (link-only, no inline
  details).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Port remaining dev configs and Docker assets from cosmos-inference

No-adaptation copies:
- ci/ (license.txt + uv_lock helpers)
- docker/Dockerfile (was cosmos-inference/Dockerfile)
- docker/ci.Dockerfile, docker/nightly.Dockerfile, docker/entrypoint.sh

Adapted from cosmos3 to cosmos:
- pyrefly.toml: project-includes -> cosmos/
- .pytest.toml: replace cosmos3._src.imaginaire entry with
  cosmos-inference in norecursedirs
- justfile: package/module/short name -> cosmos; docker recipes
  use docker/Dockerfile
- conftest.py: slimmed adaptation; inlines ALL_NUM_GPUS / ALL_LEVELS
  / ALLOWED_GPUS_BY_LEVEL and keeps generic fixtures, options, and
  markers. Drops Args / init logging / seed fixtures that depend on
  cosmos3 modules not yet ported to root cosmos/.
- AGENTS.md: full rewrite as framework-level map (training-side
  cosmos/ and inference-side cosmos-inference/ tables, docs indices,
  and common-task tables for both halves).
- README.md: AGENTS.md reference now points at root ./AGENTS.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Apply pre-commit fixes; add rumdl config and .rumdl_cache gitignore

- ci/.pre-commit-config-base.yaml, .pre-commit-config.yaml: exclude
  cosmos-inference/ from pre-commit; addlicense now skips binary
  extensions (png/jpg/pdf/safetensors/etc.) so it can't corrupt
  non-text assets.
- .gitattributes: exempt cosmos-logo-thumbnail.png from the *.png
  LFS rule so the existing 25 KB git blob is preserved (rather than
  being rewritten into a 130-byte LFS pointer at commit time).
- .gitignore: ignore .rumdl_cache/.
- .config/rumdl.toml: copy from cosmos-inference so the markdown
  linter disables MD013 / MD033 / MD040 (line length / inline HTML
  / fenced-code language) and excepts MD041 for README.md.
- addlicense: added SPDX headers to all cosmos/**/__init__.py
  skeleton files.
- markdown-toc-creator / rumdl-fmt: TOC regen and table-alignment
  fixes in AGENTS.md, README.md, CONTRIBUTING.md, RELEASE.md,
  docs/code_structure.md, docs/setup.md.
- RELEASE.md: replace stale broken-link entries with a one-line
  stub until the first tagged release.
- README.md: replace non-descriptive "here" link text with
  "one-shot quickstart" to satisfy MD059.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AGENTS.md: add pointer to inference-side skills

Note that the 5 cosmos3-* agent skills live under
cosmos-inference/.agents/skills/ and cosmos-inference/.claude/skills/
and activate when working inside that subtree; root-side skills can
be added later as cosmos/ grows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the `cosmos` package from `cosmos/` at the repo root into
`cosmos_training/cosmos/` so the training package source lives alongside
its driver scripts, configs, and experiments. The package import name
stays `cosmos` — no import rewrites needed.

Also brings in the 5 scaffolding subdirs (algorithm, controller,
evaluation, inference, workers) that were already at the root skeleton,
so the new location has the full intended layout.

Update build/type-check config to follow:
- pyproject.toml: hatch sdist/wheel packages -> cosmos_training/cosmos
- pyrefly.toml: project-includes and search-path point at new location

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runs the same torchrun invocations as launch_vlm_llava_ov.sh and
launch_mixed_modality_sft_8b.sh in deterministic mode, parses rank-0
loss and global clip_grad_norm from the captured log, and asserts
against goldens inlined at the bottom of the file. VLM check is
limited to the first 2 iters (HF Hub streaming order drifts after);
mixed-modality is bit-exact deterministic across all 10 iters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…stems

Squash of liangf/utils-cleanup branch (originally 13 commits, PR #5):

  - consolidate utils/vfm/vlm into utils/vlm and utils/vfm/fused_adam
    into utils/fused_adam (bfb043d)
  - drop dead utils/optim_instantiate.py and utils/configs/ (b66d9f8)
  - drop dead utils/one_logger/ subsystem (a4f7ea1)
  - drop orphan training_telemetry/context_managers.py (1c55568)
  - drop three more orphans: vlm/flop_calculator + env_parsers
    (customization, inference) (004b14d)
  - add SPDX/Apache-2.0 headers to files missing them (0900113)
  - docs: fix broken relative links in READMEs (21639f7)
  - docs: fix stale references in training_telemetry README (de51b60)
  - mark viewer.py executable to match its shebang (0e7b015)
  - clear executable bit on avae_utils library modules (63cfbfb)
  - skill: broaden cosmos-utils-vlm-migration to cover follow-up
    deletions (13dc960)
  - pre-commit: exclude .claude/ from project hooks (d3be55a)
  - lint (52ed386)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a `--toml` flag to scripts/train.py that loads training config from
TOML. Two modes: standalone (full structured TOML) or paired with
`--config=<py>` where the TOML is a flat interface-schema file that
scripts/interface_toml.py translates to Hydra overrides (variant
auto-detected from the --config path: vfm-vlm vs vfm-base).

Files:
  cosmos/utils/serialization.py  - from_toml() + Literal type handling
  cosmos/utils/config.py         - TOML branch in load_config +
                                   _reload_make_config_for_registrations
                                   for ...defaults.config -> ...config
                                   sibling resolution
  scripts/train.py               - --toml argparse + dispatch
  scripts/interface_toml.py      - interface TOML -> Hydra override
                                   translator with vlm/base mapping dicts
  toml/{vfm,vlm}_example.toml    - schema-shaped templates
  toml/mixed_modality_sft_8b.toml + run_mixed_modality_sft_trace_toml.sh
                                 - TOML launcher for the trace experiment
  toml/alignment_test.toml + toml/run_alignment_test.sh
                                 - --dryrun parity test: byte-identical
                                   config.yaml between direct CLI and TOML
  .gitignore                     - drop tracked __pycache__/*.pyc

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Push the wiring that lived in shell launch scripts down into the
interface TOMLs and their target experiments, so each launch script is
just `torchrun ... --toml=<file>`.

interface_toml.py
  - read_config_py_from_toml: top-level `config = "vfm"|"vlm"` field
    selects the paired configs/base/...config.py — launch scripts no
    longer need --config=.
  - read_path_overrides: top-level checkpoint_path / wan_vae_path /
    dataset_jsonl / model_path replace per-script bash variables and
    CLI tail overrides.
  - translate_interface_toml strips these launcher-directive keys
    before flattening so they never leak into Hydra overrides.

train.py
  - When only --toml is given, resolve the paired config.py from the
    TOML's `config` field; fall back to structured-TOML load otherwise.
  - _apply_path_overrides mutates the resolved Config post-load:
    checkpoint.load_path, model.config.tokenizer.vae_path, and
    model.config.policy.backbone.model_name are fixed attribute paths;
    dataset_jsonl is routed by hasattr probe on dataloader_train so
    t2w (data_source.jsonl_paths) and mixed_modality
    (dataloader.datasets.video.dataset.jsonl_paths) share one key.

Experiments absorb the "local-only mode" CLI tail
  - libero_policy_datapacker_experiment.py:
    tokenizer.bucket_name / object_store_credential_path_pretrained="",
    checkpoint.{load,save}_to_object_store=dict(enabled=False),
    vlm_config.tokenizer=dict(config_variant="hf"),
    vlm_config.pretrained_weights.enabled=False.
  - t2w_sft_8b_local_datapacker.py: same set, minus the two that were
    already encoded (vlm_config.pretrained_weights.enabled=False and
    data_source.num_video_frames=61).
  - llava_ov_datapacker_experiment.py: checkpoint.{load,save}_to_object_store
    deep-merge stubs.

5 launch scripts now end with `--toml=<file> [--deterministic]` — no
--config=, no bash CHECKPOINT_PATH/WAN_VAE_PATH/DATASET_JSONL/MODEL_PATH
declarations, no CLI tail overrides. The MAX_ITER env var on
run_mixed_modality_sft_trace_toml.sh is gone; max_iter is driven by the
TOML's [train].max_iter via the existing interface mapping.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pengcuo pengcuo closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants