Skip to content

Multi-GPU interactive mode hangs during model initialization on some machines #72

@FVEFWFE

Description

@FVEFWFE

Running Matrix-Game 3.0 interactive mode with torchrun on 4x A100 SXM 80GB.
Stock generate.py with --interactive works on some machines but silently hangs
during model initialization on others.

Setup:

  • torch 2.10.0, FA2, CUDA 12.8, Python 3.12
  • torchrun --nproc_per_node=4 with --ulysses_size 4, --use_int8, --t5_cpu
  • Docker image based on nvidia/cuda:12.8.0-devel-ubuntu22.04

What happens:

  • NCCL init succeeds on all ranks
  • T5 encoder loads successfully on all ranks
  • All 4 ranks log "Initializing Model (DiT)..."
  • Then hangs forever: 0% CPU, 0% GPU VRAM (633MB CUDA context only)
  • No error, no timeout, no crash

What we've tried:

  • NCCL_P2P_DISABLE=1
  • NCCL_DEBUG=INFO (init shows COMPLETE, no errors)
  • device_id in init_process_group()
  • 2 GPUs instead of 4
  • Different volume sizes (75GB, 200GB)

Key observation:

  • Works on some RunPod machines (e.g. vqnzttsdu44t, c250s631h6rc)
  • Hangs on others (e.g. cmg18urp1kg5, hleuzkw9vyyn)
  • Same Docker image, same code, same config

The hang appears to be inside WanModel.from_pretrained() or the
dist.barrier() that precedes it. Has anyone else seen this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions