Add UCC/NCCL alltoallv vs DeepEP#891
Conversation
📝 WalkthroughWalkthroughThis PR integrates DeepEP benchmark outputs with NCCL and UCC test frameworks by adding discovery utilities, a new MoE throughput reporter, matrix artifact mounting, and updated command generation strategies to enable dependent test execution and cross-workload metric visualization. ChangesDeepEP Integration Feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/workloads/deepep/slurm_command_gen_strategy.py (1)
28-134:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix Ruff formatting for this file before merge.
CI is failing on
ruff-formatfor this file; please run formatting and commit the normalized output to unblock the pipeline.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py` around lines 28 - 134, The file fails ruff-format checks; run the ruff formatter and commit the normalized file so CI passes. Specifically, run your project's ruff/formatter (e.g., ruff format) on the module containing methods like _append_head_node_detection, _append_sbatch_directives, _container_mounts, _generate_config_yaml and gen_srun_success_check, review the changes for any unintended edits, and commit the formatted file to the branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/workloads/deepep/deepep_moe_throughput_reporter.py`:
- Line 15: The import of Reporter in deepep_moe_throughput_reporter.py currently
uses the private path cloudai._core.base_reporter; replace that with the allowed
public reporter interface used by other workloads (import the same Reporter
symbol from the public reporter module instead) so the file imports Reporter
from the public/allowed package rather than cloudai._core.base_reporter.
- Around line 23-25: The function _deepep_dispatch_combine_bars is too complex
and triggers C901 and has an unused loop variable causing B007; split it into
small helpers (e.g., deepep_results_json_files -> leave, add helpers like
_load_latest_results(path: Path) -> dict,
_extract_dispatch_combine_rows(results: dict) -> Iterable[dict], and
_row_to_bar(row: dict) -> tuple[str, float, str]) and have
_deepep_dispatch_combine_bars orchestrate these helpers to reduce cyclomatic
complexity; also rename any unused loop variable "lab" to "_lab" (or prefix with
underscore) to satisfy B007, and finally run ruff format to fix formatting
issues.
In `@src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py`:
- Around line 51-58: If use_deepep_matrix is true, don’t silently return None
when deepep_benchmark_root or the nccl matrix file is missing; instead fail fast
by raising a clear exception (e.g., RuntimeError/ValueError) from
_deepep_nccl_matrix_host_path that includes context (test/run id and that
use_deepep_matrix was requested) so the caller will stop the run; apply the same
change to the analogous methods referenced in the comment (the other
deepep-related path/resolution methods around lines 62-67 and 81-83, such as the
container-path resolver), and use deepep_benchmark_root and
_nccl_matrix_path_under_deepep_output in the error message to make the root
cause obvious.
- Around line 95-112: The alltoallv_perf_mpi branch only appends the fixed flags
and misses forwarding other configured NCCL options; update the branch in
slurm_command_gen_strategy.py (the block using _NCCL_TESTS_ALLTOALLV_PERF,
tdef.cmd_args and _nccl_cmd_scalar) to conditionally append the additional flags
present on tdef.cmd_args (e.g., stepfactor, nthreads, check, blocking and any
other optional fields) in the same way you handle minbytes/maxbytes/ngpus (use
_nccl_cmd_scalar where appropriate and add boolean flags when true), and keep
the existing inclusion of self.test_run.test.extra_args_str when
test.extra_cmd_args is set so TOML-configured options are not dropped.
In `@src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py`:
- Around line 55-64: If tdef.cmd_args.use_deepep_matrix is true, do not silently
return []; instead detect missing matrix by checking
deepep_benchmark_root(self.test_run) and self._deepep_ucc_matrix_host_path() and
raise a clear exception (or call fail-fast helper) when either is None and there
is no manual generation option provided (e.g. no tdef.cmd_args.gen or equivalent
flag). Update the logic in the block guarded by use_deepep_matrix (and the
analogous block later around the other check) to validate deepep_root and
matrix_host and raise an informative error mentioning the missing DeepEP matrix
and required --gen/manual generation flag rather than falling back to running
without a matrix.
---
Outside diff comments:
In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py`:
- Around line 28-134: The file fails ruff-format checks; run the ruff formatter
and commit the normalized file so CI passes. Specifically, run your project's
ruff/formatter (e.g., ruff format) on the module containing methods like
_append_head_node_detection, _append_sbatch_directives, _container_mounts,
_generate_config_yaml and gen_srun_success_check, review the changes for any
unintended edits, and commit the formatted file to the branch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 0416a4db-8f59-4498-afdb-cfa1d0fa7cce
📒 Files selected for processing (15)
conf/experimental/test/deepep_standard.tomlconf/experimental/test/nccl_test_alltoallv.tomlconf/experimental/test_scenario/deepep_with_nccl_alltoallv.tomlconf/experimental/test_scenario/deepep_with_ucc_alltoallv.tomlsrc/cloudai/registration.pysrc/cloudai/workloads/deepep/__init__.pysrc/cloudai/workloads/deepep/deepep.pysrc/cloudai/workloads/deepep/deepep_combined_report.pysrc/cloudai/workloads/deepep/deepep_moe_throughput_reporter.pysrc/cloudai/workloads/deepep/report_generation_strategy.pysrc/cloudai/workloads/deepep/slurm_command_gen_strategy.pysrc/cloudai/workloads/nccl_test/nccl.pysrc/cloudai/workloads/nccl_test/slurm_command_gen_strategy.pysrc/cloudai/workloads/ucc_test/slurm_command_gen_strategy.pysrc/cloudai/workloads/ucc_test/ucc.py
| import re | ||
| from pathlib import Path | ||
|
|
||
| from cloudai._core.base_reporter import Reporter |
There was a problem hiding this comment.
Use an allowed public import path for Reporter to satisfy layering rules.
This import currently breaks the import-linter contract (cloudai.workloads → cloudai._core). Please switch to the public/allowed reporter interface module used elsewhere in workloads.
🧰 Tools
🪛 GitHub Actions: CI / Linting
[error] Ruff format check failed for this file (hook 'ruff-format' reformatted files, causing the hook to fail).
[error] import-linter contract broken: cloudai.workloads is not allowed to import cloudai._core (offending dependency: cloudai.workloads.deepep.deepep_moe_throughput_reporter -> cloudai._core.base_reporter).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/workloads/deepep/deepep_moe_throughput_reporter.py` at line 15,
The import of Reporter in deepep_moe_throughput_reporter.py currently uses the
private path cloudai._core.base_reporter; replace that with the allowed public
reporter interface used by other workloads (import the same Reporter symbol from
the public reporter module instead) so the file imports Reporter from the
public/allowed package rather than cloudai._core.base_reporter.
| def _deepep_dispatch_combine_bars(test_output: Path) -> list[tuple[str, float, str]]: | ||
| """From latest ``results.json``: one bar per ``dispatch`` / ``combine`` row (``bus_bw_avg``).""" | ||
| paths = deepep_results_json_files(test_output) |
There was a problem hiding this comment.
Resolve Ruff blockers in this new reporter file.
CI is red on C901 (_deepep_dispatch_combine_bars complexity), B007 (unused lab), and formatting. Please split _deepep_dispatch_combine_bars into smaller helpers, rename unused loop vars (e.g., _lab), and run ruff format.
Also applies to: 164-166
🧰 Tools
🪛 GitHub Actions: CI / 2_Linting.txt
[error] 23-25: Ruff error (C901): _deepep_dispatch_combine_bars is too complex (11 > 10).
🪛 GitHub Actions: CI / Linting
[error] 23-25: Ruff check (C901): _deepep_dispatch_combine_bars is too complex (11 > 10).
[error] Ruff format check failed for this file (hook 'ruff-format' reformatted files, causing the hook to fail).
[error] import-linter contract broken: cloudai.workloads is not allowed to import cloudai._core (offending dependency: cloudai.workloads.deepep.deepep_moe_throughput_reporter -> cloudai._core.base_reporter).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/workloads/deepep/deepep_moe_throughput_reporter.py` around lines
23 - 25, The function _deepep_dispatch_combine_bars is too complex and triggers
C901 and has an unused loop variable causing B007; split it into small helpers
(e.g., deepep_results_json_files -> leave, add helpers like
_load_latest_results(path: Path) -> dict,
_extract_dispatch_combine_rows(results: dict) -> Iterable[dict], and
_row_to_bar(row: dict) -> tuple[str, float, str]) and have
_deepep_dispatch_combine_bars orchestrate these helpers to reduce cyclomatic
complexity; also rename any unused loop variable "lab" to "_lab" (or prefix with
underscore) to satisfy B007, and finally run ruff format to fix formatting
issues.
| def _deepep_nccl_matrix_host_path(self) -> Path | None: | ||
| tdef: NCCLTestDefinition = cast(NCCLTestDefinition, self.test_run.test) | ||
| if not tdef.cmd_args.use_deepep_matrix: | ||
| return None | ||
| root = deepep_benchmark_root(self.test_run) | ||
| if root is None: | ||
| return None | ||
| return _nccl_matrix_path_under_deepep_output(root) |
There was a problem hiding this comment.
Fail fast when use_deepep_matrix is enabled but matrix discovery fails.
Current flow silently skips mounts/env var when the DeepEP dependency or nccl_matrix.txt is missing, so the run proceeds without the requested matrix input and can produce misleading comparison data.
💡 Suggested fix
def _deepep_nccl_matrix_host_path(self) -> Path | None:
tdef: NCCLTestDefinition = cast(NCCLTestDefinition, self.test_run.test)
if not tdef.cmd_args.use_deepep_matrix:
return None
root = deepep_benchmark_root(self.test_run)
- if root is None:
- return None
- return _nccl_matrix_path_under_deepep_output(root)
+ if root is None:
+ raise FileNotFoundError(
+ "use_deepep_matrix=true but no start_post_comp DeepEP output was found"
+ )
+ matrix = _nccl_matrix_path_under_deepep_output(root)
+ if matrix is None:
+ raise FileNotFoundError(
+ f"use_deepep_matrix=true but nccl_matrix.txt was not found under {root}"
+ )
+ return matrixAlso applies to: 62-67, 81-83
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py` around lines
51 - 58, If use_deepep_matrix is true, don’t silently return None when
deepep_benchmark_root or the nccl matrix file is missing; instead fail fast by
raising a clear exception (e.g., RuntimeError/ValueError) from
_deepep_nccl_matrix_host_path that includes context (test/run id and that
use_deepep_matrix was requested) so the caller will stop the run; apply the same
change to the analogous methods referenced in the comment (the other
deepep-related path/resolution methods around lines 62-67 and 81-83, such as the
container-path resolver), and use deepep_benchmark_root and
_nccl_matrix_path_under_deepep_output in the error message to make the root
cause obvious.
| if tdef.cmd_args.subtest_name == "alltoallv_perf_mpi": | ||
| a = tdef.cmd_args | ||
| parts: List[str] = [ | ||
| _NCCL_TESTS_ALLTOALLV_PERF, | ||
| "-b", | ||
| str(_nccl_cmd_scalar(a.minbytes)), | ||
| "-e", | ||
| str(_nccl_cmd_scalar(a.maxbytes)), | ||
| "-g", | ||
| str(_nccl_cmd_scalar(a.ngpus)), | ||
| "-w", | ||
| str(_nccl_cmd_scalar(a.warmup_iters)), | ||
| "-n", | ||
| str(_nccl_cmd_scalar(a.iters)), | ||
| ] | ||
| if self.test_run.test.extra_cmd_args: | ||
| parts.append(self.test_run.test.extra_args_str) | ||
| return parts |
There was a problem hiding this comment.
Dedicated alltoallv_perf_mpi path drops configured NCCL options.
This branch only forwards -b/-e/-g/-w/-n and ignores other configured fields (e.g. stepfactor, nthreads, check, blocking), so TOML values are silently ineffective for this subtest.
💡 Suggested fix
if tdef.cmd_args.subtest_name == "alltoallv_perf_mpi":
a = tdef.cmd_args
parts: List[str] = [
_NCCL_TESTS_ALLTOALLV_PERF,
"-b",
str(_nccl_cmd_scalar(a.minbytes)),
"-e",
str(_nccl_cmd_scalar(a.maxbytes)),
"-g",
str(_nccl_cmd_scalar(a.ngpus)),
"-w",
str(_nccl_cmd_scalar(a.warmup_iters)),
"-n",
str(_nccl_cmd_scalar(a.iters)),
]
+ optional_flags = [
+ ("-f", a.stepfactor),
+ ("-t", a.nthreads),
+ ("-c", a.check),
+ ("-z", a.blocking),
+ ]
+ for flag, raw_value in optional_flags:
+ value = _nccl_cmd_scalar(raw_value)
+ if value is not None:
+ parts.extend([flag, str(value)])
if self.test_run.test.extra_cmd_args:
parts.append(self.test_run.test.extra_args_str)
return parts📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if tdef.cmd_args.subtest_name == "alltoallv_perf_mpi": | |
| a = tdef.cmd_args | |
| parts: List[str] = [ | |
| _NCCL_TESTS_ALLTOALLV_PERF, | |
| "-b", | |
| str(_nccl_cmd_scalar(a.minbytes)), | |
| "-e", | |
| str(_nccl_cmd_scalar(a.maxbytes)), | |
| "-g", | |
| str(_nccl_cmd_scalar(a.ngpus)), | |
| "-w", | |
| str(_nccl_cmd_scalar(a.warmup_iters)), | |
| "-n", | |
| str(_nccl_cmd_scalar(a.iters)), | |
| ] | |
| if self.test_run.test.extra_cmd_args: | |
| parts.append(self.test_run.test.extra_args_str) | |
| return parts | |
| if tdef.cmd_args.subtest_name == "alltoallv_perf_mpi": | |
| a = tdef.cmd_args | |
| parts: List[str] = [ | |
| _NCCL_TESTS_ALLTOALLV_PERF, | |
| "-b", | |
| str(_nccl_cmd_scalar(a.minbytes)), | |
| "-e", | |
| str(_nccl_cmd_scalar(a.maxbytes)), | |
| "-g", | |
| str(_nccl_cmd_scalar(a.ngpus)), | |
| "-w", | |
| str(_nccl_cmd_scalar(a.warmup_iters)), | |
| "-n", | |
| str(_nccl_cmd_scalar(a.iters)), | |
| ] | |
| optional_flags = [ | |
| ("-f", a.stepfactor), | |
| ("-t", a.nthreads), | |
| ("-c", a.check), | |
| ("-z", a.blocking), | |
| ] | |
| for flag, raw_value in optional_flags: | |
| value = _nccl_cmd_scalar(raw_value) | |
| if value is not None: | |
| parts.extend([flag, str(value)]) | |
| if self.test_run.test.extra_cmd_args: | |
| parts.append(self.test_run.test.extra_args_str) | |
| return parts |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py` around lines
95 - 112, The alltoallv_perf_mpi branch only appends the fixed flags and misses
forwarding other configured NCCL options; update the branch in
slurm_command_gen_strategy.py (the block using _NCCL_TESTS_ALLTOALLV_PERF,
tdef.cmd_args and _nccl_cmd_scalar) to conditionally append the additional flags
present on tdef.cmd_args (e.g., stepfactor, nthreads, check, blocking and any
other optional fields) in the same way you handle minbytes/maxbytes/ngpus (use
_nccl_cmd_scalar where appropriate and add boolean flags when true), and keep
the existing inclusion of self.test_run.test.extra_args_str when
test.extra_cmd_args is set so TOML-configured options are not dropped.
| if not tdef.cmd_args.use_deepep_matrix: | ||
| return [] | ||
|
|
||
| deepep_root = deepep_benchmark_root(self.test_run) | ||
| if deepep_root is None: | ||
| return [] | ||
|
|
||
| matrix_host = self._deepep_ucc_matrix_host_path() | ||
| if matrix_host is None: | ||
| return [] |
There was a problem hiding this comment.
Fail fast when use_deepep_matrix is enabled but the matrix is missing.
Current flow silently falls back (no mount + no --gen) and can run a non-matrix UCC test while the scenario claims DeepEP-matrix-driven comparison. Please raise an explicit error when matrix resolution fails and no manual gen is provided.
Suggested fix
def _container_mounts(self) -> List[str]:
tdef: UCCTestDefinition = cast(UCCTestDefinition, self.test_run.test)
if not tdef.cmd_args.use_deepep_matrix:
return []
@@
matrix_host = self._deepep_ucc_matrix_host_path()
- if matrix_host is None:
- return []
+ if matrix_host is None:
+ raise FileNotFoundError(
+ "use_deepep_matrix=true but no ucc_matrix.txt was found in DeepEP outputs"
+ )
@@
def generate_test_command(self) -> List[str]:
@@
- elif self._deepep_ucc_matrix_host_path() is not None:
- srun_command_parts.append(f"--gen file:name={_UCC_GEN_MATRIX_CONTAINER}")
+ elif tdef_cmd_args.use_deepep_matrix:
+ matrix_host = self._deepep_ucc_matrix_host_path()
+ if matrix_host is None:
+ raise FileNotFoundError(
+ "use_deepep_matrix=true but no ucc_matrix.txt was found in DeepEP outputs"
+ )
+ srun_command_parts.append(f"--gen file:name={_UCC_GEN_MATRIX_CONTAINER}")Also applies to: 85-86
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py` around lines 55
- 64, If tdef.cmd_args.use_deepep_matrix is true, do not silently return [];
instead detect missing matrix by checking deepep_benchmark_root(self.test_run)
and self._deepep_ucc_matrix_host_path() and raise a clear exception (or call
fail-fast helper) when either is None and there is no manual generation option
provided (e.g. no tdef.cmd_args.gen or equivalent flag). Update the logic in the
block guarded by use_deepep_matrix (and the analogous block later around the
other check) to validate deepep_root and matrix_host and raise an informative
error mentioning the missing DeepEP matrix and required --gen/manual generation
flag rather than falling back to running without a matrix.
|
please resolve coderabbit comments and make CI pass before review ping me again directly once it's done please |
Add UCC/NCCL alltoallv vs DeepEP