Refactor(a5): align profiling stack with a2a3 (host CRTP + stable ring)#777
Open
ChaoZheng109 wants to merge 1 commit into
Open
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the profiling framework for the a5 architecture, centralizing host-side infrastructure for PMU, L2Perf, and TensorDump collectors. Key architectural changes include migrating profiling state from the runtime handshake to KernelArgs, introducing stable AICore staging rings to decouple writes from buffer rotations, and implementing a host-shadow transport mechanism to support systems without SVM. Review feedback recommends adding warning logs for null headers during buffer switches and utilizing store barriers when initializing hardware registers to prevent race conditions on weak memory model architectures.
c0630d2 to
0572269
Compare
…ICore ring, profiling off Handshake) Brings a5 to the same shape as a2a3's PR [hw-native-sys#705](hw-native-sys#705) / [hw-native-sys#709](hw-native-sys#709) / [hw-native-sys#714](hw-native-sys#714): host-side collectors share a ProfilerBase<Derived, Module> + BufferPoolManager<Module> framework; AICore writes through a stable per-core L2PerfAicoreRing / PmuAicoreRing decoupled from AICPU buffer rotation; profiling state moves off Handshake onto KernelArgs + AICore platform-owned slots (aicore_profiling_state.h). a5's transport channel deviates only in MemoryOps carrying copy_to_device / copy_from_device, the mgmt loop mirroring the shm region per tick, and release_owned_buffers freeing the paired host shadow. AICore now resolves its own PMU MMIO base at kernel entry directly from KernelArgs::regs[get_physical_core_id()] (the per-physical-core register-base table the host already fills for AICPU), instead of indexing a separate pmu_reg_addrs table that AICPU filled during handshake. The resolved base is valid from Phase 1 onward, so aicore_execute caches it at Phase 3 alongside the rings rather than re-reading per PMU record. Drops KernelArgs::pmu_reg_addrs, set/get_platform_pmu_reg_addrs, and the corresponding host-side table allocation in PmuCollector.
0572269 to
a5a594b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings a5's profiling stack to the same shape as a2a3 (PRs #705 / #709 / #714):
L2PerfCollector/PmuCollector/TensorDumpCollectornow derive fromProfilerBase<Derived, Module>over a sharedBufferPoolManager<Module>undersrc/a5/platform/include/host/profiling_common/. Mgmt + poll threads + buffer pool logic deduplicated across three collectors.L2PerfAicoreRing/PmuAicoreRingallocated once by the host; address is published viaKernelArgs::aicore_*_ring_addrs[block_idx]and never reassigned. Decouples AICore writes from AICPU's rotating records buffer, and fixes the latent PMU buffer-flip bug where AICore kept writing to the stale buffer.Handshake: enablement bits + per-core address arrays moved from runtimeHandshaketoKernelArgs+ AICore platform-owned slots (aicore_profiling_state.h). Adding a new profiling field no longer touches the runtime sync protocol. AICore resolves its PMU MMIO base directly from the existingKernelArgs::regstable at kernel entry (regs[get_physical_core_id()]), so there is no separate per-corepmu_reg_addrstable and no AICPU-fill-after-handshake dependency — the reg base is stable from Phase 1, andaicore_executecaches it at Phase 3 alongside the rings.L2PerfBufferStateandPmuBufferStatenow tracktotal / dropped / mismatch; collectors cross-checkcollected + dropped + mismatch == device_totalper pool, matching a2a3.a5's transport channel still differs from a2a3 (no
halHostRegisteron DAV_3510). The framework absorbs that as:MemoryOpscarries 5 callbacks (addscopy_to_device/copy_from_device)profiling_copy.hrelease_owned_buffersfrees the paired host-shadowmalloc()Docs
docs/profiling-framework.md§8 — new a5-specifics sectiondocs/dfx/{pmu-profiling,l2-swimlane-profiling,tensor-dump}.md§5.x — a5 transport channel + lifecycle + reconcile semantics updated to match the code as it standsNaming note
a5's new
KernelArgsfields are named*_addrs(plural) for the per-core address arrays:aicore_l2_perf_ring_addrs,aicore_pmu_ring_addrs. a2a3'saicore_ring_addr(singular but pointing to an array) is logged for a follow-up a2a3 rename PR; not touched here per the arch-independence rule.Test plan
python -m simpler_setup.build_runtimes --platform a5sim— host_build_graph + tensormap_and_ringbuffer build clean./ci.sh -p a5sim)collected + dropped + mismatch == device_totalmatches per pool on a representative case