I tested 4 different kernels on both master v2.3 (@d425a56) and bug_fix_mig_320 (@f67c70396), and results are different.
Under using the same apptainer and settings (NT=8 / NUM_TCU_LANES=8 / LMEM_LOG_SIZE=15 / TCU enabled (XLEN=32)), # of cycles, # of instructions and IPC differs by as such:
Master v2.3 has 10–31% lower cycles, IPC around 1.8 (vs ~0.22), instruction counts ~7× higher.
I traced the commits between master 2.3 and bug_fix_mig_320 (454 commits) and found that @ad46039cc, there’s a commit message “fix NT=8 i8 issue”; and @f1702ea4f “optimize the worker's architecture, a five stage pipeline”. My understanding is that master had a bug in how NT=8 instruction issue is counted (hence the 7x higher instruction and biggerIPC), and bug_fix_mig_320 both corrects that counter as well as there was a rework on the worker pipeline into 5 stages that all together accounts for the difference.
I tested 4 different kernels on both master v2.3 (@d425a56) and bug_fix_mig_320 (@f67c70396), and results are different.
Under using the same apptainer and settings (NT=8 / NUM_TCU_LANES=8 / LMEM_LOG_SIZE=15 / TCU enabled (XLEN=32)), # of cycles, # of instructions and IPC differs by as such:
Master v2.3 has 10–31% lower cycles, IPC around 1.8 (vs ~0.22), instruction counts ~7× higher.
I traced the commits between master 2.3 and bug_fix_mig_320 (454 commits) and found that @ad46039cc, there’s a commit message “fix NT=8 i8 issue”; and @f1702ea4f “optimize the worker's architecture, a five stage pipeline”. My understanding is that master had a bug in how NT=8 instruction issue is counted (hence the 7x higher instruction and biggerIPC), and bug_fix_mig_320 both corrects that counter as well as there was a rework on the worker pipeline into 5 stages that all together accounts for the difference.