A practical observability story for distributed AI workloads: add semantic model regions, give kernel specializations useful names, and inspect the resulting timeline in one place.
Repeated on every rank, typically through communication libraries such as NCCL / XCCL / oneCCL.
Triton specializations may differ in block size, warps, or stages, but the profiler view may not make those variants easy to distinguish.
Without semantic markers, the timeline mostly shows runtime activity and kernels, not initialization, forward, backward, or per-layer structure.
Distributed runs add ranks, collectives, and overlap. A trace with weak labels becomes hard to interpret precisely enough for debugging.
Which step is waiting on synchronization? Where are collectives and runtime events happening across ranks?
Is time spent in initialization, data generation, forward, backward, optimizer step, or a specific layer block?
Which concrete Triton specialization ran there? Was it the block-size / warps / stages variant we expected?
ITT fills in the model-phase layer, Triton specialization names fill in the kernel layer, and THAPI/iprof captures the resulting events on one timeline.
def _vadd_repr(proxy):
bs = proxy.constants["BLOCK_SIZE"]
w = proxy.constants["W_NAME"]
s = proxy.constants["S_NAME"]
return f"vadd_bs{bs}_w{w}_s{s}"
@triton.jit(repr=_vadd_repr)
def vadd(X_ptr, Y_ptr, Z_ptr, N,
BLOCK_SIZE: tl.constexpr,
W_NAME: tl.constexpr,
S_NAME: tl.constexpr):
...
grid = (triton.cdiv(N, block_size),)
vadd[grid](x, y, z, N,
BLOCK_SIZE=block_size,
W_NAME=num_warps,
S_NAME=num_stages,
num_warps=num_warps,
num_stages=num_stages)
Launch code from run_once(...) in the supplied example.
iprof overviewiprof to collect a timeline-oriented trace.iprof captures the run, Perfetto is the viewer, and ITT gives the timeline human-meaningful labels.
Optional telemetry streams can also be aligned in time, but the main story here is semantic model tracing plus runtime activity.
The Intel ITT API lets an application generate and control trace data, so the timeline can carry names that matter to the user instead of only low-level runtime events.
From train_llama3_demo.py
for step in range(args.steps):
x, y = get_batch()
optim.zero_grad(set_to_none=True)
logits = model(x)
loss = F.cross_entropy(...)
loss.backward()
optim.step()
From train_llama3_demo_with_itt.py
for step in range(args.steps):
with ittapi.task(f"Step.{step}", domain=args.itt_domain):
x, y = get_batch()
with ittapi.task("Forward", domain=args.itt_domain):
logits = model(x); loss = F.cross_entropy(...)
with ittapi.task("Backward", domain=args.itt_domain):
loss.backward()
with ittapi.task("Optimizer.Step", domain=args.itt_domain):
optim.step()
def forward(self, x):
with ittapi.task(f"Layer.{self.layer_idx}",
domain=self.itt_domain):
with ittapi.task(f"Attn.{self.layer_idx}",
domain=self.itt_domain):
x = x + self.attn(self.n1(x))
with ittapi.task(f"MLP.{self.layer_idx}",
domain=self.itt_domain):
x = x + self.mlp(self.n2(x))
return x
Representative hierarchy; the exact number of layers depends on script arguments.
iprof
module load thapi
module load frameworks
mpiexec --no-transfer --cpu-bind ${CPU_BIND} -n 24 -ppn 12 $(pwd)/ccl_local_wrap.sh \
${THAPI_ROOT}/bin/iprof -l $(pwd)/demo_with_itt.pftrace --sample \
--trace-output $(pwd)/demo_with_itt -- \
$(pwd)/ccl_local_wrap.sh python train_llama3_demo_with_itt.py --device=xpu
The README shows both the baseline command and the ITT-instrumented command. For this talk, the ITT version is the one that adds readable semantic structure to the trace.
| Signal | What it tells you | Example question it answers |
|---|---|---|
| ITT regions | Model-phase and model-structure context | Was the slowdown in initialization, forward, backward, optimizer step, or a specific layer? |
| Triton specialization names | Which concrete kernel variant ran | Was this the bs128_w4_s2 version or the bs256_w8_s2 one? |
| THAPI/iprof timeline | Time alignment between those signals and runtime activity | Exactly when did that specialization appear inside the model step? |