Executive Brief

Sparse training researchers and engineers fly blind. They apply sparse algorithms like SET or RigL to PyTorch models, but the training loop is a black box. They cannot see which layers remain stubbornly dense, where sparsity fluctuates erratically, or if the theoretical memory savings materialize. Today, they debug by writing custom logging scripts, manually parsing tensor dumps, and stitching together static matplotlib plots—a process that consumes 12–18 hours per experiment iteration (source: internal survey of 23 SparseLab power users, Jan 2025). This opacity turns sparse training from a precision tool into a game of chance, slowing research velocity and masking model instability before it causes training collapse.

The business case is the recoverable time of our highest-value users. 240 core SparseLab researchers and engineers (source: SparseLab GitHub repo contributor count & enterprise customer headcount estimate) × 14 hours saved per experiment iteration (baseline: 16 hrs manual, target: 2 hrs with profiler) × 10 major experiment iterations per year = 33,600 hours of recovered R&D time annually. At a blended fully-loaded cost of $72/hour for ML engineers (source: Regional Cost Benchmarks for India-based teams, Level III Engineer), the recoverable value is $2.42M/year. If adoption reaches only 40% of target users: $968K/year. The 3-month build cost is estimated at $310K (3 engineers × 3 months, fully-loaded).

This feature is a real-time, in-training-loop visualization and diagnostic callback that exposes the dynamics of sparse tensors. It is not a general-purpose model profiler, a replacement for experiment trackers like Weights & Biases, or a sparse kernel optimization tool. Its integration surface is one callback; its output is actionable insight.

Strategic Context

Competitive Landscape: Existing tools solve adjacent jobs but not this core diagnostic need.

PyTorch Profiler solves the job of "find performance bottlenecks in my model's execution." It provides fine-grained operator timing and GPU utilization but is blind to tensor sparsity patterns.
Weights & Biases (W&B) solves the job of "track my experiment metrics and hyperparameters across runs." It logs scalars and distributions but cannot visualize the evolution of sparse tensor topology or drop-grow cycles.
Custom Matplotlib Scripts solve the job of "create a static visualization of my specific tensor data." They offer ultimate flexibility but require manual instrumentation for every new experiment and do not provide real-time insight.

Competitive Analysis Table:

Capability	PyTorch Prof	W&B	SparseLab Profiler
Live per-layer nnz plot	❌	❌	✅ (unique)
Weight distrib (sparse)	❌	✅ (dense)	✅ (sparse-aware)
Memory vs dense baseline	❌	❌	✅ (unique)
Drop-grow event timeline	❌	❌	✅ (unique)
Export as standalone HTML	❌	✅ (full UI)	✅ (single-file report)
WHERE WE LOSE	Performance profiling depth	Experiment tracking ecosystem & collaboration	❌ vs ✅

Our wedge is sparse-specific real-time diagnostics because no existing tool is built from the ground up to visualize the unique dynamics of sparse training algorithms as they happen in the loop.

Problem Statement

WHO / JTBD: When an ML engineer or researcher is debugging unstable sparse training runs or characterizing a novel sparse algorithm's behavior, they want to see live per-layer sparsity evolution, weight distribution shifts, and memory footprint—so they can identify pathological layers, validate algorithm correctness, and produce reproducible diagnostic reports without manual instrumentation.

WHERE IT BREAKS: Users try manual logging with torch.save() and post-hoc analysis scripts—it fails because it captures snapshots, not a continuous timeline, and adds significant I/O overhead that alters training dynamics. They try general profilers like PyTorch Profiler or TensorBoard—they fail because these tools are built for dense operations and latency; they cannot visualize sparsity topology, nnz (non-zero) counts, or SET/RigL drop-grow events. They end up instrumenting the training loop with print statements for layer stats, manually creating plots, and pasting screenshots into Slack or Notion docs—a fragmented, non-reproducible process that consumes hours of focused debugging time per iteration.

WHAT IT COSTS:

Metric	Measured Baseline
Time to diagnose sparsity instability in a single training run	16.2 hours avg (n=23 surveyed SparseLab users, Jan 2025)
Time to produce a shareable sparsity analysis report	4.5 hours avg (manual plot generation & doc assembly)
Rate of experiments that fail due to undiagnosed sparse training collapse	22% of runs (source: analysis of 148 failed runs in user Slack channel, Q4 2024)

Aggregate annual cost per researcher: 10 experiments/year × (16.2 hrs diagnosis + 4.5 hrs reporting) × $72/hr = $14,904/year in recoverable labor. For the target 240-user population, the total addressable problem cost is $3.58M/year.

JTBD statement: "When my sparse training run behaves unexpectedly, I want to see a live, layer-by-layer visualization of sparsity, weight distribution, and topology events, so I can pinpoint the root cause in minutes and share a definitive report with my team."

Solution Design

Phase 1 (MVP — 6 weeks): A callback (SparsityProfilerCallback) that hooks into the SparseLab training loop. It samples sparse tensors at a configurable step interval, computes per-layer metrics (nnz, sparsity %, weight histogram, memory footprint), and serves a local dashboard via a lightweight web server (FastAPI + Altair/Vega-Lite). The dashboard updates in near-real-time (2-3 second latency). Training concludes, and the user can export the entire session as a single HTML file containing all visualizations and data.

Key User Flow:

User adds callbacks=[SparsityProfilerCallback()] to their SparseTrainer.
Training starts; a local URL is printed to the console (e.g., http://localhost:8080).
User opens the dashboard in a browser, seeing live-updating plots.
After training, user clicks "Export Report" to generate a portable HTML file.

ASCII Wireframe Screens:

┌─────────────────────────────────────────────────────────────────────────────┐
│ SparseLab Profiler - Live Dashboard                     [Export Report]     │
├─────────────────────────────────────────────────────────────────────────────┤
│ Training Step: 1450 │ Total Model Sparsity: 74.3% │ Memory Saved: 2.1 GB    │
├─────────────────────────────────────────────────────────────────────────────┤
│ **Per-Layer Sparsity Evolution**                                           │
│ ┌─────────┐                                                                │
│ │NNZ %    │ layer1.conv  ██████████████░░░░░░░░                            │
│ │         │ layer2.bn    ██████████████████░░░░                            │
│ │         │ layer3.fc    ██████████░░░░░░░░░░░░░ (hover: step 1200, 45%)   │
│ └─────────┘                                                                │
├─────────────────────────────────────────────────────────────────────────────┤
│ **Layer Inspector: layer1.conv**                          [Pause Updates]  │
│ Sparsity: 62%  │ NNZ: 812/2048  │ Memory vs Dense: -328 KB                 │
│ ┌───────────────────────┐ ┌───────────────────────┐                        │
│ │Weight Distribution    │ │Topology Events Timeline│                        │
│ │   ▁▂▃▅▆▇█▇▆▅▃▂▁       │ │ SET DROP ┬─────┬      │                        │
│ │   Sparse values only  │ │ RigL GROW ───┐ │      │                        │
│ └───────────────────────┘ └──────────────┴─┴──────┘                        │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ SparseLab Profiler - Export Configuration              [Generate HTML]      │
├─────────────────────────────────────────────────────────────────────────────┤
│ Report Title: [ResNet50 SET Run - 2025-04-15 ]                             │
│ Include:  [✓] Per-layer sparsity plots      [✓] Weight distribution        │
│          [✓] Memory savings summary        [✓] Event timeline              │
│          [✓] Raw metric data (CSV)         [ ] Profiler latency stats      │
│                                                                             │
│ Output: A single, self-contained HTML file. No external dependencies.       │
│ File will be saved to: ./sparsity_report_2025-04-15_1450.html               │
└─────────────────────────────────────────────────────────────────────────────┘

Phase 1.1 (3 weeks post-MVP): Add comparison view between two training runs within the same dashboard, enabling A/B analysis of different sparsity algorithms or hyperparameters.

Phase 2 (8 weeks, contingent on Phase 1 success): Integrate anomaly detection heuristics to automatically flag unstable layers (e.g., "Layer 4 sparsity dropped >30% between steps 1000-1200") and suggest corrective actions. Add plugin points for custom user-defined metrics.

Kill Criteria for Phase 2: If <25% of Phase 1 adopters use the export feature more than twice in the first 90 days, Phase 2 is canceled—indicating the live dashboard alone satisfies the core need.

Acceptance Criteria

Phase 1 — MVP (6 weeks) US#1 — One-Line Integration

Given a standard SparseLab training script
When the user adds callbacks=[SparsityProfilerCallback()] to the trainer
Then training starts without error, and a local dashboard URL is printed to stdout within 5 seconds
Failure Mode: If integration fails, users must fall back to manual logging, negating the time-savings promise.
Validator: ML Eng (Raj) against 5-sample baseline of existing SparseLab training scripts.

US#2 — Live Per-Layer Sparsity Dashboard

Given training is running with the profiler callback
When the user navigates to the local dashboard URL
Then they see a live-updating plot of nnz % for at least 10 named layers, with <3 seconds latency from step completion to graph update
Failure Mode: If latency exceeds 5 seconds, users will abandon the live view and revert to post-hoc analysis.
Validator: QA Eng (Simran) with stopwatch measurement across 20 training steps.

US#3 — Export Self-Contained HTML Report

Given a completed training session with profiler data
When the user clicks "Export Report" in the dashboard
Then a single HTML file is generated that contains all visualizations and can be opened offline in a modern browser, with file size <5 MB for a 10k-step run
Failure Mode: If report requires network resources, it breaks internal sharing for air-gapped research environments.
Validator: PM (Alex) by verifying file integrity on 3 disconnected machines.

Out of Scope (Phase 1):

Feature	Why Not Phase 1
Compare two runs side-by-side	Increases UI complexity 3×; validate single-run utility first
Automated anomaly alerts	Requires defining heuristics based on live user data from Phase 1
Profile every parameter in billion-param models	Performance risk; need to measure overhead on large models first

Phase 1.1 (3 weeks):

Add side-by-side comparison view for two exported reports.
Add ability to pause/resume live dashboard updates to inspect a specific step.
Add CSV export of raw metric data.

Phase 1.2 (4 weeks):

Integrate lightweight SQLite storage for metrics to allow dashboard refresh after browser restart.
Add configurable alerting thresholds (e.g., "warn if sparsity drops below X%").

Success Metrics

Guardrail Metrics (must NOT degrade): ┌────────────────────────┬─────────────────────────┬─────────────────────────┐ │ Guardrail │ Threshold │ Action if Breached │ ├────────────────────────┼─────────────────────────┼────────────────────────-┤ │ Training overhead | <3% added time per step | Pause rollout, optimize sampling | | Dashboard P95 latency | <3 seconds per update | Switch to less frequent sampling by default | | Memory overhead of callback | <500 MB additional peak | Add warning and auto-disable for large models |

What We Are NOT Measuring:

"Number of dashboard visits" — a high count could indicate users are confused and reloading frequently, not finding value.
"Time spent in dashboard" — doesn't distinguish between productive analysis and UI confusion.
"Raw number of exported reports" — could be inflated by automated scripts; we care about successful debug sessions.

Risk Register

Risk: Dashboard data collection imposes >5% training time overhead, causing users to disable the profiler for production runs. Probability: Medium Impact: High Mitigation: Engineer (Lin) implements adaptive sampling (reduce frequency if step time increases) by week 3. Owner: Lin. If overhead exceeds 5% in benchmark tests on ResNet50, we ship with sampling default at 50 steps, not 10.

Risk: The live dashboard's memory consumption OOMs on multi-GPU, billion-parameter models, crashing the training job. Probability: Low Impact: Critical Mitigation: Phase 1 scope includes a mandatory layer_regex filter to exclude attention layers by default. Add a prominent warning in docs. Owner: Raj. If unsampled OOM occurs in beta, we implement immediate tensor offloading to CPU for profiled layers.

Risk: Users treat the HTML export as a publishable research artifact, but a critical bug causes mislabeled axes or incorrect sparsity calculations, leading to public embarrassment and loss of trust. Probability: Low Impact: High Mitigation: Implement a visual regression test suite for the Vega-Lite charts against known sparse tensor fixtures. Freeze the chart spec API in week 4. Owner: Simran. If a calculation bug is found post-launch, we issue a CVE-style advisory and provide a patched version script.

Risk: Competitor (e.g., W&B) launches a sparse-aware profiling feature within 6 months, leveraging their superior UI and collaboration features, making our standalone tool obsolete. Probability: Medium Impact: Medium Mitigation: Our wedge is deep integration with SparseLab's training loop and one-line setup—double down on this ease-of-use narrative in marketing. Develop a pipeline to export profiler data to W&B as a Phase 1.1 feature, making us a complementary data source rather than a direct competitor. Owner: PM (Alex) to draft integration spec by launch.

Kill Criteria — we pause and conduct a full review if ANY of these are met within 90 days:

<15% of weekly active SparseLab users have the profiler enabled (measured via opt-in ping).
Median diagnosis time reported by users is >8 hours (less than 2x improvement over baseline).
More than 2 critical bugs related to data accuracy are reported and confirmed.
Training overhead exceeds 8% for a standard ViT-B model after adaptive sampling optimizations.

Technical Architecture Decisions

Components:

Callback Class (SparsityProfilerCallback): Inherits from SparseLab's TrainingCallback. Hooks into on_step_end to sample sparse tensors from the model. Maintains a ring buffer of metrics in memory.
Metrics Engine: Computes per-layer: nnz, sparsity %, weight histogram (10 bins), memory footprint (dense equivalent vs. actual). Uses PyTorch's sparse tensor COO/CSR attributes.
Server (uvicorn+FastAPI): Lightweight ASGI server exposing:
- GET / → Serves dashboard HTML/JS.
- GET /stream → Server-Sent Events stream of metric updates.
- POST /export → Generates and returns HTML file.
Frontend (Vanilla JS + Vega-Lite): Single HTML page that subscribes to the SSE stream and renders Vega-Lite charts. Includes export button that triggers a POST to /export.
Export Engine: Jinja2 template that embeds the final metric dataset and Vega-Lite specs into an HTML file, including a local copy of the Vega library.

Assumptions vs Validated Table:

Assumption	Status

Strategic Decisions Made

Decision: Data storage and transport mechanism for the live dashboard. Choice Made: Keep all metric data in memory in Python dictionaries and serve via FastAPI Server-Sent Events (SSE). Do not introduce a database (SQLite/Redis) for Phase 1. Rationale: A DB adds deployment complexity and is unnecessary for single-user, single-session profiling. In-memory keeps the callback lightweight and dependency-free. Rejected: Streaming to an external service (e.g., W&B) as it violates the "one-line integration" promise and requires network/API keys.

Decision: Visualization library for the HTML export. Choice Made: Use Vega-Lite specifications embedded directly into the HTML, rendered with a local copy of the Vega library. Rationale: This produces truly standalone HTML files with interactive charts that don't require an internet connection to view. Rejected: PNG/SVG static images (lose interactivity) and relying on CDN-hosted Plotly.js (breaks offline use).

Decision: Sampling frequency default. Choice Made: Sample sparsity metrics every 10 training steps by default, user-configurable. Rationale: Captures meaningful evolution for most vision/language model batch sizes without imposing >1% overhead. Rejected: Every step (too heavy) or every epoch (misses intra-epoch dynamics).

Decision: Handling of very large models (billions of parameters). Choice Made: In Phase 1, profile only named submodules selected by the user via a regex filter (layer_regex=".*conv.*"). Do not attempt to automatically profile every parameter tensor. Rationale: Prevents the dashboard from crashing browsers or exhausting memory. The 80/20 rule: users typically need to debug specific suspect layers. Full-model profiling is deferred to Phase 2 based on performance validation.

Appendix

Before / After Narrative: Before: Priya, a researcher at ML lab, is trying to debug why her SET-trained ResNet-50 accuracy suddenly dropped at step 12k. She stops the training, writes a script to load the last 10 checkpoints, iterates through each layer to extract nnz counts using tensor._nnz(), and manually plots them in Matplotlib. She spends 4 hours creating the plots only to realize she needs to see the weight distribution of the sparse elements in the second convolutional layer. She modifies her script to dump the sparse tensor values, loads them into a Jupyter notebook, and generates a histogram. By the time she identifies an anomaly—a bimodal weight distribution indicating unstable gradients—another 6 hours have passed, and she must restart the training from scratch.

After: Priya adds the SparsityProfilerCallback() to her trainer. As training runs, she opens the dashboard URL. At step 11,850, she sees the sparsity line for layer2.conv2 start to plunge. She pauses the live updates, inspects the weight distribution for that layer, and sees it has become bimodal—the classic instability signature. Within 8 minutes of the anomaly occurring, she identifies the root cause. She stops the training, adjusts her gradient clipping hyperparameter, and restarts. After the new run finishes, she exports a single HTML report and shares it with her team, documenting the issue and fix.

Pre-Mortem: "It is 6 months from now and this feature has failed. The 3 most likely reasons are:

We prioritized visual polish over data accuracy, and a subtle bug in nnz calculation for hybrid sparse-dense layers caused researchers to publish erroneous results, destroying trust in the

sparselab