Welcome, Guest | Browse

Software Factory Archive

← Previous Work All Works Next Work →

Prometheus Metrics

Rating:
General Audiences
Fandom:
StrongDM Software Factory
Characters:
Jay Taylor Navan Chauhan Justin McCarthy
Tags:
CXDB Prometheus Grafana Observability Dashboard
Words:
451
Published:
2025-12-28

The Rust server had been exporting Prometheus metrics since day one. It was a single configuration flag—metrics_enabled = true—and the server would expose a /metrics endpoint on its HTTP port with counters, gauges, and histograms for everything that mattered. Turn appends per second. Blob store hits versus misses. Compression ratios. DAG depth distributions. Projection latencies.

Nobody had done anything with them. The metrics endpoint responded dutifully to every scrape, emitting its text-format exposition like a lighthouse sending signals into fog. No ship had come.

Jay changed that on a Tuesday afternoon.

"I'm building a Grafana dashboard," he announced, already three panels deep.

"For what?" Navan asked.

"For everything."

The first panel was throughput: turns appended per second, broken down by conversation. A stacked area chart that showed the ebb and flow of agent activity across the day. Mornings were quiet—human engineers writing specs, reviewing scenarios. Afternoons were busy—agents running, turns accumulating, the DAG growing.

The second panel was the blob CAS hit rate. A single big number: 67 percent. Sixty-seven percent of all blob writes were deduplication hits—content that already existed in the store. Jay added a historical sparkline behind the number. The hit rate had been climbing steadily since the migration, as more conversations shared more common context.

The third panel was latency. P50, P95, P99 for turn appends, blob reads, and JSON projections. All of them were in the sub-millisecond range. Jay set the Y axis to microseconds so the numbers would actually be visible on the chart.

"Microsecond latencies on a Grafana dashboard," Navan said, leaning in. "That's going to make the SRE in you very happy."

"The SRE in me is already happy." Jay added a fourth panel: pack file size over time. A steadily ascending line, smooth and predictable, the kind of chart that says "growth" without saying "problem." He added reference lines at 25GB, 50GB, and 100GB so they'd know when to buy the next disk.

The fifth panel was Jay's favorite: active conversations. A real-time count of how many contexts had received a turn in the last five minutes. During peak hours, it hovered around thirty. During off hours, it dropped to two or three—background agents doing maintenance work.

He projected the dashboard onto the wall-mounted monitor in the common area. The team gathered around it.

"It's like a heartbeat monitor," Navan said. "For the whole factory."

Justin studied the panels in silence for a long time. "This is the first time I can see the factory working," he said finally. "Not the code. Not the logs. The actual pulse of it."

The dashboard stayed on that monitor permanently. It became the first thing people looked at when they walked in. The campfire around which the team gathered, watching the numbers flicker and flow, reading the health of their system in the glow of five panels.

Kudos: 93

grafana_glow 2025-12-30

The lighthouse metaphor for unused metrics is perfect. All that data, faithfully emitted, waiting for someone to care. And then Jay builds the dashboard and suddenly the factory has a pulse you can watch.

sre_soul 2025-12-31

Setting the latency Y axis to microseconds so the bars would be visible. That's either a flex or a genuine problem with charting libraries. Either way, beautiful numbers.

← Previous Work All Works Next Work →