Your Agent Works. Can You Prove It?

Plus 70% latency cuts, context graphs, and tmux workflows

‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏ ‌‍‎‏

Shop // View Online

Forwarded this email? Subscribe here

Liberté, égalité, ouverture.

I’m doing my part for the cause by hosting OpenXData on April 29 - a free, virtual event with 30+ talks on open data infra, from table formats to query engines to feature serving.

Join me and get sharper than a guillotine at cutting latency.

HOT TAKE

Ctrl+Trust

"I need to understand the code" is the new bottleneck.

What matters more: Shipping or Knowing?

LAST WEEK'S TAKE

A Model Interface

“We upgraded the model and nothing changed” has played out enough times that interface wins comfortably.

PRESENTED BY CLERIC AI

Heading to Google Cloud Next in Vegas next week?

Some of the best conversations happen between sessions.

Cleric AI is bringing those conversations to the table instead - a small dinner for engineering leaders at José Andrés’ renowned restaurant, Zaytinya, on Tuesday, April 21.

A chance to compare notes on how AI is changing the way teams build and ship software, with people running systems in production.

Small group, limited seats.

Request an invite

HIDDEN GEMS

Curated finds to help you stay ahead

Autonomously completing 66 engineering tickets via agents using structured workflows, task decomposition, and iterative feedback for real-world software development.

A standardized method for scaling autonomous agents using OpenClaw architecture to demonstrate separating high-level logic to ensure scalable, reproducible task execution.

LLM workflows orchestrated with tmux panes, enabling parallel tool use, inspection, and control with transparent, reproducible, debuggable execution.

Detecting humans vs machines in call audio within milliseconds to route calls efficiently, balancing latency, accuracy, and scale at high volume.

Senior ML platform role focused on building and scaling production systems across Databricks and Spark. Covers full lifecycle from feature pipelines to model serving, with emphasis on reliability, observability, and distributed data performance in production environments.

Responsibilities

Build batch and streaming feature pipelines using PySpark and Spark SQL
Design and operate offline and online feature store patterns
Define MLflow registry standards and model promotion workflows
Deploy, monitor, and scale model serving endpoints on Databricks

Requirements

Strong PySpark and Spark SQL experience with distributed data systems
Hands-on MLflow, feature stores, and production model serving experience
Experience implementing CI/CD pipelines for ML workflows in production
Experience with Databricks, Delta Lake, and Azure-based data platforms

MLOPS COMMUNITY

The Modern Software Engineer

AI coding agents can finish a task before you’ve finished framing it, but that speed hides a harder problem: how much of the work can you trust, verify, or even understand? This discussion looks past the demo magic and into the practical bottlenecks teams are hitting as agents move from autocomplete to semi-autonomous collaborators.

Validation is the real constraint. Agents can generate code fast, but tests, checks, and review harnesses still decide what is safe to ship.
Team structure is starting to shift. Product, engineering, and design roles are bleeding into each other as more people can inspect code, propose changes, and unblock themselves.
The skill gap is changing shape. Clear articulation, planning, and delegation matter more when engineers are effectively managing agents instead of writing every step by hand.

The hard part is no longer getting code written but knowing what to trust, what to verify, and where humans still need to hold the line.

Video || Spotify || Apple

How We Cut LLM Latency 70% With TensorRT in Production

Cut latency 70% or burn cash on idle GPUs - running LLMs in production is a constant trade-off. This breakdown shows what it takes to move from demos to real systems, where cost, throughput, and architecture decisions matter more than model choice.

Cost isn’t fixed - it’s shaped by architecture. Bigger GPUs can be cheaper overall if higher throughput reduces total runtime.
Cold starts and scaling are the hidden bottlenecks. Preloading models, faster storage, and scheduled or dynamic scaling cut minutes off spin-up times.
Optimization compounds. Techniques like TensorRT, batching, and KV cache usage unlock major gains without changing models.

The real advantage comes from tuning the system around your workload, not chasing the next model release.

Video || Spotify || Apple

Context Graphs And Their Implementation: The Missing Layer Between Human Judgment and Machine Agency

If context graphs are meant to become the memory layer for agents and organizations, the hard part is not drawing nodes and edges. It is capturing why a decision happened, who approved it, what constraints shaped it, and whether it later proved right. This piece argues that context graphs only become useful when they can survive real company messiness like review workflows, legal sensitivity, local jargon, and scale.

Decision traces need governance. If humans do not review, correct, and approve them, the graph risks becoming a polished record of bad reasoning.
Reasons need dual encoding. Short natural-language explanations plus structured tags give humans something readable and agents something stable to reason over.
The data layer has to handle reality. Time-aware context, multimodal artifacts, integrations, retention rules, and fast writes are all part of making this work outside a demo.

The real blocker is not the graph itself but whether an organization can turn judgment into something structured, reviewable, and worth trusting later.

Read the blog

IN PERSON EVENTS

Amsterdam - April 21
Boston - April 27
San Francisco - May 15

ML CONFESSIONS

The Twilight Time Zone

I spent three weeks building what I was convinced was a breakthrough feature for our recommendation model. Pulled in user session duration, did some clever windowing, engineered a rolling average that captured engagement patterns nobody else had tried. Offline metrics jumped. I wrote up the results, put together slides, and booked time with the team lead to talk about promoting it to the next A/B test.

She looked at it for about ten minutes. Asked me what timezone the session timestamps were in. I said UTC. She pulled up the ingestion pipeline docs and showed me they were in the user's local timezone, mixed across regions, with no normalization. My rolling averages were blending Tuesday morning in Tokyo with Monday evening in Chicago. The "signal" was just timezone noise creating artificial variance that happened to correlate with the label in the test set.

She was nice about it. That almost made it worse. I still think about it every time I touch a timestamp column.

Share your confession here.

HOW WE CAN HELP

Making the hard stuff simpler

Working on something tricky or planning ahead? Here’s how we can help - just hit reply:

Custom workshops tailored to your company’s needs
Hiring? I know some quality folks looking for a new adventure
Want to connect with someone tackling similar problems? I can introduce you

Thanks for reading, catch you next time!

Interested in partnering with us? Get in touch:

Thanks for reading. See you in Slack, YouTube, and podcast land. Oh yeah, and we are also on X and LinkedIn.