Skip to content

Good LLM Observability is just plain observability

In this post, I aim to demystify the concept of LLM observability. I'll illustrate how everyday tools employed in system monitoring and debugging can be effectively harnessed to enhance AI agents. Using Open Telemetry, we'll delve into creating comprehensive telemetry for intricate agent actions, spanning from question answering to autonomous decision-making.

If you want to learn about my consulting practice check out my services page. If you're interested in working together please reach out to me via email

What is Open Telemetry?

Essentially, Open Telemetry comprises a suite of APIs, tools, and SDKs that facilitate the creation, collection, and exportation of telemetry data (such as metrics, logs, and traces). This data is crucial for analyzing and understanding the performance and behavior of software applications.

Demystifying Telemetry in AI

The lack of sufficient observability in many AI agents today hinders their evaluation and optimization in real-world scenarios. By integrating Open Telemetry, we can not only enhance the transparency of these agents through tools like Prometheus, Grafana, and Datadog, but also reincorporate this insight to refine the agents themselves.

However, it's crucial to recognize that what's often marketed as specialized LLM telemetry services are merely superficial dashboards encapsulating basic API interactions. These don't provide the depth required for generating extensive telemetry across the whole stack or the means to meaningfully reintegrate this data into the AI agents.

Applying Telemetry to AI Agents

Consider a conversational agent that formulates SQL queries in response to natural language inquiries, interacting with various data sources through a Router Agent. If issues arise, be it database errors or latency spikes, pinpointing the culprit - whether the LLM, the SQL query, or the database itself - becomes challenging. Current LLM operations rarely offer comprehensive instrumentation of external components, leaving these questions unanswered.

Adopting standards like Open Telemetry can bridge this gap, offering a holistic view of the agent's actions and their interconnections. This insight is pivotal for enhancing system performance, robustness, and incident detection and resolution.

The Potential of Telemetry Data

Envision utilizing telemetry data for model-guided self-evaluation. This approach could revolutionize scalable model evaluation. By analyzing the complete task call graph, we can identify and address inefficiencies - for instance, isolating events leading to high-latency database calls or errors.

This data, once fed back into the LLM, could prompt targeted fine-tuning. The LLM might analyze a series of transactions, identifying and ranking documents based on relevance, or suggest corrections in a cycle of calls, thus refining the data for model improvement.

Redefining Telemetry: The Key to Self-Improvement?

Telemetry in the realm of AGI might well be akin to a detailed diary, instrumental for reflection and advancement. With robust telemetry, we gain unprecedented insight into the actions of AI agents, enabling the creation of systems that not only evaluate but also self-optimize complex actions. This synergy of human and computer intelligence, driven by comprehensive telemetry, holds the key to unlocking the full potential of AI systems.

In essence, observing LLM systems doesn't necessitate new tools; it requires viewing agent systems through the lens of distributed systems. The distinction lies in the potential exportation of this data for the refinement and distillation of other models.

A prime example of this distillation process can be found in the Instructor blog, where a GPT-3.5 model is fine-tuned using GPT-4 outputs, demonstrating the power of leveraging telemetry data for model enhancement.