Icon for gfsd IntelliJ IDEA

LLM observability: tools for monitoring Large Language Models

LLM observability basics & tools

Observability helps monitor the operation of LLM-based systems. The following tools and techniques can be used to track the resources and behavior of Large Language Models (LLMs) used in production.

Implementing observability for accurate LLM monitoring helps assess and maintain the quality of any LLM-based system, reduce operational costs, find security issues, and uncover and fix blunders like the one in our latest DevQualityEval deep dive (where GPT-4 Turbo ended up producing a whole lot of gibberish. In this post, we’ll focus on monitoring various quantitative and qualitative aspects of an LLM-based pipeline.

What is LLM observability?

LLM observability refers to the set of methodologies, practices, and tools that aim to provide a better understanding of LLM performance in a live production system. It is the process that enables the monitoring of LLM applications. Through this, LLM observability helps maintain and improve the performance of large language models in the given use case and environment, while minimizing costs and safeguarding the system against malicious intents.

Monitored metrics may cover the behavior, performance, and/or costs of the LLM. They capture the full context of the execution and provide insights into all the layers of an LLM-powered piece of software including the application itself, the query, the additional context, and the output it gives. Observability can help overcome the main problems that you’re likely to run into when integrating LLMs into your product.

Common problems with LLMs

Common problems include hallucinations, drift, logical errors, jailbreaking attacks, and misunderstood context. LLMs also introduce an additional cost factor that is not easy to assess since it is token or usage-based.

❗️ What are the most frequently encountered problems with LLMs?

  • Hallucinations: Generating false information or inconsistencies when the LLM doesn’t know the answer to a question in the prompt.
  • Drift: Model drift (aka model decay) refers to the changes in the distribution of input data that model developers use to train the LLM. This can result in changes in the LLM’s response over a given timeframe, leading to degrading performance when real-world facts change over time.
  • Logical errors: Misunderstanding the logical requirements in the input prompt, which likely results in incorrect output.
  • Misunderstood context: While the task itself may be understood by the model, it can misinterpret the context of the input query, which can lead to a factually correct output that just doesn’t align with the intent of the prompt.
  • Jailbreaking: Also known as vandalism e.g. a user manipulating the model to behave in a way that wasn’t intended and/or is downright malicious. Often, the intent is to gather information or exploit a third party (i.e. other users). Prompt injection is an example.

Some quality assessments are easy to conduct via metrics (such as the various types of statistical and model-based scorers covered here which help quantify these problems), while others involve more subjective evaluation and often require a human evaluator’s input (qualitative difficulties, especially in the case of natural text generation e.g. tone or creativity - although some of this may also apply for software code).

For example, in our experience benchmarking LLMs for code generation specifically, the most common problems were:

  • Missing or incorrect package statements
  • Missing or unnecessarily added imports
  • Assigning negative values to uint
  • Wrong object creation for nested classes
  • Going completely bonkers and generating gibberish

A study by researchers from universities in the US, Canada, and Japan identified several further common code generation errors including condition errors, constant value errors, reference errors, operation/calculation errors, incomplete code or missing steps, and memory errors.

Similar challenges exist for other domains where one might plan to integrate an LLM into a product.

🤔 Which LLM to use for code generation?

Want to know which problems we faced the most often while generating Java and Go code with LLMs, and which models proved the most useful? Check out this deep dive into the results of our coding benchmark DevQualityEval v0.5.0:

DeepSeek v2 Coder and Claude 3.5 Sonnet are more cost-effective at code generation than GPT-4o! (Deep dives from the DevQualityEval v0.5.0)

LLMs are fundamentally non-deterministic, meaning they may not provide equal responses to the same input every time a query is passed to them, or not even similar responses. That variance in the output causes opportunities for errors (especially in our case, e.g. when generating software code).

All those insights can be difficult to identify and track in real time, limiting the potential of LLMs in practical applications. That’s where LLM observability helps.

Why monitor LLMs?

Implementing LLM observability to monitor a range of metrics can be useful in improving results in a range of domains:

  • Performance: By monitoring performance metrics (including but not limited to latency and throughput) in real time, you can implement iterative & continuous improvement and react to any problems promptly. That should result in improved performance in general.
  • Quality: Monitoring can also help increase functional accuracy, enabling you to engineer more efficient prompts, design RAG pipelines or conduct fine-tuning that contribute to improved output.
  • Transparency: Monitoring an LLM can help you better understand how your system works by enabling you to inspect any parts of the LLM-assisted pipeline. This also makes troubleshooting easier.
  • Security: Observability is key to identifying (potential) vulnerabilities, abuse, or external attacks on your system.
  • Cost management: Since most LLMs have a token-based pricing model, tracking token consumption is vital to improving the cost-effectiveness of your LLM usage. If you’re using a model locally or looking to optimize resource utilization, you can also track the usage of CPU/GPU, memory, etc. Assessing the functional performance of your model can give insights into optimization potential (i.e. swapping out a costly model to a cheaper alternative with only marginal quality degradation).

How to monitor LLMs?

The theory behind monitoring LLMs is fairly straightforward. In its most simple form, you can simply check the output of the model manually to see if it’s accurate, and which resources were consumed during computation.

However, using monitoring tools lets you dig deeper. You’ll want to monitor all three key phases of working with LLMs:

  • Track the request (i.e. the user’s query or prompt to the model)
  • Trace what’s going on in the LLM’s application stack to better understand (and to be able to optimize) the process. This includes real-time monitoring, and, ideally, alerts about anomalies.
  • Track the response to monitor quantitative measures like response time, token consumption and resources utilization, etc. as well as quality and a range of other metrics.

What metrics to measure when monitoring LLMs depends on your goals.

Commonly tracked metrics include:

  • Scalability and performance metrics: A few examples are availability, latency, and error rates.
  • External context and functionality: Context provided via RAG (i.e. vector embedding cache hits) as well as any tools/functions that the LLM can utilize.
  • Metrics on model output quality: Measures for accuracy, relevance, confidence, bias, user feedback, etc.
  • Data input: For accurate tracing, it’s vital to log and validate the input (queries to the LLM).
  • Resource consumption: Things like GPU, CPU, memory, and network I/O usage.
  • Data drift: Comparisons to measure drift between training data and live data.
  • Security: Statistics about detected prompt injection or data leakage

When monitoring LLM tools, it’s a good idea to set up alert systems that notify you if certain thresholds are met in output quality, relevance, etc. or if other anomalies are detected.

Open-source tools for LLM monitoring & observability

A variety of software tools exist to support LLM monitoring. Below, we’re providing a list of (paid and open-source) tools for LLM observability. Note that the features and characteristics listed below represent the situation at the time of writing (late Aug 2024). Tools are listed in alphabetical order in each group.

Helicone

Helicone (GitHub) is an open-source LLM observability platform that’s backed by Y Combinator.

The tool logs all LLM completions and tracks the metadata of user queries. It can break down measured metrics by user, model, or prompt, and even provides caching and rate limiting to help save on LLM costs.

It also allows conducting experiments with different prompts, as well as collecting user feedback and metrics about LLM output quality, while monitoring all input user prompts for any malicious intents.

Helicone supports a range of LLM providers. It provides a playground for experimentation and evals to score LLM outputs.

Langfuse

Langfuse (GitHub) is an open-source LLM engineering platform that provides functionality for prompt management, tracing for all LLM calls (including nested traces), evaluation, as well as metrics to track costs, latency, and quality.

Like Helicone, Langfuse also supports running experiments to test the behavior of your application before deployment. Besides Langfuse’s managed cloud offering, you can run the tool locally or self-host it.

Langfuse is integrated with the most popular frameworks and is model-agnostic, meaning it should work with any LLM application and the model of your choice. The tool also provides dashboards and data export for easy access to insights.

Phoenix

Phoenix (GitHub) is an open-source AI observability platform that provides functionality for:

  • Tracing based on OpenTelemetry
  • Evaluations to benchmark performance (response and retrieval)
  • Datasets for evaluation and fine-tuning
  • Experiments to evaluate changes to prompts, the model itself, and retrieval

It can also track down issues and visualize or export data. Phoenix can be run locally, in a container, or in the cloud. Popular frameworks (like LlamaIndex, LangChain, etc) and LLM providers (OpenAI, Mistral, and more) are supported. Language support includes Python and JavaScript.

Traceloop Open LLMetry

OpenLLMetry (GitHub) is an open-source telemetry solution for LLMs based on OpenTelemetry. It lets you monitor, debug, and test the quality of LLM results, tracking a variety of semantic, syntactic, safety, and structural metrics.

OpenLLMetry can provide real-time alerts about LLM output quality, execution tracing for every request, and a way to debug and rerun issues from production in your IDE. The tool offers a range of integrations and can be connected to existing observability solutions (like Dynatrace, Datadog, etc).

The OpenLLMetry SDK supports a wide range of LLM providers (including OpenAI, Anthropic, Cohere, Ollama, MistralAI, HuggingFace, Bedrock, Vertex AI, Gemini, etc.) as well as a number of vector databases and frameworks. It works with Python, JavaScript/TypeScript, and Go.

Datadog LLM Observability

Datadog LLM Observability is a paid tool that provides end-to-end tracing for LLM application workflows, providing visibility into things like input-output, token use (costs), and latency in LLMs. The tool also includes quality and security checks, and can help troubleshoot errors and failures.

Datadog provides end-to-end traces for user requests and can identify embedding or retrieval errors for RAG. Besides general performance metrics, Datadog can also collect evaluation data about the quality of responses to identify problematic domains. Datadog can track operational metrics to help optimize LLM costs. For security, it has built-in privacy and security scanners and can automatically flag prompt injection attempts.

Dynatrace AI Observability

Dynatrace is a cloud monitoring platform that provides functionality for end-to-end LLM observability. Dynatrace AI Observability collects metrics, logs, traces and problems with dashboards to visualize the data.

The tool can leverage the company’s Davis® AI solution (and a range of other technologies) to help users identify performance bottlenecks, comply with privacy and security regulations, and control (or forecast) costs.

Use Dynatrace to observe metrics that provide insights into multiple layers of the LLM application stack:

  • Infrastructure
  • Models
  • Semantic caches and vector databases
  • Orchestration
  • Application health

Dynatrace can measure stability, latency, load, model drift, data drift, and costs, and can detect anomalies. The tool can also track infrastructure data (such as temperature, memory usage, process usage) to help curb energy costs.

HoneyHive

HoneyHive is an AI evaluation and observation platform that enables developers to trace and evaluate LLMs. Logs are enriched with user feedback, metadata, and user properties.

Tracing extends to:

  • model events: model inference calls
  • tool events: external API calls (e.g. retrieval)
  • chains events: groups of inference & external API calls
  • session to track a trace of requests

HoneyHive offers visual dashboards with charts to simply access LLM monitoring insights. The tool has the functionality to filter, curate, and label datasets to help continuously improve your LLM applications. HoneyHive also lets you run online evaluations with live production data to catch LLM errors.

LangSmith

LangSmith is a paid addon to the (open-source) LangChain ecosystem. LangChain (GitHub) is an open-source LLM framework for developing LLM-powered applications in Python and JavaScript.

The ecosystem includes LangSmith, LangGraph, and LangServe. For our purposes, LangSmith is the relevant one as that provides observability functionality to monitor and evaluate LLM applications.

LangSmith works on its own, you don’t even need to use other elements of the LangChain ecosystem. It provides tracing (including distributed tracing, metadata logging, and logging custom LLM traces) of your LLM usage. The tool lets you compare traces and calculate token costs for traces. LangSmith offers auto-evaluation of responses or allows you to write your own functional evaluation tests.

| 2024-08-29