Alex Strick van Linschoten

Trying to instrument an agentic app with Arize Phoenix and litellm

Alex Strick van Linschoten — Tue, 03 Jun 2025 22:00:00 GMT

It’s important to instrument your AI applications! I hope this can more or less be taken as given just as you’d expect a non-AI-infused app to capture logs. When you’re evaluating your LLM-powered system, you need to have capture the inputs and outputs both at an end-to-end level in terms of the way the user experiences things as well as with more fine-grained granularity for all the internal workings.

My goal with this blog is to first demonstrate how Phoenix and litellm can work together, and then to make sure that we are able to group all spans together under a single trace.

I’ll write the blog as I work so at this point I’m not sure exactly how this will turn out.

Basic logging with litellm + phoenix

As a reminder, here’s how we make an LLM call with litellm:

import litellm

completion_response = litellm.completion(
    model="openrouter/google/gemma-3n-e4b-it:free",
    messages=[
        {
            "content": "What's the capital of China? Just give me the name.",
            "role": "user",
        }
    ],
)
print(completion_response.choices[0].message.content)
# prints 'Beijing'

The Phoenix docs explain how to set up basic logging for litellm:

install the following pip packages:
- arize-phoenix-otel
- openinference-instrumentation-litellm
- (litellm, obviously)
set up the necessary environment variables with API key etc to ensure that traces get sent to the right account and endpoint

Let’s assume we’re using the hosted Phoenix Cloud version for now. Then we can rerun our example, with some slight tweaks:


import litellm
from phoenix.otel import register

# configure the Phoenix tracer
tracer_provider = register(
    project_name="hinbox",  # Default is 'default'
    auto_instrument=True,  # Auto-instrument your app based on installed OI dependencies
)

completion_response = litellm.completion(
    model="openrouter/google/gemma-3n-e4b-it:free",
    messages=[
        {
            "content": "What's the capital of China? Just give me the name.",
            "role": "user",
        }
    ],
)
print(completion_response.choices[0].message.content)

So we first register the Phoenix tracer, specify the project (already set up in Phoenix Cloud) and then run our litellm completion as previously. In the terminal we see he following logs:

🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: hinbox
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: https://app.phoenix.arize.com/v1/traces
|  Transport: HTTP + protobuf
|  Transport Headers: {'api_key': '****', 'authorization': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  ⚠️ WARNING: It is strongly advised to use a BatchSpanProcessor in production environments.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.

Beijing

So immediately there are a lot of things to consider. It seems that we’ll want to use the BatchSpanProcessor that it suggests, and also it seems like I might not want to set this as the global tracing provider, too.

In Phoenix Cloud, I see this:

Basic Phoenix tracing interface

As you can see, we’ve captured the input and output messages for the completion, it’s tracked the latency of the call (1.16s, which seems pretty slow actually!). There is also some sort of an annotation interface though I’ll explore that down the line maybe. I immediately notice that I’m missing things like the system attributes for where the call was made, also metadata like the temperature and other settings. I’d also like to see things like token counts (which you can get in Phoenix but they’re sort of buried) as well as the estimated cost of the call(s) and so on. We can see about adding some of that down the line.

`BatchSpanProcessor` for production usage

Let’s next move on to adding BatchSpanProcessor as the message suggested, which is as simple as adding batch=True to the tracer provider registration code. What this does is make sure that spans are processed in batches before they’re exported to Arize. This takes away some of the network costs that you incur when sending the spans one by one. I’ve also made sure to turn off the registration of this tracing provider as the global one:


import litellm
from phoenix.otel import register

# configure the Phoenix tracer
tracer_provider = register(
    project_name="hinbox",  # Default is 'default'
    auto_instrument=True,  # Auto-instrument your app based on installed OI dependencies
    set_global_tracer_provider=False,
    batch=True,
)

completion_response = litellm.completion(
    model="openrouter/google/gemma-3n-e4b-it:free",
    messages=[
        {
            "content": "What's the capital of China? Just give me the name.",
            "role": "user",
        }
    ],
)
print(completion_response.choices[0].message.content)

And I get this in the terminal:

🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: hinbox
|  Span Processor: BatchSpanProcessor
|  Collector Endpoint: https://app.phoenix.arize.com/v1/traces
|  Transport: HTTP + protobuf
|  Transport Headers: {'api_key': '****', 'authorization': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.

Beijing

It’s actually somehow a bit annoying to still see a message about the fact that I’m using a default SpanProcessor. It’s unclear to me why I need to care that this is a default one. The message is taking up real estate in the logs and it seems important (otherwise why would they have included it?) but it’s also unclear to me what the alternative is and why I’d want to overwrite the default. I think for now I’ll leave it.

Using the litellm callbacks as an alternative

If we stray away from the official supported way to handle tracing with Phoenix, there’s also the community-supported in-built litellm option:

import litellm

litellm.callbacks = ["arize_phoenix"]

completion_response = litellm.completion(
    model="openrouter/google/gemma-3n-e4b-it:free",
    messages=[
        {
            "content": "What's the capital of China? Just give me the name.",
            "role": "user",
        }
    ],
    metadata={"PROJECT_NAME": "hinbox"},
)
print(completion_response.choices[0].message.content)

This achieves a similar result, though I was unable to get the trace to land anywhere other than the default project. Arize’s docs mention a PHOENIX_PROJECT_NAME environment variable but it seems this isn’t respected or used by the litellm implementation. Indeed when I look at the implementation, I don’t see this being used anywhere, so it seems that the community-driven implementation isn’t really the way forward.

I just wanted to mention it, however, since some of the ‘callback’ integrations for tracing in litellm are really nicely implemented (like the one for Langfuse, e.g.) so I wanted to try it out at least.

One trace, multiple spans

For anything beyond a simple LLM call, which means most real-world LLM applications, we’ll want to be capturing multiple spans as part of a single trace.

LLM Tracing Tools’ Naming Conventions (June 2025)

Side-note: I dug into how some of the major LLM tracing providers name their primitives. I was reassured that we seem to have coalesced around ‘trace -> span’ and that the OpenTelemetry way seems to have been adopted by most.

Tracing nomenclature (June 2025)

Grouping spans under a single trace

I updated the code such that we now have a function that makes two separate LLM calls. I’d want them to both be registered as spans under the same trace:

import litellm
from phoenix.otel import register

tracer_provider = register(
    project_name="hinbox",  # Default is 'default'
    auto_instrument=True,  # Auto-instrument your app based on installed OI dependencies
    set_global_tracer_provider=False,
    batch=True,
)

def query_llm(prompt: str):
    completion_response = litellm.completion(
        model="openrouter/google/gemma-3n-e4b-it:free",
        messages=[
            {
                "content": prompt,
                "role": "user",
            }
        ],
    )
    return completion_response.choices[0].message.content

def my_llm_application():
    query1 = query_llm("What's the capital of China? Just give me the name.")
    query2 = query_llm("What's the capital of Japan? Just give me the name.")
    return (query1, query2)

if __name__ == "__main__":
    print(my_llm_application())

But these just get registered as two separate traces/calls. The key bit of the documentation is the ‘Using Phoenix Decorator’ section, it seems. If I add a decorator on top of my function and get the specific tracer, it seems I am able to start to group things together:

import litellm
from phoenix.otel import register

tracer_provider = register(
    project_name="hinbox",  # Default is 'default'
    auto_instrument=True,  # Auto-instrument your app based on installed OI dependencies
    set_global_tracer_provider=False,
    batch=True,
)
tracer = tracer_provider.get_tracer(__name__)

def query_llm(prompt: str):
    completion_response = litellm.completion(
        model="openrouter/google/gemma-3n-e4b-it:free",
        messages=[
            {
                "content": prompt,
                "role": "user",
            }
        ],
    )
    return completion_response.choices[0].message.content

@tracer.llm
def my_llm_application():
    query1 = query_llm("What's the capital of China? Just give me the name.")
    query2 = query_llm("What's the capital of Japan? Just give me the name.")
    return (query1, query2)

if __name__ == "__main__":
    print(my_llm_application())

This works and I see this in the Phoenix Cloud dashboard:

Grouped traces under a single span

See how it’s taken the function name as the name of the span. And it’s grouped those two LLM calls that happen within the function as we wanted. We can also update the decorator to denote different kinds of spans that we want to capture:

The kinds of spans you can choose from

I’m immediately a bit confused by the interface again, because when you click on the ‘Traces’ tab in Phoenix Cloud you actually still just see ‘spans’:

Spans in the Traces tab

In the documentation it isn’t clear to me how to create a trace that includes an llm span and an embedding span, for example. What’s even more frustrating is that the tracer decorator object doesn’t implement all the span types, just agent, chain and llm it seems. I tried something like this but it just ended up producing 3 separate traces in Phoenix Cloud.

I looked at the documentation for using base OTEL instead of the Phoenix decorators, but there was also nothing in there on how to denote the trace instead of just the span.

I was wondering if their ‘Sessions’ primitive was the way forward here, but they’re pretty clear in stating that a Session is a “sequence of traces”.

So I’m at a bit of a dead end with Phoenix for now. I might return to Braintrust or Langfuse since these seem to have better support for what I’m trying to do (i.e. group spans together underneath a trace). I’m really reluctant to try to instrument hinbox with Phoenix when I’m unable even to get this basic grouping working properly with some dummy code.

Update: solution from the Arize team

I posted this blog on the Arize slack and they got back to me with a solution:

import litellm
from phoenix.otel import register

tracer_provider = register(
    project_name="hinbox",  # Default is 'default'
    auto_instrument=True,  # Auto-instrument your app based on installed OI dependencies
    set_global_tracer_provider=False,
)
tracer = tracer_provider.get_tracer(__name__)

@tracer.llm
def query_llm(prompt: str):
    completion_response = litellm.completion(
        model="openrouter/google/gemma-3n-e4b-it:free",
        messages=[
            {
                "content": prompt,
                "role": "user",
            }
        ],
    )
    return completion_response.choices[0].message.content

@tracer.agent
def query_agent(prompt: str):
    return "I am an agent."

@tracer.chain
def my_llm_application():
    query1 = query_llm("What's the capital of China? Just give me the name.")
    query2 = query_llm("What's the capital of Japan? Just give me the name.")
    agent1 = query_agent("Who are you?")
    return (query1, query2, agent1)

if __name__ == "__main__":
    print(my_llm_application())

And you can see how this looks in the Phoenix Cloud dashboard:

Grouped spans

Judging from the code it seems like the way the span is constructed simply depends on how you assemble the hierarchy of spans. For instance, if I wanted to consider the top-level entity for this ‘trace’ (i.e. a grouping of spans) then I could use this code:

import litellm
from phoenix.otel import register

tracer_provider = register(
    project_name="hinbox",  # Default is 'default'
    auto_instrument=True,  # Auto-instrument your app based on installed OI dependencies
    set_global_tracer_provider=False,
    # batch=True,
)
tracer = tracer_provider.get_tracer(__name__)

@tracer.llm
def query_llm(prompt: str):
    completion_response = litellm.completion(
        model="openrouter/google/gemma-3n-e4b-it:free",
        messages=[
            {
                "content": prompt,
                "role": "user",
            }
        ],
    )
    return completion_response.choices[0].message.content

@tracer.agent
def query_agent(prompt: str):
    return "I am an agent."

@tracer.tool(name="query_embedding", description="Query embedding")
def query_embedding(prompt: str):
    return [0.1, 0.2, 0.3]

@tracer.agent
def my_llm_application():
    query1 = query_llm("What's the capital of China? Just give me the name.")
    query2 = query_llm("What's the capital of Japan? Just give me the name.")
    agent1 = query_agent("Who are you?")
    embedding1 = query_embedding("What's the capital of China? Just give me the name.")
    return (query1, query2, agent1, embedding1)

if __name__ == "__main__":
    print(my_llm_application())

And now instead of this trace being of kind ‘chain’, it’s now of kind ‘agent’, which some internal spans also being of kind ‘agent’. In a conversation in the Arize Slack I got the following clarification:

“Traces as the concept under”signals” is basically a unique identifier of spans (think “span” of time). See https://opentelemetry.io/docs/concepts/signals/traces/ In most cases if you filter spans by “roots” (e.g. spans that don’t have parents) and or look at the collective set of “traces” they will roughly look the same. Most of the time this is the view you want when looking at telemetry. Spans are too noisy to be looking at in isolation. While the two tabs feel largely overlapping, it’s a bit intentional as there’s actually no real object called a trace - it’s just a series of spans. You will see these abstractions in most observability platform.”

The line that:

“there’s actually no real object called a trace - it’s just a series of spans”

Was extremely clarifying, actually. It explains the fuzziness between the spans and traces tab in the Phoenix dashboard.

I also got some clarification around the missing @tracer.embbeding and @tracer.reranker decorators:

“We emit spans for embedding text to vectors (like”adda”), guardrailing via thinks like guardrals or content moderation, and reranking things via things like cohere. However it’s sorta rare for people to manually write these. We will have decorators for them but right now they are typically emitted from autoinstrumentors like langgraph where there are common patterns for these things. We will have decorators for them very soon - but things like reranking are much more complex than things like tool calling so we are codifying these primitives now.”

So there you have it! Some clarity. I’ll have to play around to see whether I go with the Langfuse route or the Phoenix route and which feels most ergonomic in the hinbox codebase. Appreciate the quick feedback from the Phoenix team, though!

Testing out instrumenting LLM tracing for litellm with Braintrust and Langfuse

Alex Strick van Linschoten — Tue, 03 Jun 2025 22:00:00 GMT

I previously tried (and failed) to setup LLM tracing for hinbox using Arize Phoenix and litellm. Since this is sort of a priority for being able to follow along with the Hamel / Shreya evals course with my practical application, I’ll take another stab using a tool with which I’m familiar: Braintrust. Let’s start simple and then if it works the way we want we can set things up for hinbox as well.

Simple Braintrust tracing with litellm callbacks

Callbacks are listed in the litellm docs as the way to do tracing with Braintrust. So we can do something like this:

import litellm

litellm.callbacks = ["braintrust"]

completion_response = litellm.completion(
    model="openrouter/google/gemma-3n-e4b-it:free",
    messages=[
        {
            "content": "What's the capital of China? Just give me the name.",
            "role": "user",
        }
    ],
    metadata={
        # "project_id": "1235-a70e-4571-abcd-234235",
        "project_name": "hinbox",
    },
)
print(completion_response.choices[0].message.content)

You can pass in a project_id or a project_name and the traces will be routed there. Here’s what it looks like in the Braintrust dashboard:

Our first trace logged in Braintrust

Note how you can’t see which model was used for the LLM call, nor any cost estimates. The docs mention that you can pass metadata into Braintrust using the metadata property:

“braintrust_* - any metadata field starting with braintrust_ will be passed as metadata to the logging request” (link)

This seems a bit rudimentary, however. If we take a look at the full tracing documentation on the Braintrust docs we can see that they seem to recommend wrapping the OpenAI client object instead:

import os

from braintrust import init_logger, traced, wrap_openai
from openai import OpenAI

logger = init_logger(project="hinbox")
client = wrap_openai(OpenAI(api_key=os.environ["OPENAI_API_KEY"]))

# @traced automatically logs the input (args) and output (return value)
# of this function to a span. To ensure the span is named `answer_question`,
# you should name the function `answer_question`.
@traced
def answer_question(body: str) -> str:
    prompt = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": body},
    ]

    result = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=prompt,
    )
    return result.choices[0].message.content

def main():
    input_text = "What's the capital of China? Just give me the name."
    result = answer_question(input_text)
    print(result)

if __name__ == "__main__":
    main()

This indeed does label the span as answer_question but it doesn’t do much else. Even the model name isn’t captured here. Instrumenting a series of calls to handle ‘deeply nested code’ (as their docs puts it) even didn’t log the things it was supposed to:


import os
import random

from braintrust import current_span, init_logger, start_span, traced, wrap_openai
from openai import OpenAI

logger = init_logger(project="hinbox")
client = wrap_openai(OpenAI(api_key=os.environ["OPENAI_API_KEY"]))

@traced
def run_llm(input):
    model = "gpt-4o" if random.random() > 0.5 else "gpt-4o-mini"
    result = client.chat.completions.create(
        model=model, messages=[{"role": "user", "content": input}]
    )
    current_span().log(metadata={"randomModel": model})
    return result.choices[0].message.content

@traced
def some_logic(input):
    return run_llm("You are a magical wizard. Answer the following question: " + input)

def simple_handler(input_text: str):
    with start_span() as span:
        output = some_logic(input_text)
        span.log(input=input_text, output=output, metadata=dict(user_id="test_user"))
        print(output)

if __name__ == "__main__":
    question = "What's the capital of China? Just give me the name."
    simple_handler(question)

This is adapted from the example they pasted in their docs as their one isn’t even a functional code example on its own.

It is seeming increasingly clear that Braintrust isn’t going to be the right choice, at least as long as I want to keep using litellm. I know that Langfuse has a very nice integration with litellm, so I think I’ll pivot over to that now.

Basic tracing with Langfuse and `litellm`

Simple tracing is easy:

import litellm

litellm.callbacks = ["langfuse"]

def query_llm(prompt: str):
    completion_response = litellm.completion(
        model="openrouter/google/gemma-3n-e4b-it:free",
        messages=[
            {
                "content": "What's the capital of China? Just give me the name.",
                "role": "user",
            }
        ],
    )
    return completion_response.choices[0].message.content

def my_llm_application():
    query1 = query_llm("What's the capital of China? Just give me the name.")
    query2 = query_llm("What's the capital of Japan? Just give me the name.")
    return (query1, query2)

print(my_llm_application())

We specify langfuse for the callback and each llm call is logged as a separate trace + span. Here you can see what this looks like in the dashboard:

Basic trace and span in Langfuse dashboard

The litellm docs include information on how to specify custom metadata and grouping instructions for Langfuse. Notably, we can specify (as of June 2025, at least!) things like a session_id, tags, a trace_name and/or trace_id as well as custom trace metadata and so on. So we can get most of what we want to specify in the following way:

import litellm

litellm.callbacks = ["langfuse"]

def query_llm(prompt: str, trace_id: str):
    completion_response = litellm.completion(
        model="openrouter/google/gemma-3n-e4b-it:free",
        messages=[
            {
                "content": "What's the capital of China? Just give me the name.",
                "role": "user",
            }
        ],
        metadata={
            "trace_id": trace_id,
            "trace_name": "my_llm_application",
            "project": "hinbox",
        },
    )
    return completion_response.choices[0].message.content

def my_llm_application():
    query1 = query_llm(
        "What's the capital of China? Just give me the name.",
        "my_llm_application_run_789",
    )
    query2 = query_llm(
        "What's the capital of Japan? Just give me the name.",
        "my_llm_application_run_789",
    )
    return (query1, query2)

if __name__ == "__main__":
    print(my_llm_application())

This looks like this in the Langfuse dashboard:

Spans grouped into traces

This is honestly most of what I’m looking for in terms of my tracing. If I were to use a non-OpenRouter model, moreover, I’d also get full costs in the Langfuse dashboard, e.g.:

LLM costs in Langfuse dashboard

As such, I can monitor costs from within OpenRouter and have the option to keep track of costs in Langfuse by passing custom metadata should I wish.

I’ll make a separate blog where I actually go into how I set up + instrumented hinbox for this kind of tracing while continuing to use litellm.

Building hinbox: An agentic research tool for historical document analysis

Alex Strick van Linschoten — Thu, 29 May 2025 22:00:00 GMT

I’ve been working on a project called hinbox - a flexible entity extraction system designed to help historians and researchers build structured knowledge databases from collections of primary source documents. At its core, hinbox processes historical documents, academic papers, books and news articles to automatically extract and organize information about people, organizations, locations, and events.

The tool works by ingesting batches of documents and intelligently identifying entities across sources. What makes it interesting is the iterative improvement aspect: as you feed more documents into the system, entity profiles become richer and more comprehensive. When hinbox encounters a person or organization it’s seen before, it updates their profile with new information rather than creating duplicates. I’ve been testing it extensively with Guantánamo Bay media sources - a domain where I have deep expertise from my previous career as a historian - which allows me to rigorously evaluate the quality of its extractions.

The organisations view

Right now, hinbox isn’t ready for broader use. The prompt engineering needs significant refinement, and the entity merging logic requires more sophisticated iteration loops. But that’s actually the point - I’ve been participating in Hamel and Shreya’s AI evals course, and I wanted a concrete project where I could apply the systematic evaluation and improvement techniques we’re learning.

This project originally came together over a few intense days about two months ago, then sat dormant while work got busy. I’ve recently resurrected it specifically to serve as a practical laboratory for the evals course exercises. There’s something powerful about having a real application with measurable outputs where you can experiment with different approaches to prompt optimization, model selection, and systematic error analysis.

The broader vision is creating a tool that could genuinely help researchers working with large document collections - transforming the traditional manual process of reading, noting, and cross-referencing into something more systematic and scalable. But first, it needs to work reliably, which is where the evals work comes in.

Why Build This? Personal Research History Meets the Age of Agents

This project connects directly to something I’ve done before - but under very different circumstances. In the mid-2000s, I founded and ran a media monitoring startup in Afghanistan (RIP AfghanWire). We had a team of Afghan translators processing daily newspapers and news sources, translating everything into English. Then came my part: reading these translations and manually building what essentially became a structured knowledge database.

The process was methodical but exhausting. Each article mentioning a person required checking our existing profiles - did we know this individual? If not, I’d create a new entry and research their background. If yes, I’d update their existing profile with new information. Over time, we developed detailed profiles for hundreds of key figures in Afghan politics, civil society, and security. The more articles we processed, the richer and more interconnected our database became. We were building a living encyclopaedia of contemporary Afghanistan, one translated news story at a time.

The startup eventually ran out of funding, but the intellectual framework stuck with me. We’d created something genuinely valuable - contextual intelligence that helped outsiders understand the complex landscape of Afghan media and politics. The manual approach worked, but it was incredibly time-intensive and didn’t scale beyond what a small team could handle.

The Academic Reality Check

Since then, I’ve continued working as a researcher (I have a PhD in War Studies from King’s College London and have written several critically-acclaimed books credentials blah blah sorry). This experience has reinforced how common the core challenge actually is across academic and research contexts. Historical research often involves exactly this pattern: you have access to substantial primary source collections - maybe 20,000 newspaper issues covering a decade, or thousands of diplomatic cables, or extensive archival materials - but limited time and resources to systematically extract insights.

The traditional academic approach involves months of careful reading, taking notes in physical notebooks, slowly building up understanding through manual cross-referencing. It’s thorough but painfully slow. Most researchers don’t have the luxury of unlimited time to spend four hours daily reading through source materials, even though that’s often what the work requires.

Beyond Academic Applications

The potential applications extend well beyond historical research. Intelligence analysis, scientific literature review, market research, legal discovery - anywhere you need to build structured knowledge from unstructured document collections. There’s clearly demand for these capabilities, evidenced by the popularity of “second brain” concepts and personal knowledge management tools like Obsidian and Roam.

But most existing PKM tools require manual curation. They’re great for organising knowledge you’ve already processed, less effective for bootstrapping that initial extraction from raw sources. What interests me is the hybrid approach: automated extraction that creates draft profiles and connections, which humans can then review, edit, and approve. Not pure automation, but intelligent assistance that handles the tedious first pass.

The ‘Agentic’ Moment

We’re entering what feels like a genuinely different phase of AI capability - the emergence of reliable vertical agents that can handle specific, complex workflows end-to-end. hinbox represents my attempt to explore what this might look like in practice for research applications. Rather than building with heavy agentic frameworks (which I haven’t found necessary yet and which fall in and out of favour too often for my tastes), I’m focusing on the core extraction and synthesis challenge.

This feels like the right moment to experiment with these capabilities. The models are sophisticated enough to handle nuanced entity recognition and relationship mapping, but the tooling is still flexible enough that you can build custom solutions for specific domains. It’s an interesting testing ground for understanding both the current state of the art and the practical challenges of deploying AI in knowledge-intensive workflows.

The goal isn’t necessarily to “solve” automated research (though that would be nice), but to build something concrete where I can systematically evaluate different approaches to prompt engineering, model selection, and error correction. Sometimes the best way to understand emerging capabilities is to push them against real problems you actually care about solving.

What can `hinbox` do now?

The system centres around domain-specific configuration - you define the research area you’re interested in through a set of configuration files that specify your extraction targets and prompts. For my testing, I’ve been using Guantánamo Bay historical sources as the test domain since I can rigorously evaluate the quality of extractions in an area where I have deep expertise.

An example organisation profile

Setting up a new research domain is straightforward: the system generates template configuration files with placeholders for all the necessary prompts. You customise these prompts to focus on the entities most relevant to your research - perhaps emphasising military personnel and legal proceedings for Guantánamo sources, or traders and agricultural cooperatives for Palestinian food history research.

Once configured, hinbox processes your document collection article by article, extracting people, organisations, locations, and events according to your specifications. The interesting part is the intelligent merging: rather than creating duplicate entries, the system attempts to recognise when newly extracted entities match existing profiles and updates them accordingly. This iterative enrichment means profiles become more comprehensive as you process additional sources.

Data processing logs

The system supports both cloud-based models (Gemini Flash 2.x has been particularly effective) and local processing through Ollama - crucial for researchers working with sensitive historical materials that can’t be sent to external APIs. Local models like gemma3:27b have proven surprisingly capable for this kind of structured extraction work.

After processing, you get a web-based frontend for exploring the extracted knowledge base. Profiles include source attribution and version history, so you can track how understanding of particular entities evolved as new documents were processed. The entire output can be shared as a self-contained package - useful for collaborative research or creating supplementary materials for publications.

How I built it

This project became a practical testbed for several development tools I’d been wanting to explore seriously. Claude Code and Cursor proved invaluable for rapid iteration - the kind of back-and-forth refinement that complex NLP applications require would have taken significantly longer with traditional development approaches.

FastHTML deserves particular mention for the frontend work. Building research interfaces without wrestling with JavaScript complexity felt genuinely liberating. The ability to create dynamic, interactive visualisations using primarily Python aligns well with how most researchers already think about data manipulation and presentation.

The current data architecture uses Parquet files throughout - a choice that might raise eyebrows but serves the development phase well. Direct file inspection and manipulation proved more valuable than database abstraction during rapid prototyping. Eventually, I’ll likely add SQLite backend options, but the current approach prioritises iteration speed over architectural elegance.

The entity merging logic required the most sophistication. The system combines simple string matching with embedding-based similarity search, then uses an LLM as final arbiter when potential matches are identified. A candidate profile gets compared against existing entities first through name similarity, then through vector comparison of full profile text. If similarity exceeds certain thresholds, both profiles are sent to the model with instructions to determine whether they represent the same entity and how to merge them if so.

This multi-stage approach handles the nuanced judgment calls that pure algorithmic matching struggles with - distinguishing between John Smith the journalist and John Smith the military contractor, or recognising that “Captain Rodriguez” from one article is the same person as “Maria Rodriguez” from another. The complexity here suggests this merging pipeline will be a primary focus for systematic evaluation and improvement as the project matures.

What’s up next?

This blog post represents the softest possible launch - really more of a “here’s what I’m working on” update than any kind of formal announcement. hinbox isn’t ready for broad adoption yet, though I’d certainly welcome contributions and feedback from anyone interested in the problem space.

The immediate technical improvements are fairly straightforward. Right now, everything runs synchronously - each article gets processed sequentially to avoid the complexity of concurrent profile updates. Adding parallel processing would require implementing proper queuing or database locking mechanisms. Similarly, moving from Parquet files to a SQLite backend would provide better data management and enable more sophisticated querying patterns. Both changes would improve performance but add architectural complexity I haven’t needed while focusing on core functionality.

I’m also eager to expand beyond newspaper articles to different document types - academic papers, book chapters, research reports, archival materials. Each format likely requires prompt refinements and possibly different extraction strategies. If this is going to be genuinely useful across research domains, it needs to handle the full spectrum of source materials historians and researchers actually work with.

The Real Work: Systematic Evaluation and Improvement

But the most interesting next phase involves applying systematic evaluation techniques from the AI evals course I mentioned earlier. This is where the project becomes genuinely educational rather than just another NLP application. I’ll be implementing structured approaches to:

Error analysis: Understanding exactly where and why entity extraction fails
Prompt optimization: Systematic testing rather than intuitive iteration
Model comparison: Rigorous evaluation across different architectures and providers
Merging accuracy: Quantifying the quality of entity deduplication decisions

The goal is documenting this improvement process in detail through subsequent blog posts. Rather than abstract discussions of evaluation methodology, I want to show concrete examples of how these techniques apply to a real system with measurable outputs. What does systematic prompt engineering actually look like in practice? How do you design effective test suites for complex agentic pipelines? When do local models outperform cloud APIs for specific tasks?

Context for Future Technical Content

Honestly, the main reason for writing this overview wasn’t to launch anything - it was to establish context. I wanted a reference point for future technical posts that dive deep into evaluation methodology and iterative improvement without needing to repeatedly explain what hinbox is or why I’m working on it. The interesting content will be showing how systematic AI development practices apply to concrete research problems.

This feels like the right kind of project for exploring these questions: complex enough to surface real challenges, focused enough to enable rigorous evaluation, and personally meaningful enough to sustain the extended iteration cycles that proper system improvement requires. Plus, having worked extensively in the domain I’m testing makes it much easier to distinguish between genuine improvements and superficial metrics optimisation.

More technical deep-dives coming soon as the evals work progresses. The real learning happens in the systematic refinement process, not the initial build.

The hinbox repository is available on GitHub for anyone interested in following along or contributing. All feedback welcome as this evolves from prototype to something genuinely useful for research applications.

Error analysis to find failure modes

Alex Strick van Linschoten — Thu, 22 May 2025 22:00:00 GMT

I came across this quote in a happy coincidence after attending the second session of the evals course:

It’s obviously a bit abstract, but I thought it was a nice oblique reflection on the topic being discussed. Both the main session and the office hours were mostly focused on the first part of the analyse-measure-improve loop that was introduced earlier in the week.

Focus on the ‘analyse’ part of the LLM application improvement loop

It was a very practical session in which we even took time to do some live ‘coding’ (i.e. analysis + clustering) of real data. I’ll try to summarise the points I jotted down in my notebook and end with some reflection on how I will be applying this for an application I’ve been working on.

A quick reminder of the context: we have an application of some kind, and we want to improve it. LLMs have lots of quirks that make them hard to narrow down exactly how they’re failing, so we’re working through a process that allows you to do just that. This was framed as a five-step process by Hamel + Shreya:

The five parts of the ‘analyse’ loop

First up, we need to look at some data to better understand the failure modes that our application might suffer from. If this application’s been in production for a while, you might well just have production data. If not, we’ll want to create a synthetic(-ish) dataset that allows us to get over the cold-start hump.

1. Create your initial dataset

This process is fairly technical, but as we were introduced to this process, the aim is to end up with 100 inputs that span across different dimensions of use that your application / system might be exposed to.

Why 100? No reason. As Hamel explained, it’s just a magic number to get you started. We’re encouraged not to get too focused on the details of the process but rather to trust that we would get to where we wanted if only we had a little faith.

The idea is that we pass these 100 datapoints into our LLM-driven system in order to see what we get out at the other end, we analyse them iteratively until we’re not learning anything new by doing the iterative process.

The process is something like the following:

you want to sample among dimensions or facets of the use that your application could expect to experience, so come up with at least three of these. As a rule of thumb, perhaps think through the lens of features people might use, persona, query complexities or scenarios. It will differ per application, most likely.
Then generate a number of combinations of these three dimensions. (So as an example: people who want to use a chatbot to buy a product, and these are all non-technical users who actually are non-native English speakers, and who don’t necessarily formulate their queries with full sentences because they’re being passed in by a voice transcription module). Generate 50 of these. (Then filter out the ones that don’t make sense.)
Then either hand write or use an LLM to help you generate the full 100 realistic queries that would come from any of the particular tuple-combos that we created earlier. (Again, filter out the ones that don’t make sense.)

2. Look at your data (‘open coding’)

At this point you’ll pass all these queries into your system and then you’ll have a pair of the initial query, together with the ‘full trace’ (which encompasses the final response along with all internal tool calls, retrieval and any other context or metadata).

Here you assemble your traces and you write notes on each one. Basically you are looking at each of the 100 items of data and making observations on what failure modes you observe in the data. In the lesson we did this live through the Braintrust interface, but it was emphasised that custom vibe-coded interfaces were also recommended, especially when you have a lot of metadata and tool calling that you might want to present in a certain way to foreground certain elements etc.

This is where you’ll spend 80% of your time and for 100 traces could take something on the order of an hour. Read each trace. Write some brief descriptive notes about the observed problems or actions where things are going wrong or are unexpected.

Importantly, you let the categories emerge from the data rather than coming in with pre-conceived ideas of what the categories already are.

For long traces, or ones with complex intermediary steps, focus on either the first upstream failure or the first independent failure that you come across. In the end, this process is an iterative one, so you’ll have a chance to repeat this a few times.

Note also that we don’t really care about the root cause analysis (i.e. ‘why’ things are happening). We’re doing error analysis so what we care about is just the behaviour and patterns that we observe.

3. Cluster your data (‘axial coding’)

At this point you have a dataset of inputs, outputs and your notes on these 100 items. At this point you switch to a clustering effort where you are structuring the failure modes + merging them. You bring structure into your unstructured data by grouping similar failure modes into a sort of emergent failure taxonomy.

The process: you read the notes and then you cluster similar notes.

It’s possible to get some help from an LLM with this, for suggestions on how to group items, but there’s no way to automate yourself out of this process. You still need to make the final judgement and call, based on your understanding of the context of the application. “Always manually review, refine and define these failure modes yourself.”

One useful guidance was to try to have failure modes that are binary (i.e. observably yes or no) since this will help later on in the process but also it’s much easier to have clear definitions for yes and no. (The alternative, where you have grades between 1-5, for example, is too easy to be unclear.)

4. Label more traces & iterate

And then you’re repeating and iterating! During this process don’t be concerned that your failure mode naming or definitions might start to evolve. This is a known thing that happens when you annotate data, i.e. the criteria drifts as you review new outputs, and it’s actually something you should welcome because it is a reflection of you better understanding your data.

You’ll want to keep looping between open coding + axial coding stages until you are ‘saturated’ in terms of what you’re learning about the failure modes. You’ll be refining the definitions, merging similar categories, splitting ones that are different.

Pitfalls to watch out for

We skipped over this section fairly quickly, but there are a bunch of ways in which you can short-change yourself in this process and that are worth being aware of:

you might have underspecified or been too narrow in how you defined the tuple-combos at the beginning. i.e. your data that you generated didn’t end up covering wide dimensions of usage patterns.
you might skimp on the work, either only coding a few examples, or half-passing the effort to actually think through what an example or trace really represents
you might try to automate things too early, delegating your (expert) judgement to a machine that can’t represent your interests, at least not at this stage
you might skip the iteration loop of going back to the open coding after doing some axial coding
for complex domains, you might skip including experts as part of this process of annotation

Office hours discussions

There were a few really interesting questions that were asked during the office hours.

One was about how to handle ‘complex’ pipelines (i.e. ones with many intermediary stages, possibly with lots of tool calling and iteration / reflection loops). Hamel suggested two ways of approaching this complexity:

building your own data viewer or annotator was one option since it allows you to customise exactly which bits of the complexity you’re exposed to. It’ll differ per application, but really you should focus on whatever is important to you based on the behaviour of the application, and an off-the-shelf tool — however good — can never be everything to everyone.
look at the final output instead of getting lost in all the intermediary details. You can see the errors in the output / final behaviour. Since this is an iterative process, if you observe errors in the output, that’s actually good enough. You don’t need to do a root cause analysis. Just code and cluster based on the failure modes you observe. You could also focus on the error type / pattern that seems most important or burning to you.

In general the emphasis was on finding ways to simplify things and not get lost in all the complexity of your system. This isn’t or won’t be the last time you see your system’s behaviour, so you don’t have to catch everything. Either picking the most glaring errors or sticking with upstream failures can be good ways of achieving this. “Find the one error that’s swamping out other errors.”

Another really interesting prompt from Hamel was to take on the mentality of a detective while working on this analysis stage. Think: “I’m going to find the failure nodes” and this mentality could carry you forward beyond all your doubts or hesitations or unsureness about the process.

And in the end, as both Hamel and Shreya said, it might feel like taking a leap of faith to trust in the process, since it ultimately is quite an open-ended process. Sort of like the well-worn metaphor of driving at night through fog, where you can’t see more than ten metres in front of you, but still you are able to make forward progress.

There was also a question about how to generate synthetic inputs when the LLM-driven process to turn the inputs into outputs also involved some human intervention (perhaps human-in-the-loop responses etc). Two suggestions for this: possibly you could have a synthetic persona who could play the role that a human might have played in those cases, but alternative you could just find five real humans and ask them to run through the scenarios or workflows a dozen times each in order to get you enough data generated that you get past the cold-start problem.

Reflections & what I’ll be working on

I was so struck during today’s session how much overlap there is in this work of evaluation with the work of a professional historian. The things I did when I wrote books, or my PhD, or just research reports, is really similar to this process. It actually made me a bit sad that there are aren’t more ways for people with a humanities background to be involved in the work of LLM application development. Not only are people with humanities backgrounds often trained to be good writers — important in the domain of prompting as we learned on Tuesday — but they have spent their whole career trying to find ways to get their heads around unwieldy unstructured data.

I have a project which is an agentic workflow / pipeline to ingest primary source or raw data from newspapers or books and iteratively improve and populate a sort of wikipedia based on what gets learned from each source. It’s a sister project to my source translation repo, tinbox (‘translator in a box’) and so this one’s called hinbox (i.e. ‘historian in a box’). I have a working prototype but it still needs a bit of work before I’m happy going into more detail about it works. I’ll make the repo public soon I hope. Needless to say, I am using this course as a way of developing evals as a way of improving it and iterating on its failure modes.

I might only get round to doing some deep practical work on that next week or the week after, but I’ll be sure to keep up the notes and reflections on the course sessions here as we go.

How to think about evals

Alex Strick van Linschoten — Mon, 19 May 2025 22:00:00 GMT

Today was the first session of Hamel + Shreya’s course, “AI Evals for Engineers and PMs”. The first session was all about mental models for thinking about the topic as a whole, mixed in with some teasers of practical examples and advice.

I’ll try to keep up with blogging about what I learn as we go. Most of the actual content will go up online at some point in the future, I’m assuming, so not much point writing up super detailed notes. (There is also a book coming, which I assume will be great, and about which you can learn more here.) So in general I’ll try to be doing the following as I blog along:

highlight things I found interesting or inspiring based on the formal ‘lectures’
anything that comes up while doing the practical ‘homework’ (there are some optional exercises assigned to ground everything)
contextualise or situate things that come up in my own experience having worked on a few LLM-driven projects

Today, fresh out of the first class, I wanted to write about the mental model of the ‘three gulfs’ that they propose, the improvement loop that they suggest is how to measurably improve your applications, and also prompting through the lens of evals. Finally I’ll round off with a bit about what I’ll be exploring this week.

The Three Gulfs: Specification, Generalization and Comprehension

So there’s this image that they shared in the book chapter preview discussion that came up again during the lesson today:

The three gulfs of LLM application development

(They’ve shared it already in the YouTube discussion + I see it on Twitter being shared so I think I’m not sharing something I ought not to!)

The course is very practically focused, especially so for application developers, so this diagram is in that context. The diagram offers up a way of thinking about LLM application development that pinpoints the places where you might do your work, and it’s also a way of thinking through things systematically, too.

I was especially interested in the differentiation between the gulf of specification and the gulf of generalisation, since these can often feel similar, but actually the way to get out of them is actually slightly different. I’ll go into a bit more detail below, but basically with the gulf of specification you might want to be working on your prompts + how specific you are, whereas with the generalisation gulf you might need things like splitting up your tasks or making sure your system is outputting things in a structured way, etc etc.

Note also that the world of tooling also doesn’t help you in a specific or targeted way to focus on one aspect of this diagram. Too often the tools try to cover the whole picture and probably also muddy the water by eliding the differences between the different tasks and challenges of each island or the gulf in between. All this is pretty abstract, so let’s go through them one by one.

The Gulf of Comprehension in Practice

This was seen as sort of the starting point for thinking through LLM Application improvement. At this point your big problem is that you’re trying to understand the data that comes your way from your users. You’re trying to understand the inputs to your application (what your users are typing, assuming that text is the medium of communication / input) and you’re trying to understand what the application or LLM is outputting.

The challenge comes because you can’t read every single log or morsel of data. You have to filter things down somehow! If this were something more like traditional ML you’d have statistics to help boil down your data, but mostly we’re talking about unstructured text data so it’s much more unwieldy.

This challenge means that people often get stuck at this point. This is where POC applications live, breathe and eventually die. You have enough sense that things are ‘kinda’ working, but you don’t really know what the failure modes are, so you don’t know how to improve it. You’ve tried out one or two things in a halfway systematic way, but really you have no idea what’s working well and what’s not.

On Tools vs Process

Hamel made the good point that it’s probably not so useful to think about tools too much when thinking at this stage. Generally speaking what’s going on is most often actually a process problem and trying to go straight to ‘what tool do I need’ is probably avoiding the real issue.

The Gulf of Specification in Practice

This is the place where you are trying to translate intent into precise instructions that the LLM will follow. You’re trying to be explicit and specific in the hope that the LLM will do what you want it to do, and not do the things that you don’t want it to do.

The obvious manifestation of this is people writing bad prompts. It might seem that it’s also present when you try to have an LLM solve one problem when it’s either unsuited for that task or the task needs to be broken up and so on, but that’s the sister gulf of generalisation. Here, we’re focused on how to improve the specificity of your prompts.

When you split things up and highlight the fact that prompts are something that you’ll need to work on and to improve, it becomes clear that it’s something you wouldn’t want to outsource or to skimp time on. Really the prompt writing is the thing that you (at certain moments, and where it’s identified as the thing needing focus / improvement) want to be working on in partnership with domain experts.

For small applications, you might be the same person as the domain expert! For bigger projects, you might be working with domain experts. Just be aware that often the domain expert might not necessarily be detached enough to be able to figure out what needs the focus, or where the weaknesses of a prompt are. That’s what the iterative process / error analysis and everything else that’ll be taught in the course is for (see below and see future posts).

Another point Hamel made was about why prompts are actually so important: “you have to express your project taste somewhere”. Given that your application might be fully / mostly driven by LLMs, the prompt is actually a really crucial place to express this taste and as such might be thought of as your ‘moat’.

I know just from having experienced a variety of LLM-driven applications, it’s quite easy to tell the ones where the product team gave their prompts and their specification some real love. It’s the difference between POC junk that will die a slow and lonely death and something that delights and solves real user problems.

Gulf of Generalisation

Shreya didn’t really get into the details around the generalisation gulf in practical terms in this lesson, but I think this one can be a sort of place of comfort for the technically-minded to make refuge in. It’s one where there’s a ton of tools and technologies and techniques to play with, and vendors also live in this space and try to claim that their particular product or special sauce is the thing to help you and so on.

The Improvement Loop for LLM Applications

We also got a high-level overview of the loop that allows you to iteratively improve an LLM application:

The analyse, measure and improve loop; adapted from an image used in the course

There’s a lot to unpack in all these different stages, and we didn’t really get into the details in the session today but you can see how this offers a really powerful way of thinking through what it means to iteratively improve an LLM application.

Learning how to implement this in a practical way will be the main thing I want to get good at by the end of this course. The process is made up of a bunch of techniques, but in my experience companies or use cases that struggle with improving what they built also lack the scaffold of this loop to orient themselves.

Prompting through the lens of evals

As we explored above, prompting is sort of the table stakes of improving your LLM application. In order to get good at prompting, it can help to appreciate what they are good at and what they struggle with. So, as Shreya put it, “leverage their strengths and anticipate their weaknesses” (when prompting).

At this point Shreya got into some points around what kinds of things went into a good prompt but I think I’ll write a separate blog on that and I don’t want to just regurgitate what we listened to. Today was more of a high-level introduction, and in any case it was much more about the outer-loop process instead of the inner loop (where tooling + specific techniques play more of a role.)

A slide from a talk I gave about the inner loop vs the outer loop of GenAI development

So it’s great that the course gets into the weeds (esp in the course materials, which include the draft of the book Hamel & Shreya are writing) but I think the really useful thing they’re doing is situating the tactical improvements and techniques within the strategic patterns and workflows that teams and individuals should be doing to work on these LLM applications.

At a high level, what are we talking about:

how to tease out failure scenarios for these applications and their behaviours
conversely, how to understand exactly which domains it does well for

Things I want to think about more

There was a ton of really rich discussion around prompting in the Discord. I’m interested in exploring more:

cross-provider prompting decisions (i.e. how prompting an OpenAI model differs from what you do with a Llama model or whatever)
prompts that work with reasoning models vs non-reasoning models
the tradeoffs of whether you put your instructions in system prompts vs user instructions

In general there’s been a bunch of noise recently about so-called ‘leaked’ system prompts from a bunch of LLM API providers and I’ve mainly been struck by just how detailed they are. I consider myself pretty good at improving and iterating on prompts, but I’ll admit I’m not writing these multi-thousand word tomes. I’d like to explore which scenarios it makes sense to do so, and how to calculate at what point it makes sense from a cost or latency perspective to do so.

As I’m sure you can detect, I’m really enthusiastic about the lesson to come and will work in the meanwhile on some of the readings that have been set as well as the homework task of writing a system prompt for a LLM-powered recipe recommendation application!

First impressions of the new Gemini Deep Research (with 2.5 Pro)

Alex Strick van Linschoten — Tue, 08 Apr 2025 22:00:00 GMT

Google released an updated iteration of their Deep Research tool that uses the new 2.5 Pro model. This was taken from a post originally made on Twitter, so please excuse the terseness.

First impressions:

a bit too eager to jump into a deep research task even when I just ask a clarifying question
quite verbose, just like the OpenAI version. Not sure why both play this up a lot. It looks impressive but in practice I think we need more entry points into this. The ‘Executive Summary’ and other concluding headers are nice touches but I feel maybe there should be some more adherence to user requests for short reports. (I get that as UI it’s maybe weird to think for 10 mins and then spit out a very concise version, but it might actually be more useful.)
I continue to be annoyed about how these Gemini DR reports handle footnotes (i.e. as endnotes whacked on at the end of the report). Almost a deal-breaker IMO.
It’s almost like GDR tries to show how scholarly and serious it is by giving you these walls of prose (vs OpenAI DR which throws in a lot more bullet points). Not sure one is better than the other but would appreciate a bit more flexibility!
The portability of these reports has always been not great. Yes you can export them to Google Docs but markdown (+ other options) would have been much better. In practice, this means that whenever I use GDR the report stays stuck there and I’m far less likely to share it with anyone, whereas the OpenAI DR reports I drop parts/all into a Github Gist etc.
These reports have been getting better and better, all things considered. I’ve been following along and using GDR from the early days (even pre-OpenAI DR) and this latest version is the best version of it so far (as you’d hope!)
(It’s also a little bit annoying that GDR has removed any way to use the older versions of GDR with Gemini Pro 2.0 and 1.5 etc. Makes it harder to actually compare these things.)
Please let’s get an ipad version of the Gemini iPad app soon, too? Feels a bit regressive to have to use GDR on the web interface always.
For serious research (as opposed to simply generating a nice report on some area where you don’t know much about already), all these tools remain hamstrung by the quality of the sources. In areas where I am (or very recently used to be) a leading scholar / researcher, the difference between what I’d expect (in terms of taste / discernment for picking out these sources) is especially egregious. Make the models better, yes, but have better filters + retrieval.

So yeah, these tools are getting good! Kudos to the teams who are implementing this stuff. Hard to make it perform reproducibly well on so many open-ended uses. But more work to be done!

IMO the really great implementations of this ‘deep research’ pattern will all be in-house where you can have control over:

source selection (i.e. high-quality inputs only, not just some random things on the internet)
how long it spends thinking about a particular area / loop of the research (or decides to backtrack and dig deeper etc)
output types / templates / length
different modalities of Q&A (sometimes you want reports, other times you want a quick question answered, other times you want visual guides etc etc.)
different models for different kinds of tasks
possibly you have little sub-research agents / processes which will go off and work on some hypothesis, possibly involving actual datasets / analysis of tabular data etc, something clearly missing from the current versions we have

A few other things:

GDR’s ‘clarification step’ (which I’ve heard them discuss on podcasts etc) is not as good or useful as the OpenAI DR clarification questions. In practice, because it’s buried under a concealment button that you have to click etc, and where the entire UI seems to be screaming at you to ‘Start Research’, you basically never update or amend the research plan. And when you do, it’s really not clear what’s changed because you don’t get some feedback or diff that your comments were understood; you just get an entire new research plan (again buried under the concealment button)
Going forward we’re probably going to want / need ways of navigating the layers to this research. A global overview report will have subsections that (should you wish) can be expanded into their own more detailed or granular reports. This is how research works, after all. Not just endless new reports all trailed one after another pointed in the same direction.

The other thing that I think we’re really going to need to work on is research taste. Like the LLMs that power them, GDR and OpenAI DR offer a level of research taste developed to the mean. (I know people are thinking about this since it came up on Dwarkesh’s podcast with the AI 2027 guys, but they were focused on scientific research.)

I think there’s not a single answer for this which is, again, why I see the end result as people bringing these things in-house where they get to develop and refine what makes their particular flavour of research unique. (In the human-generated research world this is very much the case, where certain institutions (or even particular authors) are known for how deep they go, or what kinds of sources they prefer, or how they choose to feature or highlight the primary sources they access, and so on.) There are many possible variations of how this manifest, and I hope that we’re headed into a world where all the AI ‘deep researchers’ will be unique and quirky in all the best senses of that word.

Learnings from a week of building with local LLMs

Alex Strick van Linschoten — Sat, 15 Mar 2025 23:00:00 GMT

I took the past week off to work on a little side project. More on that at some point, but at its heart it’s an extension of what I worked on with my translation package tinbox. (The new project uses translated sources to bootstrap a knowledge database.) Building in an environment which has less pressure / deadlines gives you space to experiment, so I both tried out a bunch of new tools and also experimented with different ways of using my tried-and-tested development tools/processes.

Along the way, there were a bunch of small insights which occurred to me so I thought I’d write them down. As usual with this blog, I’m mainly writing for my future self but I think there might be parts that are useful for others! Apologies for the somewhat rushed nature of these observations; better I get the blog finished and published than not at all!

🤖 Local Models

During this project, I experimented with several local models, which continue to impress me with their evolving capabilities. The recent launch of gemma3 was particularly timely - I found myself regularly using the 27B version, which performed admirably across various tasks.

There are three or four models I keep returning to. mistral-small stands out as an exceptional model that’s been relatively recently updated and seems a bit underrated / underappreciated. The original mistral model continues to hold up remarkably well, particularly for structured extraction tasks and general writing needs like summarization.

One important realization when working with real-world use cases: benchmarks can be deceptive. While helpful as general indicators, each model has its own strengths and quirks. Many newer models are heavily optimized for structured data extraction, but their performance ultimately depends on whether their training documents align with your specific use case. It’s crucial to test models against your actual requirements rather than relying solely on published benchmarks.

For robust results with local models, I’ve found that implementing a “reflection, iterate and improve” pattern significantly enhances performance. When you need a model to summarize or analyze content in a particular format, having a secondary model (or even the same model!) review the output against the original prompt requirements is incredibly valuable. This reviewer model can suggest improvements to better fulfill the original request. Running this loop for 2-5 iterations (depending on complexity) can yield results approaching those of proprietary models like Claude or GPT-4, which might achieve similar quality in a single pass. For local deployments, this iterative improvement pattern is essentially non-negotiable.

I also explored vision models, particularly llava and llama-3.2-vision. These were my primary tools for extracting context from images, generating captions, and analyzing visual content. Their effectiveness varies based on content type and language, but they represent impressive capabilities that can run entirely on local systems.

A significant portion of my work involved non-English languages, including some relatively rare ones. This is another area where benchmark claims about supporting “hundreds of languages” often don’t align with real-world performance. Models might list impressive language coverage in their specifications, but actual proficiency varies dramatically. It reinforces my earlier point - always verify benchmark claims against your specific use case before committing to a particular model.

💬 Prompting & Instruction Following

Working extensively with various models during this project reinforced some fundamental insights about prompting that might seem basic, but prove critical in practical applications. These observations are particularly relevant when working with local models, though they apply to cloud-based systems as well.

Context matters significantly more than we might assume. While we’ve grown accustomed to proprietary models like Claude or GPT-4o performing admirably with minimal guidance, local models require more deliberate direction. The more relevant context you can provide (within reasonable token limits), the better your results will be. If you would naturally provide certain background information to a human performing the task, make sure to include it in your prompt to the model as well.

Another key insight: every model has its unique characteristics. Techniques that work brilliantly with one model might fall flat with another, especially in the local model ecosystem. They each require slightly different prompting approaches, specific phrasing patterns, and tailored guidance. This necessitates running small experiments to understand how different models respond to various prompting styles. It’s still more art than science, but this experimentation phase is crucial when implementing local models effectively.

Perhaps the most valuable lesson I rediscovered is that breaking complex tasks into smaller components yields superior results compared to using a single comprehensive prompt. This is particularly true with local models. When performing extensive data extraction or when dealing with structured data where the extraction targets differ significantly from each other, don’t expect the model to handle everything in one pass – even a human might struggle with such an approach.

Instead, break down the task into logical components, create targeted mini-prompts for each aspect, and then recombine the results once all the separate LLM calls are completed. Yes, this approach adds processing time and complexity, but the quality improvement is well worth the trade-off. When accuracy matters more than speed, this decomposition strategy consistently delivers better outcomes.

🧰 Process & Tools

My development environment during this project provided plenty of opportunities to evaluate various tools and workflows. As context, I primarily work on a Mac while maintaining access to a separate (local) machine with GPU capabilities for more intensive tasks. This setup allows me to flexibly experiment with both local and cloud-based models.

For managing local models, Ollama continues to be my go-to solution for downloading, running, and interfacing with these models. A recent discovery that significantly improved my workflow is Bolt AI, an excellent Mac interface that provides seamless switching between local Ollama models and cloud-based alternatives. If you’re working in a hybrid model environment, Bolt AI is definitely worth exploring.

I’ve also recently integrated OpenRouter into my toolkit, which solves the problem of managing countless API keys across different inference providers. OpenRouter not only offers native connections to many cloud providers but also allows you to incorporate your own API keys, streamlining access to a diverse model ecosystem through a unified interface. It also helps with setting spend limits on various models or projects.

In terms of development insights, I was impressed by how rapidly front-end development can progress with the assistance of models like Claude 3.7 and OpenAI’s O1-Pro. These models perform exceptionally well when supplemented with documentation (such as an llms.txt file) alongside your prompts. While I can’t speak to their effectiveness with extremely complex applications or massive frontend codebases, they demonstrate remarkable proficiency with small to medium-sized projects.

A significant portion of my experimentation involved RepoPrompt, a tool that recently transitioned from free beta to a paid license model. RepoPrompt addresses the challenge of getting your codebase into an LLM-friendly format. Unlike standard CLI tools that simply export code to clipboard or text files, RepoPrompt generates a structured XML representation that, when modified by an LLM and pasted back, creates a reviewable diff of the proposed changes. At least, that’s one of the things it allows you to do! It’s actually a bit more powerful / flexible than that and here’s a video so you can see it in action:

RepoPrompt Demo Video

While tools like Cursor and Windsurf offer similar functionality, they tend to become less reliable as project complexity increases. RepoPrompt shines when paired with an OpenAI Pro subscription, enabling effective integration of models like O1 Pro and o3-mini-high into your development lifecycle. In my testing, the RepoPrompt + O1 Pro/O3 Mini High combination consistently delivered superior results compared to using Cursor with Claude 3.7 (even with ‘Thinking Mode’ enabled). Despite the occasional pauses while these models process complex problems, the quality improvement justifies the wait.

Additionally, I continued working with Claude Code and CodeBuff, both CLI-driven tools focused on code improvement. Of the two, CodeBuff has become my preferred option. Both tools require careful supervision—I typically keep Cursor open to monitor changes in real-time, occasionally needing to revert modifications or redirect the approach. These tools excel when you clearly articulate your objectives and maintain oversight of the implementation process. CodeBuff particularly impresses with larger codebases and demonstrates superior stability overall.

An interesting pattern emerged during development: whenever files approached 800-900 lines, it signaled the need to refactor into smaller submodules to maintain LLM comprehension, especially when using agent mode in Cursor. The modular approach significantly improved model performance.

I was genuinely surprised by the effectiveness of the RepoPrompt and O1 Pro combination. For smaller, targeted modifications, CodeBuff continues to demonstrate remarkable capability. While I didn’t evaluate these tools in conjunction with local models, I suspect such combinations would require more iterative refinement to achieve comparable results.

🧑‍🔬 Software Engineering Patterns

Throughout this experimental project, several software engineering principles proved particularly valuable when working with LLM-assisted development. These patterns aren’t revolutionary, but their importance amplifies in the context of AI-augmented workflows.

The principle of simplicity served as a cornerstone approach. Breaking development into the smallest logical next task repeatedly demonstrated its value, especially during the exploratory phases when project architecture was still taking shape. While some engineers might possess the cognitive bandwidth to fully conceptualize complex systems with perfect abstractions from the outset, I’ve found incremental development leads to more robust outcomes. This approach aligns naturally with how most developers actually think through problems and provides clear checkpoints for evaluating progress.

Data visibility emerged as another critical factor. When leveraging LLM-assisted coding, comprehensive logging becomes even more essential than in traditional development. Strategically placed log outputs create a diagnostic trail that proves invaluable when troubleshooting unexpected behaviors. This practice creates a feedback loop that strengthens both your understanding of the system and the LLM’s ability to assist effectively.

A particularly underappreciated practice I haven’t seen widely discussed is the importance of dead code detection. When working with LLM-assisted development, code cruft tends to accumulate more rapidly than in conventional programming. Tools like deadcode and vulture provide static analysis of Python projects to identify unused functions and variables. Running these tools periodically helps maintain codebase clarity by flagging remnants that might otherwise cause confusion during review. I’m not certain whether newer tools like ruff from Astral include this functionality (particularly for function calls), but the capability is invaluable for maintaining a clean, navigable codebase.

Taking time to think offline—away from the keyboard—often yields surprising clarity. This deliberate pause creates space to articulate precisely what you need for the next development increment. When you can express your requirements with precision, the LLM’s output improves proportionally. Ambiguous instructions inevitably produce suboptimal results, whereas clarity fosters efficiency.

A final observation worth emphasizing: having experience as an engineer in the pre-LLM era remains tremendously advantageous. When confronting complex workflows involving chained LLM calls with interdependencies and reflection patterns, traditional debugging skills become indispensable. Knowing when to step away from AI assistance and dive into manual debugging with tools like pdb, stepping through code execution and inspecting variables directly, represents a crucial judgment call.

LLMs and coding agents often demonstrate a bias toward generating new code rather than methodically analyzing existing problems. Recognizing the moment when direct human intervention becomes more efficient than continually prompting an AI is a skill that comes with experience. Once you’ve manually identified the underlying issue, you can return to the LLM with precisely targeted prompts that yield superior results.

🌐 Appendix 1: FastHTML

As a practical addition to my experimentation, I implemented FastHTML for the first time to build a frontend for my knowledge base extraction assistant. The experience was remarkably frictionless, particularly when leveraging their llms.txt file—a markdown-formatted documentation set that integrates seamlessly with your frontend codebase when provided alongside prompts.

This approach works exceptionally well with models like O1 Pro or O3 Mini High, creating a development workflow that feels intuitive and responsive. Despite having substantial JavaScript experience from previous roles, I found FastHTML significantly more manageable than complex JavaScript frameworks that dominate the ecosystem today.

The reduced cognitive overhead and natural integration with Python-based workflows makes FastHTML a compelling choice for ML practitioners who prefer to minimize context-switching between languages and paradigms. The framework strikes an excellent balance between capability and simplicity that aligns perfectly with rapid prototyping and iterative development cycles common in ML projects. For those building interfaces to ML systems, it’s definitely worth considering as your frontend solution.

📃 Appendix 2: OCR + Translation

Another interesting challenge I tackled involved OCR and translation of handwritten documents in non-English languages—a task that proved impossible to accomplish in a single pass with local models, particularly for less common languages.

The solution emerged through methodical problem decomposition:

Breaking down PDFs into individual page images
Segmenting each page into overlapping image chunks (critical for handwriting where text may slant across traditional line boundaries)
Applying OCR to extract text in the original source language from each image segment
Using translation models to convert the extracted text to English

This multi-stage pipeline allowed me to overcome the limitations of local models when confronted with the combined complexity of handwriting recognition and translation. Both gemma3 and llama-3.3 performed admirably within this decomposed workflow, demonstrating that even resource-constrained local deployments can achieve impressive results when problems are thoughtfully restructured.

This case exemplifies a core principle of effective ML implementation: when dealing with complex, multi-faceted challenges, breaking them into targeted sub-problems often yields better outcomes than attempting end-to-end solutions—especially when working with constrained computational resources. While this approach may increase processing time, the quality improvement justifies the trade-off for many practical applications.

Building an MCP Server for Beeminder: Connecting AI Assistants to Personal Data

Alex Strick van Linschoten — Thu, 20 Feb 2025 23:00:00 GMT

I spent the morning building an MCP server for Beeminder, bridging the gap between AI assistants and my personal goal tracking data. This project emerged from a practical need — ok, desire :) — to interact more effectively with my Beeminder data through AI interfaces like Claude Desktop and Cursor.

The MCP-Beeminder mashup in action!

Understanding Beeminder

For those unfamiliar with Beeminder, it’s a tool that combines self-tracking with commitment devices to help users achieve their goals. The platform draws what they call a “Bright Red Line” – a visual commitment path that shows exactly where you need to be to stay on track. What makes Beeminder unique is its approach to accountability: users pledge real money to stay on their path, and there’s a seven-day “akrasia horizon” that prevents immediate goal changes, helping to overcome moments of impulsivity.

I’ve written a lot about Beeminder over on my personal blog in the past so do go check that out if you’re interested to learn more about how I use it. I can attest that if it clicks with you, you’ll find it incredibly valuable. I have used in the past to write books, learn languages, finish my PhD and many many other things.

The Role of MCP

The Model Context Protocol (MCP) serves as a standardised way for AI assistants to interact with various data sources and tools. Think of it as a universal adapter that allows AI systems to directly access and manipulate data in your applications. Instead of copying and pasting information between your AI assistant and Beeminder, MCP creates a secure, direct connection.

This standardisation is particularly valuable because it means you can build one interface that works across multiple AI platforms. Whether you’re using Claude Desktop, Cursor, or other MCP-compatible tools, the same server provides consistent access to your Beeminder data.

Building the Server

The development process was surprisingly straightforward, largely due to two factors: the well-documented MCP specification from Anthropic and an existing Python client for Beeminder’s API by @ianm118. Most of the implementation work involved mapping Beeminder’s API endpoints to MCP’s expected interfaces and ensuring proper error handling.

And obviously, much of the code was actually written by Claude itself. After providing the initial structure, writing a couple of tools the way I wanted them and providing documentation, I found that Claude could generate the remainder of the code, requiring only minor adjustments and debugging from me.

Using the Beeminder MCP Server

Having an MCP server for Beeminder opens up several practical possibilities. You can have natural conversations with AI assistants about your goals, analyse patterns in your data, and even update your tracking information – all while the AI has direct access to your actual Beeminder account. This direct connection means the AI can provide more contextual and accurate assistance, whether you’re adjusting goal parameters or analysing your progress trends.

I’ve found that sometimes Claude needs a bit of coaxing to display the information it’s getting back from the Beeminder API in appropriate formats, which is to say, in table format. I will probably update my Claude settings so that it knows it should use tables (either Markdown or React components) to display Beeminder results that would benefit from such a presentation.

Looking Forward

Now that I have my Beeminder MCP server, I also want one for Omnifocus, my task management app of choice. That’ll probably have to wait since it doesn’t appear that they offer a REST API, but it’ll be great when I can mash up the results of those two tool queries as that’s what I currently do manually as part of my process.

The ease of building this MCP server suggests an interesting future where more of our tools and services become directly accessible to AI assistants. The real value isn’t in any single connection, but in the potential for creating a network of interconnected tools that AI can help us manage more effectively.

If you’re interested in trying this out yourself, you can find the code and setup instructions in the GitHub repository. While this implementation focuses on Beeminder, the same principles could be applied to create MCP servers for other services and tools.

Tinbox: an LLM-based document translation tool

Alex Strick van Linschoten — Sat, 15 Feb 2025 23:00:00 GMT

Large Language Models have transformed how we interact with text, offering capabilities that seemed like science fiction just a few years ago. They can write poetry, generate code, and engage in sophisticated reasoning. Yet surprisingly, one seemingly straightforward task – document translation – remains a significant challenge. This is a challenge I understand intimately, both as a developer and as a historian who has spent years working with multilingual primary sources.

Before the era of LLMs, I spent years conducting historical research in Afghanistan, working extensively with documents in Dari, Pashto, and Arabic. This wasn’t just casual reading – it was deep archival work that resulted in publications like “Poetry of the Taliban” and “The Taliban Reader”, projects that required painstaking translation work with teams of skilled translators. The process was time-consuming and resource-intensive, but it was the only way to make these primary sources accessible to a broader audience.

As someone who has dedicated significant time to making historical sources more accessible, I’ve watched the rise of LLMs with great interest. These models promise to democratise access to multilingual content, potentially transforming how historians and researchers work with primary sources. However, the reality has proven more complex. Current models, while powerful, often struggle with or outright refuse to translate certain content. This is particularly problematic when working with historical documents about Afghanistan – for instance, a 1984 document discussing the Soviet-Afghan conflict might be flagged or refused translation simply because it contains the word “jihad”, even in a purely historical context. The models’ aggressive content filtering, while well-intentioned, can make them unreliable for serious academic work.

After repeatedly bumping into these limitations in my own work, I built tinbox (shortened from ‘translation in a box’), a tool that approaches document translation through a different lens. What if we had a tool that could handle these sensitive historical texts without balking at their content? What if researchers could quickly get working translations of primary sources, even if they’re not perfect, to accelerate their research process? As a historian, having access to even rough translations of primary source materials would have dramatically accelerated my research process. As a developer, I knew we could build something better than the current solutions.

The name “tinbox” is a nod to the simple yet effective nature of the tool – it’s about taking the powerful capabilities of LLMs and packaging them in a way that actually works for real-world document translation needs. Whether you’re a researcher working with historical documents, an academic handling multilingual sources, or anyone needing to translate documents at scale, tinbox aims to provide a more reliable and practical solution.

The Hidden Complexity of Document Translation

The problem of document translation sits at an interesting intersection of challenges. On the surface, it might seem straightforward – after all, if an LLM can engage in complex dialogue, surely it can translate a document? It can, but there are some edge cases and limitations.

When working with real-world documents, particularly PDFs, we encounter a cascade of complications. First, there’s the issue of model refusal. LLMs frequently decline to translate documents, citing copyright concerns or content sensitivity. This isn’t just an occasional hiccup – it’s a systematic limitation occurring regularly that makes these models unreliable for production use out of the box.

Then there’s the scale problem. Most documents aren’t just a few paragraphs; they’re often dozens or hundreds of pages long. This runs headlong into the context window limitations of current models. Breaking documents into smaller chunks might seem like an obvious solution, but this introduces its own set of challenges. How do you maintain coherence across chunks? What happens when a sentence spans two pages? How do you handle formatting and structure?

The PDF format adds another layer of complexity. Most existing tools rely on Optical Character Recognition (OCR), which introduces its own set of problems. OCR can mangle formatting, struggle with complex layouts, and introduce errors that propagate through to the translation. Even when OCR works perfectly, you’re still left with the challenge of maintaining the document’s original structure and presentation.

A Word About Translations, Fidelity and Accuracy

Having worked professionally as a translator and worked as an editor for teams of translators, I’m acutely aware of the challenges and limitations of LLM-provided translations. While these models have made remarkable strides, they face several significant hurdles that are worth examining in detail.

One of the most prominent issues is consistency. LLMs often struggle to maintain consistent terminology across multiple API calls, which becomes particularly evident in longer documents. Technical terms, product names, and industry-specific jargon might be translated differently each time they appear, creating confusion and reducing the professional quality of the output. This problem extends beyond mere terminology – the writing style and tone can drift significantly between chunks of text, especially when using the chunking approach necessary for longer documents. You might find yourself with a document that switches unexpectedly between formal and informal registers, or that handles technical depth inconsistently across sections.

Even formatting poses challenges. The way LLMs handle structural elements like bullet points, numbered lists, or text emphasis can vary dramatically across sections. What starts as a consistently formatted document can end up with a patchwork of different styling approaches, requiring additional cleanup work.

Perhaps more fundamentally, LLMs struggle to find the right balance between literal and fluent translation. Sometimes they produce awkwardly literal translations that technically convey the meaning but lose the natural flow of the target language. Other times, they swing too far in the opposite direction, producing fluid but unfaithful translations that lose important nuances from the source text. This challenge becomes particularly acute when dealing with idioms and cultural references, where literal translation would be meaningless but too free a translation risks losing the author’s intent.

Cultural nuances present another significant challenge. LLMs often miss or mishandle culture-specific references, humour, and wordplay. They struggle with regional variations in language and historical context, potentially stripping away layers of meaning that a human translator would carefully preserve. This limitation becomes even more apparent in specialised fields – medical texts, legal documents, technical manuals, and academic writing all require domain expertise that LLMs don’t consistently demonstrate.

The technical limitations of these models add another layer of complexity. The necessity of breaking longer texts into chunks means that broader document context can be lost, making it difficult to maintain coherence across section boundaries. While tools like tinbox attempt to address this through seam repair and sliding window approaches, it remains a significant challenge. Cross-references between different parts of the document might be missed, and maintaining a consistent voice across a long text can prove difficult.

Format-specific problems abound as well. Tables and figures might be misinterpreted, special characters can be mangled, and the connections between footnotes or endnotes and their references might be lost. Page layout elements can be corrupted in the translation process, requiring additional post-processing work.

Reliability and trust present another set of concerns. LLMs are prone to hallucination, sometimes adding content that wasn’t present in the original text or filling in perceived gaps with invented information. They might create plausible but incorrect translations or embellish technical details. Moreover, they provide no indication of their confidence in different parts of the translation, no flags for potentially problematic passages, and no highlighting of ambiguous terms or phrases that might benefit from human review.

When it comes to handling source texts, LLMs show particular weakness with poor quality inputs. They struggle with grammatically incorrect text, informal or colloquial language, and dialectal variations. Their handling of abbreviations and acronyms can be inconsistent, potentially introducing errors into technical or specialised documents.

The ethical and professional implications of these limitations are significant. There’s often a lack of transparency about the translation process, no clear audit trail for translation decisions, and limited ability to explain why particular choices were made. This raises concerns about professional displacement – not just in terms of jobs, but in terms of the valuable human judgment that professional translators bring to sensitive translations, the opportunity for cultural consultation, and the role of specialist translators in maintaining high standards in their fields.

These various limitations underscore an important point: while LLMs are powerful tools for translation, they should be seen as aids to human translators rather than replacements, especially in contexts requiring high accuracy, cultural sensitivity, technical precision, legal compliance, or creative fidelity. The future of translation likely lies in finding ways to combine the efficiency and broad capabilities of LLMs with the nuanced understanding and expertise of human translators.

So why build a tool like this given all these problems? I think there’s still a use for something like this in fields where there are few translators and a huge backlog of materials where there’s a benefit to reading them in your own mother tongue, even in a ‘bad’ translation. (That said, having done a decent amount of comparison of outputs for languages like Arabic, Dari and Pashto, I actually don’t find the translations to be terrible, especially for domains like the news or political commentary.) For myself, I am working on a separate tool or system which takes in primary sources and incrementally populates a knowledge database. Having ways to ingest materials written in foreign languages is incredibly important for this, and having a way to do it that doesn’t break the bang (i.e. by using local models) is similarly important.

Engineering a Solution

tinbox takes a simple approach to solving these issues through two core algorithmic features. The first is what I call “page-by-page with seam repair.” Instead of treating a document as one continuous piece of text, we acknowledge its natural segmentation into pages. Each page is translated independently, but – and this is crucial – we then apply a repair process to the seams between pages.

This seam repair is where things get interesting. When a sentence spans a page boundary, we identify the overlap and re-translate that specific section with full context from both pages. This ensures that the translation flows naturally, even across page boundaries. It’s a bit like being a careful tailor, making sure the stitches between pieces of fabric are invisible in the final garment.

For continuous text documents (read: a .txt file containing multiple tens of thousands of words), we take a different approach using a sliding window algorithm. Think of it like moving a magnifying glass across the text, where the edges of the glass overlap with the previous and next positions. This overlap is crucial – it provides the context necessary for coherent translation across chunk boundaries.

The implementation details matter here. We need to carefully manage memory, handle errors gracefully, and provide progress tracking for long-running translations. The codebase is structured around clear separation of concerns, making it easy to add support for new document types or translation models.

Moreover, we need to ensure that in the case of failure we’re able to resume without wasting what we spent translating earlier parts of the document.

The Engineering Details

The architecture reflects these needs. At its core, tinbox uses a modular design that separates document processing from translation logic. This allows us to handle different document types (PDFs, Word documents, plain text) with specialised processors while maintaining a consistent interface for translation.

Error handling is particularly crucial. Translation is inherently error-prone, and when you’re dealing with large documents, you need robust recovery mechanisms. We implement comprehensive retry logic with exponential backoff, ensuring that temporary failures (like rate limits) don’t derail entire translation jobs.

For large documents, we provide checkpointing and progress tracking. This means you can resume interrupted translations and get detailed insights into the translation process. The progress tracking isn’t just about displaying a percentage – it provides granular information about token usage, costs, and potential issues.

Page-by-Page with Seam Repair

The page-by-page algorithm handles PDFs by treating each page as a separate unit while ensuring smooth transitions between pages. Pseudocode that can help you understand how this works goes something like this:

def translate_with_seam_repair(document, overlap_size=200):
    translated_pages = []
    
    for page_num, page in enumerate(document.pages):
        # Translate current page
        current_translation = translate_page(page)
        
        if page_num > 0:
            # Extract and repair the seam between pages
            previous_end = translated_pages[-1][-overlap_size:]
            current_start = current_translation[:overlap_size]
            
            # Re-translate the overlapping section with full context
            repaired_seam = translate_with_context(
                text=current_start,
                previous_context=previous_end
            )
            
            # Update translations with repaired seam
            translated_pages[-1] = translated_pages[-1][:-overlap_size] + repaired_seam
            current_translation = repaired_seam + current_translation[overlap_size:]
        
        translated_pages.append(current_translation)
    
    return "\n\n".join(translated_pages)

Sliding Window for Text Documents

For continuous text documents, we use a sliding window approach. Again, pseudocode to help understand how this works goes something like this, though the actual implementation is different:

def translate_with_sliding_window(text, window_size=2000, overlap=200):
    chunks = []
    position = 0
    
    while position < len(text):
        # Create window with overlap
        end = min(len(text), position + window_size)
        window = text[position:end]
        
        # Translate window
        translation = translate_window(window)
        chunks.append(translation)
        
        # Slide window forward, accounting for overlap
        position = end - overlap
    
    return merge_chunks(chunks, overlap)

CLI Usage Examples

The tool provides a simple command-line interface:

# Basic translation of a PDF to Spanish
tinbox --to es document.pdf

# Specify source language and model
tinbox --from zh --to en --model anthropic:claude-3-5-sonnet-latest chinese_doc.pdf

# Use local model via Ollama for sensitive content
tinbox --model ollama:mistral-small --to en sensitive_doc.pdf

# Advanced options for large documents
tinbox --to fr --algorithm sliding-window \
       --window-size 3000 --overlap 300 \
       large_document.txt

Other notable features

The CLI interface for tinbox currently is built on top of litellm so it technically supports most models you might want to use with it, though I’ve only enabled OpenAI, Anthropic, Google/Gemini and Ollama as base providers for now.

The Ollama support was one I was keen to offer since translation is such a token-heavy task. I also really worry about the level of sensitivity / monitoring on the cloud APIs and have run into that in the past (particularly with regard to my previous work as a historian working on issues relating to Afghanistan). Ollama-provided local models should solve that issue, perhaps at the expense of access to the very latest and greatest models.

Things still to be done

There’s lots of improvements still to be made. I’m particularly interested in exploring semantic section detection, which could make the chunking process more intelligent. There’s also work to be done on preserving more complex document formatting and supporting additional output formats.

Currently the tool is driven by whatever you tell it to do. Most decisions are in your hands. You have to choose the model to use for translation, notably. I am most interested in using this tool for some other side-projects and for low-resource languages so one of the important things I’ll be doing is to pick sensible defaults depending on the language and input document type you choose.

For example, some vision language models like GPT-4o are able to handle translating directly from an image in Urdu to English, the open-source versions (like llama3.2-vision) struggle much more with these kinds of tasks so it’s possible I might even need to insert an intermediary step of transcribe, then translate the transcribed text into English etc. In fact, for highest-fidelity of translation I almost certainly might want to enable that option.

The code is available at GitHub, and I welcome contributions and feedback.

Starting the Hugging Face Agents course

Alex Strick van Linschoten — Mon, 10 Feb 2025 23:00:00 GMT

I finished the first unit of the Hugging Face Agents course, at least the reading part. I still want to play around with the code a bit more, since I imagine we’ll be doing that more going forward. In the meanwhile I wanted to write up some reflections on the course materials from unit one, in no particular order…

Code agents’ prominence

The course materials and smolagents in general places special emphasis on code agents, citing multiple research papers and they seem to make some solid arguments for it but it also seems pretty risk at the same time. Having code agents instead of pre-defined tool use is good because:

Composability: could you nest JSON actions within each other, or define a set of JSON actions to re-use later, the same way you could just define a python function?

Object management: how do you store the output of an action like generate_image in JSON?

Generality: code is built to express simply anything you can have a computer do.

Representation in LLM training data: plenty of quality code actions is already included in LLMs’ training data which means they’re already trained for this!

The thing that gives me pause is that it seems like we moved through the spectrum from highly structured and known workflows (a chain, perhaps, or even something like a DAG) to tool use in a loop (which had some arbitrary or dynamic parts but ultimately was at least a little defined), and all the way out then to code agents where basically anything is possible.

If I think about this as an engineer tasked with building a robust, dependable and reliable system, then the last thing I think I want to add into the system is an agent that can basically do any thing under the sun (i.e. code agents). Perhaps I’m misrepresenting the position here of code agents, so I’m looking forward to reading the papers cited above as well as understanding it more from the course authors’ perspective.

Evals & testing

Following on to my confusion around code agents, I’m very curious how the course will recommend one tests and evaluates these arbitrary code agents. Things I could imagine:

testing out the specific scenarios that your application or use case requires (i.e. end to end)
testing out each component of the system, such as you can break it down into smaller sub-components
including things like linting / unit tests maybe once code is generated by the agent (?) i.e. real-time evaluation of the robustness of the system?
probably LLM as a judge somewhere in the mix, though that opens up its own can of worms…

I do hope they talk about that in the later units of the course.

General patterns

The core loop that came up in unit 1 was:

plan -> act -> feedback/reflection

And all of that gets packaged up in a loop and repeated in various forms depending on exactly how you’re using it. And this pattern is related to the ReACT loop which lots of people cite but seems to be a specific version of the general idea mentioned above.

And the fact that all of this works is somehow all powered by the very useful enablement of tool use, which is itself powered by the fact that the model providers finetuned this ability into the models. Crazy, brittle, impressive and many other words for the fact that this ‘hack’ has such power.

Chat templates

I liked how the unit really impresses on you the impact and importance of chat templates as the real way that LLMs are implemented. You may pass in your requests through a handy Python SDK, passing your tools as a list of function definitions, but in the end this is all being parsed down and out into very precise syntax with many tokens not intended for human consumption.

Points of leverage

At the end of the unit, I was thinking about all the places where an engineer has leverage over agents. What I could initially think of was:

the variety and usefulness of tools that you provide to your agent (or perhaps the extent to which you allow your code agent to ‘write’ things out into the world)
the discrimination in the volume or choice of a combination of tools or APIs
how you chain everything together
(how robustly you handle failure)

Beyond that there are quite a few things that are somewhat out of your hands unless you decide to custom finetune your own models for a specific use case.

Overall it was a good start to the course: made me think and also got my hands dirty working on a very simple agent with tools using smolagent and a Gradio demo app in the Hugging Face Hub. I’ll write more after unit two next week.

AI Engineering Architecture and User Feedback

Alex Strick van Linschoten — Sat, 08 Feb 2025 23:00:00 GMT

Chapter 10 of Chip Huyen’s “AI Engineering,” focuses on two fundamental aspects: architectural patterns in AI engineering and methods for gathering and using user feedback. The chapter presents a progressive architectural framework that evolves from simple API calls to complex agent-based systems, while also diving deep into the crucial aspect of user feedback collection and analysis.

1. Progressive Architecture Patterns

The evolution of AI engineering architecture typically follows a pattern of increasing complexity and capability. Each stage builds upon the previous one, adding new functionality while managing increased complexity.

Base Layer: Direct Model Integration

The simplest architectural pattern begins with direct queries to model APIs. While straightforward, this approach lacks the sophistication needed for most production applications.

Enhancement Layer: Context Augmentation

The first major enhancement comes through Retrieval-Augmented Generation (RAG). This layer enriches model responses by incorporating custom data and sources into LLM queries, significantly improving response quality and relevance.

Protection Layer: Guardrails Implementation

Guardrails: Protective mechanisms that filter both inputs and outputs to ensure system safety and reliability.

The protection layer implements two types of guardrails:

Input Guardrails: Filter sensitive information before it reaches the LLM, such as:
- Personal customer information
- API keys
- Other confidential data
Output Guardrails: Monitor and manage model outputs for:
- Format compliance (e.g., valid JSON)
- Factual consistency
- Hallucination detection
- Toxic content filtering
- Privacy protection

Routing Layer: Gateway and Model Selection

This layer introduces two key components:

AI Gateway: A centralized access point for LLM interactions that manages costs, usage tracking, and API key abstraction.

Model Router: An intent classifier that directs queries to appropriate models based on complexity and requirements.

The routing layer enables cost optimization by directing simpler queries (like FAQ responses) to less expensive models while routing complex tasks to more sophisticated systems.

Performance Layer: Caching Strategies

The architecture implements two distinct caching approaches:

Exact Caching:
- Stores identical queries and their responses
- Particularly valuable for multi-step operations
- Requires careful consideration of cache eviction policies:
  - Least Recently Used (LRU)
  - Least Frequently Used (LFU)
  - First In, First Out (FIFO)
Semantic Caching:
- Uses embedding-based search to identify similar queries
- Depends on high-quality embeddings and reliable similarity metrics
- More prone to failure due to component complexity

Security Note: Cache implementations must carefully consider potential data leaks between users accessing similar queries.

Agent Layer: Advanced Functionality

The final architectural layer introduces agent patterns, enabling:

Retry loops for reliability
Tool usage capabilities
Action execution (email sending, file operations)
Complex workflow orchestration

Monitoring and Observability

The complete architecture requires robust monitoring systems tracking key metrics:

Mean Time to Detection (MTTD): Time to identify issues
Mean Time to Response (MTTR): Time to resolve detected issues
Change Failure Rate (CFR): Percentage of deployments requiring fixes

The monitoring system should track:

Factual consistency
Generation relevancy
Safety metrics (toxicity, PII detection)
Model quality through conversational signals
Component-specific metrics (RAG, generation, vector database performance)

AI Pipeline Orchestration

a discussion of AI pipeline orchestration, addressing the trade-offs between using existing frameworks (Langchain, Haystack, Llama Index) versus custom implementations. This decision should be based on specific project requirements, team expertise, and maintenance considerations.

2. User Feedback Systems

The second major focus of the chapter explores comprehensive user feedback collection and utilization strategies.

Feedback Collection Methods

Direct Feedback:
- Explicit mechanisms (thumbs up/down)
- Rating systems
- Free-form comments
Implicit Feedback:
- Early termination patterns
- Error corrections
- Sentiment analysis
- Response regeneration requests
- Dialogue diversity metrics

Feedback Collection Timing

Feedback can be gathered at various stages:

Initial user preference specification
During negative experiences
When model confidence is low
Through comparative choice interfaces (e.g., ChatGPT’s response preference selection)

Feedback Limitations

Feedback Bias: User feedback systems inherently contain various biases that must be considered when making system improvements.

Key limitations include:

Negative experience bias (users more likely to report negative experiences)
Self-selection bias in respondent demographics
Preference and position biases
Potential feedback loops affecting system evolution

Implementation Considerations

The implementation of feedback systems requires careful attention to:

UI/UX design for feedback collection
Balance between different user needs
Monitoring feedback impact on system performance
Regular inspection of production data
Detection of system drift (prompts, user behavior, model changes)

Notes on ‘AI Engineering’ chapter 9: Inference Optimisation

Alex Strick van Linschoten — Thu, 06 Feb 2025 23:00:00 GMT

What follows are my notes on chapter 9 of Chip Huyen’s ‘AI Engineering’ book. This chapter was on optimising your inference and I learned a lot while reading it! There are interesting techniques like prompt caching and architectural considerations that I was vaguely aware of but hadn’t fully appreciated how they might work in real inference systems.

Chapter 9: Overview

Machine learning inference optimization operates across three fundamental domains: model optimization, hardware optimization, and service optimization. While hardware optimization often requires significant investment and may offer limited individual leverage, model and service optimizations provide substantial opportunities for AI engineers to improve performance.

Critical Cost Insight: A 2023 survey revealed that inference can account for up to 90% of machine learning costs in deployed AI systems, often exceeding training costs. This emphasizes why inference optimization isn’t just an engineering challenge - it’s a critical business necessity.

Core Concepts and Bottlenecks

Understanding inference bottlenecks is essential for effective optimization. Two primary types of computational bottlenecks impact inference performance:

Compute-Bound Bottlenecks: Tasks that are limited by raw computational capacity, typically involving complex mathematical operations that take significant time to complete. These bottlenecks are particularly evident in computationally intensive operations within neural networks.

Memory Bandwidth-Bound Bottlenecks: Limitations arising from data transfer requirements between system components, particularly between memory and processors. This becomes especially relevant in Large Language Models where significant amounts of data need to be moved between different memory hierarchies.

In Large Language Models (LLMs), different operations exhibit varying profiles of these bottlenecks. This understanding has led to architectural decisions such as decoupling the prefilling step from the decode step in production environments - a practice that has become increasingly common as organizations optimize their inference pipelines.

Inference APIs and Service Patterns

Two fundamental approaches to inference deployment exist:

Online Inference APIs
- Optimized for minimal latency
- Designed for real-time responses
- Typically more expensive per inference
- Critical for interactive applications
Batch Inference APIs
- Optimized for cost efficiency
- Can tolerate longer processing times (potentially hours)
- Allows providers to optimize resource utilization
- Ideal for bulk processing tasks

Inference Performance Metrics

Several key metrics help quantify inference performance:

Latency Components

Time to First Token
- Measures duration between query submission and initial response
- Critical for user experience in interactive applications
- Often a key optimization target for real-time systems
Time per Output Token
- Generation speed after the first token
- Impacts overall completion time
- Can vary based on model architecture and optimization
Inter-token Latency
- Time intervals between consecutive tokens
- Affects perceived smoothness of generation
- Important for streaming applications

Total latency can be expressed as: time_to_first_token + (time_per_token × number_of_tokens)

Throughput and Goodput Metrics

Throughput: The number of output tokens per second an inference service can generate across all users and requests. This raw metric provides insight into system capacity.

Goodput: The number of requests per second that successfully meet the Service Level Objective (SLO). This metric offers a more realistic view of useful system capacity.

Resource Utilization Metrics

Model FLOPS Utilization (MFU)
- Ratio of actual to theoretical FLOPS
- Indicates computational efficiency
- Key metric for hardware optimization
Model Bandwidth Utilization (MBU)
- Percentage of achievable memory bandwidth utilized
- Critical for memory-intensive operations
- Helps identify memory bottlenecks

Hardware Considerations and AI Accelerators

While NVIDIA GPUs dominate the market, various specialized chips exist for inference:

Popular AI Accelerators

NVIDIA GPUs (market leader)
AMD accelerators
Google TPUs
Various emerging specialized chips

Inference vs Training Hardware: Inference-optimized chips prioritize lower precision and faster memory access over large memory capacity, contrasting with training-focused hardware that requires substantial memory capacity.

Key hardware optimization considerations include:

Memory size and bandwidth requirements
Chip architecture specifics
Power consumption profiles
Physical chip architecture variations
Cost-performance ratios

Model Optimization Techniques

Core Approaches

Quantization
- Reduces numerical precision (e.g., 32-bit to 16-bit)
- Decreases memory footprint
- Weight-only quantization is particularly common
- Can halve model size with minimal performance impact
Pruning
- Removes non-essential parameters
- Preserves core model behavior
- Multiple techniques available
- Requires careful validation
Distillation
- Creates smaller, more efficient models
- Maintains key capabilities
- Covered extensively in Chapter 8

Advanced Decoding Strategies

Speculative Decoding

This approach combines a large model with a smaller, faster model:

Small model generates rapid initial outputs
Large model verifies and corrects as needed
Provides faster token generation
Easy to implement
Integrated into frameworks like VLLM and LamaCPU

Inference with Reference

Performs mini-RAG operations during decoding
Retrieves relevant context from input query
Requires additional memory overhead
Useful for maintaining context accuracy

Parallel Decoding

Rather than strictly sequential token generation, this method:

Generates multiple tokens simultaneously
Uses resolution mechanisms to maintain coherence
Implements look-ahead techniques
Algorithmically complex but offers significant speed benefits
Demonstrated success with look-ahead decoding method

Attention Optimization

Several strategies exist for optimizing attention mechanisms:

Key-Value Cache Optimization
- Critical for large context windows
- Requires substantial memory
- Various techniques for size reduction
Specialized Attention Kernels
- Flash Attention as leading example
- Hardware-specific implementations
- Flash Attention 3 for H100 GPUs

Service-Level Optimization

Batching Strategies

Static Batching
- Processes fixed-size batches
- Waits for complete batch (e.g., 100 requests)
- Simple but potentially inefficient
Dynamic Batching
- Uses time windows for batch formation
- Processes incomplete batches after timeout
- Balances latency and throughput
Continuous Batching
- Returns completed responses immediately
- Dynamically manages resource utilization
- Similar to a bus route that continuously picks up new passengers
- Optimizes occupation rate
- Based on Orca paper’s findings

Prefill-Decode Decoupling

Separates prefill and decode operations
Essential for large-scale inference providers
Allows optimal resource allocation
Improves overall system efficiency

Prompt Caching

Stores computations for overlapping text segments
Offered by providers like Gemini and Anthropic
May incur storage costs
Requires careful cost-benefit analysis
Must be explicitly enabled

Parallelism Strategies

Replica Parallelism
- Creates multiple copies of the model
- Distributes requests across replicas
- Simplest form of parallelism
Tensor Parallelism
- Splits individual tensors across devices
- Enables processing of larger models
- Requires careful coordination
Pipeline Parallelism
- Divides model computation into stages
- Assigns stages to different devices
- Optimizes resource utilization
- Reduces memory requirements
Context Parallelism
- Processes different parts of input context in parallel
- Particularly useful for long sequences
- Can significantly reduce latency
Sequence Parallelism
- Processes multiple sequences simultaneously
- Leverages hardware-specific features
- Requires careful implementation

Implementation Considerations

When implementing inference optimizations:

Multiple optimization techniques are typically combined in production
Hardware-specific optimizations require careful testing
Service-level optimizations often provide significant gains with minimal model modifications
Optimization choices depend heavily on specific use cases and requirements

Dataset Engineering: The Art and Science of Data Preparation

Alex Strick van Linschoten — Tue, 04 Feb 2025 23:00:00 GMT

Finally back on track and reading the next chapter of Chip Huyen’s book, ‘AI Engineering’. Here are my notes on the chapter.

Overview and Core Philosophy

“Data will be mostly just toil, tears and sweat.”

This is how we start the chapter :) This candid assessment frames dataset engineering as a discipline that requires both technical sophistication and pragmatic persistence. While the chapter’s placement might have been suitable earlier in the book, its position allows it to build effectively on previously established concepts.

Data Curation: The Foundation

Data curation addresses various use cases including fine-tuning, pre-training, and training from scratch, with specific considerations for chain of thought reasoning and tool use. The process addresses three fundamental aspects:

Data Quality: The equivalent of ingredient quality in cooking

Data Coverage: Analogous to having the right mix of ingredients

Data Quantity: Determining the optimal volume of ingredients

Quality Criteria

Data quality encompasses multiple dimensions:

Relevance to task requirements
Consistency in format and structure
Sufficient uniqueness
Regulatory compliance (especially critical in regulated industries)

Coverage Considerations

Coverage involves strategic decisions about data proportions:

Large language models often utilize significant code data (up to 50%) in training, which appears to enhance logical reasoning capabilities beyond just coding
Language distribution can be surprisingly efficient (even 1% representation of a language can enable meaningful capabilities)
Training proportions may vary across different stages of the training process

Quantity and Optimization

A key phenomenon discussed is ossification, where extensive pre-training can effectively freeze model weights, potentially hampering fine-tuning adaptability. This effect is particularly pronounced in smaller models.

Key quantity considerations include:

Task complexity correlation with data requirements
Base model performance implications
Model size considerations (OpenAI notes that with ~100 examples, more advanced models show superior fine-tuning performance)
Potential for using lower quality or less relevant data for initial fine-tuning to reduce high-quality data requirements
Recognition of performance plateaus where additional data yields diminishing returns

Data Acquisition Process

The chapter provides a detailed example workflow for creating an instruction-response dataset:

Initial dataset identification (~10,000 examples)
Low-quality instruction removal (reducing to ~9,000)
Low-quality response filtering (removing 3,000)
Manual response writing for remaining high-quality instructions
Topic gap identification and template creation (100 templates)
AI synthesis of 2,000 new instructions
Manual annotation of synthetic instructions

Final result: 11,000 high-quality examples

Data Augmentation and Synthesis

Synthesis Objectives

Increasing data quantity
Expanding coverage
Enhancing quality
Addressing privacy concerns
Enabling model distillation

Notable Research: An Anthropic paper (2022) found that language model-generated datasets can match or exceed human-written ones in quality for certain tasks.

Note that some teams actually prefer AI-generated preference data due to human fatigue and inconsistency factors.

Synthesis Applications

The chapter distinguishes between pre-training and post-training synthesis:

Synthetic data appears more frequently in post-training
Pre-training limitation: AI can reshape existing knowledge but struggles to synthesize new knowledge

LLaMA 3 Synthesis Pipeline

A comprehensive workflow example:

AI generation of problem descriptions
Solution generation in multiple programming languages
Unit test generation
Error correction
Cross-language translation with test verification
Conversation and documentation generation with back-translation verification

This pipeline generated 2.7 million synthetic coding examples for LLaMA 3.1’s supervised fine-tuning.

Model Collapse Considerations

The chapter addresses the risk of model collapse in synthetic data usage:

Potential loss of training signal through repeated synthetic data use
Current research suggests proper implementation can avoid collapse
Importance of quality control in synthetic data generation

Model Distillation

Notable example: BuzzFeed’s fine-tuning of Flan T5 using LoRa and OpenAI’s text-davinci-003 generated examples, achieving 80% inference cost reduction.

Data Processing Best Practices

Expert Tip: “Manual inspection of data has probably the highest value to prestige ratio of any activity in machine learning.” - Greg Brockman, OpenAI co-founder

Processing Guidelines

The chapter emphasizes efficiency optimization:

Order optimization (e.g., deduplication before cleaning if computationally advantageous)
Trial run validation before full dataset processing
Data preservation (avoid in-place modifications)
Original data retention for:
- Alternative processing needs
- Team requirements
- Error recovery

Technical Processing Approaches

Deduplication strategies include:

Pairwise comparison
Hashing methods
Dimensionality reduction techniques

Multiple libraries are referenced (page 400) for implementation.

Data Cleaning and Formatting

HTML tag removal for signal enhancement
Careful prompt template formatting, crucial for:
- Fine-tuning operations
- Instruction tuning
- Model performance optimization

Data Inspection

The chapter emphasizes the importance of manual data inspection:

Utilize various data exploration tools
Dedicate time to direct data examination (recommended: 15 minutes of direct observation)
Consider this step non-optional in the process

Notes on ‘AI Engineering’ (Chip Huyen) chapter 7: Finetuning

Alex Strick van Linschoten — Sat, 25 Jan 2025 23:00:00 GMT

I enjoyed chapter 7 on finetuning. It jams a lot of detail into the 50 pages she takes to explain things. Some areas had more detail than you’d expect, and others less, but overall this was a solid summary / review.

Core Narrative: Fine-tuning represents a significant technical and organisational investment that should be approached as a last resort, not a first solution.

The chapter’s essential message can be distilled into three key points:

The decision to fine-tune should follow exhausting simpler approaches like prompt engineering and RAG. At the end she sums it up: fine-tuning is for form, while RAG is for facts.
Memory considerations dominate the technical landscape of fine-tuning, leading to the emergence of techniques like PEFT (particularly LoRA) that make fine-tuning more accessible. The chapter emphasises that while the actual process of fine-tuning isn’t necessarily complex, the surrounding infrastructure and maintenance requirements are substantial.
A clear progression pathway emerges: start with prompt engineering, move to examples (up to ~50), implement RAG if needed, and only then consider fine-tuning. Even then, breaking down complex tasks into simpler components might be preferable to full fine-tuning.

So fine-tuning can be incredibly powerful when applied judiciously, but it requires careful consideration of both technical capabilities and organisational readiness.

Chapter Overview and Context

This long chapter (approximately 50 pages, much like the others) was notably one of the most challenging for Chip to write. It presents fine-tuning as an advanced approach that moves beyond basic prompt engineering, covering everything from fundamental concepts to practical implementation strategies.

The depth and breadth of the chapter reflect the complexity of fine-tuning as both a technical and organisational challenge, though the things she writes about doesn’t really cover the reality of what it’s like to work on these kinds of initiatives within a team.

Core Decision: When to Fine-tune

The decision to fine-tune should never be taken lightly. While the potential benefits are significant, including improved model quality and task-specific capabilities, the chapter emphasises that fine-tuning should be considered a last resort rather than a default approach.

Notable Case Study: Grammarly achieved remarkable results with their fine-tuned T5 models, which outperformed GPT-3 variants despite being 60 times smaller. This example illustrates how targeted fine-tuning can sometimes achieve better results than using larger, more general models.

Reasons to Avoid Fine-tuning

The chapter presents several compelling reasons why organisations might want to exhaust other options before pursuing fine-tuning:

Performance Degradation: Fine-tuning can actually degrade model performance on tasks outside the specific target domain
Engineering Complexity: The process introduces significant technical overhead
Specialised Knowledge Requirements: Teams need expertise in model training
Infrastructure Demands: Self-serving infrastructure becomes necessary
Ongoing Maintenance: Requires dedicated policies and budgets for monitoring and updates

Fine-tuning vs. RAG: A Critical Distinction

One of the most important conceptual frameworks presented is the distinction between fine-tuning and RAG:

Fine-tuning focuses on form - how the model expresses information
RAG specialises in facts - what information the model can access and use

This separation provides a clear decision framework, though the chapter acknowledges there are exceptions to this general rule.

Progressive Implementation Workflow

The chapter outlines a thoughtful progression of implementation strategies, suggesting organisations should:

Begin with prompt engineering optimisation
Expand to include more examples (up to approximately 50)
Implement dynamic data source connections through RAG
Consider advanced RAG methodologies
Explore fine-tuning only after exhausting other options
Consider task decomposition if still unsuccessful

Memory Bottlenecks and Technical Considerations

Critical Memory Factors

The chapter emphasises three key contributors to a model’s memory footprint during fine-tuning:

Parameter count
Trainable parameter count
Numeric representations

Technical Note: The relationship between trainable parameters and memory requirements becomes a key motivator for PEFT (Parameter Efficient Fine Tuning) approaches.

Quantisation Strategies

The chapter provides a detailed examination of quantisation approaches, particularly noting the distinction between:

Post-Training Quantisation (PTQ)
- Most common approach
- Particularly relevant for AI application developers
- Supported by major frameworks with minimal code requirements
Training Quantisation
- Emerging approach gaining traction
- Aims to optimise both inference performance and training costs

Advanced Fine-tuning Techniques

PEFT Methodologies

The chapter identifies two primary PEFT approaches:

Adapter-based methods (Additive):
- LoRA emerges as the most popular implementation
- Includes variants like Dora and qDora from Anthropic
- Involves adding new modules to existing model weights
Soft prompt-based methods:
- Less common but growing in popularity
- Introduces trainable tokens for input processing modification
- Offers a middle ground between full fine-tuning and basic prompting, so maybe interesting for teams who don’t really want to go too deep into finetuning (?)

Model Merging and Multitask Considerations

The chapter presents model merging as an evolving science, requiring significant expertise. Three primary approaches are discussed:

Summing
Layer stacking
Concatenation (generally not recommended due to memory implications)

There’s a lot of detail in this section (much more than I’d expected) but it was interesting to read about something that I haven’t much practical expertise with.

Core Approaches to Model Merging

The chapter outlines three fundamental approaches to model merging, each with its own technical considerations and trade-offs:

Technical Architecture: The three primary merging strategies

Summing: Direct weight combination

Layer stacking: Vertical integration of model components

Concatenation: Horizontal expansion (though notably discouraged due to memory implications)

The relative simplicity of these approaches belies their potential impact on model architecture and performance. Particularly interesting is how these techniques interface with the broader challenge of multitask learning.

Multitask Learning: A New Paradigm

Traditional approaches to multitask learning have typically forced practitioners into one of two suboptimal paths:

Simultaneous Training
- Requires creation of a comprehensive dataset containing examples for all tasks
- Necessitates careful balancing of task representation
- Often leads to compromise in per-task performance
Sequential Training
- Fine-tunes the model on each task in sequence
- Risks catastrophic forgetting as new tasks overwrite previous learning
- Requires careful orchestration of task order and learning rates

Key Innovation: Model merging introduces a third path - parallel fine-tuning followed by strategic combination. This approach fundamentally alters the landscape of multitask learning optimisation.

The Parallel Processing Advantage

Model merging enables a particularly elegant solution to the multitask learning challenge through parallel processing:

Individual models can be fine-tuned for specific tasks independently
Training can occur in parallel, optimising computational resource usage
Models can be merged post-training, preserving task-specific optimisations

This approach brings several compelling advantages:

Strategic Benefits: - Parallel training efficiency - Independent task optimisation - Flexible deployment options - Reduced risk of inter-task interference

Practical Implications

While the implementation details remain somewhat experimental, the potential applications are significant. Organisations can:

Develop specialised models in parallel
Optimise individual task performance without compromise
Maintain flexibility in deployment architecture
Scale their multitask capabilities more efficiently

Implementation Pathways

The chapter concludes with two distinct development approaches:

Progression Path

Begin with the most economical and fastest model
Validate with a mid-tier model
Push boundaries with the optimal model
Map the price-performance frontier
Select the most appropriate model based on requirements

Distillation Path

Start with a small dataset and the strongest affordable model
Generate additional training data using the fine-tuned model
Train a more cost-effective model using the expanded dataset

Final Observations

The chapter emphasises that while the technical process of fine-tuning isn’t necessarily complex, the surrounding context and implications are highly nuanced. Success requires careful consideration of business priorities, resource availability, and long-term maintenance capabilities. This holistic perspective is crucial for organisations considering fine-tuning as part of their AI strategy.

Notes on ‘AI Engineering’ (Chip Huyen) chapter 6

Alex Strick van Linschoten — Thu, 23 Jan 2025 23:00:00 GMT

This chapter was all about RAG and agents. It’s only 50 pages, so clearly there’s only so much of the details she can get into, but it was pretty good nonetheless and there were a few things in here I’d never really read. Also Chip does a good job bringing the RAG story into the story about agents, particularly in terms of how she defines agents. (Note that the second half of this chapter, on agents, is available on Chip’s blog as a free excerpt!)

As always, what follows is just my notes on the things that seemed interesting to me (and a high-level overview of the main points of the chapter just for future reference). YMMV!

Chapter Structure and Framing

This chapter undertakes the ambitious task of unifying two major paradigms in AI engineering: Retrieval-Augmented Generation (RAG) and Agents. At first glance, combining these topics might seem surprising given their scope and complexity. However, Chip creates a compelling framework that positions both as sophisticated approaches to context construction.

The unifying thesis presents RAG as a specialised case of the agent pattern, where the retriever functions as a tool at the model’s disposal. Both patterns serve to transcend context limitations and maintain current information, though agents ultimately offer broader capabilities. This framing provides an elegant theoretical bridge between these technologies while acknowledging their distinct characteristics.

Retrieval-Augmented Generation (RAG)

Core Concepts and Context Windows

The discussion begins with a fundamental examination of RAG’s purpose: enhancing model outputs with query-specific context to produce more grounded and useful results. Chip introduces a fascinating variation on Parkinson’s Law:

Context Expansion Law: Application context tends to expand to fill the context limits supported by the model.

This observation challenges the common assumption that RAG might become obsolete with infinite context models. Chip argues that larger context windows don’t necessarily solve the fundamental challenges RAG addresses, particularly noting that models often struggle with information buried in the middle of large context windows.

Retrieval Architecture and Algorithms

The retrieval architecture discussion introduces two primary paradigms:

Sparse Retrieval: Term-based approaches that rely on explicit matching of terms between queries and documents. The primary example is the TFIDF (Term Frequency-Inverse Document Frequency) algorithm, which evaluates term importance based on frequency patterns.

Dense Retrieval: Embedding-based approaches that transform text into vector representations, requiring specialised vector databases for storage and sophisticated nearest-neighbour search algorithms for retrieval.

Cost Considerations and Trade-offs

A striking revelation emerges regarding the cost structure of RAG systems: vector database expenses often consume between one-fifth to half of a company’s total model API spending. This cost burden becomes particularly acute for systems requiring frequent embedding updates due to changing data. Chip notes that both vector storage and vector search queries can be surprisingly expensive operations.

Retrieval Optimisation Techniques

The chapter presents several sophisticated approaches to optimisation:

Chunking Strategies: While the section is brief, it addresses the critical trade-offs in how documents are segmented for retrieval.

Query Rewriting: A powerful but potentially complex technique that enhances initial queries with contextual information. For example, transforming a query like “how about her?” into “how about Aunt Mabel from the previous question?” Chip notes this can introduce latency issues and suggests careful consideration before implementation.

Contextual Retrieval: Introduces the innovative “chunks-for-chunks” approach, where each retrieved chunk triggers additional retrievals for supplementary context. This might include retrieving related tags or associated metadata to enrich the initial results.

Hybrid Search: Combines term-based and embedding-based retrieval, typically implementing a re-ranking process. A common pattern involves using term-based retrieval (like Elasticsearch) to obtain an initial set of ~50 (or however many!) documents, followed by embedding-based re-ranking to identify the most relevant subset.

Evaluation Framework

The evaluation framework centres on two primary metrics:

Context Precision: The percentage of retrieved documents that are relevant to the query. Generally easier to measure and optimise.

Context Recall: The percentage of all relevant documents that are successfully retrieved. More challenging to measure as it requires comprehensive dataset annotation.

Agents

Foundational Definition

Chip provides a clear definition of an agent:

Agent Definition: An entity capable of perceiving its environment and acting upon it, characterised by: - The environment it operates in (defined by use case) - The set of actions it can perform (augmented by tools)

Tool Types and Capabilities

The chapter delineates three primary categories of tools:

Knowledge Augmentation Tools: - RAG systems - Web search capabilities - API calls for information retrieval

Capability Extension Tools: - Code interpreters - Terminal access - Function execution capabilities These have been shown to significantly boost model performance compared to prompting or fine-tuning alone.

Write Actions: - Data manipulation capabilities - Storage and deletion operations

Planning Architecture

The planning process emerges as a four-stage cycle:

Plan Generation: Task decomposition and strategy development
Initial Reflection: Plan evaluation and potential revision
Execution: Implementation of planned actions, often involving specific function calls
Final Reflection: Outcome evaluation and error correction

Chip includes an interesting debate about foundation models as planners, noting Yan LeCun’s assertion that autoregressive models cannot truly plan, though this remains a point of discussion in the field.

Plan Execution Patterns

The execution of agent plans reveals a fascinating interplay between computational patterns and practical implementation. Chip identifies several fundamental execution patterns that form the backbone of agent behaviour, each offering distinct advantages and trade-offs in different scenarios.

Execution Paradigms: The core patterns through which agents transform plans into actions, ranging from simple sequential execution to complex conditional logic.

The primary execution patterns include:

Sequential Execution: The most straightforward pattern, where actions are performed one after another in a predetermined order. This approach offers predictability and simplicity but may not maximise efficiency when actions could be performed concurrently.

Parallel Execution: Enables multiple actions to be performed simultaneously when dependencies permit. While this pattern can significantly improve performance, it introduces complexity in managing concurrent operations and handling potential conflicts.

Conditional Execution: Implements decision points through if statements, allowing agents to adapt their execution path based on intermediate results or environmental conditions. This pattern introduces crucial flexibility but requires careful handling of branch logic and state management.

Iterative Execution: Utilises for loops to handle repetitive tasks or process collections of items. This pattern is particularly powerful when dealing with datasets or when similar actions need to be performed multiple times with variations.

Pattern Selection: The choice of execution pattern often emerges from the intersection of task requirements, system constraints, and performance goals.

The effectiveness of these patterns depends heavily on the underlying system architecture and the specific requirements of the task at hand. For instance, parallel execution might offer theoretical performance benefits but could introduce unnecessary complexity for simple, linear tasks. Similarly, conditional execution provides valuable flexibility but requires robust error handling and state management to maintain system reliability.

Chip emphasises that these patterns aren’t mutually exclusive - sophisticated agent systems often combine multiple patterns to create more complex and capable execution strategies. This hybrid approach allows for the development of highly adaptable agents that can handle a wide range of tasks while maintaining system stability and performance.

Planning Optimisation

The chapter provides several practical tips for improving agent planning:

Enhance system prompts with more examples
Provide better tool descriptions and parameter documentation
Simplify complex functions through refactoring
Consider using stronger models or fine-tuning for plan generation

Function Calling Implementation

The function calling architecture requires:

Tool inventory creation, including:
- Function names and entry points
- Parameter specifications
- Comprehensive documentation
Tool usage specification (required vs. optional)
Version control for function names, parameters, and documentation

Planning Granularity

Chip introduces an important discussion of planning levels, analogous to temporal planning horizons (yearly plans vs. daily tasks). This presents a fundamental trade-off:

Planning Trade-off: Higher-level plans are easier to generate but harder to execute, while detailed plans are harder to generate but easier to execute.

Tool Selection and Evaluation

The chapter provides a systematic approach to tool selection:

Conduct ablation studies to measure performance impact
Monitor tool usage patterns and error rates
Analyze tool call distribution
Consider model-specific tool preferences (noting that GPT-4 tends to use a wider tool set than ChatGPT)

Memory Systems

The memory architecture comprises two core functions:

Memory Functions: - Memory management - Memory retrieval

The system supports three types of memory:

Internal knowledge
Short-term memory
Long-term memory

These systems prove crucial for:

Managing information overflow
Maintaining session persistence
Ensuring model consistency
Preserving data structural integrity

Evaluation and Failure Modes

The comprehensive evaluation framework considers:

Planning effectiveness
Tool execution accuracy
System latency
Overall efficiency
Memory system performance

Conclusion

The unifying thread of context construction provides a compelling framework for understanding these technologies not as separate entities, but as complementary approaches to extending model capabilities.

Notes on ‘AI Engineering’ (Chip Huyen) chapter 4

Alex Strick van Linschoten — Tue, 21 Jan 2025 23:00:00 GMT

This chapter represents a crucial bridge between academic research and production engineering practice in AI system evaluation. What sets it apart is the Chip’s very balanced perspective - neither succumbing to the prevalent hype in the field nor becoming overly academic. Instead, she melds together practical insights with theoretical foundations, creating a useful framework for evaluation that acknowledges both technical and ethical considerations.

Introduction and Context

Key Insight: The author’s approach demonstrates that effective AI system evaluation requires a synthesis of academic rigour and practical engineering concerns, much like how traditional software engineering evolved to balance theoretical computer science with practical development methodologies.

The chapter is structured in three main parts, each building upon the previous to create a complete picture of AI system evaluation:

Evaluation criteria fundamentals
Model selection and benchmark navigation
Practical pipeline implementation

Part 1: Evaluation Criteria - A Deep Dive

The Evolution of Evaluation-Driven Development

The author introduces evaluation-driven development (EDD), a methodological evolution that adapts the principles of test-driven development to the unique challenges of AI systems.

Evaluation-Driven Development: A methodology where AI application development begins with explicit evaluation criteria, similar to how test-driven development starts with test cases. However, EDD encompasses a broader range of metrics and considerations specific to AI systems.

The fundamental principle here is that AI applications require a more nuanced and multifaceted approach to evaluation than traditional software. Where traditional software might have binary pass/fail criteria, AI systems often operate in a spectrum of performance across multiple dimensions.

The Four Pillars of Evaluation

1. Domain-Specific Capability

The author presents domain-specific capability evaluation as the foundational layer of AI system assessment. This approach is particularly innovative in its use of multiple choice evaluation techniques - a method that bridges the gap between human-interpretable results and machine performance metrics.

For example, when evaluating code generation capabilities, presenting a model with multiple implementations where only one is functionally correct serves as both a test and a teaching tool. This mimics how human experts often evaluate junior developers’ understanding of coding patterns and best practices.

2. Generation Capability

The section on generation capability draws parallels with the historical development of Natural Language Generation (NLG) in computational linguistics. This historical context provides valuable insights into how we can approach modern language model evaluation.

The author breaks down factual consistency into two crucial dimensions:

Local Factual Consistency: The internal coherence of generated content and its alignment with the immediate context of the prompt. This is analogous to maintaining logical consistency within a single conversation or document.

Global Factual Consistency: The accuracy of generated content when compared against established knowledge and facts. This represents the model’s ability to maintain truthfulness in a broader context.

The discussion of hallucination detection is particularly noteworthy, presenting three complementary approaches:

Basic Prompting: Direct detection through carefully crafted prompts
Self-Verification: A novel approach using internal consistency checks across multiple generations
Knowledge-Augmented Verification: Advanced techniques like Google DeepMind’s SAFE paper (search augmented factuality evaluator)

The knowledge-augmented verification system represents a fascinating approach to fact-checking that mirrors how human experts verify information:

It breaks down complex statements into atomic claims
Each claim is independently verified through search
The results are synthesised into a final accuracy assessment

Seems pricey, though :)

3. Instruction Following Capability

The author makes a crucial observation about the bidirectional nature of instruction following evaluation. Poor performance might indicate either model limitations or instruction ambiguity - a distinction that’s often overlooked in practice.

Instruction-Performance Paradox: The quality of instruction following cannot be evaluated in isolation from the quality of the instructions themselves, creating a circular dependency that must be carefully managed in evaluation design.

The solution proposed is the development of custom benchmarks that specifically target your application’s requirements. This approach ensures that your evaluation criteria align perfectly with your practical needs rather than relying solely on generic benchmarks.

4. Cost and Latency Considerations

The author introduces the concept of Pareto optimization in the context of AI system evaluation, demonstrating how different performance metrics often involve trade-offs that must be carefully balanced.

Pareto Optimization: A multi-objective optimization approach where improvements in one metric cannot be achieved without degrading another, leading to a set of optimal trade-off solutions rather than a single optimal point.

Part 2: Model Selection - A Strategic Approach

The Four-Step Evaluation Workflow

The author presents a sophisticated workflow that combines both quantitative and qualitative factors in model selection. This approach is particularly valuable because it acknowledges the complexity of real-world deployment while providing a structured path forward.

Initial Filtering The first step involves filtering based on hard constraints, which might include:
- Deployment requirements (on-premise vs. cloud)
- Security and privacy considerations
- Licensing restrictions
- Resource constraints
Public Information Assessment This stage involves a systematic review of:
- Benchmark performances across relevant tasks
- Leaderboard rankings with context
- Published latency and cost metrics
The author emphasises the importance of looking beyond raw numbers to understand the context and limitations of public benchmarks.
Experimental Evaluation This phase involves hands-on testing with your specific use case, considering:
- Custom evaluation metrics
- Integration requirements
- Real-world performance characteristics
Continuous Monitoring The final step acknowledges that evaluation is an ongoing process, not a one-time event. This involves:
- Regular performance monitoring
- Failure detection and analysis
- Feedback collection and incorporation
- Continuous improvement cycles

The Build vs. Buy Decision Matrix

The author provides an analysis of the build vs. buy decision, going beyond simple cost comparisons to consider factors like:

Total Cost of Ownership (TCO): The complete cost picture including: - Direct costs (API fees, computing resources) - Indirect costs (engineering time, maintenance) - Opportunity costs (time to market, feature development) - Risk costs (security, reliability, vendor lock-in)

This section particularly shines in its discussion of the often-overlooked aspects of model deployment, such as the hidden costs of maintaining self-hosted models and the true value of vendor-provided updates and improvements.

Part 3: Building Evaluation Pipelines - Practical Implementation

System Component Evaluation

The author advocates for a dual-track evaluation approach:

End-to-end system evaluation
Component-level assessment

This approach allows organisations to:

Identify bottlenecks and failure points
Understand component interactions
Make targeted improvements
Maintain system reliability during updates

Creating Effective Evaluation Guidelines

The author emphasises the importance of creating clear, actionable evaluation guidelines that bridge technical and business metrics. This section introduces the concept of metric alignment - ensuring that technical evaluation metrics directly correspond to business value.

Metric Alignment: The process of mapping technical performance metrics to business outcomes, creating a clear connection between model improvements and business value.

Data Management and Sampling

Chip provides valuable insights into data management for evaluation, including:

Data Slicing: The strategic separation of evaluation data into meaningful subsets to: - Identify performance variations across different use cases - Detect potential biases - Enable targeted improvement efforts - Avoid Simpson’s paradox in performance analysis

The discussion of sample size is particularly practical, providing concrete guidelines based on statistical confidence levels and desired detection thresholds. The author cites OpenAI’s research suggesting that sample sizes between 100 and 1,000 are typically sufficient for most evaluation needs, depending on the required confidence level.

Meta-Evaluation: Evaluating Your Evaluation

The chapter concludes with a crucial discussion of meta-evaluation - the process of assessing and improving your evaluation pipeline itself. This includes considerations of:

Signal quality and reliability
Metric correlation and redundancy
Resource utilisation and efficiency
Integration with development workflows

Conclusion

The author concludes around the inherent limitations of AI system evaluation: no single metric or method can fully capture the complexity of these systems. However, this acknowledgment leads to a constructive approach: combining multiple evaluation methods, maintaining awareness of their limitations, and continuously iterating based on real-world feedback.

This chapter ultimately provides a solid framework for AI system evaluation that is both theoretically sound and practically applicable. It serves as a valuable resource for organisations working to implement effective evaluation strategies for their AI systems, while maintaining a clear-eyed view of both the possibilities and limitations of current evaluation methods.

Notes on ‘AI Engineering’ (Chip Huyen) chapter 3

Alex Strick van Linschoten — Mon, 20 Jan 2025 23:00:00 GMT

Really enjoyed this chapter. My tidied notes from my readings follow below. 150 pages in and we’re starting to get to the good stuff :)

Overview and Context

This chapter serves as the first of two chapters (Chapters 3 and 4) dealing with evaluation in AI Engineering. While Chapter 4 will delve into evaluation within systems, Chapter 3 addresses the fundamental question of how to evaluate open-ended responses from foundation models and LLMs at a high level. The importance of evaluation cannot be overstated, though the author perhaps takes this somewhat for granted. The chapter provides a comprehensive framework for understanding various evaluation methodologies and their applications.

Challenges in Evaluating Foundation Models

The evaluation of foundation models presents several unique and complex challenges that make systematic assessment difficult:

Existing benchmarks become increasingly inadequate as models improve in their capabilities
As models become better at writing and mimicking human-like responses, evaluation becomes more complex and nuanced
Many foundation models are API-driven black boxes, limiting access to internal workings
Models continuously develop new capabilities, requiring constant adaptation of evaluation methods
There has been notably limited investment in evaluation studies and technologies compared to the extensive resources devoted to enhancing model capabilities
The improvement in model performance necessitates the continuous development of new benchmarks
Without a systematic approach to evaluation, progress can be hindered by various headwinds

Language Model Metrics

The chapter includes a technically detailed section on understanding language model metrics, which while math-heavy, provides fundamental insights into model capabilities:

Entropy
Cross-entropy
Perplexity

These metrics serve as underlying measures to understand what’s happening within the models and assess their power and conversational abilities. While this section spans 4-5 pages of technical content, it provides some useful foundational understanding of how we can measure a language model’s intrinsic capabilities.

Downstream Task Performance Measurement

The chapter transitions from intrinsic metrics to evaluating actual capabilities, dividing evaluation into exact and subjective approaches.

Exact Evaluation

There are two principal approaches to exact evaluation:

Functional Correctness Assessment
- Evaluates whether the LLM can successfully complete its assigned tasks
- Focuses on practical capability rather than theoretical metrics
- Example: In coding tasks, checking if generated code passes all unit tests
- Provides clear, objective measures of success
Similarity Measurements Against Reference Data Four distinct methods are identified:
1. Human Evaluator Judgment
  - Requires manual comparison of texts by human evaluators
  - Highly accurate but time and resource-intensive
  - Limited scalability due to human involvement
  - Often considered the gold standard despite limitations
2. Exact Match Checking
  - Compares generated response against reference responses for exact matches
  - Most effective with shorter, specific outputs
  - Less useful for verbose or creative outputs
  - Provides binary yes/no results
3. Lexical Similarity
  - Employs established metrics like BLEU, ROUGE, and METEOR
  - Focuses on word overlap and structural similarities
  - Known to be somewhat crude in their assessment
  - Widely used despite limitations due to ease of implementation
4. Semantic Similarity
  - Utilizes embeddings for comparing textual meaning
  - Less sensitive to specific word choices than lexical approaches
  - Quality depends entirely on the underlying embeddings algorithm
  - May require significant computational resources
  - Generally provides more nuanced comparison than lexical methods

The chapter includes a brief but relevant sidebar on embeddings and their significance in evaluation, though this digression seemed a bit out of place in the overall flow.

AI as Judge

This section explores the increasingly popular approach of using AI systems to evaluate other AI systems.

Benefits

Significantly faster than human evaluation processes
Generally more cost-effective than human evaluation at scale
Studies have shown strong correlation with human evaluations in many cases
AI judges can provide detailed explanations for their decisions
Offers greater flexibility in evaluation approaches
Enables systematic and consistent evaluation at scale

Three Main Approaches

Individual Response Evaluation
- Assesses response quality based solely on the original question
- Often implements numerical scoring systems (e.g., 1-5 scale)
- Evaluates responses in isolation without comparison
Reference Response Comparison
- Evaluates generated response against established reference responses
- Usually produces binary (true/false) outcomes
- Helps ensure responses meet specific criteria
Generated Response Comparison
- Compares two generated responses to determine relative quality
- Predicts likely user preferences between options
- Particularly useful for:
  - Post-training alignment
  - Test-time compute optimization
  - Model ranking through comparative evaluation
  - Generating preference data

Implementation Considerations

Table 3-3 (page 139) provides an overview of different AI judge criteria used by various AI tools
Notable lack of standardization across different platforms and approaches (see above)
Various scoring systems available, each with their own trade-offs
Adding examples to prompts can improve accuracy but increases token count and costs
Careful balance needed between evaluation quality and resource consumption

Limitations and Challenges

AI judges can show inconsistency in their judgments
Costs can escalate quickly, especially when using stronger models or including more context
Evaluation criteria often remain ambiguous and difficult to standardize
Several inherent biases identified:
- Self-bias: Models tend to favor responses generated by themselves
- Verbosity bias: Tendency to favor longer, more detailed answers
- Other biases common to AI applications in general

Specialized Judges

This section presents an innovative challenge to the conventional wisdom of using the strongest available model as a judge. The author introduces a compelling alternative approach:

Small, specialized judges can be as effective as larger models for specific evaluation tasks
More cost-effective and efficient than using large language models
Can be trained for highly specific evaluation criteria
Demonstrates comparable performance to larger models like GPT-4 in specific domains

Three types of specialized judges are identified: 1. Reward models (evaluating prompt-response pairs) 2. Reference-based judges 3. Preference models

This represents a novel approach that could significantly impact evaluation methodology in the field.

Comparative Evaluation for Model Ranking

Methodology

Focuses on binary choices between two samples
Simpler for both humans and AI to make comparative judgments
Used in major leaderboards like LMSIS
Requires evaluation of multiple combinations to establish rankings
Various algorithms available for efficient comparison

Advantages

More intuitive evaluation process
Often more reliable than absolute scoring
Reduces cognitive load on evaluators
Provides clear preference data

Challenges

Highly data-intensive nature affects scalability
Lacks standardization across implementations
Difficulty in converting comparative measures to absolute metrics
Quality control remains a significant concern
Number of required comparisons can grow rapidly with model count

Key Takeaways and Future Implications

The emergence of smaller, specialized judge models represents a significant shift from the traditional approach of using the largest available models
Comparative evaluation offers promising approaches but requires careful consideration of scalability and implementation
The field continues to evolve rapidly, requiring flexible and adaptable evaluation strategies
Sets up crucial discussion for system-level evaluation in Chapter 4
Highlights the ongoing tension between evaluation quality and resource efficiency

The chapter effectively establishes the foundational understanding necessary for the more practical, system-focused evaluation discussions to follow in Chapter 4.

Notes on ‘AI Engineering’ (Chip Huyen) chapter 1

Alex Strick van Linschoten — Sat, 18 Jan 2025 23:00:00 GMT

Had the first of a series of meet-ups I’m organising in which we discuss Chip Huyen’s new book. My notes from reading the chapter follow this, and then I’ll try to summarise what we discussed in the group.

At a high-level, I really enjoyed the final part of the chapter where she got into how she was thinking about the practice of ‘AI Engineering’ and how it differs from ML engineering. Also the use of the term ‘model adaptation’ was an interesting way of encompassing all the different things that engineers are doing to get the LLM to better follow their instructions.

Chapter 1 Notes

The chapter begins by establishing AI Engineering as the preferred term over alternatives like GenAI Ops or LLM Ops. This preference stems from a fundamental shift in the field, where application development has become increasingly central to working with AI models. The “ops” suffix inadequately captures the breadth and nature of work involved in modern AI applications.

Foundation Models and Language Models

The text provides important technical context about different types of language models. A notable comparison shows that while Mistral 7B has a vocabulary of 32,000 tokens, GPT-4 possesses a much larger vocabulary of 100,256 tokens, highlighting the significant variation in model capabilities and design choices.

Two primary categories of language models are discussed:

Masked Language Models (like BERT and modern BERT variants)
Autoregressive Language Models (like those used in ChatGPT)

The term “foundation model” carries dual significance, referring both to these models’ fundamental importance and their adaptability for various applications. This terminology also marks an important transition from task-specific models to general-purpose ones, especially relevant in the era of multimodal capabilities.

AI Engineering vs Traditional Approaches

AI Engineering differs substantially from ML Engineering, warranting its distinct terminology. The key distinction lies in its focus on adapting and evaluating models rather than building them from scratch. Model adaptation techniques fall into two main categories:

Prompt-based techniques (prompt engineering) - These methods adapt models without updating weights
Fine-tuning techniques - These approaches require weight updates

The shift from ML Engineering to AI Engineering brings new challenges, particularly in handling open-ended outputs. While this flexibility enables a broader range of applications, it also introduces significant complexity in evaluation and implementation of guardrails.

The AI Engineering Stack

The framework consists of three distinct layers:

1. Application Development Layer

Focuses on prompt crafting and context provision
Requires rigorous evaluation methods
Emphasizes interface design and user experience
Primary responsibilities include evaluation, prompt engineering, and AI interface development

2. Model Development Layer

Provides tooling for model development
Includes frameworks for training, functioning, and inference optimisation
Requires systematic evaluation approaches

3. Infrastructure Layer

Handles model serving
Manages underlying technical requirements

Planning AI Applications

The chapter outlines a modern approach to AI application development that differs significantly from traditional ML projects. Rather than beginning with data collection and model training, AI engineering often starts with product development, leveraging existing models. This approach allows teams to validate product concepts before investing heavily in data and model development.

Key planning considerations include:

Setting appropriate expectations
Determining user exposure levels
Deciding between internal and external deployment
Understanding maintenance requirements

A notable insight is the “80/20” development pattern: while reaching 80% functionality can be relatively quick, achieving the final 20% often requires equal or greater effort than the initial development phase.

Evaluation and Implementation Challenges

The chapter emphasises that working with AI models presents unique evaluation challenges compared to traditional ML systems. This complexity stems from:

The open-ended nature of outputs
Difficulty in implementing strict guardrails
Challenges in type enforcement
The need for comprehensive evaluation strategies

Data and Model Adaptation

The text discusses how data set engineering and inference optimisation, while still relevant, take on different forms in AI engineering compared to traditional ML engineering. The focus shifts from raw data collection and processing to effective model adaptation and deployment strategies.

Modern Development Paradigm

A significant paradigm shift is highlighted in the development approach: unlike traditional ML engineering, which typically begins with data collection and model training, AI engineering enables a product-first approach. This allows teams to validate concepts using existing models before committing to extensive data collection or model development efforts.

Discussion summary

The conversation started with a bit on how AI Engineering represents an interesting shift in the software engineering landscape, potentially opening new career paths for traditional software engineers. While developers may not need deep mathematical knowledge of derivatives and linear algebra upfront, there’s a growing recognition that understanding how AI systems behave - their constraints and opportunities - is becoming increasingly valuable.

A key tension emerged in the discussion around enterprise adoption. While there’s significant enthusiasm around AI applications, particularly on social media where developers showcase apps with substantial user bases, enterprise companies often maintain their traditional team structures. This creates an interesting dynamic where companies might maintain their existing ML engineering teams while simultaneously forming new “tiger teams” focused on generative AI initiatives, leading to organisational friction.

The group discussed how while it’s now possible for software engineers to quickly build AI applications by calling APIs, they often hit limitations that require deeper understanding. This raises questions about whether the “shallow” approach of purely application-level development is sustainable, or whether engineers will inevitably need to develop deeper technical knowledge around model behaviour, evaluation, and fine-tuning.

A particularly notable challenge discussed was handling the non-deterministic nature of AI systems. Traditional software engineering practices, like unit testing, don’t translate cleanly to systems where outputs can vary even with temperature set to zero. This highlights how AI Engineering requires new patterns and practices beyond traditional software engineering approaches.

The discussion also touched on evaluation techniques, including the use of log probabilities to understand model confidence and improve prompts. This represents an emerging area where traditional ML evaluation meets new challenges in assessing large language model outputs.

Final notes on ‘Prompt Engineering for LLMs’

Alex Strick van Linschoten — Thu, 16 Jan 2025 23:00:00 GMT

Here are the final notes from ‘Prompt Engineering for LLMs’, a book I’ve been reading over the past few days (and enjoying!).

Chapter 10: Evaluating LLM Applications

The chapter begins with an interesting anecdote about GitHub Copilot - the first code written in their repository was the evaluation harness, highlighting the importance of testing in LLM applications. The authors, who worked on the project from its inception, emphasise this as a best practice.

Evaluation Framework

When evaluating LLM applications, three main aspects can be assessed:

The model itself - its capabilities and limitations
Individual interactions with the model (prompts and responses)
The integration of multiple interactions within the broader application

As a general rule of thumb, you should always track and record:

Latency
Token consumption statistics
Overall system approach metrics

Offline Evaluation

Example Suites

The foundation of offline evaluation is creating example suites - collections of 10-20 (minimum) input-output pairs that serve as test cases. These should be accompanied by scripts that apply your application’s logic to each example and compare the results.

Example sources come from three main areas:

Existing examples from your project
Real-time user data collection
Synthetic creation

When using synthetic data, it’s crucial to use different LLMs for creation versus application/judging to avoid potential biases.

Evaluation Approaches

Gold Standard Matching

Can be exact or partial matching
Particularly effective for binary decisions or multi-label classification
Can leverage “logical frogs” tricks from Chapter 7 to assess model confidence
Free-form text requires more creative evaluation approaches
Tool-use scenarios may be easier to evaluate, especially in agent-driven applications

Functional Testing

A step up from unit tests but not full end-to-end testing
Focuses on testing specific system components

LLM as Judge

Currently trendy but requires careful implementation
Should include human verification loop, preferably multiple humans
Key insight: Always frame the evaluation as if the LLM is grading someone else’s work, never its own
Recommendations for quantitative measures:
- Use gradient and multi-aspect coverage (MA)
- Implement 1-5 scales with specific criteria
- Place all instructions and criteria before the content to be evaluated
- Break down “Goldilocks” questions (was it just right?) into separate questions about whether it was enough and whether it was too much

Online Evaluation

The chapter transitions to discussing why we need online testing despite having offline evaluation capabilities. While offline testing is safer and more scalable, real human interactions are unpredictable and require live testing.

Key points about online evaluation:

AB testing is the standard approach
Existing solutions include Optimizely, VWO Consulting, and AB Tasty
Applications need to support running in two modes (A and B)
Consider rollout timing and users on older versions

Five main metrics for online evaluation (from most to least straightforward):

Direct feedback (user responses to suggestions)
Functional correctness
User acceptance (following suggestions)
Achieved impact (user benefit)
Incidental metrics (surrounding measurements)

Direct feedback data is particularly valuable as it can later be used for model fine-tuning. It’s recommended to track more incidental metrics rather than fewer, both for quality indicators and investigating unexpected changes.

Chapter 11: Looking Ahead

The final chapter covers several forward-looking topics:

Multimodality in LLMs
User experience and interface considerations
Published artifacts from Anthropic
Risks and rewards of custom interfaces
Trends in model intelligence, cost, and speed

Book-Level Conclusions

Two main lessons emerge from the book:

LLMs as Text Completion Engines
- They fundamentally mimic training data
- Success comes from aligning prompts with training data patterns
- Particularly relevant for completion models
Empathy with LLMs

Think of them as mechanical friends with internet knowledge
Five key insights:
- LLMs are easily distracted; keep prompts focused
- If humans can’t understand the prompt, LLMs will struggle
- Provide clear instructions and examples
- Include all necessary information (LLMs aren’t psychic)
- Give space for “thinking out loud” (chain of thought)

Personal Reflections

The book, while not revolutionary, provides valuable insights and is a recommended read at 250 pages. It can be completed in about 10-11 days. The heavy focus on completion models versus chat models is interesting, likely due to the authors’ experience with GitHub Copilot. While some points were novel, none were completely mind-blowing. The book’s emphasis on completion models versus chat models is both intriguing and occasionally confusing, though this perspective is understandable given the authors’ background with GitHub Copilot.

Assembling the Prompt: Notes on ‘Prompt Engineering for LLMs’ ch 6

Alex Strick van Linschoten — Sun, 12 Jan 2025 23:00:00 GMT

Chapter 6 of “Prompt Engineering for LLMs” is devoted to how to structure the prompt and compose its various elements. We first learn about the different kinds of ‘documents’ that we can mimic with our prompts, then think about how to pick which pieces of context to include, and then think through how we might compose all of this together.

There’s a great figure to give you an idea of ‘the anatomy of a well-constructed prompt’ early on. The introduction is where you introduce the task, then you have the ‘valley of meh’ (which the LLM can struggle to recall or obey) and finally you have the refocusing and restatement of the task.

There are two key tips at this point:

the closer a piece of information is to the end of the prompt, the more impact it has on the model
the model often struggles with the information stuffed in the middle of the prompt

So craft your prompts accordingly!

A prompt plus the resulting completion is defined as a ‘document’ in this book, and there are various templates that you can follow: an ‘advice conversation’, an ‘analytic report’ (often formatted with Markdown headers), and a ‘structured document’.

We learn that analytic report-type documents seem to offer a lighter ‘cognitive load’ for an LLM since it doesn’t have to handle the intricacies of social interaction that it would in the case of an advice conversation. 🤔

Two other tips or possible things to include in the analytic report-style document:

a table of contents at the beginning to set the scene
a scratchpad or notebook section for the model to ‘think’ in

I haven’t had much use of either of these myself but I can see why they’d be powerful.

Structured documents can be really powerful, especially when the model has been trained to expect certain kinds of structure (be it JSON or XML or YAML etc). Also TIL that apparently OpenAI’s models are very strong when dealing with JSON as inputs.

The context to be inserted into the prompt (usually dynamically depending on use case or needs) can be large or small depending on what is available in terms of context window or latency requirements. There are different strategies to how to select what goes in.

I was curious about the idea of what they call ‘elastic snippets’, i.e. dynamic decisions that get taken as to what makes it way into the prompt depending on how much space is available etc.

And even then you have to decide about the:

position (which order do all the elements appear in the prompt)
importance (how much will dropping this element from the prompt effect the response)
dependency (if you include one element, can you drop another and vice versa…)

In the end, you have a kind of optimisation problem: given a theoretical unlimited potential prompt length, how to combine all the elements together to get the most value given the space limitations that the LLM dictates.

And then what strategy do you use to get rid of elements that your prompt budget cannot afford; we learn about the ‘additive greedy approach’ and the ‘subtractive greedy approach’, all the while bearing in mind that these are all just basic prototypes to play around with.

The next chapter is all about the completion and how to make sure we receive meaningful and accurate responses from our LLM!

Alex Strick van Linschoten

Trying to instrument an agentic app with Arize Phoenix and litellm

Basic logging with litellm + phoenix

BatchSpanProcessor for production usage

Using the litellm callbacks as an alternative

One trace, multiple spans

LLM Tracing Tools’ Naming Conventions (June 2025)

Grouping spans under a single trace

Update: solution from the Arize team

Testing out instrumenting LLM tracing for litellm with Braintrust and Langfuse

Simple Braintrust tracing with litellm callbacks

Basic tracing with Langfuse and litellm

Building hinbox: An agentic research tool for historical document analysis

Why Build This? Personal Research History Meets the Age of Agents

The Academic Reality Check

Beyond Academic Applications

The ‘Agentic’ Moment

What can hinbox do now?

How I built it

What’s up next?

The Real Work: Systematic Evaluation and Improvement

Context for Future Technical Content

Error analysis to find failure modes

1. Create your initial dataset

2. Look at your data (‘open coding’)

3. Cluster your data (‘axial coding’)

4. Label more traces & iterate

Pitfalls to watch out for

Office hours discussions

Reflections & what I’ll be working on

How to think about evals

The Three Gulfs: Specification, Generalization and Comprehension

The Gulf of Comprehension in Practice

The Gulf of Specification in Practice

Gulf of Generalisation

The Improvement Loop for LLM Applications

Prompting through the lens of evals

Things I want to think about more

First impressions of the new Gemini Deep Research (with 2.5 Pro)

Learnings from a week of building with local LLMs

🤖 Local Models

💬 Prompting & Instruction Following

🧰 Process & Tools

🧑‍🔬 Software Engineering Patterns

🌐 Appendix 1: FastHTML

📃 Appendix 2: OCR + Translation

Building an MCP Server for Beeminder: Connecting AI Assistants to Personal Data

Understanding Beeminder

The Role of MCP

Building the Server

Using the Beeminder MCP Server

Looking Forward

Tinbox: an LLM-based document translation tool

The Hidden Complexity of Document Translation

A Word About Translations, Fidelity and Accuracy

Engineering a Solution

The Engineering Details

Page-by-Page with Seam Repair

Sliding Window for Text Documents

CLI Usage Examples

Other notable features

Things still to be done

Starting the Hugging Face Agents course

Code agents’ prominence

Evals & testing

General patterns

Chat templates

Points of leverage

AI Engineering Architecture and User Feedback

1. Progressive Architecture Patterns

Base Layer: Direct Model Integration

Enhancement Layer: Context Augmentation

Protection Layer: Guardrails Implementation

Routing Layer: Gateway and Model Selection

Performance Layer: Caching Strategies

Agent Layer: Advanced Functionality

Monitoring and Observability

AI Pipeline Orchestration

2. User Feedback Systems

Feedback Collection Methods

`BatchSpanProcessor` for production usage

Basic tracing with Langfuse and `litellm`

What can `hinbox` do now?