<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Alex Strick van Linschoten</title>
<link>https://alexstrick.com/technical.html</link>
<atom:link href="https://alexstrick.com/technical.xml" rel="self" type="application/rss+xml"/>
<description>Personal and technical writings from Alex Strick van Linschoten</description>
<generator>quarto-1.8.27</generator>
<lastBuildDate>Tue, 03 Jun 2025 22:00:00 GMT</lastBuildDate>
<item>
  <title>Trying to instrument an agentic app with Arize Phoenix and litellm</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-06-04-instrumenting-an-agentic-app-with-arize-phoenix-and-litellm.html</link>
  <description><![CDATA[ 




<p>It’s important to instrument your AI applications! I hope this can more or less be taken as given just as you’d expect a non-AI-infused app to capture logs. When you’re evaluating your LLM-powered system, you need to have capture the inputs and outputs both at an end-to-end level in terms of the way the user experiences things as well as with more fine-grained granularity for all the internal workings.</p>
<p>My goal with this blog is to first demonstrate how Phoenix and <code>litellm</code> can work together, and then to make sure that we are able to group all spans together under a single trace.</p>
<p>I’ll write the blog as I work so at this point I’m not sure exactly how this will turn out.</p>
<section id="basic-logging-with-litellm-phoenix" class="level2">
<h2 class="anchored" data-anchor-id="basic-logging-with-litellm-phoenix">Basic logging with litellm + phoenix</h2>
<p>As a reminder, here’s how we make an LLM call with <code>litellm</code>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> litellm</span>
<span id="cb1-2"></span>
<span id="cb1-3">completion_response <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> litellm.completion(</span>
<span id="cb1-4">    model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"openrouter/google/gemma-3n-e4b-it:free"</span>,</span>
<span id="cb1-5">    messages<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb1-6">        {</span>
<span id="cb1-7">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of China? Just give me the name."</span>,</span>
<span id="cb1-8">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span>,</span>
<span id="cb1-9">        }</span>
<span id="cb1-10">    ],</span>
<span id="cb1-11">)</span>
<span id="cb1-12"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(completion_response.choices[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].message.content)</span>
<span id="cb1-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># prints 'Beijing'</span></span></code></pre></div></div>
<p>The Phoenix docs explain how to set up basic logging for litellm:</p>
<ul>
<li>install the following pip packages:
<ul>
<li><code>arize-phoenix-otel</code></li>
<li><code>openinference-instrumentation-litellm</code></li>
<li>(<code>litellm</code>, obviously)</li>
</ul></li>
<li>set up the necessary environment variables with API key etc to ensure that traces get sent to the right account and endpoint</li>
</ul>
<p>Let’s assume we’re using the hosted Phoenix Cloud version for now. Then we can rerun our example, with some slight tweaks:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"></span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> litellm</span>
<span id="cb2-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> phoenix.otel <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> register</span>
<span id="cb2-4"></span>
<span id="cb2-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># configure the Phoenix tracer</span></span>
<span id="cb2-6">tracer_provider <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> register(</span>
<span id="cb2-7">    project_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hinbox"</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Default is 'default'</span></span>
<span id="cb2-8">    auto_instrument<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Auto-instrument your app based on installed OI dependencies</span></span>
<span id="cb2-9">)</span>
<span id="cb2-10"></span>
<span id="cb2-11">completion_response <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> litellm.completion(</span>
<span id="cb2-12">    model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"openrouter/google/gemma-3n-e4b-it:free"</span>,</span>
<span id="cb2-13">    messages<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb2-14">        {</span>
<span id="cb2-15">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of China? Just give me the name."</span>,</span>
<span id="cb2-16">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span>,</span>
<span id="cb2-17">        }</span>
<span id="cb2-18">    ],</span>
<span id="cb2-19">)</span>
<span id="cb2-20"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(completion_response.choices[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].message.content)</span></code></pre></div></div>
<p>So we first register the Phoenix tracer, specify the project (already set up in Phoenix Cloud) and then run our litellm <code>completion</code> as previously. In the terminal we see he following logs:</p>
<pre class="shell"><code>🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: hinbox
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: https://app.phoenix.arize.com/v1/traces
|  Transport: HTTP + protobuf
|  Transport Headers: {'api_key': '****', 'authorization': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  ⚠️ WARNING: It is strongly advised to use a BatchSpanProcessor in production environments.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.

Beijing</code></pre>
<p>So immediately there are a lot of things to consider. It seems that we’ll want to use the <code>BatchSpanProcessor</code> that it suggests, and also it seems like I might not want to set this as the global tracing provider, too.</p>
<p>In Phoenix Cloud, I see this:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/2025-06-04-instrumenting-an-agentic-app-with-arize-phoenix-and-litellm/first-phoenix.png" class="img-fluid figure-img"></p>
<figcaption>Basic Phoenix tracing interface</figcaption>
</figure>
</div>
<p>As you can see, we’ve captured the input and output messages for the completion, it’s tracked the latency of the call (1.16s, which seems pretty slow actually!). There is also some sort of an annotation interface though I’ll explore that down the line maybe. I immediately notice that I’m missing things like the system attributes for where the call was made, also metadata like the temperature and other settings. I’d also like to see things like token counts (which you <em>can</em> get in Phoenix but they’re sort of buried) as well as the estimated cost of the call(s) and so on. We can see about adding some of that down the line.</p>
</section>
<section id="batchspanprocessor-for-production-usage" class="level2">
<h2 class="anchored" data-anchor-id="batchspanprocessor-for-production-usage"><code>BatchSpanProcessor</code> for production usage</h2>
<p>Let’s next move on to adding <code>BatchSpanProcessor</code> as the message suggested, which is as simple as adding <code>batch=True</code> to the tracer provider registration code. What this does is make sure that spans are processed in batches before they’re exported to Arize. This takes away some of the network costs that you incur when sending the spans one by one. I’ve also made sure to turn off the registration of this tracing provider as the global one:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"></span>
<span id="cb4-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> litellm</span>
<span id="cb4-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> phoenix.otel <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> register</span>
<span id="cb4-4"></span>
<span id="cb4-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># configure the Phoenix tracer</span></span>
<span id="cb4-6">tracer_provider <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> register(</span>
<span id="cb4-7">    project_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hinbox"</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Default is 'default'</span></span>
<span id="cb4-8">    auto_instrument<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Auto-instrument your app based on installed OI dependencies</span></span>
<span id="cb4-9">    set_global_tracer_provider<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,</span>
<span id="cb4-10">    batch<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,</span>
<span id="cb4-11">)</span>
<span id="cb4-12"></span>
<span id="cb4-13">completion_response <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> litellm.completion(</span>
<span id="cb4-14">    model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"openrouter/google/gemma-3n-e4b-it:free"</span>,</span>
<span id="cb4-15">    messages<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb4-16">        {</span>
<span id="cb4-17">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of China? Just give me the name."</span>,</span>
<span id="cb4-18">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span>,</span>
<span id="cb4-19">        }</span>
<span id="cb4-20">    ],</span>
<span id="cb4-21">)</span>
<span id="cb4-22"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(completion_response.choices[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].message.content)</span></code></pre></div></div>
<p>And I get this in the terminal:</p>
<pre class="shell"><code>🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: hinbox
|  Span Processor: BatchSpanProcessor
|  Collector Endpoint: https://app.phoenix.arize.com/v1/traces
|  Transport: HTTP + protobuf
|  Transport Headers: {'api_key': '****', 'authorization': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.

Beijing</code></pre>
<p>It’s actually somehow a bit annoying to still see a message about the fact that I’m using a default <code>SpanProcessor</code>. It’s unclear to me why I need to care that this is a default one. The message is taking up real estate in the logs and it seems important (otherwise why would they have included it?) but it’s also unclear to me what the alternative is and why I’d want to overwrite the default. I think for now I’ll leave it.</p>
</section>
<section id="using-the-litellm-callbacks-as-an-alternative" class="level2">
<h2 class="anchored" data-anchor-id="using-the-litellm-callbacks-as-an-alternative">Using the litellm callbacks as an alternative</h2>
<p>If we stray away from the official supported way to handle tracing with Phoenix, there’s also <a href="https://docs.litellm.ai/docs/observability/phoenix_integration">the community-supported in-built litellm option</a>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> litellm</span>
<span id="cb6-2"></span>
<span id="cb6-3">litellm.callbacks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"arize_phoenix"</span>]</span>
<span id="cb6-4"></span>
<span id="cb6-5">completion_response <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> litellm.completion(</span>
<span id="cb6-6">    model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"openrouter/google/gemma-3n-e4b-it:free"</span>,</span>
<span id="cb6-7">    messages<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb6-8">        {</span>
<span id="cb6-9">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of China? Just give me the name."</span>,</span>
<span id="cb6-10">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span>,</span>
<span id="cb6-11">        }</span>
<span id="cb6-12">    ],</span>
<span id="cb6-13">    metadata<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"PROJECT_NAME"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hinbox"</span>},</span>
<span id="cb6-14">)</span>
<span id="cb6-15"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(completion_response.choices[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].message.content)</span></code></pre></div></div>
<p>This achieves a similar result, though I was unable to get the trace to land anywhere other than the <code>default</code> project. <a href="https://arize.com/docs/phoenix/sdk-api-reference/python-pacakges/arize-phoenix-otel">Arize’s docs</a> mention a <code>PHOENIX_PROJECT_NAME</code> environment variable but it seems this isn’t respected or used by the <code>litellm</code> implementation. Indeed when I look at <a href="https://github.com/BerriAI/litellm/blob/main/litellm/integrations/arize/arize_phoenix.py">the implementation</a>, I don’t see this being used anywhere, so it seems that the community-driven implementation isn’t really the way forward.</p>
<p>I just wanted to mention it, however, since some of the ‘callback’ integrations for tracing in litellm are really nicely implemented (like the one for Langfuse, e.g.) so I wanted to try it out at least.</p>
</section>
<section id="one-trace-multiple-spans" class="level2">
<h2 class="anchored" data-anchor-id="one-trace-multiple-spans">One trace, multiple spans</h2>
<p>For anything beyond a simple LLM call, which means most real-world LLM applications, we’ll want to be capturing multiple spans as part of a single trace.</p>
<section id="llm-tracing-tools-naming-conventions-june-2025" class="level3">
<h3 class="anchored" data-anchor-id="llm-tracing-tools-naming-conventions-june-2025">LLM Tracing Tools’ Naming Conventions (June 2025)</h3>
<p>Side-note: I dug into how some of the major LLM tracing providers name their primitives. I was reassured that we seem to have coalesced around ‘trace -&gt; span’ and that the OpenTelemetry way seems to have been adopted by most.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/2025-06-04-instrumenting-an-agentic-app-with-arize-phoenix-and-litellm/tracing_nomenclature.png" class="img-fluid figure-img"></p>
<figcaption>Tracing nomenclature (June 2025)</figcaption>
</figure>
</div>
</section>
<section id="grouping-spans-under-a-single-trace" class="level3">
<h3 class="anchored" data-anchor-id="grouping-spans-under-a-single-trace">Grouping spans under a single trace</h3>
<p>I updated the code such that we now have a function that makes two separate LLM calls. I’d want them to both be registered as spans under the same trace:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> litellm</span>
<span id="cb7-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> phoenix.otel <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> register</span>
<span id="cb7-3"></span>
<span id="cb7-4">tracer_provider <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> register(</span>
<span id="cb7-5">    project_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hinbox"</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Default is 'default'</span></span>
<span id="cb7-6">    auto_instrument<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Auto-instrument your app based on installed OI dependencies</span></span>
<span id="cb7-7">    set_global_tracer_provider<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,</span>
<span id="cb7-8">    batch<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,</span>
<span id="cb7-9">)</span>
<span id="cb7-10"></span>
<span id="cb7-11"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> query_llm(prompt: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>):</span>
<span id="cb7-12">    completion_response <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> litellm.completion(</span>
<span id="cb7-13">        model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"openrouter/google/gemma-3n-e4b-it:free"</span>,</span>
<span id="cb7-14">        messages<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb7-15">            {</span>
<span id="cb7-16">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: prompt,</span>
<span id="cb7-17">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span>,</span>
<span id="cb7-18">            }</span>
<span id="cb7-19">        ],</span>
<span id="cb7-20">    )</span>
<span id="cb7-21">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> completion_response.choices[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].message.content</span>
<span id="cb7-22"></span>
<span id="cb7-23"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> my_llm_application():</span>
<span id="cb7-24">    query1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> query_llm(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of China? Just give me the name."</span>)</span>
<span id="cb7-25">    query2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> query_llm(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of Japan? Just give me the name."</span>)</span>
<span id="cb7-26">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (query1, query2)</span>
<span id="cb7-27"></span>
<span id="cb7-28"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"__main__"</span>:</span>
<span id="cb7-29">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(my_llm_application())</span></code></pre></div></div>
<p>But these just get registered as two separate traces/calls. The key bit of the documentation is <a href="https://arize.com/docs/phoenix/tracing/how-to-tracing/setup-tracing/instrument-python">the ‘Using Phoenix Decorator’ section</a>, it seems. If I add a decorator on top of my function and get the specific tracer, it seems I <em>am</em> able to start to group things together:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> litellm</span>
<span id="cb8-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> phoenix.otel <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> register</span>
<span id="cb8-3"></span>
<span id="cb8-4">tracer_provider <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> register(</span>
<span id="cb8-5">    project_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hinbox"</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Default is 'default'</span></span>
<span id="cb8-6">    auto_instrument<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Auto-instrument your app based on installed OI dependencies</span></span>
<span id="cb8-7">    set_global_tracer_provider<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,</span>
<span id="cb8-8">    batch<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,</span>
<span id="cb8-9">)</span>
<span id="cb8-10">tracer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tracer_provider.get_tracer(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span>)</span>
<span id="cb8-11"></span>
<span id="cb8-12"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> query_llm(prompt: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>):</span>
<span id="cb8-13">    completion_response <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> litellm.completion(</span>
<span id="cb8-14">        model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"openrouter/google/gemma-3n-e4b-it:free"</span>,</span>
<span id="cb8-15">        messages<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb8-16">            {</span>
<span id="cb8-17">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: prompt,</span>
<span id="cb8-18">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span>,</span>
<span id="cb8-19">            }</span>
<span id="cb8-20">        ],</span>
<span id="cb8-21">    )</span>
<span id="cb8-22">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> completion_response.choices[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].message.content</span>
<span id="cb8-23"></span>
<span id="cb8-24"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@tracer.llm</span></span>
<span id="cb8-25"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> my_llm_application():</span>
<span id="cb8-26">    query1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> query_llm(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of China? Just give me the name."</span>)</span>
<span id="cb8-27">    query2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> query_llm(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of Japan? Just give me the name."</span>)</span>
<span id="cb8-28">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (query1, query2)</span>
<span id="cb8-29"></span>
<span id="cb8-30"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"__main__"</span>:</span>
<span id="cb8-31">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(my_llm_application())</span></code></pre></div></div>
<p>This works and I see this in the Phoenix Cloud dashboard:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/2025-06-04-instrumenting-an-agentic-app-with-arize-phoenix-and-litellm/grouped_traces.png" class="img-fluid figure-img"></p>
<figcaption>Grouped traces under a single span</figcaption>
</figure>
</div>
<p>See how it’s taken the function name as the name of the span. And it’s grouped those two LLM calls that happen within the function as we wanted. We can also update the decorator to denote different kinds of spans that we want to capture:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/2025-06-04-instrumenting-an-agentic-app-with-arize-phoenix-and-litellm/open_inference_span_kinds.png" class="img-fluid figure-img"></p>
<figcaption>The kinds of spans you can choose from</figcaption>
</figure>
</div>
<p>I’m immediately a bit confused by the interface again, because when you click on the ‘Traces’ tab in Phoenix Cloud you actually still just see ‘spans’:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/2025-06-04-instrumenting-an-agentic-app-with-arize-phoenix-and-litellm/traces_spans_phoenix.png" class="img-fluid figure-img"></p>
<figcaption>Spans in the Traces tab</figcaption>
</figure>
</div>
<p>In the documentation it isn’t clear to me how to create a trace that includes an <code>llm</code> span and an <code>embedding</code> span, for example. What’s even more frustrating is that the <code>tracer</code> decorator object doesn’t implement <em>all</em> the span types, just <code>agent</code>, <code>chain</code> and <code>llm</code> it seems. I tried something <a href="https://gist.github.com/strickvl/3e682c28278eeb850e9bf195a2b2cb44">like this</a> but it just ended up producing 3 separate traces in Phoenix Cloud.</p>
<p>I looked at <a href="https://arize.com/docs/phoenix/tracing/how-to-tracing/setup-tracing/custom-spans">the documentation for using base OTEL</a> instead of the Phoenix decorators, but there was also nothing in there on how to denote the trace instead of just the span.</p>
<p>I was wondering if <a href="https://arize.com/docs/phoenix/tracing/how-to-tracing/setup-tracing/setup-sessions">their ‘Sessions’ primitive</a> was the way forward here, but they’re pretty clear in stating that a <code>Session</code> is a “sequence of traces”.</p>
<p>So I’m at a bit of a dead end with Phoenix for now. I might return to Braintrust or Langfuse since these seem to have better support for what I’m trying to do (i.e.&nbsp;group spans together underneath a trace). I’m really reluctant to try to instrument <code>hinbox</code> with Phoenix when I’m unable even to get this basic grouping working properly with some dummy code.</p>
</section>
</section>
<section id="update-solution-from-the-arize-team" class="level2">
<h2 class="anchored" data-anchor-id="update-solution-from-the-arize-team">Update: solution from the Arize team</h2>
<p>I posted this blog on the Arize slack and they got back to me with a solution:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> litellm</span>
<span id="cb9-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> phoenix.otel <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> register</span>
<span id="cb9-3"></span>
<span id="cb9-4">tracer_provider <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> register(</span>
<span id="cb9-5">    project_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hinbox"</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Default is 'default'</span></span>
<span id="cb9-6">    auto_instrument<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Auto-instrument your app based on installed OI dependencies</span></span>
<span id="cb9-7">    set_global_tracer_provider<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,</span>
<span id="cb9-8">)</span>
<span id="cb9-9">tracer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tracer_provider.get_tracer(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span>)</span>
<span id="cb9-10"></span>
<span id="cb9-11"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@tracer.llm</span></span>
<span id="cb9-12"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> query_llm(prompt: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>):</span>
<span id="cb9-13">    completion_response <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> litellm.completion(</span>
<span id="cb9-14">        model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"openrouter/google/gemma-3n-e4b-it:free"</span>,</span>
<span id="cb9-15">        messages<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb9-16">            {</span>
<span id="cb9-17">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: prompt,</span>
<span id="cb9-18">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span>,</span>
<span id="cb9-19">            }</span>
<span id="cb9-20">        ],</span>
<span id="cb9-21">    )</span>
<span id="cb9-22">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> completion_response.choices[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].message.content</span>
<span id="cb9-23"></span>
<span id="cb9-24"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@tracer.agent</span></span>
<span id="cb9-25"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> query_agent(prompt: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>):</span>
<span id="cb9-26">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I am an agent."</span></span>
<span id="cb9-27"></span>
<span id="cb9-28"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@tracer.chain</span></span>
<span id="cb9-29"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> my_llm_application():</span>
<span id="cb9-30">    query1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> query_llm(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of China? Just give me the name."</span>)</span>
<span id="cb9-31">    query2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> query_llm(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of Japan? Just give me the name."</span>)</span>
<span id="cb9-32">    agent1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> query_agent(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Who are you?"</span>)</span>
<span id="cb9-33">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (query1, query2, agent1)</span>
<span id="cb9-34"></span>
<span id="cb9-35"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"__main__"</span>:</span>
<span id="cb9-36">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(my_llm_application())</span></code></pre></div></div>
<p>And you can see how this looks in the Phoenix Cloud dashboard:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/2025-06-04-instrumenting-an-agentic-app-with-arize-phoenix-and-litellm/arize-updated-solution-grouped.png" class="img-fluid figure-img"></p>
<figcaption>Grouped spans</figcaption>
</figure>
</div>
<p>Judging from the code it seems like the way the span is constructed simply depends on how you assemble the hierarchy of spans. For instance, if I wanted to consider the top-level entity for this ‘trace’ (i.e.&nbsp;a grouping of spans) then I could use this code:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> litellm</span>
<span id="cb10-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> phoenix.otel <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> register</span>
<span id="cb10-3"></span>
<span id="cb10-4">tracer_provider <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> register(</span>
<span id="cb10-5">    project_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hinbox"</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Default is 'default'</span></span>
<span id="cb10-6">    auto_instrument<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Auto-instrument your app based on installed OI dependencies</span></span>
<span id="cb10-7">    set_global_tracer_provider<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,</span>
<span id="cb10-8">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># batch=True,</span></span>
<span id="cb10-9">)</span>
<span id="cb10-10">tracer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tracer_provider.get_tracer(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span>)</span>
<span id="cb10-11"></span>
<span id="cb10-12"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@tracer.llm</span></span>
<span id="cb10-13"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> query_llm(prompt: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>):</span>
<span id="cb10-14">    completion_response <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> litellm.completion(</span>
<span id="cb10-15">        model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"openrouter/google/gemma-3n-e4b-it:free"</span>,</span>
<span id="cb10-16">        messages<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb10-17">            {</span>
<span id="cb10-18">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: prompt,</span>
<span id="cb10-19">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span>,</span>
<span id="cb10-20">            }</span>
<span id="cb10-21">        ],</span>
<span id="cb10-22">    )</span>
<span id="cb10-23">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> completion_response.choices[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].message.content</span>
<span id="cb10-24"></span>
<span id="cb10-25"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@tracer.agent</span></span>
<span id="cb10-26"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> query_agent(prompt: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>):</span>
<span id="cb10-27">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I am an agent."</span></span>
<span id="cb10-28"></span>
<span id="cb10-29"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@tracer.tool</span>(name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"query_embedding"</span>, description<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Query embedding"</span>)</span>
<span id="cb10-30"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> query_embedding(prompt: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>):</span>
<span id="cb10-31">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> [<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.3</span>]</span>
<span id="cb10-32"></span>
<span id="cb10-33"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@tracer.agent</span></span>
<span id="cb10-34"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> my_llm_application():</span>
<span id="cb10-35">    query1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> query_llm(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of China? Just give me the name."</span>)</span>
<span id="cb10-36">    query2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> query_llm(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of Japan? Just give me the name."</span>)</span>
<span id="cb10-37">    agent1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> query_agent(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Who are you?"</span>)</span>
<span id="cb10-38">    embedding1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> query_embedding(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of China? Just give me the name."</span>)</span>
<span id="cb10-39">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (query1, query2, agent1, embedding1)</span>
<span id="cb10-40"></span>
<span id="cb10-41"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"__main__"</span>:</span>
<span id="cb10-42">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(my_llm_application())</span></code></pre></div></div>
<p>And now instead of this trace being of kind ‘chain’, it’s now of kind ‘agent’, which some internal spans also being of kind ‘agent’. In a conversation in the Arize Slack I got the following clarification:</p>
<blockquote class="blockquote">
<p>“Traces as the concept under”signals” is basically a unique identifier of spans (think “span” of time). See https://opentelemetry.io/docs/concepts/signals/traces/ In most cases if you filter spans by “roots” (e.g.&nbsp;spans that don’t have parents) and or look at the collective set of “traces” they will roughly look the same. Most of the time this is the view you want when looking at telemetry. Spans are too noisy to be looking at in isolation. While the two tabs feel largely overlapping, it’s a bit intentional as there’s actually no real object called a trace - it’s just a series of spans. You will see these abstractions in most observability platform.”</p>
</blockquote>
<p>The line that:</p>
<blockquote class="blockquote">
<p>“there’s actually no real object called a trace - it’s just a series of spans”</p>
</blockquote>
<p>Was extremely clarifying, actually. It explains the fuzziness between the spans and traces tab in the Phoenix dashboard.</p>
<p>I also got some clarification around the missing <code>@tracer.embbeding</code> and <code>@tracer.reranker</code> decorators:</p>
<blockquote class="blockquote">
<p>“We emit spans for embedding text to vectors (like”adda”), guardrailing via thinks like guardrals or content moderation, and reranking things via things like cohere. However it’s sorta rare for people to manually write these. We will have decorators for them but right now they are typically emitted from autoinstrumentors like langgraph where there are common patterns for these things. We will have decorators for them very soon - but things like reranking are much more complex than things like tool calling so we are codifying these primitives now.”</p>
</blockquote>
<p>So there you have it! Some clarity. I’ll have to play around to see whether I go with the Langfuse route or the Phoenix route and which feels most ergonomic in the <code>hinbox</code> codebase. Appreciate the quick feedback from the Phoenix team, though!</p>


</section>

 ]]></description>
  <category>llms</category>
  <category>agents</category>
  <category>evals-course</category>
  <category>evaluation</category>
  <category>miniproject</category>
  <category>hinbox</category>
  <guid>https://alexstrick.com/posts/2025-06-04-instrumenting-an-agentic-app-with-arize-phoenix-and-litellm.html</guid>
  <pubDate>Tue, 03 Jun 2025 22:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/2025-06-04-instrumenting-an-agentic-app-with-arize-phoenix-and-litellm/grouped_traces.png" medium="image" type="image/png" height="85" width="144"/>
</item>
<item>
  <title>Testing out instrumenting LLM tracing for litellm with Braintrust and Langfuse</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-06-04-instrumenting-an-agentic-app-with-braintrust-and-litellm.html</link>
  <description><![CDATA[ 




<p>I <a href="https://mlops.systems/posts/2025-06-04-instrumenting-an-agentic-app-with-arize-phoenix-and-litellm.html">previously tried (and failed)</a> to setup LLM tracing for hinbox using Arize Phoenix and litellm. Since this is sort of a priority for being able to follow along with the <a href="https://maven.com/parlance-labs/evals">Hamel / Shreya evals course</a> with my practical application, I’ll take another stab using a tool with which I’m familiar: <a href="https://www.braintrust.dev/">Braintrust</a>. Let’s start simple and then if it works the way we want we can set things up for <code>hinbox</code> as well.</p>
<section id="simple-braintrust-tracing-with-litellm-callbacks" class="level2">
<h2 class="anchored" data-anchor-id="simple-braintrust-tracing-with-litellm-callbacks">Simple Braintrust tracing with litellm callbacks</h2>
<p>Callbacks are listed <a href="https://docs.litellm.ai/docs/observability/braintrust">in the litellm docs</a> as the way to do tracing with Braintrust. So we can do something like this:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> litellm</span>
<span id="cb1-2"></span>
<span id="cb1-3">litellm.callbacks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"braintrust"</span>]</span>
<span id="cb1-4"></span>
<span id="cb1-5">completion_response <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> litellm.completion(</span>
<span id="cb1-6">    model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"openrouter/google/gemma-3n-e4b-it:free"</span>,</span>
<span id="cb1-7">    messages<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb1-8">        {</span>
<span id="cb1-9">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of China? Just give me the name."</span>,</span>
<span id="cb1-10">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span>,</span>
<span id="cb1-11">        }</span>
<span id="cb1-12">    ],</span>
<span id="cb1-13">    metadata<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{</span>
<span id="cb1-14">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># "project_id": "1235-a70e-4571-abcd-234235",</span></span>
<span id="cb1-15">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"project_name"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hinbox"</span>,</span>
<span id="cb1-16">    },</span>
<span id="cb1-17">)</span>
<span id="cb1-18"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(completion_response.choices[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].message.content)</span></code></pre></div></div>
<p>You can pass in a <code>project_id</code> or a <code>project_name</code> and the traces will be routed there. Here’s what it looks like in the Braintrust dashboard:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/braintrust-langfuse-litellm/braintrust-logging.png" class="img-fluid figure-img"></p>
<figcaption>Our first trace logged in Braintrust</figcaption>
</figure>
</div>
<p>Note how you can’t see which model was used for the LLM call, nor any cost estimates. The docs mention that you <em>can</em> pass metadata into Braintrust using the <code>metadata</code> property:</p>
<blockquote class="blockquote">
<p>“<code>braintrust_*</code> - any metadata field starting with <code>braintrust_</code> will be passed as metadata to the logging request” (<a href="https://docs.litellm.ai/docs/observability/braintrust#full-api-spec">link</a>)</p>
</blockquote>
<p>This seems a bit rudimentary, however. If we take a look at the full tracing documentation on the Braintrust docs we can see that they seem to recommend wrapping the <code>OpenAI</code> client object instead:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> os</span>
<span id="cb2-2"></span>
<span id="cb2-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> braintrust <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> init_logger, traced, wrap_openai</span>
<span id="cb2-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> openai <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> OpenAI</span>
<span id="cb2-5"></span>
<span id="cb2-6">logger <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> init_logger(project<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hinbox"</span>)</span>
<span id="cb2-7">client <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> wrap_openai(OpenAI(api_key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>os.environ[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"OPENAI_API_KEY"</span>]))</span>
<span id="cb2-8"></span>
<span id="cb2-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># @traced automatically logs the input (args) and output (return value)</span></span>
<span id="cb2-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># of this function to a span. To ensure the span is named `answer_question`,</span></span>
<span id="cb2-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># you should name the function `answer_question`.</span></span>
<span id="cb2-12"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@traced</span></span>
<span id="cb2-13"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> answer_question(body: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>:</span>
<span id="cb2-14">    prompt <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb2-15">        {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"system"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"You are a helpful assistant."</span>},</span>
<span id="cb2-16">        {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: body},</span>
<span id="cb2-17">    ]</span>
<span id="cb2-18"></span>
<span id="cb2-19">    result <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> client.chat.completions.create(</span>
<span id="cb2-20">        model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-3.5-turbo"</span>,</span>
<span id="cb2-21">        messages<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>prompt,</span>
<span id="cb2-22">    )</span>
<span id="cb2-23">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> result.choices[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].message.content</span>
<span id="cb2-24"></span>
<span id="cb2-25"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> main():</span>
<span id="cb2-26">    input_text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of China? Just give me the name."</span></span>
<span id="cb2-27">    result <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> answer_question(input_text)</span>
<span id="cb2-28">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(result)</span>
<span id="cb2-29"></span>
<span id="cb2-30"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"__main__"</span>:</span>
<span id="cb2-31">    main()</span></code></pre></div></div>
<p>This indeed does label the span as <code>answer_question</code> but it doesn’t do much else. Even the model name isn’t captured here. Instrumenting a series of calls to handle ‘deeply nested code’ (as <a href="https://www.braintrust.dev/docs/guides/traces/customize#deeply-nested-code">their docs</a> puts it) even didn’t log the things it was supposed to:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"></span>
<span id="cb3-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> os</span>
<span id="cb3-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> random</span>
<span id="cb3-4"></span>
<span id="cb3-5"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> braintrust <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> current_span, init_logger, start_span, traced, wrap_openai</span>
<span id="cb3-6"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> openai <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> OpenAI</span>
<span id="cb3-7"></span>
<span id="cb3-8">logger <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> init_logger(project<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hinbox"</span>)</span>
<span id="cb3-9">client <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> wrap_openai(OpenAI(api_key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>os.environ[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"OPENAI_API_KEY"</span>]))</span>
<span id="cb3-10"></span>
<span id="cb3-11"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@traced</span></span>
<span id="cb3-12"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> run_llm(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">input</span>):</span>
<span id="cb3-13">    model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-4o"</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> random.random() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-4o-mini"</span></span>
<span id="cb3-14">    result <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> client.chat.completions.create(</span>
<span id="cb3-15">        model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>model, messages<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">input</span>}]</span>
<span id="cb3-16">    )</span>
<span id="cb3-17">    current_span().log(metadata<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"randomModel"</span>: model})</span>
<span id="cb3-18">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> result.choices[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].message.content</span>
<span id="cb3-19"></span>
<span id="cb3-20"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@traced</span></span>
<span id="cb3-21"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> some_logic(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">input</span>):</span>
<span id="cb3-22">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> run_llm(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"You are a magical wizard. Answer the following question: "</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">input</span>)</span>
<span id="cb3-23"></span>
<span id="cb3-24"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> simple_handler(input_text: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>):</span>
<span id="cb3-25">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> start_span() <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> span:</span>
<span id="cb3-26">        output <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> some_logic(input_text)</span>
<span id="cb3-27">        span.log(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">input</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>input_text, output<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>output, metadata<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>(user_id<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"test_user"</span>))</span>
<span id="cb3-28">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(output)</span>
<span id="cb3-29"></span>
<span id="cb3-30"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"__main__"</span>:</span>
<span id="cb3-31">    question <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of China? Just give me the name."</span></span>
<span id="cb3-32">    simple_handler(question)</span></code></pre></div></div>
<p>This is adapted from the example they pasted in their docs as their one isn’t even a functional code example on its own.</p>
<p>It is seeming increasingly clear that Braintrust isn’t going to be the right choice, at least as long as I want to keep using <code>litellm</code>. I know that Langfuse has a very nice integration with <code>litellm</code>, so I think I’ll pivot over to that now.</p>
</section>
<section id="basic-tracing-with-langfuse-and-litellm" class="level2">
<h2 class="anchored" data-anchor-id="basic-tracing-with-langfuse-and-litellm">Basic tracing with Langfuse and <code>litellm</code></h2>
<p>Simple tracing is easy:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> litellm</span>
<span id="cb4-2"></span>
<span id="cb4-3">litellm.callbacks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"langfuse"</span>]</span>
<span id="cb4-4"></span>
<span id="cb4-5"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> query_llm(prompt: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>):</span>
<span id="cb4-6">    completion_response <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> litellm.completion(</span>
<span id="cb4-7">        model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"openrouter/google/gemma-3n-e4b-it:free"</span>,</span>
<span id="cb4-8">        messages<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb4-9">            {</span>
<span id="cb4-10">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of China? Just give me the name."</span>,</span>
<span id="cb4-11">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span>,</span>
<span id="cb4-12">            }</span>
<span id="cb4-13">        ],</span>
<span id="cb4-14">    )</span>
<span id="cb4-15">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> completion_response.choices[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].message.content</span>
<span id="cb4-16"></span>
<span id="cb4-17"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> my_llm_application():</span>
<span id="cb4-18">    query1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> query_llm(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of China? Just give me the name."</span>)</span>
<span id="cb4-19">    query2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> query_llm(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of Japan? Just give me the name."</span>)</span>
<span id="cb4-20">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (query1, query2)</span>
<span id="cb4-21"></span>
<span id="cb4-22"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(my_llm_application())</span></code></pre></div></div>
<p>We specify <code>langfuse</code> for the callback and each llm call is logged as a separate trace + span. Here you can see what this looks like in the dashboard:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/braintrust-langfuse-litellm/langfuse-dashboard.png" class="img-fluid figure-img"></p>
<figcaption>Basic trace and span in Langfuse dashboard</figcaption>
</figure>
</div>
<p>The <a href="https://docs.litellm.ai/docs/observability/langfuse_integration">litellm docs</a> include information on how to specify custom metadata and grouping instructions for Langfuse. Notably, we can specify (as of June 2025, at least!) things like a <code>session_id</code>, tags, a <code>trace_name</code> and/or <code>trace_id</code> as well as custom trace metadata and so on. So we can get most of what we want to specify in the following way:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> litellm</span>
<span id="cb5-2"></span>
<span id="cb5-3">litellm.callbacks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"langfuse"</span>]</span>
<span id="cb5-4"></span>
<span id="cb5-5"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> query_llm(prompt: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>, trace_id: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>):</span>
<span id="cb5-6">    completion_response <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> litellm.completion(</span>
<span id="cb5-7">        model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"openrouter/google/gemma-3n-e4b-it:free"</span>,</span>
<span id="cb5-8">        messages<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb5-9">            {</span>
<span id="cb5-10">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of China? Just give me the name."</span>,</span>
<span id="cb5-11">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span>,</span>
<span id="cb5-12">            }</span>
<span id="cb5-13">        ],</span>
<span id="cb5-14">        metadata<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{</span>
<span id="cb5-15">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"trace_id"</span>: trace_id,</span>
<span id="cb5-16">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"trace_name"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"my_llm_application"</span>,</span>
<span id="cb5-17">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"project"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hinbox"</span>,</span>
<span id="cb5-18">        },</span>
<span id="cb5-19">    )</span>
<span id="cb5-20">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> completion_response.choices[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].message.content</span>
<span id="cb5-21"></span>
<span id="cb5-22"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> my_llm_application():</span>
<span id="cb5-23">    query1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> query_llm(</span>
<span id="cb5-24">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of China? Just give me the name."</span>,</span>
<span id="cb5-25">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"my_llm_application_run_789"</span>,</span>
<span id="cb5-26">    )</span>
<span id="cb5-27">    query2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> query_llm(</span>
<span id="cb5-28">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What's the capital of Japan? Just give me the name."</span>,</span>
<span id="cb5-29">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"my_llm_application_run_789"</span>,</span>
<span id="cb5-30">    )</span>
<span id="cb5-31">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (query1, query2)</span>
<span id="cb5-32"></span>
<span id="cb5-33"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"__main__"</span>:</span>
<span id="cb5-34">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(my_llm_application())</span></code></pre></div></div>
<p>This looks like this in the Langfuse dashboard:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/braintrust-langfuse-litellm/langfuse_grouped_spans.png" class="img-fluid figure-img"></p>
<figcaption>Spans grouped into traces</figcaption>
</figure>
</div>
<p>This is honestly most of what I’m looking for in terms of my tracing. If I were to use a non-OpenRouter model, moreover, I’d also get full costs in the Langfuse dashboard, e.g.:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/braintrust-langfuse-litellm/costs-langfuse.png" class="img-fluid figure-img"></p>
<figcaption>LLM costs in Langfuse dashboard</figcaption>
</figure>
</div>
<p>As such, I can monitor costs from within OpenRouter and have the option to keep track of costs in Langfuse by passing custom metadata should I wish.</p>
<p>I’ll make a separate blog where I actually go into how I set up + instrumented <code>hinbox</code> for this kind of tracing while continuing to use <code>litellm</code>.</p>


</section>

 ]]></description>
  <category>llms</category>
  <category>agents</category>
  <category>evals-course</category>
  <category>evaluation</category>
  <category>miniproject</category>
  <category>hinbox</category>
  <guid>https://alexstrick.com/posts/2025-06-04-instrumenting-an-agentic-app-with-braintrust-and-litellm.html</guid>
  <pubDate>Tue, 03 Jun 2025 22:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/braintrust-langfuse-litellm/braintrust-logging.png" medium="image" type="image/png" height="105" width="144"/>
</item>
<item>
  <title>Building hinbox: An agentic research tool for historical document analysis</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-05-29-hinbox-a-first-draft-of-an-agentic-research-system.html</link>
  <description><![CDATA[ 




<p>I’ve been working on a project called <a href="https://github.com/strickvl/hinbox"><code>hinbox</code></a> - a flexible entity extraction system designed to help historians and researchers build structured knowledge databases from collections of primary source documents. At its core, <code>hinbox</code> processes historical documents, academic papers, books and news articles to automatically extract and organize information about people, organizations, locations, and events.</p>
<p>The tool works by ingesting batches of documents and intelligently identifying entities across sources. What makes it interesting is the iterative improvement aspect: as you feed more documents into the system, entity profiles become richer and more comprehensive. When <code>hinbox</code> encounters a person or organization it’s seen before, it updates their profile with new information rather than creating duplicates. I’ve been testing it extensively with Guantánamo Bay media sources - a domain where I have deep expertise from my previous career as a historian - which allows me to rigorously evaluate the quality of its extractions.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/hinbox/organizations-view.png" class="img-fluid figure-img"></p>
<figcaption>The organisations view</figcaption>
</figure>
</div>
<p>Right now, <code>hinbox</code> isn’t ready for broader use. The prompt engineering needs significant refinement, and the entity merging logic requires more sophisticated iteration loops. But that’s actually the point - I’ve been participating in Hamel and Shreya’s <a href="https://mlops.systems/#category=evals-course">AI evals course</a>, and I wanted a concrete project where I could apply the systematic evaluation and improvement techniques we’re learning.</p>
<p>This project originally came together over a few intense days about two months ago, then sat dormant while work <a href="https://www.linkedin.com/feed/update/urn:li:activity:7333405837433999360/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7333405837433999360%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29">got</a> <a href="https://www.linkedin.com/feed/update/urn:li:activity:7332696687515205650/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7332696687515205650%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29">busy</a>. I’ve recently resurrected it specifically to serve as a practical laboratory for the evals course exercises. There’s something powerful about having a real application with measurable outputs where you can experiment with different approaches to prompt optimization, model selection, and systematic error analysis.</p>
<p>The broader vision is creating a tool that could genuinely help researchers working with large document collections - transforming the traditional manual process of reading, noting, and cross-referencing into something more systematic and scalable. But first, it needs to work reliably, which is where the evals work comes in.</p>
<section id="why-build-this-personal-research-history-meets-the-age-of-agents" class="level2">
<h2 class="anchored" data-anchor-id="why-build-this-personal-research-history-meets-the-age-of-agents">Why Build This? Personal Research History Meets the Age of Agents</h2>
<p>This project connects directly to something I’ve done before - but under very different circumstances. In the mid-2000s, I founded and ran a media monitoring startup in Afghanistan (RIP <a href="https://mlops.systems/posts/2024-04-01-publishing-afghanwire-dataset.html">AfghanWire</a>). We had a team of Afghan translators processing daily newspapers and news sources, translating everything into English. Then came my part: reading these translations and manually building what essentially became a structured knowledge database.</p>
<p>The process was methodical but exhausting. Each article mentioning a person required checking our existing profiles - did we know this individual? If not, I’d create a new entry and research their background. If yes, I’d update their existing profile with new information. Over time, we developed detailed profiles for hundreds of key figures in Afghan politics, civil society, and security. The more articles we processed, the richer and more interconnected our database became. We were building a living encyclopaedia of contemporary Afghanistan, one translated news story at a time.</p>
<p>The startup eventually ran out of funding, but the intellectual framework stuck with me. We’d created something genuinely valuable - contextual intelligence that helped outsiders understand the complex landscape of Afghan media and politics. The manual approach worked, but it was incredibly time-intensive and didn’t scale beyond what a small team could handle.</p>
<section id="the-academic-reality-check" class="level3">
<h3 class="anchored" data-anchor-id="the-academic-reality-check">The Academic Reality Check</h3>
<p>Since then, I’ve continued working as a researcher (I have a PhD in War Studies from King’s College London and have written several critically-acclaimed books credentials blah blah sorry). This experience has reinforced how common the core challenge actually is across academic and research contexts. Historical research often involves exactly this pattern: you have access to substantial primary source collections - maybe 20,000 newspaper issues covering a decade, or thousands of diplomatic cables, or extensive archival materials - but limited time and resources to systematically extract insights.</p>
<p>The traditional academic approach involves months of careful reading, taking notes in physical notebooks, slowly building up understanding through manual cross-referencing. It’s thorough but painfully slow. Most researchers don’t have the luxury of unlimited time to spend four hours daily reading through source materials, even though that’s often what the work requires.</p>
</section>
</section>
<section id="beyond-academic-applications" class="level2">
<h2 class="anchored" data-anchor-id="beyond-academic-applications">Beyond Academic Applications</h2>
<p>The potential applications extend well beyond historical research. Intelligence analysis, scientific literature review, market research, legal discovery - anywhere you need to build structured knowledge from unstructured document collections. There’s clearly demand for these capabilities, evidenced by the popularity of “second brain” concepts and personal knowledge management tools like Obsidian and Roam.</p>
<p>But most existing PKM tools require manual curation. They’re great for organising knowledge you’ve already processed, less effective for bootstrapping that initial extraction from raw sources. What interests me is the hybrid approach: automated extraction that creates draft profiles and connections, which humans can then review, edit, and approve. Not pure automation, but intelligent assistance that handles the tedious first pass.</p>
<section id="the-agentic-moment" class="level3">
<h3 class="anchored" data-anchor-id="the-agentic-moment">The ‘Agentic’ Moment</h3>
<p>We’re entering what feels like a genuinely different phase of AI capability - the emergence of reliable vertical agents that can handle specific, complex workflows end-to-end. <code>hinbox</code> represents my attempt to explore what this might look like in practice for research applications. Rather than building with heavy agentic frameworks (which I haven’t found necessary yet and which fall in and out of favour too often for my tastes), I’m focusing on the core extraction and synthesis challenge.</p>
<p>This feels like the right moment to experiment with these capabilities. The models are sophisticated enough to handle nuanced entity recognition and relationship mapping, but the tooling is still flexible enough that you can build custom solutions for specific domains. It’s an interesting testing ground for understanding both the current state of the art and the practical challenges of deploying AI in knowledge-intensive workflows.</p>
<p>The goal isn’t necessarily to “solve” automated research (though that would be nice), but to build something concrete where I can systematically evaluate different approaches to prompt engineering, model selection, and error correction. Sometimes the best way to understand emerging capabilities is to push them against real problems you actually care about solving.</p>
</section>
</section>
<section id="what-can-hinbox-do-now" class="level2">
<h2 class="anchored" data-anchor-id="what-can-hinbox-do-now">What can <code>hinbox</code> do now?</h2>
<p>The system centres around domain-specific configuration - you define the research area you’re interested in through a set of configuration files that specify your extraction targets and prompts. For my testing, I’ve been using Guantánamo Bay historical sources as the test domain since I can rigorously evaluate the quality of extractions in an area where I have deep expertise.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/hinbox/organization-profile.png" class="img-fluid figure-img"></p>
<figcaption>An example organisation profile</figcaption>
</figure>
</div>
<p>Setting up a new research domain is straightforward: the system generates template configuration files with placeholders for all the necessary prompts. You customise these prompts to focus on the entities most relevant to your research - perhaps emphasising military personnel and legal proceedings for Guantánamo sources, or traders and agricultural cooperatives for Palestinian food history research.</p>
<p>Once configured, <code>hinbox</code> processes your document collection article by article, extracting people, organisations, locations, and events according to your specifications. The interesting part is the intelligent merging: rather than creating duplicate entries, the system attempts to recognise when newly extracted entities match existing profiles and updates them accordingly. This iterative enrichment means profiles become more comprehensive as you process additional sources.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/hinbox/processing-logs.png" class="img-fluid figure-img"></p>
<figcaption>Data processing logs</figcaption>
</figure>
</div>
<p>The system supports both cloud-based models (Gemini Flash 2.x has been particularly effective) and local processing through Ollama - crucial for researchers working with sensitive historical materials that can’t be sent to external APIs. Local models like <code>gemma3:27b</code> have proven surprisingly capable for this kind of structured extraction work.</p>
<p>After processing, you get a web-based frontend for exploring the extracted knowledge base. Profiles include source attribution and version history, so you can track how understanding of particular entities evolved as new documents were processed. The entire output can be shared as a self-contained package - useful for collaborative research or creating supplementary materials for publications.</p>
</section>
<section id="how-i-built-it" class="level2">
<h2 class="anchored" data-anchor-id="how-i-built-it">How I built it</h2>
<p>This project became a practical testbed for several development tools I’d been wanting to explore seriously. Claude Code and Cursor proved invaluable for rapid iteration - the kind of back-and-forth refinement that complex NLP applications require would have taken significantly longer with traditional development approaches.</p>
<p><a href="https://fastht.ml/">FastHTML</a> deserves particular mention for the frontend work. Building research interfaces without wrestling with JavaScript complexity felt genuinely liberating. The ability to create dynamic, interactive visualisations using primarily Python aligns well with how most researchers already think about data manipulation and presentation.</p>
<p>The current data architecture uses Parquet files throughout - a choice that might raise eyebrows but serves the development phase well. Direct file inspection and manipulation proved more valuable than database abstraction during rapid prototyping. Eventually, I’ll likely add SQLite backend options, but the current approach prioritises iteration speed over architectural elegance.</p>
<p>The entity merging logic required the most sophistication. The system combines simple string matching with embedding-based similarity search, then uses an LLM as final arbiter when potential matches are identified. A candidate profile gets compared against existing entities first through name similarity, then through vector comparison of full profile text. If similarity exceeds certain thresholds, both profiles are sent to the model with instructions to determine whether they represent the same entity and how to merge them if so.</p>
<p>This multi-stage approach handles the nuanced judgment calls that pure algorithmic matching struggles with - distinguishing between John Smith the journalist and John Smith the military contractor, or recognising that “Captain Rodriguez” from one article is the same person as “Maria Rodriguez” from another. The complexity here suggests this merging pipeline will be a primary focus for systematic evaluation and improvement as the project matures.</p>
</section>
<section id="whats-up-next" class="level2">
<h2 class="anchored" data-anchor-id="whats-up-next">What’s up next?</h2>
<p>This blog post represents the softest possible launch - really more of a “here’s what I’m working on” update than any kind of formal announcement. <code>hinbox</code> isn’t ready for broad adoption yet, though I’d certainly welcome contributions and feedback from anyone interested in the problem space.</p>
<p>The immediate technical improvements are fairly straightforward. Right now, everything runs synchronously - each article gets processed sequentially to avoid the complexity of concurrent profile updates. Adding parallel processing would require implementing proper queuing or database locking mechanisms. Similarly, moving from Parquet files to a SQLite backend would provide better data management and enable more sophisticated querying patterns. Both changes would improve performance but add architectural complexity I haven’t needed while focusing on core functionality.</p>
<p>I’m also eager to expand beyond newspaper articles to different document types - academic papers, book chapters, research reports, archival materials. Each format likely requires prompt refinements and possibly different extraction strategies. If this is going to be genuinely useful across research domains, it needs to handle the full spectrum of source materials historians and researchers actually work with.</p>
<section id="the-real-work-systematic-evaluation-and-improvement" class="level3">
<h3 class="anchored" data-anchor-id="the-real-work-systematic-evaluation-and-improvement">The Real Work: Systematic Evaluation and Improvement</h3>
<p>But the most interesting next phase involves applying systematic evaluation techniques from the AI evals course I mentioned earlier. This is where the project becomes genuinely educational rather than just another NLP application. I’ll be implementing structured approaches to:</p>
<ul>
<li><strong>Error analysis</strong>: Understanding exactly where and why entity extraction fails</li>
<li><strong>Prompt optimization</strong>: Systematic testing rather than intuitive iteration<br>
</li>
<li><strong>Model comparison</strong>: Rigorous evaluation across different architectures and providers</li>
<li><strong>Merging accuracy</strong>: Quantifying the quality of entity deduplication decisions</li>
</ul>
<p>The goal is documenting this improvement process in detail through subsequent blog posts. Rather than abstract discussions of evaluation methodology, I want to show concrete examples of how these techniques apply to a real system with measurable outputs. What does systematic prompt engineering actually look like in practice? How do you design effective test suites for complex agentic pipelines? When do local models outperform cloud APIs for specific tasks?</p>
</section>
<section id="context-for-future-technical-content" class="level3">
<h3 class="anchored" data-anchor-id="context-for-future-technical-content">Context for Future Technical Content</h3>
<p>Honestly, the main reason for writing this overview wasn’t to launch anything - it was to establish context. I wanted a reference point for future technical posts that dive deep into evaluation methodology and iterative improvement without needing to repeatedly explain what hinbox is or why I’m working on it. The interesting content will be showing how systematic AI development practices apply to concrete research problems.</p>
<p>This feels like the right kind of project for exploring these questions: complex enough to surface real challenges, focused enough to enable rigorous evaluation, and personally meaningful enough to sustain the extended iteration cycles that proper system improvement requires. Plus, having worked extensively in the domain I’m testing makes it much easier to distinguish between genuine improvements and superficial metrics optimisation.</p>
<p>More technical deep-dives coming soon as the evals work progresses. The real learning happens in the systematic refinement process, not the initial build.</p>
<hr>
<p><em>The hinbox repository is available on <a href="https://github.com/strickvl/hinbox">GitHub</a> for anyone interested in following along or contributing. All feedback welcome as this evolves from prototype to something genuinely useful for research applications.</em></p>


</section>
</section>

 ]]></description>
  <category>llms</category>
  <category>agents</category>
  <category>evals-course</category>
  <category>evaluation</category>
  <category>miniproject</category>
  <category>hinbox</category>
  <category>research</category>
  <guid>https://alexstrick.com/posts/2025-05-29-hinbox-a-first-draft-of-an-agentic-research-system.html</guid>
  <pubDate>Thu, 29 May 2025 22:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/hinbox/frontpage.png" medium="image" type="image/png" height="83" width="144"/>
</item>
<item>
  <title>Error analysis to find failure modes</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-05-23-error-analysis-to-find-failure-modes.html</link>
  <description><![CDATA[ 




<p>I came across this quote in a happy coincidence after attending the second session of the evals course:</p>
<p><img src="https://alexstrick.com/posts/images/2025-05-23-error-analysis-to-find-failure-modes/patchett-quote.png" class="img-fluid"></p>
<p>It’s obviously a bit abstract, but I thought it was a nice oblique reflection on the topic being discussed. Both the main session and the office hours were mostly focused on the first part of the analyse-measure-improve loop that was introduced <a href="https://mlops.systems/posts/2025-05-20-how-to-think-about-evals.html">earlier in the week</a>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/2025-05-23-error-analysis-to-find-failure-modes/analyze_focus.png" class="img-fluid figure-img"></p>
<figcaption>Focus on the ‘analyse’ part of the LLM application improvement loop</figcaption>
</figure>
</div>
<p>It was a very practical session in which we even took time to do some live ‘coding’ (i.e.&nbsp;analysis + clustering) of real data. I’ll try to summarise the points I jotted down in my notebook and end with some reflection on how I will be applying this for an application I’ve been working on.</p>
<p>A quick reminder of the context: we have an application of some kind, and we want to improve it. LLMs have lots of quirks that make them hard to narrow down exactly how they’re failing, so we’re working through a process that allows you to do just that. This was framed as a five-step process by Hamel + Shreya:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/2025-05-23-error-analysis-to-find-failure-modes/analyse-cover.png" class="img-fluid figure-img"></p>
<figcaption>The five parts of the ‘analyse’ loop</figcaption>
</figure>
</div>
<p>First up, we need to look at some data to better understand the failure modes that our application might suffer from. If this application’s been in production for a while, you might well just have production data. If not, we’ll want to create a synthetic(-ish) dataset that allows us to get over the cold-start hump.</p>
<section id="create-your-initial-dataset" class="level2">
<h2 class="anchored" data-anchor-id="create-your-initial-dataset">1. Create your initial dataset</h2>
<p>This process is fairly technical, but as we were introduced to this process, the aim is to end up with 100 inputs that span across different dimensions of use that your application / system might be exposed to.</p>
<p>Why 100? No reason. As Hamel explained, it’s just a magic number to get you started. We’re encouraged not to get too focused on the details of the process but rather to trust that we would get to where we wanted if only we had a little faith.</p>
<p>The idea is that we pass these 100 datapoints into our LLM-driven system in order to see what we get out at the other end, we analyse them iteratively until we’re not learning anything new by doing the iterative process.</p>
<p>The process is something like the following:</p>
<ul>
<li>you want to sample among dimensions or facets of the use that your application could expect to experience, so come up with at least three of these. As a rule of thumb, perhaps think through the lens of features people might use, persona, query complexities or scenarios. It will differ per application, most likely.</li>
<li>Then generate a number of combinations of these three dimensions. (So as an example: people who want to use a chatbot to buy a product, and these are all non-technical users who actually are non-native English speakers, and who don’t necessarily formulate their queries with full sentences because they’re being passed in by a voice transcription module). Generate 50 of these. (Then filter out the ones that don’t make sense.)</li>
<li>Then either hand write or use an LLM to help you generate the full 100 realistic queries that would come from any of the particular tuple-combos that we created earlier. (Again, filter out the ones that don’t make sense.)</li>
</ul>
</section>
<section id="look-at-your-data-open-coding" class="level2">
<h2 class="anchored" data-anchor-id="look-at-your-data-open-coding">2. Look at your data (‘open coding’)</h2>
<p>At this point you’ll pass all these queries into your system and then you’ll have a pair of the initial query, together with the ‘full trace’ (which encompasses the final response along with all internal tool calls, retrieval and any other context or metadata).</p>
<p>Here you assemble your traces and you write notes on each one. Basically you are looking at each of the 100 items of data and making observations on what failure modes you observe in the data. In the lesson we did this live through the <a href="https://www.braintrust.dev/">Braintrust</a> interface, but it was emphasised that custom vibe-coded interfaces were also recommended, especially when you have a lot of metadata and tool calling that you might want to present in a certain way to foreground certain elements etc.</p>
<p>This is where you’ll spend 80% of your time and for 100 traces could take something on the order of an hour. Read each trace. Write some brief descriptive notes about the observed problems or actions where things are going wrong or are unexpected.</p>
<p>Importantly, you <strong>let the categories emerge from the data</strong> rather than coming in with pre-conceived ideas of what the categories already are.</p>
<p>For long traces, or ones with complex intermediary steps, focus on either the first upstream failure or the first independent failure that you come across. In the end, this process is an iterative one, so you’ll have a chance to repeat this a few times.</p>
<p>Note also that we don’t really care about the root cause analysis (i.e.&nbsp;‘why’ things are happening). We’re doing error analysis so what we care about is just the behaviour and patterns that we observe.</p>
</section>
<section id="cluster-your-data-axial-coding" class="level2">
<h2 class="anchored" data-anchor-id="cluster-your-data-axial-coding">3. Cluster your data (‘axial coding’)</h2>
<p>At this point you have a dataset of inputs, outputs and your notes on these 100 items. At this point you switch to a clustering effort where you are structuring the failure modes + merging them. You bring structure into your unstructured data by grouping similar failure modes into a sort of emergent failure taxonomy.</p>
<p>The process: you read the notes and then you cluster similar notes.</p>
<p>It’s possible to get some help from an LLM with this, for suggestions on how to group items, but there’s no way to automate yourself out of this process. You still need to make the final judgement and call, based on your understanding of the context of the application. “Always manually review, refine and define these failure modes yourself.”</p>
<p>One useful guidance was to try to have failure modes that are binary (i.e.&nbsp;observably yes or no) since this will help later on in the process but also it’s much easier to have clear definitions for yes and no. (The alternative, where you have grades between 1-5, for example, is too easy to be unclear.)</p>
</section>
<section id="label-more-traces-iterate" class="level2">
<h2 class="anchored" data-anchor-id="label-more-traces-iterate">4. Label more traces &amp; iterate</h2>
<p>And then you’re repeating and iterating! During this process don’t be concerned that your failure mode naming or definitions might start to evolve. This is a known thing that happens when you annotate data, i.e.&nbsp;the criteria drifts as you review new outputs, and it’s actually something you should welcome because it is a reflection of you better understanding your data.</p>
<p>You’ll want to keep looping between open coding + axial coding stages until you are ‘saturated’ in terms of what you’re learning about the failure modes. You’ll be refining the definitions, merging similar categories, splitting ones that are different.</p>
</section>
<section id="pitfalls-to-watch-out-for" class="level2">
<h2 class="anchored" data-anchor-id="pitfalls-to-watch-out-for">Pitfalls to watch out for</h2>
<p>We skipped over this section fairly quickly, but there are a bunch of ways in which you can short-change yourself in this process and that are worth being aware of:</p>
<ul>
<li>you might have underspecified or been too narrow in how you defined the tuple-combos at the beginning. i.e.&nbsp;your data that you generated didn’t end up covering wide dimensions of usage patterns.</li>
<li>you might skimp on the work, either only coding a few examples, or half-passing the effort to actually think through what an example or trace really represents</li>
<li>you might try to automate things too early, delegating your (expert) judgement to a machine that can’t represent your interests, at least not at this stage</li>
<li>you might skip the iteration loop of going back to the open coding after doing some axial coding</li>
<li>for complex domains, you might skip including experts as part of this process of annotation</li>
</ul>
</section>
<section id="office-hours-discussions" class="level2">
<h2 class="anchored" data-anchor-id="office-hours-discussions">Office hours discussions</h2>
<p>There were a few really interesting questions that were asked during the office hours.</p>
<p>One was about how to handle ‘complex’ pipelines (i.e.&nbsp;ones with many intermediary stages, possibly with lots of tool calling and iteration / reflection loops). Hamel suggested two ways of approaching this complexity:</p>
<ul>
<li>building your own data viewer or annotator was one option since it allows you to customise exactly which bits of the complexity you’re exposed to. It’ll differ per application, but really you should focus on whatever is important to you based on the behaviour of the application, and an off-the-shelf tool — however good — can never be everything to everyone.</li>
<li>look at the final output instead of getting lost in all the intermediary details. You can see the errors in the output / final behaviour. Since this is an iterative process, if you observe errors in the output, that’s actually good enough. You don’t need to do a root cause analysis. Just code and cluster based on the failure modes you observe. You could also focus on the error type / pattern that seems most important or burning to you.</li>
</ul>
<p>In general the emphasis was on finding ways to simplify things and not get lost in all the complexity of your system. This isn’t or won’t be the last time you see your system’s behaviour, so you don’t have to catch everything. Either picking the most glaring errors or sticking with upstream failures can be good ways of achieving this. “Find the one error that’s swamping out other errors.”</p>
<p>Another really interesting prompt from Hamel was to take on the mentality of a detective while working on this analysis stage. Think: “I’m going to find the failure nodes” and this mentality could carry you forward beyond all your doubts or hesitations or unsureness about the process.</p>
<p>And in the end, as both Hamel and Shreya said, it might feel like taking a leap of faith to trust in the process, since it ultimately is quite an open-ended process. Sort of like the well-worn metaphor of driving at night through fog, where you can’t see more than ten metres in front of you, but still you are able to make forward progress.</p>
<p>There was also a question about how to generate synthetic inputs when the LLM-driven process to turn the inputs into outputs also involved some human intervention (perhaps human-in-the-loop responses etc). Two suggestions for this: possibly you could have a synthetic persona who could play the role that a human might have played in those cases, but alternative you could just find five real humans and ask them to run through the scenarios or workflows a dozen times each in order to get you enough data generated that you get past the cold-start problem.</p>
</section>
<section id="reflections-what-ill-be-working-on" class="level2">
<h2 class="anchored" data-anchor-id="reflections-what-ill-be-working-on">Reflections &amp; what I’ll be working on</h2>
<p>I was so struck during today’s session how much overlap there is in this work of evaluation with the work of a professional historian. The things I did when I wrote books, or my PhD, or just research reports, is really similar to this process. It actually made me a bit sad that there are aren’t more ways for people with a humanities background to be involved in the work of LLM application development. Not only are people with humanities backgrounds often trained to be good writers — important in the domain of prompting as we learned on Tuesday — but they have spent their whole career trying to find ways to get their heads around unwieldy unstructured data.</p>
<p>I have a project which is an agentic workflow / pipeline to ingest primary source or raw data from newspapers or books and iteratively improve and populate a sort of wikipedia based on what gets learned from each source. It’s a sister project to my source translation repo, <a href="https://github.com/strickvl/tinbox"><code>tinbox</code> (‘translator in a box’)</a> and so this one’s called <code>hinbox</code> (i.e.&nbsp;‘historian in a box’). I have a working prototype but it still needs a bit of work before I’m happy going into more detail about it works. I’ll make the repo public soon I hope. Needless to say, I am using this course as a way of developing evals as a way of improving it and iterating on its failure modes.</p>
<p>I might only get round to doing some deep practical work on that next week or the week after, but I’ll be sure to keep up the notes and reflections on the course sessions here as we go.</p>


</section>

 ]]></description>
  <category>evals-course</category>
  <category>llms</category>
  <category>llmops</category>
  <category>evaluation</category>
  <guid>https://alexstrick.com/posts/2025-05-23-error-analysis-to-find-failure-modes.html</guid>
  <pubDate>Thu, 22 May 2025 22:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/2025-05-23-error-analysis-to-find-failure-modes/analyse-cover.png" medium="image" type="image/png" height="119" width="144"/>
</item>
<item>
  <title>How to think about evals</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-05-20-how-to-think-about-evals.html</link>
  <description><![CDATA[ 




<p>Today was the first session of Hamel + Shreya’s course, “<a href="maven.com/parlance-labs/evals/">AI Evals for Engineers and PMs</a>”. The first session was all about mental models for thinking about the topic as a whole, mixed in with some teasers of practical examples and advice.</p>
<p>I’ll try to keep up with blogging about what I learn as we go. Most of the actual content will go up online at some point in the future, I’m assuming, so not much point writing up super detailed notes. (There is also a book coming, which I assume will be great, and about which you can <a href="https://www.youtube.com/watch?v=OJItZndMUII">learn more here</a>.) So in general I’ll try to be doing the following as I blog along:</p>
<ul>
<li>highlight things I found interesting or inspiring based on the formal ‘lectures’</li>
<li>anything that comes up while doing the practical ‘homework’ (there are some optional exercises assigned to ground everything)</li>
<li>contextualise or situate things that come up in my own experience having worked on a few LLM-driven projects</li>
</ul>
<p>Today, fresh out of the first class, I wanted to write about the mental model of the ‘three gulfs’ that they propose, the improvement loop that they suggest is how to measurably improve your applications, and also prompting through the lens of evals. Finally I’ll round off with a bit about what I’ll be exploring this week.</p>
<section id="the-three-gulfs-specification-generalization-and-comprehension" class="level2">
<h2 class="anchored" data-anchor-id="the-three-gulfs-specification-generalization-and-comprehension">The Three Gulfs: Specification, Generalization and Comprehension</h2>
<p>So there’s this image that they shared in the book chapter preview discussion that came up again during the lesson today:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/2025-05-20-how-to-think-about-evals/three-gulfs.png" class="img-fluid figure-img"></p>
<figcaption>The three gulfs of LLM application development</figcaption>
</figure>
</div>
<p>(They’ve shared it already in the YouTube discussion + I see it on Twitter being shared so I think I’m not sharing something I ought not to!)</p>
<p>The course is very practically focused, especially so for application developers, so this diagram is in that context. The diagram offers up a way of thinking about LLM application development that pinpoints the places where you might do your work, and it’s also a way of thinking through things systematically, too.</p>
<p>I was especially interested in the differentiation between the gulf of specification and the gulf of generalisation, since these can often feel similar, but actually the way to get out of them is actually slightly different. I’ll go into a bit more detail below, but basically with the gulf of specification you might want to be working on your prompts + how specific you are, whereas with the generalisation gulf you might need things like splitting up your tasks or making sure your system is outputting things in a structured way, etc etc.</p>
<p>Note also that the world of tooling also doesn’t help you in a specific or targeted way to focus on one aspect of this diagram. Too often the tools try to cover the whole picture and probably also muddy the water by eliding the differences between the different tasks and challenges of each island or the gulf in between. All this is pretty abstract, so let’s go through them one by one.</p>
<section id="the-gulf-of-comprehension-in-practice" class="level3">
<h3 class="anchored" data-anchor-id="the-gulf-of-comprehension-in-practice">The Gulf of Comprehension in Practice</h3>
<p>This was seen as sort of the starting point for thinking through LLM Application improvement. At this point your big problem is that you’re trying to understand the data that comes your way from your users. You’re trying to understand the inputs to your application (what your users are typing, assuming that text is the medium of communication / input) and you’re trying to understand what the application or LLM is outputting.</p>
<p>The challenge comes because you can’t read every single log or morsel of data. You have to filter things down somehow! If this were something more like traditional ML you’d have statistics to help boil down your data, but mostly we’re talking about unstructured text data so it’s much more unwieldy.</p>
<p>This challenge means that people often get stuck at this point. This is where POC applications live, breathe and eventually die. You have enough sense that things are ‘kinda’ working, but you don’t really know what the failure modes are, so you don’t know how to improve it. You’ve tried out one or two things in a halfway systematic way, but really you have no idea what’s working well and what’s not.</p>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>On Tools vs Process
</div>
</div>
<div class="callout-body-container callout-body">
<p>Hamel made the good point that it’s probably not so useful to think about tools too much when thinking at this stage. Generally speaking what’s going on is most often actually a process problem and trying to go straight to ‘what tool do I need’ is probably avoiding the real issue.</p>
</div>
</div>
</section>
<section id="the-gulf-of-specification-in-practice" class="level3">
<h3 class="anchored" data-anchor-id="the-gulf-of-specification-in-practice">The Gulf of Specification in Practice</h3>
<p>This is the place where you are trying to translate intent into precise instructions that the LLM will follow. You’re trying to be explicit and specific in the hope that the LLM will do what you want it to do, and not do the things that you don’t want it to do.</p>
<p>The obvious manifestation of this is people writing bad prompts. It might seem that it’s also present when you try to have an LLM solve one problem when it’s either unsuited for that task or the task needs to be broken up and so on, but that’s the sister gulf of generalisation. Here, we’re focused on how to improve the specificity of your prompts.</p>
<p>When you split things up and highlight the fact that prompts are something that you’ll need to work on and to improve, it becomes clear that it’s something you wouldn’t want to outsource or to skimp time on. Really the prompt writing is the thing that you (at certain moments, and where it’s identified as the thing needing focus / improvement) want to be working on in partnership with domain experts.</p>
<p>For small applications, you might be the same person as the domain expert! For bigger projects, you might be working <em>with</em> domain experts. Just be aware that often the domain expert might not necessarily be detached enough to be able to figure out what needs the focus, or where the weaknesses of a prompt are. That’s what the iterative process / error analysis and everything else that’ll be taught in the course is for (see below and see future posts).</p>
<p>Another point Hamel made was about why prompts are actually so important: “you have to express your project taste somewhere”. Given that your application might be fully / mostly driven by LLMs, the prompt is actually a really crucial place to express this taste and as such might be thought of as your ‘moat’.</p>
<p>I know just from having experienced a variety of LLM-driven applications, it’s quite easy to tell the ones where the product team gave their prompts and their specification some real love. It’s the difference between POC junk that will die a slow and lonely death and something that delights and solves real user problems.</p>
</section>
<section id="gulf-of-generalisation" class="level3">
<h3 class="anchored" data-anchor-id="gulf-of-generalisation">Gulf of Generalisation</h3>
<p>Shreya didn’t really get into the details around the generalisation gulf in practical terms in this lesson, but I think this one can be a sort of place of comfort for the technically-minded to make refuge in. It’s one where there’s a ton of tools and technologies and techniques to play with, and vendors also live in this space and try to claim that their particular product or special sauce is <em>the</em> thing to help you and so on.</p>
</section>
</section>
<section id="the-improvement-loop-for-llm-applications" class="level2">
<h2 class="anchored" data-anchor-id="the-improvement-loop-for-llm-applications">The Improvement Loop for LLM Applications</h2>
<p>We also got a high-level overview of the loop that allows you to iteratively improve an LLM application:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/2025-05-20-how-to-think-about-evals/analyze-measure-improve-loop.png" class="img-fluid figure-img"></p>
<figcaption>The analyse, measure and improve loop; adapted from an image used in the course</figcaption>
</figure>
</div>
<p>There’s a lot to unpack in all these different stages, and we didn’t really get into the details in the session today but you can see how this offers a really powerful way of thinking through what it means to iteratively improve an LLM application.</p>
<p>Learning how to implement this in a practical way will be the main thing I want to get good at by the end of this course. The process is made up of a bunch of techniques, but in my experience companies or use cases that struggle with improving what they built also lack the scaffold of this loop to orient themselves.</p>
</section>
<section id="prompting-through-the-lens-of-evals" class="level2">
<h2 class="anchored" data-anchor-id="prompting-through-the-lens-of-evals">Prompting through the lens of evals</h2>
<p>As we explored above, prompting is sort of the table stakes of improving your LLM application. In order to get good at prompting, it can help to appreciate what they are good at and what they struggle with. So, as Shreya put it, “leverage their strengths and anticipate their weaknesses” (when prompting).</p>
<p>At this point Shreya got into some points around what kinds of things went into a good prompt but I think I’ll write a separate blog on that and I don’t want to just regurgitate what we listened to. Today was more of a high-level introduction, and in any case it was much more about the outer-loop process instead of the inner loop (where tooling + specific techniques play more of a role.)</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/2025-05-20-how-to-think-about-evals/inner-vs-outer-loop.png" class="img-fluid figure-img"></p>
<figcaption>A slide from a talk I gave about the inner loop vs the outer loop of GenAI development</figcaption>
</figure>
</div>
<p>So it’s great that the course gets into the weeds (esp in the course materials, which include the draft of the book Hamel &amp; Shreya are writing) but I think the really useful thing they’re doing is situating the tactical improvements and techniques within the strategic patterns and workflows that teams and individuals should be doing to work on these LLM applications.</p>
<p>At a high level, what are we talking about:</p>
<ul>
<li>how to tease out failure scenarios for these applications and their behaviours</li>
<li>conversely, how to understand exactly which domains it does well for</li>
</ul>
</section>
<section id="things-i-want-to-think-about-more" class="level2">
<h2 class="anchored" data-anchor-id="things-i-want-to-think-about-more">Things I want to think about more</h2>
<p>There was a ton of really rich discussion around prompting in the Discord. I’m interested in exploring more:</p>
<ul>
<li>cross-provider prompting decisions (i.e.&nbsp;how prompting an OpenAI model differs from what you do with a Llama model or whatever)</li>
<li>prompts that work with reasoning models vs non-reasoning models</li>
<li>the tradeoffs of whether you put your instructions in system prompts vs user instructions</li>
</ul>
<p>In general there’s been a bunch of noise recently about so-called ‘leaked’ system prompts from a bunch of LLM API providers and I’ve mainly been struck by just how detailed they are. I consider myself pretty good at improving and iterating on prompts, but I’ll admit I’m not writing these multi-thousand word tomes. I’d like to explore which scenarios it makes sense to do so, and how to calculate at what point it makes sense from a cost or latency perspective to do so.</p>
<p>As I’m sure you can detect, I’m really enthusiastic about the lesson to come and will work in the meanwhile on some of the readings that have been set as well as the homework task of writing a system prompt for a LLM-powered recipe recommendation application!</p>


</section>

 ]]></description>
  <category>evals-course</category>
  <category>llms</category>
  <category>llmops</category>
  <category>evaluation</category>
  <guid>https://alexstrick.com/posts/2025-05-20-how-to-think-about-evals.html</guid>
  <pubDate>Mon, 19 May 2025 22:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/2025-05-20-how-to-think-about-evals/maven-course.png" medium="image" type="image/png" height="81" width="144"/>
</item>
<item>
  <title>First impressions of the new Gemini Deep Research (with 2.5 Pro)</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-04-09-first-impressions-of-the-new-gemini-deep-research-with-2-5-pro.html</link>
  <description><![CDATA[ 




<p>Google released an updated iteration of their Deep Research tool that uses the new 2.5 Pro model. This was taken from a post originally made on Twitter, so please excuse the terseness.</p>
<p>First impressions:</p>
<ul>
<li>a bit too eager to jump into a deep research task even when I just ask a clarifying question</li>
<li>quite verbose, just like the OpenAI version. Not sure why both play this up a lot. It looks impressive but in practice I think we need more entry points into this. The ‘Executive Summary’ and other concluding headers are nice touches but I feel maybe there should be some more adherence to user requests for short reports. (I get that as UI it’s maybe weird to think for 10 mins and then spit out a very concise version, but it might actually be more useful.)</li>
<li>I continue to be annoyed about how these Gemini DR reports handle footnotes (i.e.&nbsp;as endnotes whacked on at the end of the report). Almost a deal-breaker IMO.</li>
<li>It’s almost like GDR tries to show how scholarly and serious it is by giving you these walls of prose (vs OpenAI DR which throws in a lot more bullet points). Not sure one is better than the other but would appreciate a bit more flexibility!</li>
<li>The portability of these reports has always been <em>not great</em>. Yes you can export them to Google Docs but markdown (+ other options) would have been much better. In practice, this means that whenever I use GDR the report stays stuck there and I’m far less likely to share it with anyone, whereas the OpenAI DR reports I drop parts/all into a Github Gist etc.</li>
<li>These reports <em>have</em> been getting better and better, all things considered. I’ve been following along and using GDR from the early days (even pre-OpenAI DR) and this latest version is the best version of it so far (as you’d hope!)</li>
<li>(It’s also a little bit annoying that GDR has removed any way to use the older versions of GDR with Gemini Pro 2.0 and 1.5 etc. Makes it harder to actually compare these things.)</li>
<li>Please let’s get an ipad version of the Gemini iPad app soon, too? Feels a bit regressive to have to use GDR on the web interface always.</li>
<li>For serious research (as opposed to simply generating a nice report on some area where you don’t know much about already), all these tools remain hamstrung by the quality of the sources. In areas where I am (or very recently used to be) a leading scholar / researcher, the difference between what I’d expect (in terms of taste / discernment for picking out these sources) is especially egregious. Make the models better, yes, but have better filters + retrieval.</li>
</ul>
<p>So yeah, these tools are getting good! Kudos to the teams who are implementing this stuff. Hard to make it perform reproducibly well on so many open-ended uses. But more work to be done!</p>
<p>IMO the really great implementations of this ‘deep research’ pattern will all be in-house where you can have control over:</p>
<ul>
<li>source selection (i.e.&nbsp;high-quality inputs only, not just some random things on the internet)</li>
<li>how long it spends thinking about a particular area / loop of the research (or decides to backtrack and dig deeper etc)</li>
<li>output types / templates / length</li>
<li>different modalities of Q&amp;A (sometimes you want reports, other times you want a quick question answered, other times you want visual guides etc etc.)</li>
<li>different models for different kinds of tasks</li>
<li>possibly you have little sub-research agents / processes which will go off and work on some hypothesis, possibly involving actual datasets / analysis of tabular data etc, something clearly missing from the current versions we have</li>
</ul>
<p>A few other things:</p>
<ul>
<li>GDR’s ‘clarification step’ (which I’ve heard them discuss on podcasts etc) is not as good or useful as the OpenAI DR clarification questions. In practice, because it’s buried under a concealment button that you have to click etc, and where the entire UI seems to be screaming at you to ‘Start Research’, you basically never update or amend the research plan. And when you do, it’s really not clear what’s changed because you don’t get some feedback or diff that your comments were understood; you just get an entire new research plan (again buried under the concealment button)</li>
<li>Going forward we’re probably going to want / need ways of navigating the layers to this research. A global overview report will have subsections that (should you wish) can be expanded into their own more detailed or granular reports. This is how research works, after all. Not just endless new reports all trailed one after another pointed in the same direction.</li>
</ul>
<p>The other thing that I think we’re <em>really</em> going to need to work on is research taste. Like the LLMs that power them, GDR and OpenAI DR offer a level of research taste developed to the mean. (I know people are thinking about this since it came up on Dwarkesh’s podcast with the AI 2027 guys, but they were focused on scientific research.)</p>
<p>I think there’s not a single answer for this which is, again, why I see the end result as people bringing these things in-house where they get to develop and refine what makes their particular flavour of research unique. (In the human-generated research world this is very much the case, where certain institutions (or even particular authors) are known for how deep they go, or what kinds of sources they prefer, or how they choose to feature or highlight the primary sources they access, and so on.) There are many possible variations of how this manifest, and I hope that we’re headed into a world where all the AI ‘deep researchers’ will be unique and quirky in all the best senses of that word.</p>



 ]]></description>
  <category>agents</category>
  <category>google</category>
  <category>tools</category>
  <category>openai</category>
  <category>research</category>
  <guid>https://alexstrick.com/posts/2025-04-09-first-impressions-of-the-new-gemini-deep-research-with-2-5-pro.html</guid>
  <pubDate>Tue, 08 Apr 2025 22:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/cover-gdr.png" medium="image" type="image/png" height="89" width="144"/>
</item>
<item>
  <title>Learnings from a week of building with local LLMs</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-03-16-learnings-building-llms.html</link>
  <description><![CDATA[ 




<p>I took the past week off to work on a little side project. More on that at some point, but at its heart it’s an extension of what I worked on with my translation package <a href="https://mlops.systems/posts/2025-02-16-tinbox:-an-llm-based-document-translation-tool.html"><code>tinbox</code></a>. (The new project uses translated sources to bootstrap a knowledge database.) Building in an environment which has less pressure / deadlines gives you space to experiment, so I both tried out a bunch of new tools and also experimented with different ways of using my tried-and-tested development tools/processes.</p>
<p>Along the way, there were a bunch of small insights which occurred to me so I thought I’d write them down. As usual with this blog, I’m mainly writing for my future self but I think there might be parts that are useful for others! Apologies for the somewhat rushed nature of these observations; better I get the blog finished and published than not at all!</p>
<section id="local-models" class="level2">
<h2 class="anchored" data-anchor-id="local-models">🤖 Local Models</h2>
<p>During this project, I experimented with several local models, which continue to impress me with their evolving capabilities. The recent launch of <code>gemma3</code> was particularly timely - I found myself regularly using the 27B version, which performed admirably across various tasks.</p>
<p>There are three or four models I keep returning to. <code>mistral-small</code> stands out as an exceptional model that’s been relatively recently updated and seems a bit underrated / underappreciated. The original <code>mistral</code> model continues to hold up remarkably well, particularly for structured extraction tasks and general writing needs like summarization.</p>
<p>One important realization when working with real-world use cases: benchmarks can be deceptive. While helpful as general indicators, each model has its own strengths and quirks. Many newer models are heavily optimized for structured data extraction, but their performance ultimately depends on whether their training documents align with your specific use case. It’s crucial to test models against your actual requirements rather than relying solely on published benchmarks.</p>
<p>For robust results with local models, I’ve found that implementing a “reflection, iterate and improve” pattern significantly enhances performance. When you need a model to summarize or analyze content in a particular format, having a secondary model (or even the same model!) review the output against the original prompt requirements is incredibly valuable. This reviewer model can suggest improvements to better fulfill the original request. Running this loop for 2-5 iterations (depending on complexity) can yield results approaching those of proprietary models like Claude or GPT-4, which might achieve similar quality in a single pass. For local deployments, this iterative improvement pattern is essentially non-negotiable.</p>
<p>I also explored vision models, particularly <code>llava</code> and <code>llama-3.2-vision</code>. These were my primary tools for extracting context from images, generating captions, and analyzing visual content. Their effectiveness varies based on content type and language, but they represent impressive capabilities that can run entirely on local systems.</p>
<p>A significant portion of my work involved non-English languages, including some relatively rare ones. This is another area where benchmark claims about supporting “hundreds of languages” often don’t align with real-world performance. Models might list impressive language coverage in their specifications, but actual proficiency varies dramatically. It reinforces my earlier point - always verify benchmark claims against your specific use case before committing to a particular model.</p>
</section>
<section id="prompting-instruction-following" class="level2">
<h2 class="anchored" data-anchor-id="prompting-instruction-following">💬 Prompting &amp; Instruction Following</h2>
<p>Working extensively with various models during this project reinforced some fundamental insights about prompting that might seem basic, but prove critical in practical applications. These observations are particularly relevant when working with local models, though they apply to cloud-based systems as well.</p>
<p>Context matters significantly more than we might assume. While we’ve grown accustomed to proprietary models like Claude or GPT-4o performing admirably with minimal guidance, local models require more deliberate direction. The more relevant context you can provide (within reasonable token limits), the better your results will be. If you would naturally provide certain background information to a human performing the task, make sure to include it in your prompt to the model as well.</p>
<p>Another key insight: every model has its unique characteristics. Techniques that work brilliantly with one model might fall flat with another, especially in the local model ecosystem. They each require slightly different prompting approaches, specific phrasing patterns, and tailored guidance. This necessitates running small experiments to understand how different models respond to various prompting styles. It’s still more art than science, but this experimentation phase is crucial when implementing local models effectively.</p>
<p>Perhaps the most valuable lesson I rediscovered is that breaking complex tasks into smaller components yields superior results compared to using a single comprehensive prompt. This is particularly true with local models. When performing extensive data extraction or when dealing with structured data where the extraction targets differ significantly from each other, don’t expect the model to handle everything in one pass – even a human might struggle with such an approach.</p>
<p>Instead, break down the task into logical components, create targeted mini-prompts for each aspect, and then recombine the results once all the separate LLM calls are completed. Yes, this approach adds processing time and complexity, but the quality improvement is well worth the trade-off. When accuracy matters more than speed, this decomposition strategy consistently delivers better outcomes.</p>
</section>
<section id="process-tools" class="level2">
<h2 class="anchored" data-anchor-id="process-tools">🧰 Process &amp; Tools</h2>
<p>My development environment during this project provided plenty of opportunities to evaluate various tools and workflows. As context, I primarily work on a Mac while maintaining access to a separate (local) machine with GPU capabilities for more intensive tasks. This setup allows me to flexibly experiment with both local and cloud-based models.</p>
<p>For managing local models, <a href="https://ollama.ai/">Ollama</a> continues to be my go-to solution for downloading, running, and interfacing with these models. A recent discovery that significantly improved my workflow is <a href="https://boltai.com/">Bolt AI</a>, an excellent Mac interface that provides seamless switching between local Ollama models and cloud-based alternatives. If you’re working in a hybrid model environment, Bolt AI is definitely worth exploring.</p>
<p>I’ve also recently integrated <a href="https://openrouter.ai/">OpenRouter</a> into my toolkit, which solves the problem of managing countless API keys across different inference providers. OpenRouter not only offers native connections to many cloud providers but also allows you to incorporate your own API keys, streamlining access to a diverse model ecosystem through a unified interface. It also helps with setting spend limits on various models or projects.</p>
<p>In terms of development insights, I was impressed by how rapidly front-end development can progress with the assistance of models like Claude 3.7 and OpenAI’s O1-Pro. These models perform exceptionally well when supplemented with documentation (such as an <a href="https://llmstxt.org"><code>llms.txt</code> file</a>) alongside your prompts. While I can’t speak to their effectiveness with extremely complex applications or massive frontend codebases, they demonstrate remarkable proficiency with small to medium-sized projects.</p>
<p>A significant portion of my experimentation involved <a href="https://repoprompt.com">RepoPrompt</a>, a tool that recently transitioned from free beta to a paid license model. RepoPrompt addresses the challenge of getting your codebase into an LLM-friendly format. Unlike standard CLI tools that simply export code to clipboard or text files, RepoPrompt generates a structured XML representation that, when modified by an LLM and pasted back, creates a reviewable diff of the proposed changes. At least, that’s one of the things it allows you to do! It’s actually a bit more powerful / flexible than that and here’s a video so you can see it in action:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="https://www.youtube.com/watch?v=8zIY0zxcafE%20%22RepoPrompt%20Demo%20Video%20using%20O1%20Pro%22"><img src="https://img.youtube.com/vi/8zIY0zxcafE/0.jpg" class="img-fluid figure-img"></a></p>
<figcaption>RepoPrompt Demo Video</figcaption>
</figure>
</div>
<p>While tools like Cursor and Windsurf offer similar functionality, they tend to become less reliable as project complexity increases. RepoPrompt shines when paired with an OpenAI Pro subscription, enabling effective integration of models like O1 Pro and <code>o3-mini-high</code> into your development lifecycle. In my testing, the RepoPrompt + O1 Pro/O3 Mini High combination consistently delivered superior results compared to using Cursor with Claude 3.7 (even with ‘Thinking Mode’ enabled). Despite the occasional pauses while these models process complex problems, the quality improvement justifies the wait.</p>
<p>Additionally, I continued working with <a href="https://claude.ai/code">Claude Code</a> and <a href="https://codebuff.com/">CodeBuff</a>, both CLI-driven tools focused on code improvement. Of the two, CodeBuff has become my preferred option. Both tools require careful supervision—I typically keep Cursor open to monitor changes in real-time, occasionally needing to revert modifications or redirect the approach. These tools excel when you clearly articulate your objectives and maintain oversight of the implementation process. CodeBuff particularly impresses with larger codebases and demonstrates superior stability overall.</p>
<p>An interesting pattern emerged during development: whenever files approached 800-900 lines, it signaled the need to refactor into smaller submodules to maintain LLM comprehension, especially when using agent mode in Cursor. The modular approach significantly improved model performance.</p>
<p>I was genuinely surprised by the effectiveness of the RepoPrompt and O1 Pro combination. For smaller, targeted modifications, CodeBuff continues to demonstrate remarkable capability. While I didn’t evaluate these tools in conjunction with local models, I suspect such combinations would require more iterative refinement to achieve comparable results.</p>
</section>
<section id="software-engineering-patterns" class="level2">
<h2 class="anchored" data-anchor-id="software-engineering-patterns">🧑‍🔬 Software Engineering Patterns</h2>
<p>Throughout this experimental project, several software engineering principles proved particularly valuable when working with LLM-assisted development. These patterns aren’t revolutionary, but their importance amplifies in the context of AI-augmented workflows.</p>
<p>The principle of simplicity served as a cornerstone approach. Breaking development into the smallest logical next task repeatedly demonstrated its value, especially during the exploratory phases when project architecture was still taking shape. While some engineers might possess the cognitive bandwidth to fully conceptualize complex systems with perfect abstractions from the outset, I’ve found incremental development leads to more robust outcomes. This approach aligns naturally with how most developers actually think through problems and provides clear checkpoints for evaluating progress.</p>
<p>Data visibility emerged as another critical factor. When leveraging LLM-assisted coding, comprehensive logging becomes even more essential than in traditional development. Strategically placed log outputs create a diagnostic trail that proves invaluable when troubleshooting unexpected behaviors. This practice creates a feedback loop that strengthens both your understanding of the system and the LLM’s ability to assist effectively.</p>
<p>A particularly underappreciated practice I haven’t seen widely discussed is the importance of dead code detection. When working with LLM-assisted development, code cruft tends to accumulate more rapidly than in conventional programming. Tools like <a href="https://github.com/albertas/deadcode"><code>deadcode</code></a> and <a href="https://github.com/jendrikseipp/vulture"><code>vulture</code></a> provide static analysis of Python projects to identify unused functions and variables. Running these tools periodically helps maintain codebase clarity by flagging remnants that might otherwise cause confusion during review. I’m not certain whether newer tools like <code>ruff</code> from Astral include this functionality (particularly for function calls), but the capability is invaluable for maintaining a clean, navigable codebase.</p>
<p>Taking time to think offline—away from the keyboard—often yields surprising clarity. This deliberate pause creates space to articulate precisely what you need for the next development increment. When you can express your requirements with precision, the LLM’s output improves proportionally. Ambiguous instructions inevitably produce suboptimal results, whereas clarity fosters efficiency.</p>
<p>A final observation worth emphasizing: having experience as an engineer in the pre-LLM era remains tremendously advantageous. When confronting complex workflows involving chained LLM calls with interdependencies and reflection patterns, traditional debugging skills become indispensable. Knowing when to step away from AI assistance and dive into manual debugging with tools like <code>pdb</code>, stepping through code execution and inspecting variables directly, represents a crucial judgment call.</p>
<p>LLMs and coding agents often demonstrate a bias toward generating new code rather than methodically analyzing existing problems. Recognizing the moment when direct human intervention becomes more efficient than continually prompting an AI is a skill that comes with experience. Once you’ve manually identified the underlying issue, you can return to the LLM with precisely targeted prompts that yield superior results.</p>
</section>
<section id="appendix-1-fasthtml" class="level2">
<h2 class="anchored" data-anchor-id="appendix-1-fasthtml">🌐 Appendix 1: FastHTML</h2>
<p>As a practical addition to my experimentation, I implemented FastHTML for the first time to build a frontend for my knowledge base extraction assistant. The experience was remarkably frictionless, particularly when leveraging their <code>llms.txt</code> file—a markdown-formatted documentation set that integrates seamlessly with your frontend codebase when provided alongside prompts.</p>
<p>This approach works exceptionally well with models like O1 Pro or O3 Mini High, creating a development workflow that feels intuitive and responsive. Despite having substantial JavaScript experience from previous roles, I found FastHTML significantly more manageable than complex JavaScript frameworks that dominate the ecosystem today.</p>
<p>The reduced cognitive overhead and natural integration with Python-based workflows makes FastHTML a compelling choice for ML practitioners who prefer to minimize context-switching between languages and paradigms. The framework strikes an excellent balance between capability and simplicity that aligns perfectly with rapid prototyping and iterative development cycles common in ML projects. For those building interfaces to ML systems, it’s definitely worth considering as your frontend solution.</p>
</section>
<section id="appendix-2-ocr-translation" class="level2">
<h2 class="anchored" data-anchor-id="appendix-2-ocr-translation">📃 Appendix 2: OCR + Translation</h2>
<p>Another interesting challenge I tackled involved OCR and translation of handwritten documents in non-English languages—a task that proved impossible to accomplish in a single pass with local models, particularly for less common languages.</p>
<p>The solution emerged through methodical problem decomposition:</p>
<ol type="1">
<li>Breaking down PDFs into individual page images</li>
<li>Segmenting each page into overlapping image chunks (critical for handwriting where text may slant across traditional line boundaries)</li>
<li>Applying OCR to extract text in the original source language from each image segment</li>
<li>Using translation models to convert the extracted text to English</li>
</ol>
<p>This multi-stage pipeline allowed me to overcome the limitations of local models when confronted with the combined complexity of handwriting recognition and translation. Both <code>gemma3</code> and <code>llama-3.3</code> performed admirably within this decomposed workflow, demonstrating that even resource-constrained local deployments can achieve impressive results when problems are thoughtfully restructured.</p>
<p>This case exemplifies a core principle of effective ML implementation: when dealing with complex, multi-faceted challenges, breaking them into targeted sub-problems often yields better outcomes than attempting end-to-end solutions—especially when working with constrained computational resources. While this approach may increase processing time, the quality improvement justifies the trade-off for many practical applications.</p>


</section>

 ]]></description>
  <category>claude</category>
  <category>llm</category>
  <category>llms</category>
  <category>miniproject</category>
  <category>openai</category>
  <category>prompt-engineering</category>
  <category>softwareengineering</category>
  <category>tools</category>
  <guid>https://alexstrick.com/posts/2025-03-16-learnings-building-llms.html</guid>
  <pubDate>Sat, 15 Mar 2025 23:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/learnings-building-llms/cover.png" medium="image" type="image/png" height="114" width="144"/>
</item>
<item>
  <title>Building an MCP Server for Beeminder: Connecting AI Assistants to Personal Data</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-02-21-beeminder-mcp.html</link>
  <description><![CDATA[ 




<p>I spent the morning <a href="https://github.com/strickvl/mcp-beeminder">building an MCP server</a> for <a href="https://www.beeminder.com">Beeminder</a>, bridging the gap between AI assistants and my personal goal tracking data. This project emerged from a practical need — ok, desire :) — to interact more effectively with my Beeminder data through AI interfaces like Claude Desktop and Cursor.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://alexstrick.com/posts/images/mcp-bm.png" class="img-fluid figure-img"></p>
<figcaption>The MCP-Beeminder mashup in action!</figcaption>
</figure>
</div>
<section id="understanding-beeminder" class="level2">
<h2 class="anchored" data-anchor-id="understanding-beeminder">Understanding Beeminder</h2>
<p>For those unfamiliar with <a href="https://www.beeminder.com">Beeminder</a>, it’s a tool that combines self-tracking with commitment devices to help users achieve their goals. The platform draws what they call a “Bright Red Line” – a visual commitment path that shows exactly where you need to be to stay on track. What makes Beeminder unique is its approach to accountability: users pledge real money to stay on their path, and there’s a seven-day “akrasia horizon” that prevents immediate goal changes, helping to overcome moments of impulsivity.</p>
<p>I’ve <a href="https://www.google.com/search?q=site%3Aalexstrick.com+beeminder">written a <em>lot</em> about Beeminder</a> over on my personal blog in the past so do go check that out if you’re interested to learn more about how I use it. I can attest that if it clicks with you, you’ll find it incredibly valuable. I have used in the past to write books, learn languages, <a href="https://www.alexstrick.com/blog/2016/8/phd-tools-beeminder">finish my PhD</a> and many many other things.</p>
</section>
<section id="the-role-of-mcp" class="level2">
<h2 class="anchored" data-anchor-id="the-role-of-mcp">The Role of MCP</h2>
<p>The <a href="https://modelcontextprotocol.io/">Model Context Protocol (MCP)</a> serves as a standardised way for AI assistants to interact with various data sources and tools. Think of it as a universal adapter that allows AI systems to directly access and manipulate data in your applications. Instead of copying and pasting information between your AI assistant and Beeminder, MCP creates a secure, direct connection.</p>
<p>This standardisation is particularly valuable because it means you can build one interface that works across multiple AI platforms. Whether you’re using Claude Desktop, Cursor, or <a href="https://modelcontextprotocol.io/clients">other MCP-compatible tools</a>, the same server provides consistent access to your Beeminder data.</p>
</section>
<section id="building-the-server" class="level2">
<h2 class="anchored" data-anchor-id="building-the-server">Building the Server</h2>
<p>The development process was surprisingly straightforward, largely due to two factors: the well-documented MCP specification from Anthropic and an <a href="https://github.com/ianm199/beeminder_api_client">existing Python client</a> for Beeminder’s API by <a href="https://github.com/ianm199"><span class="citation" data-cites="ianm118">@ianm118</span></a>. Most of the implementation work involved mapping Beeminder’s API endpoints to MCP’s expected interfaces and ensuring proper error handling.</p>
<p>And obviously, much of the code was actually written by Claude itself. After providing the initial structure, writing a couple of tools the way I wanted them and providing documentation, I found that Claude could generate the remainder of the code, requiring only minor adjustments and debugging from me.</p>
</section>
<section id="using-the-beeminder-mcp-server" class="level2">
<h2 class="anchored" data-anchor-id="using-the-beeminder-mcp-server">Using the Beeminder MCP Server</h2>
<p>Having an MCP server for Beeminder opens up several practical possibilities. You can have natural conversations with AI assistants about your goals, analyse patterns in your data, and even update your tracking information – all while the AI has direct access to your actual Beeminder account. This direct connection means the AI can provide more contextual and accurate assistance, whether you’re adjusting goal parameters or analysing your progress trends.</p>
<p>I’ve found that sometimes Claude needs a bit of coaxing to display the information it’s getting back from the Beeminder API in appropriate formats, which is to say, in table format. I will probably update my Claude settings so that it knows it should use tables (either Markdown or React components) to display Beeminder results that would benefit from such a presentation.</p>
</section>
<section id="looking-forward" class="level2">
<h2 class="anchored" data-anchor-id="looking-forward">Looking Forward</h2>
<p>Now that I have my Beeminder MCP server, I also want one for <a href="https://www.omnigroup.com/omnifocus">Omnifocus</a>, my task management app of choice. That’ll probably have to wait since it doesn’t appear that they offer a REST API, but it’ll be great when I can mash up the results of those two tool queries as that’s what I currently do manually as part of my process.</p>
<p>The ease of building this MCP server suggests an interesting future where more of our tools and services become directly accessible to AI assistants. The real value isn’t in any single connection, but in the potential for creating a network of interconnected tools that AI can help us manage more effectively.</p>
<p>If you’re interested in trying this out yourself, you can find the code and setup instructions in the <a href="https://github.com/strickvl/mcp-beeminder">GitHub repository</a>. While this implementation focuses on Beeminder, the same principles could be applied to create MCP servers for other services and tools.</p>


</section>

 ]]></description>
  <category>tools</category>
  <category>anthropic</category>
  <category>claude</category>
  <category>miniproject</category>
  <guid>https://alexstrick.com/posts/2025-02-21-beeminder-mcp.html</guid>
  <pubDate>Thu, 20 Feb 2025 23:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/mcp-bm.png" medium="image" type="image/png" height="172" width="144"/>
</item>
<item>
  <title>Tinbox: an LLM-based document translation tool</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-02-16-tinbox-an-llm-based-document-translation-tool.html</link>
  <description><![CDATA[ 




<p>Large Language Models have transformed how we interact with text, offering capabilities that seemed like science fiction just a few years ago. They can write poetry, generate code, and engage in sophisticated reasoning. Yet surprisingly, one seemingly straightforward task – document translation – remains a significant challenge. This is a challenge I understand intimately, both as a developer and as a historian who has spent years working with multilingual primary sources.</p>
<p>Before the era of LLMs, I spent years conducting historical research in Afghanistan, working extensively with documents in Dari, Pashto, and Arabic. This wasn’t just casual reading – it was deep archival work that resulted in publications like <a href="https://www.amazon.com/Poetry-Taliban-Columbia-Strick-Linschoten/dp/0231704046?dib_tag=AUTHOR&amp;ref_=ast_author_dp_rw&amp;dib=eyJ2IjoiMSJ9.1mVaySVqbTMQoyHNw9jr729HdyTrJqF63q_dK--vp8ZIMJilxU2L9GFro0SDlAHwkLCNvm6uzLyUFyfhIMgnqHsy6OcH29oAJydmPZBO_nk.1ryRpL7upcac5fthUBd3TPVG15nuXehSARppPP0GLz8&amp;tag=soumet-20">“Poetry of the Taliban”</a> and <a href="https://www.amazon.com/Taliban-Reader-Islam-Politics-their-ebook/dp/B07F37SB8S?sr=1-1&amp;qid=1739727492&amp;keywords=taliban%2Breader&amp;sprefix=taliban%2Breader%252Cdigital-text%252C182&amp;tag=soumet-20&amp;dib_tag=se&amp;crid=1YY9Y6SJN9YLO&amp;dib=eyJ2IjoiMSJ9.rJDPRcIzPe3NY83zHYXFevNXoERWWiJ6BTyj9SXHYfv_jixdQKUV27Qwh1NJYX0qMhAV02Z4r75o_tYUCysq_Obf_9wo3qk9AzLlZWyTWO-dSa88xUIQP3MCe9dgIUf2Okhj6DAyqjgHQDdgrivgTmN0eJNQ_IOp2MKhnWLbOpEcdZtxrI7VisVAITML4b4dwYyjKbfbKnsk1IHWtk_P0NU-XcV2ChEHBcbqlx3jh4OvKXYZ1h39-RJ7W5Tm1eo1R0T673keyXEstEi4j6msDfDu99000EMSNsvWmLIQAxPQKBsbMjEeDneokEA0-dM3.3Yy2jETrAfRcOZLuWM8Cc5f_sfS_RuVMsh0bpSCp6TA&amp;s=digital-text">“The Taliban Reader”</a>, projects that required painstaking translation work with teams of skilled translators. The process was time-consuming and resource-intensive, but it was the only way to make these primary sources accessible to a broader audience.</p>
<p>As someone who has dedicated significant time to making historical sources more accessible, I’ve watched the rise of LLMs with great interest. These models promise to democratise access to multilingual content, potentially transforming how historians and researchers work with primary sources. However, the reality has proven more complex. Current models, while powerful, often struggle with or outright refuse to translate certain content. This is particularly problematic when working with historical documents about Afghanistan – for instance, a 1984 document discussing the Soviet-Afghan conflict might be flagged or refused translation simply because it contains the word “jihad”, even in a purely historical context. The models’ aggressive content filtering, while well-intentioned, can make them unreliable for serious academic work.</p>
<p>After repeatedly bumping into these limitations in my own work, I built <a href="https://github.com/strickvl/tinbox"><code>tinbox</code></a> (shortened from ‘translation in a box’), a tool that approaches document translation through a different lens. What if we had a tool that could handle these sensitive historical texts without balking at their content? What if researchers could quickly get working translations of primary sources, even if they’re not perfect, to accelerate their research process? As a historian, having access to even rough translations of primary source materials would have dramatically accelerated my research process. As a developer, I knew we could build something better than the current solutions.</p>
<p>The name “tinbox” is a nod to the simple yet effective nature of the tool – it’s about taking the powerful capabilities of LLMs and packaging them in a way that actually works for real-world document translation needs. Whether you’re a researcher working with historical documents, an academic handling multilingual sources, or anyone needing to translate documents at scale, <a href="https://github.com/strickvl/tinbox"><code>tinbox</code></a> aims to provide a more reliable and practical solution.</p>
<section id="the-hidden-complexity-of-document-translation" class="level2">
<h2 class="anchored" data-anchor-id="the-hidden-complexity-of-document-translation">The Hidden Complexity of Document Translation</h2>
<p>The problem of document translation sits at an interesting intersection of challenges. On the surface, it might seem straightforward – after all, if an LLM can engage in complex dialogue, surely it can translate a document? It can, but there are some edge cases and limitations.</p>
<p>When working with real-world documents, particularly PDFs, we encounter a cascade of complications. First, there’s the issue of model refusal. LLMs frequently decline to translate documents, citing copyright concerns or content sensitivity. This isn’t just an occasional hiccup – it’s a systematic limitation occurring regularly that makes these models unreliable for production use out of the box.</p>
<p>Then there’s the scale problem. Most documents aren’t just a few paragraphs; they’re often dozens or hundreds of pages long. This runs headlong into the context window limitations of current models. Breaking documents into smaller chunks might seem like an obvious solution, but this introduces its own set of challenges. How do you maintain coherence across chunks? What happens when a sentence spans two pages? How do you handle formatting and structure?</p>
<p>The PDF format adds another layer of complexity. Most existing tools rely on Optical Character Recognition (OCR), which introduces its own set of problems. OCR can mangle formatting, struggle with complex layouts, and introduce errors that propagate through to the translation. Even when OCR works perfectly, you’re still left with the challenge of maintaining the document’s original structure and presentation.</p>
</section>
<section id="a-word-about-translations-fidelity-and-accuracy" class="level2">
<h2 class="anchored" data-anchor-id="a-word-about-translations-fidelity-and-accuracy">A Word About Translations, Fidelity and Accuracy</h2>
<p>Having worked professionally as a translator and worked as an editor for teams of translators, I’m acutely aware of the challenges and limitations of LLM-provided translations. While these models have made remarkable strides, they face several significant hurdles that are worth examining in detail.</p>
<p>One of the most prominent issues is consistency. LLMs often struggle to maintain consistent terminology across multiple API calls, which becomes particularly evident in longer documents. Technical terms, product names, and industry-specific jargon might be translated differently each time they appear, creating confusion and reducing the professional quality of the output. This problem extends beyond mere terminology – the writing style and tone can drift significantly between chunks of text, especially when using the chunking approach necessary for longer documents. You might find yourself with a document that switches unexpectedly between formal and informal registers, or that handles technical depth inconsistently across sections.</p>
<p>Even formatting poses challenges. The way LLMs handle structural elements like bullet points, numbered lists, or text emphasis can vary dramatically across sections. What starts as a consistently formatted document can end up with a patchwork of different styling approaches, requiring additional cleanup work.</p>
<p>Perhaps more fundamentally, LLMs struggle to find the right balance between literal and fluent translation. Sometimes they produce awkwardly literal translations that technically convey the meaning but lose the natural flow of the target language. Other times, they swing too far in the opposite direction, producing fluid but unfaithful translations that lose important nuances from the source text. This challenge becomes particularly acute when dealing with idioms and cultural references, where literal translation would be meaningless but too free a translation risks losing the author’s intent.</p>
<p>Cultural nuances present another significant challenge. LLMs often miss or mishandle culture-specific references, humour, and wordplay. They struggle with regional variations in language and historical context, potentially stripping away layers of meaning that a human translator would carefully preserve. This limitation becomes even more apparent in specialised fields – medical texts, legal documents, technical manuals, and academic writing all require domain expertise that LLMs don’t consistently demonstrate.</p>
<p>The technical limitations of these models add another layer of complexity. The necessity of breaking longer texts into chunks means that broader document context can be lost, making it difficult to maintain coherence across section boundaries. While tools like <code>tinbox</code> attempt to address this through seam repair and sliding window approaches, it remains a significant challenge. Cross-references between different parts of the document might be missed, and maintaining a consistent voice across a long text can prove difficult.</p>
<p>Format-specific problems abound as well. Tables and figures might be misinterpreted, special characters can be mangled, and the connections between footnotes or endnotes and their references might be lost. Page layout elements can be corrupted in the translation process, requiring additional post-processing work.</p>
<p>Reliability and trust present another set of concerns. LLMs are prone to hallucination, sometimes adding content that wasn’t present in the original text or filling in perceived gaps with invented information. They might create plausible but incorrect translations or embellish technical details. Moreover, they provide no indication of their confidence in different parts of the translation, no flags for potentially problematic passages, and no highlighting of ambiguous terms or phrases that might benefit from human review.</p>
<p>When it comes to handling source texts, LLMs show particular weakness with poor quality inputs. They struggle with grammatically incorrect text, informal or colloquial language, and dialectal variations. Their handling of abbreviations and acronyms can be inconsistent, potentially introducing errors into technical or specialised documents.</p>
<p>The ethical and professional implications of these limitations are significant. There’s often a lack of transparency about the translation process, no clear audit trail for translation decisions, and limited ability to explain why particular choices were made. This raises concerns about professional displacement – not just in terms of jobs, but in terms of the valuable human judgment that professional translators bring to sensitive translations, the opportunity for cultural consultation, and the role of specialist translators in maintaining high standards in their fields.</p>
<p>These various limitations underscore an important point: while LLMs are powerful tools for translation, they should be seen as aids to human translators rather than replacements, especially in contexts requiring high accuracy, cultural sensitivity, technical precision, legal compliance, or creative fidelity. The future of translation likely lies in finding ways to combine the efficiency and broad capabilities of LLMs with the nuanced understanding and expertise of human translators.</p>
<p>So why build a tool like this given all these problems? I think there’s still a use for something like this in fields where there are few translators and a huge backlog of materials where there’s a benefit to reading them in your own mother tongue, even in a ‘bad’ translation. (That said, having done a decent amount of comparison of outputs for languages like Arabic, Dari and Pashto, I actually don’t find the translations to be terrible, especially for domains like the news or political commentary.) For myself, I am working on a separate tool or system which takes in primary sources and incrementally populates a knowledge database. Having ways to ingest materials written in foreign languages is incredibly important for this, and having a way to do it that doesn’t break the bang (i.e.&nbsp;by using local models) is similarly important.</p>
</section>
<section id="engineering-a-solution" class="level2">
<h2 class="anchored" data-anchor-id="engineering-a-solution">Engineering a Solution</h2>
<p><code>tinbox</code> takes a simple approach to solving these issues through two core algorithmic features. The first is what I call “page-by-page with seam repair.” Instead of treating a document as one continuous piece of text, we acknowledge its natural segmentation into pages. Each page is translated independently, but – and this is crucial – we then apply a repair process to the seams between pages.</p>
<p>This seam repair is where things get interesting. When a sentence spans a page boundary, we identify the overlap and re-translate that specific section with full context from both pages. This ensures that the translation flows naturally, even across page boundaries. It’s a bit like being a careful tailor, making sure the stitches between pieces of fabric are invisible in the final garment.</p>
<p>For continuous text documents (read: a <code>.txt</code> file containing multiple tens of thousands of words), we take a different approach using a sliding window algorithm. Think of it like moving a magnifying glass across the text, where the edges of the glass overlap with the previous and next positions. This overlap is crucial – it provides the context necessary for coherent translation across chunk boundaries.</p>
<p>The implementation details matter here. We need to carefully manage memory, handle errors gracefully, and provide progress tracking for long-running translations. The codebase is structured around clear separation of concerns, making it easy to add support for new document types or translation models.</p>
<p>Moreover, we need to ensure that in the case of failure we’re able to resume without wasting what we spent translating earlier parts of the document.</p>
</section>
<section id="the-engineering-details" class="level2">
<h2 class="anchored" data-anchor-id="the-engineering-details">The Engineering Details</h2>
<p>The architecture reflects these needs. At its core, <code>tinbox</code> uses a modular design that separates document processing from translation logic. This allows us to handle different document types (PDFs, Word documents, plain text) with specialised processors while maintaining a consistent interface for translation.</p>
<p>Error handling is particularly crucial. Translation is inherently error-prone, and when you’re dealing with large documents, you need robust recovery mechanisms. We implement comprehensive retry logic with exponential backoff, ensuring that temporary failures (like rate limits) don’t derail entire translation jobs.</p>
<p>For large documents, we provide checkpointing and progress tracking. This means you can resume interrupted translations and get detailed insights into the translation process. The progress tracking isn’t just about displaying a percentage – it provides granular information about token usage, costs, and potential issues.</p>
<section id="page-by-page-with-seam-repair" class="level3">
<h3 class="anchored" data-anchor-id="page-by-page-with-seam-repair">Page-by-Page with Seam Repair</h3>
<p>The page-by-page algorithm handles PDFs by treating each page as a separate unit while ensuring smooth transitions between pages. Pseudocode that can help you understand how this works goes something like this:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> translate_with_seam_repair(document, overlap_size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">200</span>):</span>
<span id="cb1-2">    translated_pages <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb1-3">    </span>
<span id="cb1-4">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> page_num, page <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(document.pages):</span>
<span id="cb1-5">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Translate current page</span></span>
<span id="cb1-6">        current_translation <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> translate_page(page)</span>
<span id="cb1-7">        </span>
<span id="cb1-8">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> page_num <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>:</span>
<span id="cb1-9">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Extract and repair the seam between pages</span></span>
<span id="cb1-10">            previous_end <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> translated_pages[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>][<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>overlap_size:]</span>
<span id="cb1-11">            current_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> current_translation[:overlap_size]</span>
<span id="cb1-12">            </span>
<span id="cb1-13">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Re-translate the overlapping section with full context</span></span>
<span id="cb1-14">            repaired_seam <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> translate_with_context(</span>
<span id="cb1-15">                text<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>current_start,</span>
<span id="cb1-16">                previous_context<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>previous_end</span>
<span id="cb1-17">            )</span>
<span id="cb1-18">            </span>
<span id="cb1-19">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Update translations with repaired seam</span></span>
<span id="cb1-20">            translated_pages[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> translated_pages[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>][:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>overlap_size] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> repaired_seam</span>
<span id="cb1-21">            current_translation <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> repaired_seam <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> current_translation[overlap_size:]</span>
<span id="cb1-22">        </span>
<span id="cb1-23">        translated_pages.append(current_translation)</span>
<span id="cb1-24">    </span>
<span id="cb1-25">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>.join(translated_pages)</span></code></pre></div></div>
</section>
<section id="sliding-window-for-text-documents" class="level3">
<h3 class="anchored" data-anchor-id="sliding-window-for-text-documents">Sliding Window for Text Documents</h3>
<p>For continuous text documents, we use a sliding window approach. Again, pseudocode to help understand how this works goes something like this, though the actual implementation is different:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> translate_with_sliding_window(text, window_size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2000</span>, overlap<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">200</span>):</span>
<span id="cb2-2">    chunks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb2-3">    position <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb2-4">    </span>
<span id="cb2-5">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">while</span> position <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(text):</span>
<span id="cb2-6">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create window with overlap</span></span>
<span id="cb2-7">        end <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">min</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(text), position <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> window_size)</span>
<span id="cb2-8">        window <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> text[position:end]</span>
<span id="cb2-9">        </span>
<span id="cb2-10">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Translate window</span></span>
<span id="cb2-11">        translation <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> translate_window(window)</span>
<span id="cb2-12">        chunks.append(translation)</span>
<span id="cb2-13">        </span>
<span id="cb2-14">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Slide window forward, accounting for overlap</span></span>
<span id="cb2-15">        position <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> end <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> overlap</span>
<span id="cb2-16">    </span>
<span id="cb2-17">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> merge_chunks(chunks, overlap)</span></code></pre></div></div>
</section>
<section id="cli-usage-examples" class="level3">
<h3 class="anchored" data-anchor-id="cli-usage-examples">CLI Usage Examples</h3>
<p>The tool provides a simple command-line interface:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb3-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Basic translation of a PDF to Spanish</span></span>
<span id="cb3-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">tinbox</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--to</span> es document.pdf</span>
<span id="cb3-3"></span>
<span id="cb3-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Specify source language and model</span></span>
<span id="cb3-5"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">tinbox</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--from</span> zh <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--to</span> en <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--model</span> anthropic:claude-3-5-sonnet-latest chinese_doc.pdf</span>
<span id="cb3-6"></span>
<span id="cb3-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Use local model via Ollama for sensitive content</span></span>
<span id="cb3-8"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">tinbox</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--model</span> ollama:mistral-small <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--to</span> en sensitive_doc.pdf</span>
<span id="cb3-9"></span>
<span id="cb3-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Advanced options for large documents</span></span>
<span id="cb3-11"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">tinbox</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--to</span> fr <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--algorithm</span> sliding-window <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb3-12">       <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--window-size</span> 3000 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--overlap</span> 300 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb3-13">       large_document.txt</span></code></pre></div></div>
</section>
</section>
<section id="other-notable-features" class="level2">
<h2 class="anchored" data-anchor-id="other-notable-features">Other notable features</h2>
<p>The CLI interface for <code>tinbox</code> currently is built on top of <code>litellm</code> so it technically supports most models you might want to use with it, though I’ve only enabled OpenAI, Anthropic, Google/Gemini and Ollama as base providers for now.</p>
<p>The Ollama support was one I was keen to offer since translation is such a token-heavy task. I also really worry about the level of sensitivity / monitoring on the cloud APIs and have run into that in the past (particularly with regard to my previous work as a historian working on issues relating to Afghanistan). Ollama-provided local models should solve that issue, perhaps at the expense of access to the very latest and greatest models.</p>
</section>
<section id="things-still-to-be-done" class="level2">
<h2 class="anchored" data-anchor-id="things-still-to-be-done">Things still to be done</h2>
<p>There’s lots of improvements still to be made. I’m particularly interested in exploring semantic section detection, which could make the chunking process more intelligent. There’s also work to be done on preserving more complex document formatting and supporting additional output formats.</p>
<p>Currently the tool is driven by whatever you tell it to do. Most decisions are in your hands. You have to choose the model to use for translation, notably. I am most interested in using this tool for some other side-projects and for low-resource languages so one of the important things I’ll be doing is to pick sensible defaults depending on the language and input document type you choose.</p>
<p>For example, some vision language models like GPT-4o are able to handle translating directly from an image in Urdu to English, the open-source versions (like <code>llama3.2-vision</code>) struggle much more with these kinds of tasks so it’s possible I might even need to insert an intermediary step of transcribe, then translate the transcribed text into English etc. In fact, for highest-fidelity of translation I almost certainly might want to enable that option.</p>
<p>The code is available at <a href="https://github.com/strickvl/tinbox">GitHub</a>, and I welcome contributions and feedback.</p>


</section>

 ]]></description>
  <category>translation</category>
  <category>llm</category>
  <category>llms</category>
  <category>languages</category>
  <category>research</category>
  <category>miniproject</category>
  <category>python</category>
  <category>tools</category>
  <guid>https://alexstrick.com/posts/2025-02-16-tinbox-an-llm-based-document-translation-tool.html</guid>
  <pubDate>Sat, 15 Feb 2025 23:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/tinbox-gh-small.png" medium="image" type="image/png" height="99" width="144"/>
</item>
<item>
  <title>Starting the Hugging Face Agents course</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-02-11-starting-the-hugging-face-agents-course.html</link>
  <description><![CDATA[ 




<p>I finished the first unit of the <a href="https://huggingface.co/learn/agents-course/">Hugging Face Agents course</a>, at least the reading part. I still want to play around with the code a bit more, since I imagine we’ll be doing that more going forward. In the meanwhile I wanted to write up some reflections on the course materials from unit one, in no particular order…</p>
<section id="code-agents-prominence" class="level2">
<h2 class="anchored" data-anchor-id="code-agents-prominence">Code agents’ prominence</h2>
<p>The course materials and <code>smolagents</code> in general places special emphasis on code agents, citing <a href="https://huggingface.co/papers/2402.01030">multiple</a> <a href="https://huggingface.co/papers/2411.01747">research</a> <a href="https://huggingface.co/papers/2401.00812">papers</a> and they <em>seem</em> to make some solid arguments for it but it also seems pretty risk at the same time. Having code agents instead of pre-defined tool use is good because:</p>
<blockquote class="blockquote">
<p><strong>Composability</strong>: could you nest JSON actions within each other, or define a set of JSON actions to re-use later, the same way you could just define a python function?</p>
<p><strong>Object management</strong>: how do you store the output of an action like generate_image in JSON?</p>
<p><strong>Generality</strong>: code is built to express simply anything you can have a computer do.</p>
<p><strong>Representation in LLM training data</strong>: plenty of quality code actions is already included in LLMs’ training data which means they’re already trained for this!</p>
</blockquote>
<p>The thing that gives me pause is that it seems like we moved through the spectrum from highly structured and known workflows (a chain, perhaps, or even something like a DAG) to tool use in a loop (which had some arbitrary or dynamic parts but ultimately was at least a little defined), and all the way out then to code agents where basically anything is possible.</p>
<p>If I think about this as an engineer tasked with building a robust, dependable and reliable system, then the <em>last</em> thing I think I want to add into the system is an agent that can basically do any thing under the sun (i.e.&nbsp;code agents). Perhaps I’m misrepresenting the position here of code agents, so I’m looking forward to reading the papers cited above as well as understanding it more from the course authors’ perspective.</p>
</section>
<section id="evals-testing" class="level2">
<h2 class="anchored" data-anchor-id="evals-testing">Evals &amp; testing</h2>
<p>Following on to my confusion around code agents, I’m very curious how the course will recommend one tests and evaluates these arbitrary code agents. Things I could imagine:</p>
<ul>
<li>testing out the specific scenarios that your application or use case requires (i.e.&nbsp;end to end)</li>
<li>testing out each component of the system, such as you can break it down into smaller sub-components</li>
<li>including things like linting / unit tests maybe once code is generated by the agent (?) i.e.&nbsp;real-time evaluation of the robustness of the system?</li>
<li>probably LLM as a judge somewhere in the mix, though that opens up its own can of worms…</li>
</ul>
<p>I do hope they talk about that in the later units of the course.</p>
</section>
<section id="general-patterns" class="level2">
<h2 class="anchored" data-anchor-id="general-patterns">General patterns</h2>
<p>The core loop that came up in unit 1 was:</p>
<blockquote class="blockquote">
<p>plan -&gt; act -&gt; feedback/reflection</p>
</blockquote>
<p>And all of that gets packaged up in a loop and repeated in various forms depending on exactly how you’re using it. And this pattern is related to the ReACT loop which lots of people cite but seems to be a specific version of the general idea mentioned above.</p>
<p>And the fact that all of this works is somehow all powered by the very useful enablement of tool use, which is itself powered by the fact that the model providers finetuned this ability into the models. Crazy, brittle, impressive and many other words for the fact that this ‘hack’ has such power.</p>
</section>
<section id="chat-templates" class="level2">
<h2 class="anchored" data-anchor-id="chat-templates">Chat templates</h2>
<p>I liked how the unit really impresses on you the impact and importance of chat templates as the <em>real</em> way that LLMs are implemented. You may pass in your requests through a handy Python SDK, passing your tools as a list of function definitions, but in the end this is all being parsed down and out into very precise syntax with many tokens not intended for human consumption.</p>
</section>
<section id="points-of-leverage" class="level2">
<h2 class="anchored" data-anchor-id="points-of-leverage">Points of leverage</h2>
<p>At the end of the unit, I was thinking about all the places where an engineer has leverage over agents. What I could initially think of was:</p>
<ul>
<li>the variety and usefulness of tools that you provide to your agent (or perhaps the extent to which you allow your code agent to ‘write’ things out into the world)</li>
<li>the discrimination in the volume or choice of a combination of tools or APIs</li>
<li>how you chain everything together</li>
<li>(how robustly you handle failure)</li>
</ul>
<p>Beyond that there are quite a few things that are somewhat out of your hands unless you decide to custom finetune your own models for a specific use case.</p>
<p>Overall it was a good start to the course: made me think and also got my hands dirty working on a very simple agent with tools using <code>smolagent</code> and a Gradio demo app in the Hugging Face Hub. I’ll write more after unit two next week.</p>


</section>

 ]]></description>
  <category>agents</category>
  <category>huggingface</category>
  <category>skillbuilding</category>
  <category>llmops</category>
  <category>llms</category>
  <guid>https://alexstrick.com/posts/2025-02-11-starting-the-hugging-face-agents-course.html</guid>
  <pubDate>Mon, 10 Feb 2025 23:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/agents-certificate.png" medium="image" type="image/png" height="101" width="144"/>
</item>
<item>
  <title>AI Engineering Architecture and User Feedback</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-02-09-ai-eg-chapter-10.html</link>
  <description><![CDATA[ 




<p>Chapter 10 of Chip Huyen’s “AI Engineering,” focuses on two fundamental aspects: architectural patterns in AI engineering and methods for gathering and using user feedback. The chapter presents a progressive architectural framework that evolves from simple API calls to complex agent-based systems, while also diving deep into the crucial aspect of user feedback collection and analysis.</p>
<section id="progressive-architecture-patterns" class="level2">
<h2 class="anchored" data-anchor-id="progressive-architecture-patterns">1. Progressive Architecture Patterns</h2>
<p>The evolution of AI engineering architecture typically follows a pattern of increasing complexity and capability. Each stage builds upon the previous one, adding new functionality while managing increased complexity.</p>
<section id="base-layer-direct-model-integration" class="level3">
<h3 class="anchored" data-anchor-id="base-layer-direct-model-integration">Base Layer: Direct Model Integration</h3>
<p><img src="https://alexstrick.com/posts/images/2025-02-09-ai-eg-chapter-10/simple.png" class="img-fluid"></p>
<p>The simplest architectural pattern begins with direct queries to model APIs. While straightforward, this approach lacks the sophistication needed for most production applications.</p>
</section>
<section id="enhancement-layer-context-augmentation" class="level3">
<h3 class="anchored" data-anchor-id="enhancement-layer-context-augmentation">Enhancement Layer: Context Augmentation</h3>
<p><img src="https://alexstrick.com/posts/images/2025-02-09-ai-eg-chapter-10/context.png" class="img-fluid"></p>
<p>The first major enhancement comes through <strong>Retrieval-Augmented Generation (RAG)</strong>. This layer enriches model responses by incorporating custom data and sources into LLM queries, significantly improving response quality and relevance.</p>
</section>
<section id="protection-layer-guardrails-implementation" class="level3">
<h3 class="anchored" data-anchor-id="protection-layer-guardrails-implementation">Protection Layer: Guardrails Implementation</h3>
<p><img src="https://alexstrick.com/posts/images/2025-02-09-ai-eg-chapter-10/guardrails.png" class="img-fluid"></p>
<blockquote class="blockquote">
<p><strong>Guardrails</strong>: Protective mechanisms that filter both inputs and outputs to ensure system safety and reliability.</p>
</blockquote>
<p>The protection layer implements two types of guardrails:</p>
<ol type="1">
<li><p><strong>Input Guardrails</strong>: Filter sensitive information before it reaches the LLM, such as:</p>
<ul>
<li>Personal customer information</li>
<li>API keys</li>
<li>Other confidential data</li>
</ul></li>
<li><p><strong>Output Guardrails</strong>: Monitor and manage model outputs for:</p>
<ul>
<li>Format compliance (e.g., valid JSON)</li>
<li>Factual consistency</li>
<li>Hallucination detection</li>
<li>Toxic content filtering</li>
<li>Privacy protection</li>
</ul></li>
</ol>
</section>
<section id="routing-layer-gateway-and-model-selection" class="level3">
<h3 class="anchored" data-anchor-id="routing-layer-gateway-and-model-selection">Routing Layer: Gateway and Model Selection</h3>
<p><img src="https://alexstrick.com/posts/images/2025-02-09-ai-eg-chapter-10/extra_modules.png" class="img-fluid"></p>
<p>This layer introduces two key components:</p>
<blockquote class="blockquote">
<p><strong>AI Gateway</strong>: A centralized access point for LLM interactions that manages costs, usage tracking, and API key abstraction.</p>
</blockquote>
<blockquote class="blockquote">
<p><strong>Model Router</strong>: An intent classifier that directs queries to appropriate models based on complexity and requirements.</p>
</blockquote>
<p>The routing layer enables cost optimization by directing simpler queries (like FAQ responses) to less expensive models while routing complex tasks to more sophisticated systems.</p>
</section>
<section id="performance-layer-caching-strategies" class="level3">
<h3 class="anchored" data-anchor-id="performance-layer-caching-strategies">Performance Layer: Caching Strategies</h3>
<p><img src="https://alexstrick.com/posts/images/2025-02-09-ai-eg-chapter-10/cache.png" class="img-fluid"></p>
<p>The architecture implements two distinct caching approaches:</p>
<ol type="1">
<li><p><strong>Exact Caching</strong>:</p>
<ul>
<li>Stores identical queries and their responses</li>
<li>Particularly valuable for multi-step operations</li>
<li>Requires careful consideration of cache eviction policies:
<ul>
<li>Least Recently Used (LRU)</li>
<li>Least Frequently Used (LFU)</li>
<li>First In, First Out (FIFO)</li>
</ul></li>
</ul></li>
<li><p><strong>Semantic Caching</strong>:</p>
<ul>
<li>Uses embedding-based search to identify similar queries</li>
<li>Depends on high-quality embeddings and reliable similarity metrics</li>
<li>More prone to failure due to component complexity</li>
</ul></li>
</ol>
<blockquote class="blockquote">
<p><strong>Security Note</strong>: Cache implementations must carefully consider potential data leaks between users accessing similar queries.</p>
</blockquote>
</section>
<section id="agent-layer-advanced-functionality" class="level3">
<h3 class="anchored" data-anchor-id="agent-layer-advanced-functionality">Agent Layer: Advanced Functionality</h3>
<p><img src="https://alexstrick.com/posts/images/2025-02-09-ai-eg-chapter-10/write.png" class="img-fluid"></p>
<p>The final architectural layer introduces agent patterns, enabling:</p>
<ul>
<li>Retry loops for reliability</li>
<li>Tool usage capabilities</li>
<li>Action execution (email sending, file operations)</li>
<li>Complex workflow orchestration</li>
</ul>
</section>
</section>
<section id="monitoring-and-observability" class="level2">
<h2 class="anchored" data-anchor-id="monitoring-and-observability">Monitoring and Observability</h2>
<p>The complete architecture requires robust monitoring systems tracking key metrics:</p>
<ul>
<li><strong>Mean Time to Detection (MTTD)</strong>: Time to identify issues</li>
<li><strong>Mean Time to Response (MTTR)</strong>: Time to resolve detected issues</li>
<li><strong>Change Failure Rate (CFR)</strong>: Percentage of deployments requiring fixes</li>
</ul>
<p>The monitoring system should track:</p>
<ul>
<li>Factual consistency</li>
<li>Generation relevancy</li>
<li>Safety metrics (toxicity, PII detection)</li>
<li>Model quality through conversational signals</li>
<li>Component-specific metrics (RAG, generation, vector database performance)</li>
</ul>
<section id="ai-pipeline-orchestration" class="level3">
<h3 class="anchored" data-anchor-id="ai-pipeline-orchestration">AI Pipeline Orchestration</h3>
<p>a discussion of AI pipeline orchestration, addressing the trade-offs between using existing frameworks (Langchain, Haystack, Llama Index) versus custom implementations. This decision should be based on specific project requirements, team expertise, and maintenance considerations.</p>
</section>
</section>
<section id="user-feedback-systems" class="level2">
<h2 class="anchored" data-anchor-id="user-feedback-systems">2. User Feedback Systems</h2>
<p>The second major focus of the chapter explores comprehensive user feedback collection and utilization strategies.</p>
<section id="feedback-collection-methods" class="level3">
<h3 class="anchored" data-anchor-id="feedback-collection-methods">Feedback Collection Methods</h3>
<ol type="1">
<li><p><strong>Direct Feedback</strong>:</p>
<ul>
<li>Explicit mechanisms (thumbs up/down)</li>
<li>Rating systems</li>
<li>Free-form comments</li>
</ul></li>
<li><p><strong>Implicit Feedback</strong>:</p>
<ul>
<li>Early termination patterns</li>
<li>Error corrections</li>
<li>Sentiment analysis</li>
<li>Response regeneration requests</li>
<li>Dialogue diversity metrics</li>
</ul></li>
</ol>
</section>
<section id="feedback-collection-timing" class="level3">
<h3 class="anchored" data-anchor-id="feedback-collection-timing">Feedback Collection Timing</h3>
<p>Feedback can be gathered at various stages:</p>
<ul>
<li>Initial user preference specification</li>
<li>During negative experiences</li>
<li>When model confidence is low</li>
<li>Through comparative choice interfaces (e.g., ChatGPT’s response preference selection)</li>
</ul>
</section>
<section id="feedback-limitations" class="level3">
<h3 class="anchored" data-anchor-id="feedback-limitations">Feedback Limitations</h3>
<blockquote class="blockquote">
<p><strong>Feedback Bias</strong>: User feedback systems inherently contain various biases that must be considered when making system improvements.</p>
</blockquote>
<p>Key limitations include:</p>
<ul>
<li>Negative experience bias (users more likely to report negative experiences)</li>
<li>Self-selection bias in respondent demographics</li>
<li>Preference and position biases</li>
<li>Potential feedback loops affecting system evolution</li>
</ul>
</section>
<section id="implementation-considerations" class="level3">
<h3 class="anchored" data-anchor-id="implementation-considerations">Implementation Considerations</h3>
<p>The implementation of feedback systems requires careful attention to:</p>
<ul>
<li>UI/UX design for feedback collection</li>
<li>Balance between different user needs</li>
<li>Monitoring feedback impact on system performance</li>
<li>Regular inspection of production data</li>
<li>Detection of system drift (prompts, user behavior, model changes)</li>
</ul>


</section>
</section>

 ]]></description>
  <category>books-i-read</category>
  <category>llm</category>
  <category>llms</category>
  <category>llmops</category>
  <category>evaluation</category>
  <guid>https://alexstrick.com/posts/2025-02-09-ai-eg-chapter-10.html</guid>
  <pubDate>Sat, 08 Feb 2025 23:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/2025-02-09-ai-eg-chapter-10/write.png" medium="image" type="image/png" height="103" width="144"/>
</item>
<item>
  <title>Notes on ‘AI Engineering’ chapter 9: Inference Optimisation</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-02-07-ai-engineering-chapter-9.html</link>
  <description><![CDATA[ 




<p>What follows are my notes on chapter 9 of Chip Huyen’s ‘AI Engineering’ book. This chapter was on optimising your inference and I learned a lot while reading it! There are interesting techniques like prompt caching and architectural considerations that I was vaguely aware of but hadn’t fully appreciated how they might work in real inference systems.</p>
<section id="chapter-9-overview" class="level2">
<h2 class="anchored" data-anchor-id="chapter-9-overview">Chapter 9: Overview</h2>
<p>Machine learning inference optimization operates across three fundamental domains: model optimization, hardware optimization, and service optimization. While hardware optimization often requires significant investment and may offer limited individual leverage, model and service optimizations provide substantial opportunities for AI engineers to improve performance.</p>
<blockquote class="blockquote">
<p><strong>Critical Cost Insight</strong>: A 2023 survey revealed that inference can account for up to 90% of machine learning costs in deployed AI systems, often exceeding training costs. This emphasizes why inference optimization isn’t just an engineering challenge - it’s a critical business necessity.</p>
</blockquote>
</section>
<section id="core-concepts-and-bottlenecks" class="level2">
<h2 class="anchored" data-anchor-id="core-concepts-and-bottlenecks">Core Concepts and Bottlenecks</h2>
<p>Understanding inference bottlenecks is essential for effective optimization. Two primary types of computational bottlenecks impact inference performance:</p>
<blockquote class="blockquote">
<p><strong>Compute-Bound Bottlenecks</strong>: Tasks that are limited by raw computational capacity, typically involving complex mathematical operations that take significant time to complete. These bottlenecks are particularly evident in computationally intensive operations within neural networks.</p>
<p><strong>Memory Bandwidth-Bound Bottlenecks</strong>: Limitations arising from data transfer requirements between system components, particularly between memory and processors. This becomes especially relevant in Large Language Models where significant amounts of data need to be moved between different memory hierarchies.</p>
</blockquote>
<p>In Large Language Models (LLMs), different operations exhibit varying profiles of these bottlenecks. This understanding has led to architectural decisions such as decoupling the prefilling step from the decode step in production environments - a practice that has become increasingly common as organizations optimize their inference pipelines.</p>
</section>
<section id="inference-apis-and-service-patterns" class="level2">
<h2 class="anchored" data-anchor-id="inference-apis-and-service-patterns">Inference APIs and Service Patterns</h2>
<p>Two fundamental approaches to inference deployment exist:</p>
<ol type="1">
<li><strong>Online Inference APIs</strong>
<ul>
<li>Optimized for minimal latency</li>
<li>Designed for real-time responses</li>
<li>Typically more expensive per inference</li>
<li>Critical for interactive applications</li>
</ul></li>
<li><strong>Batch Inference APIs</strong>
<ul>
<li>Optimized for cost efficiency</li>
<li>Can tolerate longer processing times (potentially hours)</li>
<li>Allows providers to optimize resource utilization</li>
<li>Ideal for bulk processing tasks</li>
</ul></li>
</ol>
</section>
<section id="inference-performance-metrics" class="level2">
<h2 class="anchored" data-anchor-id="inference-performance-metrics">Inference Performance Metrics</h2>
<p>Several key metrics help quantify inference performance:</p>
<section id="latency-components" class="level3">
<h3 class="anchored" data-anchor-id="latency-components">Latency Components</h3>
<ol type="1">
<li><strong>Time to First Token</strong>
<ul>
<li>Measures duration between query submission and initial response</li>
<li>Critical for user experience in interactive applications</li>
<li>Often a key optimization target for real-time systems</li>
</ul></li>
<li><strong>Time per Output Token</strong>
<ul>
<li>Generation speed after the first token</li>
<li>Impacts overall completion time</li>
<li>Can vary based on model architecture and optimization</li>
</ul></li>
<li><strong>Inter-token Latency</strong>
<ul>
<li>Time intervals between consecutive tokens</li>
<li>Affects perceived smoothness of generation</li>
<li>Important for streaming applications</li>
</ul></li>
</ol>
<p>Total latency can be expressed as: <code>time_to_first_token + (time_per_token × number_of_tokens)</code></p>
</section>
<section id="throughput-and-goodput-metrics" class="level3">
<h3 class="anchored" data-anchor-id="throughput-and-goodput-metrics">Throughput and Goodput Metrics</h3>
<blockquote class="blockquote">
<p><strong>Throughput</strong>: The number of output tokens per second an inference service can generate across all users and requests. This raw metric provides insight into system capacity.</p>
</blockquote>
<blockquote class="blockquote">
<p><strong>Goodput</strong>: The number of requests per second that successfully meet the Service Level Objective (SLO). This metric offers a more realistic view of useful system capacity.</p>
</blockquote>
</section>
<section id="resource-utilization-metrics" class="level3">
<h3 class="anchored" data-anchor-id="resource-utilization-metrics">Resource Utilization Metrics</h3>
<ol type="1">
<li><strong>Model FLOPS Utilization (MFU)</strong>
<ul>
<li>Ratio of actual to theoretical FLOPS</li>
<li>Indicates computational efficiency</li>
<li>Key metric for hardware optimization</li>
</ul></li>
<li><strong>Model Bandwidth Utilization (MBU)</strong>
<ul>
<li>Percentage of achievable memory bandwidth utilized</li>
<li>Critical for memory-intensive operations</li>
<li>Helps identify memory bottlenecks</li>
</ul></li>
</ol>
</section>
</section>
<section id="hardware-considerations-and-ai-accelerators" class="level2">
<h2 class="anchored" data-anchor-id="hardware-considerations-and-ai-accelerators">Hardware Considerations and AI Accelerators</h2>
<p>While NVIDIA GPUs dominate the market, various specialized chips exist for inference:</p>
<section id="popular-ai-accelerators" class="level3">
<h3 class="anchored" data-anchor-id="popular-ai-accelerators">Popular AI Accelerators</h3>
<ul>
<li>NVIDIA GPUs (market leader)</li>
<li>AMD accelerators</li>
<li>Google TPUs</li>
<li>Various emerging specialized chips</li>
</ul>
<blockquote class="blockquote">
<p><strong>Inference vs Training Hardware</strong>: Inference-optimized chips prioritize lower precision and faster memory access over large memory capacity, contrasting with training-focused hardware that requires substantial memory capacity.</p>
</blockquote>
<p>Key hardware optimization considerations include:</p>
<ul>
<li>Memory size and bandwidth requirements</li>
<li>Chip architecture specifics</li>
<li>Power consumption profiles</li>
<li>Physical chip architecture variations</li>
<li>Cost-performance ratios</li>
</ul>
</section>
</section>
<section id="model-optimization-techniques" class="level2">
<h2 class="anchored" data-anchor-id="model-optimization-techniques">Model Optimization Techniques</h2>
<p><img src="https://alexstrick.com/posts/images/2025-02-07-ai-engineering-chapter-9/inference-optimization-differences.png" class="img-fluid"></p>
<section id="core-approaches" class="level3">
<h3 class="anchored" data-anchor-id="core-approaches">Core Approaches</h3>
<ol type="1">
<li><strong>Quantization</strong>
<ul>
<li>Reduces numerical precision (e.g., 32-bit to 16-bit)</li>
<li>Decreases memory footprint</li>
<li>Weight-only quantization is particularly common</li>
<li>Can halve model size with minimal performance impact</li>
</ul></li>
<li><strong>Pruning</strong>
<ul>
<li>Removes non-essential parameters</li>
<li>Preserves core model behavior</li>
<li>Multiple techniques available</li>
<li>Requires careful validation</li>
</ul></li>
<li><strong>Distillation</strong>
<ul>
<li>Creates smaller, more efficient models</li>
<li>Maintains key capabilities</li>
<li>Covered extensively in Chapter 8</li>
</ul></li>
</ol>
</section>
<section id="advanced-decoding-strategies" class="level3">
<h3 class="anchored" data-anchor-id="advanced-decoding-strategies">Advanced Decoding Strategies</h3>
<p><img src="https://alexstrick.com/posts/images/2025-02-07-ai-engineering-chapter-9/pytorch-llama3-optimization.png" class="img-fluid"></p>
<section id="speculative-decoding" class="level4">
<h4 class="anchored" data-anchor-id="speculative-decoding">Speculative Decoding</h4>
<p>This approach combines a large model with a smaller, faster model:</p>
<ul>
<li>Small model generates rapid initial outputs</li>
<li>Large model verifies and corrects as needed</li>
<li>Provides faster token generation</li>
<li>Easy to implement</li>
<li>Integrated into frameworks like VLLM and LamaCPU</li>
</ul>
</section>
<section id="inference-with-reference" class="level4">
<h4 class="anchored" data-anchor-id="inference-with-reference">Inference with Reference</h4>
<p><img src="https://alexstrick.com/posts/images/2025-02-07-ai-engineering-chapter-9/inference_with_reference.png" class="img-fluid"></p>
<ul>
<li>Performs mini-RAG operations during decoding</li>
<li>Retrieves relevant context from input query</li>
<li>Requires additional memory overhead</li>
<li>Useful for maintaining context accuracy</li>
</ul>
</section>
<section id="parallel-decoding" class="level4">
<h4 class="anchored" data-anchor-id="parallel-decoding">Parallel Decoding</h4>
<p>Rather than strictly sequential token generation, this method:</p>
<ul>
<li>Generates multiple tokens simultaneously</li>
<li>Uses resolution mechanisms to maintain coherence</li>
<li>Implements look-ahead techniques</li>
<li>Algorithmically complex but offers significant speed benefits</li>
<li>Demonstrated success with look-ahead decoding method</li>
</ul>
</section>
<section id="attention-optimization" class="level4">
<h4 class="anchored" data-anchor-id="attention-optimization">Attention Optimization</h4>
<p>Several strategies exist for optimizing attention mechanisms:</p>
<ol type="1">
<li><strong>Key-Value Cache Optimization</strong>
<ul>
<li>Critical for large context windows</li>
<li>Requires substantial memory</li>
<li>Various techniques for size reduction</li>
</ul></li>
<li><strong>Specialized Attention Kernels</strong>
<ul>
<li>Flash Attention as leading example</li>
<li>Hardware-specific implementations</li>
<li>Flash Attention 3 for H100 GPUs</li>
</ul></li>
</ol>
<p><img src="https://alexstrick.com/posts/images/2025-02-07-ai-engineering-chapter-9/flash-attention.png" class="img-fluid"></p>
</section>
</section>
</section>
<section id="service-level-optimization" class="level2">
<h2 class="anchored" data-anchor-id="service-level-optimization">Service-Level Optimization</h2>
<section id="batching-strategies" class="level3">
<h3 class="anchored" data-anchor-id="batching-strategies">Batching Strategies</h3>
<ol type="1">
<li><strong>Static Batching</strong>
<ul>
<li>Processes fixed-size batches</li>
<li>Waits for complete batch (e.g., 100 requests)</li>
<li>Simple but potentially inefficient</li>
</ul></li>
<li><strong>Dynamic Batching</strong>
<ul>
<li>Uses time windows for batch formation</li>
<li>Processes incomplete batches after timeout</li>
<li>Balances latency and throughput</li>
</ul></li>
<li><strong>Continuous Batching</strong>
<ul>
<li>Returns completed responses immediately</li>
<li>Dynamically manages resource utilization</li>
<li>Similar to a bus route that continuously picks up new passengers</li>
<li>Optimizes occupation rate</li>
<li>Based on Orca paper’s findings</li>
</ul></li>
</ol>
</section>
<section id="prefill-decode-decoupling" class="level3">
<h3 class="anchored" data-anchor-id="prefill-decode-decoupling">Prefill-Decode Decoupling</h3>
<ul>
<li>Separates prefill and decode operations</li>
<li>Essential for large-scale inference providers</li>
<li>Allows optimal resource allocation</li>
<li>Improves overall system efficiency</li>
</ul>
</section>
<section id="prompt-caching" class="level3">
<h3 class="anchored" data-anchor-id="prompt-caching">Prompt Caching</h3>
<p><img src="https://alexstrick.com/posts/images/2025-02-07-ai-engineering-chapter-9/prompt_caching.png" class="img-fluid"></p>
<ul>
<li>Stores computations for overlapping text segments</li>
<li>Offered by providers like Gemini and Anthropic</li>
<li>May incur storage costs</li>
<li>Requires careful cost-benefit analysis</li>
<li>Must be explicitly enabled</li>
</ul>
</section>
<section id="parallelism-strategies" class="level3">
<h3 class="anchored" data-anchor-id="parallelism-strategies">Parallelism Strategies</h3>
<ol type="1">
<li><strong>Replica Parallelism</strong>
<ul>
<li>Creates multiple copies of the model</li>
<li>Distributes requests across replicas</li>
<li>Simplest form of parallelism</li>
</ul></li>
<li><strong>Tensor Parallelism</strong>
<ul>
<li>Splits individual tensors across devices</li>
<li>Enables processing of larger models</li>
<li>Requires careful coordination</li>
</ul></li>
<li><strong>Pipeline Parallelism</strong>
<ul>
<li>Divides model computation into stages</li>
<li>Assigns stages to different devices</li>
<li>Optimizes resource utilization</li>
<li>Reduces memory requirements</li>
</ul></li>
<li><strong>Context Parallelism</strong>
<ul>
<li>Processes different parts of input context in parallel</li>
<li>Particularly useful for long sequences</li>
<li>Can significantly reduce latency</li>
</ul></li>
<li><strong>Sequence Parallelism</strong>
<ul>
<li>Processes multiple sequences simultaneously</li>
<li>Leverages hardware-specific features</li>
<li>Requires careful implementation</li>
</ul></li>
</ol>
</section>
</section>
<section id="implementation-considerations" class="level2">
<h2 class="anchored" data-anchor-id="implementation-considerations">Implementation Considerations</h2>
<p>When implementing inference optimizations:</p>
<ul>
<li>Multiple optimization techniques are typically combined in production</li>
<li>Hardware-specific optimizations require careful testing</li>
<li>Service-level optimizations often provide significant gains with minimal model modifications</li>
<li>Optimization choices depend heavily on specific use cases and requirements</li>
</ul>


</section>

 ]]></description>
  <category>books-i-read</category>
  <category>inference</category>
  <category>llm</category>
  <category>llms</category>
  <category>hardware</category>
  <guid>https://alexstrick.com/posts/2025-02-07-ai-engineering-chapter-9.html</guid>
  <pubDate>Thu, 06 Feb 2025 23:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/2025-02-07-ai-engineering-chapter-9/flash-attention.png" medium="image" type="image/png" height="83" width="144"/>
</item>
<item>
  <title>Dataset Engineering: The Art and Science of Data Preparation</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-02-05-notes-on-ai-engineering-chip-huyen-chapter-8-dataset-engineering.html</link>
  <description><![CDATA[ 




<p>Finally back on track and reading the next chapter of Chip Huyen’s book, ‘AI Engineering’. Here are my notes on the chapter.</p>
<section id="overview-and-core-philosophy" class="level2">
<h2 class="anchored" data-anchor-id="overview-and-core-philosophy">Overview and Core Philosophy</h2>
<blockquote class="blockquote">
<p>“Data will be mostly just toil, tears and sweat.”</p>
</blockquote>
<p>This is how we start the chapter :) This candid assessment frames dataset engineering as a discipline that requires both technical sophistication and pragmatic persistence. While the chapter’s placement might have been suitable earlier in the book, its position allows it to build effectively on previously established concepts.</p>
</section>
<section id="data-curation-the-foundation" class="level2">
<h2 class="anchored" data-anchor-id="data-curation-the-foundation">Data Curation: The Foundation</h2>
<p>Data curation addresses various use cases including fine-tuning, pre-training, and training from scratch, with specific considerations for chain of thought reasoning and tool use. The process addresses three fundamental aspects:</p>
<blockquote class="blockquote">
<p><strong>Data Quality</strong>: The equivalent of ingredient quality in cooking</p>
<p><strong>Data Coverage</strong>: Analogous to having the right mix of ingredients</p>
<p><strong>Data Quantity</strong>: Determining the optimal volume of ingredients</p>
</blockquote>
<section id="quality-criteria" class="level3">
<h3 class="anchored" data-anchor-id="quality-criteria">Quality Criteria</h3>
<p>Data quality encompasses multiple dimensions:</p>
<ul>
<li>Relevance to task requirements</li>
<li>Consistency in format and structure</li>
<li>Sufficient uniqueness</li>
<li>Regulatory compliance (especially critical in regulated industries)</li>
</ul>
</section>
<section id="coverage-considerations" class="level3">
<h3 class="anchored" data-anchor-id="coverage-considerations">Coverage Considerations</h3>
<p>Coverage involves strategic decisions about data proportions:</p>
<ul>
<li>Large language models often utilize significant code data (up to 50%) in training, which appears to enhance logical reasoning capabilities beyond just coding</li>
<li>Language distribution can be surprisingly efficient (even 1% representation of a language can enable meaningful capabilities)</li>
<li>Training proportions may vary across different stages of the training process</li>
</ul>
</section>
<section id="quantity-and-optimization" class="level3">
<h3 class="anchored" data-anchor-id="quantity-and-optimization">Quantity and Optimization</h3>
<p>A key phenomenon discussed is <strong>ossification</strong>, where extensive pre-training can effectively freeze model weights, potentially hampering fine-tuning adaptability. This effect is particularly pronounced in smaller models.</p>
<p>Key quantity considerations include:</p>
<ul>
<li>Task complexity correlation with data requirements</li>
<li>Base model performance implications</li>
<li>Model size considerations (OpenAI notes that with ~100 examples, more advanced models show superior fine-tuning performance)</li>
<li>Potential for using lower quality or less relevant data for initial fine-tuning to reduce high-quality data requirements</li>
<li>Recognition of performance plateaus where additional data yields diminishing returns</li>
</ul>
</section>
<section id="data-acquisition-process" class="level3">
<h3 class="anchored" data-anchor-id="data-acquisition-process">Data Acquisition Process</h3>
<p>The chapter provides a detailed example workflow for creating an instruction-response dataset:</p>
<ol type="1">
<li>Initial dataset identification (~10,000 examples)</li>
<li>Low-quality instruction removal (reducing to ~9,000)</li>
<li>Low-quality response filtering (removing 3,000)</li>
<li>Manual response writing for remaining high-quality instructions</li>
<li>Topic gap identification and template creation (100 templates)</li>
<li>AI synthesis of 2,000 new instructions</li>
<li>Manual annotation of synthetic instructions</li>
</ol>
<p>Final result: 11,000 high-quality examples</p>
</section>
</section>
<section id="data-augmentation-and-synthesis" class="level2">
<h2 class="anchored" data-anchor-id="data-augmentation-and-synthesis">Data Augmentation and Synthesis</h2>
<section id="synthesis-objectives" class="level3">
<h3 class="anchored" data-anchor-id="synthesis-objectives">Synthesis Objectives</h3>
<ol type="1">
<li>Increasing data quantity</li>
<li>Expanding coverage</li>
<li>Enhancing quality</li>
<li>Addressing privacy concerns</li>
<li>Enabling model distillation</li>
</ol>
<blockquote class="blockquote">
<p><strong>Notable Research</strong>: An Anthropic paper (2022) found that language model-generated datasets can match or exceed human-written ones in quality for certain tasks.</p>
</blockquote>
<p>Note that some teams actually prefer AI-generated preference data due to human fatigue and inconsistency factors.</p>
</section>
<section id="synthesis-applications" class="level3">
<h3 class="anchored" data-anchor-id="synthesis-applications">Synthesis Applications</h3>
<p>The chapter distinguishes between pre-training and post-training synthesis:</p>
<ul>
<li>Synthetic data appears more frequently in post-training</li>
<li>Pre-training limitation: AI can reshape existing knowledge but struggles to synthesize new knowledge</li>
</ul>
</section>
<section id="llama-3-synthesis-pipeline" class="level3">
<h3 class="anchored" data-anchor-id="llama-3-synthesis-pipeline">LLaMA 3 Synthesis Pipeline</h3>
<p>A comprehensive workflow example:</p>
<ol type="1">
<li>AI generation of problem descriptions</li>
<li>Solution generation in multiple programming languages</li>
<li>Unit test generation</li>
<li>Error correction</li>
<li>Cross-language translation with test verification</li>
<li>Conversation and documentation generation with back-translation verification</li>
</ol>
<p>This pipeline generated 2.7 million synthetic coding examples for LLaMA 3.1’s supervised fine-tuning.</p>
</section>
<section id="model-collapse-considerations" class="level3">
<h3 class="anchored" data-anchor-id="model-collapse-considerations">Model Collapse Considerations</h3>
<p>The chapter addresses the risk of <strong>model collapse</strong> in synthetic data usage:</p>
<ul>
<li>Potential loss of training signal through repeated synthetic data use</li>
<li>Current research suggests proper implementation can avoid collapse</li>
<li>Importance of quality control in synthetic data generation</li>
</ul>
</section>
<section id="model-distillation" class="level3">
<h3 class="anchored" data-anchor-id="model-distillation">Model Distillation</h3>
<p>Notable example: BuzzFeed’s fine-tuning of Flan T5 using LoRa and OpenAI’s <code>text-davinci-003</code> generated examples, achieving 80% inference cost reduction.</p>
</section>
</section>
<section id="data-processing-best-practices" class="level2">
<h2 class="anchored" data-anchor-id="data-processing-best-practices">Data Processing Best Practices</h2>
<blockquote class="blockquote">
<p><strong>Expert Tip</strong>: “Manual inspection of data has probably the highest value to prestige ratio of any activity in machine learning.” - Greg Brockman, OpenAI co-founder</p>
</blockquote>
<section id="processing-guidelines" class="level3">
<h3 class="anchored" data-anchor-id="processing-guidelines">Processing Guidelines</h3>
<p>The chapter emphasizes efficiency optimization:</p>
<ol type="1">
<li><p>Order optimization (e.g., deduplication before cleaning if computationally advantageous)</p></li>
<li><p>Trial run validation before full dataset processing</p></li>
<li><p>Data preservation (avoid in-place modifications)</p></li>
<li><p>Original data retention for:</p>
<ul>
<li>Alternative processing needs</li>
<li>Team requirements</li>
<li>Error recovery</li>
</ul></li>
</ol>
</section>
<section id="technical-processing-approaches" class="level3">
<h3 class="anchored" data-anchor-id="technical-processing-approaches">Technical Processing Approaches</h3>
<p>Deduplication strategies include:</p>
<ul>
<li>Pairwise comparison</li>
<li>Hashing methods</li>
<li>Dimensionality reduction techniques</li>
</ul>
<p>Multiple libraries are referenced (page 400) for implementation.</p>
</section>
<section id="data-cleaning-and-formatting" class="level3">
<h3 class="anchored" data-anchor-id="data-cleaning-and-formatting">Data Cleaning and Formatting</h3>
<ul>
<li>HTML tag removal for signal enhancement</li>
<li>Careful prompt template formatting, crucial for:
<ul>
<li>Fine-tuning operations</li>
<li>Instruction tuning</li>
<li>Model performance optimization</li>
</ul></li>
</ul>
</section>
<section id="data-inspection" class="level3">
<h3 class="anchored" data-anchor-id="data-inspection">Data Inspection</h3>
<p>The chapter emphasizes the importance of manual data inspection:</p>
<ul>
<li>Utilize various data exploration tools</li>
<li>Dedicate time to direct data examination (recommended: 15 minutes of direct observation)</li>
<li>Consider this step non-optional in the process</li>
</ul>


</section>
</section>

 ]]></description>
  <category>books-i-read</category>
  <category>datasets</category>
  <category>datalabelling</category>
  <category>llm</category>
  <category>llms</category>
  <category>finetuning</category>
  <guid>https://alexstrick.com/posts/2025-02-05-notes-on-ai-engineering-chip-huyen-chapter-8-dataset-engineering.html</guid>
  <pubDate>Tue, 04 Feb 2025 23:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/ch8-proportions.png" medium="image" type="image/png" height="50" width="144"/>
</item>
<item>
  <title>Notes on ‘AI Engineering’ (Chip Huyen) chapter 7: Finetuning</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-01-26-notes-on-ai-engineering-chip-huyen-chapter-7-finetuning.html</link>
  <description><![CDATA[ 




<p>I enjoyed chapter 7 on finetuning. It jams a lot of detail into the 50 pages she takes to explain things. Some areas had more detail than you’d expect, and others less, but overall this was a solid summary / review.</p>
<blockquote class="blockquote">
<p><strong>Core Narrative</strong>: Fine-tuning represents a significant technical and organisational investment that should be approached as a last resort, not a first solution.</p>
</blockquote>
<p>The chapter’s essential message can be distilled into three key points:</p>
<ol type="1">
<li>The decision to fine-tune should follow exhausting simpler approaches like prompt engineering and RAG. At the end she sums it up: <em>fine-tuning is for form, while RAG is for facts</em>.</li>
<li>Memory considerations dominate the technical landscape of fine-tuning, leading to the emergence of techniques like PEFT (particularly LoRA) that make fine-tuning more accessible. The chapter emphasises that while the actual process of fine-tuning isn’t necessarily complex, the surrounding infrastructure and maintenance requirements are substantial.</li>
<li>A clear progression pathway emerges: start with prompt engineering, move to examples (up to ~50), implement RAG if needed, and only then consider fine-tuning. Even then, breaking down complex tasks into simpler components might be preferable to full fine-tuning.</li>
</ol>
<p>So fine-tuning can be incredibly powerful when applied judiciously, but it requires careful consideration of both technical capabilities and organisational readiness.</p>
<section id="chapter-overview-and-context" class="level2">
<h2 class="anchored" data-anchor-id="chapter-overview-and-context">Chapter Overview and Context</h2>
<p>This long chapter (approximately 50 pages, much like the others) was notably one of the most challenging for Chip to write. It presents fine-tuning as an advanced approach that moves beyond basic prompt engineering, covering everything from fundamental concepts to practical implementation strategies.</p>
<p>The depth and breadth of the chapter reflect the complexity of fine-tuning as both a technical and organisational challenge, though the things she writes about doesn’t really cover the reality of what it’s like to work on these kinds of initiatives within a team.</p>
</section>
<section id="core-decision-when-to-fine-tune" class="level2">
<h2 class="anchored" data-anchor-id="core-decision-when-to-fine-tune">Core Decision: When to Fine-tune</h2>
<p>The decision to fine-tune should never be taken lightly. While the potential benefits are significant, including improved model quality and task-specific capabilities, the chapter emphasises that fine-tuning should be considered a last resort rather than a default approach.</p>
<blockquote class="blockquote">
<p><strong>Notable Case Study</strong>: Grammarly achieved remarkable results with their fine-tuned T5 models, which outperformed GPT-3 variants despite being 60 times smaller. This example illustrates how targeted fine-tuning can sometimes achieve better results than using larger, more general models.</p>
</blockquote>
<section id="reasons-to-avoid-fine-tuning" class="level3">
<h3 class="anchored" data-anchor-id="reasons-to-avoid-fine-tuning">Reasons to Avoid Fine-tuning</h3>
<p>The chapter presents several compelling reasons why organisations might want to exhaust other options before pursuing fine-tuning:</p>
<ol type="1">
<li>Performance Degradation: Fine-tuning can actually degrade model performance on tasks outside the specific target domain</li>
<li>Engineering Complexity: The process introduces significant technical overhead</li>
<li>Specialised Knowledge Requirements: Teams need expertise in model training</li>
<li>Infrastructure Demands: Self-serving infrastructure becomes necessary</li>
<li>Ongoing Maintenance: Requires dedicated policies and budgets for monitoring and updates</li>
</ol>
</section>
<section id="fine-tuning-vs.-rag-a-critical-distinction" class="level3">
<h3 class="anchored" data-anchor-id="fine-tuning-vs.-rag-a-critical-distinction">Fine-tuning vs.&nbsp;RAG: A Critical Distinction</h3>
<p>One of the most important conceptual frameworks presented is the distinction between fine-tuning and RAG:</p>
<ul>
<li>Fine-tuning focuses on form - how the model expresses information</li>
<li>RAG specialises in facts - what information the model can access and use</li>
</ul>
<p>This separation provides a clear decision framework, though the chapter acknowledges there are exceptions to this general rule.</p>
</section>
</section>
<section id="progressive-implementation-workflow" class="level2">
<h2 class="anchored" data-anchor-id="progressive-implementation-workflow">Progressive Implementation Workflow</h2>
<p><img src="https://alexstrick.com/posts/images/2025-01-26-notes-on-ai-engineering-chip-huyen-chapter-7-finetuning/prompting_to_rag_to_finetuning.png" class="img-fluid"></p>
<p>The chapter outlines a thoughtful progression of implementation strategies, suggesting organisations should:</p>
<ol type="1">
<li>Begin with prompt engineering optimisation</li>
<li>Expand to include more examples (up to approximately 50)</li>
<li>Implement dynamic data source connections through RAG</li>
<li>Consider advanced RAG methodologies</li>
<li>Explore fine-tuning only after exhausting other options</li>
<li>Consider task decomposition if still unsuccessful</li>
</ol>
</section>
<section id="memory-bottlenecks-and-technical-considerations" class="level2">
<h2 class="anchored" data-anchor-id="memory-bottlenecks-and-technical-considerations">Memory Bottlenecks and Technical Considerations</h2>
<section id="critical-memory-factors" class="level3">
<h3 class="anchored" data-anchor-id="critical-memory-factors">Critical Memory Factors</h3>
<p>The chapter emphasises three key contributors to a model’s memory footprint during fine-tuning:</p>
<ul>
<li>Parameter count</li>
<li>Trainable parameter count</li>
<li>Numeric representations</li>
</ul>
<blockquote class="blockquote">
<p><strong>Technical Note</strong>: The relationship between trainable parameters and memory requirements becomes a key motivator for PEFT (Parameter Efficient Fine Tuning) approaches.</p>
</blockquote>
</section>
<section id="quantisation-strategies" class="level3">
<h3 class="anchored" data-anchor-id="quantisation-strategies">Quantisation Strategies</h3>
<p>The chapter provides a detailed examination of quantisation approaches, particularly noting the distinction between:</p>
<ol type="1">
<li>Post-Training Quantisation (PTQ)
<ul>
<li>Most common approach</li>
<li>Particularly relevant for AI application developers</li>
<li>Supported by major frameworks with minimal code requirements</li>
</ul></li>
<li>Training Quantisation
<ul>
<li>Emerging approach gaining traction</li>
<li>Aims to optimise both inference performance and training costs</li>
</ul></li>
</ol>
</section>
</section>
<section id="advanced-fine-tuning-techniques" class="level2">
<h2 class="anchored" data-anchor-id="advanced-fine-tuning-techniques">Advanced Fine-tuning Techniques</h2>
<section id="peft-methodologies" class="level3">
<h3 class="anchored" data-anchor-id="peft-methodologies">PEFT Methodologies</h3>
<p>The chapter identifies two primary PEFT approaches:</p>
<ol type="1">
<li>Adapter-based methods (Additive):
<ul>
<li>LoRA emerges as the most popular implementation</li>
<li>Includes variants like Dora and qDora from Anthropic</li>
<li>Involves adding new modules to existing model weights</li>
</ul></li>
<li>Soft prompt-based methods:
<ul>
<li>Less common but growing in popularity</li>
<li>Introduces trainable tokens for input processing modification</li>
<li>Offers a middle ground between full fine-tuning and basic prompting, so maybe interesting for teams who don’t <em>really</em> want to go too deep into finetuning (?)</li>
</ul></li>
</ol>
</section>
<section id="model-merging-and-multitask-considerations" class="level3">
<h3 class="anchored" data-anchor-id="model-merging-and-multitask-considerations">Model Merging and Multitask Considerations</h3>
<p>The chapter presents model merging as an evolving science, requiring significant expertise. Three primary approaches are discussed:</p>
<ul>
<li>Summing</li>
<li>Layer stacking</li>
<li>Concatenation (generally not recommended due to memory implications)</li>
</ul>
<p><img src="https://alexstrick.com/posts/images/2025-01-26-notes-on-ai-engineering-chip-huyen-chapter-7-finetuning/model_merging.png" class="img-fluid"></p>
<p>There’s a lot of detail in this section (much more than I’d expected) but it was interesting to read about something that I haven’t much practical expertise with.</p>
</section>
</section>
<section id="core-approaches-to-model-merging" class="level2">
<h2 class="anchored" data-anchor-id="core-approaches-to-model-merging">Core Approaches to Model Merging</h2>
<p>The chapter outlines three fundamental approaches to model merging, each with its own technical considerations and trade-offs:</p>
<blockquote class="blockquote">
<p><strong>Technical Architecture</strong>: The three primary merging strategies</p>
<ol type="1">
<li><strong>Summing</strong>: Direct weight combination</li>
<li><strong>Layer stacking</strong>: Vertical integration of model components</li>
<li><strong>Concatenation</strong>: Horizontal expansion (though notably discouraged due to memory implications)</li>
</ol>
</blockquote>
<p>The relative simplicity of these approaches belies their potential impact on model architecture and performance. Particularly interesting is how these techniques interface with the broader challenge of multitask learning.</p>
</section>
<section id="multitask-learning-a-new-paradigm" class="level2">
<h2 class="anchored" data-anchor-id="multitask-learning-a-new-paradigm">Multitask Learning: A New Paradigm</h2>
<p>Traditional approaches to multitask learning have typically forced practitioners into one of two suboptimal paths:</p>
<ol type="1">
<li><strong>Simultaneous Training</strong>
<ul>
<li>Requires creation of a comprehensive dataset containing examples for all tasks</li>
<li>Necessitates careful balancing of task representation</li>
<li>Often leads to compromise in per-task performance</li>
</ul></li>
<li><strong>Sequential Training</strong>
<ul>
<li>Fine-tunes the model on each task in sequence</li>
<li>Risks catastrophic forgetting as new tasks overwrite previous learning</li>
<li>Requires careful orchestration of task order and learning rates</li>
</ul></li>
</ol>
<blockquote class="blockquote">
<p><strong>Key Innovation</strong>: Model merging introduces a third path - parallel fine-tuning followed by strategic combination. This approach fundamentally alters the landscape of multitask learning optimisation.</p>
</blockquote>
</section>
<section id="the-parallel-processing-advantage" class="level2">
<h2 class="anchored" data-anchor-id="the-parallel-processing-advantage">The Parallel Processing Advantage</h2>
<p>Model merging enables a particularly elegant solution to the multitask learning challenge through parallel processing:</p>
<ol type="1">
<li>Individual models can be fine-tuned for specific tasks independently</li>
<li>Training can occur in parallel, optimising computational resource usage</li>
<li>Models can be merged post-training, preserving task-specific optimisations</li>
</ol>
<p>This approach brings several compelling advantages:</p>
<blockquote class="blockquote">
<p><strong>Strategic Benefits</strong>: - Parallel training efficiency - Independent task optimisation - Flexible deployment options - Reduced risk of inter-task interference</p>
</blockquote>
</section>
<section id="practical-implications" class="level2">
<h2 class="anchored" data-anchor-id="practical-implications">Practical Implications</h2>
<p>While the implementation details remain somewhat experimental, the potential applications are significant. Organisations can:</p>
<ul>
<li>Develop specialised models in parallel</li>
<li>Optimise individual task performance without compromise</li>
<li>Maintain flexibility in deployment architecture</li>
<li>Scale their multitask capabilities more efficiently</li>
</ul>
</section>
<section id="implementation-pathways" class="level2">
<h2 class="anchored" data-anchor-id="implementation-pathways">Implementation Pathways</h2>
<p>The chapter concludes with two distinct development approaches:</p>
<section id="progression-path" class="level3">
<h3 class="anchored" data-anchor-id="progression-path">Progression Path</h3>
<ol type="1">
<li>Begin with the most economical and fastest model</li>
<li>Validate with a mid-tier model</li>
<li>Push boundaries with the optimal model</li>
<li>Map the price-performance frontier</li>
<li>Select the most appropriate model based on requirements</li>
</ol>
</section>
<section id="distillation-path" class="level3">
<h3 class="anchored" data-anchor-id="distillation-path">Distillation Path</h3>
<ol type="1">
<li>Start with a small dataset and the strongest affordable model</li>
<li>Generate additional training data using the fine-tuned model</li>
<li>Train a more cost-effective model using the expanded dataset</li>
</ol>
</section>
</section>
<section id="final-observations" class="level2">
<h2 class="anchored" data-anchor-id="final-observations">Final Observations</h2>
<p>The chapter emphasises that while the technical process of fine-tuning isn’t necessarily complex, the surrounding context and implications are highly nuanced. Success requires careful consideration of business priorities, resource availability, and long-term maintenance capabilities. This holistic perspective is crucial for organisations considering fine-tuning as part of their AI strategy.</p>


</section>

 ]]></description>
  <category>books-i-read</category>
  <category>finetuning</category>
  <category>llm</category>
  <category>llms</category>
  <guid>https://alexstrick.com/posts/2025-01-26-notes-on-ai-engineering-chip-huyen-chapter-7-finetuning.html</guid>
  <pubDate>Sat, 25 Jan 2025 23:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/2025-01-26-notes-on-ai-engineering-chip-huyen-chapter-7-finetuning/finetuning.png" medium="image" type="image/png" height="99" width="144"/>
</item>
<item>
  <title>Notes on ‘AI Engineering’ (Chip Huyen) chapter 6</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-01-24-notes-on-ai-engineering-chip-huyen-chapter-6.html</link>
  <description><![CDATA[ 




<p>This chapter was all about RAG and agents. It’s only 50 pages, so clearly there’s only so much of the details she can get into, but it was pretty good nonetheless and there were a few things in here I’d never really read. Also Chip does a good job bringing the RAG story into the story about agents, particularly in terms of how she defines agents. (Note that <a href="https://huyenchip.com/2025/01/07/agents.html">the second half of this chapter</a>, on agents, is available <a href="https://huyenchip.com/2025/01/07/agents.html">on Chip’s blog</a> as a free excerpt!)</p>
<p>As always, what follows is just my notes on the things that seemed interesting to me (and a high-level overview of the main points of the chapter just for future reference). YMMV!</p>
<section id="chapter-structure-and-framing" class="level2">
<h2 class="anchored" data-anchor-id="chapter-structure-and-framing">Chapter Structure and Framing</h2>
<p>This chapter undertakes the ambitious task of unifying two major paradigms in AI engineering: Retrieval-Augmented Generation (RAG) and Agents. At first glance, combining these topics might seem surprising given their scope and complexity. However, Chip creates a compelling framework that positions both as sophisticated approaches to <em>context construction</em>.</p>
<p>The unifying thesis presents RAG as a specialised case of the agent pattern, where the retriever functions as a tool at the model’s disposal. Both patterns serve to transcend context limitations and maintain current information, though agents ultimately offer broader capabilities. This framing provides an elegant theoretical bridge between these technologies while acknowledging their distinct characteristics.</p>
</section>
<section id="retrieval-augmented-generation-rag" class="level2">
<h2 class="anchored" data-anchor-id="retrieval-augmented-generation-rag">Retrieval-Augmented Generation (RAG)</h2>
<section id="core-concepts-and-context-windows" class="level3">
<h3 class="anchored" data-anchor-id="core-concepts-and-context-windows">Core Concepts and Context Windows</h3>
<p>The discussion begins with a fundamental examination of RAG’s purpose: enhancing model outputs with query-specific context to produce more grounded and useful results. Chip introduces a fascinating variation on Parkinson’s Law:</p>
<blockquote class="blockquote">
<p><strong>Context Expansion Law</strong>: Application context tends to expand to fill the context limits supported by the model.</p>
</blockquote>
<p>This observation challenges the common assumption that RAG might become obsolete with infinite context models. Chip argues that larger context windows don’t necessarily solve the fundamental challenges RAG addresses, particularly noting that models often struggle with information buried in the middle of large context windows.</p>
</section>
<section id="retrieval-architecture-and-algorithms" class="level3">
<h3 class="anchored" data-anchor-id="retrieval-architecture-and-algorithms">Retrieval Architecture and Algorithms</h3>
<p>The retrieval architecture discussion introduces two primary paradigms:</p>
<blockquote class="blockquote">
<p><strong>Sparse Retrieval</strong>: Term-based approaches that rely on explicit matching of terms between queries and documents. The primary example is the <strong>TFIDF</strong> (Term Frequency-Inverse Document Frequency) algorithm, which evaluates term importance based on frequency patterns.</p>
</blockquote>
<blockquote class="blockquote">
<p><strong>Dense Retrieval</strong>: Embedding-based approaches that transform text into vector representations, requiring specialised vector databases for storage and sophisticated nearest-neighbour search algorithms for retrieval.</p>
</blockquote>
</section>
<section id="cost-considerations-and-trade-offs" class="level3">
<h3 class="anchored" data-anchor-id="cost-considerations-and-trade-offs">Cost Considerations and Trade-offs</h3>
<p>A striking revelation emerges regarding the cost structure of RAG systems: vector database expenses often consume between one-fifth to half of a company’s total model API spending. This cost burden becomes particularly acute for systems requiring frequent embedding updates due to changing data. Chip notes that both vector storage and vector search queries can be surprisingly expensive operations.</p>
</section>
<section id="retrieval-optimisation-techniques" class="level3">
<h3 class="anchored" data-anchor-id="retrieval-optimisation-techniques">Retrieval Optimisation Techniques</h3>
<p><img src="https://alexstrick.com/posts/images/2025-01-24-notes-on-ai-engineering-chip-huyen-chapter-6/chunks-for-chunks.png" class="img-fluid"></p>
<p>The chapter presents several sophisticated approaches to optimisation:</p>
<p><strong>Chunking Strategies</strong>: While the section is brief, it addresses the critical trade-offs in how documents are segmented for retrieval.</p>
<p><strong>Query Rewriting</strong>: A powerful but potentially complex technique that enhances initial queries with contextual information. For example, transforming a query like “how about her?” into “how about Aunt Mabel from the previous question?” Chip notes this can introduce latency issues and suggests careful consideration before implementation.</p>
<p><strong>Contextual Retrieval</strong>: Introduces the innovative “chunks-for-chunks” approach, where each retrieved chunk triggers additional retrievals for supplementary context. This might include retrieving related tags or associated metadata to enrich the initial results.</p>
<p><strong>Hybrid Search</strong>: Combines term-based and embedding-based retrieval, typically implementing a re-ranking process. A common pattern involves using term-based retrieval (like Elasticsearch) to obtain an initial set of ~50 (or however many!) documents, followed by embedding-based re-ranking to identify the most relevant subset.</p>
</section>
<section id="evaluation-framework" class="level3">
<h3 class="anchored" data-anchor-id="evaluation-framework">Evaluation Framework</h3>
<p>The evaluation framework centres on two primary metrics:</p>
<blockquote class="blockquote">
<p><strong>Context Precision</strong>: The percentage of retrieved documents that are relevant to the query. Generally easier to measure and optimise.</p>
</blockquote>
<blockquote class="blockquote">
<p><strong>Context Recall</strong>: The percentage of all relevant documents that are successfully retrieved. More challenging to measure as it requires comprehensive dataset annotation.</p>
</blockquote>
</section>
</section>
<section id="agents" class="level2">
<h2 class="anchored" data-anchor-id="agents">Agents</h2>
<section id="foundational-definition" class="level3">
<h3 class="anchored" data-anchor-id="foundational-definition">Foundational Definition</h3>
<p>Chip provides a clear definition of an agent:</p>
<blockquote class="blockquote">
<p><strong>Agent Definition</strong>: An entity capable of perceiving its environment and acting upon it, characterised by: - The environment it operates in (defined by use case) - The set of actions it can perform (augmented by tools)</p>
</blockquote>
</section>
<section id="tool-types-and-capabilities" class="level3">
<h3 class="anchored" data-anchor-id="tool-types-and-capabilities">Tool Types and Capabilities</h3>
<p>The chapter delineates three primary categories of tools:</p>
<p><strong>Knowledge Augmentation Tools</strong>: - RAG systems - Web search capabilities - API calls for information retrieval</p>
<p><strong>Capability Extension Tools</strong>: - Code interpreters - Terminal access - Function execution capabilities These have been shown to significantly boost model performance compared to prompting or fine-tuning alone.</p>
<p><strong>Write Actions</strong>: - Data manipulation capabilities - Storage and deletion operations</p>
</section>
<section id="planning-architecture" class="level3">
<h3 class="anchored" data-anchor-id="planning-architecture">Planning Architecture</h3>
<p>The planning process emerges as a four-stage cycle:</p>
<ol type="1">
<li><strong>Plan Generation</strong>: Task decomposition and strategy development</li>
<li><strong>Initial Reflection</strong>: Plan evaluation and potential revision</li>
<li><strong>Execution</strong>: Implementation of planned actions, often involving specific function calls</li>
<li><strong>Final Reflection</strong>: Outcome evaluation and error correction</li>
</ol>
<p>Chip includes an interesting debate about foundation models as planners, noting Yan LeCun’s assertion that autoregressive models cannot truly plan, though this remains a point of discussion in the field.</p>
</section>
<section id="plan-execution-patterns" class="level3">
<h3 class="anchored" data-anchor-id="plan-execution-patterns">Plan Execution Patterns</h3>
<p><img src="https://alexstrick.com/posts/images/2025-01-24-notes-on-ai-engineering-chip-huyen-chapter-6/agent-execution-order.png" class="img-fluid"></p>
<p>The execution of agent plans reveals a fascinating interplay between computational patterns and practical implementation. Chip identifies several fundamental execution patterns that form the backbone of agent behaviour, each offering distinct advantages and trade-offs in different scenarios.</p>
<blockquote class="blockquote">
<p><strong>Execution Paradigms</strong>: The core patterns through which agents transform plans into actions, ranging from simple sequential execution to complex conditional logic.</p>
</blockquote>
<p>The primary execution patterns include:</p>
<p><strong>Sequential Execution</strong>: The most straightforward pattern, where actions are performed one after another in a predetermined order. This approach offers predictability and simplicity but may not maximise efficiency when actions could be performed concurrently.</p>
<p><strong>Parallel Execution</strong>: Enables multiple actions to be performed simultaneously when dependencies permit. While this pattern can significantly improve performance, it introduces complexity in managing concurrent operations and handling potential conflicts.</p>
<p><strong>Conditional Execution</strong>: Implements decision points through <code>if</code> statements, allowing agents to adapt their execution path based on intermediate results or environmental conditions. This pattern introduces crucial flexibility but requires careful handling of branch logic and state management.</p>
<p><strong>Iterative Execution</strong>: Utilises <code>for</code> loops to handle repetitive tasks or process collections of items. This pattern is particularly powerful when dealing with datasets or when similar actions need to be performed multiple times with variations.</p>
<blockquote class="blockquote">
<p><strong>Pattern Selection</strong>: The choice of execution pattern often emerges from the intersection of task requirements, system constraints, and performance goals.</p>
</blockquote>
<p>The effectiveness of these patterns depends heavily on the underlying system architecture and the specific requirements of the task at hand. For instance, parallel execution might offer theoretical performance benefits but could introduce unnecessary complexity for simple, linear tasks. Similarly, conditional execution provides valuable flexibility but requires robust error handling and state management to maintain system reliability.</p>
<p>Chip emphasises that these patterns aren’t mutually exclusive - sophisticated agent systems often combine multiple patterns to create more complex and capable execution strategies. This hybrid approach allows for the development of highly adaptable agents that can handle a wide range of tasks while maintaining system stability and performance.</p>
</section>
<section id="planning-optimisation" class="level3">
<h3 class="anchored" data-anchor-id="planning-optimisation">Planning Optimisation</h3>
<p>The chapter provides several practical tips for improving agent planning:</p>
<ol type="1">
<li>Enhance system prompts with more examples</li>
<li>Provide better tool descriptions and parameter documentation</li>
<li>Simplify complex functions through refactoring</li>
<li>Consider using stronger models or fine-tuning for plan generation</li>
</ol>
</section>
<section id="function-calling-implementation" class="level3">
<h3 class="anchored" data-anchor-id="function-calling-implementation">Function Calling Implementation</h3>
<p>The function calling architecture requires:</p>
<ol type="1">
<li>Tool inventory creation, including:
<ul>
<li>Function names and entry points</li>
<li>Parameter specifications</li>
<li>Comprehensive documentation</li>
</ul></li>
<li>Tool usage specification (required vs.&nbsp;optional)</li>
<li>Version control for function names, parameters, and documentation</li>
</ol>
</section>
<section id="planning-granularity" class="level3">
<h3 class="anchored" data-anchor-id="planning-granularity">Planning Granularity</h3>
<p>Chip introduces an important discussion of planning levels, analogous to temporal planning horizons (yearly plans vs.&nbsp;daily tasks). This presents a fundamental trade-off:</p>
<blockquote class="blockquote">
<p><strong>Planning Trade-off</strong>: Higher-level plans are easier to generate but harder to execute, while detailed plans are harder to generate but easier to execute.</p>
</blockquote>
</section>
<section id="tool-selection-and-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="tool-selection-and-evaluation">Tool Selection and Evaluation</h3>
<p>The chapter provides a systematic approach to tool selection:</p>
<ol type="1">
<li>Conduct ablation studies to measure performance impact</li>
<li>Monitor tool usage patterns and error rates</li>
<li>Analyze tool call distribution</li>
<li>Consider model-specific tool preferences (noting that GPT-4 tends to use a wider tool set than ChatGPT)</li>
</ol>
</section>
<section id="memory-systems" class="level3">
<h3 class="anchored" data-anchor-id="memory-systems">Memory Systems</h3>
<p>The memory architecture comprises two core functions:</p>
<blockquote class="blockquote">
<p><strong>Memory Functions</strong>: - Memory management - Memory retrieval</p>
</blockquote>
<p>The system supports three types of memory:</p>
<ul>
<li>Internal knowledge</li>
<li>Short-term memory</li>
<li>Long-term memory</li>
</ul>
<p>These systems prove crucial for:</p>
<ul>
<li>Managing information overflow</li>
<li>Maintaining session persistence</li>
<li>Ensuring model consistency</li>
<li>Preserving data structural integrity</li>
</ul>
</section>
<section id="evaluation-and-failure-modes" class="level3">
<h3 class="anchored" data-anchor-id="evaluation-and-failure-modes">Evaluation and Failure Modes</h3>
<p>The comprehensive evaluation framework considers:</p>
<ul>
<li>Planning effectiveness</li>
<li>Tool execution accuracy</li>
<li>System latency</li>
<li>Overall efficiency</li>
<li>Memory system performance</li>
</ul>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>The unifying thread of context construction provides a compelling framework for understanding these technologies not as separate entities, but as complementary approaches to extending model capabilities.</p>


</section>

 ]]></description>
  <category>books-i-read</category>
  <category>llm</category>
  <category>llms</category>
  <category>agents</category>
  <category>rag</category>
  <category>evaluation</category>
  <guid>https://alexstrick.com/posts/2025-01-24-notes-on-ai-engineering-chip-huyen-chapter-6.html</guid>
  <pubDate>Thu, 23 Jan 2025 23:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/2025-01-24-notes-on-ai-engineering-chip-huyen-chapter-6/agent-execution-order.png" medium="image" type="image/png" height="91" width="144"/>
</item>
<item>
  <title>Notes on ‘AI Engineering’ (Chip Huyen) chapter 4</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-01-22-notes-on-ai-engineering-chip-huyen-chapter-4.html</link>
  <description><![CDATA[ 




<p>This chapter represents a crucial bridge between academic research and production engineering practice in AI system evaluation. What sets it apart is the Chip’s very balanced perspective - neither succumbing to the prevalent hype in the field nor becoming overly academic. Instead, she melds together practical insights with theoretical foundations, creating a useful framework for evaluation that acknowledges both technical and ethical considerations.</p>
<section id="introduction-and-context" class="level2">
<h2 class="anchored" data-anchor-id="introduction-and-context">Introduction and Context</h2>
<blockquote class="blockquote">
<p><strong>Key Insight</strong>: The author’s approach demonstrates that effective AI system evaluation requires a synthesis of academic rigour and practical engineering concerns, much like how traditional software engineering evolved to balance theoretical computer science with practical development methodologies.</p>
</blockquote>
<p>The chapter is structured in three main parts, each building upon the previous to create a complete picture of AI system evaluation:</p>
<ol type="1">
<li>Evaluation criteria fundamentals</li>
<li>Model selection and benchmark navigation</li>
<li>Practical pipeline implementation</li>
</ol>
</section>
<section id="part-1-evaluation-criteria---a-deep-dive" class="level2">
<h2 class="anchored" data-anchor-id="part-1-evaluation-criteria---a-deep-dive">Part 1: Evaluation Criteria - A Deep Dive</h2>
<section id="the-evolution-of-evaluation-driven-development" class="level3">
<h3 class="anchored" data-anchor-id="the-evolution-of-evaluation-driven-development">The Evolution of Evaluation-Driven Development</h3>
<p>The author introduces <strong>evaluation-driven development</strong> (EDD), a methodological evolution that adapts the principles of test-driven development to the unique challenges of AI systems.</p>
<blockquote class="blockquote">
<p><strong>Evaluation-Driven Development</strong>: A methodology where AI application development begins with explicit evaluation criteria, similar to how test-driven development starts with test cases. However, EDD encompasses a broader range of metrics and considerations specific to AI systems.</p>
</blockquote>
<p>The fundamental principle here is that AI applications require a more nuanced and multifaceted approach to evaluation than traditional software. Where traditional software might have binary pass/fail criteria, AI systems often operate in a spectrum of performance across multiple dimensions.</p>
</section>
<section id="the-four-pillars-of-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="the-four-pillars-of-evaluation">The Four Pillars of Evaluation</h3>
<section id="domain-specific-capability" class="level4">
<h4 class="anchored" data-anchor-id="domain-specific-capability">1. Domain-Specific Capability</h4>
<p>The author presents domain-specific capability evaluation as the foundational layer of AI system assessment. This approach is particularly innovative in its use of <strong>multiple choice evaluation techniques</strong> - a method that bridges the gap between human-interpretable results and machine performance metrics.</p>
<p><em>For example</em>, when evaluating code generation capabilities, presenting a model with multiple implementations where only one is functionally correct serves as both a test and a teaching tool. This mimics how human experts often evaluate junior developers’ understanding of coding patterns and best practices.</p>
</section>
<section id="generation-capability" class="level4">
<h4 class="anchored" data-anchor-id="generation-capability">2. Generation Capability</h4>
<p>The section on generation capability draws parallels with the historical development of Natural Language Generation (NLG) in computational linguistics. This historical context provides valuable insights into how we can approach modern language model evaluation.</p>
<p>The author breaks down factual consistency into two crucial dimensions:</p>
<blockquote class="blockquote">
<p><strong>Local Factual Consistency</strong>: The internal coherence of generated content and its alignment with the immediate context of the prompt. This is analogous to maintaining logical consistency within a single conversation or document.</p>
<p><strong>Global Factual Consistency</strong>: The accuracy of generated content when compared against established knowledge and facts. This represents the model’s ability to maintain truthfulness in a broader context.</p>
</blockquote>
<p>The discussion of hallucination detection is particularly noteworthy, presenting three complementary approaches:</p>
<ol type="1">
<li><strong>Basic Prompting</strong>: Direct detection through carefully crafted prompts</li>
<li><strong>Self-Verification</strong>: A novel approach using internal consistency checks across multiple generations</li>
<li><strong>Knowledge-Augmented Verification</strong>: Advanced techniques like Google DeepMind’s SAFE paper (search augmented factuality evaluator)</li>
</ol>
<p>The knowledge-augmented verification system represents a fascinating approach to fact-checking that mirrors how human experts verify information:</p>
<ul>
<li>It breaks down complex statements into atomic claims</li>
<li>Each claim is independently verified through search</li>
<li>The results are synthesised into a final accuracy assessment</li>
</ul>
<p>Seems pricey, though :)</p>
</section>
<section id="instruction-following-capability" class="level4">
<h4 class="anchored" data-anchor-id="instruction-following-capability">3. Instruction Following Capability</h4>
<p>The author makes a crucial observation about the bidirectional nature of instruction following evaluation. Poor performance might indicate either model limitations or instruction ambiguity - a distinction that’s often overlooked in practice.</p>
<blockquote class="blockquote">
<p><strong>Instruction-Performance Paradox</strong>: The quality of instruction following cannot be evaluated in isolation from the quality of the instructions themselves, creating a circular dependency that must be carefully managed in evaluation design.</p>
</blockquote>
<p>The solution proposed is the development of custom benchmarks that specifically target your application’s requirements. This approach ensures that your evaluation criteria align perfectly with your practical needs rather than relying solely on generic benchmarks.</p>
</section>
<section id="cost-and-latency-considerations" class="level4">
<h4 class="anchored" data-anchor-id="cost-and-latency-considerations">4. Cost and Latency Considerations</h4>
<p>The author introduces the concept of <strong>Pareto optimization</strong> in the context of AI system evaluation, demonstrating how different performance metrics often involve trade-offs that must be carefully balanced.</p>
<blockquote class="blockquote">
<p><strong>Pareto Optimization</strong>: A multi-objective optimization approach where improvements in one metric cannot be achieved without degrading another, leading to a set of optimal trade-off solutions rather than a single optimal point.</p>
</blockquote>
</section>
</section>
</section>
<section id="part-2-model-selection---a-strategic-approach" class="level2">
<h2 class="anchored" data-anchor-id="part-2-model-selection---a-strategic-approach">Part 2: Model Selection - A Strategic Approach</h2>
<section id="the-four-step-evaluation-workflow" class="level3">
<h3 class="anchored" data-anchor-id="the-four-step-evaluation-workflow">The Four-Step Evaluation Workflow</h3>
<p>The author presents a sophisticated workflow that combines both quantitative and qualitative factors in model selection. This approach is particularly valuable because it acknowledges the complexity of real-world deployment while providing a structured path forward.</p>
<ol type="1">
<li><p><strong>Initial Filtering</strong> The first step involves filtering based on hard constraints, which might include:</p>
<ul>
<li>Deployment requirements (on-premise vs.&nbsp;cloud)</li>
<li>Security and privacy considerations</li>
<li>Licensing restrictions</li>
<li>Resource constraints</li>
</ul></li>
<li><p><strong>Public Information Assessment</strong> This stage involves a systematic review of:</p>
<ul>
<li>Benchmark performances across relevant tasks</li>
<li>Leaderboard rankings with context</li>
<li>Published latency and cost metrics</li>
</ul>
<p>The author emphasises the importance of looking beyond raw numbers to understand the context and limitations of public benchmarks.</p></li>
<li><p><strong>Experimental Evaluation</strong> This phase involves hands-on testing with your specific use case, considering:</p>
<ul>
<li>Custom evaluation metrics</li>
<li>Integration requirements</li>
<li>Real-world performance characteristics</li>
</ul></li>
<li><p><strong>Continuous Monitoring</strong> The final step acknowledges that evaluation is an ongoing process, not a one-time event. This involves:</p>
<ul>
<li>Regular performance monitoring</li>
<li>Failure detection and analysis</li>
<li>Feedback collection and incorporation</li>
<li>Continuous improvement cycles</li>
</ul></li>
</ol>
</section>
<section id="the-build-vs.-buy-decision-matrix" class="level3">
<h3 class="anchored" data-anchor-id="the-build-vs.-buy-decision-matrix">The Build vs.&nbsp;Buy Decision Matrix</h3>
<p>The author provides an analysis of the build vs.&nbsp;buy decision, going beyond simple cost comparisons to consider factors like:</p>
<blockquote class="blockquote">
<p><strong>Total Cost of Ownership (TCO)</strong>: The complete cost picture including: - Direct costs (API fees, computing resources) - Indirect costs (engineering time, maintenance) - Opportunity costs (time to market, feature development) - Risk costs (security, reliability, vendor lock-in)</p>
</blockquote>
<p>This section particularly shines in its discussion of the often-overlooked aspects of model deployment, such as the hidden costs of maintaining self-hosted models and the true value of vendor-provided updates and improvements.</p>
</section>
</section>
<section id="part-3-building-evaluation-pipelines---practical-implementation" class="level2">
<h2 class="anchored" data-anchor-id="part-3-building-evaluation-pipelines---practical-implementation">Part 3: Building Evaluation Pipelines - Practical Implementation</h2>
<section id="system-component-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="system-component-evaluation">System Component Evaluation</h3>
<p>The author advocates for a <strong>dual-track evaluation approach</strong>:</p>
<ol type="1">
<li>End-to-end system evaluation</li>
<li>Component-level assessment</li>
</ol>
<p>This approach allows organisations to:</p>
<ul>
<li>Identify bottlenecks and failure points</li>
<li>Understand component interactions</li>
<li>Make targeted improvements</li>
<li>Maintain system reliability during updates</li>
</ul>
</section>
<section id="creating-effective-evaluation-guidelines" class="level3">
<h3 class="anchored" data-anchor-id="creating-effective-evaluation-guidelines">Creating Effective Evaluation Guidelines</h3>
<p>The author emphasises the importance of creating clear, actionable evaluation guidelines that bridge technical and business metrics. This section introduces the concept of <strong>metric alignment</strong> - ensuring that technical evaluation metrics directly correspond to business value.</p>
<blockquote class="blockquote">
<p><strong>Metric Alignment</strong>: The process of mapping technical performance metrics to business outcomes, creating a clear connection between model improvements and business value.</p>
</blockquote>
</section>
<section id="data-management-and-sampling" class="level3">
<h3 class="anchored" data-anchor-id="data-management-and-sampling">Data Management and Sampling</h3>
<p>Chip provides valuable insights into data management for evaluation, including:</p>
<blockquote class="blockquote">
<p><strong>Data Slicing</strong>: The strategic separation of evaluation data into meaningful subsets to: - Identify performance variations across different use cases - Detect potential biases - Enable targeted improvement efforts - Avoid Simpson’s paradox in performance analysis</p>
</blockquote>
<p>The discussion of sample size is particularly practical, providing concrete guidelines based on statistical confidence levels and desired detection thresholds. The author cites OpenAI’s research suggesting that sample sizes between 100 and 1,000 are typically sufficient for most evaluation needs, depending on the required confidence level.</p>
<p><img src="https://alexstrick.com/posts/images/chip4-sample-number.png" class="img-fluid"></p>
</section>
<section id="meta-evaluation-evaluating-your-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="meta-evaluation-evaluating-your-evaluation">Meta-Evaluation: Evaluating Your Evaluation</h3>
<p>The chapter concludes with a crucial discussion of meta-evaluation - the process of assessing and improving your evaluation pipeline itself. This includes considerations of:</p>
<ul>
<li>Signal quality and reliability</li>
<li>Metric correlation and redundancy</li>
<li>Resource utilisation and efficiency</li>
<li>Integration with development workflows</li>
</ul>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>The author concludes around the inherent limitations of AI system evaluation: no single metric or method can fully capture the complexity of these systems. However, this acknowledgment leads to a constructive approach: combining multiple evaluation methods, maintaining awareness of their limitations, and continuously iterating based on real-world feedback.</p>
<p>This chapter ultimately provides a solid framework for AI system evaluation that is both theoretically sound and practically applicable. It serves as a valuable resource for organisations working to implement effective evaluation strategies for their AI systems, while maintaining a clear-eyed view of both the possibilities and limitations of current evaluation methods.</p>


</section>

 ]]></description>
  <category>books-i-read</category>
  <category>llm</category>
  <category>llms</category>
  <category>evaluation</category>
  <guid>https://alexstrick.com/posts/2025-01-22-notes-on-ai-engineering-chip-huyen-chapter-4.html</guid>
  <pubDate>Tue, 21 Jan 2025 23:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/chip-ch4.png" medium="image" type="image/png" height="85" width="144"/>
</item>
<item>
  <title>Notes on ‘AI Engineering’ (Chip Huyen) chapter 3</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-01-21-notes-on-ai-engineering-chip-huyen-chapter-3.html</link>
  <description><![CDATA[ 




<p>Really enjoyed this chapter. My tidied notes from my readings follow below. 150 pages in and we’re starting to get to the good stuff :)</p>
<section id="overview-and-context" class="level2">
<h2 class="anchored" data-anchor-id="overview-and-context">Overview and Context</h2>
<p>This chapter serves as the first of two chapters (Chapters 3 and 4) dealing with evaluation in AI Engineering. While Chapter 4 will delve into evaluation within systems, Chapter 3 addresses the fundamental question of how to evaluate open-ended responses from foundation models and LLMs at a high level. The importance of evaluation cannot be overstated, though the author perhaps takes this somewhat for granted. The chapter provides a comprehensive framework for understanding various evaluation methodologies and their applications.</p>
</section>
<section id="challenges-in-evaluating-foundation-models" class="level2">
<h2 class="anchored" data-anchor-id="challenges-in-evaluating-foundation-models">Challenges in Evaluating Foundation Models</h2>
<p>The evaluation of foundation models presents several unique and complex challenges that make systematic assessment difficult:</p>
<ul>
<li>Existing benchmarks become increasingly inadequate as models improve in their capabilities</li>
<li>As models become better at writing and mimicking human-like responses, evaluation becomes more complex and nuanced</li>
<li>Many foundation models are API-driven black boxes, limiting access to internal workings</li>
<li>Models continuously develop new capabilities, requiring constant adaptation of evaluation methods</li>
<li>There has been notably limited investment in evaluation studies and technologies compared to the extensive resources devoted to enhancing model capabilities</li>
<li>The improvement in model performance necessitates the continuous development of new benchmarks</li>
<li>Without a systematic approach to evaluation, progress can be hindered by various headwinds</li>
</ul>
</section>
<section id="language-model-metrics" class="level2">
<h2 class="anchored" data-anchor-id="language-model-metrics">Language Model Metrics</h2>
<p>The chapter includes a technically detailed section on understanding language model metrics, which while math-heavy, provides fundamental insights into model capabilities:</p>
<ul>
<li>Entropy</li>
<li>Cross-entropy</li>
<li>Perplexity</li>
</ul>
<p>These metrics serve as underlying measures to understand what’s happening within the models and assess their power and conversational abilities. While this section spans 4-5 pages of technical content, it provides some useful foundational understanding of how we can measure a language model’s intrinsic capabilities.</p>
</section>
<section id="downstream-task-performance-measurement" class="level2">
<h2 class="anchored" data-anchor-id="downstream-task-performance-measurement">Downstream Task Performance Measurement</h2>
<p>The chapter transitions from intrinsic metrics to evaluating actual capabilities, dividing evaluation into exact and subjective approaches.</p>
<section id="exact-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="exact-evaluation">Exact Evaluation</h3>
<p>There are two principal approaches to exact evaluation:</p>
<ol type="1">
<li><p><strong>Functional Correctness Assessment</strong></p>
<ul>
<li>Evaluates whether the LLM can successfully complete its assigned tasks</li>
<li>Focuses on practical capability rather than theoretical metrics</li>
<li>Example: In coding tasks, checking if generated code passes all unit tests</li>
<li>Provides clear, objective measures of success</li>
</ul></li>
<li><p><strong>Similarity Measurements Against Reference Data</strong> Four distinct methods are identified:</p>
<ol type="a">
<li><strong>Human Evaluator Judgment</strong>
<ul>
<li>Requires manual comparison of texts by human evaluators</li>
<li>Highly accurate but time and resource-intensive</li>
<li>Limited scalability due to human involvement</li>
<li>Often considered the gold standard despite limitations</li>
</ul></li>
<li><strong>Exact Match Checking</strong>
<ul>
<li>Compares generated response against reference responses for exact matches</li>
<li>Most effective with shorter, specific outputs</li>
<li>Less useful for verbose or creative outputs</li>
<li>Provides binary yes/no results</li>
</ul></li>
<li><strong>Lexical Similarity</strong>
<ul>
<li>Employs established metrics like BLEU, ROUGE, and METEOR</li>
<li>Focuses on word overlap and structural similarities</li>
<li>Known to be somewhat crude in their assessment</li>
<li>Widely used despite limitations due to ease of implementation</li>
</ul></li>
<li><strong>Semantic Similarity</strong>
<ul>
<li>Utilizes embeddings for comparing textual meaning</li>
<li>Less sensitive to specific word choices than lexical approaches</li>
<li>Quality depends entirely on the underlying embeddings algorithm</li>
<li>May require significant computational resources</li>
<li>Generally provides more nuanced comparison than lexical methods</li>
</ul></li>
</ol></li>
</ol>
<p>The chapter includes a brief but relevant sidebar on embeddings and their significance in evaluation, though this digression seemed a bit out of place in the overall flow.</p>
</section>
</section>
<section id="ai-as-judge" class="level2">
<h2 class="anchored" data-anchor-id="ai-as-judge">AI as Judge</h2>
<p>This section explores the increasingly popular approach of using AI systems to evaluate other AI systems.</p>
<section id="benefits" class="level3">
<h3 class="anchored" data-anchor-id="benefits">Benefits</h3>
<ul>
<li>Significantly faster than human evaluation processes</li>
<li>Generally more cost-effective than human evaluation at scale</li>
<li>Studies have shown strong correlation with human evaluations in many cases</li>
<li>AI judges can provide detailed explanations for their decisions</li>
<li>Offers greater flexibility in evaluation approaches</li>
<li>Enables systematic and consistent evaluation at scale</li>
</ul>
</section>
<section id="three-main-approaches" class="level3">
<h3 class="anchored" data-anchor-id="three-main-approaches">Three Main Approaches</h3>
<ol type="1">
<li><strong>Individual Response Evaluation</strong>
<ul>
<li>Assesses response quality based solely on the original question</li>
<li>Often implements numerical scoring systems (e.g., 1-5 scale)</li>
<li>Evaluates responses in isolation without comparison</li>
</ul></li>
<li><strong>Reference Response Comparison</strong>
<ul>
<li>Evaluates generated response against established reference responses</li>
<li>Usually produces binary (true/false) outcomes</li>
<li>Helps ensure responses meet specific criteria</li>
</ul></li>
<li><strong>Generated Response Comparison</strong>
<ul>
<li>Compares two generated responses to determine relative quality</li>
<li>Predicts likely user preferences between options</li>
<li>Particularly useful for:
<ul>
<li>Post-training alignment</li>
<li>Test-time compute optimization</li>
<li>Model ranking through comparative evaluation</li>
<li>Generating preference data</li>
</ul></li>
</ul></li>
</ol>
</section>
<section id="implementation-considerations" class="level3">
<h3 class="anchored" data-anchor-id="implementation-considerations">Implementation Considerations</h3>
<p><img src="https://alexstrick.com/posts/images/2025-01-21-notes-on-ai-engineering-chip-huyen-chapter-3/ch3.png" class="img-fluid"></p>
<ul>
<li>Table 3-3 (page 139) provides an overview of different AI judge criteria used by various AI tools</li>
<li>Notable lack of standardization across different platforms and approaches (see above)</li>
<li>Various scoring systems available, each with their own trade-offs</li>
<li>Adding examples to prompts can improve accuracy but increases token count and costs</li>
<li>Careful balance needed between evaluation quality and resource consumption</li>
</ul>
</section>
<section id="limitations-and-challenges" class="level3">
<h3 class="anchored" data-anchor-id="limitations-and-challenges">Limitations and Challenges</h3>
<ul>
<li>AI judges can show inconsistency in their judgments</li>
<li>Costs can escalate quickly, especially when using stronger models or including more context</li>
<li>Evaluation criteria often remain ambiguous and difficult to standardize</li>
<li>Several inherent biases identified:
<ul>
<li>Self-bias: Models tend to favor responses generated by themselves</li>
<li>Verbosity bias: Tendency to favor longer, more detailed answers</li>
<li>Other biases common to AI applications in general</li>
</ul></li>
</ul>
</section>
</section>
<section id="specialized-judges" class="level2">
<h2 class="anchored" data-anchor-id="specialized-judges">Specialized Judges</h2>
<p>This section presents an innovative challenge to the conventional wisdom of using the strongest available model as a judge. The author introduces a compelling alternative approach:</p>
<ul>
<li>Small, specialized judges can be as effective as larger models for specific evaluation tasks</li>
<li>More cost-effective and efficient than using large language models</li>
<li>Can be trained for highly specific evaluation criteria</li>
<li>Demonstrates comparable performance to larger models like GPT-4 in specific domains</li>
</ul>
<p>Three types of specialized judges are identified: 1. Reward models (evaluating prompt-response pairs) 2. Reference-based judges 3. Preference models</p>
<p>This represents a novel approach that could significantly impact evaluation methodology in the field.</p>
</section>
<section id="comparative-evaluation-for-model-ranking" class="level2">
<h2 class="anchored" data-anchor-id="comparative-evaluation-for-model-ranking">Comparative Evaluation for Model Ranking</h2>
<section id="methodology" class="level3">
<h3 class="anchored" data-anchor-id="methodology">Methodology</h3>
<ul>
<li>Focuses on binary choices between two samples</li>
<li>Simpler for both humans and AI to make comparative judgments</li>
<li>Used in major leaderboards like LMSIS</li>
<li>Requires evaluation of multiple combinations to establish rankings</li>
<li>Various algorithms available for efficient comparison</li>
</ul>
</section>
<section id="advantages" class="level3">
<h3 class="anchored" data-anchor-id="advantages">Advantages</h3>
<ul>
<li>More intuitive evaluation process</li>
<li>Often more reliable than absolute scoring</li>
<li>Reduces cognitive load on evaluators</li>
<li>Provides clear preference data</li>
</ul>
</section>
<section id="challenges" class="level3">
<h3 class="anchored" data-anchor-id="challenges">Challenges</h3>
<ul>
<li>Highly data-intensive nature affects scalability</li>
<li>Lacks standardization across implementations</li>
<li>Difficulty in converting comparative measures to absolute metrics</li>
<li>Quality control remains a significant concern</li>
<li>Number of required comparisons can grow rapidly with model count</li>
</ul>
</section>
</section>
<section id="key-takeaways-and-future-implications" class="level2">
<h2 class="anchored" data-anchor-id="key-takeaways-and-future-implications">Key Takeaways and Future Implications</h2>
<ol type="1">
<li>The emergence of smaller, specialized judge models represents a significant shift from the traditional approach of using the largest available models</li>
<li>Comparative evaluation offers promising approaches but requires careful consideration of scalability and implementation</li>
<li>The field continues to evolve rapidly, requiring flexible and adaptable evaluation strategies</li>
<li>Sets up crucial discussion for system-level evaluation in Chapter 4</li>
<li>Highlights the ongoing tension between evaluation quality and resource efficiency</li>
</ol>
<p>The chapter effectively establishes the foundational understanding necessary for the more practical, system-focused evaluation discussions to follow in Chapter 4.</p>


</section>

 ]]></description>
  <category>books-i-read</category>
  <category>llm</category>
  <category>llms</category>
  <category>evaluation</category>
  <guid>https://alexstrick.com/posts/2025-01-21-notes-on-ai-engineering-chip-huyen-chapter-3.html</guid>
  <pubDate>Mon, 20 Jan 2025 23:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/2025-01-21-notes-on-ai-engineering-chip-huyen-chapter-3/ch3.png" medium="image" type="image/png" height="66" width="144"/>
</item>
<item>
  <title>Notes on ‘AI Engineering’ (Chip Huyen) chapter 1</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-01-19-notes-on-ai-engineering-chapter-1.html</link>
  <description><![CDATA[ 




<p>Had the first of <a href="https://www.meetup.com/delft-fast-ai-study-group/">a series of meet-ups</a> I’m organising in which we discuss Chip Huyen’s new book. My notes from reading the chapter follow this, and then I’ll try to summarise what we discussed in the group.</p>
<p>At a high-level, I <em>really</em> enjoyed the final part of the chapter where she got into how she was thinking about the practice of ‘AI Engineering’ and how it differs from ML engineering. Also the use of the term ‘model adaptation’ was an interesting way of encompassing all the different things that engineers are doing to get the LLM to better follow their instructions.</p>
<section id="chapter-1-notes" class="level2">
<h2 class="anchored" data-anchor-id="chapter-1-notes">Chapter 1 Notes</h2>
<p>The chapter begins by establishing AI Engineering as the preferred term over alternatives like GenAI Ops or LLM Ops. This preference stems from a fundamental shift in the field, where application development has become increasingly central to working with AI models. The “ops” suffix inadequately captures the breadth and nature of work involved in modern AI applications.</p>
<section id="foundation-models-and-language-models" class="level3">
<h3 class="anchored" data-anchor-id="foundation-models-and-language-models">Foundation Models and Language Models</h3>
<p>The text provides important technical context about different types of language models. A notable comparison shows that while Mistral 7B has a vocabulary of 32,000 tokens, GPT-4 possesses a much larger vocabulary of 100,256 tokens, highlighting the significant variation in model capabilities and design choices.</p>
<p>Two primary categories of language models are discussed:</p>
<ol type="1">
<li>Masked Language Models (like BERT and modern BERT variants)</li>
<li>Autoregressive Language Models (like those used in ChatGPT)</li>
</ol>
<p>The term “foundation model” carries dual significance, referring both to these models’ fundamental importance and their adaptability for various applications. This terminology also marks an important transition from task-specific models to general-purpose ones, especially relevant in the era of multimodal capabilities.</p>
</section>
</section>
<section id="ai-engineering-vs-traditional-approaches" class="level2">
<h2 class="anchored" data-anchor-id="ai-engineering-vs-traditional-approaches">AI Engineering vs Traditional Approaches</h2>
<p>AI Engineering differs substantially from ML Engineering, warranting its distinct terminology. The key distinction lies in its focus on adapting and evaluating models rather than building them from scratch. Model adaptation techniques fall into two main categories:</p>
<ol type="1">
<li>Prompt-based techniques (prompt engineering) - These methods adapt models without updating weights</li>
<li>Fine-tuning techniques - These approaches require weight updates</li>
</ol>
<p>The shift from ML Engineering to AI Engineering brings new challenges, particularly in handling open-ended outputs. While this flexibility enables a broader range of applications, it also introduces significant complexity in evaluation and implementation of guardrails.</p>
</section>
<section id="the-ai-engineering-stack" class="level2">
<h2 class="anchored" data-anchor-id="the-ai-engineering-stack">The AI Engineering Stack</h2>
<p>The framework consists of three distinct layers:</p>
<section id="application-development-layer" class="level3">
<h3 class="anchored" data-anchor-id="application-development-layer">1. Application Development Layer</h3>
<ul>
<li>Focuses on prompt crafting and context provision</li>
<li>Requires rigorous evaluation methods</li>
<li>Emphasizes interface design and user experience</li>
<li>Primary responsibilities include evaluation, prompt engineering, and AI interface development</li>
</ul>
</section>
<section id="model-development-layer" class="level3">
<h3 class="anchored" data-anchor-id="model-development-layer">2. Model Development Layer</h3>
<ul>
<li>Provides tooling for model development</li>
<li>Includes frameworks for training, functioning, and inference optimisation</li>
<li>Requires systematic evaluation approaches</li>
</ul>
</section>
<section id="infrastructure-layer" class="level3">
<h3 class="anchored" data-anchor-id="infrastructure-layer">3. Infrastructure Layer</h3>
<ul>
<li>Handles model serving</li>
<li>Manages underlying technical requirements</li>
</ul>
</section>
</section>
<section id="planning-ai-applications" class="level2">
<h2 class="anchored" data-anchor-id="planning-ai-applications">Planning AI Applications</h2>
<p>The chapter outlines a modern approach to AI application development that differs significantly from traditional ML projects. Rather than beginning with data collection and model training, AI engineering often starts with product development, leveraging existing models. This approach allows teams to validate product concepts before investing heavily in data and model development.</p>
<p>Key planning considerations include:</p>
<ul>
<li>Setting appropriate expectations</li>
<li>Determining user exposure levels</li>
<li>Deciding between internal and external deployment</li>
<li>Understanding maintenance requirements</li>
</ul>
<p>A notable insight is the “80/20” development pattern: while reaching 80% functionality can be relatively quick, achieving the final 20% often requires equal or greater effort than the initial development phase.</p>
</section>
<section id="evaluation-and-implementation-challenges" class="level2">
<h2 class="anchored" data-anchor-id="evaluation-and-implementation-challenges">Evaluation and Implementation Challenges</h2>
<p>The chapter emphasises that working with AI models presents unique evaluation challenges compared to traditional ML systems. This complexity stems from:</p>
<ul>
<li>The open-ended nature of outputs</li>
<li>Difficulty in implementing strict guardrails</li>
<li>Challenges in type enforcement</li>
<li>The need for comprehensive evaluation strategies</li>
</ul>
</section>
<section id="data-and-model-adaptation" class="level2">
<h2 class="anchored" data-anchor-id="data-and-model-adaptation">Data and Model Adaptation</h2>
<p>The text discusses how data set engineering and inference optimisation, while still relevant, take on different forms in AI engineering compared to traditional ML engineering. The focus shifts from raw data collection and processing to effective model adaptation and deployment strategies.</p>
</section>
<section id="modern-development-paradigm" class="level2">
<h2 class="anchored" data-anchor-id="modern-development-paradigm">Modern Development Paradigm</h2>
<p>A significant paradigm shift is highlighted in the development approach: unlike traditional ML engineering, which typically begins with data collection and model training, AI engineering enables a product-first approach. This allows teams to validate concepts using existing models before committing to extensive data collection or model development efforts.</p>
</section>
<section id="discussion-summary" class="level2">
<h2 class="anchored" data-anchor-id="discussion-summary">Discussion summary</h2>
<p>The conversation started with a bit on how AI Engineering represents an interesting shift in the software engineering landscape, potentially opening new career paths for traditional software engineers. While developers may not need deep mathematical knowledge of derivatives and linear algebra upfront, there’s a growing recognition that understanding how AI systems behave - their constraints and opportunities - is becoming increasingly valuable.</p>
<p>A key tension emerged in the discussion around enterprise adoption. While there’s significant enthusiasm around AI applications, particularly on social media where developers showcase apps with substantial user bases, enterprise companies often maintain their traditional team structures. This creates an interesting dynamic where companies might maintain their existing ML engineering teams while simultaneously forming new “tiger teams” focused on generative AI initiatives, leading to organisational friction.</p>
<p>The group discussed how while it’s now possible for software engineers to quickly build AI applications by calling APIs, they often hit limitations that require deeper understanding. This raises questions about whether the “shallow” approach of purely application-level development is sustainable, or whether engineers will inevitably need to develop deeper technical knowledge around model behaviour, evaluation, and fine-tuning.</p>
<p>A particularly notable challenge discussed was handling the non-deterministic nature of AI systems. Traditional software engineering practices, like unit testing, don’t translate cleanly to systems where outputs can vary even with temperature set to zero. This highlights how AI Engineering requires new patterns and practices beyond traditional software engineering approaches.</p>
<p>The discussion also touched on evaluation techniques, including the use of log probabilities to understand model confidence and improve prompts. This represents an emerging area where traditional ML evaluation meets new challenges in assessing large language model outputs.</p>


</section>

 ]]></description>
  <category>books-i-read</category>
  <category>llm</category>
  <category>llms</category>
  <category>finetuning</category>
  <category>prompt-engineering</category>
  <guid>https://alexstrick.com/posts/2025-01-19-notes-on-ai-engineering-chapter-1.html</guid>
  <pubDate>Sat, 18 Jan 2025 23:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/aieng1-small.png" medium="image" type="image/png" height="101" width="144"/>
</item>
<item>
  <title>Final notes on ‘Prompt Engineering for LLMs’</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-01-17-final-notes-on-prompt-engineering-for-llms.html</link>
  <description><![CDATA[ 




<p>Here are the final notes from ‘<a href="https://app.thestorygraph.com/books/8535f61d-1dcd-4610-9cd9-6bcaf774f392">Prompt Engineering for LLMs</a>’, a book I’ve been reading over the past few days (and enjoying!).</p>
<section id="chapter-10-evaluating-llm-applications" class="level2">
<h2 class="anchored" data-anchor-id="chapter-10-evaluating-llm-applications">Chapter 10: Evaluating LLM Applications</h2>
<p>The chapter begins with an interesting anecdote about GitHub Copilot - the first code written in their repository was the evaluation harness, highlighting the importance of testing in LLM applications. The authors, who worked on the project from its inception, emphasise this as a best practice.</p>
<section id="evaluation-framework" class="level3">
<h3 class="anchored" data-anchor-id="evaluation-framework">Evaluation Framework</h3>
<p>When evaluating LLM applications, three main aspects can be assessed:</p>
<ul>
<li>The model itself - its capabilities and limitations</li>
<li>Individual interactions with the model (prompts and responses)</li>
<li>The integration of multiple interactions within the broader application</li>
</ul>
<p>As a general rule of thumb, you should always track and record:</p>
<ul>
<li>Latency</li>
<li>Token consumption statistics</li>
<li>Overall system approach metrics</li>
</ul>
</section>
<section id="offline-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="offline-evaluation">Offline Evaluation</h3>
<section id="example-suites" class="level4">
<h4 class="anchored" data-anchor-id="example-suites">Example Suites</h4>
<p>The foundation of offline evaluation is creating example suites - collections of 10-20 (minimum) input-output pairs that serve as test cases. These should be accompanied by scripts that apply your application’s logic to each example and compare the results.</p>
<p>Example sources come from three main areas:</p>
<ul>
<li>Existing examples from your project</li>
<li>Real-time user data collection</li>
<li>Synthetic creation</li>
</ul>
<p>When using synthetic data, it’s crucial to use different LLMs for creation versus application/judging to avoid potential biases.</p>
</section>
<section id="evaluation-approaches" class="level4">
<h4 class="anchored" data-anchor-id="evaluation-approaches">Evaluation Approaches</h4>
<ol type="1">
<li><strong>Gold Standard Matching</strong></li>
</ol>
<ul>
<li>Can be exact or partial matching</li>
<li>Particularly effective for binary decisions or multi-label classification</li>
<li>Can leverage “logical frogs” tricks from Chapter 7 to assess model confidence</li>
<li>Free-form text requires more creative evaluation approaches</li>
<li>Tool-use scenarios may be easier to evaluate, especially in agent-driven applications</li>
</ul>
<ol start="2" type="1">
<li><strong>Functional Testing</strong></li>
</ol>
<ul>
<li>A step up from unit tests but not full end-to-end testing</li>
<li>Focuses on testing specific system components</li>
</ul>
<ol start="3" type="1">
<li><strong>LLM as Judge</strong></li>
</ol>
<ul>
<li>Currently trendy but requires careful implementation</li>
<li>Should include human verification loop, preferably multiple humans</li>
<li>Key insight: Always frame the evaluation as if the LLM is grading someone else’s work, never its own</li>
<li>Recommendations for quantitative measures:
<ul>
<li>Use gradient and multi-aspect coverage (MA)</li>
<li>Implement 1-5 scales with specific criteria</li>
<li>Place all instructions and criteria before the content to be evaluated</li>
<li>Break down “Goldilocks” questions (was it just right?) into separate questions about whether it was enough and whether it was too much</li>
</ul></li>
</ul>
</section>
</section>
<section id="online-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="online-evaluation">Online Evaluation</h3>
<p>The chapter transitions to discussing why we need online testing despite having offline evaluation capabilities. While offline testing is safer and more scalable, real human interactions are unpredictable and require live testing.</p>
<p>Key points about online evaluation:</p>
<ul>
<li>AB testing is the standard approach</li>
<li>Existing solutions include Optimizely, VWO Consulting, and AB Tasty</li>
<li>Applications need to support running in two modes (A and B)</li>
<li>Consider rollout timing and users on older versions</li>
</ul>
<p>Five main metrics for online evaluation (from most to least straightforward):</p>
<ol type="1">
<li>Direct feedback (user responses to suggestions)</li>
<li>Functional correctness</li>
<li>User acceptance (following suggestions)</li>
<li>Achieved impact (user benefit)</li>
<li>Incidental metrics (surrounding measurements)</li>
</ol>
<p>Direct feedback data is particularly valuable as it can later be used for model fine-tuning. It’s recommended to track more incidental metrics rather than fewer, both for quality indicators and investigating unexpected changes.</p>
</section>
</section>
<section id="chapter-11-looking-ahead" class="level2">
<h2 class="anchored" data-anchor-id="chapter-11-looking-ahead">Chapter 11: Looking Ahead</h2>
<p>The final chapter covers several forward-looking topics:</p>
<ul>
<li>Multimodality in LLMs</li>
<li>User experience and interface considerations</li>
<li>Published artifacts from Anthropic</li>
<li>Risks and rewards of custom interfaces</li>
<li>Trends in model intelligence, cost, and speed</li>
</ul>
<section id="book-level-conclusions" class="level3">
<h3 class="anchored" data-anchor-id="book-level-conclusions">Book-Level Conclusions</h3>
<p>Two main lessons emerge from the book:</p>
<ol type="1">
<li><strong>LLMs as Text Completion Engines</strong>
<ul>
<li>They fundamentally mimic training data</li>
<li>Success comes from aligning prompts with training data patterns</li>
<li>Particularly relevant for completion models</li>
</ul></li>
<li><strong>Empathy with LLMs</strong></li>
</ol>
<ul>
<li>Think of them as mechanical friends with internet knowledge</li>
<li>Five key insights:
<ul>
<li>LLMs are easily distracted; keep prompts focused</li>
<li>If humans can’t understand the prompt, LLMs will struggle</li>
<li>Provide clear instructions and examples</li>
<li>Include all necessary information (LLMs aren’t psychic)</li>
<li>Give space for “thinking out loud” (chain of thought)</li>
</ul></li>
</ul>
</section>
</section>
<section id="personal-reflections" class="level2">
<h2 class="anchored" data-anchor-id="personal-reflections">Personal Reflections</h2>
<p>The book, while not revolutionary, provides valuable insights and is a recommended read at 250 pages. It can be completed in about 10-11 days. The heavy focus on completion models versus chat models is interesting, likely due to the authors’ experience with GitHub Copilot. While some points were novel, none were completely mind-blowing. The book’s emphasis on completion models versus chat models is both intriguing and occasionally confusing, though this perspective is understandable given the authors’ background with GitHub Copilot.</p>


</section>

 ]]></description>
  <category>llm</category>
  <category>prompt-engineering</category>
  <category>books-i-read</category>
  <category>evaluation</category>
  <guid>https://alexstrick.com/posts/2025-01-17-final-notes-on-prompt-engineering-for-llms.html</guid>
  <pubDate>Thu, 16 Jan 2025 23:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/chapter-10-prompt-eng.png" medium="image" type="image/png" height="98" width="144"/>
</item>
<item>
  <title>Assembling the Prompt: Notes on ‘Prompt Engineering for LLMs’ ch 6</title>
  <dc:creator>Alex Strick van Linschoten</dc:creator>
  <link>https://alexstrick.com/posts/2025-01-13-assembling-the-prompt-notes-on-prompt-engineering-for-llms-ch-6.html</link>
  <description><![CDATA[ 




<p>Chapter 6 of “Prompt Engineering for LLMs” is devoted to how to structure the prompt and compose its various elements. We first learn about the different kinds of ‘documents’ that we can mimic with our prompts, then think about how to pick which pieces of context to include, and then think through how we might compose all of this together.</p>
<p><img src="https://alexstrick.com/posts/images/2025-01-13-assembling-the-prompt-ch-6/well-constructed-prompt.png" class="img-fluid"></p>
<p>There’s a great figure to give you an idea of ‘the anatomy of a well-constructed prompt’ early on. The introduction is where you introduce the task, then you have the ‘valley of meh’ (which the LLM can struggle to recall or obey) and finally you have the refocusing and restatement of the task.</p>
<p>There are two key tips at this point:</p>
<ul>
<li>the closer a piece of information is to the end of the prompt, the more impact it has on the model</li>
<li>the model often struggles with the information stuffed in the middle of the prompt</li>
</ul>
<p>So craft your prompts accordingly!</p>
<p>A prompt plus the resulting completion is defined as a ‘document’ in this book, and there are various templates that you can follow: an ‘advice conversation’, an ‘analytic report’ (often formatted with Markdown headers), and a ‘structured document’.</p>
<p>We learn that analytic report-type documents seem to offer a lighter ‘cognitive load’ for an LLM since it doesn’t have to handle the intricacies of social interaction that it would in the case of an advice conversation. 🤔</p>
<p>Two other tips or possible things to include in the analytic report-style document:</p>
<ul>
<li>a table of contents at the beginning to set the scene</li>
<li>a scratchpad or notebook section for the model to ‘think’ in</li>
</ul>
<p>I haven’t had much use of either of these myself but I can see why they’d be powerful.</p>
<p>Structured documents can be really powerful, especially when the model has been trained to expect certain kinds of structure (be it JSON or XML or YAML etc). Also TIL that apparently OpenAI’s models are very strong when dealing with JSON as inputs.</p>
<p>The context to be inserted into the prompt (usually dynamically depending on use case or needs) can be large or small depending on what is available in terms of context window or latency requirements. There are different strategies to how to select what goes in.</p>
<p>I was curious about the idea of what they call ‘elastic snippets’, i.e.&nbsp;dynamic decisions that get taken as to what makes it way into the prompt depending on how much space is available etc.</p>
<p>And even then you have to decide about the:</p>
<ul>
<li>position (which order do all the elements appear in the prompt)</li>
<li>importance (how much will dropping this element from the prompt effect the response)</li>
<li>dependency (if you include one element, can you drop another and vice versa…)</li>
</ul>
<p>In the end, you have a kind of optimisation problem: given a theoretical unlimited potential prompt length, how to combine all the elements together to get the most value given the space limitations that the LLM dictates.</p>
<p><img src="https://alexstrick.com/posts/images/2025-01-13-assembling-the-prompt-ch-6/additive-greedy.png" class="img-fluid"></p>
<p>And then what strategy do you use to get rid of elements that your prompt budget cannot afford; we learn about the ‘additive greedy approach’ and the ‘subtractive greedy approach’, all the while bearing in mind that these are all just basic prototypes to play around with.</p>
<p><img src="https://alexstrick.com/posts/images/2025-01-13-assembling-the-prompt-ch-6/subtr-greedy.png" class="img-fluid"></p>
<p>The next chapter is all about the completion and how to make sure we receive meaningful and accurate responses from our LLM!</p>



 ]]></description>
  <category>llm</category>
  <category>prompt-engineering</category>
  <category>books-i-read</category>
  <guid>https://alexstrick.com/posts/2025-01-13-assembling-the-prompt-notes-on-prompt-engineering-for-llms-ch-6.html</guid>
  <pubDate>Sun, 12 Jan 2025 23:00:00 GMT</pubDate>
  <media:content url="https://alexstrick.com/posts/images/2025-01-13-assembling-the-prompt-ch-6/well-constructed-prompt.png" medium="image" type="image/png" height="119" width="144"/>
</item>
</channel>
</rss>
