<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://aider.chat/feed.xml" rel="self" type="application/atom+xml" /><link href="https://aider.chat/" rel="alternate" type="text/html" /><updated>2026-04-25T16:44:59+00:00</updated><id>https://aider.chat/feed.xml</id><title type="html">aider</title><subtitle>aider is AI pair programming in your terminal</subtitle><entry><title type="html">Qwen3 benchmark results</title><link href="https://aider.chat/2025/05/08/qwen3.html" rel="alternate" type="text/html" title="Qwen3 benchmark results" /><published>2025-05-08T00:00:00+00:00</published><updated>2025-05-08T00:00:00+00:00</updated><id>https://aider.chat/2025/05/08/qwen3</id><content type="html" xml:base="https://aider.chat/2025/05/08/qwen3.html"><![CDATA[<h1 id="qwen3-results-on-the-aider-polyglot-benchmark">Qwen3 results on the aider polyglot benchmark</h1>

<p>As <a href="/2024/11/21/quantization.html">previously discussed when Qwen2.5 was released</a>,
details matter when working with open source models for AI coding.
Proprietary models are served by their creators or trusted providers with stable inference settings.
Open source models are wonderful because anyone can serve them,
but API providers can use very different inference settings, quantizations, etc.</p>

<p>Below are collection of aider polyglot benchmark results for the new Qwen3 models.
Results are presented using both “diff” and “whole” 
<a href="https://aider.chat/docs/more/edit-formats.html">edit formats</a>,
with various models settings, against various API providers.</p>

<p>See details on the 
<a href="https://aider.chat/docs/config/adv-model-settings.html#model-settings">model settings</a> 
used after the results table.</p>

<p class="note">This article is being updated as new results become available.
Also, some results were submitted by aider users and have not been verified.</p>

<h2 id="leaderboard-title">Qwen3 results on the aider polyglot benchmark</h2>

<div id="controls-container" style="display: flex; align-items: center; width: 100%; max-width: 800px; margin: 10px auto; gap: 10px; box-sizing: border-box; padding: 0 5px; position: relative;">
  <input type="text" id="editSearchInput" placeholder="Search..." style="flex-grow: 1; padding: 8px; border: 1px solid #ddd; border-radius: 4px;" />
  <div id="view-mode-toggle" style="display: inline-flex; border: 1px solid #ccc; border-radius: 4px;">
    <button id="mode-view-btn" class="mode-button active" data-mode="view" style="padding: 8px 8px; border: none; border-radius: 3px 0 0 3px; cursor: pointer; font-size: 14px; line-height: 1.5; min-width: 50px;">View</button>
    <button id="mode-select-btn" class="mode-button" data-mode="select" style="padding: 8px 8px; border: none; background-color: #f8f9fa; border-radius: 0; cursor: pointer; border-left: 1px solid #ccc; font-size: 14px; line-height: 1.5; min-width: 50px;">Select</button>
    <button id="mode-detail-btn" class="mode-button" data-mode="detail" style="padding: 8px 8px; border: none; background-color: #f8f9fa; border-radius: 0 3px 3px 0; cursor: pointer; border-left: 1px solid #ccc; font-size: 14px; line-height: 1.5; min-width: 50px;">Detail</button>
  </div>
<button id="close-controls-btn" style="width: 18px; height: 18px; padding: 0; border: 1px solid #ddd; border-radius: 50%; background-color: transparent; cursor: pointer; display: flex; align-items: center; justify-content: center; font-size: 12px; margin-left: 4px; color: #999;">×</button>
</div>

<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
  <thead style="background-color: #f2f2f2;">
    <tr>
      <th style="padding: 8px; width: 40px; text-align: center; vertical-align: middle;">
        <input type="checkbox" id="select-all-checkbox" style="display: none; cursor: pointer; vertical-align: middle;" />
      </th> <!-- Header checkbox added here -->
      <th style="padding: 8px; text-align: left;">Model</th>
      <th style="padding: 8px; text-align: center; width: 25%">Percent correct</th>
      <th style="padding: 8px; text-align: center; width: 25%">Cost</th>
      <th style="padding: 8px; text-align: left;" class="col-command">Command</th>
      <th style="padding: 8px; text-align: center; width: 10%" class="col-conform">Correct edit format</th>
      <th style="padding: 8px; text-align: left; width: 10%" class="col-edit-format">Edit Format</th>
    </tr>
  </thead>
  <tbody>
    
    
    
    
     
      
      <tr id="main-row-0">
        <td style="padding: 8px; text-align: center; vertical-align: middle;">
          <button class="toggle-details" data-target="details-0" style="background: none; border: none; cursor: pointer; font-size: 16px; padding: 0; vertical-align: middle;">▶</button>
          <input type="checkbox" class="row-selector" data-row-index="0" style="display: none; cursor: pointer; vertical-align: middle;" />
        </td>
        <td style="padding: 8px;"><span>Qwen3-235B-A22B whole with VLLM, bfloat16, recommended /no_think settings</span></td>
        <td class="bar-cell">
          <div class="bar-viz" style="width: 65.3%; background-color: rgba(40, 167, 69, 0.3); border-right: 1px solid rgba(40, 167, 69, 0.5);"></div>
          <span>65.3%</span>
        </td>
        <td class="bar-cell cost-bar-cell">
          
          
          <span></span>
        </td>
        <td style="padding: 8px;" class="col-command"><span><code>aider --model openai/Qwen3-235B-A22B</code></span></td>
        <td style="padding: 8px; text-align: center;" class="col-conform"><span>100.0%</span></td>
        <td style="padding: 8px;" class="col-edit-format"><span>whole</span></td>
      </tr>
      <tr class="details-row" id="details-0" style="display: none; background-color: #f9f9f9;">
        <td colspan="7" style="padding: 15px; border-bottom: 1px solid #ddd;">
          <ul style="margin: 0; padding-left: 20px; list-style: none; border-bottom: 1px solid #ddd;">
            
              
                <li><strong>
                  
                    Dirname
                  
                  :</strong>
                  2025-04-30-04-49-37--Qwen3-235B-A22B-whole-nothink
                </li>
              
            
              
                <li><strong>
                  
                    Test cases
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Model
                  
                  :</strong>
                  Qwen3-235B-A22B whole with VLLM, bfloat16, recommended /no_think settings
                </li>
              
            
              
                <li><strong>
                  
                    Edit format
                  
                  :</strong>
                  whole
                </li>
              
            
              
                <li><strong>
                  
                    Commit hash
                  
                  :</strong>
                  0c383df-dirty
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 1
                  
                  :</strong>
                  28.0
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 2
                  
                  :</strong>
                  65.3
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 1
                  
                  :</strong>
                  63
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 2
                  
                  :</strong>
                  147
                </li>
              
            
              
                <li><strong>
                  
                    Percent cases well formed
                  
                  :</strong>
                  100.0
                </li>
              
            
              
                <li><strong>
                  
                    Error outputs
                  
                  :</strong>
                  3
                </li>
              
            
              
                <li><strong>
                  
                    Num malformed responses
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Num with malformed responses
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    User asks
                  
                  :</strong>
                  166
                </li>
              
            
              
                <li><strong>
                  
                    Lazy comments
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Syntax errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Indentation errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Exhausted context windows
                  
                  :</strong>
                  3
                </li>
              
            
              
                <li><strong>
                  
                    Test timeouts
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Total tests
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Command
                  
                  :</strong>
                  <code>aider --model openai/Qwen3-235B-A22B</code>
                </li>
              
            
              
                <li><strong>
                  
                    Date
                  
                  :</strong>
                  2025-04-30
                </li>
              
            
              
                <li><strong>
                  
                    Versions
                  
                  :</strong>
                  0.81.4.dev
                </li>
              
            
              
                <li><strong>
                  
                    Seconds per case
                  
                  :</strong>
                  166.0
                </li>
              
            
              
                <li><strong>
                  
                    Total cost
                  
                  :</strong>
                  0.0
                </li>
              
            
          </ul>
        </td>
      </tr>
     
      
      <tr id="main-row-1">
        <td style="padding: 8px; text-align: center; vertical-align: middle;">
          <button class="toggle-details" data-target="details-1" style="background: none; border: none; cursor: pointer; font-size: 16px; padding: 0; vertical-align: middle;">▶</button>
          <input type="checkbox" class="row-selector" data-row-index="1" style="display: none; cursor: pointer; vertical-align: middle;" />
        </td>
        <td style="padding: 8px;"><span>Qwen3 235B A22B whole, no think, via official Alibaba API</span></td>
        <td class="bar-cell">
          <div class="bar-viz" style="width: 61.8%; background-color: rgba(40, 167, 69, 0.3); border-right: 1px solid rgba(40, 167, 69, 0.5);"></div>
          <span>61.8%</span>
        </td>
        <td class="bar-cell cost-bar-cell">
          
          
          <span></span>
        </td>
        <td style="padding: 8px;" class="col-command"><span><code>aider --model openai/qwen3-235b-a22b</code></span></td>
        <td style="padding: 8px; text-align: center;" class="col-conform"><span>100.0%</span></td>
        <td style="padding: 8px;" class="col-edit-format"><span>whole</span></td>
      </tr>
      <tr class="details-row" id="details-1" style="display: none; background-color: #f9f9f9;">
        <td colspan="7" style="padding: 15px; border-bottom: 1px solid #ddd;">
          <ul style="margin: 0; padding-left: 20px; list-style: none; border-bottom: 1px solid #ddd;">
            
              
                <li><strong>
                  
                    Dirname
                  
                  :</strong>
                  2025-05-09-23-01-22--qwen3-235b-a22b.unthink_16k_whole
                </li>
              
            
              
                <li><strong>
                  
                    Test cases
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Model
                  
                  :</strong>
                  Qwen3 235B A22B whole, no think, via official Alibaba API
                </li>
              
            
              
                <li><strong>
                  
                    Edit format
                  
                  :</strong>
                  whole
                </li>
              
            
              
                <li><strong>
                  
                    Commit hash
                  
                  :</strong>
                  425fb6d
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 1
                  
                  :</strong>
                  26.7
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 2
                  
                  :</strong>
                  61.8
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 1
                  
                  :</strong>
                  60
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 2
                  
                  :</strong>
                  139
                </li>
              
            
              
                <li><strong>
                  
                    Percent cases well formed
                  
                  :</strong>
                  100.0
                </li>
              
            
              
                <li><strong>
                  
                    Error outputs
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Num malformed responses
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Num with malformed responses
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    User asks
                  
                  :</strong>
                  175
                </li>
              
            
              
                <li><strong>
                  
                    Lazy comments
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Syntax errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Indentation errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Exhausted context windows
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Prompt tokens
                  
                  :</strong>
                  2768173
                </li>
              
            
              
                <li><strong>
                  
                    Completion tokens
                  
                  :</strong>
                  384000
                </li>
              
            
              
                <li><strong>
                  
                    Test timeouts
                  
                  :</strong>
                  1
                </li>
              
            
              
                <li><strong>
                  
                    Total tests
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Command
                  
                  :</strong>
                  <code>aider --model openai/qwen3-235b-a22b</code>
                </li>
              
            
              
                <li><strong>
                  
                    Date
                  
                  :</strong>
                  2025-05-09
                </li>
              
            
              
                <li><strong>
                  
                    Versions
                  
                  :</strong>
                  0.82.4.dev
                </li>
              
            
              
                <li><strong>
                  
                    Seconds per case
                  
                  :</strong>
                  50.8
                </li>
              
            
              
                <li><strong>
                  
                    Total cost
                  
                  :</strong>
                  0.0
                </li>
              
            
          </ul>
        </td>
      </tr>
     
      
      <tr id="main-row-2">
        <td style="padding: 8px; text-align: center; vertical-align: middle;">
          <button class="toggle-details" data-target="details-2" style="background: none; border: none; cursor: pointer; font-size: 16px; padding: 0; vertical-align: middle;">▶</button>
          <input type="checkbox" class="row-selector" data-row-index="2" style="display: none; cursor: pointer; vertical-align: middle;" />
        </td>
        <td style="padding: 8px;"><span>Qwen3-235B-A22B diff with VLLM, bfloat16, recommended /no_think settings</span></td>
        <td class="bar-cell">
          <div class="bar-viz" style="width: 61.3%; background-color: rgba(40, 167, 69, 0.3); border-right: 1px solid rgba(40, 167, 69, 0.5);"></div>
          <span>61.3%</span>
        </td>
        <td class="bar-cell cost-bar-cell">
          
          
          <span></span>
        </td>
        <td style="padding: 8px;" class="col-command"><span><code>aider --model openai/Qwen3-235B-A22B</code></span></td>
        <td style="padding: 8px; text-align: center;" class="col-conform"><span>94.7%</span></td>
        <td style="padding: 8px;" class="col-edit-format"><span>diff</span></td>
      </tr>
      <tr class="details-row" id="details-2" style="display: none; background-color: #f9f9f9;">
        <td colspan="7" style="padding: 15px; border-bottom: 1px solid #ddd;">
          <ul style="margin: 0; padding-left: 20px; list-style: none; border-bottom: 1px solid #ddd;">
            
              
                <li><strong>
                  
                    Dirname
                  
                  :</strong>
                  2025-04-30-04-49-50--Qwen3-235B-A22B-diff-nothink
                </li>
              
            
              
                <li><strong>
                  
                    Test cases
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Model
                  
                  :</strong>
                  Qwen3-235B-A22B diff with VLLM, bfloat16, recommended /no_think settings
                </li>
              
            
              
                <li><strong>
                  
                    Edit format
                  
                  :</strong>
                  diff
                </li>
              
            
              
                <li><strong>
                  
                    Commit hash
                  
                  :</strong>
                  0c383df-dirty
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 1
                  
                  :</strong>
                  29.8
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 2
                  
                  :</strong>
                  61.3
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 1
                  
                  :</strong>
                  67
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 2
                  
                  :</strong>
                  138
                </li>
              
            
              
                <li><strong>
                  
                    Percent cases well formed
                  
                  :</strong>
                  94.7
                </li>
              
            
              
                <li><strong>
                  
                    Error outputs
                  
                  :</strong>
                  25
                </li>
              
            
              
                <li><strong>
                  
                    Num malformed responses
                  
                  :</strong>
                  25
                </li>
              
            
              
                <li><strong>
                  
                    Num with malformed responses
                  
                  :</strong>
                  12
                </li>
              
            
              
                <li><strong>
                  
                    User asks
                  
                  :</strong>
                  97
                </li>
              
            
              
                <li><strong>
                  
                    Lazy comments
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Syntax errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Indentation errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Exhausted context windows
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Test timeouts
                  
                  :</strong>
                  2
                </li>
              
            
              
                <li><strong>
                  
                    Total tests
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Command
                  
                  :</strong>
                  <code>aider --model openai/Qwen3-235B-A22B</code>
                </li>
              
            
              
                <li><strong>
                  
                    Date
                  
                  :</strong>
                  2025-04-30
                </li>
              
            
              
                <li><strong>
                  
                    Versions
                  
                  :</strong>
                  0.81.4.dev
                </li>
              
            
              
                <li><strong>
                  
                    Seconds per case
                  
                  :</strong>
                  158.2
                </li>
              
            
              
                <li><strong>
                  
                    Total cost
                  
                  :</strong>
                  0.0
                </li>
              
            
          </ul>
        </td>
      </tr>
     
      
      <tr id="main-row-3">
        <td style="padding: 8px; text-align: center; vertical-align: middle;">
          <button class="toggle-details" data-target="details-3" style="background: none; border: none; cursor: pointer; font-size: 16px; padding: 0; vertical-align: middle;">▶</button>
          <input type="checkbox" class="row-selector" data-row-index="3" style="display: none; cursor: pointer; vertical-align: middle;" />
        </td>
        <td style="padding: 8px;"><span>Qwen3 235B A22B diff, no think, via official Alibaba API</span></td>
        <td class="bar-cell">
          <div class="bar-viz" style="width: 59.6%; background-color: rgba(40, 167, 69, 0.3); border-right: 1px solid rgba(40, 167, 69, 0.5);"></div>
          <span>59.6%</span>
        </td>
        <td class="bar-cell cost-bar-cell">
          
          
          <span></span>
        </td>
        <td style="padding: 8px;" class="col-command"><span><code>aider --model openai/qwen3-235b-a22b</code></span></td>
        <td style="padding: 8px; text-align: center;" class="col-conform"><span>92.9%</span></td>
        <td style="padding: 8px;" class="col-edit-format"><span>diff</span></td>
      </tr>
      <tr class="details-row" id="details-3" style="display: none; background-color: #f9f9f9;">
        <td colspan="7" style="padding: 15px; border-bottom: 1px solid #ddd;">
          <ul style="margin: 0; padding-left: 20px; list-style: none; border-bottom: 1px solid #ddd;">
            
              
                <li><strong>
                  
                    Dirname
                  
                  :</strong>
                  2025-05-09-17-02-02--qwen3-235b-a22b.unthink_16k_diff
                </li>
              
            
              
                <li><strong>
                  
                    Test cases
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Model
                  
                  :</strong>
                  Qwen3 235B A22B diff, no think, via official Alibaba API
                </li>
              
            
              
                <li><strong>
                  
                    Edit format
                  
                  :</strong>
                  diff
                </li>
              
            
              
                <li><strong>
                  
                    Commit hash
                  
                  :</strong>
                  91d7fbd-dirty
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 1
                  
                  :</strong>
                  28.9
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 2
                  
                  :</strong>
                  59.6
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 1
                  
                  :</strong>
                  65
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 2
                  
                  :</strong>
                  134
                </li>
              
            
              
                <li><strong>
                  
                    Percent cases well formed
                  
                  :</strong>
                  92.9
                </li>
              
            
              
                <li><strong>
                  
                    Error outputs
                  
                  :</strong>
                  22
                </li>
              
            
              
                <li><strong>
                  
                    Num malformed responses
                  
                  :</strong>
                  22
                </li>
              
            
              
                <li><strong>
                  
                    Num with malformed responses
                  
                  :</strong>
                  16
                </li>
              
            
              
                <li><strong>
                  
                    User asks
                  
                  :</strong>
                  111
                </li>
              
            
              
                <li><strong>
                  
                    Lazy comments
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Syntax errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Indentation errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Exhausted context windows
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Prompt tokens
                  
                  :</strong>
                  2816192
                </li>
              
            
              
                <li><strong>
                  
                    Completion tokens
                  
                  :</strong>
                  342062
                </li>
              
            
              
                <li><strong>
                  
                    Test timeouts
                  
                  :</strong>
                  1
                </li>
              
            
              
                <li><strong>
                  
                    Total tests
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Command
                  
                  :</strong>
                  <code>aider --model openai/qwen3-235b-a22b</code>
                </li>
              
            
              
                <li><strong>
                  
                    Date
                  
                  :</strong>
                  2025-05-09
                </li>
              
            
              
                <li><strong>
                  
                    Versions
                  
                  :</strong>
                  0.82.4.dev
                </li>
              
            
              
                <li><strong>
                  
                    Seconds per case
                  
                  :</strong>
                  45.4
                </li>
              
            
              
                <li><strong>
                  
                    Total cost
                  
                  :</strong>
                  0.0
                </li>
              
            
          </ul>
        </td>
      </tr>
     
      
      <tr id="main-row-4">
        <td style="padding: 8px; text-align: center; vertical-align: middle;">
          <button class="toggle-details" data-target="details-4" style="background: none; border: none; cursor: pointer; font-size: 16px; padding: 0; vertical-align: middle;">▶</button>
          <input type="checkbox" class="row-selector" data-row-index="4" style="display: none; cursor: pointer; vertical-align: middle;" />
        </td>
        <td style="padding: 8px;"><span>Qwen3-235B-A22B whole with llama.cpp, Q5_K_M (unsloth), recommended /no_think settings</span></td>
        <td class="bar-cell">
          <div class="bar-viz" style="width: 59.1%; background-color: rgba(40, 167, 69, 0.3); border-right: 1px solid rgba(40, 167, 69, 0.5);"></div>
          <span>59.1%</span>
        </td>
        <td class="bar-cell cost-bar-cell">
          
          
          <span></span>
        </td>
        <td style="padding: 8px;" class="col-command"><span><code>aider --model openai/Qwen3-235B-A22B-Q5_K_M</code></span></td>
        <td style="padding: 8px; text-align: center;" class="col-conform"><span>100.0%</span></td>
        <td style="padding: 8px;" class="col-edit-format"><span>whole</span></td>
      </tr>
      <tr class="details-row" id="details-4" style="display: none; background-color: #f9f9f9;">
        <td colspan="7" style="padding: 15px; border-bottom: 1px solid #ddd;">
          <ul style="margin: 0; padding-left: 20px; list-style: none; border-bottom: 1px solid #ddd;">
            
              
                <li><strong>
                  
                    Dirname
                  
                  :</strong>
                  2025-05-07-03-15-59--Qwen3-235B-A22B-Q5_K_M-whole-nothink
                </li>
              
            
              
                <li><strong>
                  
                    Test cases
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Model
                  
                  :</strong>
                  Qwen3-235B-A22B whole with llama.cpp, Q5_K_M (unsloth), recommended /no_think settings
                </li>
              
            
              
                <li><strong>
                  
                    Edit format
                  
                  :</strong>
                  whole
                </li>
              
            
              
                <li><strong>
                  
                    Commit hash
                  
                  :</strong>
                  8159cbf
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 1
                  
                  :</strong>
                  27.1
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 2
                  
                  :</strong>
                  59.1
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 1
                  
                  :</strong>
                  61
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 2
                  
                  :</strong>
                  133
                </li>
              
            
              
                <li><strong>
                  
                    Percent cases well formed
                  
                  :</strong>
                  100.0
                </li>
              
            
              
                <li><strong>
                  
                    Error outputs
                  
                  :</strong>
                  1
                </li>
              
            
              
                <li><strong>
                  
                    Num malformed responses
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Num with malformed responses
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    User asks
                  
                  :</strong>
                  169
                </li>
              
            
              
                <li><strong>
                  
                    Lazy comments
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Syntax errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Indentation errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Exhausted context windows
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Test timeouts
                  
                  :</strong>
                  1
                </li>
              
            
              
                <li><strong>
                  
                    Total tests
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Command
                  
                  :</strong>
                  <code>aider --model openai/Qwen3-235B-A22B-Q5_K_M</code>
                </li>
              
            
              
                <li><strong>
                  
                    Date
                  
                  :</strong>
                  2025-05-07
                </li>
              
            
              
                <li><strong>
                  
                    Versions
                  
                  :</strong>
                  0.82.4.dev
                </li>
              
            
              
                <li><strong>
                  
                    Seconds per case
                  
                  :</strong>
                  635.2
                </li>
              
            
              
                <li><strong>
                  
                    Total cost
                  
                  :</strong>
                  0.0
                </li>
              
            
          </ul>
        </td>
      </tr>
     
      
      <tr id="main-row-5">
        <td style="padding: 8px; text-align: center; vertical-align: middle;">
          <button class="toggle-details" data-target="details-5" style="background: none; border: none; cursor: pointer; font-size: 16px; padding: 0; vertical-align: middle;">▶</button>
          <input type="checkbox" class="row-selector" data-row-index="5" style="display: none; cursor: pointer; vertical-align: middle;" />
        </td>
        <td style="padding: 8px;"><span>Qwen3 235B A22B diff on OpenRouter only TogetherAI, recommended /no_think settings</span></td>
        <td class="bar-cell">
          <div class="bar-viz" style="width: 54.7%; background-color: rgba(40, 167, 69, 0.3); border-right: 1px solid rgba(40, 167, 69, 0.5);"></div>
          <span>54.7%</span>
        </td>
        <td class="bar-cell cost-bar-cell">
          
          <div class="bar-viz cost-bar" data-cost="0.6399" data-max-cost="1.8037" style="width: 0%; background-color: rgba(13, 110, 253, 0.3); border-right: 1px solid rgba(13, 110, 253, 0.5);"></div>
          
          
          <span>$0.64</span>
        </td>
        <td style="padding: 8px;" class="col-command"><span><code>aider --model openrouter/qwen/qwen3-235b-a22b</code></span></td>
        <td style="padding: 8px; text-align: center;" class="col-conform"><span>90.7%</span></td>
        <td style="padding: 8px;" class="col-edit-format"><span>diff</span></td>
      </tr>
      <tr class="details-row" id="details-5" style="display: none; background-color: #f9f9f9;">
        <td colspan="7" style="padding: 15px; border-bottom: 1px solid #ddd;">
          <ul style="margin: 0; padding-left: 20px; list-style: none; border-bottom: 1px solid #ddd;">
            
              
                <li><strong>
                  
                    Dirname
                  
                  :</strong>
                  2025-05-08-17-39-14--qwen3-235b-or-together-only
                </li>
              
            
              
                <li><strong>
                  
                    Test cases
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Model
                  
                  :</strong>
                  Qwen3 235B A22B diff on OpenRouter only TogetherAI, recommended /no_think settings
                </li>
              
            
              
                <li><strong>
                  
                    Edit format
                  
                  :</strong>
                  diff
                </li>
              
            
              
                <li><strong>
                  
                    Commit hash
                  
                  :</strong>
                  328584e
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 1
                  
                  :</strong>
                  28.0
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 2
                  
                  :</strong>
                  54.7
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 1
                  
                  :</strong>
                  63
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 2
                  
                  :</strong>
                  123
                </li>
              
            
              
                <li><strong>
                  
                    Percent cases well formed
                  
                  :</strong>
                  90.7
                </li>
              
            
              
                <li><strong>
                  
                    Error outputs
                  
                  :</strong>
                  39
                </li>
              
            
              
                <li><strong>
                  
                    Num malformed responses
                  
                  :</strong>
                  32
                </li>
              
            
              
                <li><strong>
                  
                    Num with malformed responses
                  
                  :</strong>
                  21
                </li>
              
            
              
                <li><strong>
                  
                    User asks
                  
                  :</strong>
                  106
                </li>
              
            
              
                <li><strong>
                  
                    Lazy comments
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Syntax errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Indentation errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Exhausted context windows
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Prompt tokens
                  
                  :</strong>
                  2816606
                </li>
              
            
              
                <li><strong>
                  
                    Completion tokens
                  
                  :</strong>
                  362346
                </li>
              
            
              
                <li><strong>
                  
                    Test timeouts
                  
                  :</strong>
                  2
                </li>
              
            
              
                <li><strong>
                  
                    Total tests
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Command
                  
                  :</strong>
                  <code>aider --model openrouter/qwen/qwen3-235b-a22b</code>
                </li>
              
            
              
                <li><strong>
                  
                    Date
                  
                  :</strong>
                  2025-05-08
                </li>
              
            
              
                <li><strong>
                  
                    Versions
                  
                  :</strong>
                  0.82.4.dev
                </li>
              
            
              
                <li><strong>
                  
                    Seconds per case
                  
                  :</strong>
                  77.2
                </li>
              
            
              
                <li><strong>
                  
                    Total cost
                  
                  :</strong>
                  0.6399
                </li>
              
            
          </ul>
        </td>
      </tr>
     
      
      <tr id="main-row-6">
        <td style="padding: 8px; text-align: center; vertical-align: middle;">
          <button class="toggle-details" data-target="details-6" style="background: none; border: none; cursor: pointer; font-size: 16px; padding: 0; vertical-align: middle;">▶</button>
          <input type="checkbox" class="row-selector" data-row-index="6" style="display: none; cursor: pointer; vertical-align: middle;" />
        </td>
        <td style="padding: 8px;"><span>Qwen3 235B A22B diff on OpenRouter, all providers, default settings (thinking)</span></td>
        <td class="bar-cell">
          <div class="bar-viz" style="width: 49.8%; background-color: rgba(40, 167, 69, 0.3); border-right: 1px solid rgba(40, 167, 69, 0.5);"></div>
          <span>49.8%</span>
        </td>
        <td class="bar-cell cost-bar-cell">
          
          <div class="bar-viz cost-bar" data-cost="1.8037" data-max-cost="1.8037" style="width: 0%; background-color: rgba(13, 110, 253, 0.3); border-right: 1px solid rgba(13, 110, 253, 0.5);"></div>
          
          
          <span>$1.8</span>
        </td>
        <td style="padding: 8px;" class="col-command"><span><code>aider --model openrouter/qwen/qwen3-235b-a22b</code></span></td>
        <td style="padding: 8px; text-align: center;" class="col-conform"><span>91.6%</span></td>
        <td style="padding: 8px;" class="col-edit-format"><span>diff</span></td>
      </tr>
      <tr class="details-row" id="details-6" style="display: none; background-color: #f9f9f9;">
        <td colspan="7" style="padding: 15px; border-bottom: 1px solid #ddd;">
          <ul style="margin: 0; padding-left: 20px; list-style: none; border-bottom: 1px solid #ddd;">
            
              
                <li><strong>
                  
                    Dirname
                  
                  :</strong>
                  2025-05-08-03-22-37--qwen3-235b-defaults
                </li>
              
            
              
                <li><strong>
                  
                    Test cases
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Model
                  
                  :</strong>
                  Qwen3 235B A22B diff on OpenRouter, all providers, default settings (thinking)
                </li>
              
            
              
                <li><strong>
                  
                    Edit format
                  
                  :</strong>
                  diff
                </li>
              
            
              
                <li><strong>
                  
                    Commit hash
                  
                  :</strong>
                  aaacee5-dirty
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 1
                  
                  :</strong>
                  17.3
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 2
                  
                  :</strong>
                  49.8
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 1
                  
                  :</strong>
                  39
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 2
                  
                  :</strong>
                  112
                </li>
              
            
              
                <li><strong>
                  
                    Percent cases well formed
                  
                  :</strong>
                  91.6
                </li>
              
            
              
                <li><strong>
                  
                    Error outputs
                  
                  :</strong>
                  58
                </li>
              
            
              
                <li><strong>
                  
                    Num malformed responses
                  
                  :</strong>
                  29
                </li>
              
            
              
                <li><strong>
                  
                    Num with malformed responses
                  
                  :</strong>
                  19
                </li>
              
            
              
                <li><strong>
                  
                    User asks
                  
                  :</strong>
                  102
                </li>
              
            
              
                <li><strong>
                  
                    Lazy comments
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Syntax errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Indentation errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Exhausted context windows
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Prompt tokens
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Completion tokens
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Test timeouts
                  
                  :</strong>
                  1
                </li>
              
            
              
                <li><strong>
                  
                    Total tests
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Command
                  
                  :</strong>
                  <code>aider --model openrouter/qwen/qwen3-235b-a22b</code>
                </li>
              
            
              
                <li><strong>
                  
                    Date
                  
                  :</strong>
                  2025-05-08
                </li>
              
            
              
                <li><strong>
                  
                    Versions
                  
                  :</strong>
                  0.82.4.dev
                </li>
              
            
              
                <li><strong>
                  
                    Seconds per case
                  
                  :</strong>
                  428.1
                </li>
              
            
              
                <li><strong>
                  
                    Total cost
                  
                  :</strong>
                  1.8037
                </li>
              
            
          </ul>
        </td>
      </tr>
     
      
      <tr id="main-row-7">
        <td style="padding: 8px; text-align: center; vertical-align: middle;">
          <button class="toggle-details" data-target="details-7" style="background: none; border: none; cursor: pointer; font-size: 16px; padding: 0; vertical-align: middle;">▶</button>
          <input type="checkbox" class="row-selector" data-row-index="7" style="display: none; cursor: pointer; vertical-align: middle;" />
        </td>
        <td style="padding: 8px;"><span>Qwen3-32B whole with VLLM, bfloat16, recommended /no_think settings</span></td>
        <td class="bar-cell">
          <div class="bar-viz" style="width: 45.8%; background-color: rgba(40, 167, 69, 0.3); border-right: 1px solid rgba(40, 167, 69, 0.5);"></div>
          <span>45.8%</span>
        </td>
        <td class="bar-cell cost-bar-cell">
          
          
          <span></span>
        </td>
        <td style="padding: 8px;" class="col-command"><span><code>aider --model openai/Qwen3-32B</code></span></td>
        <td style="padding: 8px; text-align: center;" class="col-conform"><span>100.0%</span></td>
        <td style="padding: 8px;" class="col-edit-format"><span>whole</span></td>
      </tr>
      <tr class="details-row" id="details-7" style="display: none; background-color: #f9f9f9;">
        <td colspan="7" style="padding: 15px; border-bottom: 1px solid #ddd;">
          <ul style="margin: 0; padding-left: 20px; list-style: none; border-bottom: 1px solid #ddd;">
            
              
                <li><strong>
                  
                    Dirname
                  
                  :</strong>
                  2025-04-30-04-08-41--Qwen3-32B-whole-nothink
                </li>
              
            
              
                <li><strong>
                  
                    Test cases
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Model
                  
                  :</strong>
                  Qwen3-32B whole with VLLM, bfloat16, recommended /no_think settings
                </li>
              
            
              
                <li><strong>
                  
                    Edit format
                  
                  :</strong>
                  whole
                </li>
              
            
              
                <li><strong>
                  
                    Commit hash
                  
                  :</strong>
                  0c383df-dirty
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 1
                  
                  :</strong>
                  20.4
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 2
                  
                  :</strong>
                  45.8
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 1
                  
                  :</strong>
                  46
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 2
                  
                  :</strong>
                  103
                </li>
              
            
              
                <li><strong>
                  
                    Percent cases well formed
                  
                  :</strong>
                  100.0
                </li>
              
            
              
                <li><strong>
                  
                    Error outputs
                  
                  :</strong>
                  3
                </li>
              
            
              
                <li><strong>
                  
                    Num malformed responses
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Num with malformed responses
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    User asks
                  
                  :</strong>
                  94
                </li>
              
            
              
                <li><strong>
                  
                    Lazy comments
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Syntax errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Indentation errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Exhausted context windows
                  
                  :</strong>
                  3
                </li>
              
            
              
                <li><strong>
                  
                    Test timeouts
                  
                  :</strong>
                  5
                </li>
              
            
              
                <li><strong>
                  
                    Total tests
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Command
                  
                  :</strong>
                  <code>aider --model openai/Qwen3-32B</code>
                </li>
              
            
              
                <li><strong>
                  
                    Date
                  
                  :</strong>
                  2025-04-30
                </li>
              
            
              
                <li><strong>
                  
                    Versions
                  
                  :</strong>
                  0.81.4.dev
                </li>
              
            
              
                <li><strong>
                  
                    Seconds per case
                  
                  :</strong>
                  48.1
                </li>
              
            
              
                <li><strong>
                  
                    Total cost
                  
                  :</strong>
                  0.0
                </li>
              
            
          </ul>
        </td>
      </tr>
     
      
      <tr id="main-row-8">
        <td style="padding: 8px; text-align: center; vertical-align: middle;">
          <button class="toggle-details" data-target="details-8" style="background: none; border: none; cursor: pointer; font-size: 16px; padding: 0; vertical-align: middle;">▶</button>
          <input type="checkbox" class="row-selector" data-row-index="8" style="display: none; cursor: pointer; vertical-align: middle;" />
        </td>
        <td style="padding: 8px;"><span>Qwen3-32B diff with VLLM, bfloat16, recommended /no_think settings</span></td>
        <td class="bar-cell">
          <div class="bar-viz" style="width: 41.3%; background-color: rgba(40, 167, 69, 0.3); border-right: 1px solid rgba(40, 167, 69, 0.5);"></div>
          <span>41.3%</span>
        </td>
        <td class="bar-cell cost-bar-cell">
          
          
          <span></span>
        </td>
        <td style="padding: 8px;" class="col-command"><span><code>aider --model openai/Qwen3-32B</code></span></td>
        <td style="padding: 8px; text-align: center;" class="col-conform"><span>94.2%</span></td>
        <td style="padding: 8px;" class="col-edit-format"><span>diff</span></td>
      </tr>
      <tr class="details-row" id="details-8" style="display: none; background-color: #f9f9f9;">
        <td colspan="7" style="padding: 15px; border-bottom: 1px solid #ddd;">
          <ul style="margin: 0; padding-left: 20px; list-style: none; border-bottom: 1px solid #ddd;">
            
              
                <li><strong>
                  
                    Dirname
                  
                  :</strong>
                  2025-04-30-04-08-51--Qwen3-32B-diff-nothink
                </li>
              
            
              
                <li><strong>
                  
                    Test cases
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Model
                  
                  :</strong>
                  Qwen3-32B diff with VLLM, bfloat16, recommended /no_think settings
                </li>
              
            
              
                <li><strong>
                  
                    Edit format
                  
                  :</strong>
                  diff
                </li>
              
            
              
                <li><strong>
                  
                    Commit hash
                  
                  :</strong>
                  0c383df-dirty
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 1
                  
                  :</strong>
                  20.4
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 2
                  
                  :</strong>
                  41.3
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 1
                  
                  :</strong>
                  46
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 2
                  
                  :</strong>
                  93
                </li>
              
            
              
                <li><strong>
                  
                    Percent cases well formed
                  
                  :</strong>
                  94.2
                </li>
              
            
              
                <li><strong>
                  
                    Error outputs
                  
                  :</strong>
                  17
                </li>
              
            
              
                <li><strong>
                  
                    Num malformed responses
                  
                  :</strong>
                  14
                </li>
              
            
              
                <li><strong>
                  
                    Num with malformed responses
                  
                  :</strong>
                  13
                </li>
              
            
              
                <li><strong>
                  
                    User asks
                  
                  :</strong>
                  83
                </li>
              
            
              
                <li><strong>
                  
                    Lazy comments
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Syntax errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Indentation errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Exhausted context windows
                  
                  :</strong>
                  3
                </li>
              
            
              
                <li><strong>
                  
                    Test timeouts
                  
                  :</strong>
                  4
                </li>
              
            
              
                <li><strong>
                  
                    Total tests
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Command
                  
                  :</strong>
                  <code>aider --model openai/Qwen3-32B</code>
                </li>
              
            
              
                <li><strong>
                  
                    Date
                  
                  :</strong>
                  2025-04-30
                </li>
              
            
              
                <li><strong>
                  
                    Versions
                  
                  :</strong>
                  0.81.4.dev
                </li>
              
            
              
                <li><strong>
                  
                    Seconds per case
                  
                  :</strong>
                  59.4
                </li>
              
            
              
                <li><strong>
                  
                    Total cost
                  
                  :</strong>
                  0.0
                </li>
              
            
          </ul>
        </td>
      </tr>
     
      
      <tr id="main-row-9">
        <td style="padding: 8px; text-align: center; vertical-align: middle;">
          <button class="toggle-details" data-target="details-9" style="background: none; border: none; cursor: pointer; font-size: 16px; padding: 0; vertical-align: middle;">▶</button>
          <input type="checkbox" class="row-selector" data-row-index="9" style="display: none; cursor: pointer; vertical-align: middle;" />
        </td>
        <td style="padding: 8px;"><span>Qwen3 32B diff on OpenRouter, all providers, default settings (thinking)</span></td>
        <td class="bar-cell">
          <div class="bar-viz" style="width: 40.0%; background-color: rgba(40, 167, 69, 0.3); border-right: 1px solid rgba(40, 167, 69, 0.5);"></div>
          <span>40.0%</span>
        </td>
        <td class="bar-cell cost-bar-cell">
          
          <div class="bar-viz cost-bar" data-cost="0.7603" data-max-cost="1.8037" style="width: 0%; background-color: rgba(13, 110, 253, 0.3); border-right: 1px solid rgba(13, 110, 253, 0.5);"></div>
          
          
          <span>$0.76</span>
        </td>
        <td style="padding: 8px;" class="col-command"><span><code>aider --model openrouter/qwen/qwen3-32b</code></span></td>
        <td style="padding: 8px; text-align: center;" class="col-conform"><span>83.6%</span></td>
        <td style="padding: 8px;" class="col-edit-format"><span>diff</span></td>
      </tr>
      <tr class="details-row" id="details-9" style="display: none; background-color: #f9f9f9;">
        <td colspan="7" style="padding: 15px; border-bottom: 1px solid #ddd;">
          <ul style="margin: 0; padding-left: 20px; list-style: none; border-bottom: 1px solid #ddd;">
            
              
                <li><strong>
                  
                    Dirname
                  
                  :</strong>
                  2025-05-08-03-20-24--qwen3-32b-default
                </li>
              
            
              
                <li><strong>
                  
                    Test cases
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Model
                  
                  :</strong>
                  Qwen3 32B diff on OpenRouter, all providers, default settings (thinking)
                </li>
              
            
              
                <li><strong>
                  
                    Edit format
                  
                  :</strong>
                  diff
                </li>
              
            
              
                <li><strong>
                  
                    Commit hash
                  
                  :</strong>
                  aaacee5-dirty, aeaf259
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 1
                  
                  :</strong>
                  14.2
                </li>
              
            
              
                <li><strong>
                  
                    Pass rate 2
                  
                  :</strong>
                  40.0
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 1
                  
                  :</strong>
                  32
                </li>
              
            
              
                <li><strong>
                  
                    Pass num 2
                  
                  :</strong>
                  90
                </li>
              
            
              
                <li><strong>
                  
                    Percent cases well formed
                  
                  :</strong>
                  83.6
                </li>
              
            
              
                <li><strong>
                  
                    Error outputs
                  
                  :</strong>
                  119
                </li>
              
            
              
                <li><strong>
                  
                    Num malformed responses
                  
                  :</strong>
                  50
                </li>
              
            
              
                <li><strong>
                  
                    Num with malformed responses
                  
                  :</strong>
                  37
                </li>
              
            
              
                <li><strong>
                  
                    User asks
                  
                  :</strong>
                  97
                </li>
              
            
              
                <li><strong>
                  
                    Lazy comments
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Syntax errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Indentation errors
                  
                  :</strong>
                  0
                </li>
              
            
              
                <li><strong>
                  
                    Exhausted context windows
                  
                  :</strong>
                  12
                </li>
              
            
              
                <li><strong>
                  
                    Prompt tokens
                  
                  :</strong>
                  317591
                </li>
              
            
              
                <li><strong>
                  
                    Completion tokens
                  
                  :</strong>
                  120418
                </li>
              
            
              
                <li><strong>
                  
                    Test timeouts
                  
                  :</strong>
                  5
                </li>
              
            
              
                <li><strong>
                  
                    Total tests
                  
                  :</strong>
                  225
                </li>
              
            
              
                <li><strong>
                  
                    Command
                  
                  :</strong>
                  <code>aider --model openrouter/qwen/qwen3-32b</code>
                </li>
              
            
              
                <li><strong>
                  
                    Date
                  
                  :</strong>
                  2025-05-08
                </li>
              
            
              
                <li><strong>
                  
                    Versions
                  
                  :</strong>
                  0.82.4.dev
                </li>
              
            
              
                <li><strong>
                  
                    Seconds per case
                  
                  :</strong>
                  372.2
                </li>
              
            
              
                <li><strong>
                  
                    Total cost
                  
                  :</strong>
                  0.7603
                </li>
              
            
          </ul>
        </td>
      </tr>
    
  </tbody>
</table>

<style>
  #leaderboard-title {
    margin-bottom: 20px; /* Add space below the title */
  }
  tr.selected {
    color: #0056b3;
  }
  table {
    table-layout: fixed;
  }
  thead {
    border-top: 1px solid #ddd; /* Add top border to header */
  }
  td, th {
    border: none; /* Remove internal cell borders */
    word-wrap: break-word;
    overflow-wrap: break-word;
    vertical-align: middle; /* Ensure consistent vertical alignment */
  }
  tbody tr {
    height: 50px; /* Set a minimum height for all data rows */
  }
  td.col-command { /* Command column */
    font-size: 12px; /* Keep font size adjustment for command column if desired, or remove */
  }

  /* Hide new columns first on smaller screens */
  @media screen and (max-width: 991px) {
    th.col-conform, td.col-conform,
    th.col-edit-format, td.col-edit-format {
      display: none;
    }
    /* Increase width of Percent correct and Cost columns when others are hidden */
    th:nth-child(3), td:nth-child(3), /* Percent correct */
    th:nth-child(4), td:nth-child(4) { /* Cost */
      width: 33% !important; /* Override inline style */
    }
  }

  /* Hide command column on even smaller screens */
  @media screen and (max-width: 767px) {
    th.col-command, td.col-command { /* Command column */
      display: none;
    }
  }

  /* --- Control Styles --- */
  #controls-container {
    margin-bottom: 20px; /* Add some space below controls */
  }

  #editSearchInput, #view-mode-select {
    padding: 8px 12px; /* Consistent padding */
    border: 1px solid #ccc; /* Slightly softer border */
    border-radius: 4px;
    font-size: 14px; /* Match table font size */
    height: 38px; /* Match height */
    box-sizing: border-box; /* Include padding/border in height */
  }


  .bar-cell {
    position: relative; /* Positioning context for the bar */
    padding: 8px;
    /* text-align: center; Removed */
    overflow: hidden; /* Prevent bar from overflowing cell boundaries if needed */
  }
  .cost-bar-cell {
    background-image: none; /* Remove default gradient for cost cells */
  }
  .percent-tick, .cost-tick {
    position: absolute;
    top: 50%;
    transform: translateY(10px);
    height: 8px; /* Short tick */
    width: 1px;
    background-color: rgba(170, 170, 170, 0.5); 
    z-index: 2; /* Above the bar but below the text */
  }
  .bar-viz {
    position: absolute;
    left: 0;
    top: 50%; /* Position at the middle of the cell */
    transform: translateY(-50%); /* Center the bar vertically */
    z-index: 1; /* Above background, below ticks and text */
    height: 36px;
    border-radius: 0 2px 2px 0; /* Slightly rounded end corners */
    /* Width and colors are set inline via style attribute */
  }
  /* Add a tooltip class for showing cost information on hover */
  .cost-bar-cell:hover .bar-viz[style*="background-image"] {
    animation: stripe-animation 2s linear infinite;
  }
  @keyframes stripe-animation {
    0% { background-position: 0 0; }
    100% { background-position: 20px 0; }
  }
  .bar-cell span {
     position: absolute; /* Position relative to the cell */
     left: 5px; /* Position slightly inside the left edge */
     top: 50%; /* Center vertically */
     transform: translateY(-50%); /* Adjust vertical centering */
     z-index: 3; /* Ensure text is above everything else */
     background-color: rgba(255, 255, 255, 0.7); /* Semi-transparent white background */
     padding: 0 4px; /* Add padding around the text */
     border-radius: 3px; /* Rounded corners for the text background */
     font-size: 14px; /* Adjust font size for the numbers */
  }
  .toggle-details {
    color: #888; /* Make toggle symbol more subtle */
    transition: color 0.2s; /* Smooth transition on hover */
  }


  /* Style for selected rows */
  tr.row-selected > td {
    background-color: #e7f3ff; /* Example light blue highlight */
  }

  /* Ensure checkbox is vertically aligned if needed */
  .row-selector {
    vertical-align: middle;
  }

  /* Hide rows not matching the filter */
  tr.hidden-by-mode {
      display: none !important; /* Use important to override other display styles if necessary */
  }
  tr.hidden-by-search {
      display: none !important;
  }

  /* --- Mode Toggle Button Styles --- */
  #view-mode-toggle {
    height: 38px; /* Match input height */
    box-sizing: border-box;
    flex-shrink: 0; /* Prevent toggle from shrinking on small screens */
  }
  .mode-button {
    transition: background-color 0.2s ease-in-out, color 0.2s ease-in-out;
    white-space: nowrap; /* Prevent text wrapping */
  }
  .mode-button:not(.active) {
    background-color: #f8f9fa; /* Light grey background */
    color: #495057; /* Dark grey text */
  }
  .mode-button:not(.active):hover {
    background-color: #e2e6ea; /* Slightly darker grey on hover */
  }

  /* Style for highlighted rows in view mode */
  tr.view-highlighted > td {
    background-color: #fffef5; /* Very light yellow/cream */
    /* Border moved to specific cell below */
  }
  /* Apply border and adjust padding ONLY for the first *visible* cell (Model name) in view mode */
  tr.view-highlighted > td:nth-child(2) {
     border-left: 4px solid #ffc107; /* Warning yellow border */
     /* Original padding is 8px. Subtract border width. */
     padding-left: 4px;
  }
</style>

<script>
const LEADERBOARD_CUSTOM_TITLE = "Qwen3 results on the aider polyglot benchmark";
document.addEventListener('DOMContentLoaded', function() {
  let currentMode = 'view'; // 'view', 'select', 'detail'
  let selectedRows = new Set(); // Store indices of selected rows
  const MAX_DISPLAY_COST_CAP = 200; // Define the constant here

  const allMainRows = document.querySelectorAll('tr[id^="main-row-"]');
  const allDetailsRows = document.querySelectorAll('tr[id^="details-"]');
  const searchInput = document.getElementById('editSearchInput');
  const modeViewButton = document.getElementById('mode-view-btn');
  const modeDetailButton = document.getElementById('mode-detail-btn');
  const modeSelectButton = document.getElementById('mode-select-btn');
  const modeButtons = [modeViewButton, modeSelectButton, modeDetailButton];
  const selectAllCheckbox = document.getElementById('select-all-checkbox');
  const leaderboardTitle = document.getElementById('leaderboard-title'); // Get title element
  const defaultTitle = "Aider polyglot coding leaderboard";
  const filteredTitle = "Aider polyglot coding benchmark results (selected)";

  function applySearchFilter() {
    const searchTerm = searchInput.value.toLowerCase();
    allMainRows.forEach(row => {
      const textContent = row.textContent.toLowerCase();
      const detailsRow = document.getElementById(row.id.replace('main-row-', 'details-'));
      const matchesSearch = textContent.includes(searchTerm);

      if (matchesSearch) {
        row.classList.remove('hidden-by-search');
        if (detailsRow) detailsRow.classList.remove('hidden-by-search');
      } else {
        row.classList.add('hidden-by-search');
        if (detailsRow) detailsRow.classList.add('hidden-by-search');
      }
    });
    // After applying search filter, re-apply view mode filter and update select-all state
    updateTableView(currentMode);
    if (currentMode === 'select') {
        updateSelectAllCheckboxState();
    }
    
    // Update cost bars and ticks since visible rows may have changed
    updateCostBars();
    updateCostTicks();
  }

  function getVisibleMainRows() {
      // Helper to get rows currently visible (not hidden by search or mode)
      return Array.from(allMainRows).filter(row =>
          !row.classList.contains('hidden-by-search') && !row.classList.contains('hidden-by-mode')
      );
  }

  function updateSelectAllCheckboxState() {
      // Update the header checkbox based on the selection state of *visible* rows
      if (currentMode !== 'select') return; // Only relevant in select mode

      const visibleRows = getVisibleMainRows();
      const visibleRowCount = visibleRows.length;
      const selectedVisibleRowCount = visibleRows.filter(row => selectedRows.has(row.querySelector('.row-selector')?.dataset.rowIndex)).length;

      if (visibleRowCount === 0) {
          selectAllCheckbox.checked = false;
          selectAllCheckbox.indeterminate = false;
      } else if (selectedVisibleRowCount === visibleRowCount) {
          selectAllCheckbox.checked = true;
          selectAllCheckbox.indeterminate = false;
      } else if (selectedVisibleRowCount > 0) {
          selectAllCheckbox.checked = false;
          selectAllCheckbox.indeterminate = true;
      } else {
          selectAllCheckbox.checked = false;
          selectAllCheckbox.indeterminate = false;
      }
  }


  function updateTableView(mode) {
    currentMode = mode; // Update global state ('view', 'select', 'detail')

    // Update button styles first
    modeButtons.forEach(btn => {
        btn.classList.remove('active');
        // Reset specific styles potentially added by .active
        btn.style.backgroundColor = '';
        btn.style.color = '';
    });
    let activeButton;
    if (mode === 'view') activeButton = modeViewButton;
    else if (mode === 'select') activeButton = modeSelectButton;
    else if (mode === 'detail') activeButton = modeDetailButton;

    activeButton.classList.add('active');
    activeButton.style.backgroundColor = '#e7f3ff'; // Use selected row highlight blue
    activeButton.style.color = '#495057'; // Use dark text for contrast on light blue

    // Get the first header cell (for the toggle/checkbox column)
    const firstHeaderCell = document.querySelector('table thead th:first-child');

    // Show/hide header checkbox based on mode
    selectAllCheckbox.style.display = mode === 'select' ? 'inline-block' : 'none';

    allMainRows.forEach(row => {
      const rowIndex = row.querySelector('.row-selector')?.dataset.rowIndex;
      const toggleButton = row.querySelector('.toggle-details');
      const selectorCheckbox = row.querySelector('.row-selector');
      const firstCell = row.querySelector('td:first-child'); // Get the first cell of the main row
      const detailsRow = document.getElementById(`details-${rowIndex}`);
      const isSelected = selectedRows.has(rowIndex);

      // Reset visibility classes before applying mode logic
      row.classList.remove('hidden-by-mode');
      if (detailsRow) detailsRow.classList.remove('hidden-by-mode');

      // Show/hide the first column (header and data cells) based on mode
      if (firstHeaderCell) {
          firstHeaderCell.style.display = mode === 'view' ? 'none' : '';
      }
      if (firstCell) {
          firstCell.style.display = mode === 'view' ? 'none' : '';
      }

      // Apply mode-specific logic
      if (mode === 'view') { // --- VIEW MODE ---
          toggleButton.style.display = 'none'; // Hide toggle in view mode
          selectorCheckbox.style.display = 'none';
          row.classList.remove('row-selected'); // Ensure no selection highlight
          // view-highlighted is handled by row click listener

          // In 'view' mode, hide row if selections exist AND this row is NOT selected
          if (selectedRows.size > 0 && !isSelected) {
              row.classList.add('hidden-by-mode');
              if (detailsRow) detailsRow.classList.add('hidden-by-mode');
          } else {
              // Ensure row is not hidden by mode if it's selected or no selections exist
              // This is handled by the reset at the start of the loop:
              // row.classList.remove('hidden-by-mode');
              // if (detailsRow) detailsRow.classList.remove('hidden-by-mode');
          }
          // Always hide details row content in view mode regardless of visibility class
          if (detailsRow) {
              detailsRow.style.display = 'none';
          }

      } else if (mode === 'select') { // --- SELECT MODE ---
          toggleButton.style.display = 'none';
          selectorCheckbox.style.display = 'inline-block';
          selectorCheckbox.checked = isSelected;
          row.classList.toggle('row-selected', isSelected);
          row.classList.remove('view-highlighted'); // Clear view highlight when switching to select
          // Always hide details row in select mode
          if (detailsRow) detailsRow.style.display = 'none';

          // In 'select' mode, no rows should be hidden based on selection status
          row.classList.remove('hidden-by-mode');
          if (detailsRow) detailsRow.classList.remove('hidden-by-mode');

      } else { // --- DETAIL MODE --- (mode === 'detail')
          toggleButton.style.display = 'inline-block'; // Show toggle
          selectorCheckbox.style.display = 'none';
          row.classList.remove('row-selected'); // Clear selection highlight
          row.classList.remove('view-highlighted'); // Clear view highlight when switching to detail
          // Details row visibility is controlled by the toggle button state, don't force hide/show here
          // Ensure main row is visible if not hidden by search
          row.classList.remove('hidden-by-mode');
          if (detailsRow) {
              detailsRow.classList.remove('hidden-by-mode');
              // Preserve existing display state (controlled by toggle) unless hidden by search
              if (detailsRow.classList.contains('hidden-by-search')) {
                  detailsRow.style.display = 'none';
              }
          }
      }


      // Ensure rows hidden by search remain hidden regardless of mode
      if (row.classList.contains('hidden-by-search')) {
          row.style.display = 'none';
          if (detailsRow) detailsRow.style.display = 'none';
      } else if (!row.classList.contains('hidden-by-mode')) {
          // Make row visible if not hidden by search or mode
          row.style.display = ''; // Or 'table-row' if needed, but '' usually works
      } else {
          // Row is hidden by mode, ensure it's hidden
          row.style.display = 'none';
          if (detailsRow) detailsRow.style.display = 'none';
      }


    });

    // Update the leaderboard title based on mode and selection
    if (leaderboardTitle) {
      // Check if a custom title is provided globally
      if (typeof LEADERBOARD_CUSTOM_TITLE !== 'undefined' && LEADERBOARD_CUSTOM_TITLE) {
        leaderboardTitle.textContent = LEADERBOARD_CUSTOM_TITLE;
      } else {
        if (currentMode === 'view' && selectedRows.size > 0) {
          leaderboardTitle.textContent = filteredTitle;
        } else {
          leaderboardTitle.textContent = defaultTitle;
        }
      }
    }

    // Update the select-all checkbox state after updating the view
    updateSelectAllCheckboxState();
    
    // Update cost bars and ticks since visible/selected rows may have changed
    updateCostBars();
    updateCostTicks();
  }


  // --- Existing Initializations ---
  // Add percentage ticks
  const percentCells = document.querySelectorAll('.bar-cell:not(.cost-bar-cell)');
  percentCells.forEach(cell => {
    // Add ticks at 0%, 10%, 20%, ..., 100%
    for (let i = 0; i <= 100; i += 10) {
      const tick = document.createElement('div');
      tick.className = 'percent-tick';
      tick.style.left = `${i}%`;
      cell.appendChild(tick);
    }
  });

  // Function to calculate the appropriate max display cost based on visible/selected entries
  function calculateDisplayMaxCost() {
    // Get the appropriate set of rows based on the current mode and selection state
    let rowsToConsider;    
    
    if (currentMode === 'view' && selectedRows.size > 0) {
      // In view mode with selections, only consider selected rows
      rowsToConsider = Array.from(allMainRows).filter(row => {
        const rowIndex = row.querySelector('.row-selector')?.dataset.rowIndex;
        return rowIndex && selectedRows.has(rowIndex) && !row.classList.contains('hidden-by-search');
      });
    } else {
      // In other modes or without selections, consider all visible rows
      rowsToConsider = getVisibleMainRows();
    }
    
    // Find the maximum cost among the rows to consider
    let maxCost = 0;
    rowsToConsider.forEach(row => {
      const costBar = row.querySelector('.cost-bar');
      if (costBar) {
        const cost = parseFloat(costBar.dataset.cost || '0');
        if (cost > maxCost) maxCost = cost;
      }
    });
    
    // Cap at MAX_DISPLAY_COST_CAP if any entries exceed that amount, otherwise use actual max
    return maxCost > MAX_DISPLAY_COST_CAP ? MAX_DISPLAY_COST_CAP : Math.max(1, maxCost); // Ensure at least 1 to avoid division by zero
  }
  
  // Process cost bars with dynamic scale
  function updateCostBars() {
    const costBars = document.querySelectorAll('.cost-bar');
    const currentMaxDisplayCost = calculateDisplayMaxCost();
    
    // Remove existing special indicators first
    document.querySelectorAll('.dark-section, .tear-line').forEach(el => el.remove());
    
    costBars.forEach(bar => {
      const cost = parseFloat(bar.dataset.cost);
      
      if (cost > 0) {
        // Calculate percentage based on the dynamic display max
        const percent = Math.min(cost, currentMaxDisplayCost) / currentMaxDisplayCost * 100;
        // Clamp percentage between 0 and 100
        bar.style.width = Math.max(0, Math.min(100, percent)) + '%';
        
        // Mark bars that exceed the limit (only if our display max is capped at 50)
        if (currentMaxDisplayCost === MAX_DISPLAY_COST_CAP && cost > MAX_DISPLAY_COST_CAP) {
          // Create a darker section at the end with diagonal stripes
          const darkSection = document.createElement('div');
          darkSection.className = 'bar-viz dark-section';
          darkSection.style.width = '15%'; // From 85% to 100%
          darkSection.style.left = '85%';
          darkSection.style.backgroundColor = 'rgba(13, 110, 253, 0.6)'; // Darker blue
          darkSection.style.borderRight = '1px solid rgba(13, 110, 253, 0.8)';
          darkSection.style.zIndex = '1';
          // Add diagonal stripes with CSS background
          darkSection.style.backgroundImage = 'repeating-linear-gradient(45deg, rgba(255,255,255,0.3), rgba(255,255,255,0.3) 5px, transparent 5px, transparent 10px)';
          bar.parentNode.appendChild(darkSection);
          
          // Add a dashed "tear line" at the transition point
          const tearLine = document.createElement('div');
          tearLine.className = 'tear-line';
          tearLine.style.position = 'absolute';
          tearLine.style.left = '85%';
          // Center the tear line vertically and make it 1.5x as tall as the bar
          tearLine.style.top = '50%';
          tearLine.style.transform = 'translateY(-50%)';
          tearLine.style.height = '54px'; // 1.5x the bar height (36px)
          tearLine.style.width = '2px';
          tearLine.style.backgroundColor = 'white';
          tearLine.style.borderLeft = '2px dashed rgba(0, 0, 0, 0.3)';
          tearLine.style.zIndex = '2'; // Above the bar
          bar.parentNode.appendChild(tearLine);
        }
      } else {
        // Set width to 0 if cost is 0 or negative
        bar.style.width = '0%';
      }
    });
  }
  
  // Call this initially to set up the bars
  updateCostBars();

  // Update cost ticks dynamically based on current max display cost
  function updateCostTicks() {
    const costCells = document.querySelectorAll('.cost-bar-cell');
    if (costCells.length === 0) return;
    
    const currentMaxDisplayCost = calculateDisplayMaxCost();
    
    // Remove existing ticks first
    document.querySelectorAll('.cost-tick').forEach(tick => tick.remove());
    
    // Generate appropriate tick values based on current max
    let tickValues = [];
    
    // Always use $10 increments, regardless of the max
    const maxTickValue = Math.ceil(currentMaxDisplayCost / 10) * 10; // Round up to nearest $10
    
    for (let i = 0; i <= maxTickValue; i += 10) {
      tickValues.push(i);
    }
    
    // Calculate percentage positions for each tick
    const tickPercentages = tickValues.map(tickCost => {
      return (tickCost / currentMaxDisplayCost) * 100;
    });
    
    // Add tick divs to each cost cell
    costCells.forEach(cell => {
      const costBar = cell.querySelector('.cost-bar');
      // Use optional chaining and provide '0' as fallback if costBar or dataset.cost is missing
      const cost = parseFloat(costBar?.dataset?.cost || '0');
      
      // Only add ticks if the cost is actually greater than 0
      if (cost > 0) {
        tickPercentages.forEach((percent, index) => {
          // Ensure percentage is within valid range
          if (percent >= 0 && percent <= 100) {
            const tick = document.createElement('div');
            tick.className = 'cost-tick';
            tick.style.left = `${percent}%`;
            cell.appendChild(tick);
          }
        });
      }
    });
  }
  
  // Call this initially to set up the ticks
  updateCostTicks();


  // --- New Event Listeners ---

  // Listener for mode toggle buttons
  modeButtons.forEach(button => {
    button.addEventListener('click', function(event) {
      const newMode = this.dataset.mode;
      if (newMode !== currentMode) {
        // Update active button style
        modeButtons.forEach(btn => {
            btn.classList.remove('active');
            // Reset specific styles potentially added by .active
            btn.style.backgroundColor = '';
            btn.style.color = '';
        });
        this.classList.add('active');
        // Apply active styles directly as inline styles might interfere
        this.style.backgroundColor = '#e7f3ff'; // Use selected row highlight blue
        this.style.color = '#495057'; // Use dark text for contrast on light blue

        // Update table view and apply filters
        updateTableView(newMode);
        applySearchFilter(); // Re-apply search filter when mode changes
      }
    });
  });

  // Listener for row selector checkboxes (using event delegation on table body)
  const tableBody = document.querySelector('table tbody');
  tableBody.addEventListener('change', function(event) {
    if (event.target.classList.contains('row-selector') && currentMode === 'select') {
      const checkbox = event.target;
      const rowIndex = checkbox.dataset.rowIndex;
      const mainRow = checkbox.closest('tr');

      if (checkbox.checked) {
        selectedRows.add(rowIndex);
        mainRow.classList.add('row-selected');
      } else {
        selectedRows.delete(rowIndex);
        mainRow.classList.remove('row-selected');
      }
      // Update select-all checkbox state
      updateSelectAllCheckboxState();
      
      // Update cost bars and ticks if in view mode, as selection affects what's shown
      if (currentMode === 'view') {
        updateCostBars();
        updateCostTicks();
      }
    }
  }); // End of tableBody listener

  // Listener for Select All checkbox
  selectAllCheckbox.addEventListener('change', function() {
      if (currentMode !== 'select') return;

      const isChecked = selectAllCheckbox.checked;
      // Select/deselect only the rows that are currently visible
      const visibleRows = getVisibleMainRows();

      visibleRows.forEach(row => {
          const checkbox = row.querySelector('.row-selector');
          const rowIndex = checkbox?.dataset.rowIndex;
          if (!checkbox || !rowIndex) return; // Skip if no checkbox/index found

          // Only change state if it differs from target state
          if (checkbox.checked !== isChecked) {
              checkbox.checked = isChecked;
              row.classList.toggle('row-selected', isChecked);
              if (isChecked) {
                  selectedRows.add(rowIndex);
              } else {
                  selectedRows.delete(rowIndex);
              }
          }
      });
      // After bulk change, ensure the selectAll checkbox state is correct (not indeterminate)
      updateSelectAllCheckboxState();
      
      // Update cost bars and ticks after selection changes
      updateCostBars();
      updateCostTicks();
  });

  // Listener for search input
  searchInput.addEventListener('input', applySearchFilter);

  // Add toggle functionality for details (Modified to respect modes)
  const toggleButtons = document.querySelectorAll('.toggle-details');
  toggleButtons.forEach(button => {
    button.addEventListener('click', function() {
      // Only allow toggling in 'detail' mode
      if (currentMode !== 'detail') return;

      const targetId = this.getAttribute('data-target');
      const targetRow = document.getElementById(targetId);
      const mainRow = this.closest('tr'); // Get the main row associated with this button

      if (targetRow && !mainRow.classList.contains('hidden-by-mode') && !mainRow.classList.contains('hidden-by-search')) {
        const isVisible = targetRow.style.display !== 'none';
        targetRow.style.display = isVisible ? 'none' : 'table-row';
        this.textContent = isVisible ? '▶' : '▼';
      }
    });
  });

  // Listener for clicking anywhere on a row
  tableBody.addEventListener('click', function(event) {
    const clickedRow = event.target.closest('tr');

    // Ensure it's a main row and not a details row or header/footer
    if (!clickedRow || !clickedRow.id.startsWith('main-row-')) return;

    // --- START conditional logic ---
    if (currentMode === 'select') {
        // --- SELECT MODE LOGIC (Existing) ---
        // Find the checkbox within this row
        const checkbox = clickedRow.querySelector('.row-selector');
        if (!checkbox) return; // No checkbox found in this row

        // If the click was directly on the checkbox or its label (if any),
        // let the default behavior and the 'change' event listener handle it.
        // Otherwise, toggle the checkbox state programmatically.
        if (event.target !== checkbox && event.target.tagName !== 'LABEL' /* Add if you use labels */) {
            checkbox.checked = !checkbox.checked;
            // Manually trigger the change event to update state and UI
            checkbox.dispatchEvent(new Event('change', { bubbles: true }));
        }
        // --- END SELECT MODE LOGIC ---

    } else if (currentMode === 'view') {
        // --- VIEW MODE LOGIC (New) ---
        // Don't highlight if the click was on the details toggle button
        if (event.target.classList.contains('toggle-details')) {
            return;
        }
        // Toggle the highlight class on the clicked row
        clickedRow.classList.toggle('view-highlighted');
        // --- END VIEW MODE LOGIC ---
    }
    // --- END conditional logic ---
  });


  // --- Initial Setup ---
  updateTableView('view'); // Initialize view to 'view' mode
  applySearchFilter(); // Apply initial search filter (if any text is pre-filled or just to set initial state)

// Close button functionality
const closeControlsBtn = document.getElementById('close-controls-btn');
if (closeControlsBtn) {
  closeControlsBtn.addEventListener('click', function() {
    const controlsContainer = document.getElementById('controls-container');
    if (controlsContainer) {
      controlsContainer.style.display = 'none';
    }
  });
}

});

</script>

<h2 id="no-think-via-official-alibaba-api">No think, via official Alibaba API</h2>

<p>These results were obtained running against <code class="language-plaintext highlighter-rouge">https://dashscope.aliyuncs.com/compatible-mode/v1</code>
with no thinking.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">OPENAI_API_BASE</span><span class="o">=</span>https://dashscope.aliyuncs.com/compatible-mode/v1
<span class="nb">export </span><span class="nv">OPENAI_API_KEY</span><span class="o">=</span>&lt;key&gt;
</code></pre></div></div>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">openai/qwen3-235b-a22b</span>
  <span class="na">use_temperature</span><span class="pi">:</span> <span class="m">0.7</span>
  <span class="na">streaming</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">extra_params</span><span class="pi">:</span>
    <span class="na">stream</span><span class="pi">:</span> <span class="no">false</span>
    <span class="na">max_tokens</span><span class="pi">:</span> <span class="m">16384</span>
    <span class="na">top_p</span><span class="pi">:</span> <span class="m">0.8</span>
    <span class="na">top_k</span><span class="pi">:</span> <span class="m">20</span>
    <span class="na">temperature</span><span class="pi">:</span> <span class="m">0.7</span>
    <span class="na">enable_thinking</span><span class="pi">:</span> <span class="no">false</span>
    <span class="na">extra_body</span><span class="pi">:</span>
      <span class="na">enable_thinking</span><span class="pi">:</span> <span class="no">false</span>
</code></pre></div></div>

<h2 id="openrouter-only-togetherai-recommended-no_think-settings">OpenRouter only TogetherAI, recommended /no_think settings</h2>

<p>These results were obtained with the 
<a href="https://huggingface.co/Qwen/Qwen3-235B-A22B#best-practices">recommended</a>
non-thinking model settings in <code class="language-plaintext highlighter-rouge">.aider.model.settings.yml</code>:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">openrouter/qwen/qwen3-235b-a22b</span>
  <span class="na">system_prompt_prefix</span><span class="pi">:</span> <span class="s2">"</span><span class="s">/no_think"</span>
  <span class="na">use_temperature</span><span class="pi">:</span> <span class="m">0.7</span>
  <span class="na">extra_params</span><span class="pi">:</span>
    <span class="na">max_tokens</span><span class="pi">:</span> <span class="m">24000</span>
    <span class="na">top_p</span><span class="pi">:</span> <span class="m">0.8</span>
    <span class="na">top_k</span><span class="pi">:</span> <span class="m">20</span>
    <span class="na">min_p</span><span class="pi">:</span> <span class="m">0.0</span>
    <span class="na">temperature</span><span class="pi">:</span> <span class="m">0.7</span>
    <span class="na">extra_body</span><span class="pi">:</span>
      <span class="na">provider</span><span class="pi">:</span>
        <span class="na">order</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">Together"</span><span class="pi">]</span>
</code></pre></div></div>

<p>And then running aider:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aider <span class="nt">--model</span> openrouter/qwen/qwen3-235b-a22b
</code></pre></div></div>

<h2 id="openrouter-all-providers-default-settings-thinking">OpenRouter, all providers, default settings (thinking)</h2>

<p>These results were obtained by simply running aider as shown below, without any model specific settings.
This should have enabled thinking, assuming upstream API providers honor that convention for Qwen3.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aider <span class="nt">--model</span> openrouter/qwen/qwen3-xxx
</code></pre></div></div>

<h2 id="vllm-bfloat16-recommended-no_think">VLLM, bfloat16, recommended /no_think</h2>

<p>These <a href="https://github.com/Aider-AI/aider/pull/3908">benchmarks results were obtained by GitHub user AlongWY</a>
with the 
<a href="https://huggingface.co/Qwen/Qwen3-235B-A22B#best-practices">recommended</a>
non-thinking model settings in <code class="language-plaintext highlighter-rouge">.aider.model.settings.yml</code>:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">openai/&lt;model-name&gt;</span>
  <span class="na">system_prompt_prefix</span><span class="pi">:</span> <span class="s2">"</span><span class="s">/no_think"</span>
  <span class="na">use_temperature</span><span class="pi">:</span> <span class="m">0.7</span>
  <span class="na">extra_params</span><span class="pi">:</span>
    <span class="na">max_tokens</span><span class="pi">:</span> <span class="m">24000</span>
    <span class="na">top_p</span><span class="pi">:</span> <span class="m">0.8</span>
    <span class="na">top_k</span><span class="pi">:</span> <span class="m">20</span>
    <span class="na">min_p</span><span class="pi">:</span> <span class="m">0.0</span>
    <span class="na">temperature</span><span class="pi">:</span> <span class="s">0.7</span>        
</code></pre></div></div>

<p>And then running aider:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aider <span class="nt">--model</span> openai/&lt;model-name&gt; <span class="nt">--openai-api-base</span> &lt;url&gt;
</code></pre></div></div>]]></content><author><name></name></author><summary type="html"><![CDATA[Benchmark results for Qwen3 models using the Aider polyglot coding benchmark.]]></summary></entry><entry><title type="html">Gemini 2.5 Pro Preview 03-25 benchmark cost</title><link href="https://aider.chat/2025/05/07/gemini-cost.html" rel="alternate" type="text/html" title="Gemini 2.5 Pro Preview 03-25 benchmark cost" /><published>2025-05-07T00:00:00+00:00</published><updated>2025-05-07T00:00:00+00:00</updated><id>https://aider.chat/2025/05/07/gemini-cost</id><content type="html" xml:base="https://aider.chat/2025/05/07/gemini-cost.html"><![CDATA[<p class="post-date">May 07, 2025</p>

<h1 id="gemini-25-pro-preview-03-25-benchmark-cost">Gemini 2.5 Pro Preview 03-25 benchmark cost</h1>

<h2 id="summary">Summary</h2>
<p>The $6.32 cost reported to run the aider polyglot benchmark on
Gemini 2.5 Pro Preview 03-25 was incorrect.
The true cost was higher, possibly significantly so.
The incorrect cost has been removed from the leaderboard.</p>

<p>An investigation determined the primary cause was that the litellm
package (used by aider for LLM API connections) was not properly including reasoning tokens in 
the token counts it reported.
While an incorrect price-per-token entry for the model also existed in litellm’s cost
database at that time, this was found not to be a contributing factor.
Aider’s own internal, correct pricing data was utilized during the benchmark.</p>

<h2 id="resolution">Resolution</h2>

<p>Litellm began correctly including reasoning tokens in the reported counts
on April 21, 2025 in 
commit <a href="https://github.com/BerriAI/litellm/commit/a7db0df0434bfbac2b68ebe1c343b77955becb4b">a7db0df</a>.
This change was released in litellm v1.67.1.
Aider picked up this change April 28, 2025 when it upgraded its litellm dependency 
from v1.65.7 to v1.67.4.post1
in commit <a href="https://github.com/Aider-AI/aider/commit/9351f37">9351f37</a>.
That dependency change shipped on May 5, 2025 in aider v0.82.3.</p>

<p>Unfortunately the 03-25 version of Gemini 2.5 Pro Preview is no longer available,
so it is not possible to re-run the benchmark to obtain an accurate cost.
As a possibly relevant comparison, the newer 05-06 version of Gemini 2.5 Pro Preview
completed the benchmark at a cost of about $37.</p>

<h2 id="investigation-detail">Investigation detail</h2>

<p>The version of litellm available at that time of the benchmark appears to have been
excluding reasoning tokens from the token counts it reported.
So even though aider had correct per-token pricing, it did not have the correct token counts
used during the benchmark.
This resulted in an underestimate of the benchmark costs.</p>

<p>The incorrect litellm database entry does not appear to have affected the aider benchmark costs.
Aider maintains and uses its own database of costs for some models, and it contained
the correct pricing at the time of the benchmark.
Aider appears to have
loaded the correct cost data from its database and made use of it during the benchmark.</p>

<p>Every aider benchmark report contains the git commit hash of the aider repository state used to
run the benchmark.
The 
<a href="https://github.com/Aider-AI/aider/blob/edbfec0ce4e1fe86735c915cb425b0d8636edc32/aider/website/_data/polyglot_leaderboard.yml#L814">benchmark run in question</a>
was built from 
commit <a href="https://github.com/Aider-AI/aider/commit/0282574">0282574</a>.</p>

<p>Additional runs of the benchmark from that build verified that the error in litellm’s
model cost database appears not to have been a factor:</p>

<ul>
  <li>Aider’s internal model database correctly overrides the litellm database, which contained an incorrect token cost at the time.</li>
  <li>The correct pricing is loaded from aider’s internal model database and produces similar (incorrect) costs as the original run.</li>
  <li>Updating aider’s internal model database with an absurdly high token cost resulted in an appropriately high benchmark cost report, demonstrating that the internal database costs were in effect.</li>
</ul>

<p>This specific build of aider was then updated with various versions of litellm using <code class="language-plaintext highlighter-rouge">git biset</code>
to identify the first litellm commit where reasoning tokens counts were correctly reported.</p>

<h2 id="timeline">Timeline</h2>

<p>Below is the full timeline of git commits related to this issue in the aider and litellm repositories.
Each entry has a UTC timestamp, followed by the original literal timestamp obtained from the
relevant source.</p>

<ul>
  <li>2025-04-04 19:54:45 UTC (Sat Apr 5 08:54:45 2025 +1300)
    <ul>
      <li>Correct value <code class="language-plaintext highlighter-rouge">"output_cost_per_token": 0.000010</code> for  <code class="language-plaintext highlighter-rouge">gemini/gemini-2.5-pro-preview-03-25</code> added to <code class="language-plaintext highlighter-rouge">aider/resources/model-metadata.json</code></li>
      <li>Commit <a href="https://github.com/Aider-AI/aider/commit/eda796d">eda796d</a> in aider.</li>
    </ul>
  </li>
  <li>2025-04-05 16:20:01 UTC (Sun Apr 6 00:20:01 2025 +0800)
    <ul>
      <li>First litellm commit of <code class="language-plaintext highlighter-rouge">gemini/gemini-2.5-pro-preview-03-25</code> metadata, with incorrect price <code class="language-plaintext highlighter-rouge">"output_cost_per_token": 0.0000010</code></li>
      <li>Commit <a href="https://github.com/BerriAI/litellm/commit/cd0a1e6">cd0a1e6</a> in litellm.</li>
    </ul>
  </li>
  <li>2025-04-10 01:48:43 UTC (Wed Apr 9 18:48:43 2025 -0700)
    <ul>
      <li>litellm commit updates <code class="language-plaintext highlighter-rouge">gemini/gemini-2.5-pro-preview-03-25</code> metadata, but not price</li>
      <li>Commit <a href="https://github.com/BerriAI/litellm/commit/ac4f32f">ac4f32f</a> in litellm.</li>
    </ul>
  </li>
  <li>2025-04-12 04:55:50 UTC (2025-04-12-04-55-50 UTC)
    <ul>
      <li>Benchmark performed.</li>
      <li>Aider repo hash <a href="https://github.com/Aider-AI/aider/blob/7fbeafa1cfd4ad83f7499417837cdfa6b16fe7a1/aider/website/_data/polyglot_leaderboard.yml#L814">0282574 recorded in benchmark results</a>, without a “dirty” annotation, indicating that the benchmark was run on a clean checkout of the aider repo at commit <a href="https://github.com/Aider-AI/aider/commit/0282574">0282574</a>.</li>
      <li>Correct value <code class="language-plaintext highlighter-rouge">"output_cost_per_token": 0.000010</code> is in <code class="language-plaintext highlighter-rouge">aider/resources/model-metadata.json</code> at this commit <a href="https://github.com/Aider-AI/aider/blob/0282574/aider/resources/model-metadata.json#L357">0282574</a>.</li>
    </ul>
  </li>
  <li>2025-04-12 15:06:39 UTC (Apr 12 08:06:39 2025 -0700)
    <ul>
      <li>Benchmark results added to aider repo.</li>
      <li>Commit <a href="https://github.com/Aider-AI/aider/commit/7fbeafa">7fbeafa</a> in aider.</li>
    </ul>
  </li>
  <li>2025-04-12 15:20:04 UTC (Sat Apr 12 19:20:04 2025 +0400)
    <ul>
      <li>litellm commit fixes <code class="language-plaintext highlighter-rouge">gemini/gemini-2.5-pro-preview-03-25</code> price metadata to <code class="language-plaintext highlighter-rouge">"output_cost_per_token": 0.00001</code></li>
      <li>Commit <a href="https://github.com/BerriAI/litellm/commit/93037ea">93037ea</a> in litellm.</li>
    </ul>
  </li>
  <li>2025-04-22 05:48:00 UTC (Mon Apr 21 22:48:00 2025 -0700)
    <ul>
      <li>Litellm started including reasoning tokens in token count reporting.</li>
      <li>Commit <a href="https://github.com/BerriAI/litellm/commit/a7db0df0434bfbac2b68ebe1c343b77955becb4b">a7db0df</a> in litellm.</li>
      <li>This fix was released in litellm v1.67.1.</li>
    </ul>
  </li>
  <li>2025-04-28 14:53:20 UTC (Mon Apr 28 07:53:20 2025 -0700)
    <ul>
      <li>Aider upgraded its litellm dependency from v1.65.7 to v1.67.4.post1, which included the reasoning token count fix.</li>
      <li>Commit <a href="https://github.com/Aider-AI/aider/commit/9351f37">9351f37</a> in aider.</li>
      <li>This dependency change shipped on May 5, 2025 in aider v0.82.3.</li>
    </ul>
  </li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[The $6.32 benchmark cost reported for Gemini 2.5 Pro Preview 03-25 was incorrect.]]></summary></entry><entry><title type="html">Alternative DeepSeek V3 providers</title><link href="https://aider.chat/2025/01/28/deepseek-down.html" rel="alternate" type="text/html" title="Alternative DeepSeek V3 providers" /><published>2025-01-28T00:00:00+00:00</published><updated>2025-01-28T00:00:00+00:00</updated><id>https://aider.chat/2025/01/28/deepseek-down</id><content type="html" xml:base="https://aider.chat/2025/01/28/deepseek-down.html"><![CDATA[<p class="post-date">January 28, 2025</p>

<h1 class="no_toc" id="alternative-deepseek-v3-providers">Alternative DeepSeek V3 providers</h1>

<canvas id="editChart" width="800" height="450" style="margin-top: 20px"></canvas>

<p>DeepSeek’s API has been experiencing significant reliability issues for the past 24-48+ hours, with many users reporting downtime and overload problems.
Their <a href="https://status.deepseek.com">status page</a> notes an ongoing incident.</p>

<p>If you’re affected by these issues, several alternative providers offer access to DeepSeek V3. This article compares their performance on aider’s polyglot benchmark to help you choose a reliable alternative.</p>

<h2 class="no_toc" id="providers">Providers</h2>

<ul id="markdown-toc">
  <li><a href="#openrouter" id="markdown-toc-openrouter">OpenRouter</a></li>
  <li><a href="#fireworks" id="markdown-toc-fireworks">Fireworks</a></li>
  <li><a href="#hyperbolic" id="markdown-toc-hyperbolic">Hyperbolic</a></li>
  <li><a href="#ollama" id="markdown-toc-ollama">Ollama</a></li>
  <li><a href="#other-providers" id="markdown-toc-other-providers">Other providers</a></li>
  <li><a href="#results" id="markdown-toc-results">Results</a></li>
</ul>

<h2 id="openrouter">OpenRouter</h2>

<p><a href="https://openrouter.ai/deepseek/deepseek-chat/providers">OpenRouter offers many DeepSeek providers</a>
through their unified API.
You can use aider with OpenRouter like this:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Set your API key using environment variables</span>
<span class="nb">export </span><span class="nv">OPENROUTER_API_KEY</span><span class="o">=</span>&lt;your-key&gt;
aider <span class="nt">--model</span> openrouter/deepseek/deepseek-chat

<span class="c"># Or use the --api-key command line option</span>
aider <span class="nt">--model</span> openrouter/deepseek/deepseek-chat <span class="nt">--api-key</span> <span class="nv">openrouter</span><span class="o">=</span>&lt;your-key&gt;

<span class="c"># Or add it to .aider.conf.yml in your home directory or project root:</span>
api-key:
  - <span class="nv">openrouter</span><span class="o">=</span>&lt;your-key&gt;
</code></pre></div></div>

<p>OpenRouter automatically monitors their providers and routes requests to stable
APIs and away from those experiencing unreliable performance.</p>

<p>But not all providers serve the same version of open source models, and not
all have the same privacy guarantees.
You can control which OpenRouter providers are used to serve the model via
<a href="https://aider.chat/docs/config/adv-model-settings.html#model-settings">aider’s model settings</a>.
Create a <code class="language-plaintext highlighter-rouge">.aider.model.settings.yml</code> file in your home directory or git project root with settings like this:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">openrouter/deepseek/deepseek-chat</span>
  <span class="na">extra_params</span><span class="pi">:</span>
    <span class="na">extra_body</span><span class="pi">:</span>
      <span class="na">provider</span><span class="pi">:</span>
        <span class="c1"># Only use these providers, in this order</span>
        <span class="na">order</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">Novita"</span><span class="pi">]</span>
        <span class="c1"># Don't fall back to other providers</span>
        <span class="na">allow_fallbacks</span><span class="pi">:</span> <span class="no">false</span>
</code></pre></div></div>

<p>See <a href="https://openrouter.ai/docs/provider-routing">OpenRouter’s provider routing docs</a> for more details.</p>

<h2 id="fireworks">Fireworks</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Set your API key using environment variables</span>
<span class="nb">export </span><span class="nv">FIREWORKS_API_KEY</span><span class="o">=</span>&lt;your-key&gt;
aider <span class="nt">--model</span> fireworks_ai/accounts/fireworks/models/deepseek-chat

<span class="c"># Or use the --api-key command line option</span>
aider <span class="nt">--model</span> fireworks_ai/accounts/fireworks/models/deepseek-chat <span class="nt">--api-key</span> <span class="nv">fireworks</span><span class="o">=</span>&lt;your-key&gt;

<span class="c"># Or add it to .aider.conf.yml in your home directory or project root:</span>
api-key:
  - <span class="nv">fireworks</span><span class="o">=</span>&lt;your-key&gt;
</code></pre></div></div>

<p>Create a <code class="language-plaintext highlighter-rouge">.aider.model.settings.yml</code> file in your home directory or git project root with settings like this:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">fireworks_ai/accounts/fireworks/models/deepseek-chat</span>
  <span class="na">edit_format</span><span class="pi">:</span> <span class="s">diff</span>
  <span class="na">weak_model_name</span><span class="pi">:</span> <span class="no">null</span>
  <span class="na">use_repo_map</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">send_undo_reply</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">lazy</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">reminder</span><span class="pi">:</span> <span class="s">sys</span>
  <span class="na">examples_as_sys_msg</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">extra_params</span><span class="pi">:</span>
    <span class="na">max_tokens</span><span class="pi">:</span> <span class="m">8192</span>
  <span class="na">cache_control</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">caches_by_default</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">use_system_prompt</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">use_temperature</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">streaming</span><span class="pi">:</span> <span class="no">true</span>
</code></pre></div></div>

<h2 id="hyperbolic">Hyperbolic</h2>

<p>You can use <a href="https://hyperbolic.xyz">Hyperbolic’s API</a> as an OpenAI-compatible provider:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Set your API key using environment variables</span>
<span class="nb">export </span><span class="nv">OPENAI_API_BASE</span><span class="o">=</span>https://api.hyperbolic.xyz/v1/
<span class="nb">export </span><span class="nv">OPENAI_API_KEY</span><span class="o">=</span>&lt;your-key&gt;
aider <span class="nt">--model</span> openai/deepseek-ai/DeepSeek-V3

<span class="c"># Or use the --api-key command line option</span>
aider <span class="nt">--model</span> openai/deepseek-ai/DeepSeek-V3 <span class="nt">--api-key</span> <span class="nv">openai</span><span class="o">=</span>&lt;your-key&gt;

<span class="c"># Or add it to .aider.conf.yml in your home directory or project root:</span>
api-key:
  - <span class="nv">openai</span><span class="o">=</span>&lt;your-key&gt;
</code></pre></div></div>

<p>Create a <code class="language-plaintext highlighter-rouge">.aider.model.settings.yml</code> file in your home directory or git project root with settings like this:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">openai/deepseek-ai/DeepSeek-V3</span>
  <span class="na">edit_format</span><span class="pi">:</span> <span class="s">diff</span>
  <span class="na">weak_model_name</span><span class="pi">:</span> <span class="no">null</span>
  <span class="na">use_repo_map</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">send_undo_reply</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">lazy</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">reminder</span><span class="pi">:</span> <span class="s">sys</span>
  <span class="na">examples_as_sys_msg</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">cache_control</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">caches_by_default</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">use_system_prompt</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">use_temperature</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">streaming</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">editor_model_name</span><span class="pi">:</span> <span class="no">null</span>
  <span class="na">editor_edit_format</span><span class="pi">:</span> <span class="no">null</span>
  <span class="na">extra_params</span><span class="pi">:</span>
    <span class="na">max_tokens</span><span class="pi">:</span> <span class="m">65536</span>
</code></pre></div></div>

<h2 id="ollama">Ollama</h2>

<p>You can run <a href="https://ollama.com/library/deepseek-v3">DeepSeek V3 via Ollama</a>.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Pull the model</span>
ollama pull deepseek-v3

<span class="c"># Start your ollama server</span>
ollama serve

<span class="c"># In another terminal window...</span>
<span class="nb">export </span><span class="nv">OLLAMA_API_BASE</span><span class="o">=</span>http://127.0.0.1:11434 <span class="c"># Mac/Linux</span>
setx   OLLAMA_API_BASE http://127.0.0.1:11434 <span class="c"># Windows, restart shell after setx</span>

aider <span class="nt">--model</span> ollama/deepseek-v3
</code></pre></div></div>

<p>It’s important to provide model settings, especially the <code class="language-plaintext highlighter-rouge">num_ctx</code> parameter to
set the context window.
Ollama uses a 2k context window by default, which is very small for working with aider.
Larger context windows will allow you to work with larger amounts of code,
but will use memory and increase latency.</p>

<p>Unlike most other LLM servers, Ollama does not throw an error if you submit a request that exceeds the context window. Instead, it just silently truncates the request by discarding the “oldest” messages in the chat to make it fit within the context window.</p>

<p>So if your context window is too small, you won’t get an explicit error. The biggest symptom will be that aider says it can’t see (some of) the files you added to the chat. That’s because ollama is silently discarding them because they exceed the context window.</p>

<p>Create a <code class="language-plaintext highlighter-rouge">.aider.model.settings.yml</code> file in your home directory or git project root with settings like this:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">ollama/deepseek-v3</span>
  <span class="na">edit_format</span><span class="pi">:</span> <span class="s">diff</span>
  <span class="na">weak_model_name</span><span class="pi">:</span> <span class="no">null</span>
  <span class="na">use_repo_map</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">send_undo_reply</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">lazy</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">reminder</span><span class="pi">:</span> <span class="s">sys</span>
  <span class="na">examples_as_sys_msg</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">cache_control</span><span class="pi">:</span> <span class="no">false</span>
  <span class="na">caches_by_default</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">use_system_prompt</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">use_temperature</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">streaming</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">extra_params</span><span class="pi">:</span>
    <span class="na">num_ctx</span><span class="pi">:</span> <span class="m">8192</span> <span class="c1"># How large a context window?</span>
</code></pre></div></div>

<h2 id="other-providers">Other providers</h2>

<p>You will need to properly configure aider to work with DeepSeek V3 when served
via other providers:</p>

<ul>
  <li>Determine the <code class="language-plaintext highlighter-rouge">--model</code> name to use.</li>
  <li>Provide your API key to aider.</li>
  <li>Add model settings to <code class="language-plaintext highlighter-rouge">.aider.model.settings.yml</code>.</li>
</ul>

<p>Adapt the <code class="language-plaintext highlighter-rouge">.aider.model.settings.yml</code> shown above for Fireworks. You will need to change the <code class="language-plaintext highlighter-rouge">name</code> field to match you chosen provider’s model naming scheme.</p>

<p>See <a href="https://aider.chat/docs/config/adv-model-settings.html#model-settings">Advanced model settings</a> for details about all aider model settings</p>

<h2 id="results">Results</h2>

<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
  <thead style="background-color: #f2f2f2;">
    <tr>
      <th style="padding: 8px; text-align: left;">Model</th>
      <th style="padding: 8px; text-align: center;">Percent completed correctly</th>
      <th style="padding: 8px; text-align: center;">Percent using correct edit format</th>
      <th style="padding: 8px; text-align: left;">Command</th>
      <th style="padding: 8px; text-align: center;">Edit format</th>
    </tr>
  </thead>
  <tbody>
    
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Hyperbolic</td>
        <td style="padding: 8px; text-align: center;">48.4%</td>
        <td style="padding: 8px; text-align: center;">97.3%</td>
        <td style="padding: 8px;"><code>OPENAI_API_BASE=https://api.hyperbolic.xyz/v1/ aider --model openai/deepseek-ai/DeepSeek-V3</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Fireworks</td>
        <td style="padding: 8px; text-align: center;">48.4%</td>
        <td style="padding: 8px; text-align: center;">96.9%</td>
        <td style="padding: 8px;"><code>aider --model fireworks_ai/accounts/fireworks/models/deepseek-v3</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">DeepSeek</td>
        <td style="padding: 8px; text-align: center;">48.4%</td>
        <td style="padding: 8px; text-align: center;">98.7%</td>
        <td style="padding: 8px;"><code>aider --model deepseek/deepseek-chat</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">OpenRouter: DeepInfra</td>
        <td style="padding: 8px; text-align: center;">48.0%</td>
        <td style="padding: 8px; text-align: center;">99.5%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/deepseek/deepseek-chat</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">OpenRouter: Novita</td>
        <td style="padding: 8px; text-align: center;">42.7%</td>
        <td style="padding: 8px; text-align: center;">84.0%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/deepseek/deepseek-chat</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
  </tbody>
</table>

<script src="https://unpkg.com/patternomaly/dist/patternomaly.js"></script>

<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

<script>



document.addEventListener('DOMContentLoaded', function () {
  var ctx = document.getElementById('editChart').getContext('2d');
  const blueDiagonalPattern = pattern.draw('diagonal', 'rgba(54, 162, 235, 0.2)');
  const redDiagonalPattern = pattern.draw('diagonal', 'rgba(255, 99, 132, 0.2)');
  let displayedData = [];

  // Get highlight model from query string or Jekyll variable
  const urlParams = new URLSearchParams(window.location.search);
  const queryHighlight = urlParams.get('highlight');
  const HIGHLIGHT_MODEL = queryHighlight || 'DeepSeek';

  var leaderboardData = {
    labels: [],
    datasets: [{
      label: 'Percent completed correctly',
      data: [],
      backgroundColor: function(context) {
        const row = allData[context.dataIndex];
        if (row && row.edit_format === 'whole') {
          return redDiagonalPattern; // Use red pattern for highlighted whole format
        }
        const label = leaderboardData.labels[context.dataIndex] || '';
        return (label && HIGHLIGHT_MODEL && label.toLowerCase().includes(HIGHLIGHT_MODEL.toLowerCase())) ? 'rgba(255, 99, 132, 0.2)' : 'rgba(54, 162, 235, 0.2)';
      },
      borderColor: function(context) {
        const label = context.chart.data.labels[context.dataIndex] || '';
        return (label && HIGHLIGHT_MODEL && label.toLowerCase().includes(HIGHLIGHT_MODEL.toLowerCase())) ? 'rgba(255, 99, 132, 1)' : 'rgba(54, 162, 235, 1)';
      },
      borderWidth: 1
    }, {
      label: 'Total Cost ($)',
      data: [],
      type: 'scatter',
      yAxisID: 'y1',
      backgroundColor: 'rgba(153, 102, 255, 1)',
      borderColor: '#fff',
      borderWidth: 1,
      pointRadius: 5,
      pointHoverRadius: 7
    }]
  };

  var allData = [];
  
    allData.push({
      model: 'Hyperbolic',
      pass_rate: 48.4,
      percent_cases_well_formed: 97.3,
      edit_format: 'diff',
      total_cost: 0.0
    });
  
    allData.push({
      model: 'Fireworks',
      pass_rate: 48.4,
      percent_cases_well_formed: 96.9,
      edit_format: 'diff',
      total_cost: 2.1177
    });
  
    allData.push({
      model: 'DeepSeek',
      pass_rate: 48.4,
      percent_cases_well_formed: 98.7,
      edit_format: 'diff',
      total_cost: 0.3369
    });
  
    allData.push({
      model: 'OpenRouter: DeepInfra',
      pass_rate: 48.0,
      percent_cases_well_formed: 99.5,
      edit_format: 'diff',
      total_cost: 0.2733
    });
  
    allData.push({
      model: 'OpenRouter: Novita',
      pass_rate: 42.7,
      percent_cases_well_formed: 84.0,
      edit_format: 'diff',
      total_cost: 0.0
    });
  

  function updateChart() {
    var selectedRows = document.querySelectorAll('tr.selected');
    var showAll = selectedRows.length === 0;

    displayedData = [];
    leaderboardData.labels = [];
    leaderboardData.datasets[0].data = [];
    leaderboardData.datasets[1].data = [];

    allData.forEach(function(row, index) {
      var rowElement = document.getElementById('edit-row-' + index);
      if (showAll) {
        rowElement.classList.remove('selected');
      }
      if (showAll || rowElement.classList.contains('selected')) {
        displayedData.push(row);
        leaderboardData.labels.push(row.model);
        leaderboardData.datasets[0].data.push(row.pass_rate);
        // Only include cost if it's not zero (placeholder for unknown)
        leaderboardData.datasets[1].data.push(row.total_cost > 0 ? row.total_cost : null);
      }
    });

    leaderboardChart.update();
    leaderboardChart.render();
  }

  // Update backgroundColor and borderColor for the main dataset based on displayedData
  leaderboardData.datasets[0].backgroundColor = function(context) {
    const row = displayedData[context.dataIndex];
    const label = leaderboardData.labels[context.dataIndex] || '';
    const isHighlighted = label && HIGHLIGHT_MODEL && label.toLowerCase().includes(HIGHLIGHT_MODEL.toLowerCase());

    if (isHighlighted) {
      if (row && row.edit_format === 'whole') return redDiagonalPattern;
      else return 'rgba(255, 99, 132, 0.2)';
    } else if (row && row.edit_format === 'whole') {
      return blueDiagonalPattern;
    } else {
      return 'rgba(54, 162, 235, 0.2)';
    }
  };

  var tableBody = document.querySelector('table tbody');
  allData.forEach(function(row, index) {
    var tr = tableBody.children[index];
    if (!tr) {
      // If the row doesn't exist, create it
      tr = document.createElement('tr');
      tableBody.appendChild(tr);
    }
    tr.id = 'edit-row-' + index;
    tr.style.cursor = 'pointer';
    tr.onclick = function() {
      this.classList.toggle('selected');
      updateChart();
    };
  });

  var leaderboardChart = new Chart(ctx, {
    type: 'bar',
    data: leaderboardData,
    options: {
      plugins: {
        legend: {
          display: true,
          labels: {
            generateLabels: function(chart) {
              return [
                {
                  text: 'Diff-like format',
                  fillStyle: 'rgba(54, 162, 235, 0.2)',
                  strokeStyle: 'rgba(54, 162, 235, 1)',
                  lineWidth: 1
                },
                {
                  text: 'Whole format',
                  fillStyle: blueDiagonalPattern,
                  strokeStyle: 'rgba(54, 162, 235, 1)',
                  lineWidth: 1
                },
                {
                  text: 'Total Cost ($)',
                  fillStyle: 'rgba(153, 102, 255, 1)',
                  strokeStyle: '#fff',
                  lineWidth: 1,
                  pointStyle: 'circle'
                }
              ];
            }
          }
        },
        tooltip: {
          callbacks: {
            label: function(context) {
              const datasetLabel = context.dataset.label || '';
              const value = context.parsed.y;
              if (datasetLabel === 'Total Cost ($)') {
                return datasetLabel + ': $' + value.toFixed(2);
              }
              return datasetLabel + ': ' + value.toFixed(1) + '%';
            }
          }
        }
      },
      scales: {
        y: {
          beginAtZero: true,
          title: {
            display: true,
            text: 'Percent completed correctly'
          }
        },
        y1: {
          beginAtZero: true,
          position: 'right',
          grid: {
            drawOnChartArea: false
          },
          title: {
            display: true,
            text: 'Total Cost ($)'
          }
        },
        x: {
          ticks: {
            autoSkip: false, // Prevent labels from being automatically skipped
            maxRotation: 90, // Allow labels to rotate up to 90 degrees
            minRotation: 0, 
            callback: function(value, index) {
              const label = this.getLabelForValue(value);
              if (label.length <= "claude-3-5-sonnet".length) {
                return label;
              }
              
              // Find all possible split positions
              const splitPositions = [];
              for (let i = 0; i < label.length; i++) {
                if (label[i] === '-' || label[i] === ' ') {
                  splitPositions.push(i);
                }
              }
              
              if (splitPositions.length === 0) {
                return label;
              }
              
              // Find split position closest to middle
              const middle = label.length / 2;
              const splitIndex = splitPositions.reduce((closest, current) => {
                return Math.abs(current - middle) < Math.abs(closest - middle) ? current : closest;
              });
              
              return [
                label.slice(0, splitIndex),
                label.slice(splitIndex + 1)
              ];
            }
          }
        }
      }
    }
  });

  updateChart();
  
  // Add search functionality for edit table
  document.getElementById('editSearchInput').addEventListener('keyup', function() {
    var searchWords = this.value.toLowerCase().split(' ').filter(word => word.length > 0);
    var tableBody = document.querySelector('table:first-of-type tbody');
    var rows = tableBody.getElementsByTagName('tr');
    
    displayedData = [];
    leaderboardData.labels = [];
    leaderboardData.datasets[0].data = [];
    leaderboardData.datasets[1].data = [];
    
    for (var i = 0; i < rows.length; i++) {
      var rowText = rows[i].textContent;
      if (searchWords.every(word => rowText.toLowerCase().includes(word))) {
        rows[i].style.display = '';
        displayedData.push(allData[i]);
        leaderboardData.labels.push(allData[i].model);
        leaderboardData.datasets[0].data.push(allData[i].pass_rate);
        // Only include cost if it's not zero (placeholder for unknown)
        leaderboardData.datasets[1].data.push(allData[i].total_cost > 0 ? allData[i].total_cost : null);
      } else {
        rows[i].style.display = 'none';
      }
    }
    leaderboardChart.update();
  });
});

</script>

<style>
  tr.selected {
    color: #0056b3;
  }
  table {
    table-layout: fixed;
  }
  td, th {
    word-wrap: break-word;
    overflow-wrap: break-word;
  }
  td:nth-child(3), td:nth-child(4) {
    font-size: 12px;
  }
</style>]]></content><author><name></name></author><summary type="html"><![CDATA[DeepSeek's API has been experiencing reliability issues. Here are alternative providers you can use.]]></summary></entry><entry><title type="html">R1+Sonnet set SOTA on aider’s polyglot benchmark</title><link href="https://aider.chat/2025/01/24/r1-sonnet.html" rel="alternate" type="text/html" title="R1+Sonnet set SOTA on aider’s polyglot benchmark" /><published>2025-01-24T00:00:00+00:00</published><updated>2025-01-24T00:00:00+00:00</updated><id>https://aider.chat/2025/01/24/r1-sonnet</id><content type="html" xml:base="https://aider.chat/2025/01/24/r1-sonnet.html"><![CDATA[<p class="post-date">January 24, 2025</p>

<h1 class="no_toc" id="r1sonnet-set-sota-on-aiders-polyglot-benchmark">R1+Sonnet set SOTA on aider’s polyglot benchmark</h1>

<canvas id="editChart" width="800" height="450" style="margin-top: 20px"></canvas>

<p>Aider supports <a href="https://aider.chat/2024/09/26/architect.html">using a pair of models for coding</a>:</p>

<ul>
  <li>An Architect model is asked to describe how to solve the coding problem. Thinking/reasoning models often work well in this role.</li>
  <li>An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files.</li>
</ul>

<p><strong>R1 as architect with Sonnet as editor has set a new SOTA of 64.0%</strong> on the 
<a href="/2024/12/21/polyglot.html">aider polyglot benchmark</a>.
They achieve this at <strong>14X less cost</strong> compared to the previous o1 SOTA result.</p>

<p>o1 paired with Sonnet didn’t produce better results than just using o1 alone.
Using various other models as editor didn’t seem to improve o1 or R1 versus their solo scores.
This is in contrast to the first wave of thinking models like o1-preview and o1-mini,
which improved when paired with many different editor models.</p>

<p>o1 was set with reasoning effort high for these tests.</p>

<h2 id="try-it">Try it</h2>

<p>Once you <a href="https://aider.chat/docs/install.html">install aider</a>,
you can use aider, R1 and Sonnet like this:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">DEEPSEEK_API_KEY</span><span class="o">=</span>&lt;your-key&gt;
<span class="nb">export </span><span class="nv">ANTHROPIC_API_KEY</span><span class="o">=</span>&lt;your-key&gt;

aider <span class="nt">--architect</span> <span class="nt">--model</span> r1 <span class="nt">--editor-model</span> sonnet
</code></pre></div></div>

<p>Or if you have an <a href="https://openrouter.ai">OpenRouter</a> account:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">OPENROUTER_API_KEY</span><span class="o">=</span>&lt;your-key&gt;

aider <span class="nt">--architect</span> <span class="nt">--model</span> openrouter/deepseek/deepseek-r1 <span class="nt">--editor-model</span> openrouter/anthropic/claude-3.5-sonnet
</code></pre></div></div>

<h2 id="thinking-output">Thinking output</h2>

<p>There has been 
<a href="https://github.com/Aider-AI/aider/pull/2973">some recent discussion</a>
about extracting the <code class="language-plaintext highlighter-rouge">&lt;think&gt;</code> tokens from R1’s responses
and feeding them to Sonnet.
That was an interesting experiment, for sure.</p>

<p>To be clear, the results above are <em>not</em> using R1’s thinking tokens, just the normal
final output. 
R1 is configured in aider’s standard architect role with Sonnet as editor.
The benchmark results that used the thinking tokens appear to be worse than
the architect/editor results shared here.</p>

<h2 id="results">Results</h2>

<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
  <thead style="background-color: #f2f2f2;">
    <tr>
      <th style="padding: 8px; text-align: left;">Model</th>
      <th style="padding: 8px; text-align: center;">Percent completed correctly</th>
      <th style="padding: 8px; text-align: center;">Percent using correct edit format</th>
      <th style="padding: 8px; text-align: left;">Command</th>
      <th style="padding: 8px; text-align: center;">Edit format</th>
      <th style="padding: 8px; text-align: center;">Total Cost</th>
    </tr>
  </thead>
  <tbody>
    
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">R1+Sonnet</td>
        <td style="padding: 8px; text-align: center;">64.0%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --architect --model r1 --editor-model sonnet</code></td>
        <td style="padding: 8px; text-align: center;">architect</td>
        <td style="padding: 8px; text-align: center;">$13.29</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">o1</td>
        <td style="padding: 8px; text-align: center;">61.7%</td>
        <td style="padding: 8px; text-align: center;">91.5%</td>
        <td style="padding: 8px;"><code>aider --model o1</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
        <td style="padding: 8px; text-align: center;">$186.5</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">R1</td>
        <td style="padding: 8px; text-align: center;">56.9%</td>
        <td style="padding: 8px; text-align: center;">96.9%</td>
        <td style="padding: 8px;"><code>aider --model r1</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
        <td style="padding: 8px; text-align: center;">$5.42</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Sonnet</td>
        <td style="padding: 8px; text-align: center;">51.6%</td>
        <td style="padding: 8px; text-align: center;">99.6%</td>
        <td style="padding: 8px;"><code>aider --model sonnet</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
        <td style="padding: 8px; text-align: center;">$14.41</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">DeepSeek V3</td>
        <td style="padding: 8px; text-align: center;">48.4%</td>
        <td style="padding: 8px; text-align: center;">98.7%</td>
        <td style="padding: 8px;"><code>aider --model deepseek</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
        <td style="padding: 8px; text-align: center;">$0.34</td>
      </tr>
    
  </tbody>
</table>

<script src="https://unpkg.com/patternomaly/dist/patternomaly.js"></script>

<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

<script>




document.addEventListener('DOMContentLoaded', function () {
  var ctx = document.getElementById('editChart').getContext('2d');
  const blueDiagonalPattern = pattern.draw('diagonal', 'rgba(54, 162, 235, 0.2)');
  const redDiagonalPattern = pattern.draw('diagonal', 'rgba(255, 99, 132, 0.2)');
  let displayedData = [];

  // Get highlight model from query string or Jekyll variable
  const urlParams = new URLSearchParams(window.location.search);
  const queryHighlight = urlParams.get('highlight');
  const HIGHLIGHT_MODEL = queryHighlight || '+';

  var leaderboardData = {
    labels: [],
    datasets: [{
      label: 'Percent completed correctly',
      data: [],
      backgroundColor: function(context) {
        const row = allData[context.dataIndex];
        if (row && row.edit_format === 'whole') {
          return redDiagonalPattern; // Use red pattern for highlighted whole format
        }
        const label = leaderboardData.labels[context.dataIndex] || '';
        return (label && HIGHLIGHT_MODEL && label.toLowerCase().includes(HIGHLIGHT_MODEL.toLowerCase())) ? 'rgba(255, 99, 132, 0.2)' : 'rgba(54, 162, 235, 0.2)';
      },
      borderColor: function(context) {
        const label = context.chart.data.labels[context.dataIndex] || '';
        return (label && HIGHLIGHT_MODEL && label.toLowerCase().includes(HIGHLIGHT_MODEL.toLowerCase())) ? 'rgba(255, 99, 132, 1)' : 'rgba(54, 162, 235, 1)';
      },
      borderWidth: 1
    }, {
      label: 'Total Cost ($)',
      data: [],
      type: 'scatter',
      yAxisID: 'y1',
      backgroundColor: 'rgba(153, 102, 255, 1)',
      borderColor: '#fff',
      borderWidth: 1,
      pointRadius: 5,
      pointHoverRadius: 7
    }]
  };

  var allData = [];
  
    allData.push({
      model: 'R1+Sonnet',
      pass_rate: 64.0,
      percent_cases_well_formed: 100.0,
      edit_format: 'architect',
      total_cost: 13.2933
    });
  
    allData.push({
      model: 'o1',
      pass_rate: 61.7,
      percent_cases_well_formed: 91.5,
      edit_format: 'diff',
      total_cost: 186.4958
    });
  
    allData.push({
      model: 'R1',
      pass_rate: 56.9,
      percent_cases_well_formed: 96.9,
      edit_format: 'diff',
      total_cost: 5.4193
    });
  
    allData.push({
      model: 'Sonnet',
      pass_rate: 51.6,
      percent_cases_well_formed: 99.6,
      edit_format: 'diff',
      total_cost: 14.4063
    });
  
    allData.push({
      model: 'DeepSeek V3',
      pass_rate: 48.4,
      percent_cases_well_formed: 98.7,
      edit_format: 'diff',
      total_cost: 0.3369
    });
  

  function updateChart() {
    var selectedRows = document.querySelectorAll('tr.selected');
    var showAll = selectedRows.length === 0;

    displayedData = [];
    leaderboardData.labels = [];
    leaderboardData.datasets[0].data = [];
    leaderboardData.datasets[1].data = [];

    allData.forEach(function(row, index) {
      var rowElement = document.getElementById('edit-row-' + index);
      if (showAll) {
        rowElement.classList.remove('selected');
      }
      if (showAll || rowElement.classList.contains('selected')) {
        displayedData.push(row);
        leaderboardData.labels.push(row.model);
        leaderboardData.datasets[0].data.push(row.pass_rate);
        // Only include cost if it's not zero (placeholder for unknown)
        leaderboardData.datasets[1].data.push(row.total_cost > 0 ? row.total_cost : null);
      }
    });

    leaderboardChart.update();
    leaderboardChart.render();
  }

  // Update backgroundColor and borderColor for the main dataset based on displayedData
  leaderboardData.datasets[0].backgroundColor = function(context) {
    const row = displayedData[context.dataIndex];
    const label = leaderboardData.labels[context.dataIndex] || '';
    const isHighlighted = label && HIGHLIGHT_MODEL && label.toLowerCase().includes(HIGHLIGHT_MODEL.toLowerCase());

    if (isHighlighted) {
      if (row && row.edit_format === 'whole') return redDiagonalPattern;
      else return 'rgba(255, 99, 132, 0.2)';
    } else if (row && row.edit_format === 'whole') {
      return blueDiagonalPattern;
    } else {
      return 'rgba(54, 162, 235, 0.2)';
    }
  };

  var tableBody = document.querySelector('table tbody');
  allData.forEach(function(row, index) {
    var tr = tableBody.children[index];
    if (!tr) {
      // If the row doesn't exist, create it
      tr = document.createElement('tr');
      tableBody.appendChild(tr);
    }
    tr.id = 'edit-row-' + index;
    tr.style.cursor = 'pointer';
    tr.onclick = function() {
      this.classList.toggle('selected');
      updateChart();
    };
  });

  var leaderboardChart = new Chart(ctx, {
    type: 'bar',
    data: leaderboardData,
    options: {
      plugins: {
        legend: {
          display: false,
          labels: {
            generateLabels: function(chart) {
              return [
                {
                  text: 'Diff-like format',
                  fillStyle: 'rgba(54, 162, 235, 0.2)',
                  strokeStyle: 'rgba(54, 162, 235, 1)',
                  lineWidth: 1
                },
                {
                  text: 'Whole format',
                  fillStyle: blueDiagonalPattern,
                  strokeStyle: 'rgba(54, 162, 235, 1)',
                  lineWidth: 1
                },
                {
                  text: 'Total Cost ($)',
                  fillStyle: 'rgba(153, 102, 255, 1)',
                  strokeStyle: '#fff',
                  lineWidth: 1,
                  pointStyle: 'circle'
                }
              ];
            }
          }
        },
        tooltip: {
          callbacks: {
            label: function(context) {
              const datasetLabel = context.dataset.label || '';
              const value = context.parsed.y;
              if (datasetLabel === 'Total Cost ($)') {
                return datasetLabel + ': $' + value.toFixed(2);
              }
              return datasetLabel + ': ' + value.toFixed(1) + '%';
            }
          }
        }
      },
      scales: {
        y: {
          beginAtZero: true,
          title: {
            display: true,
            text: 'Percent completed correctly'
          }
        },
        y1: {
          beginAtZero: true,
          position: 'right',
          grid: {
            drawOnChartArea: false
          },
          title: {
            display: true,
            text: 'Total Cost ($)'
          }
        },
        x: {
          ticks: {
            autoSkip: false, // Prevent labels from being automatically skipped
            maxRotation: 90, // Allow labels to rotate up to 90 degrees
            minRotation: 0, 
            callback: function(value, index) {
              const label = this.getLabelForValue(value);
              if (label.length <= "claude-3-5-sonnet".length) {
                return label;
              }
              
              // Find all possible split positions
              const splitPositions = [];
              for (let i = 0; i < label.length; i++) {
                if (label[i] === '-' || label[i] === ' ') {
                  splitPositions.push(i);
                }
              }
              
              if (splitPositions.length === 0) {
                return label;
              }
              
              // Find split position closest to middle
              const middle = label.length / 2;
              const splitIndex = splitPositions.reduce((closest, current) => {
                return Math.abs(current - middle) < Math.abs(closest - middle) ? current : closest;
              });
              
              return [
                label.slice(0, splitIndex),
                label.slice(splitIndex + 1)
              ];
            }
          }
        }
      }
    }
  });

  updateChart();
  
  // Add search functionality for edit table
  document.getElementById('editSearchInput').addEventListener('keyup', function() {
    var searchWords = this.value.toLowerCase().split(' ').filter(word => word.length > 0);
    var tableBody = document.querySelector('table:first-of-type tbody');
    var rows = tableBody.getElementsByTagName('tr');
    
    displayedData = [];
    leaderboardData.labels = [];
    leaderboardData.datasets[0].data = [];
    leaderboardData.datasets[1].data = [];
    
    for (var i = 0; i < rows.length; i++) {
      var rowText = rows[i].textContent;
      if (searchWords.every(word => rowText.toLowerCase().includes(word))) {
        rows[i].style.display = '';
        displayedData.push(allData[i]);
        leaderboardData.labels.push(allData[i].model);
        leaderboardData.datasets[0].data.push(allData[i].pass_rate);
        // Only include cost if it's not zero (placeholder for unknown)
        leaderboardData.datasets[1].data.push(allData[i].total_cost > 0 ? allData[i].total_cost : null);
      } else {
        rows[i].style.display = 'none';
      }
    }
    leaderboardChart.update();
  });
});

</script>

<style>
  tr.selected {
    color: #0056b3;
  }
  table {
    table-layout: fixed;
  }
  td, th {
    word-wrap: break-word;
    overflow-wrap: break-word;
  }
  td:nth-child(3), td:nth-child(4) {
    font-size: 12px;
  }
</style>]]></content><author><name></name></author><summary type="html"><![CDATA[R1+Sonnet has set a new SOTA on the aider polyglot benchmark. At 14X less cost compared to o1.]]></summary></entry><entry><title type="html">Using uv as an installer</title><link href="https://aider.chat/2025/01/15/uv.html" rel="alternate" type="text/html" title="Using uv as an installer" /><published>2025-01-15T00:00:00+00:00</published><updated>2025-01-15T00:00:00+00:00</updated><id>https://aider.chat/2025/01/15/uv</id><content type="html" xml:base="https://aider.chat/2025/01/15/uv.html"><![CDATA[<p class="post-date">January 15, 2025</p>

<h1 class="no_toc" id="using-uv-as-an-installer">Using uv as an installer</h1>

<p>It’s hard to reliably
package and distribute python command line tools
to end users.
Users frequently encounter challenges:
dependency version conflicts, virtual environment management,
needing to install python or a specific version of python, etc.</p>

<p>Aider employs <a href="https://github.com/astral-sh/uv">uv</a> 
in a couple of novel ways to streamline the installation process:</p>

<ol>
  <li>
    <p>Install aider with
<code class="language-plaintext highlighter-rouge">curl https://aider.chat/install.sh | sh</code> even if python isn’t already installed.</p>
  </li>
  <li>
    <p>Users who have python 3.8+ installed can <code class="language-plaintext highlighter-rouge">pip install aider-install &amp;&amp; aider-install</code>.</p>
  </li>
</ol>

<p>Both methods use uv to <strong>globally</strong> install the <code class="language-plaintext highlighter-rouge">aider</code> command line program,
with all of its dependencies in an <strong>isolated environment</strong>.
They ensure that aider will run with <strong>python 3.12</strong>, and install that version
if it is not already available.</p>

<p>These uv install methods are especially helpful for aider, because it 
has a large set of very specific dependencies.
Since not all of aider’s dependencies are available on all python versions,
it requires python 3.9-3.12.</p>

<p>Most users don’t want to worry about these details –
they just want a quick way to install and run aider.</p>

<h2 id="one-liners">One-liners</h2>

<p>Users can install aider with a shell one-liner, without even having python previously installed:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-LsSf</span> https://aider.chat/install.sh | sh
</code></pre></div></div>

<p>This installs uv, then uses it to install python 3.12, 
install the <code class="language-plaintext highlighter-rouge">aider</code> command line tool
and update the user’s shell path.
Under the hood, it is simply a copy of 
uv’s own install script <code class="language-plaintext highlighter-rouge">https://astral.sh/uv/install.sh</code>
with <a href="https://github.com/Aider-AI/aider/blob/4251e976b3aa52c2a3af08da4b203d4d524c8e92/aider/website/install.sh#L1181">one line added</a>, to install aider as a tool:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ensure "${_install_dir}/uv" tool install --force --python python3.12 aider-chat@latest
</code></pre></div></div>

<h2 id="aider-install">aider-install</h2>

<p>The aider-install python package allows quick global installation of aider
for users who already have python 3.8+ installed.
It simply provides the <code class="language-plaintext highlighter-rouge">aider-install</code> command line program,
which users just need to run once.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>aider-install
aider-install
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">pip install aider-install</code> installs only two packages: 
aider-install and the <a href="https://pypi.org/project/uv/">uv python package</a>.
This ensures that uv is available
in the user’s environment.
Everything else is installed in a stand-alone environment created by uv.</p>

<p>When the user runs <code class="language-plaintext highlighter-rouge">aider-install</code>, it runs uv
to install aider as a tool and update the user’s shell path if needed:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>uv tool <span class="nb">install</span> <span class="nt">--force</span> <span class="nt">--python</span> python3.12 aider-chat
uv tool update-shell
</code></pre></div></div>

<h2 id="benefits">Benefits</h2>

<p>These uv install methods have been popular with users,
providing a hassle free way to install aider and quickly get started.
Installs are also extremely fast, much faster than pip or pipx installs
even when uv is also installing python 3.12!</p>

<p>There are also a number of benefits from the perspective of the tool developer/publisher.
Since providing these install methods, far fewer users report dependency problems and 
version conflicts as compared to users who <code class="language-plaintext highlighter-rouge">pip install aider-chat</code>.
There is also less pressure to rapidly support the newest python versions, 
since aider always installs with python 3.12.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Reliably packaging & distributing python CLI tools is hard. Aider uses uv in novel ways to make it easy to install the aider CLI, its dependencies and python 3.12. All in an isolated env.]]></summary></entry><entry><title type="html">o1 tops aider’s new polyglot leaderboard</title><link href="https://aider.chat/2024/12/21/polyglot.html" rel="alternate" type="text/html" title="o1 tops aider’s new polyglot leaderboard" /><published>2024-12-21T00:00:00+00:00</published><updated>2024-12-21T00:00:00+00:00</updated><id>https://aider.chat/2024/12/21/polyglot</id><content type="html" xml:base="https://aider.chat/2024/12/21/polyglot.html"><![CDATA[<p class="post-date">December 21, 2024</p>

<h1 class="no_toc" id="o1-tops-aiders-new-polyglot-leaderboard">o1 tops aider’s new polyglot leaderboard</h1>

<canvas id="editChart" width="800" height="450" style="margin-top: 20px"></canvas>

<p>OpenAI’s new o1 model with “high” reasoning effort
gets the top score on the
new 
<a href="/docs/leaderboards/">aider polyglot leaderboard</a>, significantly ahead of
other top LLMs.
The new polyglot benchmark uses many popular coding languages
and was designed to be 
<em>much more challenging</em> than aider’s original
<a href="/docs/leaderboards/edit.html">code editing benchmark</a>.
This more clearly distinguishes 
the performance of
today’s strongest coding models and
leaves headroom for future LLMs.</p>

<p class="note">See the main 
<a href="https://aider.chat/docs/leaderboards/">aider leaderboard</a>
for benchmark results from more models.
This article only contains a snapshot
of results at the time of publication.</p>

<h2 id="the-polyglot-benchmark">The polyglot benchmark</h2>

<p>Like aider’s original code editing benchmark,
the new polyglot benchmark is based on Exercism
coding exercises.</p>

<p>The new polyglot benchmark:</p>

<ul>
  <li>Contains coding problems in C++, Go, Java, JavaScript, Python and Rust. 
The old benchmark was solely based on Python exercises.</li>
  <li>Focuses on the <em>most difficult</em> 225 exercises out of the 697 that
Exercism provides for those languages.
The old benchmark simply included all 133 Python exercises,
regardless of difficulty.</li>
</ul>

<h2 id="motivation-and-goals">Motivation and goals</h2>

<p>Aider’s original code editing benchmark was 
saturating as the top scores approached and then surpassed 80%.
Sonnet’s score of 84.2% was based on solving 112 of the 133
exercises, leaving only 21 unsolved exercises.
New champions were advancing the top score by
solving just 1-2 more problems than the previous record.
This made it hard to clearly 
measure the
difference in code editing skill between these top models.</p>

<p>Part of the problem is that many of the original
133 Python problems are very easy 
and provide
little challenge to today’s frontier LLMs.
Models as old as GPT 3.5 Turbo were able to solve half of the
133 problems.
Such easy problems simply inflate the benchmark scores 
of modern LLMs without
providing any data about which models are better or worse.</p>

<p>The main goal for a new benchmark 
was to re-calibrate the scale so that
today’s top coding LLMs 
would occupy a wide range of scores between about 5% and 50%.
This should leave headroom for future LLMs and
make it possible to
more clearly compare the relative performance of top models.</p>

<h2 id="designing-the-polyglot-benchmark">Designing the polyglot benchmark</h2>

<p>The new benchmark:</p>

<ul>
  <li>Tests LLMs with more coding languages, to increase diversity and source a larger pool of problems.</li>
  <li>Includes just the most challenging coding problems and excludes easy problems that are solvable by most of today’s top coding LLMs.</li>
  <li>Includes more total coding problems, to enable more granularity of comparison.</li>
</ul>

<p>The new benchmark is based on Exercism coding problems
from 6 of the most popular programming languages:</p>

<ul>
  <li>C++</li>
  <li>Go</li>
  <li>Java</li>
  <li>JavaScript</li>
  <li>Python</li>
  <li>Rust</li>
</ul>

<p>Exercism provides a total of 697 coding problems in those 6 languages.
A set of 7 of today’s top coding models each attempted all 697 of
the Exercism problems:</p>

<ul>
  <li>Sonnet</li>
  <li>Haiku</li>
  <li>o1 Mini</li>
  <li>DeepSeek</li>
  <li>GPT-4o</li>
  <li>Qwen 32B Coder Instruct</li>
  <li>GPT-4o Mini</li>
</ul>

<p>Depending on the difficulty of the problems,
a different number of solutions were found by the collection of
7 models:</p>

<table>
  <thead>
    <tr>
      <th>Solutions<br />found</th>
      <th>Number of<br />problems</th>
      <th>Cumulative number<br />of problems</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td>66</td>
      <td>66</td>
    </tr>
    <tr>
      <td>1</td>
      <td>61</td>
      <td>127</td>
    </tr>
    <tr>
      <td>2</td>
      <td>50</td>
      <td>177</td>
    </tr>
    <tr>
      <td>3</td>
      <td>48</td>
      <td>225</td>
    </tr>
    <tr>
      <td>4</td>
      <td>53</td>
      <td>278</td>
    </tr>
    <tr>
      <td>5</td>
      <td>71</td>
      <td>349</td>
    </tr>
    <tr>
      <td>6</td>
      <td>90</td>
      <td>439</td>
    </tr>
    <tr>
      <td>7</td>
      <td>258</td>
      <td>697</td>
    </tr>
  </tbody>
</table>

<p>In the table above, you can see that 258 of the problems were solved
by all 7 LLMs.
These problems are far too easy, and wouldn’t be good choices for the new benchmark.
Instead, we need hard problems like the
66 that none of the 7 models were able to solve.</p>

<p>The new benchmark uses 
the 225 problems that were solved by 3 or fewer models.
This achieves a balance between hard and moderate problems,
and provides a large but not excessive total pool of problems.
It also represents a good diversity of coding languages:</p>

<table>
  <thead>
    <tr>
      <th>Language</th>
      <th>Problems</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>C++</td>
      <td>26</td>
    </tr>
    <tr>
      <td>Go</td>
      <td>39</td>
    </tr>
    <tr>
      <td>Java</td>
      <td>47</td>
    </tr>
    <tr>
      <td>JavaScript</td>
      <td>49</td>
    </tr>
    <tr>
      <td>Python</td>
      <td>34</td>
    </tr>
    <tr>
      <td>Rust</td>
      <td>30</td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td><strong>225</strong></td>
    </tr>
  </tbody>
</table>

<h2 id="o1">o1</h2>

<p>OpenAI’s new o1 model established a very strong
top score of 62% on the new benchmark.
This still leaves 86 problems of headroom for future models
to solve.
Given the incredible pace of recent advancements, it
will be interesting to see
how long it will take for this new benchmark to saturate.</p>

<h2 id="benchmark-problems">Benchmark problems</h2>

<p>The 225 coding problems are available in the
<a href="https://github.com/Aider-AI/polyglot-benchmark">aider polyglot benchmark repo</a>
on GitHub.</p>

<h2 id="results">Results</h2>

<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
  <thead style="background-color: #f2f2f2;">
    <tr>
      <th style="padding: 8px; text-align: left;">Model</th>
      <th style="padding: 8px; text-align: center;">Percent completed correctly</th>
      <th style="padding: 8px; text-align: center;">Percent using correct edit format</th>
      <th style="padding: 8px; text-align: left;">Command</th>
      <th style="padding: 8px; text-align: center;">Edit format</th>
    </tr>
  </thead>
  <tbody>
    
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">o1-2024-12-17 (high)</td>
        <td style="padding: 8px; text-align: center;">61.7%</td>
        <td style="padding: 8px; text-align: center;">91.5%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/openai/o1</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">claude-3-5-sonnet-20241022</td>
        <td style="padding: 8px; text-align: center;">45.3%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model claude-3-5-sonnet-20241022</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">gemini-exp-1206</td>
        <td style="padding: 8px; text-align: center;">38.2%</td>
        <td style="padding: 8px; text-align: center;">98.2%</td>
        <td style="padding: 8px;"><code>aider --model gemini/gemini-exp-1206</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">o1-mini-2024-09-12</td>
        <td style="padding: 8px; text-align: center;">32.9%</td>
        <td style="padding: 8px; text-align: center;">96.9%</td>
        <td style="padding: 8px;"><code>aider --model o1-mini</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">claude-3-5-haiku-20241022</td>
        <td style="padding: 8px; text-align: center;">28.0%</td>
        <td style="padding: 8px; text-align: center;">91.1%</td>
        <td style="padding: 8px;"><code>aider --model claude-3-5-haiku-20241022</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">gemini-2.0-flash-exp</td>
        <td style="padding: 8px; text-align: center;">22.2%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model gemini/gemini-2.0-flash-exp</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">DeepSeek Chat V2.5</td>
        <td style="padding: 8px; text-align: center;">17.8%</td>
        <td style="padding: 8px; text-align: center;">92.9%</td>
        <td style="padding: 8px;"><code>aider --model deepseek/deepseek-chat</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">gpt-4o-2024-11-20</td>
        <td style="padding: 8px; text-align: center;">15.1%</td>
        <td style="padding: 8px; text-align: center;">96.0%</td>
        <td style="padding: 8px;"><code>aider --model gpt-4o-2024-11-20</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Qwen2.5-Coder-32B-Instruct</td>
        <td style="padding: 8px; text-align: center;">8.0%</td>
        <td style="padding: 8px; text-align: center;">71.6%</td>
        <td style="padding: 8px;"><code>aider --model openai/Qwen/Qwen2.5-Coder-32B-Instruct # via hyperbolic</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">gpt-4o-mini-2024-07-18</td>
        <td style="padding: 8px; text-align: center;">3.6%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model gpt-4o-mini-2024-07-18</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
  </tbody>
</table>

<script src="https://unpkg.com/patternomaly/dist/patternomaly.js"></script>

<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

<script>



document.addEventListener('DOMContentLoaded', function () {
  var ctx = document.getElementById('editChart').getContext('2d');
  const blueDiagonalPattern = pattern.draw('diagonal', 'rgba(54, 162, 235, 0.2)');
  const redDiagonalPattern = pattern.draw('diagonal', 'rgba(255, 99, 132, 0.2)');
  let displayedData = [];

  // Get highlight model from query string or Jekyll variable
  const urlParams = new URLSearchParams(window.location.search);
  const queryHighlight = urlParams.get('highlight');
  const HIGHLIGHT_MODEL = queryHighlight || 'o1-2024';

  var leaderboardData = {
    labels: [],
    datasets: [{
      label: 'Percent completed correctly',
      data: [],
      backgroundColor: function(context) {
        const row = allData[context.dataIndex];
        if (row && row.edit_format === 'whole') {
          return redDiagonalPattern; // Use red pattern for highlighted whole format
        }
        const label = leaderboardData.labels[context.dataIndex] || '';
        return (label && HIGHLIGHT_MODEL && label.toLowerCase().includes(HIGHLIGHT_MODEL.toLowerCase())) ? 'rgba(255, 99, 132, 0.2)' : 'rgba(54, 162, 235, 0.2)';
      },
      borderColor: function(context) {
        const label = context.chart.data.labels[context.dataIndex] || '';
        return (label && HIGHLIGHT_MODEL && label.toLowerCase().includes(HIGHLIGHT_MODEL.toLowerCase())) ? 'rgba(255, 99, 132, 1)' : 'rgba(54, 162, 235, 1)';
      },
      borderWidth: 1
    }, {
      label: 'Total Cost ($)',
      data: [],
      type: 'scatter',
      yAxisID: 'y1',
      backgroundColor: 'rgba(153, 102, 255, 1)',
      borderColor: '#fff',
      borderWidth: 1,
      pointRadius: 5,
      pointHoverRadius: 7
    }]
  };

  var allData = [];
  
    allData.push({
      model: 'o1-2024-12-17 (high)',
      pass_rate: 61.7,
      percent_cases_well_formed: 91.5,
      edit_format: 'diff',
      total_cost: 0.0
    });
  
    allData.push({
      model: 'claude-3-5-sonnet-20241022',
      pass_rate: 45.3,
      percent_cases_well_formed: 100.0,
      edit_format: 'diff',
      total_cost: 13.4847
    });
  
    allData.push({
      model: 'gemini-exp-1206',
      pass_rate: 38.2,
      percent_cases_well_formed: 98.2,
      edit_format: 'whole',
      total_cost: 0.0
    });
  
    allData.push({
      model: 'o1-mini-2024-09-12',
      pass_rate: 32.9,
      percent_cases_well_formed: 96.9,
      edit_format: 'whole',
      total_cost: 18.577
    });
  
    allData.push({
      model: 'claude-3-5-haiku-20241022',
      pass_rate: 28.0,
      percent_cases_well_formed: 91.1,
      edit_format: 'diff',
      total_cost: 6.0583
    });
  
    allData.push({
      model: 'gemini-2.0-flash-exp',
      pass_rate: 22.2,
      percent_cases_well_formed: 100.0,
      edit_format: 'whole',
      total_cost: 0.0
    });
  
    allData.push({
      model: 'DeepSeek Chat V2.5',
      pass_rate: 17.8,
      percent_cases_well_formed: 92.9,
      edit_format: 'diff',
      total_cost: 0.5101
    });
  
    allData.push({
      model: 'gpt-4o-2024-11-20',
      pass_rate: 15.1,
      percent_cases_well_formed: 96.0,
      edit_format: 'diff',
      total_cost: 7.1835
    });
  
    allData.push({
      model: 'Qwen2.5-Coder-32B-Instruct',
      pass_rate: 8.0,
      percent_cases_well_formed: 71.6,
      edit_format: 'diff',
      total_cost: 0.0
    });
  
    allData.push({
      model: 'gpt-4o-mini-2024-07-18',
      pass_rate: 3.6,
      percent_cases_well_formed: 100.0,
      edit_format: 'whole',
      total_cost: 0.3236
    });
  

  function updateChart() {
    var selectedRows = document.querySelectorAll('tr.selected');
    var showAll = selectedRows.length === 0;

    displayedData = [];
    leaderboardData.labels = [];
    leaderboardData.datasets[0].data = [];
    leaderboardData.datasets[1].data = [];

    allData.forEach(function(row, index) {
      var rowElement = document.getElementById('edit-row-' + index);
      if (showAll) {
        rowElement.classList.remove('selected');
      }
      if (showAll || rowElement.classList.contains('selected')) {
        displayedData.push(row);
        leaderboardData.labels.push(row.model);
        leaderboardData.datasets[0].data.push(row.pass_rate);
        // Only include cost if it's not zero (placeholder for unknown)
        leaderboardData.datasets[1].data.push(row.total_cost > 0 ? row.total_cost : null);
      }
    });

    leaderboardChart.update();
    leaderboardChart.render();
  }

  // Update backgroundColor and borderColor for the main dataset based on displayedData
  leaderboardData.datasets[0].backgroundColor = function(context) {
    const row = displayedData[context.dataIndex];
    const label = leaderboardData.labels[context.dataIndex] || '';
    const isHighlighted = label && HIGHLIGHT_MODEL && label.toLowerCase().includes(HIGHLIGHT_MODEL.toLowerCase());

    if (isHighlighted) {
      if (row && row.edit_format === 'whole') return redDiagonalPattern;
      else return 'rgba(255, 99, 132, 0.2)';
    } else if (row && row.edit_format === 'whole') {
      return blueDiagonalPattern;
    } else {
      return 'rgba(54, 162, 235, 0.2)';
    }
  };

  var tableBody = document.querySelector('table tbody');
  allData.forEach(function(row, index) {
    var tr = tableBody.children[index];
    if (!tr) {
      // If the row doesn't exist, create it
      tr = document.createElement('tr');
      tableBody.appendChild(tr);
    }
    tr.id = 'edit-row-' + index;
    tr.style.cursor = 'pointer';
    tr.onclick = function() {
      this.classList.toggle('selected');
      updateChart();
    };
  });

  var leaderboardChart = new Chart(ctx, {
    type: 'bar',
    data: leaderboardData,
    options: {
      plugins: {
        legend: {
          display: true,
          labels: {
            generateLabels: function(chart) {
              return [
                {
                  text: 'Diff-like format',
                  fillStyle: 'rgba(54, 162, 235, 0.2)',
                  strokeStyle: 'rgba(54, 162, 235, 1)',
                  lineWidth: 1
                },
                {
                  text: 'Whole format',
                  fillStyle: blueDiagonalPattern,
                  strokeStyle: 'rgba(54, 162, 235, 1)',
                  lineWidth: 1
                },
                {
                  text: 'Total Cost ($)',
                  fillStyle: 'rgba(153, 102, 255, 1)',
                  strokeStyle: '#fff',
                  lineWidth: 1,
                  pointStyle: 'circle'
                }
              ];
            }
          }
        },
        tooltip: {
          callbacks: {
            label: function(context) {
              const datasetLabel = context.dataset.label || '';
              const value = context.parsed.y;
              if (datasetLabel === 'Total Cost ($)') {
                return datasetLabel + ': $' + value.toFixed(2);
              }
              return datasetLabel + ': ' + value.toFixed(1) + '%';
            }
          }
        }
      },
      scales: {
        y: {
          beginAtZero: true,
          title: {
            display: true,
            text: 'Percent completed correctly'
          }
        },
        y1: {
          beginAtZero: true,
          position: 'right',
          grid: {
            drawOnChartArea: false
          },
          title: {
            display: true,
            text: 'Total Cost ($)'
          }
        },
        x: {
          ticks: {
            autoSkip: false, // Prevent labels from being automatically skipped
            maxRotation: 90, // Allow labels to rotate up to 90 degrees
            minRotation: 0, 
            callback: function(value, index) {
              const label = this.getLabelForValue(value);
              if (label.length <= "claude-3-5-sonnet".length) {
                return label;
              }
              
              // Find all possible split positions
              const splitPositions = [];
              for (let i = 0; i < label.length; i++) {
                if (label[i] === '-' || label[i] === ' ') {
                  splitPositions.push(i);
                }
              }
              
              if (splitPositions.length === 0) {
                return label;
              }
              
              // Find split position closest to middle
              const middle = label.length / 2;
              const splitIndex = splitPositions.reduce((closest, current) => {
                return Math.abs(current - middle) < Math.abs(closest - middle) ? current : closest;
              });
              
              return [
                label.slice(0, splitIndex),
                label.slice(splitIndex + 1)
              ];
            }
          }
        }
      }
    }
  });

  updateChart();
  
  // Add search functionality for edit table
  document.getElementById('editSearchInput').addEventListener('keyup', function() {
    var searchWords = this.value.toLowerCase().split(' ').filter(word => word.length > 0);
    var tableBody = document.querySelector('table:first-of-type tbody');
    var rows = tableBody.getElementsByTagName('tr');
    
    displayedData = [];
    leaderboardData.labels = [];
    leaderboardData.datasets[0].data = [];
    leaderboardData.datasets[1].data = [];
    
    for (var i = 0; i < rows.length; i++) {
      var rowText = rows[i].textContent;
      if (searchWords.every(word => rowText.toLowerCase().includes(word))) {
        rows[i].style.display = '';
        displayedData.push(allData[i]);
        leaderboardData.labels.push(allData[i].model);
        leaderboardData.datasets[0].data.push(allData[i].pass_rate);
        // Only include cost if it's not zero (placeholder for unknown)
        leaderboardData.datasets[1].data.push(allData[i].total_cost > 0 ? allData[i].total_cost : null);
      } else {
        rows[i].style.display = 'none';
      }
    }
    leaderboardChart.update();
  });
});

</script>

<style>
  tr.selected {
    color: #0056b3;
  }
  table {
    table-layout: fixed;
  }
  td, th {
    word-wrap: break-word;
    overflow-wrap: break-word;
  }
  td:nth-child(3), td:nth-child(4) {
    font-size: 12px;
  }
</style>]]></content><author><name></name></author><summary type="html"><![CDATA[o1 scores the top result on aider's new multi-language, more challenging coding benchmark.]]></summary></entry><entry><title type="html">QwQ is a code architect, not an editor</title><link href="https://aider.chat/2024/12/03/qwq.html" rel="alternate" type="text/html" title="QwQ is a code architect, not an editor" /><published>2024-12-03T00:00:00+00:00</published><updated>2024-12-03T00:00:00+00:00</updated><id>https://aider.chat/2024/12/03/qwq</id><content type="html" xml:base="https://aider.chat/2024/12/03/qwq.html"><![CDATA[<p class="post-date">December 03, 2024</p>

<h1 class="no_toc" id="qwq-is-a-code-architect-not-an-editor">QwQ is a code architect, not an editor</h1>

<canvas id="qwqChart" width="800" height="500" style="margin: 20px 0"></canvas>

<p>QwQ 32B Preview is a “reasoning” model, which spends a lot of tokens thinking before
rendering a final response.
This is similar to OpenAI’s o1 models, which are most effective with aider
<a href="https://aider.chat/2024/09/26/architect.html">when paired as an architect with a traditional LLM as an editor</a>.
In this mode, the reasoning model acts as an “architect” to propose a solution to the
coding problem without regard for how to actually make edits to the source files.
The “editor” model receives that proposal, and focuses solely on how to
edit the existing source code to implement it.</p>

<p>Used alone without being paired with an editor, 
QwQ was unable to comply with even the simplest 
<a href="https://aider.chat/docs/more/edit-formats.html">editing format</a>.
It was not able to reliably edit source code files.
As a result, QwQ’s solo score on the benchmark was quite underwhelming
(and far worse than the o1 models performing solo).</p>

<p>QwQ is based on
Qwen 2.5 Coder 32B Instruct,
and does better when paired with it as an architect + editor combo.
Though this provided only a modest benchmark improvement over just using Qwen alone,
and comes with a fairly high cost in terms of latency.
Each request must wait for QwQ to return all its thinking text
and the final solution proposal.
And then one must wait for Qwen to turn that large
response into actual file edits.</p>

<p>Pairing QwQ with other sensible editor models performed the same or worse than
just using Qwen 2.5 Coder 32B Instruct alone.</p>

<p>QwQ+Qwen seems to be the best way to use QwQ, achieving a score of 74%.
That is well below the
SOTA results for this benchmark: Sonnet alone scores 84%, and
o1-preview + o1-mini as architect + editor scores 85%.</p>

<h2 id="qwq-specific-editing-formats">QwQ specific editing formats</h2>

<p>I spent some time experimenting with a variety of custom editing formats
for QwQ.
In particular, I tried to parse the QwQ response and discard the long
sections of “thinking” and retain only the “final” solution.
None of this custom work seemed to translate 
into any significant improvement in the benchmark results.</p>

<h2 id="results">Results</h2>

<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

<script>
document.addEventListener('DOMContentLoaded', function () {
  var ctx = document.getElementById('qwqChart').getContext('2d');
  var allData = [];
  
    allData.push({
      model: 'QwQ + Haiku',
      pass_rate_2: 71.4
    });
  
    allData.push({
      model: 'QwQ + DeepSeek V2.5',
      pass_rate_2: 67.7
    });
  
    allData.push({
      model: 'Qwen2.5 Coder 32B-I',
      pass_rate_2: 71.4
    });
  
    allData.push({
      model: 'QwQ + Qwen2.5 Coder 32B-I',
      pass_rate_2: 73.6
    });
  
    allData.push({
      model: 'QwQ',
      pass_rate_2: 42.1
    });
  
    allData.push({
      model: 'o1-mini',
      pass_rate_2: 70.7
    });
  
    allData.push({
      model: 'o1-preview',
      pass_rate_2: 79.7
    });
  

  // Sort data by pass_rate_2 in descending order
  allData.sort((a, b) => b.pass_rate_2 - a.pass_rate_2);

  var chart;
  
  function updateChart(filterText) {
    var filteredData = allData.filter(row => 
      row.model.toLowerCase().includes(filterText.toLowerCase())
    );
    
    var chartData = {
      labels: filteredData.map(row => row.model),
      datasets: [{
        data: filteredData.map(row => row.pass_rate_2),
        backgroundColor: filteredData.map(row => 
          (row.model === 'Qwen2.5 Coder 32B-I' || row.model === 'Sonnet (SOTA)' || row.model === 'o1-mini' || row.model === 'o1-preview' || row.model === 'QwQ') 
            ? 'rgba(75, 192, 192, 0.2)'   // Green for solo models
            : 'rgba(54, 162, 235, 0.2)'   // Blue for architect+editor
        ),
        borderColor: filteredData.map(row => 
          (row.model === 'Qwen2.5 Coder 32B-I' || row.model === 'Sonnet (SOTA)' || row.model === 'o1-mini' || row.model === 'o1-preview' || row.model === 'QwQ')
            ? 'rgba(75, 192, 192, 1)'     // Green border for solo models
            : 'rgba(54, 162, 235, 1)'     // Blue border for architect+editor
        ),
        borderWidth: 1
      }]
    };

    if (chart) {
      chart.data = chartData;
      chart.update();
    } else {
      chart = new Chart(ctx, {
        type: 'bar',
        data: chartData,
        options: {
          plugins: {
            legend: {
              display: true,
              position: 'top',
              labels: {
                font: {
                  size: 14
                },
                generateLabels: function(chart) {
                  return [
                    {
                      text: 'Solo model',
                      fillStyle: 'rgba(75, 192, 192, 0.2)',
                      strokeStyle: 'rgba(75, 192, 192, 1)',
                      lineWidth: 1,
                      fontColor: '#666'
                    },
                    {
                      text: 'Architect + Editor',
                      fillStyle: 'rgba(54, 162, 235, 0.2)',
                      strokeStyle: 'rgba(54, 162, 235, 1)',
                      lineWidth: 1,
                      fontColor: '#666'
                    }
                  ];
                }
              }
            }
          },
          scales: {
            y: {
              beginAtZero: true,
              title: {
                display: true,
                text: 'Aider code editing benchmark (%)',
                font: {
                  size: 18
                }
              },
              ticks: {
                font: {
                  size: 16
                }
              }
            },
            x: {
              ticks: {
                font: {
                  size: 16
                },
                callback: function(value, index) {
                  const label = this.getLabelForValue(value);
                  if (label.includes(" + ")) {
                    const parts = label.split(" + ");
                    return [parts[0] + " +", parts[1]];
                  }
                  return label;
                }
              }
            }
          }
        }
      });
    }
  }

  // Initial chart render
  updateChart('');

  // Connect search input to chart filtering
  document.getElementById('qwqSearchInput').addEventListener('keyup', function() {
    updateChart(this.value);
  });
});

</script>

<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
  <thead style="background-color: #f2f2f2;">
    <tr>
      <th style="padding: 8px; text-align: left;">Model</th>
      <th style="padding: 8px; text-align: center;">Percent completed correctly</th>
      <th style="padding: 8px; text-align: center;">Percent using correct edit format</th>
      <th style="padding: 8px; text-align: left;">Command</th>
      <th style="padding: 8px; text-align: center;">Edit format</th>
    </tr>
  </thead>
  <tbody>
    
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">o1-preview</td>
        <td style="padding: 8px; text-align: center;">79.7%</td>
        <td style="padding: 8px; text-align: center;">93.2%</td>
        <td style="padding: 8px;"><code>aider --model o1-preview</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">QwQ + Qwen2.5 Coder 32B-I</td>
        <td style="padding: 8px; text-align: center;">73.6%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/qwen/qwq-32b-preview --editor-model openrouter/qwen/qwen-2.5-coder-32b-instruct --editor-edit-format editor-whole</code></td>
        <td style="padding: 8px; text-align: center;">architect</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Qwen2.5 Coder 32B-I</td>
        <td style="padding: 8px; text-align: center;">71.4%</td>
        <td style="padding: 8px; text-align: center;">94.7%</td>
        <td style="padding: 8px;"><code>aider --model openai/hf:Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://glhf.chat/api/openai/v1 (via GLHF)</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">QwQ + Haiku</td>
        <td style="padding: 8px; text-align: center;">71.4%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/qwen/qwq-32b-preview --editor-model claude-3-5-haiku-20241022 --edit-format editor-whole</code></td>
        <td style="padding: 8px; text-align: center;">architect</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">o1-mini</td>
        <td style="padding: 8px; text-align: center;">70.7%</td>
        <td style="padding: 8px; text-align: center;">90.0%</td>
        <td style="padding: 8px;"><code>aider --model o1-mini</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">QwQ + DeepSeek V2.5</td>
        <td style="padding: 8px; text-align: center;">67.7%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/qwen/qwq-32b-preview --editor-model deepseek/deepseek-chat --edit-format editor-whole</code></td>
        <td style="padding: 8px; text-align: center;">architect</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">QwQ</td>
        <td style="padding: 8px; text-align: center;">42.1%</td>
        <td style="padding: 8px; text-align: center;">91.0%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/qwen/qwq-32b-preview</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
  </tbody>
</table>

<style>
  tr.selected {
    color: #0056b3;
  }
  table {
    table-layout: fixed;
  }
  td, th {
    word-wrap: break-word;
    overflow-wrap: break-word;
  }
  td:nth-child(3), td:nth-child(4) {
    font-size: 12px;
  }
</style>

<script>
document.getElementById('qwqSearchInput').addEventListener('keyup', function() {
    var input = this.value.toLowerCase();
    var rows = document.querySelectorAll('tbody tr');
    
    rows.forEach(function(row) {
        var text = row.textContent.toLowerCase();
        if(text.includes(input)) {
            row.style.display = '';
            row.classList.add('selected');
        } else {
            row.style.display = 'none';
            row.classList.remove('selected');
        }
    });
});
</script>

<h2 id="open-source-model-caveats">Open source model caveats</h2>

<p>As discussed in a recent blog post,
<a href="https://aider.chat/2024/11/21/quantization.html">details matter with open source models</a>.
For clarity, new benchmark runs for this article were
performed against OpenRouter’s endpoints for
QwQ 32B Preview and Qwen 2.5 Coder 32B Instruct.
For the other models, the benchmark was direct to their providers’ APIs.</p>

<p>Having recently done extensive testing of OpenRouter’s Qwen 2.5 Coder 32B Instruct endpoint,
it seems reliable.
The provider Mancer was blocked due to the small context window it provides.</p>

<p>For QwQ 32B Preview, Fireworks was blocked because of its small context window.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[QwQ is reasoning model like o1, and needs to be used as an architect with another model as editor.]]></summary></entry><entry><title type="html">Details matter with open source models</title><link href="https://aider.chat/2024/11/21/quantization.html" rel="alternate" type="text/html" title="Details matter with open source models" /><published>2024-11-21T00:00:00+00:00</published><updated>2024-11-21T00:00:00+00:00</updated><id>https://aider.chat/2024/11/21/quantization</id><content type="html" xml:base="https://aider.chat/2024/11/21/quantization.html"><![CDATA[<p class="post-date">November 21, 2024</p>

<h1 class="no_toc" id="details-matter-with-open-source-models">Details matter with open source models</h1>

<canvas id="quantChart" width="800" height="600" style="margin: 20px 0"></canvas>

<p>Open source models like Qwen 2.5 32B Instruct are performing very well on
aider’s code editing benchmark, rivaling closed source frontier models.</p>

<p>But pay attention to how your model is being served and quantized, 
as it can impact code editing skill.
Open source models are often available at a variety of quantizations,
and can be served with different token limits.
These details matter when working with code.</p>

<p>The graph above and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model,
served both locally and from a variety of cloud providers.</p>

<ul>
  <li>The <a href="https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct">HuggingFace BF16 weights</a> served via <a href="https://glhf.chat">glhf.chat</a>.</li>
  <li><a href="https://t.co/cwX3DYX35D">4bit and 8bit quants for mlx</a>.</li>
  <li>The results from <a href="https://openrouter.ai/qwen/qwen-2.5-coder-32b-instruct/providers">OpenRouter’s mix of providers</a> which serve the model with different levels of quantization.</li>
  <li>Results from OpenRouter’s providers, both served via OpenRouter and directly to their own APIs.</li>
  <li>Ollama locally serving different quantizations from the <a href="https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M">Ollama model library</a> with 8k+
context windows.</li>
  <li>An Ollama fp16 quantization served with Ollama’s default 2k context window.</li>
</ul>

<h3 id="pitfalls-and-details">Pitfalls and details</h3>

<p>This benchmarking effort highlighted a number of pitfalls and details specific to open source
models which
can have a significant impact on their ability to correctly edit code:</p>

<ul>
  <li><strong>Quantization</strong> – Open source models are often available at dozens of different quantizations.
Most seem to only modestly decrease code editing skill, but stronger quantizations
do have a real impact.</li>
  <li><strong>Context window</strong> – Cloud providers can decide how large a context window to accept,
and they often choose differently. Ollama’s local API server
defaults to a tiny 2k context window,
and silently discards data that exceeds it. Such a small window has
catastrophic effects on performance, without throwing obvious hard errors.</li>
  <li><strong>Output token limits</strong> – Open source models are often served with wildly
differing output token limits. This has a direct impact on how much code the
model can write or edit in a response.</li>
  <li><strong>Buggy cloud providers</strong> – While benchmarking Qwen 2.5 Coder 32B Instruct
and DeepSeek V2.5, I discovered
multiple cloud providers with broken or buggy API endpoints.
They seemed
to be returning results different from expected based on the advertised
quantization and context sizes.
The harm caused to the code editing benchmark varied from serious
to catastrophic.
One provider scored 0.5% on the benchmark with DeepSeek V2.5, a highly capable model.</li>
</ul>

<p>Closed source, proprietary models don’t typically have these issues.
They are owned and operated by the organization that created them,
and typically served with specific, predictable context window and output token limits.
Their quantization level is usually unknown, but fixed and unchanging for all users.</p>

<h3 id="conclusions">Conclusions</h3>

<p>The best versions of the Qwen model rival GPT-4o, while the worst performing
quantization is more like the older GPT-4 Turbo when served competently.
Even an otherwise excellent fp16 quantization falls to GPT-3.5 Turbo levels of performance
if run with Ollama’s default 2k context window.</p>

<h3 class="no_toc" id="sections">Sections</h3>

<ul id="markdown-toc">
  <li><a href="#pitfalls-and-details" id="markdown-toc-pitfalls-and-details">Pitfalls and details</a></li>
  <li><a href="#conclusions" id="markdown-toc-conclusions">Conclusions</a></li>
  <li><a href="#benchmark-results" id="markdown-toc-benchmark-results">Benchmark results</a></li>
  <li><a href="#setting-ollamas-context-window-size" id="markdown-toc-setting-ollamas-context-window-size">Setting Ollama’s context window size</a></li>
  <li><a href="#choosing-providers-with-openrouter" id="markdown-toc-choosing-providers-with-openrouter">Choosing providers with OpenRouter</a></li>
  <li><a href="#notes" id="markdown-toc-notes">Notes</a></li>
</ul>

<h2 id="benchmark-results">Benchmark results</h2>

<p class="note">These are results from single benchmark runs, so expect normal variance of +/- 1-2%.</p>

<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

<script>
document.addEventListener('DOMContentLoaded', function () {
  var ctx = document.getElementById('quantChart').getContext('2d');
  var allData = [];
  
    allData.push({
      model: 'HuggingFace via GLHF: BF16',
      pass_rate_2: 71.4
    });
  
    allData.push({
      model: 'Ollama: fp16',
      pass_rate_2: 71.4
    });
  
    allData.push({
      model: 'Hyperbolic: BF16',
      pass_rate_2: 69.2
    });
  
    allData.push({
      model: 'mlx-community: 4bit',
      pass_rate_2: 72.2
    });
  
    allData.push({
      model: 'mlx-community: 8bit',
      pass_rate_2: 72.2
    });
  
    allData.push({
      model: 'OpenRouter: multiple',
      pass_rate_2: 67.7
    });
  
    allData.push({
      model: 'Ollama: q4_K_M',
      pass_rate_2: 66.9
    });
  
    allData.push({
      model: 'Deepinfra: BF16',
      pass_rate_2: 72.2
    });
  
    allData.push({
      model: 'Fireworks: unknown',
      pass_rate_2: 72.2
    });
  
    allData.push({
      model: 'Ollama: q2_K',
      pass_rate_2: 61.7
    });
  
    allData.push({
      model: 'Fireworks via OpenRouter: unknown',
      pass_rate_2: 67.7
    });
  
    allData.push({
      model: 'Hyperbolic via OpenRouter: BF16',
      pass_rate_2: 68.4
    });
  
    allData.push({
      model: 'Deepinfra via OpenRouter: BF16',
      pass_rate_2: 69.9
    });
  
    allData.push({
      model: 'Ollama: fp16, 2k ctx',
      pass_rate_2: 51.9
    });
  

  // Sort data by pass_rate_2 in descending order
  allData.sort((a, b) => b.pass_rate_2 - a.pass_rate_2);

  var chart;
  
  function updateChart(filterText) {
    var filteredData = allData.filter(row => 
      row.model.toLowerCase().includes(filterText.toLowerCase())
    );
    
    var chartData = {
      labels: filteredData.map(row => row.model),
      datasets: [{
        label: 'Percent completed correctly',
        data: filteredData.map(row => row.pass_rate_2),
        backgroundColor: 'rgba(54, 162, 235, 0.2)',
        borderColor: 'rgba(54, 162, 235, 1)',
        borderWidth: 1
      }]
    };

    if (chart) {
      chart.data = chartData;
      chart.update();
    } else {
      chart = new Chart(ctx, {
        type: 'bar',
        data: chartData,
        options: {
          plugins: {
            legend: {
              display: false
            },
            title: {
              display: true,
              text: 'Aider code editing benchmark',
              font: {
                size: 16
              }
            }
          },
          scales: {
            y: {
              beginAtZero: true,
              title: {
                display: true,
                text: 'Percent completed correctly',
                font: {
                  size: 14
                }
              },
              ticks: {
                font: {
                  size: 16
                }
              }
            },
            x: {
              ticks: {
                font: {
                  size: 16
                }
              },
              title: {
                display: true,
                text: 'Provider: quantization',
                font: {
                  size: 14
                }
              }
            }
          }
        }
      });
    }
  }

  // Initial chart render
  updateChart('');

  // Connect search input to chart filtering
  document.getElementById('quantSearchInput').addEventListener('keyup', function() {
    updateChart(this.value);
  });
});

</script>

<p><input type="text" id="quantSearchInput" placeholder="Search..." style="width: 100%; max-width: 800px; margin: 10px auto; padding: 8px; display: block; border: 1px solid #ddd; border-radius: 4px;" /></p>

<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
  <thead style="background-color: #f2f2f2;">
    <tr>
      <th style="padding: 8px; text-align: left;">Model</th>
      <th style="padding: 8px; text-align: center;">Percent completed correctly</th>
      <th style="padding: 8px; text-align: center;">Percent using correct edit format</th>
      <th style="padding: 8px; text-align: left;">Command</th>
      <th style="padding: 8px; text-align: center;">Edit format</th>
    </tr>
  </thead>
  <tbody>
    
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Fireworks: unknown</td>
        <td style="padding: 8px; text-align: center;">72.2%</td>
        <td style="padding: 8px; text-align: center;">94.0%</td>
        <td style="padding: 8px;"><code>aider --model fireworks_ai/accounts/fireworks/models/qwen2p5-coder-32b-instruct</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Deepinfra: BF16</td>
        <td style="padding: 8px; text-align: center;">72.2%</td>
        <td style="padding: 8px; text-align: center;">94.7%</td>
        <td style="padding: 8px;"><code>aider --model deepinfra/Qwen/Qwen2.5-Coder-32B-Instruct</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">mlx-community: 8bit</td>
        <td style="padding: 8px; text-align: center;">72.2%</td>
        <td style="padding: 8px; text-align: center;">92.5%</td>
        <td style="padding: 8px;"><code>aider --model openai/mlx-community/Qwen2.5-Coder-32B-Instruct-8bit</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">mlx-community: 4bit</td>
        <td style="padding: 8px; text-align: center;">72.2%</td>
        <td style="padding: 8px; text-align: center;">88.7%</td>
        <td style="padding: 8px;"><code>aider --model openai/mlx-community/Qwen2.5-Coder-32B-Instruct-4bit</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Ollama: fp16</td>
        <td style="padding: 8px; text-align: center;">71.4%</td>
        <td style="padding: 8px; text-align: center;">90.2%</td>
        <td style="padding: 8px;"><code>aider --model ollama/qwen2.5-coder:32b-instruct-fp16</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">HuggingFace via GLHF: BF16</td>
        <td style="padding: 8px; text-align: center;">71.4%</td>
        <td style="padding: 8px; text-align: center;">94.7%</td>
        <td style="padding: 8px;"><code>aider --model openai/hf:Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://glhf.chat/api/openai/v1</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Deepinfra via OpenRouter: BF16</td>
        <td style="padding: 8px; text-align: center;">69.9%</td>
        <td style="padding: 8px; text-align: center;">89.5%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Hyperbolic: BF16</td>
        <td style="padding: 8px; text-align: center;">69.2%</td>
        <td style="padding: 8px; text-align: center;">91.7%</td>
        <td style="padding: 8px;"><code>aider --model openai/Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://api.hyperbolic.xyz/v1/</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Hyperbolic via OpenRouter: BF16</td>
        <td style="padding: 8px; text-align: center;">68.4%</td>
        <td style="padding: 8px; text-align: center;">89.5%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Fireworks via OpenRouter: unknown</td>
        <td style="padding: 8px; text-align: center;">67.7%</td>
        <td style="padding: 8px; text-align: center;">94.0%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">OpenRouter: multiple</td>
        <td style="padding: 8px; text-align: center;">67.7%</td>
        <td style="padding: 8px; text-align: center;">95.5%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Ollama: q4_K_M</td>
        <td style="padding: 8px; text-align: center;">66.9%</td>
        <td style="padding: 8px; text-align: center;">94.0%</td>
        <td style="padding: 8px;"><code>aider --model ollama/qwen2.5-coder:32b-instruct-q4_K_M</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Ollama: q2_K</td>
        <td style="padding: 8px; text-align: center;">61.7%</td>
        <td style="padding: 8px; text-align: center;">91.7%</td>
        <td style="padding: 8px;"><code>aider --model ollama/qwen2.5-coder:32b-instruct-q2_K</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Ollama: fp16, 2k ctx</td>
        <td style="padding: 8px; text-align: center;">51.9%</td>
        <td style="padding: 8px; text-align: center;">46.2%</td>
        <td style="padding: 8px;"><code>aider --model ollama/qwen2.5-coder:32b-instruct-fp16 # num_ctx: 2048</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
  </tbody>
</table>

<style>
  tr.selected {
    color: #0056b3;
  }
  table {
    table-layout: fixed;
  }
  td, th {
    word-wrap: break-word;
    overflow-wrap: break-word;
  }
  td:nth-child(3), td:nth-child(4) {
    font-size: 12px;
  }
</style>

<script>
document.getElementById('quantSearchInput').addEventListener('keyup', function() {
    var input = this.value.toLowerCase();
    var rows = document.querySelectorAll('tbody tr');
    
    rows.forEach(function(row) {
        var text = row.textContent.toLowerCase();
        if(text.includes(input)) {
            row.style.display = '';
            row.classList.add('selected');
        } else {
            row.style.display = 'none';
            row.classList.remove('selected');
        }
    });
});
</script>

<h2 id="setting-ollamas-context-window-size">Setting Ollama’s context window size</h2>

<p><a href="https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size">Ollama uses a 2k context window by default</a>,
which is very small for working with aider.
Unlike most other LLM servers, Ollama does not throw an error if you submit
a request that exceeds the context window.
Instead, it just silently truncates the request by discarding the “oldest” messages
in the chat to make it fit within the context window.</p>

<p>Except for the single 2k context result,
all of the Ollama results above were collected with at least an 8k context window.
An 8k window is large enough to attempt all the coding problems in the benchmark.
Aider sets Ollama’s context window to 8k by default, starting in aider v0.65.0.</p>

<p>You can change the Ollama server’s context window with a 
<a href="https://aider.chat/docs/config/adv-model-settings.html#model-settings"><code class="language-plaintext highlighter-rouge">.aider.model.settings.yml</code> file</a>
like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>- name: ollama/qwen2.5-coder:32b-instruct-fp16
  extra_params:
    num_ctx: 8192
</code></pre></div></div>

<h2 id="choosing-providers-with-openrouter">Choosing providers with OpenRouter</h2>

<p>OpenRouter allows you to ignore specific providers in your
<a href="https://openrouter.ai/settings/preferences">preferences</a>.
This can be used to limit your OpenRouter requests to be
served by only your preferred providers.</p>

<h2 id="notes">Notes</h2>

<p>This article went through many revisions as I received feedback from
numerous members of the community.
Here are some of the noteworthy learnings and changes:</p>

<ul>
  <li>The first version of this article included incorrect Ollama models.</li>
  <li>Earlier Ollama results used the too small default 2k context window,
artificially harming the benchmark results.</li>
  <li>The benchmark results appear to have uncovered a problem in the way
OpenRouter was communicating with Hyperbolic.
They fixed the issue 11/24/24, shortly after it was pointed out.</li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[Open source LLMs are becoming very powerful, but pay attention to how you (or your provider) are serving the model. It can affect code editing skill.]]></summary></entry><entry><title type="html">Separating code reasoning and editing</title><link href="https://aider.chat/2024/09/26/architect.html" rel="alternate" type="text/html" title="Separating code reasoning and editing" /><published>2024-09-26T00:00:00+00:00</published><updated>2024-09-26T00:00:00+00:00</updated><id>https://aider.chat/2024/09/26/architect</id><content type="html" xml:base="https://aider.chat/2024/09/26/architect.html"><![CDATA[<p class="post-date">September 26, 2024</p>

<h1 id="separating-code-reasoning-and-editing">Separating code reasoning and editing</h1>

<p>Aider now has experimental support for using two models to complete each coding task:</p>

<ul>
  <li>An Architect model is asked to describe how to solve the coding problem.</li>
  <li>An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files.</li>
</ul>

<p>Splitting up “code reasoning” and “code editing” in this manner
has produced SOTA results on
<a href="/docs/benchmarks.html#the-benchmark">aider’s code editing benchmark</a>.
Using o1-preview as the Architect with either DeepSeek or o1-mini as the
Editor produced the SOTA score of 85%.
Using the Architect/Editor approach
also significantly improved the benchmark scores of many
models, compared to their previous “solo” baseline scores (striped bars).</p>

<style>
  .shaded td {
    background-color: #f2f2f2;
    border-top: 1px solid #ccc;
  }
  .table-container {
    max-width: 100%;
    overflow-x: auto;
  }
  .responsive-table {
    border-collapse: separate;
    border-spacing: 0;
    width: 100%;
    font-size: 16px;
    border: 1px solid #ddd;
  }
  .responsive-table th, .responsive-table td {
    padding: 8px;
    text-align: left;
    border-bottom: 1px solid #ddd;
    word-break: break-word;
  }
  .responsive-table th {
    background-color: #e2e2e2;
  }
  .responsive-table th:first-child,
  .responsive-table td:first-child {
    border-left: 1px solid #ddd;
  }
  .responsive-table th:last-child,
  .responsive-table td:last-child {
    border-right: 1px solid #ddd;
  }
  
  @media screen and (max-width: 600px) {
    .responsive-table {
      font-size: 12px;
    }
    .responsive-table th, .responsive-table td {
      padding: 4px;
    }
  }
</style>

<style>
  #passRateChart {
    max-width: 100%;
    height: auto !important;
  }
</style>

<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

<script src="https://cdn.jsdelivr.net/npm/chartjs-plugin-annotation@1.0.2"></script>

<canvas id="passRateChart" width="400" height="250"></canvas>
<script>
  document.addEventListener("DOMContentLoaded", function() {
    var ctx = document.getElementById('passRateChart').getContext('2d');
    
    // Function to determine aspect ratio and base font size based on screen width
    function getChartSettings() {
      if (window.innerWidth < 600) {
        return { aspectRatio: 1, baseFontSize: 8 }; // Slightly taller for small screens
      } else if (window.innerWidth < 800) {
        return { aspectRatio: 1.2, baseFontSize: 10 }; // Slightly taller for small screens
      } else {
        return { aspectRatio: 1.4, baseFontSize: 12 }; // Slightly taller for larger screens
      }
    }

    var chartSettings = getChartSettings();
    var baseFontSize = chartSettings.baseFontSize;

    var labels = [];
    var data = [];
    var colorMapping = {
      "claude-3.5-sonnet": "rgba(75, 192, 192, 0.2)",
      "gpt-4o": "rgba(255, 99, 132, 0.2)",
      "o1-preview": "rgba(54, 162, 235, 0.2)",
      "o1-mini": "rgba(255, 206, 86, 0.2)",
      "gpt-4o-mini": "rgba(153, 102, 255, 0.2)"
    };
    var borderColorMapping = {
      "claude-3.5-sonnet": "rgba(75, 192, 192, 1)",
      "gpt-4o": "rgba(255, 99, 132, 1)",
      "o1-preview": "rgba(54, 162, 235, 1)",
      "o1-mini": "rgba(255, 206, 86, 1)",
      "gpt-4o-mini": "rgba(153, 102, 255, 1)"
    };
    var backgroundColors = [];
    var borderColors = [];
    var patterns = {};
    for (var key in colorMapping) {
      patterns[key] = ctx.createPattern(createStripePattern(colorMapping[key]), 'repeat');
    }
    
    
      
        if ("o1-mini" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("o1-mini/whole");
        }
        data.push(85.0);
        if ("o1-mini" == "") {
          backgroundColors.push(patterns["o1-preview"]);
        } else {
          backgroundColors.push(colorMapping["o1-preview"]);
        }
        borderColors.push(borderColorMapping["o1-preview"]);
      
        if ("deepseek" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("deepseek/whole");
        }
        data.push(85.0);
        if ("deepseek" == "") {
          backgroundColors.push(patterns["o1-preview"]);
        } else {
          backgroundColors.push(colorMapping["o1-preview"]);
        }
        borderColors.push(borderColorMapping["o1-preview"]);
      
        if ("claude-3-5-sonnet" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("claude-3-5-sonnet/diff");
        }
        data.push(82.7);
        if ("claude-3-5-sonnet" == "") {
          backgroundColors.push(patterns["o1-preview"]);
        } else {
          backgroundColors.push(colorMapping["o1-preview"]);
        }
        borderColors.push(borderColorMapping["o1-preview"]);
      
        if ("deepseek" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("deepseek/diff");
        }
        data.push(80.5);
        if ("deepseek" == "") {
          backgroundColors.push(patterns["o1-preview"]);
        } else {
          backgroundColors.push(colorMapping["o1-preview"]);
        }
        borderColors.push(borderColorMapping["o1-preview"]);
      
        if ("gpt-4o" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("gpt-4o/diff");
        }
        data.push(80.5);
        if ("gpt-4o" == "") {
          backgroundColors.push(patterns["o1-preview"]);
        } else {
          backgroundColors.push(colorMapping["o1-preview"]);
        }
        borderColors.push(borderColorMapping["o1-preview"]);
      
        if ("" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("/diff");
        }
        data.push(79.7);
        if ("" == "") {
          backgroundColors.push(patterns["o1-preview"]);
        } else {
          backgroundColors.push(colorMapping["o1-preview"]);
        }
        borderColors.push(borderColorMapping["o1-preview"]);
      
    
      
        if ("claude-3.5-sonnet" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("claude-3.5-sonnet/diff");
        }
        data.push(80.5);
        if ("claude-3.5-sonnet" == "") {
          backgroundColors.push(patterns["claude-3.5-sonnet"]);
        } else {
          backgroundColors.push(colorMapping["claude-3.5-sonnet"]);
        }
        borderColors.push(borderColorMapping["claude-3.5-sonnet"]);
      
        if ("deepseek" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("deepseek/diff");
        }
        data.push(78.9);
        if ("deepseek" == "") {
          backgroundColors.push(patterns["claude-3.5-sonnet"]);
        } else {
          backgroundColors.push(colorMapping["claude-3.5-sonnet"]);
        }
        borderColors.push(borderColorMapping["claude-3.5-sonnet"]);
      
        if ("deepseek" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("deepseek/whole");
        }
        data.push(78.9);
        if ("deepseek" == "") {
          backgroundColors.push(patterns["claude-3.5-sonnet"]);
        } else {
          backgroundColors.push(colorMapping["claude-3.5-sonnet"]);
        }
        borderColors.push(borderColorMapping["claude-3.5-sonnet"]);
      
        if ("" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("/diff");
        }
        data.push(77.4);
        if ("" == "") {
          backgroundColors.push(patterns["claude-3.5-sonnet"]);
        } else {
          backgroundColors.push(colorMapping["claude-3.5-sonnet"]);
        }
        borderColors.push(borderColorMapping["claude-3.5-sonnet"]);
      
    
      
        if ("gpt-4o" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("gpt-4o/diff");
        }
        data.push(75.2);
        if ("gpt-4o" == "") {
          backgroundColors.push(patterns["gpt-4o"]);
        } else {
          backgroundColors.push(colorMapping["gpt-4o"]);
        }
        borderColors.push(borderColorMapping["gpt-4o"]);
      
        if ("deepseek" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("deepseek/diff");
        }
        data.push(74.4);
        if ("deepseek" == "") {
          backgroundColors.push(patterns["gpt-4o"]);
        } else {
          backgroundColors.push(colorMapping["gpt-4o"]);
        }
        borderColors.push(borderColorMapping["gpt-4o"]);
      
        if ("deepseek" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("deepseek/whole");
        }
        data.push(73.7);
        if ("deepseek" == "") {
          backgroundColors.push(patterns["gpt-4o"]);
        } else {
          backgroundColors.push(colorMapping["gpt-4o"]);
        }
        borderColors.push(borderColorMapping["gpt-4o"]);
      
        if ("" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("/diff");
        }
        data.push(71.4);
        if ("" == "") {
          backgroundColors.push(patterns["gpt-4o"]);
        } else {
          backgroundColors.push(colorMapping["gpt-4o"]);
        }
        borderColors.push(borderColorMapping["gpt-4o"]);
      
    
      
        if ("deepseek" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("deepseek/whole");
        }
        data.push(71.4);
        if ("deepseek" == "") {
          backgroundColors.push(patterns["o1-mini"]);
        } else {
          backgroundColors.push(colorMapping["o1-mini"]);
        }
        borderColors.push(borderColorMapping["o1-mini"]);
      
        if ("gpt-4o" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("gpt-4o/diff");
        }
        data.push(70.7);
        if ("gpt-4o" == "") {
          backgroundColors.push(patterns["o1-mini"]);
        } else {
          backgroundColors.push(colorMapping["o1-mini"]);
        }
        borderColors.push(borderColorMapping["o1-mini"]);
      
        if ("deepseek" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("deepseek/diff");
        }
        data.push(69.2);
        if ("deepseek" == "") {
          backgroundColors.push(patterns["o1-mini"]);
        } else {
          backgroundColors.push(colorMapping["o1-mini"]);
        }
        borderColors.push(borderColorMapping["o1-mini"]);
      
        if ("" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("/diff");
        }
        data.push(61.1);
        if ("" == "") {
          backgroundColors.push(patterns["o1-mini"]);
        } else {
          backgroundColors.push(colorMapping["o1-mini"]);
        }
        borderColors.push(borderColorMapping["o1-mini"]);
      
    
      
        if ("gpt-4o-mini" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("gpt-4o-mini/whole");
        }
        data.push(60.2);
        if ("gpt-4o-mini" == "") {
          backgroundColors.push(patterns["gpt-4o-mini"]);
        } else {
          backgroundColors.push(colorMapping["gpt-4o-mini"]);
        }
        borderColors.push(borderColorMapping["gpt-4o-mini"]);
      
        if ("" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("/whole");
        }
        data.push(55.6);
        if ("" == "") {
          backgroundColors.push(patterns["gpt-4o-mini"]);
        } else {
          backgroundColors.push(colorMapping["gpt-4o-mini"]);
        }
        borderColors.push(borderColorMapping["gpt-4o-mini"]);
      
    
    labels.reverse();
    data.reverse();
    backgroundColors.reverse();
    borderColors.reverse();
    var chart = new Chart(ctx, {
      type: 'bar',
      data: {
        labels: labels,
        datasets: [{
          label: 'Pass Rate',
          data: data,
          backgroundColor: backgroundColors,
          borderColor: borderColors,
          borderWidth: 1
        }]
      },
      options: {
        responsive: true,
        maintainAspectRatio: true,
        aspectRatio: chartSettings.aspectRatio,
        scales: {
          y: { 
            beginAtZero: true,
            title: {
              display: true,
              text: 'Pass Rate (%)',
              font: {
                size: baseFontSize + 6
              }
            },
            ticks: {
              font: {
                size: baseFontSize
              }
            }
          },
          x: {
            title: {
              display: true,
              text: 'Editor model and edit format',
              font: {
                size: baseFontSize + 6
              }
            },
            ticks: {
              font: {
                size: baseFontSize + 4
              },
              maxRotation: 90, // Allow full rotation if needed
              minRotation: 45  // Start rotating at 45 degrees to fit more labels
            }
          }
        },
        plugins: {
          annotation: {
            annotations: {
              line1: {
                type: 'line',
                yMin: 79.7,
                yMax: 79.7,
                borderColor: 'rgba(255, 99, 132, 0.8)',
                borderWidth: 2,
                borderDash: [6, 6],
                label: {
                  content: 'Previous SOTA',
                  enabled: true,
                  position: 'start',
                  xAdjust: 10,
                  font: {
                    size: baseFontSize
                  }
                }
              }
            }
          },
          legend: {
            display: true,
            title: {
              display: true,
              text: 'Architect model',
              font: {
                size: baseFontSize + 2,
                weight: 'bold'
              }
            },
            labels: {
              font: {
                size: baseFontSize + 4
              },
              generateLabels: function(chart) {
                var colorMapping = {
                  "o1-preview": "rgba(54, 162, 235, 0.2)",
                  "claude-3.5-sonnet": "rgba(75, 192, 192, 0.2)",
                  "gpt-4o": "rgba(255, 99, 132, 0.2)",
                  "o1-mini": "rgba(255, 206, 86, 0.2)",
                  "gpt-4o-mini": "rgba(153, 102, 255, 0.2)"
                };
                return Object.keys(colorMapping).reverse().map(function(key) {
                  return {
                    text: key,
                    fillStyle: colorMapping[key],
                    strokeStyle: colorMapping[key].replace('0.2', '1'),
                    lineWidth: 1
                  };
                });
              }
            }
          }
        }
      }
    });

    // Update aspect ratio and font sizes on window resize
    window.addEventListener('resize', function() {
      var newSettings = getChartSettings();
      chart.options.aspectRatio = newSettings.aspectRatio;
      baseFontSize = newSettings.baseFontSize;
      
      // Update font sizes
      chart.options.scales.y.title.font.size = baseFontSize + 6;
      chart.options.scales.y.ticks.font.size = baseFontSize;
      chart.options.scales.x.title.font.size = baseFontSize + 6;
      chart.options.scales.x.ticks.font.size = baseFontSize + 4;
      chart.options.plugins.annotation.annotations.line1.label.font.size = baseFontSize;
      chart.options.plugins.legend.title.font.size = baseFontSize + 4;
      chart.options.plugins.legend.labels.font.size = baseFontSize + 4;
      
      chart.update();
    });
  });

  function createStripePattern(baseColor) {
    var canvas = document.createElement('canvas');
    canvas.width = 10;
    canvas.height = 10;
    var ctx = canvas.getContext('2d');

    ctx.fillStyle = baseColor;
    ctx.fillRect(0, 0, canvas.width, canvas.height);
    ctx.strokeStyle = 'rgba(0, 0, 0, 0.1)';
    ctx.lineWidth = 2;
    ctx.beginPath();
    ctx.moveTo(0, 0);
    ctx.lineTo(10, 10);
    ctx.stroke();

    return canvas;
  }
</script>

<h2 id="motivation">Motivation</h2>

<p>This approach was motivated by the release of OpenAI’s o1 models.
They are strong at reasoning, but often fail to output properly formatted
code editing instructions.
It helps to instead let them describe the solution
however they prefer and then pass that output to a more traditional LLM.
This second Editor LLM can then interpret the solution description and
produce the code editing instructions needed to update
the existing source code.</p>

<p>This approach has recently become attractive for aider due to 
rapid improvements in the speed and costs of frontier models.
In particular, chaining older LLMs would have been quite slow and
incompatible with aider’s goal of providing an interactive,
pair programming AI coding experience.</p>

<h2 id="code-reasoning-and-code-editing">Code reasoning and code editing</h2>

<p>Normally aider asks the model to solve a coding problem in one prompt,
asking the LLM to explain the solution and return 
a well formatted series of file edits.
All of <a href="/docs/more/edit-formats.html">aider’s editing formats</a>
require the LLM to return source code edits in a specific text
format, so that aider can process the edits and apply them to the local source files.</p>

<p>Because this all happens in a single prompt/response round trip to the LLM,
the model has to split its attention between 
solving the coding problem and conforming to the edit format.</p>

<p>The Architect/Editor approach splits this into two inference steps, possibly
using two different LLMs:</p>

<ol>
  <li>Solve the coding problem (Architect).</li>
  <li>Turn the proposed solution into a series of well formed code edits (Editor).</li>
</ol>

<p>The Architect/Editor approach allows the Architect to focus on solving the coding problem
and <em>describe the solution however comes naturally to it</em>.
Similarly, the Editor can focus all of its attention on properly formatting the edits
without needing to reason much about how to solve the coding problem.</p>

<p>We can assign the Architect and Editor roles to LLMs which are well suited to their needs.
Strong reasoning model like o1-preview make excellent Architects, while
the Editor role can be assigned to an appropriate model based on cost, speed
and code editing skill.</p>

<h2 id="results">Results</h2>

<p>The graph above and the table below show the
<a href="/docs/benchmarks.html#the-benchmark">aider’s code editing benchmark</a>
score for various combinations of Architect and Editor models.</p>

<p>Some noteworthy observations:</p>

<ul>
  <li>Pairing o1-preview as Architect with either Deepseek or o1-mini as Editor sets a SOTA significantly above the previous best score. This result is obtained with the “whole” editing format, requiring the Editor to output a full update copy of each edited source file. Both of these steps are therefore quite slow, so probably not practical for interactive use with aider.</li>
  <li>Pairing OpenAI’s o1-preview with Anthropic’s Sonnet as the Editor produces the second best result. This is an entirely practical configuration for users able to work with both providers.</li>
  <li>Pairing many models with themselves in the Architect/Editor configuration can provide
significant benefits. 
Sonnet, GPT-4o and GPT-4o-mini all scored higher when used as an Architect/Editor pair.</li>
  <li>Deepseek is surprisingly effective as an Editor model. It seems remarkably capable at turning proposed coding solutions into new, updated versions of the source files. Using the efficient “diff” editing format, Deepseek helps all the Architect models except for Sonnet.</li>
</ul>

<h2 id="try-it">Try it!</h2>

<p>The development version of aider 
has built in defaults to support Architect/Editor coding with
o1-preview, o1-mini, GPT-4o and Claude 3.5 Sonnet.
Run aider with <code class="language-plaintext highlighter-rouge">--architect</code> or get started quickly like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install -U aider-chat

# Change directory into a git repo
cd /to/your/git/repo

# Work with Claude 3.5 Sonnet as the Architect and Editor
export ANTHROPIC_API_KEY=your-key-goes-here
aider --sonnet --architect

# Work with OpenAI models, using gpt-4o as the Editor
export OPENAI_API_KEY=your-key-goes-here
aider --4o --architect
aider --o1-mini --architect
aider --o1-preview --architect
</code></pre></div></div>

<h2 id="more-info">More info</h2>

<p>Aider has a number of “chat modes”, and “architect” is available as a new chat mode.
The <code class="language-plaintext highlighter-rouge">--architect</code> switch is a shortcut for <code class="language-plaintext highlighter-rouge">--chat-mode architect</code>.
For more details, see documentation on 
<a href="/docs/usage/modes.html">aider’s chat modes</a>.</p>

<h2 id="full-results">Full results</h2>

<p>Below are the benchmark results using various models as the Architect, paired with
various models as the Editor.
Each section includes a “baseline” result,
where the model works
by itself in aider’s normal “code” editing mode
(not as part of an Architect/Editor configuration).
This “solo” baseline represents the performance previously available when using
this model with aider.</p>

<div class="table-container">
  <table class="responsive-table">
    <thead>
      <tr>
        <th>Architect</th>
        <th>Editor</th>
        <th>Edit Format</th>
        <th>Pass Rate</th>
      </tr>
    </thead>
    <tbody>
      
        
        
          <tr class="">
            <td>o1-preview</td>
            <td>o1-mini</td>
            <td style="text-align: center;">whole</td>
            <td style="text-align: right;">85.0%</td>
          </tr>
        
          <tr class="">
            <td>o1-preview</td>
            <td>deepseek</td>
            <td style="text-align: center;">whole</td>
            <td style="text-align: right;">85.0%</td>
          </tr>
        
          <tr class="">
            <td>o1-preview</td>
            <td>claude-3-5-sonnet</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">82.7%</td>
          </tr>
        
          <tr class="">
            <td>o1-preview</td>
            <td>deepseek</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">80.5%</td>
          </tr>
        
          <tr class="">
            <td>o1-preview</td>
            <td>gpt-4o</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">80.5%</td>
          </tr>
        
          <tr class="">
            <td>o1-preview</td>
            <td><b>Baseline</b></td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">79.7%</td>
          </tr>
        
      
        
        
          <tr class="shaded">
            <td>claude-3.5-sonnet</td>
            <td>claude-3.5-sonnet</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">80.5%</td>
          </tr>
        
          <tr class="shaded">
            <td>claude-3.5-sonnet</td>
            <td>deepseek</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">78.9%</td>
          </tr>
        
          <tr class="shaded">
            <td>claude-3.5-sonnet</td>
            <td>deepseek</td>
            <td style="text-align: center;">whole</td>
            <td style="text-align: right;">78.9%</td>
          </tr>
        
          <tr class="shaded">
            <td>claude-3.5-sonnet</td>
            <td><b>Baseline</b></td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">77.4%</td>
          </tr>
        
      
        
        
          <tr class="">
            <td>gpt-4o</td>
            <td>gpt-4o</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">75.2%</td>
          </tr>
        
          <tr class="">
            <td>gpt-4o</td>
            <td>deepseek</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">74.4%</td>
          </tr>
        
          <tr class="">
            <td>gpt-4o</td>
            <td>deepseek</td>
            <td style="text-align: center;">whole</td>
            <td style="text-align: right;">73.7%</td>
          </tr>
        
          <tr class="">
            <td>gpt-4o</td>
            <td><b>Baseline</b></td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">71.4%</td>
          </tr>
        
      
        
        
          <tr class="shaded">
            <td>o1-mini</td>
            <td>deepseek</td>
            <td style="text-align: center;">whole</td>
            <td style="text-align: right;">71.4%</td>
          </tr>
        
          <tr class="shaded">
            <td>o1-mini</td>
            <td>gpt-4o</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">70.7%</td>
          </tr>
        
          <tr class="shaded">
            <td>o1-mini</td>
            <td>deepseek</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">69.2%</td>
          </tr>
        
          <tr class="shaded">
            <td>o1-mini</td>
            <td><b>Baseline</b></td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">61.1%</td>
          </tr>
        
      
        
        
          <tr class="">
            <td>gpt-4o-mini</td>
            <td>gpt-4o-mini</td>
            <td style="text-align: center;">whole</td>
            <td style="text-align: right;">60.2%</td>
          </tr>
        
          <tr class="">
            <td>gpt-4o-mini</td>
            <td><b>Baseline</b></td>
            <td style="text-align: center;">whole</td>
            <td style="text-align: right;">55.6%</td>
          </tr>
        
      
    </tbody>
  </table>
</div>]]></content><author><name></name></author><summary type="html"><![CDATA[An Architect model describes how to solve the coding problem, and an Editor model translates that into file edits. This Architect/Editor approach produces SOTA benchmark results.]]></summary></entry><entry><title type="html">o1-preview is SOTA on the aider leaderboard</title><link href="https://aider.chat/2024/09/12/o1.html" rel="alternate" type="text/html" title="o1-preview is SOTA on the aider leaderboard" /><published>2024-09-12T00:00:00+00:00</published><updated>2024-09-12T00:00:00+00:00</updated><id>https://aider.chat/2024/09/12/o1</id><content type="html" xml:base="https://aider.chat/2024/09/12/o1.html"><![CDATA[<p class="post-date">September 12, 2024</p>

<h1 id="openai-o1-preview-is-sota-on-the-aider-leaderboard">OpenAI o1-preview is SOTA on the aider leaderboard</h1>

<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

<canvas id="editChart" width="800" height="450" style="margin-top: 20px"></canvas>
<script>
  document.addEventListener('DOMContentLoaded', function () {
    var ctx = document.getElementById('editChart').getContext('2d');
    var leaderboardData = {
      labels: [],
      datasets: [{
        label: 'Percent completed correctly',
        data: [],
        backgroundColor: [],
        borderColor: [],
        borderWidth: 1
      }]
    };

    var allData = [];
    
      allData.push({
        model: 'o1-preview (whole)',
        pass_rate: 79.7,
        percent_cases_well_formed: 100.0,
        edit_format: 'whole'
      });
    
      allData.push({
        model: 'claude-3.5-sonnet (diff)',
        pass_rate: 77.4,
        percent_cases_well_formed: 99.2,
        edit_format: 'diff'
      });
    
      allData.push({
        model: 'o1-preview (diff)',
        pass_rate: 75.2,
        percent_cases_well_formed: 84.2,
        edit_format: 'diff'
      });
    
      allData.push({
        model: 'claude-3.5-sonnet (whole)',
        pass_rate: 75.2,
        percent_cases_well_formed: 100.0,
        edit_format: 'whole'
      });
    
      allData.push({
        model: 'gpt-4o-2024-08-06 (diff)',
        pass_rate: 71.4,
        percent_cases_well_formed: 98.5,
        edit_format: 'diff'
      });
    
      allData.push({
        model: 'o1-mini (whole)',
        pass_rate: 70.7,
        percent_cases_well_formed: 90.0,
        edit_format: 'whole'
      });
    
      allData.push({
        model: 'o1-mini (diff)',
        pass_rate: 62.4,
        percent_cases_well_formed: 85.7,
        edit_format: 'diff'
      });
    
      allData.push({
        model: 'gpt-4o-mini (whole)',
        pass_rate: 55.6,
        percent_cases_well_formed: 100.0,
        edit_format: 'whole'
      });
    

    function updateChart() {
      var selectedRows = document.querySelectorAll('tr.selected');
      var showAll = selectedRows.length === 0;

      leaderboardData.labels = [];
      leaderboardData.datasets[0].data = [];
      leaderboardData.datasets[0].backgroundColor = [];
      leaderboardData.datasets[0].borderColor = [];

      allData.forEach(function(row, index) {
        var rowElement = document.getElementById('edit-row-' + index);
        if (showAll) {
          rowElement.classList.remove('selected');
        }
        if (showAll || rowElement.classList.contains('selected')) {
          leaderboardData.labels.push(row.model);
          leaderboardData.datasets[0].data.push(row.pass_rate);
          
          switch (row.edit_format) {
            case 'whole':
              leaderboardData.datasets[0].backgroundColor.push('rgba(255, 99, 132, 0.2)');
              leaderboardData.datasets[0].borderColor.push('rgba(255, 99, 132, 1)');
              break;
            case 'diff':
              leaderboardData.datasets[0].backgroundColor.push('rgba(54, 162, 235, 0.2)');
              leaderboardData.datasets[0].borderColor.push('rgba(54, 162, 235, 1)');
              break;
            case 'udiff':
              leaderboardData.datasets[0].backgroundColor.push('rgba(75, 192, 192, 0.2)');
              leaderboardData.datasets[0].borderColor.push('rgba(75, 192, 192, 1)');
              break;
            case 'diff-fenced':
              leaderboardData.datasets[0].backgroundColor.push('rgba(153, 102, 255, 0.2)');
              leaderboardData.datasets[0].borderColor.push('rgba(153, 102, 255, 1)');
              break;
            default:
              leaderboardData.datasets[0].backgroundColor.push('rgba(201, 203, 207, 0.2)');
              leaderboardData.datasets[0].borderColor.push('rgba(201, 203, 207, 1)');
          }
        }
      });

      // Apply legend filtering
      var meta = leaderboardChart.getDatasetMeta(0);
      meta.data.forEach(function(bar, index) {
        if (leaderboardData.labels.includes(allData[index].model)) {
          bar.hidden = (allData[index].edit_format === 'whole' && meta.data[0].hidden) ||
                       (allData[index].edit_format !== 'whole' && meta.data[1].hidden);
        } else {
          bar.hidden = true;
        }
      });

      leaderboardChart.update();
    }

    var tableBody = document.querySelector('table tbody');
    allData.forEach(function(row, index) {
      var tr = tableBody.children[index];
      tr.id = 'edit-row-' + index;
      tr.style.cursor = 'pointer';
      tr.onclick = function() {
        this.classList.toggle('selected');
        updateChart();
      };
    });

    var leaderboardChart = new Chart(ctx, {
      type: 'bar',
      data: leaderboardData,
      options: {
        scales: {
          y: {
            beginAtZero: true,
            title: {
              display: true,
              text: 'Correct Exercises (%)'
            }
          },
          x: {
            ticks: {
              autoSkip: false,
              maxRotation: 90,
              minRotation: 0
            }
          }
        },
        plugins: {
          legend: {
            display: true,
            position: 'top',
            labels: {
              generateLabels: function(chart) {
                var uniqueFormats = [...new Set(allData.map(item => item.edit_format))];
                return uniqueFormats.map(format => {
                  var color;
                  switch (format) {
                    case 'whole':
                      color = { fill: 'rgba(255, 99, 132, 0.2)', stroke: 'rgba(255, 99, 132, 1)' };
                      break;
                    case 'diff':
                      color = { fill: 'rgba(54, 162, 235, 0.2)', stroke: 'rgba(54, 162, 235, 1)' };
                      break;
                    case 'udiff':
                      color = { fill: 'rgba(75, 192, 192, 0.2)', stroke: 'rgba(75, 192, 192, 1)' };
                      break;
                    case 'diff-fenced':
                      color = { fill: 'rgba(153, 102, 255, 0.2)', stroke: 'rgba(153, 102, 255, 1)' };
                      break;
                    default:
                      color = { fill: 'rgba(201, 203, 207, 0.2)', stroke: 'rgba(201, 203, 207, 1)' };
                  }
                  return {
                    text: format,
                    fillStyle: color.fill,
                    strokeStyle: color.stroke,
                    lineWidth: 1,
                    hidden: false
                  };
                });
              }
            },
            onClick: function(e, legendItem, legend) {
              var ci = legend.chart;
              var clickedFormat = legendItem.text;
              
              legendItem.hidden = !legendItem.hidden;
              
              ci.data.datasets[0].data.forEach(function(dataPoint, i) {
                var meta = ci.getDatasetMeta(0);
                if (allData[i].edit_format === clickedFormat) {
                  meta.data[i].hidden = legendItem.hidden;
                }
              });
              
              ci.update();
            }
          }
        }
      }
    });

    updateChart();
  });
</script>

<h2 id="o1-preview">o1-preview</h2>

<p>OpenAI o1-preview scored 79.7% on aider’s code editing benchmark,
a state of the art result.
It achieved this result with the 
<a href="/docs/leaderboards/#notes-on-the-edit-format">“whole” edit format</a>,
where the LLM returns a full copy of the source code file with changes.</p>

<p>It is much more practical to use aider’s
<a href="/docs/leaderboards/#notes-on-the-edit-format">“diff” edit format</a>,
which allows the LLM to return search/replace blocks to 
efficiently edit the source code.
This saves significant time and token costs.</p>

<p>Using the diff edit format the o1-preview model had a strong
benchmark score of 75.2%.
This likely places o1-preview between Sonnet and GPT-4o for practical use,
but at significantly higher cost.</p>

<h2 id="o1-mini">o1-mini</h2>

<p>OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet,
but scored below those models.
It also works best with the whole edit format.</p>

<h2 id="future-work">Future work</h2>

<p>The o1-preview model had trouble conforming to aider’s diff edit format.
The o1-mini model had trouble conforming to both the whole and diff edit formats.
Aider is extremely permissive and tries hard to accept anything close
to the correct formats.</p>

<p>It is surprising that such strong models had trouble with
the syntactic requirements of simple text output formats.
It seems likely that aider could optimize its prompts and edit formats to
better harness the o1 models.</p>

<h2 id="using-aider-with-o1">Using aider with o1</h2>

<p>OpenAI’s new o1 models are supported in v0.57.0 of aider:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aider --model o1-mini
aider --model o1-preview
</code></pre></div></div>

<blockquote class="note">
  <p>These are initial benchmark results for the o1 models,
based on aider v0.56.1-dev.
See the <a href="/docs/leaderboards/">aider leaderboards</a> for up-to-date results
based on the latest aider releases.</p>
</blockquote>

<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
  <thead style="background-color: #f2f2f2;">
    <tr>
      <th style="padding: 8px; text-align: left;">Model</th>
      <th style="padding: 8px; text-align: center;">Percent completed correctly</th>
      <th style="padding: 8px; text-align: center;">Percent using correct edit format</th>
      <th style="padding: 8px; text-align: left;">Command</th>
      <th style="padding: 8px; text-align: center;">Edit format</th>
    </tr>
  </thead>
  <tbody>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">o1-preview (whole)</td>
        <td style="padding: 8px; text-align: center;">79.7%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model o1-preview</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">claude-3.5-sonnet (diff)</td>
        <td style="padding: 8px; text-align: center;">77.4%</td>
        <td style="padding: 8px; text-align: center;">99.2%</td>
        <td style="padding: 8px;"><code>aider --sonnet</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">o1-preview (diff)</td>
        <td style="padding: 8px; text-align: center;">75.2%</td>
        <td style="padding: 8px; text-align: center;">84.2%</td>
        <td style="padding: 8px;"><code>aider --model o1-preview</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">claude-3.5-sonnet (whole)</td>
        <td style="padding: 8px; text-align: center;">75.2%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/anthropic/claude-3.5-sonnet --edit-format whole</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">gpt-4o-2024-08-06 (diff)</td>
        <td style="padding: 8px; text-align: center;">71.4%</td>
        <td style="padding: 8px; text-align: center;">98.5%</td>
        <td style="padding: 8px;"><code>aider --model openai/gpt-4o-2024-08-06</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">o1-mini (whole)</td>
        <td style="padding: 8px; text-align: center;">70.7%</td>
        <td style="padding: 8px; text-align: center;">90.0%</td>
        <td style="padding: 8px;"><code>aider --model o1-mini</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">o1-mini (diff)</td>
        <td style="padding: 8px; text-align: center;">62.4%</td>
        <td style="padding: 8px; text-align: center;">85.7%</td>
        <td style="padding: 8px;"><code>aider --model o1-mini --edit-format diff</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">gpt-4o-mini (whole)</td>
        <td style="padding: 8px; text-align: center;">55.6%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model gpt-4o-mini</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
  </tbody>
</table>

<style>
  tr.selected {
    color: #0056b3;
  }
  table {
    table-layout: fixed;
  }
  td, th {
    word-wrap: break-word;
    overflow-wrap: break-word;
  }
  td:nth-child(3), td:nth-child(4) {
    font-size: 12px;
  }
</style>]]></content><author><name></name></author><summary type="html"><![CDATA[Preliminary benchmark results for the new OpenAI o1 models.]]></summary></entry></feed>