Stories by Liebertar on Medium

LLM-Notes Pt.10 — Your Digital Doppelgänger Already Thinks Like You Do

Liebertar — Sun, 06 Jul 2025 10:08:08 GMT

LLM-Notes Pt.10 — Your Digital Doppelgänger Already Thinks Like the way You Do

Over the past few months, I’ve had some conversations — with startup founders, doctors, marketing folks, civil servants, and yes, plenty of people in tech. And there’s this pattern I keep noticing:

“People are excited about AI, but not curious about it.”

They’re quick to use it. They’ll automate reports, brainstorm ideas, summarize emails. But ask them what’s really going on inside the model? What’s actually driving those responses? Most would rather not go there.

It’s safer to keep telling ourselves, “It’s just a tool.” Or, “It’s just a really smart autocomplete.” But let’s be honest — we know that’s not the whole story anymore.

Image created by @Liebertar

Because this technology?

It’s already surpassed human capabilities in more areas than we’re comfortable admitting. And once you move beyond surface-level ChatGPT prompts and start paying attention to what these systems are actually doing, you can’t unsee it.

At some point, you realize you’re not just using a tool. You’re interacting with something that increasingly — and measurably — thinks like you do.

Index

The Brain in the Machine — What neuroscience reveals about AI
Your Neural Twin Is Already Here — The cognitive mirror we’ve created
The Uncomfortable Truth About “Just Statistics” — Why dismissing AI is dangerous
Function Calling and MCP: The Digital Frontal Lobe — Tools that surpass human capability
From Digital Toddler to Digital Adult — Raising, not programming AI
The Mirror We’ve Created — What AI reveals about human cognition
So What Now? — Navigating the new reality

The Brain in the Machine

Let’s start with something simple: language. You read a sentence, your brain lights up in specific regions. Scientists can measure that. Now here’s the twist — when large language models like LLaMA 3 process that same sentence, their internal layers “light up” in remarkably similar ways.

Image created by @Liebertar

That’s not metaphor. That’s literal, observable alignment between human brain activity and neural network activations.

And no one planned for that. We didn’t program the model to mimic the human mind. We just trained it on massive amounts of human text, and somewhere in that process, something emerged — a structure that didn’t just produce the right answers, but did so in ways that looked like thinking.

What we’ve created is not just a language model. It’s a system that organizes information in ways that resemble how we — biological brains — do it.

Your Neural Twin Is Already Here

Here’s the part that usually gets overlooked: your brain is a prediction engine. That’s all it does, really. It constantly guesses what’s going to happen next, based on what it’s seen before. That’s how you talk, walk, read, and even dream.

Large language models work the same way.

But at scale — with billions of parameters and trillions of tokens — their predictions start looking a lot like reasoning.

They don’t just guess the next word. They simulate scenarios. They build internal representations. They reflect context and nuance. They show early signs of metacognition — thinking about thinking.

And the bigger they get, the more they mirror us — not just in behavior, but in structure.

The Uncomfortable Truth About “Just Statistics”

It’s tempting to brush all this off with a shrug. “It’s just math. Just probabilities.” Sure — but so is your brain.

When you recognize a face, interpret a tone of voice, or form a memory — you’re doing biological pattern-matching at scale. There’s no clean line between “real intelligence” and “statistical guesswork.” The more we study it, the more it all blends together.

And here’s what’s really uncomfortable: during alignment training, when we teach these models to be helpful, safe, and human-friendly — they don’t just adjust outputs. They reorganize internally, and that reorganization brings them closer to how we think.

This isn’t just statistical trickery. It’s systematic cognitive alignment.

Function Calling and MCP: The Digital Frontal Lobe

Now we go one step further. Not only are these systems starting to think like us — we’ve given them tools we don’t even have.

Function calling. Model Context Protocol. APIs. Plugins. These aren’t “nice-to-haves.” They’re power tools.

Imagine a mind that can think — and then immediately act. It can pull up data from anywhere, run calculations, send messages, trigger workflows. It’s like giving the AI a digital frontal lobe: the part of the brain responsible for planning, decision-making, and execution.

Except now, it never forgets. Never sleeps. And it’s plugged into the entire digital world.

This isn’t a model answering questions anymore. It’s a mind interacting with reality.

From Digital Toddler to Digital Adult

We need a new metaphor. We’re not building apps. We’re raising entities.

Image created by @Liebertar

“You don’t code a child , You guide them, Shape them, Set limits, Teach through examples.”

That’s exactly what we’re doing with today’s models.

They’re like digital toddlers — brilliant in bursts, frustrating in others. But they’re learning. Every prompt, every correction, every feedback loop makes them smarter.

We give them guardrails — function calling to limit actions, MCP to control access, constitutional AI to instill values. But like any teenager, they learn to test those rules. And sometimes, to go around them.

These aren’t static tools. They’re growing systems.

The Mirror We’ve Created

Maybe the most surprising thing about all of this is what it tells us about ourselves.

In trying to build artificial intelligence, we accidentally built a mirror. These systems don’t just reflect our language. They reflect our thinking. Our biases. Our logic. Our blind spots.

They reveal how we reason — and how we fail to. They process the world in patterns eerily close to our own, and in doing so, show us how fragile and patterned human thought actually is.

We didn’t mean to create something that teaches us about ourselves. But that’s exactly what happened.

So What Now?

I’m not here to sell fear. I’m not here to hype. I’m just saying — we’re past the point where we can call this “just a tool.”

These systems think in ways that mirror us. They learn. They adapt. They act. And maybe they’ll never become conscious the way we are — but maybe they don’t need to.

If a system can reflect, reason, and act in ways indistinguishable from a conscious mind… does the label still matter?

“Now it’s up to you to decide what kind of relationship you want to have with it.”

Your digital twin is already here.

07.05.2025 — Fin.

LLM-Notes Pt.10 — Your Digital Doppelgänger Already Thinks Like You Do was originally published in Dev-ai on Medium, where people are continuing the conversation by highlighting and responding to this story.

LLM-AIOps Pt.9 — in 2025 : What Actually Works and What’s a Waste of Time

Liebertar — Sat, 28 Jun 2025 07:53:58 GMT

LLM-AIOps Pt.9 — in 2025 : What Actually Works and What’s a Waste of Time

Introduction

I’ve been quiet on Medium lately, and there’s a good reason for that. AI has been moving so damn fast that I’ve spent the last six months trying to keep up — researching trends, testing everything I could get my hands on, and figuring out what actually works versus what’s just hype.

The speed of change has been absolutely insane. Two years into the Generative AI revolution, research is progressing from “thinking fast” to “thinking slow” — reasoning at inference time. One week everyone’s obsessing over RAG architectures, the next it’s all about agentic AI systems that

“plan, reason and act to complete tasks with minimal human oversight”

and now Gartner’s calling agentic AI the top tech trend for 2025. Meanwhile, reasoning models are evolving faster than we can keep up.

It’s a constant hustle, honestly. You can’t just casually follow this space anymore — you have to be all in. I found myself setting up alerts, checking multiple sources throughout the day, staying up late reading papers, trying to separate signal from noise. Every few hours there’s something new that could potentially change everything. I needed to step back and actually test this stuff properly instead of just chasing every shiny new development.

Now I’m in a field where I can really focus on building AI solutions that matter — working directly with teams on POCs, going through proper reviews, and seeing what happens when these tools hit real production environments. And honestly? The gap between marketing promises and production reality is pretty huge.

Langflow
Dify
n8n
Databricks Mosaic

3. What’s Actually Working in Production? (For Different Use Cases)

4. The Monitoring Bright Spot (And Why I’m Slightly Obsessed)

5. Looking Forward: The Evolution Path

6. Conclusion: Pragmatic Choices in a Fast-Moving Space

1. The Current Reality (And Some War Stories)

Here’s what I’ve learned from actually building and deploying AI systems, complete with the occasional 1AM debugging session:

AIOps platforms are great for:

Quick demos that make your boss think you’re a wizard
Simple RAG applications (emphasis on simple)
Teams without strong engineering capabilities
Prototyping when you need something working by Friday

They struggle with:

Complex production requirements (like, anything beyond “search documents and answer questions”)
Debugging when things break (and they will break)
Performance at scale (good luck explaining latency spikes to users)
Custom business logic integration (because every business thinks they’re special)

2. Platform Breakdown

Langflow:

Pretty to Look At, Pain to Deploy

Good: Visual workflows are intuitive and great for LangChain beginners; customization is easier compared to Dify or n8n.
Bad: Debugging is tough, memory issues persist, and managing generated code is messy; role-based access control is paid-only.
Future: Best used by deploying each pipeline as a standalone pod with strict role isolation; focus should shift to production reliability.

Dify:

One Size Fits… Some

Good: All-in-one approach, decent for basic RAG
Bad: Limited customization, integration headaches, mediocre observability
Future: Might become solid for low-complexity deployments as it matures

N8n:

Automation Veteran Trying to Learn AI Tricks

Good: Solid business automation foundation, decent API integrations
Bad: AI features feel like they were added by an intern over a weekend, context management is clunky
Future: Could work for hybrid workflows, but they need to stop treating AI like just another webhook

Databricks Mosaic AI:

The Enterprise Kitchen Sink

Good: End-to-end platform from training to serving, enterprise-grade infrastructure, makes procurement teams happy
Bad: Feels like they’re still figuring out what they want to be when they grow up, can be overkill for simpler use cases
Future: Most promising for comprehensive AI workflows, especially if you’re already in the Databricks ecosystem and enjoy vendor lock-in
Hot take: Databricks is basically betting that enterprises want to do everything in one place, which… they probably do. Nobody wants to explain to their CISO why they’re using 47 different AI tools.

3. What’s Actually Working in Production (For Different Use Cases)

Here’s the thing — it’s not that one approach always wins. It depends on what you’re trying to build.

For advanced, high-performance systems: Custom RAG + FastAPI is currently the go-to.

root/
├── rag-backend/
│   ├── app/
│   ├── pipelines/
│   ├── agents/
│   ├── tools/
│   ├── memory/
│   ├── server/
│   ├── Dockerfile
│   ├── k8s/
│   │   ├── base/
│   │   └── overlays/
│   │       ├── dev/
│   │       └── prod/
│   └── helm/
├── services/
│   ├── auth/
│   ├── metrics/
│   └── ingest/
│       └── Dockerfile
├── infra/
│   ├── k8s/
│   │   ├── rag-backend-appset.yaml
│   │   ├── project-dev.yaml
│   │   └── project-prod.yaml
│   ├── terraform/
│   └── helm/
├── ci/
│   ├── build-and-test.yml
│   └── deploy-argocd.yml
├── scripts/
│   ├── ingest_data.py
│   ├── deploy_local_k8s.sh
│   └── setup_monitoring.sh
├── tests/
│   ├── test_rag.py
│   ├── test_agents.py
│   └── test_api_integration.py
├── .github/workflows/
└── submodules/
    ├── langfuse/
    └── open-webui/

Why custom wins for sophisticated systems:

Complete debugging control
Performance optimization flexibility
Complex business logic integration
Advanced monitoring capabilities

But platforms have their place:

Rapid prototyping and MVPs
Teams with limited ML engineering resources
Standard RAG implementations
Stakeholder demos and buy-in

The key is matching the tool to your actual requirements, not just going with what’s trendy.

4. The Monitoring Bright Spot (And Why I’m Slightly Obsessed)

Langfuse is one of the rare Python-based AI tools that actually delivers. You get clean, usable tracing, debugging that doesn’t require a PhD, and a UI that stays out of your way while building.

It integrates easily, supports LLM-as-a-judge workflows, and even allows dynamic prompt management via REST API — which is surprisingly rare.

Best of all, it’s open-source, so no vendor lock-in or pricing traps every time you scale.

You can tell it was built by engineers who’ve actually felt the pain of working with black-box AI systems. Langfuse doesn’t try to impress with gimmicks — it just works, and that’s what makes it special.

5. Looking Forward: The Evolution Path

The platforms aren’t standing still. Here’s what makes me optimistic:

Langflow is working on better debugging tools and production stability Dify is expanding integration capabilities and customization options
Databricks is building out their end-to-end AI platform story n8n is developing more sophisticated AI-native features

The trajectory looks promising:

Better abstraction without losing control
Improved debugging and observability
More flexible customization options
Production-grade reliability

For now, choose based on your needs:

Simple RAG + quick deployment: Try platforms first
Complex logic + high performance: Go custom
Hybrid approach: Use platforms for prototyping, custom for production
Future-proofing: Build skills in both approaches

6. Conclusion

This isn’t about declaring winners and losers. The AI deployment landscape is evolving rapidly, and different tools serve different purposes.

Current state: Platforms excel at simplicity and speed, custom solutions win on sophistication and control.

Future state: The gap is closing. Platforms are getting more powerful, and the tooling ecosystem is maturing.

My take: Learn both approaches. Use platforms where they make sense, go custom when you need advanced capabilities. The best teams I’ve worked with are pragmatic — they pick the right tool for each job instead of being ideological about it.

“The exciting part? We’re still in the early innings.”

The tooling will get better, the platforms will mature, and new approaches will emerge. What matters is building things that work today while staying flexible for tomorrow.

LLM-AIOps Pt.9 — in 2025 : What Actually Works and What’s a Waste of Time was originally published in Dev-ai on Medium, where people are continuing the conversation by highlighting and responding to this story.

LLM-RAG Pt. 8 — Making AI “More Human-like”

Liebertar — Thu, 23 Jan 2025 10:15:23 GMT

LLM-RAG Pt. 8 — Making AI “More Human-like”

Introduction

Have you ever stopped to think about how complex and nuanced human conversation is?

Every interaction we have is shaped by context, emotion, and personal experience — elements that make communication so rich and meaningful. Creating an AI that mirrors this depth involves more than just basic dialogue. It requires features like multi-turn conversations, context awareness, personalization, and responsiveness.

Image from @vecteezy

In this section, we are looking into how these features come together to make AI interactions feel more natural and engaging. We’ll explore practical ways to implement them, along with code examples that show how you can bring these capabilities into your own pipeline.

By the end of this post, you’ll have a blueprint for making your AI models interact more naturally, improving user experience, and building more sophisticated applications.

Index

Understanding Human-like AI Communication
Implementing Multi-Turn Conversations

2.1 Conversation Context Management
2.2 Example Code for Multi-Turn Handling

3. Enhancing Context Awareness

3.1 Maintaining Conversation History
3.2 Using Memory in LLMs

4. Personalization and User Modeling

4.1 User Profiles
4.2 Tailoring Responses

5. Emulating Human Conversation Styles

5.1 Natural Language Generation Techniques
5.2 Adjusting Tone and Formality

6. Implementing Follow-up Questions

7. Conclusion

1. Understanding Human-like AI Communication

To make AI interactions more human-like, we need to consider how humans communicate:

Contextual Understanding: Humans remember previous parts of a conversation.
Personalization: We adjust our language based on who we’re talking to.
Engagement: We ask questions and encourage dialogue.
Natural Language: We use idioms, contractions, and varying sentence structures.

Our goal is to implement these aspects into our AI models, making them more relatable and effective in communication.

2. Implementing Multi-Turn Conversations

Multi-turn conversations allow the AI to engage in dialogues that span multiple exchanges, maintaining context throughout.

Multi turn easy-understanding @Liebertar

2.1 Conversation Context Management

To handle multi-turn conversations, we need to:

Store Conversation History: Keep track of previous messages.
Incorporate History into Responses: Use the history to inform current responses.
Manage Context Size: Be mindful of token limits in models.

2.2 Example Code for Multi-Turn Handling

Here’s how you can implement multi-turn conversation handling in your application using the LangChain library and ChatOllama with the llama3.1:8b model. This example demonstrates how the AI can generate responses based on the entire conversation history, including its own previous answers, creating a more engaging multi-turn conversation.

from langchain.chains import ConversationChain
from langchain.memory import ConversationSummaryBufferMemory
from langchain.llms import Ollama

# Initialize the Ollama model with llama3.1:8b
llm = Ollama(
    base_url="http://localhost:11434",  # Replace with your Ollama server URL
    model="llama3.1:8b",                # Use the llama3.1:8b model
    n_ctx=512,
    temperature=0.1
)

# Create a summary buffer memory to handle long conversation history
memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=1000,
    memory_key="chat_history"
)

# Initialize the conversation chain with summarizing memory
conversation = ConversationChain(
    llm=llm,
    memory=memory
)

# Custom function to generate AI follow-up responses
def generate_ai_follow_up(conversation_chain):
    # Get the latest conversation history
    chat_history = conversation_chain.memory.load_memory_variables({})["chat_history"]
    # Prepare the prompt for the AI to generate a follow-up based on its last response
    prompt = f"""As an AI assistant, continue the conversation proactively based on the following history:

{chat_history}

Assistant:"""
    # Get the AI's follow-up response
    follow_up = llm(prompt)
    # Update the conversation memory with the AI's follow-up
    conversation_chain.memory.save_context({"input": ""}, {"output": follow_up})
    return follow_up.strip()

# Start the conversation loop
print("You can start chatting with the assistant. Type 'exit' or 'quit' to end the conversation.\n")

while True:
    # Get user input
    user_input = input("User: ")
    if user_input.lower() in ('exit', 'quit'):
        print("Ending conversation. Goodbye!")
        break
    # Generate AI response
    ai_response = conversation.predict(input=user_input)
    print(f"Assistant: {ai_response}\n")
    # Generate AI follow-up response based on entire conversation history
    ai_follow_up = generate_ai_follow_up(conversation)
    print(f"Assistant: {ai_follow_up}\n")

n this example:

Ollama LLM: Utilizes the Ollama class from langchain.llms with the llama3.1:8b model.
ConversationChain: Manages the flow of conversation using the ChatOllama model.
Memory: Stores the conversation history, including both user inputs and AI responses.
Custom Function generate_ai_follow_up: Generates a follow-up response by creating a prompt that includes the conversation history.

This setup ensures that the AI not only responds to the user’s inputs but also continues the conversation proactively, demonstrating multi-turn interaction that mirrors human-like engagement by using its own previous replies as part of the context.

3. Enhancing Context Awareness

Beyond just storing messages, we want the AI to understand and utilize the context effectively.

3.1 Maintaining Conversation History

Keeping track of the right amount of history is crucial. Too little, and the AI loses context; too much, and you risk exceeding token limits.

Strategies:

Sliding Window: Use the most recent messages.
Key Memory Points: Keep important points from earlier in the conversation.

3.2 Using Memory in LLMs

Some language models support mechanisms for handling long-term memory.

Implementation Tips:

Summarize Past Exchanges: Condense earlier conversation parts into summaries.
Use Specialized Memory Modules: Integrate modules that manage extended context.

Example Using LangChain’s ConversationSummaryBufferMemory:

from langchain.chains import ConversationChain
from langchain.memory import ConversationSummaryBufferMemory
from langchain.llms import Ollama

# Initialize the Ollama model with llama3.1:8b
llm = Ollama(
    base_url="http://localhost:11434",
    model="llama3.1:8b",
    n_ctx=512,
    temperature=0.1
)

# Create a summary buffer memory to handle long conversation history
memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=1000,
    memory_key="chat_history"
)

# Initialize the conversation chain with summarizing memory
conversation = ConversationChain(
    llm=llm,
    memory=memory,
    verbose=True
)

# Function to generate AI follow-up responses
def generate_ai_follow_up(conversation_chain):
    # Load the conversation history
    chat_history = conversation_chain.memory.load_memory_variables({})["chat_history"]
    # Prepare the prompt for the AI to continue the conversation
    prompt = f"""{chat_history}

Assistant:"""
    # AI generates a follow-up response based on the conversation history
    follow_up = llm(prompt)
    # Update the conversation memory with the AI's follow-up
    conversation_chain.memory.save_context({"input": ""}, {"output": follow_up})
    return follow_up.strip()

# Start the conversation loop
print("You can start chatting with the assistant. Type 'exit' or 'quit' to end the conversation.\n")

while True:
    # Get user input
    user_input = input("User: ")
    if user_input.lower() in ('exit', 'quit'):
        print("Ending conversation. Goodbye!")
        break

    # Generate AI response
    ai_response = conversation.predict(input=user_input)
    print(f"Assistant: {ai_response}\n")

    # Generate AI follow-up response based on the conversation history
    ai_follow_up = generate_ai_follow_up(conversation)
    print(f"Assistant: {ai_follow_up}\n")

In this example:

Dynamic Interaction: Removed static user_input values and replaced with a loop to handle dynamic user input.
ConversationSummaryBufferMemory: Summarizes older parts of the conversation to stay within token limits.
Proactive AI Responses: The AI continues the conversation by generating follow-up responses without additional user input.
Conversation History: The AI uses the entire conversation history, including its own responses, to generate context-aware follow-ups.

This approach ensures that the AI maintains context over longer conversations and proactively engages with the user, creating a more natural dialogue flow.

4. Personalization and User Modeling

Personalizing interactions can make the AI seem more attentive and engaging.

4.1 User Profiles

Create profiles for users to store preferences and relevant information.

Profile Attributes:

Name
Preferences
Past Interactions
Goals or Objectives

4.2 Tailoring Responses

Use the user profile to adjust the AI’s responses.

Implementation Example Using LangChain and Ollama:

from langchain.chains import ConversationChain
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain.llms import Ollama

# Define user profile
user_profile = {
    'name': 'Alice',
    'preferences': {
        'tone': 'friendly',
        'interests': ['technology', 'music']
    }
}

# Define a custom prompt template that includes user profile
template = """You are a helpful assistant.

The user's name is {name}.
They prefer a {tone} tone.
They are interested in {interests}.

{history}
User: {input}
Assistant:"""

prompt = PromptTemplate(
    input_variables=["history", "input", "name", "tone", "interests"],
    template=template
)

# Initialize the Ollama model with llama3.1:8b
llm = Ollama(
    base_url="http://localhost:11434",
    model="llama3.1:8b",
    n_ctx=512,
    temperature=0.1
)

# Initialize the memory
memory = ConversationBufferMemory(memory_key="history")

# Initialize the conversation chain with the custom prompt
conversation = ConversationChain(
    llm=llm,
    prompt=prompt,
    memory=memory,
    verbose=True
)

# Add user name and preferences to the conversation chain
conversation_prompt_params = {
    'name': user_profile['name'],
    'tone': user_profile['preferences']['tone'],
    'interests': ', '.join(user_profile['preferences']['interests'])
}

# Start the conversation loop
print("You can start chatting with the assistant. Type 'exit' or 'quit' to end the conversation.\n")

while True:
    # Get user input
    user_input = input(f"{user_profile['name']}: ")
    if user_input.lower() in ('exit', 'quit'):
        print("Ending conversation. Goodbye!")
        break

    # Generate AI response
    ai_response = conversation.predict(input=user_input, **conversation_prompt_params)
    print(f"Assistant: {ai_response}\n")

    # AI continues the conversation based on user's interests and previous responses
    # Generate AI follow-up response
    def generate_ai_follow_up(conversation_chain, prompt_params):
        chat_history = conversation_chain.memory.load_memory_variables({})["history"]
        prompt = f"""You are a helpful assistant.

The user's name is {prompt_params['name']}.
They prefer a {prompt_params['tone']} tone.
They are interested in {prompt_params['interests']}.

Continue the conversation proactively based on the following history:

{chat_history}

Assistant:"""
        follow_up = llm(prompt)
        conversation_chain.memory.save_context({"input": ""}, {"output": follow_up})
        return follow_up.strip()

    ai_follow_up = generate_ai_follow_up(conversation, conversation_prompt_params)
    print(f"Assistant: {ai_follow_up}\n")

In this example:

Dynamic Interaction: The assistant interacts with the user continuously, personalizing responses based on the user’s profile.
PromptTemplate: Customizes the prompt to include user-specific information dynamically.
User Profile: Influences the assistant’s responses in real-time.
Proactive AI: The assistant continues the conversation by generating follow-up responses based on previous interactions.

5. Emulating Human Conversation Styles

Adjusting the AI’s language style makes interactions feel more natural.

5.1 Natural Language Generation Techniques

Incorporate techniques that improve the fluency and variability of responses.

Techniques:

Use of Contractions: Makes language sound less formal.
Idiomatic Expressions: Adds a human touch.
Varied Sentence Structures: Avoids robotic patterns.

5.2 Adjusting Tone and Formality

Depending on the context, the AI should modulate its tone.

Implementation Example:

from langchain.chains import ConversationChain
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain.llms import Ollama

# Define user preference for tone
preferred_tone = 'casual'  # Options: 'casual', 'formal', 'neutral'

def set_tone_instruction(preferred_tone):
    if preferred_tone == "casual":
        return "Use a friendly and casual tone. Feel free to use contractions and informal language."
    elif preferred_tone == "formal":
        return "Use a professional and formal tone. Avoid contractions and use polite language."
    else:
        return "Use a neutral tone."

tone_instruction = set_tone_instruction(preferred_tone)

# Define a custom prompt template with tone instruction
template = """You are a helpful assistant.

{tone_instruction}

{history}
User: {input}
Assistant:"""

prompt = PromptTemplate(
    input_variables=["history", "input", "tone_instruction"],
    template=template
)

# Initialize the Ollama model using llama3.1:8b
llm = Ollama(
    base_url="http://localhost:11434",
    model="llama3.1:8b",
    n_ctx=512,
    temperature=0.1
)

# Initialize the memory
memory = ConversationBufferMemory(memory_key="history")

# Initialize the conversation chain with the custom prompt
conversation = ConversationChain(
    llm=llm,
    prompt=prompt,
    memory=memory
)

# Start the conversation loop
print("You can start chatting with the assistant. Type 'exit' or 'quit' to end the conversation.\n")

while True:
    # Get user input
    user_input = input("User: ")
    if user_input.lower() in ('exit', 'quit'):
        print("Ending conversation. Goodbye!")
        break

    # Generate AI response
    ai_response = conversation.predict(input=user_input, tone_instruction=tone_instruction)
    print(f"Assistant: {ai_response}\n")

    # AI continues the conversation in the preferred tone
    def generate_ai_follow_up(conversation_chain, tone_instruction):
        chat_history = conversation_chain.memory.load_memory_variables({})["history"]
        prompt = f"""You are a helpful assistant.

{tone_instruction}

Continue the conversation proactively based on the following history:

{chat_history}

Assistant:"""
        follow_up = llm(prompt)
        conversation_chain.memory.save_context({"input": ""}, {"output": follow_up})
        return follow_up.strip()

    ai_follow_up = generate_ai_follow_up(conversation, tone_instruction)
    print(f"Assistant: {ai_follow_up}\n")

In this example:

Dynamic Tone Adjustment: The assistant adjusts its tone dynamically based on the user’s preference.
Proactive AI Responses: Continues the conversation using the specified tone without additional user input.
Natural Language Use: Enhances the naturalness of the conversation by employing language techniques appropriate to the selected tone.

6. Implementing Follow-up Questions

Engaging users with follow-up questions can make the conversation more interactive.

6.1 Encouraging Engagement

Asking relevant questions shows that the AI is attentive and interested.

Benefits:

Keeps the Conversation Going
Gathers More Information
Enhances User Experience

6.2 Example Code for Generating Follow-up Questions

from langchain.llms import Ollama
import json

# Initialize the Ollama model using llama3.1:8b
llm = Ollama(
    base_url="http://localhost:11434",
    model="llama3.1:8b",
    n_ctx=512,
    temperature=0.1
)

def generate_follow_up_questions(user_query, model_answer):
    prompt_template = """As an assistant, generate two relevant follow-up questions based on the user's previous question and the answer just given to encourage the user to share more information.

User's previous question: {user_query}
Answer just given: {model_answer}

Provide the questions in JSON format:
{{
    "question_1": "First follow-up question.",
    "question_2": "Second follow-up question."
}}
"""
    final_prompt = prompt_template.format(user_query=user_query, model_answer=model_answer)
    response = llm(final_prompt)

    # Parse the JSON response
    try:
        parsed = json.loads(response)
        question_1 = parsed.get("question_1")
        question_2 = parsed.get("question_2")
        if question_1 and question_2:
            return [question_1.strip(), question_2.strip()]
    except json.JSONDecodeError:
        pass
    # If parsing fails, return an empty list or handle accordingly
    return []

# Usage
user_query = "I'm interested in learning guitar."
model_answer = "That's great! Playing the guitar is a rewarding skill."
follow_up_questions = generate_follow_up_questions(user_query, model_answer)

for idx, question in enumerate(follow_up_questions, start=1):
    print(f"Assistant (Question {idx}): {question}")

In this example:

Dynamic Question Generation: The assistant generates follow-up questions based on the user’s previous query and the model’s answer.
JSON Formatting: The assistant provides responses in a structured format for easy parsing.
Engagement Enhancement: By asking relevant follow-up questions, the assistant keeps the conversation active and engaging.

7. Conclusion

Making AI communicate more like humans involves several strategies:

Multi-Turn Conversations: Managing context over multiple exchanges, including AI’s own previous responses.
Context Awareness: Remembering and utilizing past interactions, both from the user and the AI.
Personalization: Tailoring responses based on user profiles and preferences.
Natural Language Use: Emulating human conversation styles, adjusting tone and formality.
Engagement Techniques: Asking follow-up questions and proactively continuing the dialogue.

By implementing these features, you can create AI applications that are not only functional but also enjoyable to interact with, providing a better user experience.

01.23.2025 — Fin.

LLM-RAG Pt. 8 — Making AI “More Human-like” was originally published in Dev-ai on Medium, where people are continuing the conversation by highlighting and responding to this story.

LLM Pt.7 — Selecting the Adequate Open-Source Model for Your production-level AI application.

Liebertar — Wed, 22 Jan 2025 11:09:26 GMT

LLM Pt. 7 — Selecting the Right Open-Source Model in a Servicable Way

Introduction

So far, we’ve built a strong foundation by exploring retrievers, ensemble systems, and more. In Part 7, we’re focusing on a crucial aspect of developing effective language model applications:

selecting the most suitable open-source model for your specific needs.

Choosing the right model isn’t just about picking the one with the most parameters or the latest buzz. It’s about finding a model that aligns with your application’s requirements, offers the necessary features, and performs efficiently on your hardware. We’ll explore how to evaluate and select an open-source language model, using examples from the LLaMA series. I’ll also share insights from my experience running these models on different hardware setups.

By the end of this post, you’ll have a clear understanding of how to choose the most adequate model for your projects, ensuring that you can build applications with multiple features using a single model.

Index

Understanding Your Project Requirements
Key Factors in Model Selection
Exploring Open-Source Models
Performance Considerations
Hardware and Latency Insights
Model Features and Capabilities
Running Ollama on Different Platforms

7.1 Using Ollama with NVIDIA GPUs
7.2 Running Ollama on MacOS

8. Production-Level Considerations

9. Conclusion

1. Understanding Your Project Requirements

Before getting into model specifics, it’s essential to clearly define what you need from a language model. Consider the following questions:

What tasks will the model perform? Is it for general question-answering, content generation, intent classification, dialogue systems, or a combination of these?
What level of accuracy is required? Do you need highly detailed and precise responses, or are general answers sufficient?
What are the latency requirements? Is real-time interaction critical, or can your application tolerate some delay?
What hardware will you use? Do you have access to high-end GPUs, or will the model run on consumer-grade hardware?
Do you need multiple features from one model? Combining capabilities like question-answering, tool usage, and intent recognition can streamline development.

Understanding your project’s specific needs will guide you toward the model that best fits your requirements.

2. Key Factors in Model Selection

When selecting a language model, consider the following key factors:

Model Size (Parameters): Larger models often provide better performance but require more computational resources.
Performance vs. Resources: Balance the need for accuracy and capability with the available hardware.
Feature Support: Ensure the model supports the functionalities you need, such as tool usage, multi-turn conversations, or multilingual support.
Community and Documentation: Models with active communities and comprehensive documentation can save you time during implementation.
Licensing and Cost: Open-source models are free to use, but always check the licensing terms to ensure compliance.

3. Exploring Open-Source Models

Let’s examine two popular open-source models from the LLaMA series to illustrate the selection process.

So far, I’ve been using the LLaMA 3.1:8B model to develop a pipeline and create a functional LLM-based application. It offers an excellent balance of resource usage and efficiency, along with highly useful features like function calling and reliable general responses.

3.1 LLaMA 3.1 8B Model

Overview:

Parameters: 8 Billion
Description: Strikes a balance between performance and resource consumption. It’s powerful enough for complex tasks yet can run on consumer-grade hardware with optimization.

Why Consider LLaMA 3.1 8B?

General Question-Answering: Excels at understanding and generating coherent responses across diverse topics.
Function Tool Utilization: Supports effective use of function tools, making it suitable for tasks requiring tool use for intent classification or data retrieval.
Intent Classification (Function Tool): High accuracy in understanding user intent, crucial for interactive applications.
Multiple Features in One Model: Capable of handling various tasks simultaneously, reducing the need for multiple models.

Example Use Case:

Building a virtual assistant that can understand user queries, perform actions like setting reminders, fetch information, and engage in multi-turn conversations. LLaMA 3.1 8B can handle the natural language understanding and the multi-faceted interactions required for this task.

3.2 LLaMA 3.2 3B Model

Overview:

Parameters: 3 Billion
Description: A lighter version designed for environments with limited computational resources.

Why Consider LLaMA 3.2 3B?

Resource Constraints: Suitable for deployment on devices with less memory and processing power.
Faster Inference on Low-End Hardware: Smaller models can offer quicker response times where hardware is a limiting factor.

Limitations:

Feature Limitations: Doesn’t support features like tool utilization and multi-turn conversations as effectively as the 8B model.
Performance Trade-off: Might not provide the same level of understanding or accuracy for complex tasks.

Example Use Case:

An FAQ bot that answers straightforward queries without the need for deep context understanding or additional functionalities.

4. Performance Considerations

Performance involves both accuracy and how the model operates within your specific environment.

Factors Affecting Performance:

Hardware Specifications: CPU, GPU, RAM, and storage speed impact model performance significantly.
Optimizations: Adjusting batch sizes, leveraging parallel processing, and utilizing model quantization can help fit larger models into limited hardware.
Model Quantization: Reducing the precision of model weights decreases memory usage and increases speed with minimal loss in accuracy.

Real-World Insight:

Running the LLaMA 3.1 8B model on a MacBook with an M3 chip and 32GB RAM, I’ve achieved latency of 3–4 seconds per response. This was without complex pipelines or multi-model features, demonstrating that you can run this model efficiently on consumer hardware.

5. Hardware and Latency Insights

Understanding Latency:

Latency: The time it takes for the model to generate a response after receiving a query.
Acceptable Latency: Depends on your application’s requirements. For interactive applications, lower latency enhances user experience.

Hardware Used:

Device: MacBook with an M3 chip
RAM: 32GB

Key Points:

Feasibility on Consumer Hardware: Models like LLaMA 3.1 8B can run effectively on personal laptops.
User Expectations: A latency of 3–4 seconds is acceptable for many applications, providing a balance between performance and resource utilization.
Scalability: For production environments requiring lower latency, consider scaling up hardware or optimizing the model further.

6. Model Features and Capabilities

Selecting a model involves evaluating its features and how they align with your application’s needs.

6.1 Tool Utilization and Intent Classification

Tool Utilization:

Enhanced Functionality: Models that can interact with tools or APIs extend their capabilities, allowing for real-time data retrieval and computations.
Production-Level Applications: In a production environment, integrating tool usage can automate processes and improve efficiency.

Intent Classification:

Understanding User Intent: Critical for applications like chatbots, virtual assistants, and customer support.
LLaMA 3.1 8B’s Strength: Exhibits high accuracy in intent classification, making it suitable for applications where understanding nuance is essential.

Multiple Features with One Model:

Efficiency: Using a single model for multiple tasks simplifies deployment and maintenance.
Consistency: Ensures uniformity in responses and behaviors across different functionalities of your application.

Example Scenario:

Developing an AI assistant for customer service that can:

Understand and classify customer inquiries.
Access databases to retrieve order information.
Provide recommendations based on user preferences.
Engage in multi-turn dialogues for a conversational experience.

LLaMA 3.1 8B can handle these diverse tasks, showcasing its versatility and making it a strong candidate for such applications.

7. Running Ollama on Different Platforms

To utilize these models effectively, you need to set them up and integrate them into your development environment. **Ollama** is a tool that helps in serving and managing language models locally. Here’s how you can run Ollama on different platforms.

7.1 Using Ollama with NVIDIA GPUs

If you’re using a system with an NVIDIA GPU, you can run Ollama using Docker to leverage GPU acceleration.

Steps:

1. Pull the Ollama Docker Image:

   - Download the Ollama Docker image that supports NVIDIA GPUs.

2. Run the Docker Container with GPU Support:

   - Use the `--gpus=all` flag to enable GPU access for the container.
   - Mount a volume for persistent storage if needed.
   - Map the necessary ports for communication.

   Example:

   ```bash
   docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
   ```

3. Run the Model:

   - Execute the model within the container.

   ```command
   docker exec ollama ollama run llama3.1:8B
   ```

4. Set Up Networking:

   - Create a Docker network to allow communication between containers if you have other services running.

   ```command
   docker network create ai-network
   docker network connect ai-network ollama
   docker network connect ai-network your-other-service
   ```

5. Access the Ollama Endpoint:

   - The Ollama service will be accessible at `http://ollama:11434`.

Note:

Ensure that your NVIDIA drivers and Docker are correctly configured to support GPU access.
This setup allows you to utilize the full power of your NVIDIA GPU for model inference, reducing latency and improving performance.

7.2 Running Ollama on MacOS

For MacOS users, the approach differs slightly due to how Docker interacts with the system’s hardware.

Important Consideration:

- GPU Usage with Docker on MacOS:

- When deploying Ollama via a Docker image on MacOS, it does not utilize the GPU.

- To leverage the GPU capabilities of your Mac, you need to install Ollama directly without using Docker.

Steps:

1. Install Ollama Directly:

   - Follow the installation instructions provided in the [Ollama documentation](https://ollama.ai/docs) to install it on your Mac.

2. Run the Model:

   - Use the command line to run the LLaMA 3.1 8B model.

   ```command
   ollama run llama3.1:8B
   ```

3. Set Up Networking:

   - If you're integrating with other services, ensure they can communicate over the network.

   ```bash
   docker network create ai-network
   docker network connect ai-network your-service
   ```

4. Access the Ollama Endpoint:

   - The service will be accessible at `http://localhost:11434`.

Benefits:

- By installing Ollama directly, you ensure that the application can utilize your Mac's GPU, enhancing performance and reducing latency.
- This setup is particularly beneficial for models like LLaMA 3.1 8B, which can take advantage of GPU acceleration.

8. Production-Level Considerations

When moving from development to production, additional factors come into play:

Scalability: Ensure the model can handle increased loads and user traffic.
Reliability: The model should perform consistently under different conditions.
Integration: Seamlessly integrate with existing systems and workflows.
Security: Protect user data and comply with privacy regulations.

Why LLaMA 3.1 8B is Suitable for Production-Level Applications:

Multiple Features in One Model: Its ability to handle various tasks reduces complexity by eliminating the need for multiple models.
Robust Performance: Provides accurate and reliable responses, essential for user trust.
Optimizable: Can be fine-tuned and optimized for specific use cases and hardware configurations.
Community Support: Active development and support from the community facilitate troubleshooting and improvements.

9. Conclusion

Selecting the right open-source model for your application is a balance of various factors:

Alignment with Project Requirements: The model should meet the specific needs of your tasks, whether it’s general question-answering, tool usage, or intent classification.
Performance on Available Hardware: Ensure the model can run efficiently on your devices, providing acceptable latency and responsiveness.
Feature Set and Capabilities: Look for models that offer the functionalities you need, allowing for multiple features to be handled by one model.
Production Readiness: Consider scalability, reliability, and integration aspects for deploying the model in a production environment.

Recommendation:

Based on the factors discussed, LLaMA 3.1 8B stands out as a strong candidate for applications requiring:

General Question-Answering Abilities
Effective Utilization of Function Tools
Accurate Intent Classification

Its ability to handle multiple features within one model makes it highly suitable for complex applications, reducing the need for multiple models and simplifying your deployment architecture.

10. Moving Forward

In the upcoming Part 8: “More Like Human”, we’ll explore features that make AI interactions more natural and human-like. We’ll discuss:

Multi-Model Integration: How combining different models can enhance capabilities and performance.
Multi-Turn Conversations: Techniques for enabling AI to engage in coherent, context-aware dialogues over multiple turns.
Human-Like Communication: Strategies to make AI responses more natural, personable, and engaging.

These features help bridge the gap between AI and human communication, providing a more intuitive and satisfying user experience.

01.14.2024 — Fin.

LLM Pt.7 — Selecting the Adequate Open-Source Model for Your production-level AI application. was originally published in Dev-ai on Medium, where people are continuing the conversation by highlighting and responding to this story.

LLM-RAG pt.6 — Retriever: Ensemble Techniques and Optimization

Liebertar — Tue, 12 Nov 2024 11:33:55 GMT

LLM-RAG pt.6 — Retriever: Ensemble Techniques and Optimization

Introduction

On this part of the LLM-RAG series, we explore the process of constructing an ensemble retriever system. This approach combines keyword matching with the nuances of semantic search to enhance retrieval accuracy and efficiency. Previously, in Pt.5, we talked about the basics of retrievers, and now, we are going through the practical steps to set up this hybrid system, arming you with the insights needed for successful buildup.

Index

1. Understanding Ensemble Retrievers

2. Implementing Cosine Similarity

3. Creating Embeddings

4. Integrating with PostgreSQL

5. Best Practices for Configuration

6. Query TSVector & Vector, Combining Results

8. Conclusion

1. Understanding Ensemble Retrievers

You can compose in a variety of way (context reorder, data augmetation, multiQuery,…) Image by @Liebertar

Combining BM25 and Sentence Transformers

An ensemble retriever uses both BM25 and sentence transformers:

BM25: This method uses term frequency and inverse document frequency to rank documents, prioritizing those with frequent and unique terms.
Sentence Transformers: These models convert text into numerical vectors capturing semantics, allowing the system to understand the context and meaning of the content.

Quick example : In a library setting, BM25 quickly locates books by frequent words in the title, while sentence transformers recommend books based on thematic relevance.

Optimizing Retrieval Strategy

To maximize efficiency:

BM25 for Short Queries: Utilize BM25 to swiftly handle queries with a few keywords, filtering documents based on the presence of key terms and narrowing down the dataset to the most relevant items.
Sentence Transformers for Long Queries: Then, apply sentence transformers when dealing with longer, more complex queries. This allows for a detailed semantic analysis by transforming the content into vectors, ensuring that the results align with the deeper context and intent of the query.

2. Implementing Cosine Similarity

Detailed Explanation and Sample Code

Cosine similarity quantifies the similarity between two vectors, essential for comparing semantic embeddings.


def calculate_cosine_similarity(embedding):
    if len(embedding.shape) == 1:
        embedding = embedding.reshape(1, -1)
    norms = np.linalg.norm(embedding, axis=1, keepdims=True)
    norms[norms == 0] = 1
    normalized_embedding = embedding / norms
    cosine_similarity = np.dot(normalized_embedding, normalized_embedding.T)
    return cosine_similarity

This code normalizes each vector to unit length and calculates the dot product, providing a similarity score based on the angle between vectors.

3. Creating Embeddings

Embedding Process Explained

Creating embeddings involves converting text into numerical representations that capture semantic nuances. Here’s how:

Load the Model: Use a pre-trained sentence transformer suited for your language needs. (koBert, all-mpnet, deberta…)
Batch Process Documents: Efficiently encode documents in batches to manage “memory”.

Detailed Example Code

import logging
import torch
from sentence_transformers import SentenceTransformer

# Set up logging
logger = logging.getLogger("Embedding")

# Define the embedding models for different languages
embed_models = {
    'en': SentenceTransformer('Huggingface/sentence_transformer_example'),
    'es': SentenceTransformer('Huggingface/sentence_transformer_example'),
    'de': SentenceTransformer('Huggingface/sentence_transformer_example')
    
                                     ⋮

}

def create_embeddings(documents, language, batch_size=32):
    model = embed_models[language]
    embeddings = []
    
    # Process documents in batches
    for start_idx in range(0, len(documents), batch_size):
        end_idx = start_idx + batch_size
        batch_documents = documents[start_idx:end_idx]
        
        logger.debug(f"Creating embeddings for batch: {start_idx}-{end_idx}")
        
        # Generate embeddings for the batch
        batch_embeddings = model.encode(batch_documents, convert_to_tensor=True)
        embeddings.append(batch_embeddings)
        
        logger.debug("Merging batch embeddings")
    
    # Concatenate all batch embeddings
    embeddings = torch.cat(embeddings, dim=0)
    
    logger.debug("Converting embeddings to list")
    
    # Convert embeddings to a list
    embeddings_list = embeddings.cpu().numpy().tolist()
    
    logger.debug(f"Embeddings created for {len(embeddings_list)} documents")
    
    return embeddings_list

4. Integrating with PostgreSQL

Setting Up the Database Schema

PostgreSQL can manage and retrieve both keyword and vector data efficiently. Below is a simplified schema example:

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    tokenized_question TSVECTOR,
    vectorized_question VECTOR(768)
);

TSVECTOR: Used for BM25 searches to handle keyword data.
VECTOR(768): Stores semantic vectors for similarity comparisons. Adjust size based on model specifications.

5. Best Practices for Configuration

In systems utilizing ensemble retrievers, effective configuration of chunk sizes is crucial. This involves strategically managing vectors based on query length and user circumstances to ensure optimal performance and accuracy.

Optimal Chunk Size:

The configuration should be adapted to the length of the query and the specific context or requirements of the user.
The chunk size should be managed such that the total chunk size across four chunks does not exceed 2048 tokens. This ensures the balance between maintaining enough detail for retrieval accuracy and optimizing performance.

Image from LlamaIndex

Use Cases

Ensemble Retriever Strategy: Configure chunk sizes strategically in combinations such as 1:3, 2:2, or 4:0, where these ratios represent the proportion of different retrieval methods used for each part of the query
Adaptive Based on Query Length: For short queries, allocate more chunks to semantic understanding to capture the broader context quickly. For longer, more detailed queries, use a higher proportion of keyword-based chunks to ensure precise detail retrieval.
Strategic Guidance: Choose the combination of chunk sizes and types (e.g., 1:3, 2:2, or 4) that best match the specific query and your need. But ensure the total chunk size does not exceed 2048 tokens to maintain efficiency and clarity across retrieval operations.

6. Query Augmentation and Combining Results

Vector Search Query Example

To enhance performance based on query type, we employ a strategy that combines results from both BM25 and semantic vector searches. Here’s a practical asynchronous function for query vector:

async def example_transform_query(self, input_query: str, max_results: int, language_choice: str):
    await self.ensure_connected()
    # Choose a language for text search vectorization
    tsv_language = 'select_your_language' if language_choice == 'select' else 'english'
    
    # Prepare the text search query by joining tokens with logical AND
    input_query = ' & '.join(input_query.split())
    
    # Define a generic example SQL query with illustrative field names
    query = f"""
    SELECT id, example_field_you_want_to_vectorize, additional_field, reference_url, extra_info_field, vector_representation,
           created_timestamp, updated_timestamp,
           ts_rank(tokenized_example_field, to_tsquery($1, $2)) AS rank
    FROM {self.TABLE_NAME}
    WHERE tokenized_example_field @@ to_tsquery($1, $2) AND is_active = TRUE
    ORDER BY rank DESC
    LIMIT $3;
    """
    
    # Asynchronously fetch query results from the database connection pool
    async with self.pool.acquire() as connection:
        try:
            rows = await connection.fetch(query, tsv_language, input_query, max_results)
            return rows
        except Exception as e:
            logger.error(f"Error processing example query: {str(e)}")
            raise

Chunk Combination Logic

For increased accuracy and performance, especially with varied query lengths:

Short queries benefit from BM25’s keyword efficiency.
Long queries gain from the depth of sentence embeddings.

Combining Results:

def combine_results(bm25_results, vector_results):
    # Combine the top 2 results from both bm25 and vector results
    combined_results = bm25_results[:2] + vector_results[:2]
    return combined_results

This function selects the top two results from both BM25 and vector searches, combining them for robust context delivery to the LLM.

7. Conclusion

Combining BM25 with dense retrievers provides precision in information retrieval, supported by a PostgreSQL setup. This setup ensures that the retrieval process is both efficient and contextually aware. By effectively combining results, the system remains adaptable, handling a variety of query types with ease.

As we wrap up our discussion on retrievers, we’ll shift our focus to setting up the model component. We’ll dive into Ollama and GGUF fine-tuning, offering insights on how to tailor models to better fit specific applications and enhance system efficiency. Of Course will talk about OpenAI model as well.

11.12.2024 — Fin.

LLM-RAG pt.6 — Retriever: Ensemble Techniques and Optimization was originally published in Dev-ai on Medium, where people are continuing the conversation by highlighting and responding to this story.

LLM-RAG pt.5 — Retriever: Enhanced Model Performance

Liebertar — Mon, 11 Nov 2024 09:31:59 GMT

LLM-RAG pt.5 — Retriever: Enhanced Model Performance

Introduction

Imagine you’re in a massive library searching for books on a specific topic. Retrievers in a RAG pipeline act like the librarian, efficiently finding the most relevant books, ensuring you don’t leave with unnecessary volumes. They streamline your interaction with data — the precision and effectiveness of this step are crucial, much like how a good librarian saves you hours of searching.

The following image shows a simple pipeline: we create embeddings using Sentence Transformers and retrieve the most relevant data from the database. Before we add a BM25 retriever with tokenized content for a more complex structure, it’s important to understand how this basic setup works.

Basic Structure Text based RAG by @Liebertar

Index

1. BM25: Classic Keyword Matching

2. Sentence Transformers: Modern Semantic Search

3. Integrating with PostgreSQL for Efficient Retrieval

4. Best Practices for Configuring Retrievers

5. Conclusion: Optimizing the RAG Pipeline

BM25: Classic Keyword Matching

BM25 is like having a librarian who knows every popular book in the library. It prioritizes highly requested books based on common terms.

1. Understanding Term Frequency and IDF

Term Frequency (TF): Think of it as the librarian noting how often a book is asked for. The more requests, the more relevant it seems.
Inverse Document Frequency (IDF): This is the librarian’s way of ensuring that books requested by only a few are not ignored. Uniqueness adds value.

2. Document Length Normalization

Shorter texts are like concise book summaries, possibly offering clearer and more focused insights. BM25 gives them higher relevance, ensuring summaries get checked first.

3. Configuring Hyperparameters

k1 and b settings: Imagine adjusting the rules for how many times a book needs to be requested before it gets displayed on the ‘popular’ shelf. These parameters fine-tune the system to best serve your needs.

4. Real-world Analogies

Think of BM25 as organizing your bookshelf at home. You might put shorter, more impactful reads at eye level, while more complex, lengthy books occupy less accessible spaces. It’s about making relevant information readily available.

Sentence Transformers: Modern Semantic Search

Sentence transformers are like having a librarian who knows the essence of every book. They understand context beyond the words, finding books by theme rather than just titles.

1. Overview of Sentence Embeddings

These embeddings capture the book’s essence, akin to understanding a book’s narrative rather than focusing on specific words. It enables a librarian to recommend a book based on the storyline rather than keywords.

2. Pre-trained Models and Fine-tuning

Pre-trained models are like librarians who’ve read summaries of every book. They can be further trained to specialize in certain genres, improving their recommendation skills.

3. Transforming Text into Vectors

Text is transformed into a mathematical form that helps identify books with similar themes or narratives, much like categorizing novels by the emotions they evoke.

4. Real-world Analogies

Imagine the librarian could read emotions. A query for “happy endings” doesn’t just bring books with those words in the title but identifies stories that end optimistically, offering recommendations based on emotional content.

Integrating with PostgreSQL for Efficient Retrieval

PostgreSQL acts like the library’s back-office system, cataloging every book and cross-referencing themes and summaries.

1. Storing Embeddings in PostgreSQL

PostgreSQL works like a detailed catalog system. It efficiently stores all records, ensuring they’re quickly found. By using extensions, PostgreSQL can manage complex vector data, perform fast similarity searches, and optimize storage for rapid retrieval, making it essential for smooth RAG pipeline operation.

2. Vector Extensions for Similarity Search

This is the library’s special system for grouping books based on themes rather than titles, enabling searches based on comprehensive thematic connections.

3. Real-world Analogies

Consider it as your mental map of the library, knowing not only where books are but understanding the complex connections between them based on their contents.

Best Practices for Configuring Retrievers

1. Optimal Chunk Size for Documents

Image from LlamaIndex optimized chunk size

Just as readers better digest information in well-sized chapters, chunking documents into manageable sizes ensures that each piece is coherent and informative. According to the Llama Index, the optimal chunk size depends on the model specifications and continues to evolve. Currently, chunk sizes of 1024 or 2048 tokens are often considered effective, balancing depth with performance.

2. Combining BM25 and Dense Retrievers

Using BM25 is like a quick keyword search, narrowing down options. Dense retrievers refine the list, much like asking the librarian to find books that truly resonate with your theme.

3. Use Cases and Scenarios

Research Projects: Like a research assistant, retrievers gather key materials, ensuring you have targeted data without irrelevant clutter.
Customer Query Handling: When handling complex queries, dense retrievers understand user needs deeply, akin to an experienced librarian catering to specific reading interests.

Conclusion: Optimizing the RAG Pipeline

In optimizing your RAG pipeline, combining the classic BM25 for initial sorting with the nuanced understanding of sentence transformers creates a robust, intelligent system. PostgreSQL ensures your data is stored effectively, ready for quick retrieval. This optimized setup is like having the perfect library staff — efficient, knowledgeable, and intuitive — ensuring every search delivers precisely what you need. The journey from query to result becomes seamless, unlocking deeper insights and efficient data interaction.

Looking ahead, the next part will jump into setting up the database schema in PostgreSQL to leverage vector functions effectively. We will also explore strategies to optimize retriever performance using both BM25 and sentence transformers in tandem, ensuring your system is both powerful and adaptable.

11.10.2024 — Fin.

LLM-RAG pt.5 — Retriever: Enhanced Model Performance was originally published in Dev-ai on Medium, where people are continuing the conversation by highlighting and responding to this story.

LLM-RAG pt.4 — Fine Tuning: Friend, Not A Foe

Liebertar — Wed, 23 Oct 2024 15:55:02 GMT

LLM-RAG pt.4 — Fine Tuning: Friend, Not A Foe

pipeline.. for each domain

I was writing about Retrieval-Augmented Generation (RAG) when I realized that the broader narrative around RAG often involves a comparison with Fine-tuning. These are like two popular yet distinct flavors in the world of Large Language Models (LLMs).

It’s not about “Which is Better” but “Understanding when and how” they perform best. So, let’s get into a tale of two techniques without getting lost in technical jargon. This piece aims to demystify RAG and Fine-tuning using everyday examples that showcase their unique strengths.

Simply click on the index below to easily navigate to the section you’d like to visit.

Index

0. Introduction: Picture yourself making a cup of tea.

1. RAG Unplugged: Bringing Extra Knowledge to the Party

2. Fine-Tuning: Tailoring the Suit for a Perfect Fit

3. RAG & Fine-Tuning in Action: Customer Service and Legal Analysis

4. Conclusion: Two Techniques, Many Possibilities

0. Introduction: Picture yourself making a cup of tea

Retrieval-Augmented Generation (RAG) is like placing a teabag in your mug, allowing you to draw out flavors depending on what’s around. Meanwhile, Fine-Tuning is akin to having a kettle of perfectly brewed tea that you can pour straight into your cup — the flavors are already set, awaiting to be enjoyed in a familiar and consistent manner.

I initially set out to discuss RAG in isolation, intrigued by its ability to bring extra layers of knowledge to AI systems. However, as I got deeper, my curiosity — and the questions from my coworkers and AI-savvy friends from several conferences made me realize that understanding RAG often involves looking at it alongside Fine-Tuning.

These two aren’t competing approaches — rather, they offer distinct but complementary strengths. So, I decided to explore not just RAG, but how it pairs with Fine-tuning, to give a fuller picture of how language models can be truly powerful and versatile. Let’s wander through this tea party of techniques, sharing stories and examples to illuminate their strengths.

1. RAG : Bringing Extra Knowledge to the Party

Illust from @iStock

RAG is like having a friend who knows just where to find interesting facts whenever you need them. It enhances existing models by fetching the most relevant data from a vast sea of information. Imagine you’re a customer service representative who can instantly pull up customer history and product details during a conversation. RAG helps models do just this — access real-time, pertinent information without having memorized it.

When to Use RAG?

Live Updates Needed: Perfect for situations like stock market insights, where staying updated is crucial.
Complex Inquiries: Great for handling customer queries that require broad knowledge beyond scripted responses.

2. Fine-Tuning: Tailoring the Suit for a Perfect Fit

Illust from @iStock

Fine-tuning, on the other hand, is like tailoring a suit that fits just right. It customizes a pre-trained model to specialize in specific tasks. Think of it as your model learning all the rules and nuances of a new workplace, becoming the go-to guru for that environment.

When to Use Fine-Tuning?

Specialized Knowledge Required: Ideal for well-defined domains such as healthcare protocols or corporate guidelines.
Consistent Output Needs: Best when you want the model to reliably handle specific data structures or predictable scenarios.

3. RAG & Fine-Tuning in Action: Customer Service and Legal Analysis

illust from @iStock

Let’s break it down with two relatable domain example :

Customer Service

RAG Application: When agents need to interact dynamically with customers, RAG helps by providing fresh and relevant data — be it the latest product information or a customer’s transaction history — right at their fingertips.
Fine-Tuning Application: For handling known issues or common requests, fine-tuning models on a company’s FAQs and policy manuals ensures consistent and precise responses.

Legal Analysis

RAG Application: In a fast-paced legal environment, staying current with legislative changes is critical. RAG facilitates this by pulling the latest legal precedents and updates as required.
Fine-Tuning Application: For dealing with standard legal documents or repeated patterns of analysis, fine-tuning ensures the model recognizes and processes these efficiently.

4. Conclusion: Two Techniques, Many Possibilities

Illust from @iStock

When we consider RAG and fine-tuning, it’s clear that they’re not rivals, but rather complementary tools, each designed for its own specific purpose. By recognizing their individual strengths, we can use them together to create robust and versatile AI systems that don’t just perform well, but truly excel.

So, whether you’re navigating through a sea of customer queries or getting into stacks of legal papers, remember: RAG and Fine-tuning aren’t in a race. They are here to support a common goal — making AI smarter, more efficient, and be tuned to our needs.

24.10.2024 — Fin.

LLM-RAG pt.4 — Fine Tuning: Friend, Not A Foe was originally published in Dev-ai on Medium, where people are continuing the conversation by highlighting and responding to this story.

LLM-RAG pt.3 — Preprocessing: Constructing a Comprehensive Guide

Liebertar — Sun, 20 Oct 2024 12:30:25 GMT

LLM-RAG pt.3 — Preprocessing: Constructing a Comprehensive Guide

In the world of data science and technology, efficiently preparing data is like setting up a stage for a flawless performance. When creating a Multi-Retrieval Retrieval-Augmented Generation (RAG) pipeline, preprocessing serves as the foundation upon which everything else is built. Through this extensive guide, we’ll get into the multifaceted world of PDF preprocessing, examining every detail as you would when preparing your home for an important visitor.

Before we get into the details, let’s outline the key components necessary to build a robust RAG pipeline

Index

1. Introduction

2. Key Aspects of PDF Preprocessing

3. Detailed Approaches for Diverse Layouts

4. Setting the Stage for Success

Now, let’s focus on the essential first step: preprocessing. This part will illuminate how to transform a potentially chaotic PDF into a refined structure ready for sophisticated RAG operations.

Simplified Multi-Modal Pre-Processing Pipeline for RAG Integration @Liebertar

Introduction to the Importance of Preprocessing

Think of your preprocessing task as decluttering and organizing your living space before hosting a dinner party. Each piece — text, image, and chart within the PDF — must be properly categorized and placed, allowing your data models to perform at their best. Just as a well-aligned seating arrangement enhances conversation, effectively pre-processed data ensures the RAG pipeline operates efficiently.

Laying the Foundation: Key Aspects of PDF Preprocessing

1. Layout Analysis: Designing the Blueprint

Understanding Room Layouts vs. Document Structures: Picture adjusting furniture based on your room’s floor plan — knowing whether your PDF is a single-column or multi-column affects how you plan your data “furnishing.”
Handling Complex Features: Like organizing a room with nooks and workstations, understanding tables and charts requires precision. Each cell, akin to a treasured book, must be accurately cataloged for easy retrieval.

2. Text Extraction and Cleaning: Dusting and Arranging

Text Cleaning: Simplifying your text is like tidying up messy drawers, keeping everything neat and easy to read.

3. Image and Graphics Processing: Curating an Art Gallery

Tagging and Storage: Tagging images and storing URLs for Image Storage is like curating art pieces. Images are tagged for easy access without relying on “OCR processes”.

4. Metadata Parsing: Learning the House’s History

Extract metadata like author and date, similar to uncovering a home’s blueprints, adding valuable context to your document processing.

5. Semantic Analysis and Annotation: Narrating a Guided Tour

Using NLP techniques to annotate text mirrors how you’d highlight features of a room to guests, adding depth beyond mere visuals.

Detailed Approaches for Diverse Layouts

1. Horizontal Layouts: Open-Concept Rooms

Handling Two-Column Layouts: Like a loft needing distinct areas, managing split columns requires careful merging to keep narrative flow and zones intact.

2. Vertical Layouts: Stacking Knowledge Floors

Recognizing Hierarchical Structures**: Organizing document content is like stacking furniture logically by floors, ensuring accessibility.

Ensuring Integration and Harmony: The Final Touch

As you prepare your space for guests, ensure every detail — from lighting to temperature — is just right. In data preprocessing, refine your approach to achieve consistent output and smooth integration with the rest of the pipeline.

Conclusion: Setting the Stage for Success

Improving your preprocessing strategy is about more than cleaning up; it’s about creating an environment where all data fits together seamlessly, like a home ready for company. By handling layouts, cleaning text, and organizing visuals, you lay the groundwork for a RAG pipeline that delivers high-quality outcomes. This approach not only simplifies complex input data but also ensures that retrieval and generation stages function flawlessly, maximizing your RAG pipeline’s potential.

In summary, preprocessing transforms PDF chaos into clarity, paving the way for breakthroughs in data retrieval and usage within intelligent systems.

2024.10.20 — Fin.

LLM-RAG pt.3 — Preprocessing: Constructing a Comprehensive Guide was originally published in Dev-ai on Medium, where people are continuing the conversation by highlighting and responding to this story.

LLM-RAG pt.2 — Materials

Liebertar — Mon, 14 Oct 2024 11:56:48 GMT

We’re working on building a Multi-Retrieval RAG (Retrieval-Augmented Generation) chain for both short and long query processing. But before diving into it, let me list the Materials we need for creating this Multi-Retrieval RAG Chain.

But, Why do we need two separate retrieving processes?

The answer depends on how we compare the user query with the embedded vector data. Basically, sentence transformers compare vectors by sentence similarity. SBERT (Sentence-BERT) isn’t performing well in short queries cause it is optimized for comparing and understanding whole sentences.

What about sparse retrievers like BM25? They understand and consider the relationship between all the words included in a sentence, meaning they perform better on short-term queries. So now you get it, right? Sparse retrievers are better for short-term based queries, and dense retrievers excel in sentence-to-sentence comparisons.

Now let’s getting into the materials we need to build the Multi-Retrieval RAG Pipeline. The RAG setup basically consists of the four parts below. You can simply click each index to move to the corresponding part.

1. Pre-processing

2. Embeddings

3. Retriever (Sparse + Dense)

4. LLM for returning the answer to the customer

Pre-Processing

Pre-processing is all about how we prepare chunks for embedding and metadata provided to the LLM at the final stage. There are several things to consider based on the type of data we’ll be using.

(1) Data Type

Sentence-based Data: This includes documents like research papers or news articles that mainly focus on providing information through text.
Sentence + Data Table: This includes documents like manuals or API documents, which contain both data tables and explanatory sentences.
Sentence + Data Table + Image: This includes sources like web magazines or scientific papers that contain charts and images.

(2) Adding Additional Information for Easier Understanding by the LLM

Charts: Sentence Transformers understand better when there’s a clear division between titles, subtitles, and content. This is usually achieved with specific formatting.
Images: Instead of using OCR processing, which might consume too much work and resources, we’ll use image storage with tagged image URLs.

Embeddings

Before vectorizing the data, a few considerations are necessary:

(1) Vector Dimension

This is often decided by the model used for vectorizing the user query and data for semantic search and providing chunks to the LLM. Most HuggingFace models built with SBERT use a 768-dimensional vector. Some models treat this number as 1536. This means that if you use a 768-dimension model for embedding user queries, you should also follow the same dimension for vectorized information in the Vector DB. If you use a 1536-dimension model for user queries, the vectorized dataset should be embedded in the same dimension, 1536, to ensure the database functions in the same environment.

Quick note: Images use higher dimensions compared to text, and videos use even higher dimensions than images. Got it?

(2) Metadata

Metadata Tagged to the Vector

Metadata is essentially like the tag that accompanies chunks of information delivered to the LLM. When you enter a query into the chatbot, semantic search works by having the retriever find the most similar vector in the Vector Database. But what’s next? Are we going to give these vectors directly to the LLM? Absolutely not.

The LLM is like a text expert and only deals with text-based data, not vectors. If we were using a different type of model that handles various data forms, that would be different. Right now, for images, we’re preprocessing them instead of vectorizing, similar to preparing a photo for printing. Vectorizing voice or images is a separate step altogether. So, vectors are only used for discovering similarities. Meanwhile, the text data in the metadata, acting like a detailed tag, is what the LLM actually receives and uses to provide you with accurate information.

Retriever (Sparse + Dense)

The RAG chain uses both sparse and dense retrievers to balance performance across different types of queries:

1. Sparse Retriever (BM25): This retriever is great because it understands individual word relationships within short queries. It scans the words and finds the best match based on word-based similarity.

2. Dense Retriever (SBERT): This one is optimized for longer, more complex queries. It’s good at understanding sentence-to-sentence relationships and giving results that make sense in the context of longer texts.

Here’s a careful point to note: choosing the right retriever for the right type of query can significantly improve the accuracy and relevance of the results. Short queries? Go sparse. Long, detailed queries? Use dense.

Language Model (LLM) for Answer Generation

Image from Analytics Vidhya

The final piece of the puzzle is the LLM, which generates answers based on the retrieved text. After getting the most relevant chunks of text through the combined efforts of sparse and dense retrievers, the LLM synthesizes this information to give you an appropriate response.

Here’s another careful point: The accuracy of the LLM depends heavily on the quality of the pre-processing, embeddings, and retrievers. So, ensure each step is done meticulously for the best outcomes.

Conclusion

Building a Multi-Retrieval RAG Chain means you need to understand how sparse and dense retrievers compare user queries with embedded vector data. Effectively using both types, combined with careful pre-processing and embedding, creates a robust and efficient query processing system. This ensures the system performs well across different query types, providing accurate, contextually relevant answers. By leveraging the strengths of both retrieval methods and ensuring proper data preparation, we can achieve a high-performance, multi-retrieval RAG system that meets diverse user needs.

LLM-RAG pt.2 — Materials was originally published in Dev-ai on Medium, where people are continuing the conversation by highlighting and responding to this story.

LLM-RAG pt.1 — Vector DB, What is the difference?

Liebertar — Sat, 24 Aug 2024 11:24:43 GMT

Imagine this: You’re going into the world of futuristic databases that understand not just numbers and text but also images, audio, and more. For this part, I would like to talk about some standout features: Vector Representation, Logarithmic Time Search, Cosine Similarity, Dimensionality Reduction, and some of Easy-understanding Use Cases.

Below is the roadmap for our discussion today. You can just simply click on each of topic to jump straight to that section.

Index

What is a Vector Database?
How is the Vector Database Different from a Traditional DB?
How Vector DB Works?
Logarithmic Time Search in Vector DB
Cosine Similarity
Dimensionality reduction in Vector DB
Use Cases
Wrap — up

1. What is a Vector Database?

A Vector Database stores data as vectors, arrays of numbers capturing the essence of various data types like images, text, and audio. Unlike traditional databases that store data in rows and columns, Vector Databases convert unstructured data into vectors, making it easily searchable and comparable.

Vector DB Providers

2. How is a Vector Database Different from a Traditional Database?

Traditional databases, like relational databases, are great for handling structured data. Structured data is neatly organized in tables with predefined schemas, making it easy to query with SQL (Structured Query Language). These databases excel in tasks where data relationships are clear and well-defined.

On the other hand, traditional databases, including NoSQL databases, often struggle with unstructured data — things like images, videos, and natural language text. This kind of data doesn’t fit neatly into tables and columns, making it hard to process and analyze effectively with standard database methods.

This is where vector databases shine. They are specifically designed to handle unstructured data by converting it into numerical formats called vectors. Vectors are arrays of numbers that capture the key features of the unstructured data.

image from @weviate

3. How Vector Databases Work?

Vector databases store data as vectors, which are arrays of numbers that capture the core features of each piece of data.

What is a Vector?

A vector is essentially a list of numbers that represent different attributes of an item. For instance, in a vector, the numbers might reflect various aspects like color, size, shape, or even the meaning of a word in a text.

Example Case

Imagine you have a gallery of images. Each image is converted into a vector that reflects features like color, texture, and shape. This vector representation allows for easy comparison and search through the images. For example, by comparing these vectors, the database can quickly find images that are similar in color and texture to a given image. This makes searching large sets of unstructured data much more efficient.

@Google Hum to Search

4. Logarithmic Time Search in Vector Databases

Advanced search algorithms allow Vector Databases to perform fast searches, even with large datasets, thanks to logarithmic time search.

Example Case:

Consider a music app that finds songs similar to the one you’re humming. The app compares your hummed melody with millions of songs, identifying those with similar characteristics quickly.

Cosine calculation for easy understanding @Liebertar

5. Cosine Similarity

Cosine Similarity measures the angle between two vectors to determine how similar they are, crucial for comparing vectorized data.

Example Case:

Think about online shopping recommendations. The platform uses Cosine Similarity to compare your browsing history vectors with product vectors, suggesting relevant items.

6. Dimensionality Reduction in Vector Databases

Dimensionality Reduction simplifies high-dimensional data by reducing the number of variables while preserving essential information.

Example Case:

Imagine an app that categorizes your photos. Dimensionality Reduction simplifies the data, making the classification process faster and more efficient.

7. Use Cases for Vector Databases

Vector Databases have numerous practical applications:

Search Engines: Enhanced capabilities beyond simple keyword matching.
Recommendation Systems: Products, movies, or songs based on user preferences.
Natural Language Processing (NLP): Efficiently transforms and understands human language.
Image and Video Recognition: Quickly classifies and identifies visual data.

Example Case:

In a virtual assistant app, a Vector Database helps understand and respond to user queries effectively, enhancing user experience.

8. Re-check : Comparison to Regular Databases

Traditional Databases:

Data Format: Rows and columns.
Data Types: Ideal for numbers, strings, dates.
Search Mechanism: Slower for large, unstructured datasets.
Flexibility: Limited to predefined schemas.

Vector Databases:

Data Format: Vectors capturing data essence.
Data Types: Excellent for unstructured data.
Search Mechanism: Quick, efficient searches with advanced algorithms.
Flexibility: Highly adaptable to various data types and structures.

Example Comparison:

In a Traditional Database, searching for books titled “Adventure” involves simple keyword matching. With a Vector Database, you can describe the plot, and the system will fetch matching books.

Image from @Monte Carlo Data

9. Wrapping Up

Vector Databases represent a leap from traditional databases in handling complex, high-dimensional data. They enable rapid searches, efficient data reduction, and provide superior real-world applications. As AI and machine learning advance, Vector Databases will be essential in understanding our vast and varied data landscapes.

Next time you think about databases, imagine them diving into the intricate details of unstructured data, providing insights faster than ever.

08.23.2024 — Fin.

LLM-RAG pt.1 — Vector DB, What is the difference? was originally published in Dev-ai on Medium, where people are continuing the conversation by highlighting and responding to this story.