Welcome back!
In Chapter 4: Memory, we taught our AI to remember the conversation history. It can now remember your name or what you asked five minutes ago.
However, there is still a massive limitation: The Knowledge Cutoff. If you ask the model about a news event from yesterday, or the contents of your private company PDF, it will hallucinate or say "I don't know."
Imagine the LLM is a brilliant student taking an exam.
This technique is called RAG (Retrieval-Augmented Generation).
To build this "Library," we need four components:
LangChain doesn't read PDF or Word files directly in the chain. It converts everything into a standard format called a Document.
A Document is a simple container with two fields:
page_content: The text itself.metadata: Info about the text (source, page number, author).from langchain_core.documents import Document
# Create a "book" manually
doc = Document(
page_content="LangChain was released in late 2022.",
metadata={"source": "history_book.txt", "author": "Harrison"}
)
print(doc.page_content)
# Output: "LangChain was released in late 2022."
In a real app, you would use a Loader (like PyPDFLoader) to create these automatically, but they all result in this exact object.
You cannot feed an entire 300-page book into an LLM prompt. It's too expensive and exceeds the context window. We need to cut the document into smaller chunks.
We use a TextSplitter. The most common one is RecursiveCharacterTextSplitter.
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Create the splitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=100, # Max characters per chunk
chunk_overlap=20 # Overlap to keep context between chunks
)
# Example long text
text = "LangChain is a framework for developing applications powered by language models."
# Split it
docs = splitter.create_documents([text])
print(docs[0].page_content)
# Output: "LangChain is a framework for developing applications"
print(docs[1].page_content)
# Output: "applications powered by language models."
Explanation: Notice the word "applications" appears in both chunks? That is the overlap. It ensures we don't cut a sentence in half and lose the meaning.
Now we have hundreds of small chunks. How do we find the right one?
We can't just use Ctrl+F (Keyword Search) because if you search for "pet," you also want results for "dog" or "cat."
We need Semantic Search.
Let's use a simple in-memory store called Chroma.
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
# 1. Create the database from our documents
# Note: You need an OpenAI API Key for the embeddings to work
db = Chroma.from_documents(
documents=docs,
embedding=OpenAIEmbeddings()
)
# 2. Search for relevant info
results = db.similarity_search("What is LangChain?")
print(results[0].page_content)
# Output: "LangChain is a framework for developing applications"
Explanation: We didn't search for exact words. We searched for the meaning. The VectorStore found the chunk most similar to our question.
Now we connect this to what we learned in Chapter 3: Runnables & Chains.
We convert the VectorStore into a Retriever. A Retriever is a standard interface that takes a string (query) and returns a list of Documents.
# Create the interface
retriever = db.as_retriever()
# Create a prompt that expects context
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template(
"Answer based on this context: {context}. Question: {question}"
)
# ... (Assume we have a 'model' defined from Chapter 1) ...
To run this, we fetch the documents first, then pass them to the chain.
# 1. Retrieve data manually
docs = retriever.invoke("What is LangChain?")
# 2. Run the chain
response = prompt.pipe(model).invoke({
"context": docs,
"question": "What is LangChain?"
})
print(response.content)
How does LangChain manage splitting text and wrapping it in objects?
When you search for documents, a mathematical comparison happens.
As we saw in libs/core/langchain_core/documents/base.py, the Document class is intentionally simple. It inherits from BaseMedia (which handles IDs) and is serializable (can be saved to JSON).
class Document(BaseMedia):
page_content: str
metadata: dict
def __init__(self, page_content, **kwargs):
# Validation happens here to ensure content is a string
super().__init__(page_content=page_content, **kwargs)
The splitting logic is fascinating. It doesn't just hack text apart; it tries to keep it meaningful.
In libs/text-splitters/langchain_text_splitters/base.py, the create_documents method orchestrates the process:
# Simplified logic from TextSplitter.create_documents
def create_documents(self, texts, metadatas=None):
documents = []
# Loop through every original text (e.g., every file)
for i, text in enumerate(texts):
# 1. Call the specific splitting logic (abstract method)
chunks = self.split_text(text)
# 2. Wrap each chunk into a Document object
for chunk in chunks:
new_doc = Document(
page_content=chunk,
metadata=copy.deepcopy(metadatas[i])
)
documents.append(new_doc)
return documents
The specific logic for how to split (by character, by token, or by newlines) is defined in the subclass method split_text.
In this chapter, we learned:
page_content + metadata).Why this matters: We now have a "Brain" (Model), "Memory" (History), and a "Library" (VectorStore).
But our AI is still passive. It waits for us to ask questions. What if we want the AI to do things? What if we want it to check the weather, calculate numbers, or book a flight?
For that, we need to give it hands. We need Tools.
Generated by Code IQ