In the Extraction Orchestrator chapter, we saw how to magically turn text into data using extract(). But have you ever wondered how the computer understands the text generated by an AI?
Language Models (LLMs) are designed to chat. They like to say things like "Here is the data you asked for" or "I found the following information."
Imagine you ask a model for JSON data. You expect this:
[{"name": "Alice", "age": 30}]
But the model actually returns this:
%%CODE_1%%json [ { "name": "Alice", "age": 30 } ]
<think>I should probably double-check that age...</think>
Hope this helps!
If you try to feed that messy string directly into Python's json.loads(), your program will crash.
In langextract, the Format Handler is the cleanup crew. It acts as a bridge between the messy text output of an LLM and the strict structured objects Python needs.
It performs two main jobs:
To make extraction reliable, we usually ask models to wrap data in Markdown Code Fences. These look like triple backticks:
%%CODE_3%%json ... data ...
The Format Handler looks for these specific markers. It ignores everything outside of them.
While JSON is the most common format, langextract also supports YAML.
By default, the Orchestrator sets up a standard JSON handler for you. However, you can customize it.
You don't usually instantiate the handler yourself. You pass configuration parameters to extract.
Here is how you might tell langextract to use YAML instead of JSON (which can be faster/cheaper):
import langextract as lx
from langextract.core import data
# We want YAML output instead of JSON
result = lx.extract(
text_or_documents="Buy milk and eggs",
prompt_description="Extract shopping list",
examples=examples, # Assume examples are defined
model_id="gemini-2.0-flash",
resolver_params={"format_type": "yaml"}
)
Explanation: We added resolver_params. This tells the Orchestrator to configure the internal FormatHandler to look for YAML syntax.
Some modern "Reasoning Models" (like DeepSeek-R1 or QwQ) "think" out loud before answering. They output tags like <think>... reasoning ...</think> followed by the JSON.
The FormatHandler in langextract automatically detects these tags and strips them out before parsing, so you don't have to write complex Regex patterns yourself.
When the LLM returns a string, the FormatHandler goes through a specific pipeline to clean it up.
_extract_content)
Let's look at langextract/core/format_handler.py. The handler uses Regular Expressions (Regex) to find the data.
%%CODE_6%%json ... `` blocks
matches = list(_FENCE_RE.finditer(text))
if len(matches) == 1:
# Found exactly one block? Great! Return the body.
return matches[0].group("body").strip()
%%CODE_7%% and ending with `. If it finds the right block, it grabs the content inside.*
### 2. The Parsing Logic (parse_output)
Once we have the clean string (e.g., [{"item": "milk"}]), we need to turn it into a Python object.
%%CODE_8%%
### 3. Handling Wrappers
Sometimes, models return a list: [...].
Other times, they return an object: {"extractions": [...]}.
The handler standardizes this.
%%CODE_9%%
*Explanation: This ensures that no matter how the model wraps the data, your code always gets a clean list of items back.*
## Validation and Strictness
You can control how strict the parser is.
* **strict_fences=True**: The output *must* contain markdown code blocks. If the model forgets them, the extraction fails.
* **strict_fences=False** (Default): If the model forgets the code blocks but returns valid JSON, langextract will still accept it.
You can set this in your resolver_params:
%%CODE_10%%json block found
"format_type": "json"
}
)
`
## Conclusion
The **Format Handling** system allows langextract to be robust against the unpredictability of LLMs. It handles the "boring" work of:
1. Stripping out conversational filler.
2. Removing reasoning tags ().
3. Normalizing lists vs. dictionaries.
4. Switching between JSON and YAML.
Now that we know *how* we parse the data, we need to understand *who* generates the data. How does langextract` switch between OpenAI, Gemini, or Ollama?
Next Chapter: Provider Routing & Factory
Generated by Code IQ