๐งฉ Tutorial: Structured Data, Pydantic & LangChain Structured Outputs
In this tutorial, you’ll learn:
-
What JSON Schema is and how it relates to models
-
How to define data models using Pydantic BaseModel
-
How to use TypedDict for typed dictionaries
-
How to get structured JSON output from LLMs using:
-
Raw JSON schema (Python dict)
-
Pydantic
BaseModel -
TypedDict+Annotated
-
-
How to use structured output with:
-
ChatOpenAI -
ChatHuggingFace(TinyLlama endpoint)
-
1. JSON Schema Basics (Concept Level)
You started with:
{
"title": "student",
"description": "schema about students",
"type": "object",
"properties": {
"name": "string",
"age": "integer"
},
"required": ["name"]
}
๐ What is this?
This is a JSON Schema:
-
type: "object"→ it describes an object -
properties:-
name: string -
age: integer
-
-
required: ["name"]→namemust be present;ageis optional.
๐ When is JSON Schema used?
-
To validate JSON payloads (APIs, configurations).
-
To describe the structure of data for tools and LLMs.
-
As a contract between systems (backend ↔ frontend, services).
❓ Why do we care here?
-
LLMs can be guided to output JSON matching a schema.
-
Libraries like LangChain and Pydantic internally map to/from JSON schema.
2. Pydantic BaseModel – Strongly Typed Student Model
๐ป Code
from pydantic import BaseModel, EmailStr, Field
from typing import Optional
class Student(BaseModel):
name: str = 'nitish'
age: Optional[int] = None
email: EmailStr
cgpa: float = Field(
gt=0,
lt=10,
default=5,
description='A decimal value representing the cgpa of the student'
)
new_student = {'age': '32', 'email': 'abc@gmail.com'}
student = Student(**new_student)
student_dict = dict(student)
print(student_dict['age'])
student_json = student.model_dump_json()
๐ What is happening?
-
Studentis a Pydantic model with:-
name: str→ default"nitish" -
age: Optional[int]→ may beNone, but here"32"will be converted to32 -
email: EmailStr→ validates proper email format -
cgpa: floatwith:-
gt=0, lt=10→ must be between 0 and 10 -
default
5 -
helpful description
-
-
-
Student(**new_student):-
ageis"32"(string) → Pydantic converts to32(int) -
emailis validated -
nameuses default"nitish" -
cgpauses default5.0
-
-
model_dump_json()creates JSON string like:{"name": "nitish", "age": 32, "email": "abc@gmail.com", "cgpa": 5.0}
๐ When to use BaseModel?
-
Whenever you want:
-
Validation of input data
-
Type safety inside Python
-
JSON <-> Python object conversion
-
❓ Why is this important for LLM work?
-
LLMs output free-form text by default.
-
With Pydantic + LangChain, you can ask them to output structured, valid data that matches your model.
3. TypedDict – Typed Dictionaries Without Validation
๐ป Code
from typing import TypedDict
class Person(TypedDict):
name: str
age: int
new_person: Person = {'name': 'nitish', 'age': 35}
print(new_person)
๐ What is TypedDict?
-
A way to define the expected shape of a dictionary for type checkers.
-
No runtime validation like Pydantic; just static typing help.
๐ When to use TypedDict?
-
When you:
-
Want type hints
-
Don’t need heavy validation
-
Prefer lightweight types (no Pydantic overhead)
-
❓ Why is this relevant?
-
LangChain’s
with_structured_outputcan work with:-
JSON Schema dict
-
Pydantic BaseModel
-
TypedDict (+ Annotated descriptions)
-
This gives you flexibility depending on your style.
4. Structured Output from ChatOpenAI Using Raw JSON Schema
Here you passed a JSON schema dict to with_structured_output.
๐ป Code
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
load_dotenv()
model = ChatOpenAI()
# schema
json_schema = {
"title": "Review",
"type": "object",
"properties": {
"key_themes": {
"type": "array",
"items": {"type": "string"},
"description": "Write down all the key themes discussed in the review in a list"
},
"summary": {
"type": "string",
"description": "A brief summary of the review"
},
"sentiment": {
"type": "string",
"enum": ["pos", "neg"],
"description": "Return sentiment of the review either negative, positive or neutral"
},
"pros": {
"type": ["array", "null"],
"items": {"type": "string"},
"description": "Write down all the pros inside a list"
},
"cons": {
"type": ["array", "null"],
"items": {"type": "string"},
"description": "Write down all the cons inside a list"
},
"name": {
"type": ["string", "null"],
"description": "Write the name of the reviewer"
}
},
"required": ["key_themes", "summary", "sentiment"]
}
structured_model = model.with_structured_output(json_schema)
result = structured_model.invoke(
"""I recently upgraded to the Samsung Galaxy S24 Ultra, and I must say, it’s an absolute powerhouse! ... Review by Nitish Singh"""
)
print(result)
๐ What is happening?
-
with_structured_output(json_schema):-
Tells the model: “Your output must follow this JSON schema.”
-
-
invoke(review_text):-
LLM reads your review.
-
Extracts:
-
key_themes: list[str] -
summary: string -
sentiment:"pos"or"neg" -
pros: list[str] ornull -
cons: list[str] ornull -
name: string ornull
-
-
-
resultwill be a Python dict matching the schema.
๐ When to use raw JSON schema?
-
When:
-
You’re comfortable with JSON Schema
-
You want full schema control (for tools, OpenAPI, etc.)
-
You’re not tied to Python type system only
-
❓ Why is this great?
-
You get machine-usable structured data from an LLM in one step.
-
No need to write brittle regex or JSON-parsing hacks.
๐งพ Example result (approx):
{
"key_themes": [
"Powerful performance",
"High-quality camera",
"Long battery life",
"S-Pen usefulness",
"Heavy device and size",
"Bloatware in One UI",
"High price"
],
"summary": "The reviewer is very impressed with the Galaxy S24 Ultra's performance, camera, and battery life, but dislikes the bulky design, Samsung bloatware, and high price.",
"sentiment": "pos",
"pros": [
"Fast Snapdragon 8 Gen 3 processor",
"Excellent 200MP camera with great zoom",
"Strong battery life with fast charging",
"S-Pen support for notes and sketches"
],
"cons": [
"Heavy and uncomfortable for one-handed use",
"Bloatware in One UI",
"Very expensive price tag"
],
"name": "Nitish Singh"
}
5. Structured Output with Pydantic BaseModel + HuggingFace (TinyLlama)
Now the same idea, but using Pydantic model and HuggingFace LLM.
๐ป Code
from dotenv import load_dotenv
from typing import Optional, Literal
from pydantic import BaseModel, Field
from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint
load_dotenv()
llm = HuggingFaceEndpoint(
repo_id="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
task="text-generation"
)
model = ChatHuggingFace(llm=llm)
# schema
class Review(BaseModel):
key_themes: list[str] = Field(
description="Write down all the key themes discussed in the review in a list"
)
summary: str = Field(description="A brief summary of the review")
sentiment: Literal["pos", "neg"] = Field(
description="Return sentiment of the review either negative, positive or neutral"
)
pros: Optional[list[str]] = Field(
default=None,
description="Write down all the pros inside a list"
)
cons: Optional[list[str]] = Field(
default=None,
description="Write down all the cons inside a list"
)
name: Optional[str] = Field(
default=None,
description="Write the name of the reviewer"
)
structured_model = model.with_structured_output(Review)
result = structured_model.invoke(
"""I recently upgraded to the Samsung Galaxy S24 Ultra, and I must say, it’s an absolute powerhouse! ... Review by Nitish Singh"""
)
print(result)
๐ What is happening?
-
Review(BaseModel)defines the Python data model. -
with_structured_output(Review):-
LangChain internally converts this to JSON schema.
-
Forces the LLM to return data that can be parsed as
Review.
-
-
Result is a Review instance (Pydantic object), not just dict.
So you can do:
print(result.summary)
print(result.sentiment)
print(result.name)
๐ When to use Pydantic + structured output?
-
When you:
-
Want validation & type hints
-
Work in a Python backend
-
Need to plug result straight into your code
-
❓ Why is this powerful?
-
End-to-end pipeline:
-
Raw text → LLM → Pydantic model → direct use in database / APIs
-
6. Structured Output with Pydantic + ChatOpenAI
Same Pydantic Review model, but using OpenAI instead of HuggingFace.
๐ป Code
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
from typing import Optional, Literal
from pydantic import BaseModel, Field
load_dotenv()
model = ChatOpenAI()
# schema
class Review(BaseModel):
key_themes: list[str] = Field(description="Write down all the key themes discussed in the review in a list")
summary: str = Field(description="A brief summary of the review")
sentiment: Literal["pos", "neg"] = Field(description="Return sentiment of the review either negative, positive or neutral")
pros: Optional[list[str]] = Field(default=None, description="Write down all the pros inside a list")
cons: Optional[list[str]] = Field(default=None, description="Write down all the cons inside a list")
name: Optional[str] = Field(default=None, description="Write the name of the reviewer")
structured_model = model.with_structured_output(Review)
result = structured_model.invoke(
"""I recently upgraded to the Samsung Galaxy S24 Ultra, and I must say, it’s an absolute powerhouse! ... Review by Nitish Singh"""
)
print(result)
๐ What’s different?
-
Same idea, different provider.
-
You still get a
Reviewobject.
Example usage:
print(result.key_themes)
print(result.pros)
print(result.cons)
print(result.name)
7. Structured Output with TypedDict + Annotated (ChatOpenAI)
Here’s a more lightweight type approach.
๐ป Code
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
from typing import TypedDict, Annotated, Optional, Literal
load_dotenv()
model = ChatOpenAI()
# schema
class Review(TypedDict):
key_themes: Annotated[list[str], "Write down all the key themes discussed in the review in a list"]
summary: Annotated[str, "A brief summary of the review"]
sentiment: Annotated[Literal["pos", "neg"], "Return sentiment of the review either negative, positive or neutral"]
pros: Annotated[Optional[list[str]], "Write down all the pros inside a list"]
cons: Annotated[Optional[list[str]], "Write down all the cons inside a list"]
name: Annotated[Optional[str], "Write the name of the reviewer"]
structured_model = model.with_structured_output(Review)
result = structured_model.invoke(
"""I recently upgraded to the Samsung Galaxy S24 Ultra, and I must say, it’s an absolute powerhouse! ... Review by Nitish Singh"""
)
print(result['name'])
๐ What is happening?
-
Reviewis aTypedDictwithAnnotateddescriptions. -
with_structured_output(Review):-
Uses typing + annotations to build the schema.
-
-
resultis a plain dict, but type checkers know its structure.
Example result:
{
"key_themes": [...],
"summary": "...",
"sentiment": "pos",
"pros": [...],
"cons": [...],
"name": "Nitish Singh"
}
print(result['name']) → "Nitish Singh"
๐ When to use this?
-
When you:
-
Want static typing but don’t need Pydantic
-
Prefer minimal dependencies
-
Still want structured outputs from LLM
-
❓ Why use TypedDict + Annotated?
-
Lightweight
-
Works nicely with
mypy/ type-checkers -
You still get descriptions for the LLM to follow
8. Big Picture: Which Structured Output Style to Use?
| Approach | Type | Runtime Validation | Best For |
|---|---|---|---|
| Raw JSON Schema (dict) | Dict | No (LLM constrained only) | Multi-language / tool-level schema |
BaseModel (Pydantic) |
Class | ✅ Yes | Python backends, APIs, DB integration |
TypedDict + Annotated |
Dict type | ❌ No | Lightweight typing, fast, simple |
9. Why Structured Output Matters for LLM Apps
Without structured output:
-
You get free text → must parse manually
-
More chances of errors (missing fields, invalid JSON, etc.)
With structured output:
-
LLM output → auto-validated object/dict
-
You can directly:
-
Save to DB
-
Return in API
-
Feed into next processing step
-
This is critical for production-grade AI features where you need reliable data, not just pretty text.
Below is a clean, professional comparison table showing the differences between:
✅ JSON Schema (dict)
✅ Pydantic BaseModel
✅ TypedDict + Annotated
when used with LangChain structured outputs.
๐ Comparison Table — Structured Output Methods in LangChain
| Feature / Aspect | JSON Schema (dict) | Pydantic BaseModel | TypedDict + Annotated |
|---|---|---|---|
| Definition Type | Python dictionary describing JSON schema | Python class extending BaseModel |
Python TypedDict with Annotated descriptions |
| Runtime Validation | ❌ No validation (LLM must comply) | ✅ Yes (strict validation by Pydantic) | ❌ No runtime validation |
| Output Type | dict |
Pydantic model instance | dict |
| Error Handling if Output Invalid | ❌ You must manually check | ✅ Pydantic raises validation errors | ❌ No built-in guarantees |
| Best Use Case | Tooling, API schema, cross-language systems | Backend apps needing clean, validated objects | Lightweight typing with minimal overhead |
| Ease of Use | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Flexibility / Customization | ⭐⭐⭐⭐⭐ (full JSON schema control) | ⭐⭐⭐⭐ (rich field types) | ⭐⭐⭐ (simple types only) |
| Type Safety | ❌ No | ✅ Strong | ⚠️ Static only (type checkers) |
| Performance | Fast (no validation) | Slightly slower (validation overhead) | Fast (no validation) |
| Works With | All LangChain models | All LangChain models | All LangChain models |
| Ideal For | Multi-language systems, OpenAPI, strict schema control | Python apps, APIs, DB pipelines | Quick typing, simple extraction tasks |
| Description Support | Medium (via description fields) |
Strong (via Field(description=...)) |
Strong (via Annotated) |
| Nested Complex Structures | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐ (less flexible) |
| Strictness | Low | High | Medium |
| Use in Production | ⚠️ Only if model output reliable | ✅ Yes, recommended | ⚠️ For simple use cases |
| Requires External Library | ❌ No | ✅ Yes → Pydantic | ❌ No |
| Automatic JSON Serialization | Manual | Built-in (model_dump_json()) |
Manual |
๐งญ Summary in Simple Words
1. JSON Schema → Specification
-
Best when you need a standard schema
-
Great for cross-language use
-
No validation → LLM must obey
2. Pydantic BaseModel → Strict Validation
-
Ensures correct & clean structured output
-
Perfect for backends, APIs, databases
-
Most reliable for production
3. TypedDict + Annotated → Lightweight
-
No validation, faster
-
Good for simple tasks
-
Best when you want type hints but don’t want heavy models
๐ Which One Should YOU Use?
| Need | Choose |
|---|---|
| Production app, strict typing | Pydantic BaseModel |
| Tool integration / OpenAPI / external systems | JSON Schema |
| Lightweight & fast | TypedDict + Annotated |
| Most predictable results | Pydantic BaseModel |
No comments:
Post a Comment