🧩 Tutorial: Structured Data, Pydantic & LangChain Structured Outputs

In this tutorial, you’ll learn:

What JSON Schema is and how it relates to models
How to define data models using Pydantic BaseModel
How to use TypedDict for typed dictionaries
How to get structured JSON output from LLMs using:
- Raw JSON schema (Python dict)
- Pydantic BaseModel
- TypedDict + Annotated
How to use structured output with:
- ChatOpenAI
- ChatHuggingFace (TinyLlama endpoint)

1. JSON Schema Basics (Concept Level)

You started with:

{
    "title": "student",
    "description": "schema about students",
    "type": "object",
    "properties": {
        "name": "string",
        "age": "integer"
    },
    "required": ["name"]
}

🔍 What is this?

This is a JSON Schema:

type: "object" → it describes an object
properties:
- name: string
- age: integer
required: ["name"] → name must be present; age is optional.

🕒 When is JSON Schema used?

To validate JSON payloads (APIs, configurations).
To describe the structure of data for tools and LLMs.
As a contract between systems (backend ↔ frontend, services).

❓ Why do we care here?

LLMs can be guided to output JSON matching a schema.
Libraries like LangChain and Pydantic internally map to/from JSON schema.

2. Pydantic `BaseModel` – Strongly Typed Student Model

💻 Code

from pydantic import BaseModel, EmailStr, Field
from typing import Optional

class Student(BaseModel):
    name: str = 'nitish'
    age: Optional[int] = None
    email: EmailStr
    cgpa: float = Field(
        gt=0,
        lt=10,
        default=5,
        description='A decimal value representing the cgpa of the student'
    )

new_student = {'age': '32', 'email': 'abc@gmail.com'}

student = Student(**new_student)

student_dict = dict(student)

print(student_dict['age'])

student_json = student.model_dump_json()

🔍 What is happening?

Student is a Pydantic model with:
- name: str → default "nitish"
- age: Optional[int] → may be None, but here "32" will be converted to 32
- email: EmailStr → validates proper email format
- cgpa: float with:
  - gt=0, lt=10 → must be between 0 and 10
  - default 5
  - helpful description
Student(**new_student):
- age is "32" (string) → Pydantic converts to 32 (int)
- email is validated
- name uses default "nitish"
- cgpa uses default 5.0

model_dump_json() creates JSON string like:

{"name": "nitish", "age": 32, "email": "abc@gmail.com", "cgpa": 5.0}

🕒 When to use `BaseModel`?

Whenever you want:
- Validation of input data
- Type safety inside Python
- JSON <-> Python object conversion

❓ Why is this important for LLM work?

LLMs output free-form text by default.
With Pydantic + LangChain, you can ask them to output structured, valid data that matches your model.

3. `TypedDict` – Typed Dictionaries Without Validation

💻 Code

from typing import TypedDict

class Person(TypedDict):
    name: str
    age: int

new_person: Person = {'name': 'nitish', 'age': 35}

print(new_person)

🔍 What is `TypedDict`?

A way to define the expected shape of a dictionary for type checkers.
No runtime validation like Pydantic; just static typing help.

🕒 When to use TypedDict?

When you:
- Want type hints
- Don’t need heavy validation
- Prefer lightweight types (no Pydantic overhead)

❓ Why is this relevant?

LangChain’s with_structured_output can work with:
- JSON Schema dict
- Pydantic BaseModel
- TypedDict (+ Annotated descriptions)

This gives you flexibility depending on your style.

4. Structured Output from `ChatOpenAI` Using Raw JSON Schema

Here you passed a JSON schema dict to with_structured_output.

💻 Code

from langchain_openai import ChatOpenAI
from dotenv import load_dotenv

load_dotenv()

model = ChatOpenAI()

# schema
json_schema = {
  "title": "Review",
  "type": "object",
  "properties": {
    "key_themes": {
      "type": "array",
      "items": {"type": "string"},
      "description": "Write down all the key themes discussed in the review in a list"
    },
    "summary": {
      "type": "string",
      "description": "A brief summary of the review"
    },
    "sentiment": {
      "type": "string",
      "enum": ["pos", "neg"],
      "description": "Return sentiment of the review either negative, positive or neutral"
    },
    "pros": {
      "type": ["array", "null"],
      "items": {"type": "string"},
      "description": "Write down all the pros inside a list"
    },
    "cons": {
      "type": ["array", "null"],
      "items": {"type": "string"},
      "description": "Write down all the cons inside a list"
    },
    "name": {
      "type": ["string", "null"],
      "description": "Write the name of the reviewer"
    }
  },
  "required": ["key_themes", "summary", "sentiment"]
}

structured_model = model.with_structured_output(json_schema)

result = structured_model.invoke(
    """I recently upgraded to the Samsung Galaxy S24 Ultra, and I must say, it’s an absolute powerhouse! ... Review by Nitish Singh"""
)

print(result)

🔍 What is happening?

with_structured_output(json_schema):
- Tells the model: “Your output must follow this JSON schema.”
invoke(review_text):
- LLM reads your review.
- Extracts:
  - key_themes: list[str]
  - summary: string
  - sentiment: "pos" or "neg"
  - pros: list[str] or null
  - cons: list[str] or null
  - name: string or null
result will be a Python dict matching the schema.

🕒 When to use raw JSON schema?

When:
- You’re comfortable with JSON Schema
- You want full schema control (for tools, OpenAPI, etc.)
- You’re not tied to Python type system only

❓ Why is this great?

You get machine-usable structured data from an LLM in one step.
No need to write brittle regex or JSON-parsing hacks.

🧾 Example result (approx):

{
  "key_themes": [
    "Powerful performance",
    "High-quality camera",
    "Long battery life",
    "S-Pen usefulness",
    "Heavy device and size",
    "Bloatware in One UI",
    "High price"
  ],
  "summary": "The reviewer is very impressed with the Galaxy S24 Ultra's performance, camera, and battery life, but dislikes the bulky design, Samsung bloatware, and high price.",
  "sentiment": "pos",
  "pros": [
    "Fast Snapdragon 8 Gen 3 processor",
    "Excellent 200MP camera with great zoom",
    "Strong battery life with fast charging",
    "S-Pen support for notes and sketches"
  ],
  "cons": [
    "Heavy and uncomfortable for one-handed use",
    "Bloatware in One UI",
    "Very expensive price tag"
  ],
  "name": "Nitish Singh"
}

5. Structured Output with Pydantic `BaseModel` + HuggingFace (TinyLlama)

Now the same idea, but using Pydantic model and HuggingFace LLM.

💻 Code

from dotenv import load_dotenv
from typing import Optional, Literal
from pydantic import BaseModel, Field
from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint

load_dotenv()

llm = HuggingFaceEndpoint(
    repo_id="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    task="text-generation"
)

model = ChatHuggingFace(llm=llm)

# schema
class Review(BaseModel):
    key_themes: list[str] = Field(
        description="Write down all the key themes discussed in the review in a list"
    )
    summary: str = Field(description="A brief summary of the review")
    sentiment: Literal["pos", "neg"] = Field(
        description="Return sentiment of the review either negative, positive or neutral"
    )
    pros: Optional[list[str]] = Field(
        default=None,
        description="Write down all the pros inside a list"
    )
    cons: Optional[list[str]] = Field(
        default=None,
        description="Write down all the cons inside a list"
    )
    name: Optional[str] = Field(
        default=None,
        description="Write the name of the reviewer"
    )

structured_model = model.with_structured_output(Review)

result = structured_model.invoke(
    """I recently upgraded to the Samsung Galaxy S24 Ultra, and I must say, it’s an absolute powerhouse! ... Review by Nitish Singh"""
)

print(result)

🔍 What is happening?

Review(BaseModel) defines the Python data model.
with_structured_output(Review):
- LangChain internally converts this to JSON schema.
- Forces the LLM to return data that can be parsed as Review.
Result is a Review instance (Pydantic object), not just dict.

So you can do:

print(result.summary)
print(result.sentiment)
print(result.name)

🕒 When to use Pydantic + structured output?

When you:
- Want validation & type hints
- Work in a Python backend
- Need to plug result straight into your code

❓ Why is this powerful?

End-to-end pipeline:
- Raw text → LLM → Pydantic model → direct use in database / APIs

6. Structured Output with Pydantic + `ChatOpenAI`

Same Pydantic Review model, but using OpenAI instead of HuggingFace.

💻 Code

from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
from typing import Optional, Literal
from pydantic import BaseModel, Field

load_dotenv()

model = ChatOpenAI()

# schema
class Review(BaseModel):
    key_themes: list[str] = Field(description="Write down all the key themes discussed in the review in a list")
    summary: str = Field(description="A brief summary of the review")
    sentiment: Literal["pos", "neg"] = Field(description="Return sentiment of the review either negative, positive or neutral")
    pros: Optional[list[str]] = Field(default=None, description="Write down all the pros inside a list")
    cons: Optional[list[str]] = Field(default=None, description="Write down all the cons inside a list")
    name: Optional[str] = Field(default=None, description="Write the name of the reviewer")

structured_model = model.with_structured_output(Review)

result = structured_model.invoke(
    """I recently upgraded to the Samsung Galaxy S24 Ultra, and I must say, it’s an absolute powerhouse! ... Review by Nitish Singh"""
)

print(result)

🔍 What’s different?

Same idea, different provider.
You still get a Review object.

Example usage:

print(result.key_themes)
print(result.pros)
print(result.cons)
print(result.name)

7. Structured Output with `TypedDict` + `Annotated` (ChatOpenAI)

Here’s a more lightweight type approach.

💻 Code

from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
from typing import TypedDict, Annotated, Optional, Literal

load_dotenv()

model = ChatOpenAI()

# schema
class Review(TypedDict):
    key_themes: Annotated[list[str], "Write down all the key themes discussed in the review in a list"]
    summary: Annotated[str, "A brief summary of the review"]
    sentiment: Annotated[Literal["pos", "neg"], "Return sentiment of the review either negative, positive or neutral"]
    pros: Annotated[Optional[list[str]], "Write down all the pros inside a list"]
    cons: Annotated[Optional[list[str]], "Write down all the cons inside a list"]
    name: Annotated[Optional[str], "Write the name of the reviewer"]

structured_model = model.with_structured_output(Review)

result = structured_model.invoke(
    """I recently upgraded to the Samsung Galaxy S24 Ultra, and I must say, it’s an absolute powerhouse! ... Review by Nitish Singh"""
)

print(result['name'])

🔍 What is happening?

Review is a TypedDict with Annotated descriptions.
with_structured_output(Review):
- Uses typing + annotations to build the schema.
result is a plain dict, but type checkers know its structure.

Example result:

{
  "key_themes": [...],
  "summary": "...",
  "sentiment": "pos",
  "pros": [...],
  "cons": [...],
  "name": "Nitish Singh"
}

print(result['name']) → "Nitish Singh"

🕒 When to use this?

When you:
- Want static typing but don’t need Pydantic
- Prefer minimal dependencies
- Still want structured outputs from LLM

❓ Why use `TypedDict` + `Annotated`?

Lightweight
Works nicely with mypy / type-checkers
You still get descriptions for the LLM to follow

8. Big Picture: Which Structured Output Style to Use?

Approach	Type	Runtime Validation	Best For
Raw JSON Schema (dict)	Dict	No (LLM constrained only)	Multi-language / tool-level schema
`BaseModel` (Pydantic)	Class	✅ Yes	Python backends, APIs, DB integration
`TypedDict` + `Annotated`	Dict type	❌ No	Lightweight typing, fast, simple

9. Why Structured Output Matters for LLM Apps

Without structured output:

You get free text → must parse manually
More chances of errors (missing fields, invalid JSON, etc.)

With structured output:

LLM output → auto-validated object/dict
You can directly:
- Save to DB
- Return in API
- Feed into next processing step

This is critical for production-grade AI features where you need reliable data, not just pretty text.

Below is a clean, professional comparison table showing the differences between:

✅ JSON Schema (dict)
✅ Pydantic BaseModel
✅ TypedDict + Annotated
when used with LangChain structured outputs.

📊 Comparison Table — Structured Output Methods in LangChain

Feature / Aspect	JSON Schema (dict)	Pydantic BaseModel	TypedDict + Annotated
Definition Type	Python dictionary describing JSON schema	Python class extending `BaseModel`	Python `TypedDict` with `Annotated` descriptions
Runtime Validation	❌ No validation (LLM must comply)	✅ Yes (strict validation by Pydantic)	❌ No runtime validation
Output Type	`dict`	Pydantic model instance	`dict`
Error Handling if Output Invalid	❌ You must manually check	✅ Pydantic raises validation errors	❌ No built-in guarantees
Best Use Case	Tooling, API schema, cross-language systems	Backend apps needing clean, validated objects	Lightweight typing with minimal overhead
Ease of Use	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐
Flexibility / Customization	⭐⭐⭐⭐⭐ (full JSON schema control)	⭐⭐⭐⭐ (rich field types)	⭐⭐⭐ (simple types only)
Type Safety	❌ No	✅ Strong	⚠️ Static only (type checkers)
Performance	Fast (no validation)	Slightly slower (validation overhead)	Fast (no validation)
Works With	All LangChain models	All LangChain models	All LangChain models
Ideal For	Multi-language systems, OpenAPI, strict schema control	Python apps, APIs, DB pipelines	Quick typing, simple extraction tasks
Description Support	Medium (via `description` fields)	Strong (via `Field(description=...)`)	Strong (via `Annotated`)
Nested Complex Structures	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐ (less flexible)
Strictness	Low	High	Medium
Use in Production	⚠️ Only if model output reliable	✅ Yes, recommended	⚠️ For simple use cases
Requires External Library	❌ No	✅ Yes → Pydantic	❌ No
Automatic JSON Serialization	Manual	Built-in (`model_dump_json()`)	Manual

🧭 Summary in Simple Words

1. JSON Schema → Specification

Best when you need a standard schema
Great for cross-language use
No validation → LLM must obey

2. Pydantic BaseModel → Strict Validation

Ensures correct & clean structured output
Perfect for backends, APIs, databases
Most reliable for production

3. TypedDict + Annotated → Lightweight

No validation, faster
Good for simple tasks
Best when you want type hints but don’t want heavy models

🏅 Which One Should YOU Use?

Need	Choose
Production app, strict typing	Pydantic BaseModel
Tool integration / OpenAPI / external systems	JSON Schema
Lightweight & fast	TypedDict + Annotated
Most predictable results	Pydantic BaseModel

Friday, November 28, 2025

Structured Data, Pydantic & LangChain Structured Outputs

🧩 Tutorial: Structured Data, Pydantic & LangChain Structured Outputs

1. JSON Schema Basics (Concept Level)

🔍 What is this?

🕒 When is JSON Schema used?

❓ Why do we care here?

2. Pydantic BaseModel – Strongly Typed Student Model

💻 Code

🔍 What is happening?

🕒 When to use BaseModel?

❓ Why is this important for LLM work?

3. TypedDict – Typed Dictionaries Without Validation

💻 Code

🔍 What is TypedDict?

🕒 When to use TypedDict?

❓ Why is this relevant?

4. Structured Output from ChatOpenAI Using Raw JSON Schema

💻 Code

🔍 What is happening?

🕒 When to use raw JSON schema?

❓ Why is this great?

🧾 Example result (approx):

5. Structured Output with Pydantic BaseModel + HuggingFace (TinyLlama)

💻 Code

🔍 What is happening?

🕒 When to use Pydantic + structured output?

❓ Why is this powerful?

6. Structured Output with Pydantic + ChatOpenAI

💻 Code

🔍 What’s different?

Example usage:

7. Structured Output with TypedDict + Annotated (ChatOpenAI)

💻 Code

🔍 What is happening?

🕒 When to use this?

❓ Why use TypedDict + Annotated?

8. Big Picture: Which Structured Output Style to Use?

9. Why Structured Output Matters for LLM Apps

📊 Comparison Table — Structured Output Methods in LangChain

🧭 Summary in Simple Words

1. JSON Schema → Specification

2. Pydantic BaseModel → Strict Validation

3. TypedDict + Annotated → Lightweight

🏅 Which One Should YOU Use?

No comments:

Post a Comment

Structured Output & Multi-step Chains with HuggingFace + OpenAI

2. Pydantic `BaseModel` – Strongly Typed Student Model

🕒 When to use `BaseModel`?

3. `TypedDict` – Typed Dictionaries Without Validation

🔍 What is `TypedDict`?

4. Structured Output from `ChatOpenAI` Using Raw JSON Schema

5. Structured Output with Pydantic `BaseModel` + HuggingFace (TinyLlama)

6. Structured Output with Pydantic + `ChatOpenAI`

7. Structured Output with `TypedDict` + `Annotated` (ChatOpenAI)

❓ Why use `TypedDict` + `Annotated`?