Friday, November 28, 2025

Structured Data, Pydantic & LangChain Structured Outputs

๐Ÿงฉ Tutorial: Structured Data, Pydantic & LangChain Structured Outputs

In this tutorial, you’ll learn:

  • What JSON Schema is and how it relates to models

  • How to define data models using Pydantic BaseModel

  • How to use TypedDict for typed dictionaries

  • How to get structured JSON output from LLMs using:

    • Raw JSON schema (Python dict)

    • Pydantic BaseModel

    • TypedDict + Annotated

  • How to use structured output with:

    • ChatOpenAI

    • ChatHuggingFace (TinyLlama endpoint)


1. JSON Schema Basics (Concept Level)

You started with:

{
    "title": "student",
    "description": "schema about students",
    "type": "object",
    "properties": {
        "name": "string",
        "age": "integer"
    },
    "required": ["name"]
}

๐Ÿ” What is this?

This is a JSON Schema:

  • type: "object" → it describes an object

  • properties:

    • name: string

    • age: integer

  • required: ["name"]name must be present; age is optional.

๐Ÿ•’ When is JSON Schema used?

  • To validate JSON payloads (APIs, configurations).

  • To describe the structure of data for tools and LLMs.

  • As a contract between systems (backend ↔ frontend, services).

❓ Why do we care here?

  • LLMs can be guided to output JSON matching a schema.

  • Libraries like LangChain and Pydantic internally map to/from JSON schema.


2. Pydantic BaseModel – Strongly Typed Student Model

๐Ÿ’ป Code

from pydantic import BaseModel, EmailStr, Field
from typing import Optional

class Student(BaseModel):
    name: str = 'nitish'
    age: Optional[int] = None
    email: EmailStr
    cgpa: float = Field(
        gt=0,
        lt=10,
        default=5,
        description='A decimal value representing the cgpa of the student'
    )

new_student = {'age': '32', 'email': 'abc@gmail.com'}

student = Student(**new_student)

student_dict = dict(student)

print(student_dict['age'])

student_json = student.model_dump_json()

๐Ÿ” What is happening?

  • Student is a Pydantic model with:

    • name: str → default "nitish"

    • age: Optional[int] → may be None, but here "32" will be converted to 32

    • email: EmailStr → validates proper email format

    • cgpa: float with:

      • gt=0, lt=10 → must be between 0 and 10

      • default 5

      • helpful description

  • Student(**new_student):

    • age is "32" (string) → Pydantic converts to 32 (int)

    • email is validated

    • name uses default "nitish"

    • cgpa uses default 5.0

  • model_dump_json() creates JSON string like:

    {"name": "nitish", "age": 32, "email": "abc@gmail.com", "cgpa": 5.0}
    

๐Ÿ•’ When to use BaseModel?

  • Whenever you want:

    • Validation of input data

    • Type safety inside Python

    • JSON <-> Python object conversion

❓ Why is this important for LLM work?

  • LLMs output free-form text by default.

  • With Pydantic + LangChain, you can ask them to output structured, valid data that matches your model.


3. TypedDict – Typed Dictionaries Without Validation

๐Ÿ’ป Code

from typing import TypedDict

class Person(TypedDict):
    name: str
    age: int

new_person: Person = {'name': 'nitish', 'age': 35}

print(new_person)

๐Ÿ” What is TypedDict?

  • A way to define the expected shape of a dictionary for type checkers.

  • No runtime validation like Pydantic; just static typing help.

๐Ÿ•’ When to use TypedDict?

  • When you:

    • Want type hints

    • Don’t need heavy validation

    • Prefer lightweight types (no Pydantic overhead)

❓ Why is this relevant?

  • LangChain’s with_structured_output can work with:

    • JSON Schema dict

    • Pydantic BaseModel

    • TypedDict (+ Annotated descriptions)

This gives you flexibility depending on your style.


4. Structured Output from ChatOpenAI Using Raw JSON Schema

Here you passed a JSON schema dict to with_structured_output.

๐Ÿ’ป Code

from langchain_openai import ChatOpenAI
from dotenv import load_dotenv

load_dotenv()

model = ChatOpenAI()

# schema
json_schema = {
  "title": "Review",
  "type": "object",
  "properties": {
    "key_themes": {
      "type": "array",
      "items": {"type": "string"},
      "description": "Write down all the key themes discussed in the review in a list"
    },
    "summary": {
      "type": "string",
      "description": "A brief summary of the review"
    },
    "sentiment": {
      "type": "string",
      "enum": ["pos", "neg"],
      "description": "Return sentiment of the review either negative, positive or neutral"
    },
    "pros": {
      "type": ["array", "null"],
      "items": {"type": "string"},
      "description": "Write down all the pros inside a list"
    },
    "cons": {
      "type": ["array", "null"],
      "items": {"type": "string"},
      "description": "Write down all the cons inside a list"
    },
    "name": {
      "type": ["string", "null"],
      "description": "Write the name of the reviewer"
    }
  },
  "required": ["key_themes", "summary", "sentiment"]
}

structured_model = model.with_structured_output(json_schema)

result = structured_model.invoke(
    """I recently upgraded to the Samsung Galaxy S24 Ultra, and I must say, it’s an absolute powerhouse! ... Review by Nitish Singh"""
)

print(result)

๐Ÿ” What is happening?

  • with_structured_output(json_schema):

    • Tells the model: “Your output must follow this JSON schema.”

  • invoke(review_text):

    • LLM reads your review.

    • Extracts:

      • key_themes: list[str]

      • summary: string

      • sentiment: "pos" or "neg"

      • pros: list[str] or null

      • cons: list[str] or null

      • name: string or null

  • result will be a Python dict matching the schema.

๐Ÿ•’ When to use raw JSON schema?

  • When:

    • You’re comfortable with JSON Schema

    • You want full schema control (for tools, OpenAPI, etc.)

    • You’re not tied to Python type system only

❓ Why is this great?

  • You get machine-usable structured data from an LLM in one step.

  • No need to write brittle regex or JSON-parsing hacks.

๐Ÿงพ Example result (approx):

{
  "key_themes": [
    "Powerful performance",
    "High-quality camera",
    "Long battery life",
    "S-Pen usefulness",
    "Heavy device and size",
    "Bloatware in One UI",
    "High price"
  ],
  "summary": "The reviewer is very impressed with the Galaxy S24 Ultra's performance, camera, and battery life, but dislikes the bulky design, Samsung bloatware, and high price.",
  "sentiment": "pos",
  "pros": [
    "Fast Snapdragon 8 Gen 3 processor",
    "Excellent 200MP camera with great zoom",
    "Strong battery life with fast charging",
    "S-Pen support for notes and sketches"
  ],
  "cons": [
    "Heavy and uncomfortable for one-handed use",
    "Bloatware in One UI",
    "Very expensive price tag"
  ],
  "name": "Nitish Singh"
}

5. Structured Output with Pydantic BaseModel + HuggingFace (TinyLlama)

Now the same idea, but using Pydantic model and HuggingFace LLM.

๐Ÿ’ป Code

from dotenv import load_dotenv
from typing import Optional, Literal
from pydantic import BaseModel, Field
from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint

load_dotenv()

llm = HuggingFaceEndpoint(
    repo_id="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    task="text-generation"
)

model = ChatHuggingFace(llm=llm)

# schema
class Review(BaseModel):
    key_themes: list[str] = Field(
        description="Write down all the key themes discussed in the review in a list"
    )
    summary: str = Field(description="A brief summary of the review")
    sentiment: Literal["pos", "neg"] = Field(
        description="Return sentiment of the review either negative, positive or neutral"
    )
    pros: Optional[list[str]] = Field(
        default=None,
        description="Write down all the pros inside a list"
    )
    cons: Optional[list[str]] = Field(
        default=None,
        description="Write down all the cons inside a list"
    )
    name: Optional[str] = Field(
        default=None,
        description="Write the name of the reviewer"
    )

structured_model = model.with_structured_output(Review)

result = structured_model.invoke(
    """I recently upgraded to the Samsung Galaxy S24 Ultra, and I must say, it’s an absolute powerhouse! ... Review by Nitish Singh"""
)

print(result)

๐Ÿ” What is happening?

  • Review(BaseModel) defines the Python data model.

  • with_structured_output(Review):

    • LangChain internally converts this to JSON schema.

    • Forces the LLM to return data that can be parsed as Review.

  • Result is a Review instance (Pydantic object), not just dict.

So you can do:

print(result.summary)
print(result.sentiment)
print(result.name)

๐Ÿ•’ When to use Pydantic + structured output?

  • When you:

    • Want validation & type hints

    • Work in a Python backend

    • Need to plug result straight into your code

❓ Why is this powerful?

  • End-to-end pipeline:

    • Raw text → LLM → Pydantic model → direct use in database / APIs


6. Structured Output with Pydantic + ChatOpenAI

Same Pydantic Review model, but using OpenAI instead of HuggingFace.

๐Ÿ’ป Code

from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
from typing import Optional, Literal
from pydantic import BaseModel, Field

load_dotenv()

model = ChatOpenAI()

# schema
class Review(BaseModel):
    key_themes: list[str] = Field(description="Write down all the key themes discussed in the review in a list")
    summary: str = Field(description="A brief summary of the review")
    sentiment: Literal["pos", "neg"] = Field(description="Return sentiment of the review either negative, positive or neutral")
    pros: Optional[list[str]] = Field(default=None, description="Write down all the pros inside a list")
    cons: Optional[list[str]] = Field(default=None, description="Write down all the cons inside a list")
    name: Optional[str] = Field(default=None, description="Write the name of the reviewer")

structured_model = model.with_structured_output(Review)

result = structured_model.invoke(
    """I recently upgraded to the Samsung Galaxy S24 Ultra, and I must say, it’s an absolute powerhouse! ... Review by Nitish Singh"""
)

print(result)

๐Ÿ” What’s different?

  • Same idea, different provider.

  • You still get a Review object.

Example usage:

print(result.key_themes)
print(result.pros)
print(result.cons)
print(result.name)

7. Structured Output with TypedDict + Annotated (ChatOpenAI)

Here’s a more lightweight type approach.

๐Ÿ’ป Code

from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
from typing import TypedDict, Annotated, Optional, Literal

load_dotenv()

model = ChatOpenAI()

# schema
class Review(TypedDict):
    key_themes: Annotated[list[str], "Write down all the key themes discussed in the review in a list"]
    summary: Annotated[str, "A brief summary of the review"]
    sentiment: Annotated[Literal["pos", "neg"], "Return sentiment of the review either negative, positive or neutral"]
    pros: Annotated[Optional[list[str]], "Write down all the pros inside a list"]
    cons: Annotated[Optional[list[str]], "Write down all the cons inside a list"]
    name: Annotated[Optional[str], "Write the name of the reviewer"]

structured_model = model.with_structured_output(Review)

result = structured_model.invoke(
    """I recently upgraded to the Samsung Galaxy S24 Ultra, and I must say, it’s an absolute powerhouse! ... Review by Nitish Singh"""
)

print(result['name'])

๐Ÿ” What is happening?

  • Review is a TypedDict with Annotated descriptions.

  • with_structured_output(Review):

    • Uses typing + annotations to build the schema.

  • result is a plain dict, but type checkers know its structure.

Example result:

{
  "key_themes": [...],
  "summary": "...",
  "sentiment": "pos",
  "pros": [...],
  "cons": [...],
  "name": "Nitish Singh"
}

print(result['name'])"Nitish Singh"

๐Ÿ•’ When to use this?

  • When you:

    • Want static typing but don’t need Pydantic

    • Prefer minimal dependencies

    • Still want structured outputs from LLM

❓ Why use TypedDict + Annotated?

  • Lightweight

  • Works nicely with mypy / type-checkers

  • You still get descriptions for the LLM to follow


8. Big Picture: Which Structured Output Style to Use?

Approach Type Runtime Validation Best For
Raw JSON Schema (dict) Dict No (LLM constrained only) Multi-language / tool-level schema
BaseModel (Pydantic) Class ✅ Yes Python backends, APIs, DB integration
TypedDict + Annotated Dict type ❌ No Lightweight typing, fast, simple

9. Why Structured Output Matters for LLM Apps

Without structured output:

  • You get free text → must parse manually

  • More chances of errors (missing fields, invalid JSON, etc.)

With structured output:

  • LLM output → auto-validated object/dict

  • You can directly:

    • Save to DB

    • Return in API

    • Feed into next processing step

This is critical for production-grade AI features where you need reliable data, not just pretty text.




Below is a clean, professional comparison table showing the differences between:

JSON Schema (dict)
Pydantic BaseModel
TypedDict + Annotated
when used with LangChain structured outputs.


๐Ÿ“Š Comparison Table — Structured Output Methods in LangChain

Feature / Aspect JSON Schema (dict) Pydantic BaseModel TypedDict + Annotated
Definition Type Python dictionary describing JSON schema Python class extending BaseModel Python TypedDict with Annotated descriptions
Runtime Validation ❌ No validation (LLM must comply) ✅ Yes (strict validation by Pydantic) ❌ No runtime validation
Output Type dict Pydantic model instance dict
Error Handling if Output Invalid ❌ You must manually check ✅ Pydantic raises validation errors ❌ No built-in guarantees
Best Use Case Tooling, API schema, cross-language systems Backend apps needing clean, validated objects Lightweight typing with minimal overhead
Ease of Use ⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐⭐
Flexibility / Customization ⭐⭐⭐⭐⭐ (full JSON schema control) ⭐⭐⭐⭐ (rich field types) ⭐⭐⭐ (simple types only)
Type Safety ❌ No ✅ Strong ⚠️ Static only (type checkers)
Performance Fast (no validation) Slightly slower (validation overhead) Fast (no validation)
Works With All LangChain models All LangChain models All LangChain models
Ideal For Multi-language systems, OpenAPI, strict schema control Python apps, APIs, DB pipelines Quick typing, simple extraction tasks
Description Support Medium (via description fields) Strong (via Field(description=...)) Strong (via Annotated)
Nested Complex Structures ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐ (less flexible)
Strictness Low High Medium
Use in Production ⚠️ Only if model output reliable ✅ Yes, recommended ⚠️ For simple use cases
Requires External Library ❌ No ✅ Yes → Pydantic ❌ No
Automatic JSON Serialization Manual Built-in (model_dump_json()) Manual

๐Ÿงญ Summary in Simple Words

1. JSON Schema → Specification

  • Best when you need a standard schema

  • Great for cross-language use

  • No validation → LLM must obey

2. Pydantic BaseModel → Strict Validation

  • Ensures correct & clean structured output

  • Perfect for backends, APIs, databases

  • Most reliable for production

3. TypedDict + Annotated → Lightweight

  • No validation, faster

  • Good for simple tasks

  • Best when you want type hints but don’t want heavy models


๐Ÿ… Which One Should YOU Use?

Need Choose
Production app, strict typing Pydantic BaseModel
Tool integration / OpenAPI / external systems JSON Schema
Lightweight & fast TypedDict + Annotated
Most predictable results Pydantic BaseModel


No comments:

Post a Comment

Structured Output & Multi-step Chains with HuggingFace + OpenAI

๐Ÿงฉ Tutorial: Structured Output & Multi-step Chains with HuggingFace + OpenAI (LangChain) In this tutorial you’ll learn: How to get...