✂️ Tutorial: Text Chunking in LangChain

(Recursive Splitter, Character Splitter, Language-Aware Splitter, Semantic Chunker)

Chunking is the most important step in building a RAG system.

In this tutorial you will learn:

How to chunk PDFs using CharacterTextSplitter
How to chunk Markdown using language-aware splitters
How to chunk Python code safely
How to chunk text semantically using embeddings
When to use which splitter and why

🔥 1. Splitting PDFs — `CharacterTextSplitter`

💻 Code

from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('dl-curriculum.pdf')
docs = loader.load()

splitter = CharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=0,
    separator=''
)

result = splitter.split_documents(docs)

print(result[1].page_content)

🔍 WHAT is happening?

Load a PDF → each page is a Document
Break each page into 200-character chunks
No overlap between chunks

⏰ WHEN to use this?

For simple text (plain text, PDFs)
When structure is not important
When you want fast and simple chunking

❓ WHY useful?

Many LLM pipelines need small chunks for:
- embeddings
- vector databases
- retrieval

🧾 Example Output (approx)

"Deep Learning has become one of the most exciting areas... (partial text)"

📘 2. RecursiveCharacterTextSplitter — Smart Chunking

This one tries to split intelligently:

First by paragraphs
Then by sentences
Then by words
And falls back safely

2A. Markdown Chunking — Language-aware

💻 Code

from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

text = """
# Project Name: Smart Student Tracker

A simple Python-based project to manage and track student data...
"""

splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN,
    chunk_size=200,
    chunk_overlap=0,
)

chunks = splitter.split_text(text)

print(len(chunks))
print(chunks[0])

🔍 WHAT?

Understands Markdown structure
Keeps headings + sections together
Avoids breaking code blocks incorrectly

⏰ WHEN?

GitHub READMEs
Notes in Markdown
Documentation

❓ WHY?

Better accuracy for RAG because chunk boundaries follow logical sections.

🧾 Example Output

1
# Project Name: Smart Student Tracker

A simple Python-based project...

2B. Splitting Python Code — Language.PYTHON

💻 Code

from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

text = """
class Student:
    def __init__(self, name, age, grade):
        self.name = name
        self.age = age
        ...
"""

splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=300,
    chunk_overlap=0,
)

chunks = splitter.split_text(text)

print(len(chunks))
print(chunks[1])

🔍 WHAT?

Safely splits Python code
Keeps functions, classes, and expressions together

⏰ WHEN?

Code RAG
AI assistants for programming
LLM-based debugging

❓ WHY?

Code must not be broken mid-line or mid-block
Helps LLM understand context better

🧾 Example Output

    def is_passing(self):
        return self.grade >= 6.0

🧠 3. Semantic Chunking — Using Embeddings

(The smartest chunker)

💻 Code

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
from dotenv import load_dotenv

load_dotenv()

text_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="standard_deviation",
    breakpoint_threshold_amount=3
)

sample = """
Farmers were working hard...
The Indian Premier League (IPL) is the biggest cricket league...
Terrorism is a big danger...
"""

docs = text_splitter.create_documents([sample])
print(len(docs))
print(docs)

🔍 WHAT?

Looks at meaning, not characters
Uses embeddings → finds topic shifts
Creates chunks where semantic changes occur

Example:

Farming paragraph → Chunk 1
IPL paragraph → Chunk 2
Terrorism paragraph → Chunk 3

⏰ WHEN?

Long articles
Mixed-topic documents
Web scraping
RAG applications needing high accuracy

❓ WHY?

Avoids mixing unrelated topics
Helps retrieval return the best possible chunk

🧾 Example Output

3
[Document(page_content='Farmers were working...'), Document(...), Document(...)]

📗 4. Simple Text Splitting with RecursiveCharacterTextSplitter

💻 Code

from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """
Space exploration has led to incredible scientific discoveries...
"""

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=0,
)

chunks = splitter.split_text(text)

print(len(chunks))
print(chunks)

🔍 WHAT?

General-purpose text splitter
Tries small separators → large separators → fallback to characters

⏰ WHEN?

Normal articles
Blogs
Wikipedia text
Anything non-code, non-Markdown

❓ WHY?

Best balance between simplicity and intelligence
Most commonly used splitter in RAG systems

🧾 Example Output

1
['Space exploration has led...']

⭐ Which Text Splitter Should You Use?

Splitter	Best For	Why
CharacterTextSplitter	PDFs, raw text	Fast but dumb splitting
RecursiveCharacterTextSplitter	General-purpose chunking	Most reliable and balanced
Language-aware Splitter	Markdown, Python, HTML	Understands syntax & structure
SemanticChunker	Mixed-topic large docs	Best RAG retrieval accuracy

🎯 Summary

After this tutorial you can:

Split PDFs into pages and chunks
Split Markdown and Python safely
Use semantic chunking with embeddings
Decide which splitter is ideal for your project

Chunking is the backbone of RAG, and now you understand it properly.

Saturday, November 22, 2025

Text Chunking in LangChain

✂️ Tutorial: Text Chunking in LangChain

🔥 1. Splitting PDFs — CharacterTextSplitter

💻 Code

🔍 WHAT is happening?

⏰ WHEN to use this?

❓ WHY useful?

🧾 Example Output (approx)

📘 2. RecursiveCharacterTextSplitter — Smart Chunking

2A. Markdown Chunking — Language-aware

💻 Code

🔍 WHAT?

⏰ WHEN?

❓ WHY?

🧾 Example Output

2B. Splitting Python Code — Language.PYTHON

💻 Code

🔍 WHAT?

⏰ WHEN?

❓ WHY?

🧾 Example Output

🧠 3. Semantic Chunking — Using Embeddings

💻 Code

🔍 WHAT?

Example:

⏰ WHEN?

❓ WHY?

🧾 Example Output

📗 4. Simple Text Splitting with RecursiveCharacterTextSplitter

💻 Code

🔍 WHAT?

⏰ WHEN?

❓ WHY?

🧾 Example Output

⭐ Which Text Splitter Should You Use?

🎯 Summary

No comments:

Post a Comment

Structured Output & Multi-step Chains with HuggingFace + OpenAI

🔥 1. Splitting PDFs — `CharacterTextSplitter`