✂️ Tutorial: Text Chunking in LangChain
(Recursive Splitter, Character Splitter, Language-Aware Splitter, Semantic Chunker)
Chunking is the most important step in building a RAG system.
In this tutorial you will learn:
-
How to chunk PDFs using
CharacterTextSplitter -
How to chunk Markdown using language-aware splitters
-
How to chunk Python code safely
-
How to chunk text semantically using embeddings
-
When to use which splitter and why
๐ฅ 1. Splitting PDFs — CharacterTextSplitter
๐ป Code
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader('dl-curriculum.pdf')
docs = loader.load()
splitter = CharacterTextSplitter(
chunk_size=200,
chunk_overlap=0,
separator=''
)
result = splitter.split_documents(docs)
print(result[1].page_content)
๐ WHAT is happening?
-
Load a PDF → each page is a
Document -
Break each page into 200-character chunks
-
No overlap between chunks
⏰ WHEN to use this?
-
For simple text (plain text, PDFs)
-
When structure is not important
-
When you want fast and simple chunking
❓ WHY useful?
-
Many LLM pipelines need small chunks for:
-
embeddings
-
vector databases
-
retrieval
-
๐งพ Example Output (approx)
"Deep Learning has become one of the most exciting areas... (partial text)"
๐ 2. RecursiveCharacterTextSplitter — Smart Chunking
This one tries to split intelligently:
-
First by paragraphs
-
Then by sentences
-
Then by words
-
And falls back safely
2A. Markdown Chunking — Language-aware
๐ป Code
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language
text = """
# Project Name: Smart Student Tracker
A simple Python-based project to manage and track student data...
"""
splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.MARKDOWN,
chunk_size=200,
chunk_overlap=0,
)
chunks = splitter.split_text(text)
print(len(chunks))
print(chunks[0])
๐ WHAT?
-
Understands Markdown structure
-
Keeps headings + sections together
-
Avoids breaking code blocks incorrectly
⏰ WHEN?
-
GitHub READMEs
-
Notes in Markdown
-
Documentation
❓ WHY?
-
Better accuracy for RAG because chunk boundaries follow logical sections.
๐งพ Example Output
1
# Project Name: Smart Student Tracker
A simple Python-based project...
2B. Splitting Python Code — Language.PYTHON
๐ป Code
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language
text = """
class Student:
def __init__(self, name, age, grade):
self.name = name
self.age = age
...
"""
splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=300,
chunk_overlap=0,
)
chunks = splitter.split_text(text)
print(len(chunks))
print(chunks[1])
๐ WHAT?
-
Safely splits Python code
-
Keeps functions, classes, and expressions together
⏰ WHEN?
-
Code RAG
-
AI assistants for programming
-
LLM-based debugging
❓ WHY?
-
Code must not be broken mid-line or mid-block
-
Helps LLM understand context better
๐งพ Example Output
def is_passing(self):
return self.grade >= 6.0
๐ง 3. Semantic Chunking — Using Embeddings
(The smartest chunker)
๐ป Code
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
from dotenv import load_dotenv
load_dotenv()
text_splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="standard_deviation",
breakpoint_threshold_amount=3
)
sample = """
Farmers were working hard...
The Indian Premier League (IPL) is the biggest cricket league...
Terrorism is a big danger...
"""
docs = text_splitter.create_documents([sample])
print(len(docs))
print(docs)
๐ WHAT?
-
Looks at meaning, not characters
-
Uses embeddings → finds topic shifts
-
Creates chunks where semantic changes occur
Example:
-
Farming paragraph → Chunk 1
-
IPL paragraph → Chunk 2
-
Terrorism paragraph → Chunk 3
⏰ WHEN?
-
Long articles
-
Mixed-topic documents
-
Web scraping
-
RAG applications needing high accuracy
❓ WHY?
-
Avoids mixing unrelated topics
-
Helps retrieval return the best possible chunk
๐งพ Example Output
3
[Document(page_content='Farmers were working...'), Document(...), Document(...)]
๐ 4. Simple Text Splitting with RecursiveCharacterTextSplitter
๐ป Code
from langchain.text_splitter import RecursiveCharacterTextSplitter
text = """
Space exploration has led to incredible scientific discoveries...
"""
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=0,
)
chunks = splitter.split_text(text)
print(len(chunks))
print(chunks)
๐ WHAT?
-
General-purpose text splitter
-
Tries small separators → large separators → fallback to characters
⏰ WHEN?
-
Normal articles
-
Blogs
-
Wikipedia text
-
Anything non-code, non-Markdown
❓ WHY?
-
Best balance between simplicity and intelligence
-
Most commonly used splitter in RAG systems
๐งพ Example Output
1
['Space exploration has led...']
⭐ Which Text Splitter Should You Use?
| Splitter | Best For | Why |
|---|---|---|
| CharacterTextSplitter | PDFs, raw text | Fast but dumb splitting |
| RecursiveCharacterTextSplitter | General-purpose chunking | Most reliable and balanced |
| Language-aware Splitter | Markdown, Python, HTML | Understands syntax & structure |
| SemanticChunker | Mixed-topic large docs | Best RAG retrieval accuracy |
๐ฏ Summary
After this tutorial you can:
-
Split PDFs into pages and chunks
-
Split Markdown and Python safely
-
Use semantic chunking with embeddings
-
Decide which splitter is ideal for your project
Chunking is the backbone of RAG, and now you understand it properly.
No comments:
Post a Comment