Saturday, November 22, 2025

Text Chunking in LangChain

✂️ Tutorial: Text Chunking in LangChain

(Recursive Splitter, Character Splitter, Language-Aware Splitter, Semantic Chunker)

Chunking is the most important step in building a RAG system.

In this tutorial you will learn:

  • How to chunk PDFs using CharacterTextSplitter

  • How to chunk Markdown using language-aware splitters

  • How to chunk Python code safely

  • How to chunk text semantically using embeddings

  • When to use which splitter and why


๐Ÿ”ฅ 1. Splitting PDFs — CharacterTextSplitter

๐Ÿ’ป Code

from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('dl-curriculum.pdf')
docs = loader.load()

splitter = CharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=0,
    separator=''
)

result = splitter.split_documents(docs)

print(result[1].page_content)

๐Ÿ” WHAT is happening?

  • Load a PDF → each page is a Document

  • Break each page into 200-character chunks

  • No overlap between chunks

⏰ WHEN to use this?

  • For simple text (plain text, PDFs)

  • When structure is not important

  • When you want fast and simple chunking

❓ WHY useful?

  • Many LLM pipelines need small chunks for:

    • embeddings

    • vector databases

    • retrieval

๐Ÿงพ Example Output (approx)

"Deep Learning has become one of the most exciting areas... (partial text)"

๐Ÿ“˜ 2. RecursiveCharacterTextSplitter — Smart Chunking

This one tries to split intelligently:

  • First by paragraphs

  • Then by sentences

  • Then by words

  • And falls back safely


2A. Markdown Chunking — Language-aware

๐Ÿ’ป Code

from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

text = """
# Project Name: Smart Student Tracker

A simple Python-based project to manage and track student data...
"""

splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN,
    chunk_size=200,
    chunk_overlap=0,
)

chunks = splitter.split_text(text)

print(len(chunks))
print(chunks[0])

๐Ÿ” WHAT?

  • Understands Markdown structure

  • Keeps headings + sections together

  • Avoids breaking code blocks incorrectly

⏰ WHEN?

  • GitHub READMEs

  • Notes in Markdown

  • Documentation

❓ WHY?

  • Better accuracy for RAG because chunk boundaries follow logical sections.

๐Ÿงพ Example Output

1
# Project Name: Smart Student Tracker

A simple Python-based project...

2B. Splitting Python Code — Language.PYTHON

๐Ÿ’ป Code

from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

text = """
class Student:
    def __init__(self, name, age, grade):
        self.name = name
        self.age = age
        ...
"""

splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=300,
    chunk_overlap=0,
)

chunks = splitter.split_text(text)

print(len(chunks))
print(chunks[1])

๐Ÿ” WHAT?

  • Safely splits Python code

  • Keeps functions, classes, and expressions together

⏰ WHEN?

  • Code RAG

  • AI assistants for programming

  • LLM-based debugging

❓ WHY?

  • Code must not be broken mid-line or mid-block

  • Helps LLM understand context better

๐Ÿงพ Example Output

    def is_passing(self):
        return self.grade >= 6.0

๐Ÿง  3. Semantic Chunking — Using Embeddings

(The smartest chunker)

๐Ÿ’ป Code

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
from dotenv import load_dotenv

load_dotenv()

text_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="standard_deviation",
    breakpoint_threshold_amount=3
)

sample = """
Farmers were working hard...
The Indian Premier League (IPL) is the biggest cricket league...
Terrorism is a big danger...
"""

docs = text_splitter.create_documents([sample])
print(len(docs))
print(docs)

๐Ÿ” WHAT?

  • Looks at meaning, not characters

  • Uses embeddings → finds topic shifts

  • Creates chunks where semantic changes occur

Example:

  • Farming paragraph → Chunk 1

  • IPL paragraph → Chunk 2

  • Terrorism paragraph → Chunk 3

⏰ WHEN?

  • Long articles

  • Mixed-topic documents

  • Web scraping

  • RAG applications needing high accuracy

❓ WHY?

  • Avoids mixing unrelated topics

  • Helps retrieval return the best possible chunk

๐Ÿงพ Example Output

3
[Document(page_content='Farmers were working...'), Document(...), Document(...)]

๐Ÿ“— 4. Simple Text Splitting with RecursiveCharacterTextSplitter

๐Ÿ’ป Code

from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """
Space exploration has led to incredible scientific discoveries...
"""

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=0,
)

chunks = splitter.split_text(text)

print(len(chunks))
print(chunks)

๐Ÿ” WHAT?

  • General-purpose text splitter

  • Tries small separators → large separators → fallback to characters

⏰ WHEN?

  • Normal articles

  • Blogs

  • Wikipedia text

  • Anything non-code, non-Markdown

❓ WHY?

  • Best balance between simplicity and intelligence

  • Most commonly used splitter in RAG systems

๐Ÿงพ Example Output

1
['Space exploration has led...']

⭐ Which Text Splitter Should You Use?

Splitter Best For Why
CharacterTextSplitter PDFs, raw text Fast but dumb splitting
RecursiveCharacterTextSplitter General-purpose chunking Most reliable and balanced
Language-aware Splitter Markdown, Python, HTML Understands syntax & structure
SemanticChunker Mixed-topic large docs Best RAG retrieval accuracy

๐ŸŽฏ Summary

After this tutorial you can:

  • Split PDFs into pages and chunks

  • Split Markdown and Python safely

  • Use semantic chunking with embeddings

  • Decide which splitter is ideal for your project

Chunking is the backbone of RAG, and now you understand it properly.


No comments:

Post a Comment

Structured Output & Multi-step Chains with HuggingFace + OpenAI

๐Ÿงฉ Tutorial: Structured Output & Multi-step Chains with HuggingFace + OpenAI (LangChain) In this tutorial you’ll learn: How to get...