Saturday, November 22, 2025

LangChain Document Loaders (CSV, PDF, Text, Web) + LLM Processing



๐Ÿ“˜ Tutorial: LangChain Document Loaders (CSV, PDF, Text, Web) + LLM Processing

In this tutorial, you will learn:

  • How to load CSV files

  • How to load PDF files

  • How to load all PDFs from a directory

  • How to load TEXT files

  • How to load webpages (HTML)

  • How to use LLMs to summarize, answer questions, inspect metadata, etc.

You’ll also understand:

  • WHAT each loader does

  • WHEN to use it

  • WHY it’s important


1️⃣ Loading CSV Files — CSVLoader

๐Ÿ’ป Code

from langchain_community.document_loaders import CSVLoader

loader = CSVLoader(file_path='Social_Network_Ads.csv')

docs = loader.load()

print(len(docs))
print(docs[1])

๐Ÿ” WHAT is happening?

  • CSVLoader loads CSV rows as individual Documents

  • Each document has:

    • page_content → the row content

    • metadata → row index & file info

⏰ WHEN to use this?

  • When your data is in tabular form:

    • Sales CSV

    • Ads CSV

    • Training dataset

    • Any spreadsheet exported as CSV

❓ WHY use CSVLoader?

  • Converts structured data into LangChain Document objects

  • Easy to send rows into LLMs for:

    • summarization

    • quality checks

    • classification

    • insights

๐Ÿงพ Example Output (approx.)

403
Document(
  page_content="Age: 40, Salary: 59000, Purchased: No",
  metadata={'source': 'Social_Network_Ads.csv', 'row': 1}
)

2️⃣ Loading Multiple PDFs from a Folder — DirectoryLoader

๐Ÿ’ป Code

from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader

loader = DirectoryLoader(
    path='books',
    glob='*.pdf',
    loader_cls=PyPDFLoader
)

docs = loader.lazy_load()

for document in docs:
    print(document.metadata)

๐Ÿ” WHAT is happening?

  • DirectoryLoader scans the folder:

    /books
      ├─ book1.pdf
      ├─ book2.pdf
      ├─ book3.pdf
    
  • Loads all PDFs using PyPDFLoader

⏰ WHEN to use?

  • When processing:

    • E-book collections

    • Research papers

    • PDF-based knowledge bases

    • Multiple invoices

❓ WHY useful?

  • Automates reading entire directories

  • Perfect for large-scale document ingestion pipelines

๐Ÿงพ Example Metadata Output

{'source': 'books/book1.pdf', 'page': 0}
{'source': 'books/book1.pdf', 'page': 1}
...

3️⃣ Loading a Single PDF — PyPDFLoader

๐Ÿ’ป Code

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('dl-curriculum.pdf')

docs = loader.load()

print(len(docs))
print(docs[0].page_content)
print(docs[1].metadata)

๐Ÿ” WHAT happens?

  • Each page becomes a Document

  • page_content = text from that page

  • metadata = page number, source file

⏰ WHEN to use this?

  • When you want page-level analysis:

    • Summaries per page

    • Extracting answers

    • Finding chapters

❓ WHY use PyPDFLoader?

  • Most PDFs can’t be processed by LLMs directly

  • This loader extracts text safely & accurately

๐Ÿงพ Example Output

36
"Deep Learning Curriculum...(full text)"
{'source': 'dl-curriculum.pdf', 'page': 1}

4️⃣ Loading TXT Files — TextLoader + Summarization

๐Ÿ’ป Code

from langchain_community.document_loaders import TextLoader
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from dotenv import load_dotenv

load_dotenv()

model = ChatOpenAI()

prompt = PromptTemplate(
    template='Write a summary for the following poem - \n {poem}',
    input_variables=['poem']
)

parser = StrOutputParser()

loader = TextLoader('cricket.txt', encoding='utf-8')

docs = loader.load()

print(type(docs))
print(len(docs))
print(docs[0].page_content)
print(docs[0].metadata)

chain = prompt | model | parser

print(chain.invoke({'poem': docs[0].page_content}))

๐Ÿ” WHAT happens?

  • TextLoader loads the file into a single Document

  • Then we feed the content into an LLM chain for summarization

⏰ WHEN to use?

  • When processing plain-text:

    • poems

    • articles

    • scripts

    • notes

❓ WHY useful?

  • Text files are very common for:

    • datasets

    • chat logs

    • scraped info

๐Ÿงพ Example Output (summary)

The poem celebrates the thrill, passion, and joy of cricket...

5️⃣ Loading Website Content — WebBaseLoader

๐Ÿ’ป Code

from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from dotenv import load_dotenv

load_dotenv()

model = ChatOpenAI()

prompt = PromptTemplate(
    template='Answer the following question \n {question} from the following text - \n {text}',
    input_variables=['question','text']
)

parser = StrOutputParser()

url = 'https://www.flipkart.com/apple-macbook-air-m2-16-gb-256-gb-ssd-macos-sequoia-mc7x4hn-a/p/itmdc5308fa78421'
loader = WebBaseLoader(url)

docs = loader.load()

chain = prompt | model | parser

print(chain.invoke({'question': 'What is the product we are talking about?', 'text': docs[0].page_content}))

๐Ÿ” WHAT happens?

  • WebBaseLoader scrapes HTML, removes tags, extracts readable text

  • You now have product details as a Document

⏰ WHEN to use?

  • For pulling data from:

    • product pages

    • blogs

    • documentation

    • news articles

❓ WHY useful?

  • You can automatically create:

    • summaries

    • Q&A bots

    • research assistants

    • scraping + LLM analysis pipelines

๐Ÿงพ Example Answer Output

The product discussed is an Apple MacBook Air M2 (16GB | 256GB SSD).

๐Ÿ“Œ Summary — Which Loader to Use?

Loader Best For Why?
CSVLoader CSV files Converts rows → Documents
TextLoader TXT files Simple & reliable text extraction
PyPDFLoader Single PDFs Page-by-page documents
DirectoryLoader Many PDFs Automated ingestion
WebBaseLoader Websites Scrapes HTML → Text


No comments:

Post a Comment

Structured Output & Multi-step Chains with HuggingFace + OpenAI

๐Ÿงฉ Tutorial: Structured Output & Multi-step Chains with HuggingFace + OpenAI (LangChain) In this tutorial you’ll learn: How to get...