📘 Tutorial: LangChain Document Loaders (CSV, PDF, Text, Web) + LLM Processing

In this tutorial, you will learn:

How to load CSV files
How to load PDF files
How to load all PDFs from a directory
How to load TEXT files
How to load webpages (HTML)
How to use LLMs to summarize, answer questions, inspect metadata, etc.

You’ll also understand:

WHAT each loader does
WHEN to use it
WHY it’s important

1️⃣ Loading CSV Files — `CSVLoader`

💻 Code

from langchain_community.document_loaders import CSVLoader

loader = CSVLoader(file_path='Social_Network_Ads.csv')

docs = loader.load()

print(len(docs))
print(docs[1])

🔍 WHAT is happening?

CSVLoader loads CSV rows as individual Documents
Each document has:
- page_content → the row content
- metadata → row index & file info

⏰ WHEN to use this?

When your data is in tabular form:
- Sales CSV
- Ads CSV
- Training dataset
- Any spreadsheet exported as CSV

❓ WHY use CSVLoader?

Converts structured data into LangChain Document objects
Easy to send rows into LLMs for:
- summarization
- quality checks
- classification
- insights

🧾 Example Output (approx.)

403
Document(
  page_content="Age: 40, Salary: 59000, Purchased: No",
  metadata={'source': 'Social_Network_Ads.csv', 'row': 1}
)

2️⃣ Loading Multiple PDFs from a Folder — `DirectoryLoader`

💻 Code

from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader

loader = DirectoryLoader(
    path='books',
    glob='*.pdf',
    loader_cls=PyPDFLoader
)

docs = loader.lazy_load()

for document in docs:
    print(document.metadata)

🔍 WHAT is happening?

DirectoryLoader scans the folder:

/books
  ├─ book1.pdf
  ├─ book2.pdf
  ├─ book3.pdf

Loads all PDFs using PyPDFLoader

⏰ WHEN to use?

When processing:
- E-book collections
- Research papers
- PDF-based knowledge bases
- Multiple invoices

❓ WHY useful?

Automates reading entire directories
Perfect for large-scale document ingestion pipelines

🧾 Example Metadata Output

{'source': 'books/book1.pdf', 'page': 0}
{'source': 'books/book1.pdf', 'page': 1}
...

3️⃣ Loading a Single PDF — `PyPDFLoader`

💻 Code

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('dl-curriculum.pdf')

docs = loader.load()

print(len(docs))
print(docs[0].page_content)
print(docs[1].metadata)

🔍 WHAT happens?

Each page becomes a Document
page_content = text from that page
metadata = page number, source file

⏰ WHEN to use this?

When you want page-level analysis:
- Summaries per page
- Extracting answers
- Finding chapters

❓ WHY use PyPDFLoader?

Most PDFs can’t be processed by LLMs directly
This loader extracts text safely & accurately

🧾 Example Output

36
"Deep Learning Curriculum...(full text)"
{'source': 'dl-curriculum.pdf', 'page': 1}

4️⃣ Loading TXT Files — `TextLoader` + Summarization

💻 Code

from langchain_community.document_loaders import TextLoader
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from dotenv import load_dotenv

load_dotenv()

model = ChatOpenAI()

prompt = PromptTemplate(
    template='Write a summary for the following poem - \n {poem}',
    input_variables=['poem']
)

parser = StrOutputParser()

loader = TextLoader('cricket.txt', encoding='utf-8')

docs = loader.load()

print(type(docs))
print(len(docs))
print(docs[0].page_content)
print(docs[0].metadata)

chain = prompt | model | parser

print(chain.invoke({'poem': docs[0].page_content}))

🔍 WHAT happens?

TextLoader loads the file into a single Document
Then we feed the content into an LLM chain for summarization

⏰ WHEN to use?

When processing plain-text:
- poems
- articles
- scripts
- notes

❓ WHY useful?

Text files are very common for:
- datasets
- chat logs
- scraped info

🧾 Example Output (summary)

The poem celebrates the thrill, passion, and joy of cricket...

5️⃣ Loading Website Content — `WebBaseLoader`

💻 Code

from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from dotenv import load_dotenv

load_dotenv()

model = ChatOpenAI()

prompt = PromptTemplate(
    template='Answer the following question \n {question} from the following text - \n {text}',
    input_variables=['question','text']
)

parser = StrOutputParser()

url = 'https://www.flipkart.com/apple-macbook-air-m2-16-gb-256-gb-ssd-macos-sequoia-mc7x4hn-a/p/itmdc5308fa78421'
loader = WebBaseLoader(url)

docs = loader.load()

chain = prompt | model | parser

print(chain.invoke({'question': 'What is the product we are talking about?', 'text': docs[0].page_content}))

🔍 WHAT happens?

WebBaseLoader scrapes HTML, removes tags, extracts readable text
You now have product details as a Document

⏰ WHEN to use?

For pulling data from:
- product pages
- blogs
- documentation
- news articles

❓ WHY useful?

You can automatically create:
- summaries
- Q&A bots
- research assistants
- scraping + LLM analysis pipelines

🧾 Example Answer Output

The product discussed is an Apple MacBook Air M2 (16GB | 256GB SSD).

📌 Summary — Which Loader to Use?

Loader	Best For	Why?
`CSVLoader`	CSV files	Converts rows → Documents
`TextLoader`	TXT files	Simple & reliable text extraction
`PyPDFLoader`	Single PDFs	Page-by-page documents
`DirectoryLoader`	Many PDFs	Automated ingestion
`WebBaseLoader`	Websites	Scrapes HTML → Text

LangChain

Saturday, November 22, 2025

LangChain Document Loaders (CSV, PDF, Text, Web) + LLM Processing

📘 Tutorial: LangChain Document Loaders (CSV, PDF, Text, Web) + LLM Processing

1️⃣ Loading CSV Files — `CSVLoader`

💻 Code

🔍 WHAT is happening?

⏰ WHEN to use this?

❓ WHY use CSVLoader?

🧾 Example Output (approx.)

2️⃣ Loading Multiple PDFs from a Folder — `DirectoryLoader`

💻 Code

🔍 WHAT is happening?

⏰ WHEN to use?

❓ WHY useful?

🧾 Example Metadata Output

3️⃣ Loading a Single PDF — `PyPDFLoader`

💻 Code

🔍 WHAT happens?

⏰ WHEN to use this?

❓ WHY use PyPDFLoader?

🧾 Example Output

4️⃣ Loading TXT Files — `TextLoader` + Summarization

💻 Code

🔍 WHAT happens?

⏰ WHEN to use?

❓ WHY useful?

🧾 Example Output (summary)

5️⃣ Loading Website Content — `WebBaseLoader`

💻 Code

🔍 WHAT happens?

⏰ WHEN to use?

❓ WHY useful?

🧾 Example Answer Output

📌 Summary — Which Loader to Use?

No comments:

Post a Comment

Structured Output & Multi-step Chains with HuggingFace + OpenAI

Report Abuse

Saturday, November 22, 2025

LangChain Document Loaders (CSV, PDF, Text, Web) + LLM Processing

📘 Tutorial: LangChain Document Loaders (CSV, PDF, Text, Web) + LLM Processing

1️⃣ Loading CSV Files — CSVLoader

💻 Code

🔍 WHAT is happening?

⏰ WHEN to use this?

❓ WHY use CSVLoader?

🧾 Example Output (approx.)

2️⃣ Loading Multiple PDFs from a Folder — DirectoryLoader

💻 Code

🔍 WHAT is happening?

⏰ WHEN to use?

❓ WHY useful?

🧾 Example Metadata Output

3️⃣ Loading a Single PDF — PyPDFLoader

💻 Code

🔍 WHAT happens?

⏰ WHEN to use this?

❓ WHY use PyPDFLoader?

🧾 Example Output

4️⃣ Loading TXT Files — TextLoader + Summarization

💻 Code

🔍 WHAT happens?

⏰ WHEN to use?

❓ WHY useful?

🧾 Example Output (summary)

5️⃣ Loading Website Content — WebBaseLoader

💻 Code

🔍 WHAT happens?

⏰ WHEN to use?

❓ WHY useful?

🧾 Example Answer Output

📌 Summary — Which Loader to Use?

No comments:

Post a Comment

Structured Output & Multi-step Chains with HuggingFace + OpenAI

1️⃣ Loading CSV Files — `CSVLoader`

2️⃣ Loading Multiple PDFs from a Folder — `DirectoryLoader`

3️⃣ Loading a Single PDF — `PyPDFLoader`

4️⃣ Loading TXT Files — `TextLoader` + Summarization

5️⃣ Loading Website Content — `WebBaseLoader`