๐ Tutorial: LangChain Document Loaders (CSV, PDF, Text, Web) + LLM Processing
In this tutorial, you will learn:
-
How to load CSV files
-
How to load PDF files
-
How to load all PDFs from a directory
-
How to load TEXT files
-
How to load webpages (HTML)
-
How to use LLMs to summarize, answer questions, inspect metadata, etc.
You’ll also understand:
-
WHAT each loader does
-
WHEN to use it
-
WHY it’s important
1️⃣ Loading CSV Files — CSVLoader
๐ป Code
from langchain_community.document_loaders import CSVLoader
loader = CSVLoader(file_path='Social_Network_Ads.csv')
docs = loader.load()
print(len(docs))
print(docs[1])
๐ WHAT is happening?
-
CSVLoaderloads CSV rows as individual Documents -
Each document has:
-
page_content→ the row content -
metadata→ row index & file info
-
⏰ WHEN to use this?
-
When your data is in tabular form:
-
Sales CSV
-
Ads CSV
-
Training dataset
-
Any spreadsheet exported as CSV
-
❓ WHY use CSVLoader?
-
Converts structured data into LangChain
Documentobjects -
Easy to send rows into LLMs for:
-
summarization
-
quality checks
-
classification
-
insights
-
๐งพ Example Output (approx.)
403
Document(
page_content="Age: 40, Salary: 59000, Purchased: No",
metadata={'source': 'Social_Network_Ads.csv', 'row': 1}
)
2️⃣ Loading Multiple PDFs from a Folder — DirectoryLoader
๐ป Code
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
loader = DirectoryLoader(
path='books',
glob='*.pdf',
loader_cls=PyPDFLoader
)
docs = loader.lazy_load()
for document in docs:
print(document.metadata)
๐ WHAT is happening?
-
DirectoryLoaderscans the folder:/books ├─ book1.pdf ├─ book2.pdf ├─ book3.pdf -
Loads all PDFs using
PyPDFLoader
⏰ WHEN to use?
-
When processing:
-
E-book collections
-
Research papers
-
PDF-based knowledge bases
-
Multiple invoices
-
❓ WHY useful?
-
Automates reading entire directories
-
Perfect for large-scale document ingestion pipelines
๐งพ Example Metadata Output
{'source': 'books/book1.pdf', 'page': 0}
{'source': 'books/book1.pdf', 'page': 1}
...
3️⃣ Loading a Single PDF — PyPDFLoader
๐ป Code
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader('dl-curriculum.pdf')
docs = loader.load()
print(len(docs))
print(docs[0].page_content)
print(docs[1].metadata)
๐ WHAT happens?
-
Each page becomes a Document
-
page_content= text from that page -
metadata= page number, source file
⏰ WHEN to use this?
-
When you want page-level analysis:
-
Summaries per page
-
Extracting answers
-
Finding chapters
-
❓ WHY use PyPDFLoader?
-
Most PDFs can’t be processed by LLMs directly
-
This loader extracts text safely & accurately
๐งพ Example Output
36
"Deep Learning Curriculum...(full text)"
{'source': 'dl-curriculum.pdf', 'page': 1}
4️⃣ Loading TXT Files — TextLoader + Summarization
๐ป Code
from langchain_community.document_loaders import TextLoader
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from dotenv import load_dotenv
load_dotenv()
model = ChatOpenAI()
prompt = PromptTemplate(
template='Write a summary for the following poem - \n {poem}',
input_variables=['poem']
)
parser = StrOutputParser()
loader = TextLoader('cricket.txt', encoding='utf-8')
docs = loader.load()
print(type(docs))
print(len(docs))
print(docs[0].page_content)
print(docs[0].metadata)
chain = prompt | model | parser
print(chain.invoke({'poem': docs[0].page_content}))
๐ WHAT happens?
-
TextLoaderloads the file into a singleDocument -
Then we feed the content into an LLM chain for summarization
⏰ WHEN to use?
-
When processing plain-text:
-
poems
-
articles
-
scripts
-
notes
-
❓ WHY useful?
-
Text files are very common for:
-
datasets
-
chat logs
-
scraped info
-
๐งพ Example Output (summary)
The poem celebrates the thrill, passion, and joy of cricket...
5️⃣ Loading Website Content — WebBaseLoader
๐ป Code
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from dotenv import load_dotenv
load_dotenv()
model = ChatOpenAI()
prompt = PromptTemplate(
template='Answer the following question \n {question} from the following text - \n {text}',
input_variables=['question','text']
)
parser = StrOutputParser()
url = 'https://www.flipkart.com/apple-macbook-air-m2-16-gb-256-gb-ssd-macos-sequoia-mc7x4hn-a/p/itmdc5308fa78421'
loader = WebBaseLoader(url)
docs = loader.load()
chain = prompt | model | parser
print(chain.invoke({'question': 'What is the product we are talking about?', 'text': docs[0].page_content}))
๐ WHAT happens?
-
WebBaseLoaderscrapes HTML, removes tags, extracts readable text -
You now have product details as a Document
⏰ WHEN to use?
-
For pulling data from:
-
product pages
-
blogs
-
documentation
-
news articles
-
❓ WHY useful?
-
You can automatically create:
-
summaries
-
Q&A bots
-
research assistants
-
scraping + LLM analysis pipelines
-
๐งพ Example Answer Output
The product discussed is an Apple MacBook Air M2 (16GB | 256GB SSD).
๐ Summary — Which Loader to Use?
| Loader | Best For | Why? |
|---|---|---|
CSVLoader |
CSV files | Converts rows → Documents |
TextLoader |
TXT files | Simple & reliable text extraction |
PyPDFLoader |
Single PDFs | Page-by-page documents |
DirectoryLoader |
Many PDFs | Automated ingestion |
WebBaseLoader |
Websites | Scrapes HTML → Text |
No comments:
Post a Comment