Extract Data from PDFs Using Claude

Invoices, reports, contracts — pull structured data from PDF files using AI.

Extracting Data from PDFs

PDFs are everywhere. Invoices, bank statements, contracts, reports, government forms. They're designed to look good on screen and in print. They are not designed to give you their data.

What works and what doesn't

Claude Code does well with PDFs that have selectable text: reports, invoices, statements, forms. It can handle tables inside PDFs, even messy ones with merged cells, and it reads through multi-page documents without issues.

It's less reliable with scanned documents (images of paper), where accuracy drops. Handwritten text is a coin flip. And if your PDF is mostly charts or graphics, Claude can describe what it sees but can't pull precise numbers from a bar chart.

If you're working with scanned documents, you'll get better results running them through a dedicated OCR tool first and then giving Claude the text output. For text-based PDFs, Claude handles them directly.

Reading a PDF

Start by having Claude tell you what's in there:

Read invoice_march.pdf.
Tell me:
- What kind of document this is
- How many pages it has
- What data is in it (line items, totals, dates, etc.)

Same idea as the CSV lesson: understand what you have before you start transforming it. With PDFs this matters more because you can't just open them in a text editor to peek at the structure.

Extracting tables from a PDF

The most common PDF task. You have a report or statement with a table, and you want it as a CSV:

Read quarterly_report.pdf.
Find the table on pages 3-4 titled "Revenue by Region".
Extract it into a CSV with columns: region, q1_revenue, q2_revenue, q3_revenue, q4_revenue.
Save as revenue_by_region.csv

Be specific about which table you want. A long PDF might have multiple tables, and "extract the table" is ambiguous.

Pulling data from invoices

Invoices follow a predictable structure, which helps:

Read invoice_4521.pdf.
Extract these fields:
- Invoice number
- Invoice date
- Due date
- Vendor name
- Line items (description, quantity, unit price, total)
- Subtotal, tax, and grand total
Save as a CSV with one row per line item.
Include the invoice number, date, and vendor on every row.
Save as invoice_4521.csv

The output format instructions matter here. "One row per line item with header info repeated" gives you a flat CSV you can actually work with. Without that detail, Claude might nest the data or group it differently than you'd expect.

Processing multiple invoices

If you have a folder full of invoices, you can process them all at once:

Read all PDF files in the /invoices folder.
For each invoice, extract:
- Invoice number
- Vendor name
- Invoice date
- Grand total
Create a single CSV with one row per invoice.
Sort by date, most recent first.
Save as invoice_summary.csv

A folder full of PDFs becomes a single structured spreadsheet.

Extracting from bank and financial statements

Bank statements, credit card statements, and financial reports usually have transaction tables:

Read bank_statement_march.pdf.
Extract all transactions into a CSV with columns:
- date
- description
- amount
- type (debit or credit)
Ignore the summary section at the top — only extract individual transactions.
Save as march_transactions.csv

That "ignore the summary section" line prevents Claude from mixing up the totals with the individual transactions.

Pulling specific information from contracts

Sometimes you don't need a table — you need specific facts scattered throughout a document:

Read service_agreement.pdf.
Extract these details:
- Parties involved (company names)
- Effective date
- Term length
- Total contract value
- Payment schedule
- Termination clause (summarize in one sentence)
- Auto-renewal terms (yes/no, and conditions)
Format as a markdown document. Save as contract_summary.md

Different from table extraction. Here you're asking Claude to find specific facts scattered throughout a longer document and organize them for you.

Comparing two PDFs

You can also use Claude to compare documents:

Read old_contract.pdf and new_contract.pdf.
Compare them and list:
- Any sections that were added
- Any sections that were removed
- Any changes to dollar amounts, dates, or named parties
- Any changes to the termination or renewal clauses
Format as a markdown document. Save as contract_changes.md

Tips for better results

Tell Claude what to expect. "This is a 12-page invoice" or "the table starts on page 3" helps it focus on the right part of the document.

Name the columns you want. Don't say "extract the table." Say "extract the table with columns: date, description, amount, category." This avoids Claude guessing at column names or splitting columns in ways you don't expect.

If a table spans multiple pages, say so:

The transaction table spans pages 4-7. Treat it as one continuous table —
don't restart the row count at each page.

And check the output. PDF extraction is where Claude Code is most likely to make small mistakes: a number misread, a column shifted. For anything financial, spot-check a few rows against the original.

When to use a different approach

If you're processing dozens of scanned PDFs regularly, a dedicated OCR service will be more reliable at that volume. But for the occasional scanned document, or for any text-based PDF (which covers most invoices, reports, statements, and contracts), Claude Code does the job well.

Recap

PDFs aren't designed to give up their data, but that's what Claude Code is for. Start by asking Claude what's in the document before you extract anything. Be specific about which table, which pages, which fields. Name your output columns so you're not leaving the structure to guesswork. And spot-check the results, especially for numbers.

Next up

Now that you can work with CSVs, spreadsheets, and PDFs individually, it's time to bring them together. In the next lesson, you'll combine data from multiple files — even different formats — into a single unified output.

Process Excel and Google Sheets Using AI Next: Combine Multiple Files Using Claude Code