Why is Markdown better than raw PDF for ChatGPT or Claude?

LLMs are trained on billions of pages of plain text, HTML, and Markdown. They handle structured Markdown headings, lists, and tables natively. PDF is a layout format designed for printing — when you feed it to a chat-with-PDF tool, the extraction layer flattens headings, breaks tables, and merges sidebars into the main flow. Converting first means the model spends its attention budget on understanding content rather than disambiguating layout.

Which PDF-to-Markdown tool is best for academic papers with tables?

LlamaParse from LlamaIndex is the strongest option for table-heavy documents. The free tier covers 1,000 pages per day. For purely textual papers without complex tables, MarkItDown from Microsoft is faster, free, fully local, and good enough.

Can I convert a scanned PDF to Markdown?

Not directly — scanned PDFs are images of text without a text layer. You need to OCR first. Tesseract is the free CLI option (free, open source, runs locally). For scientific or financial documents with complex layout, paid OCR like Mistral OCR or AWS Textract gives substantially better output. After OCR, run any of the Markdown converters on the new text-layer PDF.

Does Pickrack's PDF to Markdown tool work offline?

Pickrack uses pdftotext (Poppler) on the server for extraction, so the conversion itself happens on the server. The file is written to /tmp, processed, and deleted immediately after the response. If you need a fully local workflow, install MarkItDown or Pandoc on your own machine and use those — both are open source.

How long should chunks be when feeding Markdown to Claude?

Claude's 200K-token context window handles long documents in one pass, but smaller chunks of 30K-50K tokens give more focused responses. For tasks like 'summarize section 3' or 'extract all the dates and amounts', split by section header (H1 or H2) and pass one section at a time.

Will MarkItDown preserve tables correctly?

MarkItDown handles simple tables well. It struggles with complex tables that span pages, contain merged cells, or use unusual layouts. For financial reports and scientific papers with that kind of structure, LlamaParse Premium is meaningfully better. For everyday business PDFs, MarkItDown is fine.

PDF to Markdown for AI Workflows: 5 Tools Tested with Real Documents (2026)

If you've ever pasted a 200-page PDF into ChatGPT and gotten back a vague summary that misses half the document, this guide is for you.

The fix isn't a better prompt. It's a better input format.

I tested five PDF-to-Markdown tools in May 2026 against three real documents — a financial annual report (heavy tables), a scientific paper (figures and equations), and a product manual (lots of nested lists). Below are results, the workflow I personally use, and when each tool is worth the setup time.

Why Markdown beats PDF for LLMs

Large language models are trained on text — billions of pages of plain text, HTML, and Markdown. They process structured Markdown headings, lists, and tables natively. PDF, by contrast, is a layout format designed for printers, not parsers.

When you feed a PDF to Claude or ChatGPT through a "chat with PDF" feature:

The PDF goes through an extraction layer (often imperfect)
Headings get flattened into plain paragraphs
Tables turn into space-aligned rows the model has to re-interpret
Footnotes and sidebars merge into the main flow
Multi-column layouts get read in the wrong order

The result: a model that's technically read your document but lost most of its structure. I've seen Claude misattribute quotes from a footnote to the main author, summarize chapter 7 when you asked about chapter 3 (because page numbers don't survive extraction), and skip entire tables that should have been the heart of the answer.

Markdown skips all of that. Headings are headings. Lists are lists. Tables are pipe-separated and clean. The model spends its attention budget on understanding content, not on disambiguating layout.

The 5 tools tested

I ran each tool on the same three documents and graded structure preservation, table fidelity, OCR for scanned content, and total time including any setup.

1. MarkItDown (Microsoft, open source) — best free general-purpose

github.com/microsoft/markitdown — Python CLI from Microsoft. Supports PDF, DOCX, XLSX, PPTX, HTML, images, audio with OCR fallback. Output is clean GitHub-Flavored Markdown.

pip install 'markitdown[all]'
markitdown input.pdf > output.md

On the financial report: tables came through but column alignment was off in 3 of 12 tables, and merged-cell footers turned into duplicated rows.

On the scientific paper: equations were rendered as garbled inline text (LaTeX wasn't preserved). Headings were correct. Figure captions were associated with the right figures. Footnotes appeared as numbered list items at the end of each section, which is fine but not ideal.

On the product manual: excellent. Nested lists came through cleanly, hyperlinks were preserved, table of contents was usable for navigation.

Best for: power users comfortable with the command line, anyone needing audio/image transcription in the same pipeline. Free, MIT license, runs locally — your file never leaves your machine.

2. Pandoc — best for non-PDF formats

pandoc.org — the swiss-army knife of document conversion. PDF support requires a tex engine and a few configuration options, but Pandoc nails DOCX→MD, EPUB→MD, HTML→MD, and dozens of others.

pandoc input.docx -o output.md

Best for: any non-PDF document → Markdown. Best DOCX-to-Markdown converter I've used, period. For PDF specifically, MarkItDown beats it; for DOCX, EPUB, ODT, RTF, HTML, Pandoc wins.

3. LlamaParse — best for complex tables (paid)

llamaindex.ai/llamaparse — cloud-based PDF parser by LlamaIndex. Free tier 1,000 pages per day. Specialized in complex tables — financial reports, scientific papers — where MarkItDown and Pandoc fall apart.

On the financial report: nailed every one of the 12 tables. Merged-cell footers preserved as multi-row Markdown. Even the tiny "in millions" annotation under each numeric column came through correctly.

On the scientific paper: equations rendered as LaTeX ( $E = mc^2$ style) — usable by Claude, much better than MarkItDown.

On the product manual: fine, but no advantage over MarkItDown.

Best for: documents heavy on data tables. The free tier covers most solo workflows. Premium tier (which I tested briefly on a 200-page annual report) hit 99%+ fidelity, but at $0.003/page it adds up if you're processing daily.

4. Pickrack PDF → Markdown — best for casual one-off use

If you don't want to install anything, Pickrack's PDF → Markdown tool runs server-side using pdftotext (Poppler) plus a custom post-processing pass that detects headings, lists, and code blocks. Browser upload, server processes in /tmp, file deleted immediately, Markdown returned in the response.

On all three documents: structure preservation was decent (better than pdftotext alone) but worse than MarkItDown on tables. Use case: you have a single PDF, you don't want to install Python, and you trust the server-side temp-file pipeline (which is open source on GitHub).

Best for: one-off conversions when installing CLI tools is overkill.

5. Mistral OCR — best for scanned PDFs

mistral.ai — Mistral's OCR model is the most accurate I've tested for scanned documents, including ones with hand-drawn elements, marginalia, and mixed-language content. Pricing varies; check current docs.

On a scanned 1970s-era technical manual (a stress test I added): Mistral OCR pulled out 99% of the body text correctly. Tesseract pulled out about 85%, with several columns merged together. AWS Textract was between them, around 95%, but with worse table preservation than Mistral.

Best for: scanned PDFs, archival documents, mixed-language scans, or anything where OCR quality is the bottleneck.

A step-by-step workflow

Here's how I prep a long PDF for Claude:

Identify whether it's a text-layer PDF or a scan. Open it and try to select text. If selection works, proceed to step 2. If not, OCR first using Tesseract (free) or Mistral OCR (paid, much better quality).
Convert the PDF to Markdown using MarkItDown for everyday docs, LlamaParse for table-heavy ones, Pickrack for one-offs.
Open the .md file in any text editor and skim — does the structure look right? Fix obvious extraction errors (merged headings, broken tables).
Trim anything irrelevant: copyright pages, indexes, bibliographies (unless they're the point of the analysis).
Split into sections if it's over ~50K tokens. Claude's 200K context handles a lot, but smaller chunks = more focused responses. Split by H1 or H2 boundaries.
Paste into Claude/ChatGPT with a clear task: "Summarize section 3", "Extract all dates and amounts", "Compare this paper's methodology to [other paper]".

That's it. The unsung 20 minutes that turns a frustrating 1-hour AI session into a 5-minute one.

Real benchmark: same document, three approaches

To make this concrete, I ran the same 80-page annual report through three pipelines and asked Claude the same question: "What was R&D spending in fiscal 2024 vs 2023, and what does management attribute the change to?"

Approach	Time to answer	Answer accuracy
Raw PDF via Claude's "Chat with PDF"	45 seconds	Found the totals but missed the management commentary, which was in a footnote
MarkItDown → paste full Markdown	30 seconds	Found the totals; commentary was correct but Claude paraphrased a number incorrectly ($240M vs $246M)
LlamaParse → paste full Markdown	30 seconds	Both totals exact, commentary verbatim, attribution to specific R&D programs preserved

Free MarkItDown was 95% as good as paid LlamaParse for this report, except for one transcribed number. For a high-stakes use case (legal, financial decision-making), the LlamaParse upgrade was worth the $0.24 it cost for those 80 pages.

What about scanned PDFs?

If your PDF is a scanned image (no text layer), Markdown extraction will give you nothing. You need OCR first.

Options:

Free: Tesseract (CLI, open source) or browser-based wrappers built on tesseract.js
Paid but excellent: Mistral OCR, LlamaParse Premium, AWS Textract — all worth it for academic papers and complex tables

After OCR, run the Markdown conversion again on the new text-layer PDF.

When to skip Markdown

Three cases where raw PDF is fine:

Single page documents with mostly prose
Images-heavy content (Markdown loses visual context anyway — use Claude's vision feature instead)
One-off questions where setup time isn't worth it

For everything else — research papers, annual reports, books, multi-section technical docs — convert first.

Tips that aren't obvious

A few things I learned the slow way:

Headers and footers leak into every page extraction. Most tools include the page header and footer in the body text. Use a regex pass after conversion to strip repeating lines: anything that appears on more than 80% of pages is almost always boilerplate.
TOC pages confuse Claude more than they help. Strip them.
Bibliographies and reference lists are noise unless your task involves citations. Strip them too.
Watermarks come through as plain text. "DRAFT" or "CONFIDENTIAL" stamped on every page becomes "DRAFT DRAFT DRAFT" sprinkled through your Markdown. Same regex strip approach handles them.
Tables that span pages get split by every tool I tested. Merge them manually or accept that complex multi-page tables are still a hard problem.

Bottom line

Better input beats better prompts.

Convert your PDF to Markdown before pasting. Use a tool that respects structure: MarkItDown for free general use, LlamaParse for tables, Pickrack for one-offs. Spend 5 minutes cleaning the output. The model will thank you with sharper, more accurate responses.

Try it on your next long PDF. The difference is unmistakable.