From PDF to Markdown With Dolphin: Local, Fast, and Actually Honest
How to convert complex PDFs to structured Markdown in minutes, what works, what breaks, and how to fix it.
Why We’re Here
Let’s not sugarcoat it: PDFs are where information goes to get lost, mangled, and locked up.
Every data scientist and developer has tried — and hated — the usual suspects:
- Copy-paste (your wrist weeps)
- Online converters (privacy? what privacy?)
- “PDF-to-Anything” libraries that barf at anything more complex than a receipt
You want Markdown. You want structure — lists, headers, tables, not a random stew of line breaks.
You want speed and privacy.
Dolphin is ByteDance’s open-source vision-language model (VLM) for documents.
It runs locally, supports Markdown/JSON/HTML, “gets” layout, and — if you’re lucky — Just Works.
But getting there, especially on a Mac, means dodging some dragons.
Today, you’ll see the true story, Mac quirks and all.
I’ll even run my own CV through it — and show you the output, warts and all.
Quick-Glance: Mac vs Linux Setup
Mac (M1/M2/M3/M4 - all ARM64/Intel) | Linux (x86/AMD64) | |
---|---|---|
Python | 3.11 (not 3.12+), best via Miniconda/Conda | 3.9–3.11, system Python usually OK |
pip | Use pip, NOT uv/poetry/pdm for now | pip works |
Dependencies | See notes below for numpy/torch hacks | Usually no hacks needed |
Model Download | Use git lfs install & git clone HuggingFace model |
Same |
Patching | Must fix .float() /.half() in code, and NOT cast indices |
No patching unless you hit errors |
Speed | MPS (Apple Silicon) is supported but may be slower | CUDA/CPU both supported |
Known Pain | Dependency conflicts, dtype issues, aggressive pip resolver | Generally smoother |
The Step-By-Step Install (with Survival Tips)
1. Create an Isolated Python 3.11 Environment
(Do this even if you think your global Python is “clean” — trust me.)
Mac:
brew install miniforge # or Miniconda/Anaconda
conda create --name dolphin python=3.11
conda activate dolphin
Linux:
You can use system Python 3.9–3.11, but a venv
or Conda
is still safest.
2. Clone the Dolphin Repo
git clone https://github.com/ByteDance/Dolphin.git
cd Dolphin
3. Install Dependencies
Mac (pip
, not uv
):
- Do NOT use
poetry
,pdm
, oruv
. They will break onnumpy
/torch
/opencv
conflicts.
pip install -r requirements.txt
- If you get errors about numpy version or “cannot build wheel for numpy”, try:
pip install numpy==1.26.4
pip install -r requirements.txt --no-deps
Linux:
pip install -r requirements.txt
# Or: poetry install (usually works)
4. Download the Pretrained Model
All platforms:
brew install git-lfs # Mac only; Linux: sudo apt install git-lfs
git lfs install
git clone https://huggingface.co/ByteDance/Dolphin ./hf_model
5. Patch the Code (Mac Only / If You Get Errors)
The Issue:
- The Dolphin code sometimes tries to
.float()
or.half()
everything — including processors and token indices. - This causes:
AttributeError: 'DonutProcessor' object has no attribute 'float'
Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead
The Fix:
Open any files with .half()
or .float()
and make these changes:
- Only cast models and image tensors to
float32
, never processors or input ids.
If you see lines like:
batch_prompt_ids = batch_prompt_inputs.input_ids.to(self.device, dtype=torch.float32)
Change to:
batch_prompt_ids = batch_prompt_inputs.input_ids.to(self.device)
If you see lines like:
self.processor = AutoProcessor.from_pretrained(model_path).float()
Change to:
self.processor = AutoProcessor.from_pretrained(model_path)
You can also use the sed
commands in the appendix if you want to patch all files at once.
How I Actually Patched Dolphin (What You’ll Probably Have to Do, Too, When Using a Mac)
Let’s be honest: fixing the Dolphin code on a Mac isn’t just a matter of one clever sed.
Here’s exactly how I handled it, start to finish:
First, I asked an LLM to generate an automated patch script I could run right inside my cloned Dolphin repository.
Here’s the (surprisingly handy) result — save as patch_dolphin_float32.sh
:
#!/bin/bash
# Save as patch_dolphin_float32.sh, then run: bash patch_dolphin_float32.sh
set -e
# 1. Replace all '.half()' with '.float()'
echo "Patching .half() → .float() ..."
find . -type f -name "*.py" -exec sed -i '' 's/\.half()/\.float()/g' {} +
# 2. Patch model loading lines to use .float() immediately after from_pretrained
echo "Adding .float() after from_pretrained ..."
find . -type f -name "*.py" -exec sed -i '' 's/\(from_pretrained([^)]*)\)/\1.float()/g' {} +
# 3. Patch .to(self.device) to .to(self.device, dtype=torch.float32)
echo "Patching .to(self.device) ..."
find . -type f -name "*.py" -exec sed -i '' 's/\.to(self.device)/\.to(self.device, dtype=torch.float32)/g' {} +
# 4. Patch .to("cuda")/.to('cuda')/.to("cpu") to include dtype
echo "Patching .to(\"cuda\") and .to(\"cpu\") ..."
find . -type f -name "*.py" -exec sed -i '' 's/\.to("cuda")/\.to("cuda", dtype=torch.float32)/g' {} +
find . -type f -name "*.py" -exec sed -i '' "s/\.to('cuda')/\.to('cuda', dtype=torch.float32)/g" {} +
find . -type f -name "*.py" -exec sed -i '' 's/\.to("cpu")/\.to("cpu", dtype=torch.float32)/g' {} +
find . -type f -name "*.py" -exec sed -i '' "s/\.to('cpu')/\.to('cpu', dtype=torch.float32)/g" {} +
echo "Patch complete! Dolphin should now use float32 everywhere."
But—predictably—this introduced a fresh set of errors. (If you’re thinking “too much of a good thing,” you’re absolutely right.)
Specifically, you don’t want to .float()
your processor objects, only the model tensors themselves.
So I added a second script for surgical cleanup:
Save this as fix_processors.sh
:
#!/bin/bash
# This script fixes `.float()`/`.to()` calls on processor objects (AutoProcessor, DonutProcessor, etc.)
set -e
# 1. Remove .float() and .to(...) from processor assignments
echo "Patching processor assignments..."
# Regex: For lines assigning to self.processor or processor = ... from_pretrained ... .float()
find . -type f -name "*.py" -exec sed -i '' \
-E '/processor *= *.*from_pretrained/ s/(\.float\(\)|\.to\([^)]+\))//g' {} +
# Also catch self.processor = ... .float()
find . -type f -name "*.py" -exec sed -i '' \
-E '/self\.processor *= *.*from_pretrained/ s/(\.float\(\)|\.to\([^)]+\))//g' {} +
# 2. Print lines that were changed for your review
echo "Done! Here are lines with .from_pretrained() that may need review:"
grep -r --color=always 'from_pretrained' .
echo "You should manually verify that .float()/.to(...) is only used on model objects, not on processors."
echo "All done!"
With these two scripts, just run (in your Dolphin repo folder):
bash patch_dolphin_float32.sh
bash fix_processors.sh
But wait, there’s one last Mac-specific “gotcha”:
For Apple Silicon (M-series) users, you must ensure that indices like input_ids
, attention_mask
, and prompt_ids
are never cast to float32.
Run this final one-liner to be safe:
find . -type f -name "*.py" -exec sed -i '' -E 's/(input_ids|attention_mask|prompt_ids)\.to\((self\.device|"cpu"|"cuda"|device), *dtype *= *torch\.float32 *\)/\1.to(\2)/g' {} +
Why? This prevents those infamous "Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor" errors that only seem to haunt Mac users.
Linux users: You can almost always skip these patching steps.
The install and run will likely Just Work — but if you hit odd dtype
errors, now you know where to look.
On a Mac, patch the code, double-check processors, and never cast indices to float. On Linux, chances are you’ll barely break a sweat.
First Run: Dolphin on Your PDF
Let’s put it to the test — here’s how to run Dolphin and convert a PDF (say, your CV):
python demo_page_hf.py \
--model_path ./hf_model \
--input_path ~/Downloads/MyCV.pdf \
--save_dir ./results
This produces in your results
directory:
MyCV.md
(Markdown)MyCV.json
(all detected elements, coordinates, types)
What You Get (and What Breaks): Honest Output Review
Let’s not just wave hands — here’s what Dolphin did with my real CV.
What Worked
- Section headers (EDUCATION, EXPERIENCE, RELEVANT SKILLS, LANGUAGES) are mostly intact.
- Contact info (email, phone, LinkedIn) pulled out as text.
- Most bullet points preserved, job roles identified.
- Order of sections is roughly correct.
What Didn’t Work
- Headers sometimes truncated or misspelled: “Data Scientist / Consultant” → “Data Scientist / Consultan”
- Bullets and lists split mid-sentence or merged: Long points may be chopped, especially if a line wraps or is near a margin.
- Section mixing: “LANGUAGES” ended up inside the experience block.
- Markdown is “ragged” if your layout is two-column or tightly packed.
- This is a heads-up that future versions will need you to set
legacy=False
or adjust code for tokenizer changes.
- This is a heads-up that future versions will need you to set
Legacy warning:
Legacy behavior is being used. The current behavior will be deprecated in version 5.0.0...
In short:
Dolphin gets you 80–90% of the way for well-structured, one-column, text-rich PDFs.
If your document is formatted for print, or has columns/tables, expect to spend a few minutes post-processing.
Fixes, Post-Processing, and Warnings
- Manual editing is your friend:
Open the.md
output and fix split lines, bullet merges, and out-of-order sections. - For batch jobs:Write a quick script to merge broken bullet points or fix section headers. Or let an LLM handle post-processing: I’ve once fed error-ridden OCR output to ChatGPT and had it restore clean structure and labeling automatically (the article below).
- Watch for future Dolphin updates:
If you see “legacy” warnings, keep an eye on the GitHub repo for breaking changes.

Pro Tips: How to Get the Best Results
- Stick to one-column layouts for best extraction.
- Bold, large, or clearly separated section headers are most reliably detected.
- Scanned PDFs:
Dolphin has OCR, but fuzzy scans may still confuse it. - Tables:
Simple tables work, complex ones can get flattened to text. - Try both Markdown and JSON outputs — sometimes the JSON gives you structure that’s easier to post-process.
Bonus: Simple Alternatives for Simple Jobs
If your PDFs are boring (no images, no crazy tables), Marker is lightning fast:
pip install marker-pdf
marker --pdf MyCV.pdf --output ./out
Marker is great for “easy” PDFs. Dolphin shines when your document has layout, mixed content, or you need structured output.
Final Thoughts
Dolphin is a massive step forward for local, privacy-friendly PDF conversion to Markdown or JSON — but it’s not magic.
For scientific papers, documentation, or CVs, it’s usually “almost there,” and you’ll save hours over manual copy-paste.
But for the last mile — merging split lists, fixing out-of-order sections, and tweaking Markdown — you’ll still want to keep your editor handy.
If you’re a Mac user, expect to do a little patching. If you’re on Linux, count your blessings.
Still, this is the best local solution out there for turning PDFs into structured, useful Markdown.
Appendix: Fast Patch Script (for Mac errors)
# Fix .float() and .half() mistakes in all Python files
find . -type f -name "*.py" -exec sed -i '' -E '/processor *= *.*from_pretrained/ s/(\\.float\\(\\)|\\.to\\([^)]+\\))//g' {} +
find . -type f -name "*.py" -exec sed -i '' -E 's/(input_ids|attention_mask|prompt_ids)\\.to\\((self\\.device|"cpu"|"cuda"|device), *dtype *= *torch\\.float32 *\\)/\\1.to(\\2)/g' {} +
See Also
How is Dolphin working for you? What’s the weirdest thing you’ve seen in its output? Share your findings or ask for troubleshooting below.
Want even more automation, batch conversion, or Markdown cleaning scripts? Let me know — I’ve probably already hacked it!
Save your wrists. Use Dolphin, but keep your wits about you.
Do you like this kind of thinking?
- Follow me on Medium: @gwangjinkim for deep dives on Python, Lisp, system design, and developer thinking, and much more
- Subscribe on Substack: gwangjinkim.substack.com — coming soon with early essays, experiments & newsletters (just getting started).
- Visit my Ghost blog (here): everyhub.org — with hands-on tech tutorials and tools (most of them I cross-post here in medium, but not all of them).
Follow anywhere that fits your style — or all three if you want front-row seats to what’s coming next.