From PDF to Markdown With Dolphin: Local, Fast, and Actually Honest

Gwang-Jin Kim

20 Aug 2025 • 8 min read

Photo by Stanislav Staritsyn / Unsplash - A Scanner as a symbol for the PDF scanning into Markdown.

How to convert complex PDFs to structured Markdown in minutes, what works, what breaks, and how to fix it.

Why We’re Here

Let’s not sugarcoat it: PDFs are where information goes to get lost, mangled, and locked up.
Every data scientist and developer has tried — and hated — the usual suspects:

Copy-paste (your wrist weeps)
Online converters (privacy? what privacy?)
“PDF-to-Anything” libraries that barf at anything more complex than a receipt

You want Markdown. You want structure — lists, headers, tables, not a random stew of line breaks.
You want speed and privacy.

Dolphin is ByteDance’s open-source vision-language model (VLM) for documents.

It runs locally, supports Markdown/JSON/HTML, “gets” layout, and — if you’re lucky — Just Works.
But getting there, especially on a Mac, means dodging some dragons.

Today, you’ll see the true story, Mac quirks and all.
I’ll even run my own CV through it — and show you the output, warts and all.

Quick-Glance: Mac vs Linux Setup

	Mac (M1/M2/M3/M4 - all ARM64/Intel)	Linux (x86/AMD64)
Python	3.11 (not 3.12+), best via Miniconda/Conda	3.9–3.11, system Python usually OK
pip	Use pip, NOT uv/poetry/pdm for now	pip works
Dependencies	See notes below for numpy/torch hacks	Usually no hacks needed
Model Download	Use `git lfs install` & `git clone` HuggingFace model	Same
Patching	Must fix `.float()`/`.half()` in code, and NOT cast indices	No patching unless you hit errors
Speed	MPS (Apple Silicon) is supported but may be slower	CUDA/CPU both supported
Known Pain	Dependency conflicts, dtype issues, aggressive pip resolver	Generally smoother

The Step-By-Step Install (with Survival Tips)

1. Create an Isolated Python 3.11 Environment

(Do this even if you think your global Python is “clean” — trust me.)

Mac:

brew install miniforge             # or Miniconda/Anaconda
conda create --name dolphin python=3.11
conda activate dolphin

Linux:
You can use system Python 3.9–3.11, but a venv or Conda is still safest.

2. Clone the Dolphin Repo

git clone https://github.com/ByteDance/Dolphin.git
cd Dolphin

3. Install Dependencies

Mac (pip, not uv):

Do NOT use poetry, pdm, or uv. They will break on numpy/torch/opencv conflicts.

pip install -r requirements.txt

If you get errors about numpy version or “cannot build wheel for numpy”, try:

pip install numpy==1.26.4
pip install -r requirements.txt --no-deps

Linux:

pip install -r requirements.txt
# Or: poetry install  (usually works)

4. Download the Pretrained Model

All platforms:

brew install git-lfs     # Mac only; Linux: sudo apt install git-lfs
git lfs install
git clone https://huggingface.co/ByteDance/Dolphin ./hf_model

5. Patch the Code (Mac Only / If You Get Errors)

The Issue:

The Dolphin code sometimes tries to .float() or .half() everything — including processors and token indices.
This causes:
- AttributeError: 'DonutProcessor' object has no attribute 'float'
- Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead

The Fix:

Open any files with .half() or .float() and make these changes:

Only cast models and image tensors to float32, never processors or input ids.

If you see lines like:

batch_prompt_ids = batch_prompt_inputs.input_ids.to(self.device, dtype=torch.float32)

Change to:

batch_prompt_ids = batch_prompt_inputs.input_ids.to(self.device)

If you see lines like:

self.processor = AutoProcessor.from_pretrained(model_path).float()

Change to:

self.processor = AutoProcessor.from_pretrained(model_path)

You can also use the sed commands in the appendix if you want to patch all files at once.

How I Actually Patched Dolphin (What You’ll Probably Have to Do, Too, When Using a Mac)

Let’s be honest: fixing the Dolphin code on a Mac isn’t just a matter of one clever sed.
Here’s exactly how I handled it, start to finish:

First, I asked an LLM to generate an automated patch script I could run right inside my cloned Dolphin repository.
Here’s the (surprisingly handy) result — save as patch_dolphin_float32.sh:

#!/bin/bash
# Save as patch_dolphin_float32.sh, then run: bash patch_dolphin_float32.sh

set -e

# 1. Replace all '.half()' with '.float()'
echo "Patching .half() → .float() ..."
find . -type f -name "*.py" -exec sed -i '' 's/\.half()/\.float()/g' {} +

# 2. Patch model loading lines to use .float() immediately after from_pretrained
echo "Adding .float() after from_pretrained ..."
find . -type f -name "*.py" -exec sed -i '' 's/\(from_pretrained([^)]*)\)/\1.float()/g' {} +

# 3. Patch .to(self.device) to .to(self.device, dtype=torch.float32)
echo "Patching .to(self.device) ..."
find . -type f -name "*.py" -exec sed -i '' 's/\.to(self.device)/\.to(self.device, dtype=torch.float32)/g' {} +

# 4. Patch .to("cuda")/.to('cuda')/.to("cpu") to include dtype
echo "Patching .to(\"cuda\") and .to(\"cpu\") ..."
find . -type f -name "*.py" -exec sed -i '' 's/\.to("cuda")/\.to("cuda", dtype=torch.float32)/g' {} +
find . -type f -name "*.py" -exec sed -i '' "s/\.to('cuda')/\.to('cuda', dtype=torch.float32)/g" {} +
find . -type f -name "*.py" -exec sed -i '' 's/\.to("cpu")/\.to("cpu", dtype=torch.float32)/g' {} +
find . -type f -name "*.py" -exec sed -i '' "s/\.to('cpu')/\.to('cpu', dtype=torch.float32)/g" {} +

echo "Patch complete! Dolphin should now use float32 everywhere."

But—predictably—this introduced a fresh set of errors. (If you’re thinking “too much of a good thing,” you’re absolutely right.)
Specifically, you don’t want to .float() your processor objects, only the model tensors themselves.

So I added a second script for surgical cleanup:
Save this as fix_processors.sh:

#!/bin/bash
# This script fixes `.float()`/`.to()` calls on processor objects (AutoProcessor, DonutProcessor, etc.)

set -e

# 1. Remove .float() and .to(...) from processor assignments
echo "Patching processor assignments..."

# Regex: For lines assigning to self.processor or processor = ... from_pretrained ... .float()
find . -type f -name "*.py" -exec sed -i '' \
  -E '/processor *= *.*from_pretrained/ s/(\.float\(\)|\.to\([^)]+\))//g' {} +

# Also catch self.processor = ... .float()
find . -type f -name "*.py" -exec sed -i '' \
  -E '/self\.processor *= *.*from_pretrained/ s/(\.float\(\)|\.to\([^)]+\))//g' {} +

# 2. Print lines that were changed for your review
echo "Done! Here are lines with .from_pretrained() that may need review:"
grep -r --color=always 'from_pretrained' .

echo "You should manually verify that .float()/.to(...) is only used on model objects, not on processors."

echo "All done!"

With these two scripts, just run (in your Dolphin repo folder):

bash patch_dolphin_float32.sh
bash fix_processors.sh

But wait, there’s one last Mac-specific “gotcha”:

For Apple Silicon (M-series) users, you must ensure that indices like input_ids, attention_mask, and prompt_ids are never cast to float32.
Run this final one-liner to be safe:

find . -type f -name "*.py" -exec sed -i '' -E 's/(input_ids|attention_mask|prompt_ids)\.to\((self\.device|"cpu"|"cuda"|device), *dtype *= *torch\.float32 *\)/\1.to(\2)/g' {} +

Why? This prevents those infamous "Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor" errors that only seem to haunt Mac users.

Linux users: You can almost always skip these patching steps.
The install and run will likely Just Work — but if you hit odd dtype errors, now you know where to look.

On a Mac, patch the code, double-check processors, and never cast indices to float. On Linux, chances are you’ll barely break a sweat.

First Run: Dolphin on Your PDF

Let’s put it to the test — here’s how to run Dolphin and convert a PDF (say, your CV):

python demo_page_hf.py \
  --model_path ./hf_model \
  --input_path ~/Downloads/MyCV.pdf \
  --save_dir ./results

This produces in your results directory:

MyCV.md (Markdown)
MyCV.json (all detected elements, coordinates, types)

What You Get (and What Breaks): Honest Output Review

Let’s not just wave hands — here’s what Dolphin did with my real CV.

What Worked

Section headers (EDUCATION, EXPERIENCE, RELEVANT SKILLS, LANGUAGES) are mostly intact.
Contact info (email, phone, LinkedIn) pulled out as text.
Most bullet points preserved, job roles identified.
Order of sections is roughly correct.

What Didn’t Work

Headers sometimes truncated or misspelled: “Data Scientist / Consultant” → “Data Scientist / Consultan”
Bullets and lists split mid-sentence or merged: Long points may be chopped, especially if a line wraps or is near a margin.
Section mixing: “LANGUAGES” ended up inside the experience block.
Markdown is “ragged” if your layout is two-column or tightly packed.
- This is a heads-up that future versions will need you to set legacy=False or adjust code for tokenizer changes.

Legacy warning:

Legacy behavior is being used. The current behavior will be deprecated in version 5.0.0...

In short:
Dolphin gets you 80–90% of the way for well-structured, one-column, text-rich PDFs.
If your document is formatted for print, or has columns/tables, expect to spend a few minutes post-processing.

Fixes, Post-Processing, and Warnings

Manual editing is your friend:
Open the .md output and fix split lines, bullet merges, and out-of-order sections.
For batch jobs:Write a quick script to merge broken bullet points or fix section headers. Or let an LLM handle post-processing: I’ve once fed error-ridden OCR output to ChatGPT and had it restore clean structure and labeling automatically (the article below).
Watch for future Dolphin updates:
If you see “legacy” warnings, keep an eye on the GitHub repo for breaking changes.

Pro Tips: How to Get the Best Results

Stick to one-column layouts for best extraction.
Bold, large, or clearly separated section headers are most reliably detected.
Scanned PDFs:
Dolphin has OCR, but fuzzy scans may still confuse it.
Tables:
Simple tables work, complex ones can get flattened to text.
Try both Markdown and JSON outputs — sometimes the JSON gives you structure that’s easier to post-process.

Bonus: Simple Alternatives for Simple Jobs

If your PDFs are boring (no images, no crazy tables), Marker is lightning fast:

pip install marker-pdf
marker --pdf MyCV.pdf --output ./out

Marker is great for “easy” PDFs. Dolphin shines when your document has layout, mixed content, or you need structured output.

Final Thoughts

Dolphin is a massive step forward for local, privacy-friendly PDF conversion to Markdown or JSON — but it’s not magic.
For scientific papers, documentation, or CVs, it’s usually “almost there,” and you’ll save hours over manual copy-paste.

But for the last mile — merging split lists, fixing out-of-order sections, and tweaking Markdown — you’ll still want to keep your editor handy.

If you’re a Mac user, expect to do a little patching. If you’re on Linux, count your blessings.

Still, this is the best local solution out there for turning PDFs into structured, useful Markdown.

Appendix: Fast Patch Script (for Mac errors)

# Fix .float() and .half() mistakes in all Python files
find . -type f -name "*.py" -exec sed -i '' -E '/processor *= *.*from_pretrained/ s/(\\.float\\(\\)|\\.to\\([^)]+\\))//g' {} +
find . -type f -name "*.py" -exec sed -i '' -E 's/(input_ids|attention_mask|prompt_ids)\\.to\\((self\\.device|"cpu"|"cuda"|device), *dtype *= *torch\\.float32 *\\)/\\1.to(\\2)/g' {} +