结构化 PDF 到 JSON:2026 年开源提取模型指南
Structured PDF-to-JSON: A Guide to Open-Source Extraction Models in 2026
Most enterprise data still sits inside PDFs, scans, and slide decks. Large language models and agents cannot use that data until it becomes structured JSON. Open-source document extraction has become the standard way to do that conversion on your own hardware.
Two different problems hide under the phrase ‘PDF to JSON.’ The first is schema-driven extraction: you define fields, and a model fills them with values. The second is document parsing: a model reconstructs the page into structured JSON or Markdown. Most teams need one, sometimes both. Choosing the wrong category costs real time.
Open weights matter here for cost and privacy. Proprietary APIs can cost thousands of dollars per million pages, and they require sending documents off-premise. Local models remove both constraints. Below are the models and toolkits worth evaluating, grouped by what they actually do.
Two categories, one phrase
Schema-driven extraction takes a document and a JSON schema, then returns values for your fields. Use it for invoices, forms, contracts, and receipts, where you know the fields in advance.
Document parsing reconstructs the document itself. It detects layout, reading order, tables, formulas, and code, then exports JSON or Markdown. Use it to prepare clean corpora for retrieval-augmented generation (RAG) and agents.
Category 1: Schema-driven structured extraction
Datalab lift
lift is a 9B vision model from Datalab, the team behind Marker and Surya. You pass a JSON schema, and lift returns JSON that matches it. Schema-constrained decoding guarantees the output is valid JSON. The model is built on Qwen 3.5 and runs locally through Hugging Face or remotely through a vLLM server.
It handles multi-page documents in a single pass, including values that span pages. It ships a CLI, a Python API, and a Streamlit ‘Schema Studio’ for building and testing schemas.
pip install lift-pdf
# Start the vLLM server, then extract to your schema
lift_vllm
lift_extract input.pdf ./output --schema schema.json
from lift import extract
result = extract("document.pdf", "schema.json")
if result.extraction is not None:
data = result.extraction # dict matching your schemaOn Datalab’s 225-document benchmark, lift reaches 90.2% field accuracy at 9.5s median latency. It leads NuExtract3 (81.5%) and Qwen3.5-9B (76.3%) on field accuracy. It trails Gemini Flash 3.5 (91.3%) and the hosted Datalab API (95.9%). Note that full-document accuracy stays low for all local models, with lift at 20.9%. Getting every field right in one document remains hard.
The code is Apache-2.0. The weights use a modified OpenRAIL-M license, free for research, personal use, and startups under $5M in funding or revenue. Commercial self-hosting needs a license, and the weights cannot be used competitively with the Datalab API.
NuMind NuExtract 3
NuExtract 3 is a 4B vision-language model from NuMind. It unifies two tasks in one model: structured extraction (document to JSON) and content extraction (OCR to Markdown). You provide an input and a JSON template describing the fields you need. The model is trained with reinforcement learning to add extraction-specific reasoning, which you can switch on or off per request.
NuExtract 3 is multimodal, multilingual, and based on a Qwen backbone. It serves through vLLM with an OpenAI-compatible API, and a Python SDK is available via pip install numind. NuMind positions it as a reference open model for both structured and content extraction at its size. Check the model card for exact license terms before commercial use.
Category 2: Document parsing to structured JSON and Markdown
IBM Docling
Docling started at IBM Research and is now hosted by the LF AI & Data Foundation. It parses PDF, DOCX, PPTX, XLSX, HTML, images, and more. Output formats include Markdown, HTML, lossless JSON, and DocTags. Its core is the DoclingDocument representation, which preserves layout, reading order, tables, and formulas as LaTeX.
Docling runs locally for air-gapped environments. It integrates with LangChain, LlamaIndex, Crew AI, and Haystack, and ships an MCP server and a Docling Serve mode. The project carries a permissive MIT license. IBM also offers a managed version through watsonx.
IBM Granite-Docling-258M
Granite-Docling-258M is a compact 258M vision-language model from IBM. It performs one-shot document conversion inside Docling pipelines. Despite its size, it handles OCR, layout, tables, code, and equations, and outputs DocTags. On an A100 GPU, it averages roughly 0.35 seconds per page.
The model builds on the Idefics3 architecture, with a SigLIP2 encoder and a Granite 165M language backbone. It is released under Apache 2.0. IBM states it is built for document conversion, not general image understanding.
OpenDataLab MinerU
MinerU, from OpenDataLab and Shanghai AI Laboratory, converts PDF, image, DOCX, PPTX, and XLSX inputs into Markdown and JSON. It pairs a processing pipeline with a vision-language model. The current model, MinerU2.5-Pro, targets high-resolution parsing of complex layouts, including cross-page tables and charts.
MinerU recently changed its license. It moved from AGPL-3.0 to the “MinerU Open Source License,” a custom license based on Apache 2.0 with additional conditions. That change lowers friction for commercial deployment.
Datalab Marker
Marker is Datalab’s pipeline for converting documents into Markdown, JSON, chunks, and HTML. It supports PDF, image, PPTX, DOCX, XLSX, HTML, and EPUB. It formats tables, forms, equations, inline math, links, and code. An optional --use_llm flag adds a language model to improve tables and forms.
On the third-party olmOCR-Bench suite, Marker scores around 76.1. Its code is GPL-3.0, and its model weights use a modified AI Pubs OpenRAIL-M license. That weight license is free for research, personal use, and startups under $2M in funding or revenue. Datalab’s managed platform now runs a newer OCR model, Chandra, which is Apache-2.0 and outputs HTML, Markdown, and JSON.
Ai2 olmOCR 2
olmOCR 2 is a 7B OCR-specialized vision-language model from the Allen Institute for AI (Ai2). It converts PDFs into clean text and Markdown while preserving reading order. It handles tables, equations, and handwriting across complex multi-column layouts. The model is trained with reinforcement learning from verifiable rewards, using synthetic unit tests as the reward signal.
olmOCR 2 scores 82.4 on its own olmOCR-Bench, among the higher published results on that suite. Ai2 estimates a cost of roughly $178 per million pages on your own GPUs. The toolkit and the allenai/olmOCR-2-7B-1025 weights are Apache-2.0. The current model is English-focused.
DeepSeek DeepSeek-OCR
DeepSeek-OCR is an open OCR model from DeepSeek, released in October 2025. It introduces “contexts optical compression,” which represents text-rich pages as compact vision tokens, then decodes them back to text. This lets it process long documents with far fewer tokens than typical vision-language models.
It uses a DeepEncoder plus a 3B Mixture-of-Experts decoder that activates about 570M parameters per token. Depending on the prompt, it outputs plain text, Markdown, HTML tables, or structured JSON, and it supports 100+ languages. The code is released under the MIT license. A follow-up, DeepSeek-OCR2, arrived in January 2026.
The general-purpose option: Qwen3-VL
Qwen3-VL from Alibaba is not a document-specific model. It is a general multimodal series that many extraction models use as a base. You can prompt it to return Markdown, JSON, or code from a page. Most sizes ship under Apache 2.0. It is a flexible fallback when a specialized model does not fit, though it needs more prompt engineering and offers fewer output guarantees.
How the options compare
| Model | Org | Size | What it does | Primary output | License |
|---|---|---|---|---|---|
| lift | Datalab | 9B | Schema-driven extraction | JSON to your schema | Apache-2.0 code / OpenRAIL-M weights |
| NuExtract 3 | NuMind | 4B | Schema extraction + OCR | JSON + Markdown | Open weights (see card) |
| Docling | IBM / LF AI & Data | Pipeline | Layout parsing | Markdown, JSON, DocTags | MIT |
| Granite-Docling | IBM | 258M | One-shot conversion | DocTags, Markdown | Apache-2.0 |
| MinerU | OpenDataLab | ~1.2B VLM | Layout parsing | Markdown, JSON | MinerU Open Source License |
| Marker | Datalab | Pipeline | Layout parsing | Markdown, JSON, HTML | GPL-3.0 code / OpenRAIL-M weights |
| olmOCR 2 | Ai2 | 7B | OCR to text | Plain text, Markdown | Apache-2.0 |
| DeepSeek-OCR | DeepSeek | 3B MoE (~570M active) | OCR with token compression | Text, Markdown, JSON | MIT (code) |
| Qwen3-VL | Alibaba | 2B–235B | General VLM | Markdown, JSON, code | Apache-2.0 (most sizes) |
A note on benchmarks: these numbers come from different suites and are not directly comparable. lift’s 90.2% is field accuracy on Datalab’s schema-extraction benchmark. The olmOCR-Bench scores for olmOCR 2 (82.4) and Marker (76.1) measure content extraction with unit-test scoring. Run your own documents through each candidate before deciding.
Marktechpost Explainer
Open-Source Document Extraction Models for Structured PDF-to-JSON
“PDF to JSON” hides two different jobs. Schema-driven extraction fills fields you define. Document parsing rebuilds the page into JSON or Markdown. Filter by task and license, then open any repo.
Key Takeaways
- Schema-driven extraction (fields to values) and document parsing (layout to JSON) are different jobs.
- lift and NuExtract 3 target schema-driven JSON; the rest target document parsing.
- Docling, MinerU, Marker, olmOCR 2, and DeepSeek-OCR parse documents into structured Markdown or JSON.
- Licenses vary widely; MinerU moved off AGPL-3.0 in 2026, and lift and Marker split code and model-weight licenses.
- Published benchmarks come from different suites, so treat cross-model scores as indicative, not comparable.
The post Structured PDF-to-JSON: A Guide to Open-Source Extraction Models in 2026 appeared first on MarkTechPost.
这篇还没有中文全文
该条目暂未提供中文翻译。标题/摘要已自动中译;本系统只对人工挑选的内容生成全文翻译。
挑中后 → markitdown 取正文 → 精翻 → 此处切换为译文