资讯MarkTechPost· 07-05 · 03:02

结构化 PDF 到 JSON：2026 年开源提取模型指南

Structured PDF-to-JSON: A Guide to Open-Source Extraction Models in 2026

Most enterprise data still sits inside PDFs, scans, and slide decks. Large language models and agents cannot use that data until it becomes structured JSON. Open-source document extraction has become the standard way to do that conversion on your own hardware.

Two different problems hide under the phrase ‘PDF to JSON.’ The first is schema-driven extraction: you define fields, and a model fills them with values. The second is document parsing: a model reconstructs the page into structured JSON or Markdown. Most teams need one, sometimes both. Choosing the wrong category costs real time.

Open weights matter here for cost and privacy. Proprietary APIs can cost thousands of dollars per million pages, and they require sending documents off-premise. Local models remove both constraints. Below are the models and toolkits worth evaluating, grouped by what they actually do.

Two categories, one phrase

Schema-driven extraction takes a document and a JSON schema, then returns values for your fields. Use it for invoices, forms, contracts, and receipts, where you know the fields in advance.

Document parsing reconstructs the document itself. It detects layout, reading order, tables, formulas, and code, then exports JSON or Markdown. Use it to prepare clean corpora for retrieval-augmented generation (RAG) and agents.

Category 1: Schema-driven structured extraction

Datalab lift

lift is a 9B vision model from Datalab, the team behind Marker and Surya. You pass a JSON schema, and lift returns JSON that matches it. Schema-constrained decoding guarantees the output is valid JSON. The model is built on Qwen 3.5 and runs locally through Hugging Face or remotely through a vLLM server.

It handles multi-page documents in a single pass, including values that span pages. It ships a CLI, a Python API, and a Streamlit ‘Schema Studio’ for building and testing schemas.

Copy CodeCopiedUse a different Browser

pip install lift-pdf

# Start the vLLM server, then extract to your schema
lift_vllm
lift_extract input.pdf ./output --schema schema.json

Copy CodeCopiedUse a different Browser


from lift import extract

result = extract("document.pdf", "schema.json")
if result.extraction is not None:
    data = result.extraction  # dict matching your schema

On Datalab’s 225-document benchmark, lift reaches 90.2% field accuracy at 9.5s median latency. It leads NuExtract3 (81.5%) and Qwen3.5-9B (76.3%) on field accuracy. It trails Gemini Flash 3.5 (91.3%) and the hosted Datalab API (95.9%). Note that full-document accuracy stays low for all local models, with lift at 20.9%. Getting every field right in one document remains hard.

The code is Apache-2.0. The weights use a modified OpenRAIL-M license, free for research, personal use, and startups under $5M in funding or revenue. Commercial self-hosting needs a license, and the weights cannot be used competitively with the Datalab API.

NuMind NuExtract 3

NuExtract 3 is a 4B vision-language model from NuMind. It unifies two tasks in one model: structured extraction (document to JSON) and content extraction (OCR to Markdown). You provide an input and a JSON template describing the fields you need. The model is trained with reinforcement learning to add extraction-specific reasoning, which you can switch on or off per request.

NuExtract 3 is multimodal, multilingual, and based on a Qwen backbone. It serves through vLLM with an OpenAI-compatible API, and a Python SDK is available via pip install numind. NuMind positions it as a reference open model for both structured and content extraction at its size. Check the model card for exact license terms before commercial use.

Category 2: Document parsing to structured JSON and Markdown

IBM Docling

Docling started at IBM Research and is now hosted by the LF AI & Data Foundation. It parses PDF, DOCX, PPTX, XLSX, HTML, images, and more. Output formats include Markdown, HTML, lossless JSON, and DocTags. Its core is the DoclingDocument representation, which preserves layout, reading order, tables, and formulas as LaTeX.

Docling runs locally for air-gapped environments. It integrates with LangChain, LlamaIndex, Crew AI, and Haystack, and ships an MCP server and a Docling Serve mode. The project carries a permissive MIT license. IBM also offers a managed version through watsonx.

IBM Granite-Docling-258M

Granite-Docling-258M is a compact 258M vision-language model from IBM. It performs one-shot document conversion inside Docling pipelines. Despite its size, it handles OCR, layout, tables, code, and equations, and outputs DocTags. On an A100 GPU, it averages roughly 0.35 seconds per page.

The model builds on the Idefics3 architecture, with a SigLIP2 encoder and a Granite 165M language backbone. It is released under Apache 2.0. IBM states it is built for document conversion, not general image understanding.

OpenDataLab MinerU

MinerU, from OpenDataLab and Shanghai AI Laboratory, converts PDF, image, DOCX, PPTX, and XLSX inputs into Markdown and JSON. It pairs a processing pipeline with a vision-language model. The current model, MinerU2.5-Pro, targets high-resolution parsing of complex layouts, including cross-page tables and charts.

MinerU recently changed its license. It moved from AGPL-3.0 to the “MinerU Open Source License,” a custom license based on Apache 2.0 with additional conditions. That change lowers friction for commercial deployment.

Datalab Marker

Marker is Datalab’s pipeline for converting documents into Markdown, JSON, chunks, and HTML. It supports PDF, image, PPTX, DOCX, XLSX, HTML, and EPUB. It formats tables, forms, equations, inline math, links, and code. An optional --use_llm flag adds a language model to improve tables and forms.

On the third-party olmOCR-Bench suite, Marker scores around 76.1. Its code is GPL-3.0, and its model weights use a modified AI Pubs OpenRAIL-M license. That weight license is free for research, personal use, and startups under $2M in funding or revenue. Datalab’s managed platform now runs a newer OCR model, Chandra, which is Apache-2.0 and outputs HTML, Markdown, and JSON.

Ai2 olmOCR 2

olmOCR 2 is a 7B OCR-specialized vision-language model from the Allen Institute for AI (Ai2). It converts PDFs into clean text and Markdown while preserving reading order. It handles tables, equations, and handwriting across complex multi-column layouts. The model is trained with reinforcement learning from verifiable rewards, using synthetic unit tests as the reward signal.

olmOCR 2 scores 82.4 on its own olmOCR-Bench, among the higher published results on that suite. Ai2 estimates a cost of roughly $178 per million pages on your own GPUs. The toolkit and the allenai/olmOCR-2-7B-1025 weights are Apache-2.0. The current model is English-focused.

DeepSeek DeepSeek-OCR

DeepSeek-OCR is an open OCR model from DeepSeek, released in October 2025. It introduces “contexts optical compression,” which represents text-rich pages as compact vision tokens, then decodes them back to text. This lets it process long documents with far fewer tokens than typical vision-language models.

It uses a DeepEncoder plus a 3B Mixture-of-Experts decoder that activates about 570M parameters per token. Depending on the prompt, it outputs plain text, Markdown, HTML tables, or structured JSON, and it supports 100+ languages. The code is released under the MIT license. A follow-up, DeepSeek-OCR2, arrived in January 2026.

The general-purpose option: Qwen3-VL

Qwen3-VL from Alibaba is not a document-specific model. It is a general multimodal series that many extraction models use as a base. You can prompt it to return Markdown, JSON, or code from a page. Most sizes ship under Apache 2.0. It is a flexible fallback when a specialized model does not fit, though it needs more prompt engineering and offers fewer output guarantees.

How the options compare

Model	Org	Size	What it does	Primary output	License
lift	Datalab	9B	Schema-driven extraction	JSON to your schema	Apache-2.0 code / OpenRAIL-M weights
NuExtract 3	NuMind	4B	Schema extraction + OCR	JSON + Markdown	Open weights (see card)
Docling	IBM / LF AI & Data	Pipeline	Layout parsing	Markdown, JSON, DocTags	MIT
Granite-Docling	IBM	258M	One-shot conversion	DocTags, Markdown	Apache-2.0
MinerU	OpenDataLab	~1.2B VLM	Layout parsing	Markdown, JSON	MinerU Open Source License
Marker	Datalab	Pipeline	Layout parsing	Markdown, JSON, HTML	GPL-3.0 code / OpenRAIL-M weights
olmOCR 2	Ai2	7B	OCR to text	Plain text, Markdown	Apache-2.0
DeepSeek-OCR	DeepSeek	3B MoE (~570M active)	OCR with token compression	Text, Markdown, JSON	MIT (code)
Qwen3-VL	Alibaba	2B–235B	General VLM	Markdown, JSON, code	Apache-2.0 (most sizes)

A note on benchmarks: these numbers come from different suites and are not directly comparable. lift’s 90.2% is field accuracy on Datalab’s schema-extraction benchmark. The olmOCR-Bench scores for olmOCR 2 (82.4) and Marker (76.1) measure content extraction with unit-test scoring. Run your own documents through each candidate before deciding.

Marktechpost Explainer

Open-Source Document Extraction Models for Structured PDF-to-JSON

“PDF to JSON” hides two different jobs. Schema-driven extraction fills fields you define. Document parsing rebuilds the page into JSON or Markdown. Filter by task and license, then open any repo.

Schema-driven extraction Document parsing General-purpose VLM

Task

All Schema-driven Document parsing General

License

Any Permissive (MIT / Apache) Copyleft / Custom

Benchmarks are not directly comparable. lift’s 90.2% is field accuracy on Datalab’s schema benchmark. The olmOCR-Bench scores for olmOCR 2 (82.4) and Marker (76.1) measure content extraction with unit tests. Run your own documents before choosing.

Key Takeaways

Schema-driven extraction (fields to values) and document parsing (layout to JSON) are different jobs.
lift and NuExtract 3 target schema-driven JSON; the rest target document parsing.
Docling, MinerU, Marker, olmOCR 2, and DeepSeek-OCR parse documents into structured Markdown or JSON.
Licenses vary widely; MinerU moved off AGPL-3.0 in 2026, and lift and Marker split code and model-weight licenses.
Published benchmarks come from different suites, so treat cross-model scores as indicative, not comparable.

The post Structured PDF-to-JSON: A Guide to Open-Source Extraction Models in 2026 appeared first on MarkTechPost.

这篇还没有中文全文

该条目暂未提供中文翻译。标题/摘要已自动中译;本系统只对人工挑选的内容生成全文翻译。

挑中后 → markitdown 取正文 → 精翻 → 此处切换为译文