English · 简体中文 · 繁體中文 · Español
Auditable LLM extraction for Java. DocTruth turns PDFs, DOCX, XLSX, and CSV files into schema-bound structured output with field-level source citations, optional PDF bounding boxes, confidence scores, provenance, and PROV-O audit JSON.
DocTruth is for teams that need to answer one question reliably:
Where did this extracted value come from?
The core boundary is simple: source document in, validated structured output plus evidence trail out.
It is framework-agnostic and fits into plain Java, Spring Boot, LangChain4j, Spring AI, Quarkus, Micronaut, or any Java service that already calls OpenAI, Anthropic, Gemini, DeepSeek, or an OpenAI-compatible model endpoint.
contract.pdf
→ Contract record
→ result.requireCitation("totalValue")
→ source quote + page/line + optional bbox + match score
→ audit JSON
Requires Java 25+. Use in a Maven project:
<dependency>
<groupId>ai.doctruth</groupId>
<artifactId>doctruth-java</artifactId>
<version>0.2.0-alpha</version>
</dependency>Gradle uses the same coordinate: ai.doctruth:doctruth-java:0.2.0-alpha.
Upgrade to the latest release:
mvn versions:use-latest-releases -Dincludes=ai.doctruth:doctruth-java -DgenerateBackupPoms=falseIf java on your shell still points to the macOS Java stub or an older runtime,
set JAVA_HOME to a Java 25 installation before running the CLI or examples:
export JAVA_HOME=/path/to/jdk-25
export PATH="$JAVA_HOME/bin:$PATH"
java -versionimport ai.doctruth.DocTruth;
import java.math.BigDecimal;
import java.nio.file.Path;
import java.time.LocalDate;
record Contract(String partyA, String partyB, LocalDate effectiveDate, BigDecimal totalValue) {}
var result = DocTruth.withOpenAi(System.getenv("OPENAI_API_KEY"))
.fromPdf(Path.of("contract.pdf"))
.extract("Extract the contract terms", Contract.class)
.withEvidence()
.run();
Contract contract = result.value();
var partyACitation = result.requireCitation("partyA");
System.out.println(partyACitation.exactQuote());
System.out.println(partyACitation.location());
partyACitation.boundingBox().ifPresent(System.out::println);
result.writeAudit(Path.of("audit.json"));See examples/quickstart for a runnable example.
withEvidence() is the opinionated default for auditable extraction. It enables
field citations, confidence scores, bitemporal provenance, and audit metadata in
one call. Use result.requireCitation("field") for required evidence and
result.findCitation("field") when missing evidence should be handled manually.
The CLI is for first-run inspection, parser debugging, schema checks, and CI smoke tests. Parser and schema inspection do not require an LLM key.
mvn package -DskipTests
java -jar target/doctruth-java-0.2.0-alpha-all.jar parse contract.pdf --bboxes
java -jar target/doctruth-java-0.2.0-alpha-all.jar parse contract.pdf --json -o parsed.json
java -jar target/doctruth-java-0.2.0-alpha-all.jar schema contract.schema.jsonSee Install DocTruth CLI and CLI.
Tagged releases publish doctruth-<version>.tar.gz,
doctruth-java-<version>-all.jar, checksums, and a generated Homebrew formula.
Homebrew install is the intended default once the tap is updated:
brew tap doctruthhq/tap
brew install doctruth
doctruth version- Parses PDF, DOCX, XLSX, and CSV into sections with source locations; PDF text sections include page-normalized bounding boxes when layout data is available.
- Extracts Java records or JSON Schema-bound objects through LLM providers.
- Validates structured output locally and retries repairable failures.
- Matches extracted fields back to exact source quotes.
- Returns per-field
Citation, including source location and optional PDF bounding box, plusConfidenceandProvenance. - Exports W3C PROV-O JSON-LD audit files with
toAuditJson(...).
Java records and simple POJOs are the native path. DocTruth turns the target Java type into the same JSON Schema contract it sends to providers and validates locally before deserializing the response.
Supported Java-native schema shapes include nested records/classes, List<T>,
Map<String, T>, enums, String, booleans, integer and decimal numbers,
BigDecimal, LocalDate, and Jackson property annotations such as
@JsonProperty and @JsonIgnore. Optional<T> is treated as an optional field:
it is omitted from required, while the wrapped value type is still reflected in
the generated schema. Raw Object and unbounded shapes fail fast instead of
becoming unauditable catch-all objects.
JSON Schema remains the interoperability path for external schema producers and template packs.
var schema = JsonSchema.from(Path.of("contract.schema.json"));
var result = DocTruth.withProvider(provider)
.fromPdf(Path.of("contract.pdf"))
.extractJson("Extract contract terms", schema)
.requireCitation("partyA")
.requireCitation("totalValue")
.withEvidence()
.withMaxRetries(2)
.runJson();If a team already owns Pydantic v2 models, export them to JSON Schema at build time and treat the output as a normal schema file. DocTruth does not import Python in Java production.
OpenAI-compatible chat completions are the primary path because many hosted, gateway, and local models expose that API shape.
| Provider | Structured output mode |
|---|---|
| OpenAI / OpenAI-compatible | response_format: json_schema |
| Anthropic | tool-use forcing |
| Gemini | responseMimeType + responseSchema |
| DeepSeek | OpenAI-compatible JSON mode plus local validation |
Provider clients use JDK java.net.http.HttpClient; no vendor SDKs are on the classpath.
Common provider setup:
var client = DocTruth.withProvider(LlmProviders.openAi(System.getenv("OPENAI_API_KEY")));
var anthropic = DocTruth.withProvider(LlmProviders.anthropic("sk-ant-..."));
var local = DocTruth.withProvider(LlmProviders.openAiCompatible(
"local-key",
URI.create("http://localhost:11434/v1/chat/completions"),
"qwen2.5"));doctruth init
doctruth parse contract.pdf --bboxes
doctruth schema contract.schema.json
doctruth doctor
doctruth extract contract.pdf -s contract.schema.json
doctruth audit .doctruth/runs/<run-id>/audit.json- Start here:
- Integrate:
- Java integration guide
- Spring Boot
- LangChain4j
- JSON Schema
- Pydantic interop example for existing Python schema owners
- Understand:
- Use cases:
- Contributing
- Changelog
0.2.0-alpha is an early public alpha. The API is usable, tested, and published for feedback, but may still change before 1.0.
Current verification baseline: mvn verify passes with 703 unit tests and the
tracked integration suite; optional local corpus tests run when fixtures/ is
present. Coverage gates are 90% line / 79% branch.
Code is licensed under Apache License 2.0.
DocTruth, doctruth.ai, and the DocTruth logo are trademarks of doctruthhq. See NOTICE.

