Phase 1 foundation: kherud fork + repo scaffolding#1
Open
aksOps wants to merge 18 commits into
Open
Conversation
Tier 0 (native/kherud-fork/):
- Forked de.kherud:llama:v4.2.0 (upstream SHA 330ccc1a6c)
- Bumped bundled llama.cpp from b4916 to b8146 - clears 5 reachable
High GHSA advisories (8wwf, 7rxv, vgg9, 96jg, 3p4r) and adds Gemma 3
/ Gemma 3n architecture support
- Native CI workflow for Win-x64 + Linux-x64 (manylinux2014/glibc 2.17)
+ Linux-arm64 (dockcross-arm64-lts/glibc 2.27); path-filtered to
native/kherud-fork/**
- Will publish io.github.randomcodespace.inference:kherud-fork-llama
:4.2.1-llama-b8146 to GitHub Packages
- Smoke-test plan: load Qwen 2.5-0.5B GGUF on UBI8 container
Tier 1 (repo scaffolding):
- Top-level: LICENSE (Apache 2.0), NOTICE, SECURITY, CONTRIBUTING,
CODE_OF_CONDUCT (Contributor Covenant 2.1), .editorconfig, Makefile
- Docs: ARCHITECTURE, WIRE_FORMAT, MODEL_REGISTRY, GLOSSARY
- Tooling: Maven Wrapper (3.9.15), scripts/fetch_models.py +
verify_models.py, .github/workflows/{java-ci,scripts-ci,native-ci}.yml,
CODEOWNERS, dependabot.yml, PULL_REQUEST_TEMPLATE, ISSUE_TEMPLATE
- Stubs: java/examples/quickstart/, go/, models/
Group ID: io.github.randomcodespace.inference (matches GitHub org
RandomCodeSpace).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reformat 6 source files + pom.xml in inference-sdk-embed to satisfy the Spotless gate (CI was failing on `spotless:check`). No semantic changes — pure whitespace/wrap reflow per google-java-format. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Tier 3 model JAR module was added to the aggregator but its pom.xml + sources were never committed (only an empty src/main/resources/models/ tree exists), so Maven failed the reactor with "Child module ... pom.xml does not exist" before any verify phase could run. Comment the module out until its scaffolding lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both verify (ubuntu-latest) and verify (ubuntu-22.04-arm) jobs failed with "no POM in this directory" because mvnw was invoked at repo root but the only Maven build lives at java/pom.xml. Pass `-f java/pom.xml` to every Maven invocation in verify, network-isolation, and javadoc jobs, and update the JavaDoc upload path accordingly. Standalone gate steps (jacoco/spotless/spotbugs) also failed prefix resolution because spotless and spotbugs are configured in pluginManagement only, and the aggregator POM does not extend the parent. Drop the redundant standalone `jacoco:check` (already runs via the parent's verify-bound execution), and invoke spotless and spotbugs by full GAV coords with -pl scoped to parent-aware modules. Validated locally: `./mvnw -f java/pom.xml -B -ntp -e verify` → BUILD SUCCESS, 60 tests run, 0 failures, 0 errors, 0 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…path 1) User acknowledged path 1 on 2026-05-09 per ~/.claude/rules/security.md mitigation clause: ship Phase 1 with documented residual-risk + written sign-off. Fork-and-bump strategy moves to Phase 1.5 (first task). Removed (58 files): - native/kherud-fork/ (vendored kherud v4.2.0 + CMake bump to b8146) - .github/workflows/native-ci.yml (3-platform cross-compile workflow) Updated 11 files: - java/inference-sdk-parent/pom.xml: dep coord -> de.kherud:llama:4.2.0; removed GitHub Packages repository entry - java/inference-sdk-embed/pom.xml: header comment - docs/ARCHITECTURE.md: rewrote D-003 (mitigation+sign-off justification); added Section 4.4 residual-risk register (5 reachable Highs + 1 Mod at b4916, mitigation per advisory: SHA-256 model allow-list neutralizes 4 of 5; input length cap + UTF-8 validation narrow the 5th); Phase 1.5 fork-and-bump as first task in Roadmap - SECURITY.md: residual-risk section + 2026-05-09 sign-off line - NOTICE: consumer-side kherud entry (de.kherud:llama:4.2.0 + embedded llama.cpp b4916 attribution) - README.md: removed native/ from project tree; updated Phase 1 desc - Makefile: removed native-build target - CONTRIBUTING.md: removed native-binding section - docs/GLOSSARY.md: kherud + Dockcross entries reflect Phase 1.5 context - .github/dependabot.yml: removed fork-specific upstream watcher - scripts/fetch_models.py: removed llama_cpp_pin(); hardcoded LLAMA_CPP_TAG=b4916 (matches kherud:4.2.0 bundled) Validations: parent POM parses; fetch_models.py compiles; dependabot.yml parses as YAML. Phase 1.5 commitment: fork kherud and bump bundled llama.cpp to clear all 5 reachable High GHSAs (8wwf, 7rxv, vgg9, 96jg, 3p4r) + add Gemma 4 architecture support; first task before any new feature work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 4.A (java/inference-sdk-generate): - Generator interface + KherudGenerator impl wrapping de.kherud:llama:4.2.0 - Records: Message, GenerateRequest, GenerateResponse, GenerateChunk, GenerateStats with snake_case @JsonProperty annotations (Phase 2 ready) - Streaming: BoundedSubscription honors Reactive Streams 3.9 + 3.17, exactly-one terminal chunk, cancellation at next-token boundary, idempotent cancel, lazy generation start - InputValidator enforces SECURITY.md path-1 mitigations: length cap (contextSize x 8 chars/token) + strict UTF-8 round-trip - narrows GHSA-7rxv tokenizer prompt overflow - ModelResolver: explicit modelPath -> INFERENCE_MODEL_DIR -> classpath - Reserved fields throw FeatureNotSupportedException on non-null (Message.toolCalls/toolCallId/name; GenerateRequest.tools/ toolChoice/responseFormat) - cases #49, #50 - Auto-module name 'llama' for de.kherud:llama:4.2.0 (verified via jar manifest inspection: no Automatic-Module-Name; JPMS derives from filename minus version) - LlamaClient seam interface (de.kherud.llama.LlamaModel is final); production wires KherudLlamaClient adapter, tests wire FakeLlamaClient - 105 tests in generate; reactor total 165 tests, all passing - JaCoCo line 79% / branch 71% (above 75/70 gates); KherudLlamaClient excluded with justification (real-model paths exercised in Tier 5 IT) - Spotless google-java-format clean; SpotBugs HIGH clean - NativeExecutor pinning workaround documented at every JNI call site Tier 4.B (java/inference-sdk-generate-qwen-0_5b): - Maven JAR with no Java code; resources placeholder for the Qwen 2.5 -0.5B-Instruct Q4_K_M GGUF (populated by scripts/fetch_models.py in Tier 0.5 before Tier 5 IT) - model-manifest.properties placeholder (id, hf_repo, revision, quantization, max_tokens, sha256, license) Aggregator java/pom.xml updated with both new modules. §11.2 case coverage: #13 (maxTokens=0), #14 (empty messages), #15 (system-only), #49 + #50 (reserved fields) implemented as unit tests; #12 + #16-34 deferred to Tier 5 IT (need real GGUF load). Documented in test class JavaDoc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
maven-shade-plugin produces a 75.6 MB fat JAR with: - All 55 SDK classes (core + embed + generate) - ONNX Runtime natives for 4 platforms (linux-x64, linux-aarch64, osx-aarch64, win-x64) including onnxruntime_providers_shared.dll - de.kherud:llama:4.2.0 natives (linux/mac/win/android variants) - DJL HuggingFace tokenizers natives (4 platforms) - JNA dispatch natives (11+ platforms) - Total: 30 .so / 12 .dll / 7 .dylib; 2911 JAR entries Module-info strategy: unnamed/automatic-module fat JAR. shade 3.6.2 JPMS support is experimental and merging multiple module-infos breaks reproducibility. JPMS-strict consumers should use per-module artifacts (which carry well-formed module-infos). Tradeoff documented in pom.xml header + README.md. Verification: - BUILD SUCCESS; bundle reactor builds parent + core + embed + generate + qwen + bundle - Signatures stripped: 0 .SF/.RSA/.DSA files - module-info.class stripped from all shaded deps (incl. MR-JAR versions/*/module-info.class) - Manifest: Reproducible-Build=true; Implementation-Title/Vendor/Version - dependency-reduced-pom.xml generated (gitignored) bge-small dep deferred to next commit (its module pom hasn't been written yet; Tier 4.B added the Qwen shell, bge-small shell follows). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors Tier 4.B (Qwen JAR shell) for the embedding model: - Maven JAR with no Java code; only resources/models/ for the LFS-tracked bge-small-en-v1.5.int8.onnx (populated by scripts/fetch_models.py in Tier 0.5) - model-manifest.properties placeholder (id, hf_repo, dimensions=384, max_tokens=512, quantization=int8-dynamic, sha256, license=MIT) - Skip JavaDoc/JaCoCo/SpotBugs (no Java sources) Aggregator java/pom.xml: uncomment <module>inference-sdk-embed-bge-small</module> inference-sdk-bundle/pom.xml: add bge-small as a dep so the fat JAR shades the embedding model resources alongside Qwen GGUF resources. POM resolution verified: ./mvnw help:effective-pom -pl :inference-sdk-bundle -am exits 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…RY.md inference-sdk-integration-tests (new module, 14 files, 29 tests): - NetworkIsolationIT (case #47, 7 tests): InetAddressResolverProvider SPI installs a BlockingDnsResolverProvider via META-INF/services that refuses arbitrary host lookups; Embedder + Generator init succeed -> proves no runtime DNS / phone-home in any dep - ForwardCompatIT (cases #49 + #50 + #51, 14 tests): Message + Generate Request reserved fields throw FeatureNotSupportedException; Jackson serialization of every record produces snake_case keys per WIRE_FORMAT - FailureModeIT (cases #43, #44, #45, 5 tests): typed exceptions for nonexistent model, corrupted file, wrong format - GenerateEdgeCaseIT (cases #13, #14, #15, 3 tests): validation cases - @tag("model") classes (skeletons, deferred until make fetch-models): EmbedEdgeCaseIT (#1-11), full GenerateEdgeCaseIT (#12, #16-25), StreamingEdgeCaseIT (#26-34), ConcurrencyLifecycleIT (#35-39), ResourceExhaustionIT (#40-42), case #46 small-heap - @tag("model-switch") ModelSwitchIT (case #48 deferred per spec) - @tag("slow") PropertyTestsIT (jqwik per spec §11.3) 51 of 51 §11.2 cases now have implementations (or tagged skeletons documenting the model-fetch prerequisite). java/examples/quickstart/Main.java (87 lines): - Embed + sync generate + streaming demo with try-with-resources - Wraps in RequestId.withRequestId for ScopedValue propagation pattern - Compiles cleanly after ./mvnw install LIBRARY.md (~140 lines): public API doc, virtual-threads / structured concurrency / Flow.Publisher streaming examples, native-pinning pattern diagram, build-time model switching workflow. java/pom.xml: added inference-sdk-integration-tests to <modules>. Final test suite: 194 / 194 passing (165 baseline + 29 new IT) on default verify. @tag("model") + @tag("slow") run with -P slow after make fetch-models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switches from "commit ONNX/GGUF to Git LFS" to "fetch at build time and
embed in published Maven artifact". Eliminates LFS storage/bandwidth
costs while preserving the air-gapped offline-first guarantee for
downstream consumers (they receive model bytes via the Maven artifact
they depend on, not via a Git clone).
Mechanics:
- inference-sdk-embed-bge-small/pom.xml + inference-sdk-generate-qwen-
0_5b/pom.xml: exec-maven-plugin bound to generate-resources phase
invokes scripts/fetch_models.py with the model id; second exec at
process-resources runs scripts/verify_models.py for SHA-256 check
against scripts/checksums/models.sha256. Skip via -Dfetch.models.skip
=true for IDE imports / quick POM-only builds.
- .gitattributes: removed *.onnx / *.gguf / *.safetensors / *.bin /
*.pt / *.pth LFS tracking (no Git LFS dependency)
- .gitignore: added **/src/main/resources/models/*.{onnx,gguf,safetens
ors,bin,pt,pth} so locally-fetched bytes don't accidentally commit;
model-manifest.properties + .gitkeep stay tracked
- .github/workflows/java-ci.yml: added actions/cache for ~/.cache/hugg
ingface + the staged module/src/main/resources/models/ + build/llama
.cpp keyed on hashFiles('scripts/checksums/models.sha256','scripts/
fetch_models.py'); pinned hashes mean cache stays valid until we
deliberately bump
- docs/ARCHITECTURE.md: rewrote model-distribution section to describe
the hybrid approach; replaced LFS language with build-time fetch +
Maven artifact embedding
Air-gapped consumers: receive bundled model bytes via the published
Maven artifact (no network at consume time). Air-gapped contributors:
documented path in CONTRIBUTING.md (forthcoming follow-up commit:
"request a pre-built target/ cache from a maintainer or run make
fetch-models on a connected machine and copy ~/.cache/huggingface/
into the air-gapped environment").
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CONTRIBUTING.md: - Removed "Git LFS" from required toolchain - Added gcc + cmake (required by llama.cpp's convert_hf_to_gguf.py + llama-quantize during the generation-model fetch path) - Added "Air-gapped contributors" section with two offline workflows: (a) request a pre-built target/ cache from a maintainer; (b) run make fetch-models on a connected machine and copy ~/.cache/hugg ingface/ into the air-gapped environment java/inference-sdk-generate-qwen-0_5b/pom.xml: refinement of the fetch binding (post-Spotless reformat from the hybrid retrofit agent). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bytes Root cause: .gitattributes line 'text=auto eol=lf' was treating the test fixture .bin files as text and applying LF normalization on checkout, corrupting their bytes (127 -> 32 bytes is wrong direction; correct is 32 bytes original; the corruption inflated random bytes into LF-converted form on commit, then on checkout produced different bytes than the SHA-256 in the sibling .sha256 file expected). NativeLibLoaderTest verifies SHA-256 of an extracted resource against its sibling .sha256 file; with normalized fixtures the SHA never matched -> tests fail in CI on a clean checkout (locally devs had the original bytes still in their working copy, masking the issue). Fix: explicitly mark the two test fixtures binary in .gitattributes: java/inference-sdk-core/src/test/resources/native-fixtures/*.bin binary java/inference-sdk-core/src/test/resources/native-fixtures/*.sha256 text=auto Restored the .bin files to their original 32-byte content (one matches its .sha256, the other intentionally doesn't for the negative-path test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the verify regression: hybrid model-distribution exec binding fires fetch_models.py during generate-resources, but the GHA runners don't have huggingface_hub / onnxruntime / optimum / safe tensors / gcc-cmake-make pre-installed AND we don't want PR feedback to wait on multi-minute HF downloads + GGUF conversion. Default PR verify just needs to exercise the SDK's own logic; real model bytes are not on the PR-feedback critical path. Fix: - .github/workflows/java-ci.yml: add -Dfetch.models.skip=true to BOTH verify and network-isolation mvnw invocations. PR verify now runs in seconds (local timing: 15s for the full reactor including the 29 IT tests). @tag("model") tests stay deferred per Tier 5 design. - .github/workflows/package-artifacts.yml: NEW manual + scheduled (weekly) workflow. Sets up Python 3.11, installs scripts/requirements .txt, ensures gcc/cmake/make, runs the FULL mvnw verify (no skip flag) to actually fetch + convert + embed models, then uploads the inference-sdk-bundle fat JAR + per-module JARs as artifacts. Smoke- tests the bundled GGUF on a sample prompt. Local validation: ./mvnw -f java/pom.xml -Dfetch.models.skip=true -B -ntp verify -> BUILD SUCCESS, 9 modules green, 194 tests passing in 15s. This matches the design pattern many ML-library repos use: PRs get fast feedback; artifact builds happen in a separate slower workflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Foundation PR for inference-sdk Java Phase 1 — combines Tier 0 (kherud-fork + native CI) and Tier 1 (repo scaffolding + docs + tooling).
Tier 0 —
native/kherud-fork/In-repo fork of
kherud/java-llama.cpp v4.2.0with the bundled llama.cpp bumped fromb4916tob8146. Rationale (full vulnerability deep-dive evidence in research notes; mirrored toSECURITY.md):b4916:8wwf(token_to_piece overflow),7rxv(tokenizer overflow),vgg9(GGUF size accumulator overflow),96jg(ggml_nbytes overflow → potential RCE),3p4r(mem_size bypass)MODEL_ARCH.GEMMA3/GEMMA3NandGemma3Model/Gemma3NModelingguf-py/gguf/constants.pyandconvert_hf_to_gguf.pyat b8146dockcross/manylinux2014-x64(glibc 2.17) +dockcross/linux-arm64-lts(glibc 2.27) +windows-2019/VS2019native/kherud-fork/**) so unrelated PRs don't trigger native rebuildsQwen/Qwen2.5-0.5B-Instruct.Q4_K_M.ggufon a UBI8 container, run 10 generationsio.github.randomcodespace.inference:kherud-fork-llama:4.2.1-llama-b8146to GitHub Packages on tag pushMaintenance plan in
native/kherud-fork/README.md: quarterly bump checks; trigger on (a) new High/Critical CVE in pinned llama.cpp, (b) JDK release breaking JNI compat, (c) new model architecture needed.Tier 1 — repo scaffolding
LICENSE(Apache 2.0 verbatim),NOTICE,SECURITY.md,CONTRIBUTING.md,CODE_OF_CONDUCT.md(Contributor Covenant 2.1 verbatim),.editorconfig,Makefile, enhancedREADME.mddocs/ARCHITECTURE.md,docs/WIRE_FORMAT.md,docs/MODEL_REGISTRY.md,docs/GLOSSARY.mdscripts/fetch_models.py(HF → ONNX/GGUF converter, build-host-only);scripts/verify_models.py(SHA-256 check).github/workflows/{java-ci,scripts-ci,native-ci}.yml; CODEOWNERS; dependabot config (with custom watcher forde.kherud:llamaupstream); PR + issue templatesjava/examples/quickstart/,go/,models/Deviations from
java-sdk.mdRecorded in
docs/ARCHITECTURE.md§ Deviation Register. Summary:Qwen/Qwen2.5-0.5B-Instruct(Apache 2.0 ungated) instead ofgemma-3-270m-it(Gemma Terms, gated)com.diffplug.spotless(corrected from spec); JaCoCo 0.8.14 floor for JDK 25Test plan
This PR contains no production Java code yet — it's foundation only. Acceptance:
mainis clean./mvnw --versionreports Maven 3.9.15 + JDK 25python3 -m py_compile scripts/{fetch,verify}_models.pycleanmvn help:effective-pom -f native/kherud-fork/pom.xml -qexits 0randomcodebasetypo remaining in tracked filesSubsequent tiers (not in this PR)
inference-sdk-coremoduleinference-sdk-embed+ bge-small model JARinference-sdk-generate+ Qwen 0.5B model JAR + bundle fat-JAR🤖 Generated with Claude Code