Phase 1 foundation: kherud fork + repo scaffolding by aksOps · Pull Request #1 · RandomCodeSpace/inference-sdk

aksOps · 2026-05-08T19:10:37Z

Summary

Foundation PR for inference-sdk Java Phase 1 — combines Tier 0 (kherud-fork + native CI) and Tier 1 (repo scaffolding + docs + tooling).

Tier 0 — `native/kherud-fork/`

In-repo fork of kherud/java-llama.cpp v4.2.0 with the bundled llama.cpp bumped from b4916 to b8146. Rationale (full vulnerability deep-dive evidence in research notes; mirrored to SECURITY.md):

Clears 5 reachable High GHSA advisories in upstream b4916: 8wwf (token_to_piece overflow), 7rxv (tokenizer overflow), vgg9 (GGUF size accumulator overflow), 96jg (ggml_nbytes overflow → potential RCE), 3p4r (mem_size bypass)
Adds Gemma 3 / Gemma 3n architecture support — confirmed via MODEL_ARCH.GEMMA3 / GEMMA3N and Gemma3Model / Gemma3NModel in gguf-py/gguf/constants.py and convert_hf_to_gguf.py at b8146
Native CI matrix: dockcross/manylinux2014-x64 (glibc 2.17) + dockcross/linux-arm64-lts (glibc 2.27) + windows-2019/VS2019
Path-filtered (native/kherud-fork/**) so unrelated PRs don't trigger native rebuilds
Smoke test plan: load Qwen/Qwen2.5-0.5B-Instruct.Q4_K_M.gguf on a UBI8 container, run 10 generations
Will publish io.github.randomcodespace.inference:kherud-fork-llama:4.2.1-llama-b8146 to GitHub Packages on tag push

Maintenance plan in native/kherud-fork/README.md: quarterly bump checks; trigger on (a) new High/Critical CVE in pinned llama.cpp, (b) JDK release breaking JNI compat, (c) new model architecture needed.

Tier 1 — repo scaffolding

Top-level: LICENSE (Apache 2.0 verbatim), NOTICE, SECURITY.md, CONTRIBUTING.md, CODE_OF_CONDUCT.md (Contributor Covenant 2.1 verbatim), .editorconfig, Makefile, enhanced README.md
Docs: docs/ARCHITECTURE.md, docs/WIRE_FORMAT.md, docs/MODEL_REGISTRY.md, docs/GLOSSARY.md
Tooling: Maven Wrapper pinned to 3.9.15; scripts/fetch_models.py (HF → ONNX/GGUF converter, build-host-only); scripts/verify_models.py (SHA-256 check)
CI: .github/workflows/{java-ci,scripts-ci,native-ci}.yml; CODEOWNERS; dependabot config (with custom watcher for de.kherud:llama upstream); PR + issue templates
Stubs: java/examples/quickstart/, go/, models/

Deviations from `java-sdk.md`

Recorded in docs/ARCHITECTURE.md § Deviation Register. Summary:

D-001: Generation default = Qwen/Qwen2.5-0.5B-Instruct (Apache 2.0 ungated) instead of gemma-3-270m-it (Gemma Terms, gated)
D-002: Platform matrix = Win-x64 + Linux UBI8/9/10 + Linux-arm64 (UBI8+); Win-arm64 explicitly out of scope (broken upstream)
D-003: llama.cpp Java binding maintenance criterion relaxed via fork-and-bump (this PR)
D-004: Spotless GAV is com.diffplug.spotless (corrected from spec); JaCoCo 0.8.14 floor for JDK 25
D-005: jlama evaluated and rejected (Critical CVEs in jinjava 2.7.2 transitive; missing Gemma 4 architecture support; pure-Java perf would not scale to E4B)

Test plan

This PR contains no production Java code yet — it's foundation only. Acceptance:

Bootstrap commit on main is clean
./mvnw --version reports Maven 3.9.15 + JDK 25
python3 -m py_compile scripts/{fetch,verify}_models.py clean
All YAML files parse cleanly
mvn help:effective-pom -f native/kherud-fork/pom.xml -q exits 0
No randomcodebase typo remaining in tracked files
Native CI workflow runs successfully on first push (validates the dockcross + Windows build matrix; will surface any JNI-vs-b8146 incompatibility)
Tier 2 (Java parent + core) opens against this branch's merge

Subsequent tiers (not in this PR)

Tier 2: Java parent POM + inference-sdk-core module
Tier 3: inference-sdk-embed + bge-small model JAR
Tier 4: inference-sdk-generate + Qwen 0.5B model JAR + bundle fat-JAR
Tier 5: integration tests (51 numbered §11.2 cases) + quickstart example + final quality gates

🤖 Generated with Claude Code

Tier 0 (native/kherud-fork/): - Forked de.kherud:llama:v4.2.0 (upstream SHA 330ccc1a6c) - Bumped bundled llama.cpp from b4916 to b8146 - clears 5 reachable High GHSA advisories (8wwf, 7rxv, vgg9, 96jg, 3p4r) and adds Gemma 3 / Gemma 3n architecture support - Native CI workflow for Win-x64 + Linux-x64 (manylinux2014/glibc 2.17) + Linux-arm64 (dockcross-arm64-lts/glibc 2.27); path-filtered to native/kherud-fork/** - Will publish io.github.randomcodespace.inference:kherud-fork-llama :4.2.1-llama-b8146 to GitHub Packages - Smoke-test plan: load Qwen 2.5-0.5B GGUF on UBI8 container Tier 1 (repo scaffolding): - Top-level: LICENSE (Apache 2.0), NOTICE, SECURITY, CONTRIBUTING, CODE_OF_CONDUCT (Contributor Covenant 2.1), .editorconfig, Makefile - Docs: ARCHITECTURE, WIRE_FORMAT, MODEL_REGISTRY, GLOSSARY - Tooling: Maven Wrapper (3.9.15), scripts/fetch_models.py + verify_models.py, .github/workflows/{java-ci,scripts-ci,native-ci}.yml, CODEOWNERS, dependabot.yml, PULL_REQUEST_TEMPLATE, ISSUE_TEMPLATE - Stubs: java/examples/quickstart/, go/, models/ Group ID: io.github.randomcodespace.inference (matches GitHub org RandomCodeSpace). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

socket-security · 2026-05-08T19:25:17Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	maven/org.assertj/assertj-core@3.27.7
	maven/ch.qos.logback/logback-classic@1.5.32
	maven/com.fasterxml.jackson.core/jackson-annotations@2.21
	maven/com.fasterxml.jackson.core/jackson-databind@2.21.3
	maven/nl.jqno.equalsverifier/equalsverifier@4.5
	maven/org.awaitility/awaitility@4.3.0
	maven/org.junit.jupiter/junit-jupiter-params@6.0.3
	maven/com.microsoft.onnxruntime/onnxruntime@1.25.1
	maven/org.junit.jupiter/junit-jupiter@6.0.3
	pypi/onnxruntime@1.26.0
	maven/org.slf4j/slf4j-api@2.0.17
	maven/ai.djl.huggingface/tokenizers@0.36.0
	pypi/huggingface-hub@0.36.2
	maven/de.kherud/llama@4.2.0
	maven/net.jqwik/jqwik@1.9.3
	pypi/pyright@1.1.409
	pypi/safetensors@0.7.0
	pypi/ruff@0.15.12

View full report

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reformat 6 source files + pom.xml in inference-sdk-embed to satisfy the Spotless gate (CI was failing on `spotless:check`). No semantic changes — pure whitespace/wrap reflow per google-java-format. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Tier 3 model JAR module was added to the aggregator but its pom.xml + sources were never committed (only an empty src/main/resources/models/ tree exists), so Maven failed the reactor with "Child module ... pom.xml does not exist" before any verify phase could run. Comment the module out until its scaffolding lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both verify (ubuntu-latest) and verify (ubuntu-22.04-arm) jobs failed with "no POM in this directory" because mvnw was invoked at repo root but the only Maven build lives at java/pom.xml. Pass `-f java/pom.xml` to every Maven invocation in verify, network-isolation, and javadoc jobs, and update the JavaDoc upload path accordingly. Standalone gate steps (jacoco/spotless/spotbugs) also failed prefix resolution because spotless and spotbugs are configured in pluginManagement only, and the aggregator POM does not extend the parent. Drop the redundant standalone `jacoco:check` (already runs via the parent's verify-bound execution), and invoke spotless and spotbugs by full GAV coords with -pl scoped to parent-aware modules. Validated locally: `./mvnw -f java/pom.xml -B -ntp -e verify` → BUILD SUCCESS, 60 tests run, 0 failures, 0 errors, 0 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…path 1) User acknowledged path 1 on 2026-05-09 per ~/.claude/rules/security.md mitigation clause: ship Phase 1 with documented residual-risk + written sign-off. Fork-and-bump strategy moves to Phase 1.5 (first task). Removed (58 files): - native/kherud-fork/ (vendored kherud v4.2.0 + CMake bump to b8146) - .github/workflows/native-ci.yml (3-platform cross-compile workflow) Updated 11 files: - java/inference-sdk-parent/pom.xml: dep coord -> de.kherud:llama:4.2.0; removed GitHub Packages repository entry - java/inference-sdk-embed/pom.xml: header comment - docs/ARCHITECTURE.md: rewrote D-003 (mitigation+sign-off justification); added Section 4.4 residual-risk register (5 reachable Highs + 1 Mod at b4916, mitigation per advisory: SHA-256 model allow-list neutralizes 4 of 5; input length cap + UTF-8 validation narrow the 5th); Phase 1.5 fork-and-bump as first task in Roadmap - SECURITY.md: residual-risk section + 2026-05-09 sign-off line - NOTICE: consumer-side kherud entry (de.kherud:llama:4.2.0 + embedded llama.cpp b4916 attribution) - README.md: removed native/ from project tree; updated Phase 1 desc - Makefile: removed native-build target - CONTRIBUTING.md: removed native-binding section - docs/GLOSSARY.md: kherud + Dockcross entries reflect Phase 1.5 context - .github/dependabot.yml: removed fork-specific upstream watcher - scripts/fetch_models.py: removed llama_cpp_pin(); hardcoded LLAMA_CPP_TAG=b4916 (matches kherud:4.2.0 bundled) Validations: parent POM parses; fetch_models.py compiles; dependabot.yml parses as YAML. Phase 1.5 commitment: fork kherud and bump bundled llama.cpp to clear all 5 reachable High GHSAs (8wwf, 7rxv, vgg9, 96jg, 3p4r) + add Gemma 4 architecture support; first task before any new feature work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@JsonProperty

Tier 4.A (java/inference-sdk-generate): - Generator interface + KherudGenerator impl wrapping de.kherud:llama:4.2.0 - Records: Message, GenerateRequest, GenerateResponse, GenerateChunk, GenerateStats with snake_case @JsonProperty annotations (Phase 2 ready) - Streaming: BoundedSubscription honors Reactive Streams 3.9 + 3.17, exactly-one terminal chunk, cancellation at next-token boundary, idempotent cancel, lazy generation start - InputValidator enforces SECURITY.md path-1 mitigations: length cap (contextSize x 8 chars/token) + strict UTF-8 round-trip - narrows GHSA-7rxv tokenizer prompt overflow - ModelResolver: explicit modelPath -> INFERENCE_MODEL_DIR -> classpath - Reserved fields throw FeatureNotSupportedException on non-null (Message.toolCalls/toolCallId/name; GenerateRequest.tools/ toolChoice/responseFormat) - cases #49, #50 - Auto-module name 'llama' for de.kherud:llama:4.2.0 (verified via jar manifest inspection: no Automatic-Module-Name; JPMS derives from filename minus version) - LlamaClient seam interface (de.kherud.llama.LlamaModel is final); production wires KherudLlamaClient adapter, tests wire FakeLlamaClient - 105 tests in generate; reactor total 165 tests, all passing - JaCoCo line 79% / branch 71% (above 75/70 gates); KherudLlamaClient excluded with justification (real-model paths exercised in Tier 5 IT) - Spotless google-java-format clean; SpotBugs HIGH clean - NativeExecutor pinning workaround documented at every JNI call site Tier 4.B (java/inference-sdk-generate-qwen-0_5b): - Maven JAR with no Java code; resources placeholder for the Qwen 2.5 -0.5B-Instruct Q4_K_M GGUF (populated by scripts/fetch_models.py in Tier 0.5 before Tier 5 IT) - model-manifest.properties placeholder (id, hf_repo, revision, quantization, max_tokens, sha256, license) Aggregator java/pom.xml updated with both new modules. §11.2 case coverage: #13 (maxTokens=0), #14 (empty messages), #15 (system-only), #49 + #50 (reserved fields) implemented as unit tests; #12 + #16-34 deferred to Tier 5 IT (need real GGUF load). Documented in test class JavaDoc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

maven-shade-plugin produces a 75.6 MB fat JAR with: - All 55 SDK classes (core + embed + generate) - ONNX Runtime natives for 4 platforms (linux-x64, linux-aarch64, osx-aarch64, win-x64) including onnxruntime_providers_shared.dll - de.kherud:llama:4.2.0 natives (linux/mac/win/android variants) - DJL HuggingFace tokenizers natives (4 platforms) - JNA dispatch natives (11+ platforms) - Total: 30 .so / 12 .dll / 7 .dylib; 2911 JAR entries Module-info strategy: unnamed/automatic-module fat JAR. shade 3.6.2 JPMS support is experimental and merging multiple module-infos breaks reproducibility. JPMS-strict consumers should use per-module artifacts (which carry well-formed module-infos). Tradeoff documented in pom.xml header + README.md. Verification: - BUILD SUCCESS; bundle reactor builds parent + core + embed + generate + qwen + bundle - Signatures stripped: 0 .SF/.RSA/.DSA files - module-info.class stripped from all shaded deps (incl. MR-JAR versions/*/module-info.class) - Manifest: Reproducible-Build=true; Implementation-Title/Vendor/Version - dependency-reduced-pom.xml generated (gitignored) bge-small dep deferred to next commit (its module pom hasn't been written yet; Tier 4.B added the Qwen shell, bge-small shell follows). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors Tier 4.B (Qwen JAR shell) for the embedding model: - Maven JAR with no Java code; only resources/models/ for the LFS-tracked bge-small-en-v1.5.int8.onnx (populated by scripts/fetch_models.py in Tier 0.5) - model-manifest.properties placeholder (id, hf_repo, dimensions=384, max_tokens=512, quantization=int8-dynamic, sha256, license=MIT) - Skip JavaDoc/JaCoCo/SpotBugs (no Java sources) Aggregator java/pom.xml: uncomment <module>inference-sdk-embed-bge-small</module> inference-sdk-bundle/pom.xml: add bge-small as a dep so the fat JAR shades the embedding model resources alongside Qwen GGUF resources. POM resolution verified: ./mvnw help:effective-pom -pl :inference-sdk-bundle -am exits 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@tag

…RY.md inference-sdk-integration-tests (new module, 14 files, 29 tests): - NetworkIsolationIT (case #47, 7 tests): InetAddressResolverProvider SPI installs a BlockingDnsResolverProvider via META-INF/services that refuses arbitrary host lookups; Embedder + Generator init succeed -> proves no runtime DNS / phone-home in any dep - ForwardCompatIT (cases #49 + #50 + #51, 14 tests): Message + Generate Request reserved fields throw FeatureNotSupportedException; Jackson serialization of every record produces snake_case keys per WIRE_FORMAT - FailureModeIT (cases #43, #44, #45, 5 tests): typed exceptions for nonexistent model, corrupted file, wrong format - GenerateEdgeCaseIT (cases #13, #14, #15, 3 tests): validation cases - @tag("model") classes (skeletons, deferred until make fetch-models): EmbedEdgeCaseIT (#1-11), full GenerateEdgeCaseIT (#12, #16-25), StreamingEdgeCaseIT (#26-34), ConcurrencyLifecycleIT (#35-39), ResourceExhaustionIT (#40-42), case #46 small-heap - @tag("model-switch") ModelSwitchIT (case #48 deferred per spec) - @tag("slow") PropertyTestsIT (jqwik per spec §11.3) 51 of 51 §11.2 cases now have implementations (or tagged skeletons documenting the model-fetch prerequisite). java/examples/quickstart/Main.java (87 lines): - Embed + sync generate + streaming demo with try-with-resources - Wraps in RequestId.withRequestId for ScopedValue propagation pattern - Compiles cleanly after ./mvnw install LIBRARY.md (~140 lines): public API doc, virtual-threads / structured concurrency / Flow.Publisher streaming examples, native-pinning pattern diagram, build-time model switching workflow. java/pom.xml: added inference-sdk-integration-tests to <modules>. Final test suite: 194 / 194 passing (165 baseline + 29 new IT) on default verify. @tag("model") + @tag("slow") run with -P slow after make fetch-models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Switches from "commit ONNX/GGUF to Git LFS" to "fetch at build time and embed in published Maven artifact". Eliminates LFS storage/bandwidth costs while preserving the air-gapped offline-first guarantee for downstream consumers (they receive model bytes via the Maven artifact they depend on, not via a Git clone). Mechanics: - inference-sdk-embed-bge-small/pom.xml + inference-sdk-generate-qwen- 0_5b/pom.xml: exec-maven-plugin bound to generate-resources phase invokes scripts/fetch_models.py with the model id; second exec at process-resources runs scripts/verify_models.py for SHA-256 check against scripts/checksums/models.sha256. Skip via -Dfetch.models.skip =true for IDE imports / quick POM-only builds. - .gitattributes: removed *.onnx / *.gguf / *.safetensors / *.bin / *.pt / *.pth LFS tracking (no Git LFS dependency) - .gitignore: added **/src/main/resources/models/*.{onnx,gguf,safetens ors,bin,pt,pth} so locally-fetched bytes don't accidentally commit; model-manifest.properties + .gitkeep stay tracked - .github/workflows/java-ci.yml: added actions/cache for ~/.cache/hugg ingface + the staged module/src/main/resources/models/ + build/llama .cpp keyed on hashFiles('scripts/checksums/models.sha256','scripts/ fetch_models.py'); pinned hashes mean cache stays valid until we deliberately bump - docs/ARCHITECTURE.md: rewrote model-distribution section to describe the hybrid approach; replaced LFS language with build-time fetch + Maven artifact embedding Air-gapped consumers: receive bundled model bytes via the published Maven artifact (no network at consume time). Air-gapped contributors: documented path in CONTRIBUTING.md (forthcoming follow-up commit: "request a pre-built target/ cache from a maintainer or run make fetch-models on a connected machine and copy ~/.cache/huggingface/ into the air-gapped environment"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CONTRIBUTING.md: - Removed "Git LFS" from required toolchain - Added gcc + cmake (required by llama.cpp's convert_hf_to_gguf.py + llama-quantize during the generation-model fetch path) - Added "Air-gapped contributors" section with two offline workflows: (a) request a pre-built target/ cache from a maintainer; (b) run make fetch-models on a connected machine and copy ~/.cache/hugg ingface/ into the air-gapped environment java/inference-sdk-generate-qwen-0_5b/pom.xml: refinement of the fetch binding (post-Spotless reformat from the hybrid retrofit agent). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…bytes Root cause: .gitattributes line 'text=auto eol=lf' was treating the test fixture .bin files as text and applying LF normalization on checkout, corrupting their bytes (127 -> 32 bytes is wrong direction; correct is 32 bytes original; the corruption inflated random bytes into LF-converted form on commit, then on checkout produced different bytes than the SHA-256 in the sibling .sha256 file expected). NativeLibLoaderTest verifies SHA-256 of an extracted resource against its sibling .sha256 file; with normalized fixtures the SHA never matched -> tests fail in CI on a clean checkout (locally devs had the original bytes still in their working copy, masking the issue). Fix: explicitly mark the two test fixtures binary in .gitattributes: java/inference-sdk-core/src/test/resources/native-fixtures/*.bin binary java/inference-sdk-core/src/test/resources/native-fixtures/*.sha256 text=auto Restored the .bin files to their original 32-byte content (one matches its .sha256, the other intentionally doesn't for the negative-path test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@tag

Root cause of the verify regression: hybrid model-distribution exec binding fires fetch_models.py during generate-resources, but the GHA runners don't have huggingface_hub / onnxruntime / optimum / safe tensors / gcc-cmake-make pre-installed AND we don't want PR feedback to wait on multi-minute HF downloads + GGUF conversion. Default PR verify just needs to exercise the SDK's own logic; real model bytes are not on the PR-feedback critical path. Fix: - .github/workflows/java-ci.yml: add -Dfetch.models.skip=true to BOTH verify and network-isolation mvnw invocations. PR verify now runs in seconds (local timing: 15s for the full reactor including the 29 IT tests). @tag("model") tests stay deferred per Tier 5 design. - .github/workflows/package-artifacts.yml: NEW manual + scheduled (weekly) workflow. Sets up Python 3.11, installs scripts/requirements .txt, ensures gcc/cmake/make, runs the FULL mvnw verify (no skip flag) to actually fetch + convert + embed models, then uploads the inference-sdk-bundle fat JAR + per-module JARs as artifacts. Smoke- tests the bundled GGUF on a sample prompt. Local validation: ./mvnw -f java/pom.xml -Dfetch.models.skip=true -B -ntp verify -> BUILD SUCCESS, 9 modules green, 194 tests passing in 15s. This matches the design pattern many ML-library repos use: PRs get fast feedback; artifact builds happen in a separate slower workflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

aksOps and others added 17 commits May 9, 2026 00:23

checkpoint: pre-yolo 2026-05-09T00:23:00

61afd38

checkpoint: pre-yolo 2026-05-09T00:34:13

2025bdf

checkpoint: pre-yolo 2026-05-09T00:41:21

9ef8088

fix(scripts): drop extraneous f prefix on manifest header (ruff F541)

a9e5304

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

checkpoint: pre-yolo 2026-05-09T01:58:26

65eed55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 1 foundation: kherud fork + repo scaffolding#1

Phase 1 foundation: kherud fork + repo scaffolding#1
aksOps wants to merge 18 commits into
mainfrom
feat/phase-1-foundation

aksOps commented May 8, 2026

Uh oh!

socket-security Bot commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aksOps commented May 8, 2026

Summary

Tier 0 — native/kherud-fork/

Tier 1 — repo scaffolding

Deviations from java-sdk.md

Test plan

Subsequent tiers (not in this PR)

Uh oh!

socket-security Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tier 0 — `native/kherud-fork/`

Deviations from `java-sdk.md`

socket-security Bot commented May 8, 2026 •

edited

Loading