Skip to content

Phase 1 foundation: kherud fork + repo scaffolding#1

Open
aksOps wants to merge 18 commits into
mainfrom
feat/phase-1-foundation
Open

Phase 1 foundation: kherud fork + repo scaffolding#1
aksOps wants to merge 18 commits into
mainfrom
feat/phase-1-foundation

Conversation

@aksOps
Copy link
Copy Markdown
Contributor

@aksOps aksOps commented May 8, 2026

Summary

Foundation PR for inference-sdk Java Phase 1 — combines Tier 0 (kherud-fork + native CI) and Tier 1 (repo scaffolding + docs + tooling).

Tier 0 — native/kherud-fork/

In-repo fork of kherud/java-llama.cpp v4.2.0 with the bundled llama.cpp bumped from b4916 to b8146. Rationale (full vulnerability deep-dive evidence in research notes; mirrored to SECURITY.md):

  • Clears 5 reachable High GHSA advisories in upstream b4916: 8wwf (token_to_piece overflow), 7rxv (tokenizer overflow), vgg9 (GGUF size accumulator overflow), 96jg (ggml_nbytes overflow → potential RCE), 3p4r (mem_size bypass)
  • Adds Gemma 3 / Gemma 3n architecture support — confirmed via MODEL_ARCH.GEMMA3 / GEMMA3N and Gemma3Model / Gemma3NModel in gguf-py/gguf/constants.py and convert_hf_to_gguf.py at b8146
  • Native CI matrix: dockcross/manylinux2014-x64 (glibc 2.17) + dockcross/linux-arm64-lts (glibc 2.27) + windows-2019/VS2019
  • Path-filtered (native/kherud-fork/**) so unrelated PRs don't trigger native rebuilds
  • Smoke test plan: load Qwen/Qwen2.5-0.5B-Instruct.Q4_K_M.gguf on a UBI8 container, run 10 generations
  • Will publish io.github.randomcodespace.inference:kherud-fork-llama:4.2.1-llama-b8146 to GitHub Packages on tag push

Maintenance plan in native/kherud-fork/README.md: quarterly bump checks; trigger on (a) new High/Critical CVE in pinned llama.cpp, (b) JDK release breaking JNI compat, (c) new model architecture needed.

Tier 1 — repo scaffolding

  • Top-level: LICENSE (Apache 2.0 verbatim), NOTICE, SECURITY.md, CONTRIBUTING.md, CODE_OF_CONDUCT.md (Contributor Covenant 2.1 verbatim), .editorconfig, Makefile, enhanced README.md
  • Docs: docs/ARCHITECTURE.md, docs/WIRE_FORMAT.md, docs/MODEL_REGISTRY.md, docs/GLOSSARY.md
  • Tooling: Maven Wrapper pinned to 3.9.15; scripts/fetch_models.py (HF → ONNX/GGUF converter, build-host-only); scripts/verify_models.py (SHA-256 check)
  • CI: .github/workflows/{java-ci,scripts-ci,native-ci}.yml; CODEOWNERS; dependabot config (with custom watcher for de.kherud:llama upstream); PR + issue templates
  • Stubs: java/examples/quickstart/, go/, models/

Deviations from java-sdk.md

Recorded in docs/ARCHITECTURE.md § Deviation Register. Summary:

  • D-001: Generation default = Qwen/Qwen2.5-0.5B-Instruct (Apache 2.0 ungated) instead of gemma-3-270m-it (Gemma Terms, gated)
  • D-002: Platform matrix = Win-x64 + Linux UBI8/9/10 + Linux-arm64 (UBI8+); Win-arm64 explicitly out of scope (broken upstream)
  • D-003: llama.cpp Java binding maintenance criterion relaxed via fork-and-bump (this PR)
  • D-004: Spotless GAV is com.diffplug.spotless (corrected from spec); JaCoCo 0.8.14 floor for JDK 25
  • D-005: jlama evaluated and rejected (Critical CVEs in jinjava 2.7.2 transitive; missing Gemma 4 architecture support; pure-Java perf would not scale to E4B)

Test plan

This PR contains no production Java code yet — it's foundation only. Acceptance:

  • Bootstrap commit on main is clean
  • ./mvnw --version reports Maven 3.9.15 + JDK 25
  • python3 -m py_compile scripts/{fetch,verify}_models.py clean
  • All YAML files parse cleanly
  • mvn help:effective-pom -f native/kherud-fork/pom.xml -q exits 0
  • No randomcodebase typo remaining in tracked files
  • Native CI workflow runs successfully on first push (validates the dockcross + Windows build matrix; will surface any JNI-vs-b8146 incompatibility)
  • Tier 2 (Java parent + core) opens against this branch's merge

Subsequent tiers (not in this PR)

  • Tier 2: Java parent POM + inference-sdk-core module
  • Tier 3: inference-sdk-embed + bge-small model JAR
  • Tier 4: inference-sdk-generate + Qwen 0.5B model JAR + bundle fat-JAR
  • Tier 5: integration tests (51 numbered §11.2 cases) + quickstart example + final quality gates

🤖 Generated with Claude Code

Tier 0 (native/kherud-fork/):
- Forked de.kherud:llama:v4.2.0 (upstream SHA 330ccc1a6c)
- Bumped bundled llama.cpp from b4916 to b8146 - clears 5 reachable
  High GHSA advisories (8wwf, 7rxv, vgg9, 96jg, 3p4r) and adds Gemma 3
  / Gemma 3n architecture support
- Native CI workflow for Win-x64 + Linux-x64 (manylinux2014/glibc 2.17)
  + Linux-arm64 (dockcross-arm64-lts/glibc 2.27); path-filtered to
  native/kherud-fork/**
- Will publish io.github.randomcodespace.inference:kherud-fork-llama
  :4.2.1-llama-b8146 to GitHub Packages
- Smoke-test plan: load Qwen 2.5-0.5B GGUF on UBI8 container

Tier 1 (repo scaffolding):
- Top-level: LICENSE (Apache 2.0), NOTICE, SECURITY, CONTRIBUTING,
  CODE_OF_CONDUCT (Contributor Covenant 2.1), .editorconfig, Makefile
- Docs: ARCHITECTURE, WIRE_FORMAT, MODEL_REGISTRY, GLOSSARY
- Tooling: Maven Wrapper (3.9.15), scripts/fetch_models.py +
  verify_models.py, .github/workflows/{java-ci,scripts-ci,native-ci}.yml,
  CODEOWNERS, dependabot.yml, PULL_REQUEST_TEMPLATE, ISSUE_TEMPLATE
- Stubs: java/examples/quickstart/, go/, models/

Group ID: io.github.randomcodespace.inference (matches GitHub org
RandomCodeSpace).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aksOps and others added 17 commits May 9, 2026 00:23
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reformat 6 source files + pom.xml in inference-sdk-embed to satisfy
the Spotless gate (CI was failing on `spotless:check`). No semantic
changes — pure whitespace/wrap reflow per google-java-format.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Tier 3 model JAR module was added to the aggregator but its
pom.xml + sources were never committed (only an empty
src/main/resources/models/ tree exists), so Maven failed the reactor
with "Child module ... pom.xml does not exist" before any verify
phase could run. Comment the module out until its scaffolding lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both verify (ubuntu-latest) and verify (ubuntu-22.04-arm) jobs failed
with "no POM in this directory" because mvnw was invoked at repo root
but the only Maven build lives at java/pom.xml. Pass `-f java/pom.xml`
to every Maven invocation in verify, network-isolation, and javadoc
jobs, and update the JavaDoc upload path accordingly.

Standalone gate steps (jacoco/spotless/spotbugs) also failed prefix
resolution because spotless and spotbugs are configured in
pluginManagement only, and the aggregator POM does not extend the
parent. Drop the redundant standalone `jacoco:check` (already runs
via the parent's verify-bound execution), and invoke spotless and
spotbugs by full GAV coords with -pl scoped to parent-aware modules.

Validated locally: `./mvnw -f java/pom.xml -B -ntp -e verify` →
BUILD SUCCESS, 60 tests run, 0 failures, 0 errors, 0 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…path 1)

User acknowledged path 1 on 2026-05-09 per ~/.claude/rules/security.md
mitigation clause: ship Phase 1 with documented residual-risk + written
sign-off. Fork-and-bump strategy moves to Phase 1.5 (first task).

Removed (58 files):
- native/kherud-fork/ (vendored kherud v4.2.0 + CMake bump to b8146)
- .github/workflows/native-ci.yml (3-platform cross-compile workflow)

Updated 11 files:
- java/inference-sdk-parent/pom.xml: dep coord -> de.kherud:llama:4.2.0;
  removed GitHub Packages repository entry
- java/inference-sdk-embed/pom.xml: header comment
- docs/ARCHITECTURE.md: rewrote D-003 (mitigation+sign-off justification);
  added Section 4.4 residual-risk register (5 reachable Highs + 1 Mod
  at b4916, mitigation per advisory: SHA-256 model allow-list neutralizes
  4 of 5; input length cap + UTF-8 validation narrow the 5th); Phase 1.5
  fork-and-bump as first task in Roadmap
- SECURITY.md: residual-risk section + 2026-05-09 sign-off line
- NOTICE: consumer-side kherud entry (de.kherud:llama:4.2.0 + embedded
  llama.cpp b4916 attribution)
- README.md: removed native/ from project tree; updated Phase 1 desc
- Makefile: removed native-build target
- CONTRIBUTING.md: removed native-binding section
- docs/GLOSSARY.md: kherud + Dockcross entries reflect Phase 1.5 context
- .github/dependabot.yml: removed fork-specific upstream watcher
- scripts/fetch_models.py: removed llama_cpp_pin(); hardcoded
  LLAMA_CPP_TAG=b4916 (matches kherud:4.2.0 bundled)

Validations: parent POM parses; fetch_models.py compiles; dependabot.yml
parses as YAML.

Phase 1.5 commitment: fork kherud and bump bundled llama.cpp to clear
all 5 reachable High GHSAs (8wwf, 7rxv, vgg9, 96jg, 3p4r) + add Gemma 4
architecture support; first task before any new feature work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 4.A (java/inference-sdk-generate):
- Generator interface + KherudGenerator impl wrapping de.kherud:llama:4.2.0
- Records: Message, GenerateRequest, GenerateResponse, GenerateChunk,
  GenerateStats with snake_case @JsonProperty annotations (Phase 2 ready)
- Streaming: BoundedSubscription honors Reactive Streams 3.9 + 3.17,
  exactly-one terminal chunk, cancellation at next-token boundary,
  idempotent cancel, lazy generation start
- InputValidator enforces SECURITY.md path-1 mitigations: length cap
  (contextSize x 8 chars/token) + strict UTF-8 round-trip - narrows
  GHSA-7rxv tokenizer prompt overflow
- ModelResolver: explicit modelPath -> INFERENCE_MODEL_DIR -> classpath
- Reserved fields throw FeatureNotSupportedException on non-null
  (Message.toolCalls/toolCallId/name; GenerateRequest.tools/
  toolChoice/responseFormat) - cases #49, #50
- Auto-module name 'llama' for de.kherud:llama:4.2.0 (verified via
  jar manifest inspection: no Automatic-Module-Name; JPMS derives
  from filename minus version)
- LlamaClient seam interface (de.kherud.llama.LlamaModel is final);
  production wires KherudLlamaClient adapter, tests wire FakeLlamaClient
- 105 tests in generate; reactor total 165 tests, all passing
- JaCoCo line 79% / branch 71% (above 75/70 gates); KherudLlamaClient
  excluded with justification (real-model paths exercised in Tier 5 IT)
- Spotless google-java-format clean; SpotBugs HIGH clean
- NativeExecutor pinning workaround documented at every JNI call site

Tier 4.B (java/inference-sdk-generate-qwen-0_5b):
- Maven JAR with no Java code; resources placeholder for the Qwen 2.5
  -0.5B-Instruct Q4_K_M GGUF (populated by scripts/fetch_models.py
  in Tier 0.5 before Tier 5 IT)
- model-manifest.properties placeholder (id, hf_repo, revision,
  quantization, max_tokens, sha256, license)

Aggregator java/pom.xml updated with both new modules.

§11.2 case coverage: #13 (maxTokens=0), #14 (empty messages), #15
(system-only), #49 + #50 (reserved fields) implemented as unit tests;
#12 + #16-34 deferred to Tier 5 IT (need real GGUF load). Documented
in test class JavaDoc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
maven-shade-plugin produces a 75.6 MB fat JAR with:
- All 55 SDK classes (core + embed + generate)
- ONNX Runtime natives for 4 platforms (linux-x64, linux-aarch64,
  osx-aarch64, win-x64) including onnxruntime_providers_shared.dll
- de.kherud:llama:4.2.0 natives (linux/mac/win/android variants)
- DJL HuggingFace tokenizers natives (4 platforms)
- JNA dispatch natives (11+ platforms)
- Total: 30 .so / 12 .dll / 7 .dylib; 2911 JAR entries

Module-info strategy: unnamed/automatic-module fat JAR. shade 3.6.2
JPMS support is experimental and merging multiple module-infos breaks
reproducibility. JPMS-strict consumers should use per-module artifacts
(which carry well-formed module-infos). Tradeoff documented in pom.xml
header + README.md.

Verification:
- BUILD SUCCESS; bundle reactor builds parent + core + embed + generate
  + qwen + bundle
- Signatures stripped: 0 .SF/.RSA/.DSA files
- module-info.class stripped from all shaded deps (incl. MR-JAR
  versions/*/module-info.class)
- Manifest: Reproducible-Build=true; Implementation-Title/Vendor/Version
- dependency-reduced-pom.xml generated (gitignored)

bge-small dep deferred to next commit (its module pom hasn't been
written yet; Tier 4.B added the Qwen shell, bge-small shell follows).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors Tier 4.B (Qwen JAR shell) for the embedding model:
- Maven JAR with no Java code; only resources/models/ for the LFS-tracked
  bge-small-en-v1.5.int8.onnx (populated by scripts/fetch_models.py in
  Tier 0.5)
- model-manifest.properties placeholder (id, hf_repo, dimensions=384,
  max_tokens=512, quantization=int8-dynamic, sha256, license=MIT)
- Skip JavaDoc/JaCoCo/SpotBugs (no Java sources)

Aggregator java/pom.xml: uncomment <module>inference-sdk-embed-bge-small</module>
inference-sdk-bundle/pom.xml: add bge-small as a dep so the fat JAR
shades the embedding model resources alongside Qwen GGUF resources.

POM resolution verified: ./mvnw help:effective-pom -pl :inference-sdk-bundle -am exits 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…RY.md

inference-sdk-integration-tests (new module, 14 files, 29 tests):
- NetworkIsolationIT (case #47, 7 tests): InetAddressResolverProvider SPI
  installs a BlockingDnsResolverProvider via META-INF/services that
  refuses arbitrary host lookups; Embedder + Generator init succeed
  -> proves no runtime DNS / phone-home in any dep
- ForwardCompatIT (cases #49 + #50 + #51, 14 tests): Message + Generate
  Request reserved fields throw FeatureNotSupportedException; Jackson
  serialization of every record produces snake_case keys per WIRE_FORMAT
- FailureModeIT (cases #43, #44, #45, 5 tests): typed exceptions for
  nonexistent model, corrupted file, wrong format
- GenerateEdgeCaseIT (cases #13, #14, #15, 3 tests): validation cases
- @tag("model") classes (skeletons, deferred until make fetch-models):
  EmbedEdgeCaseIT (#1-11), full GenerateEdgeCaseIT (#12, #16-25),
  StreamingEdgeCaseIT (#26-34), ConcurrencyLifecycleIT (#35-39),
  ResourceExhaustionIT (#40-42), case #46 small-heap
- @tag("model-switch") ModelSwitchIT (case #48 deferred per spec)
- @tag("slow") PropertyTestsIT (jqwik per spec §11.3)

51 of 51 §11.2 cases now have implementations (or tagged skeletons
documenting the model-fetch prerequisite).

java/examples/quickstart/Main.java (87 lines):
- Embed + sync generate + streaming demo with try-with-resources
- Wraps in RequestId.withRequestId for ScopedValue propagation pattern
- Compiles cleanly after ./mvnw install

LIBRARY.md (~140 lines): public API doc, virtual-threads / structured
concurrency / Flow.Publisher streaming examples, native-pinning pattern
diagram, build-time model switching workflow.

java/pom.xml: added inference-sdk-integration-tests to <modules>.

Final test suite: 194 / 194 passing (165 baseline + 29 new IT) on
default verify. @tag("model") + @tag("slow") run with -P slow after
make fetch-models.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switches from "commit ONNX/GGUF to Git LFS" to "fetch at build time and
embed in published Maven artifact". Eliminates LFS storage/bandwidth
costs while preserving the air-gapped offline-first guarantee for
downstream consumers (they receive model bytes via the Maven artifact
they depend on, not via a Git clone).

Mechanics:
- inference-sdk-embed-bge-small/pom.xml + inference-sdk-generate-qwen-
  0_5b/pom.xml: exec-maven-plugin bound to generate-resources phase
  invokes scripts/fetch_models.py with the model id; second exec at
  process-resources runs scripts/verify_models.py for SHA-256 check
  against scripts/checksums/models.sha256. Skip via -Dfetch.models.skip
  =true for IDE imports / quick POM-only builds.
- .gitattributes: removed *.onnx / *.gguf / *.safetensors / *.bin /
  *.pt / *.pth LFS tracking (no Git LFS dependency)
- .gitignore: added **/src/main/resources/models/*.{onnx,gguf,safetens
  ors,bin,pt,pth} so locally-fetched bytes don't accidentally commit;
  model-manifest.properties + .gitkeep stay tracked
- .github/workflows/java-ci.yml: added actions/cache for ~/.cache/hugg
  ingface + the staged module/src/main/resources/models/ + build/llama
  .cpp keyed on hashFiles('scripts/checksums/models.sha256','scripts/
  fetch_models.py'); pinned hashes mean cache stays valid until we
  deliberately bump
- docs/ARCHITECTURE.md: rewrote model-distribution section to describe
  the hybrid approach; replaced LFS language with build-time fetch +
  Maven artifact embedding

Air-gapped consumers: receive bundled model bytes via the published
Maven artifact (no network at consume time). Air-gapped contributors:
documented path in CONTRIBUTING.md (forthcoming follow-up commit:
"request a pre-built target/ cache from a maintainer or run make
fetch-models on a connected machine and copy ~/.cache/huggingface/
into the air-gapped environment").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CONTRIBUTING.md:
- Removed "Git LFS" from required toolchain
- Added gcc + cmake (required by llama.cpp's convert_hf_to_gguf.py +
  llama-quantize during the generation-model fetch path)
- Added "Air-gapped contributors" section with two offline workflows:
  (a) request a pre-built target/ cache from a maintainer; (b) run
  make fetch-models on a connected machine and copy ~/.cache/hugg
  ingface/ into the air-gapped environment

java/inference-sdk-generate-qwen-0_5b/pom.xml: refinement of the
fetch binding (post-Spotless reformat from the hybrid retrofit agent).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bytes

Root cause: .gitattributes line 'text=auto eol=lf' was treating the
test fixture .bin files as text and applying LF normalization on
checkout, corrupting their bytes (127 -> 32 bytes is wrong direction;
correct is 32 bytes original; the corruption inflated random bytes
into LF-converted form on commit, then on checkout produced different
bytes than the SHA-256 in the sibling .sha256 file expected).

NativeLibLoaderTest verifies SHA-256 of an extracted resource against
its sibling .sha256 file; with normalized fixtures the SHA never matched
-> tests fail in CI on a clean checkout (locally devs had the original
bytes still in their working copy, masking the issue).

Fix: explicitly mark the two test fixtures binary in .gitattributes:
  java/inference-sdk-core/src/test/resources/native-fixtures/*.bin binary
  java/inference-sdk-core/src/test/resources/native-fixtures/*.sha256 text=auto

Restored the .bin files to their original 32-byte content (one matches
its .sha256, the other intentionally doesn't for the negative-path test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the verify regression: hybrid model-distribution exec
binding fires fetch_models.py during generate-resources, but the GHA
runners don't have huggingface_hub / onnxruntime / optimum / safe
tensors / gcc-cmake-make pre-installed AND we don't want PR feedback
to wait on multi-minute HF downloads + GGUF conversion. Default PR
verify just needs to exercise the SDK's own logic; real model bytes
are not on the PR-feedback critical path.

Fix:
- .github/workflows/java-ci.yml: add -Dfetch.models.skip=true to BOTH
  verify and network-isolation mvnw invocations. PR verify now runs
  in seconds (local timing: 15s for the full reactor including the
  29 IT tests). @tag("model") tests stay deferred per Tier 5 design.
- .github/workflows/package-artifacts.yml: NEW manual + scheduled
  (weekly) workflow. Sets up Python 3.11, installs scripts/requirements
  .txt, ensures gcc/cmake/make, runs the FULL mvnw verify (no skip flag)
  to actually fetch + convert + embed models, then uploads the
  inference-sdk-bundle fat JAR + per-module JARs as artifacts. Smoke-
  tests the bundled GGUF on a sample prompt.

Local validation: ./mvnw -f java/pom.xml -Dfetch.models.skip=true
-B -ntp verify -> BUILD SUCCESS, 9 modules green, 194 tests passing
in 15s.

This matches the design pattern many ML-library repos use: PRs get
fast feedback; artifact builds happen in a separate slower workflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant