Servable framework refactor - phase 1 by mzegla · Pull Request #4219 · openvinotoolkit/model_server

mzegla · 2026-05-18T15:35:33Z

No description provided.

fix style windows compiler issue fixes

Copilot

Pull request overview

Phase 1 of a servable-framework refactor that moves streaming from “raw text chunks” to “pre-parsed delta documents”, introducing a shared parsed-delta channel and a new OVMSTextStreamer to unify streaming behavior across continuous-batching and legacy pipelines.

Changes:

Introduces OVMSTextStreamer to detokenize and invoke OutputParser::parseChunk() with both text and the corresponding token slice, producing rapidjson::Document deltas.
Replaces legacy mutex/CV “last text chunk” flow with a shared DeltaChannel and updates servables/executors to signal completion via the channel.
Refactors OpenAI streaming serialization APIs to accept already-parsed delta Documents and updates tests/parsers for the new parseChunk() signature.

Reviewed changes

Copilot reviewed 54 out of 54 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/test/llm/output_parsers/qwen3coder_output_parser_test.cpp	Updates tests to call `OutputParser::parseChunk()` with the new `tokens` argument.
src/test/llm/output_parsers/qwen3_output_parser_test.cpp	Updates streaming parser tests for the new `parseChunk()` signature and error paths.
src/test/llm/output_parsers/phi4_output_parser_test.cpp	Updates streaming tests to pass `tokens` to `parseChunk()`.
src/test/llm/output_parsers/mistral_output_parser_test.cpp	Updates streaming tests to pass `tokens` to `parseChunk()`.
src/test/llm/output_parsers/llama3_output_parser_test.cpp	Updates streaming tests to pass `tokens` to `parseChunk()`.
src/test/llm/output_parsers/lfm2_output_parser_test.cpp	Updates helper/assert flows to pass `tokens` to `parseChunk()`.
src/test/llm/output_parsers/hermes3_output_parser_test.cpp	Updates streaming tests to pass `tokens` to `parseChunk()`.
src/test/llm/output_parsers/gptoss_output_parser_test.cpp	Updates streaming tests to pass `tokens` to `parseChunk()`.
src/test/llm/output_parsers/gemma4_output_parser_test.cpp	Updates streaming tests to pass `tokens` to `parseChunk()`.
src/test/llm/output_parsers/devstral_output_parser_test.cpp	Updates streaming tests to pass `tokens` to `parseChunk()`.
src/test/http_openai_handler_test.cpp	Adds migration helper to build parsed deltas and adapts tests to new `serializeStreamingChunk(Document, reason)` API.
src/llm/visual_language_model/legacy/servable.hpp	Updates legacy VLM execution context for the new delta-channel streaming approach and unary accumulation.
src/llm/visual_language_model/legacy/servable.cpp	Migrates VLM legacy streaming/unary flows to `OVMSTextStreamer` + `DeltaChannel`.
src/llm/visual_language_model/legacy/legacy_executor.cpp	Signals streaming completion via `deltaChannel.signalComplete()`.
src/llm/servable.hpp	Introduces `DeltaChannel` and updates execution context fields for parsed-delta streaming.
src/llm/servable.cpp	Switches continuous-batching streaming to drain parsed deltas from `DeltaChannel` and serialize them.
src/llm/ovms_text_streamer.hpp	Adds new `OVMSTextStreamer` (TextStreamer-derived) that produces parsed delta `Document`s.
src/llm/ovms_text_streamer.cpp	Implements flush heuristics, token-slice computation, and callback emission for parsed deltas.
src/llm/language_model/legacy/servable.hpp	Removes legacy mutex/CV fields and switches disconnect handling to `DeltaChannel`.
src/llm/language_model/legacy/servable.cpp	Migrates legacy LM streaming flow to `OVMSTextStreamer` + `DeltaChannel`.
src/llm/language_model/legacy/legacy_executor.cpp	Signals streaming completion via `deltaChannel.signalComplete()`.
src/llm/io_processing/qwen3coder/qwen3coder_tool_parser.hpp	Updates tool parser streaming API to accept `tokens`.
src/llm/io_processing/qwen3coder/qwen3coder_tool_parser.cpp	Adapts streaming tool parsing implementation to new signature.
src/llm/io_processing/qwen3/reasoning_parser.hpp	Updates reasoning parser streaming API to accept `tokens`.
src/llm/io_processing/qwen3/reasoning_parser.cpp	Adapts reasoning parser to new signature.
src/llm/io_processing/phi4/tool_parser.hpp	Updates Phi4 tool parser streaming API to accept `tokens`.
src/llm/io_processing/phi4/tool_parser.cpp	Adapts Phi4 tool parser streaming implementation and recursive calls.
src/llm/io_processing/output_parser.hpp	Adds `tokens` parameter threading through tool/reasoning chunk parsing and top-level `parseChunk()`.
src/llm/io_processing/output_parser.cpp	Passes `tokens` through phase transitions and parser calls.
src/llm/io_processing/mistral/tool_parser.hpp	Updates Mistral tool parser streaming API to accept `tokens`.
src/llm/io_processing/mistral/tool_parser.cpp	Adapts Mistral streaming parsing implementation and recursive calls.
src/llm/io_processing/llama3/tool_parser.hpp	Updates Llama3 tool parser streaming API to accept `tokens`.
src/llm/io_processing/llama3/tool_parser.cpp	Adapts Llama3 streaming parsing implementation to new signature.
src/llm/io_processing/lfm2/lfm2_tool_parser.hpp	Updates LFM2 tool parser streaming API to accept `tokens`.
src/llm/io_processing/lfm2/lfm2_tool_parser.cpp	Adapts LFM2 streaming parsing implementation to new signature.
src/llm/io_processing/hermes3/tool_parser.hpp	Updates Hermes3 tool parser streaming API to accept `tokens`.
src/llm/io_processing/hermes3/tool_parser.cpp	Adapts Hermes3 streaming parsing signature to accept `tokens`.
src/llm/io_processing/gptoss/tool_parser.hpp	Updates GPT-OSS tool parser streaming API to accept `tokens`.
src/llm/io_processing/gptoss/tool_parser.cpp	Adapts GPT-OSS tool parsing signature to accept `tokens`.
src/llm/io_processing/gptoss/reasoning_parser.hpp	Updates GPT-OSS reasoning parser streaming API to accept `tokens`.
src/llm/io_processing/gptoss/reasoning_parser.cpp	Adapts GPT-OSS reasoning parsing signature to accept `tokens`.
src/llm/io_processing/gemma4/gemma4_tool_parser.hpp	Updates Gemma4 tool parser streaming API to accept `tokens`.
src/llm/io_processing/gemma4/gemma4_tool_parser.cpp	Adapts Gemma4 tool parsing signature to accept `tokens`.
src/llm/io_processing/gemma4/gemma4_reasoning_parser.hpp	Updates Gemma4 reasoning parser streaming API to accept `tokens`.
src/llm/io_processing/gemma4/gemma4_reasoning_parser.cpp	Adapts Gemma4 reasoning parsing signature to accept `tokens`.
src/llm/io_processing/devstral/tool_parser.hpp	Updates Devstral tool parser streaming API to accept `tokens`.
src/llm/io_processing/devstral/tool_parser.cpp	Adapts Devstral tool parsing signature to accept `tokens`.
src/llm/io_processing/base_output_parser.hpp	Updates base streaming parser interface to include token IDs for each chunk.
src/llm/BUILD	Adds `ovms_text_streamer` sources/headers to the `genai_servables` target.
src/llm/apis/openai_responses.hpp	Updates streaming serialization signature to accept parsed deltas.
src/llm/apis/openai_responses.cpp	Refactors streaming serialization to consume parsed delta `Document`s (no parsing in handler).
src/llm/apis/openai_completions.hpp	Updates streaming serialization signature to accept parsed deltas.
src/llm/apis/openai_completions.cpp	Refactors chat/completions streaming serialization to consume parsed delta `Document`s.
src/llm/apis/openai_api_handler.hpp	Updates abstract streaming serialization API to accept parsed delta `Document`s.

Comments suppressed due to low confidence (1)

src/llm/io_processing/hermes3/tool_parser.cpp:209

parseChunk() returns nullopt immediately on empty chunk, even when finishReason != NONE. With the new OVMSTextStreamer, a final STOP flush can legitimately call parseChunk("", STOP) to force end-of-tool-call cleanup (delay window / argument string closing). Consider handling empty chunk specially when finishReason != NONE (finalize/close pending arguments and emit any delayed delta as needed) instead of returning early.

std::optional<rapidjson::Document> Hermes3ToolParser::parseChunk(const std::string& chunk, const std::vector<int64_t>& /*tokens*/, ov::genai::GenerationFinishReason finishReason) {
    /* 
    We first collect data until we have full function name - that's when we return the first delta.
    Every next delta contains next parts of the arguments. Hermes3 generates arguments as JSON, but OpenAI API expects them in a string format.
    That's why once we reach 'arguments' key, we add double quote to force string type and escape all double quotes that come in next parts.
    To know when we reach the end of the arguments string, we return delta with a one-chunk delay. This way, when we reach end of tool call, we modify previous chunk to close
    arguments string properly and return such modified chunk.
    */

    /*
    PHASE 0: Prepare data and state for processing
    - If previous call finished tool call (received </tool_call> tag), we clear state.
    - If current call finishes tool call (finishReason != NONE), we set flag to clear state in the next call.
    - If chunk is empty, we return std::nullopt.
    - We prepend unprocessedBuffer to the chunk and clear unprocessedBuffer.
    */

    // Check if previous call finished tool call (received </tool_call> tag)
    if (toolCallCompleted) {
        clearState();
        toolCallCompleted = false;
    }

    toolCallCompleted = (finishReason != ov::genai::GenerationFinishReason::NONE);

    if (chunk.empty()) {
        SPDLOG_LOGGER_DEBUG(llm_calculator_logger, "Received empty chunk for Hermes3ToolParser");
        return std::nullopt;
    }

mzegla added the 2026.3 label May 18, 2026

mzegla added 2 commits May 19, 2026 13:51

init

de8399a

fix style windows compiler issue fixes

post rebase fixes

6b07f7f

mzegla force-pushed the servables_refactor_phase1 branch from 4a40996 to 6b07f7f Compare May 19, 2026 12:01

mzegla requested a review from Copilot May 19, 2026 12:04

Copilot started reviewing on behalf of mzegla May 19, 2026 12:05 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread src/llm/servable.hpp

Comment thread src/llm/language_model/legacy/servable.cpp

Comment thread src/llm/visual_language_model/legacy/servable.cpp

Comment thread src/llm/visual_language_model/legacy/servable.cpp

mzegla added 2 commits May 20, 2026 13:29

fixes

6d68816

fix empty content emission

db83db3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Servable framework refactor - phase 1#4219

Servable framework refactor - phase 1#4219
mzegla wants to merge 4 commits into
mainfrom
servables_refactor_phase1

mzegla commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mzegla commented May 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants