Skip to content

world.streamScene + world.lookingAt + world_companion demo (multi-item placeLabel, embedding dedupe, Quest 3 camera)#268

Open
salmanmkc wants to merge 44 commits into
google:mainfrom
salmanmkc:feat/world-stream-scene
Open

world.streamScene + world.lookingAt + world_companion demo (multi-item placeLabel, embedding dedupe, Quest 3 camera)#268
salmanmkc wants to merge 44 commits into
google:mainfrom
salmanmkc:feat/world-stream-scene

Conversation

@salmanmkc
Copy link
Copy Markdown
Contributor

@salmanmkc salmanmkc commented May 10, 2026

been wanting a voice + vision companion in xr for a while. start it, talk to it, it sees what you see and can drop markers on stuff.

needed two sdk bits to make it work so they're in here too:

world.streamScene(prompt, opts) - opens a gemini live session w/ the camera streaming at N fps, tool calls come back via callbacks
world.lookingAt() - whatever the reticle's on, or null

demo's in demos/world_companion. there's a placeLabel tool with three styles (dot, arrow, pulse) so it can pick how to highlight - arrow if you ask it to find something, pulse for tiny stuff, dot otherwise. uses world.objects.runDetection so markers stick to the real object via depth, not to wherever your head was when the tool fired, it will do this on the desktop simulator though. small spatial panel for start/stop/clear so it's actually usable once you're in immersive.

you can ask for several things in one go ("label the couch, tv, and coffee table") and the tool takes an items[] array so they all get placed in a single call, each with its own style. labels billboard back at the camera so they stay readable when you walk around.

detector labels and what you say don't always match — television vs tv, pendant light vs floor lamp, picture vs painting. there's a small synonym table for the obvious cases, then a fallback to gemini's embedContent api with cosine similarity to match by meaning and dedupe markers across rephrasings. the embed cache is per-page so it's basically free after the first call.

tests in World.test.ts cover the tool wiring + start/stop paths.

Try launch the demo and say for example "place an arrow on my water bottle".

I have my thoughts on updating states of objects it has seen, to update later, however for now this seems ok.

Gemini will be able to talk and see screens afaik in Android XR, however this will allow interaction in the real world + gemini live.

I will see if I can get a demo recorded for this. This is open to lots of feedback though, since this is just a very rough version.

Edit, here's a demo! https://youtu.be/-5s_aV6eV_A

I may have to just add key input in again but will double check later when I'm home

salmanmkc added 11 commits May 9, 2026 13:33
world.streamScene(prompt, opts) opens a Gemini Live session and runs a
periodic camera-frame loop into it, with auto-dispatch of agentic tools and
auto-playback of model audio via CoreSound. Returns a {stop, isActive}
handle. Throws cleanly when AI / Live capability / device camera are
missing instead of failing deep in the SDK.

world.lookingAt(controllerId?) is sugar over User.getReticleTarget so demos
can stay on the world.* namespace.

World now takes registry as a Script dependency so the new primitives can
resolve AI / XRDeviceCamera / CoreSound / User without callers wiring it
through every method.

11 tests covering missing-AI, non-Live AI, missing-camera, the frame loop,
text+audio routing, onAudio override, tool dispatch, unknown tool, and
onToolCall intercept.
A small single-file demo that wires xb.core.world.streamScene to a Live
session with two demo-local tools: placeLabel drops a marker in front of
the camera, and lookCloser reports what the user's reticle is aimed at via
xb.core.world.lookingAt.

Mirrors the world_ask UI pattern (floating bottom panel, transcript,
start/stop) so users have a complete reference for the new primitive
without leaving the demos directory.
Switch placeLabel from live reticle sampling to world.objects.runDetection
so labels anchor to actual detected objects in world space, not wherever
the user was looking when the tool fired. Also render a Troika text label
above the marker, not just a bare sphere.

Add a SpatialPanel with start/stop/clear controls so the demo is usable
in immersive mode, not just from the flat web overlay.
placeLabel now takes a style param so the model can pick how to highlight
something: dot for casual noting, arrow for 'point this out for me',
pulse for small or hard-to-spot things. Arrow gently bobs, pulse
expands and fades on a 1.5s loop.
Default enableDepth() leaves updateFullResolutionGeometry off, so the
depth mesh snapshot used by object detection is too sparse to raycast
against. Markers were landing near the camera instead of on the actual
detected object. Copy the depth flags the gemini_xrobject demo uses.
@salmanmkc salmanmkc marked this pull request as draft May 10, 2026 11:39
@salmanmkc
Copy link
Copy Markdown
Contributor Author

Turns out I hit rate limits of 20 object detections per day when I checked logs, I for some reason though it was broken

@salmanmkc salmanmkc marked this pull request as ready for review May 11, 2026 06:44
salmanmkc added 6 commits May 11, 2026 07:47
ObjectDetector now switches targetDevice to 'quest' when the Oculus
browser is detected, instead of always falling back to galaxyxr params.
Adds QuestCameraParams.ts with approximate Quest 3 passthrough intrinsics
(fx/fy ~800 at 1280x720, ~77° HFOV from the cropped getUserMedia stream)
and an offset for the RGB camera relative to the right XR eye. These are
estimates - WebXR doesn't expose the real values - and may need
per-device tweaks.

Also swaps the detection debug image dump from auto-downloading PNGs
(unusable on Quest browser) to a console-log preview that shows the
image inline, and adds a few extra logs in world_companion to help see
what placeLabel is actually receiving from the detector.
Quest 3 passthrough cameras are physically angled downward; labels were
landing too high above table-surface objects. Apply a -0.26 rad pitch in
the right-camera pose so unprojected detections line up with what the
user actually sees.
Floating world labels were getting cut up by the passthrough depth mesh
- letters disappearing where the mesh triangles passed in front of them.
Disable depthTest/depthWrite on the troika text and bump renderOrder so
labels always draw on top.
Gemini sometimes calls placeLabel multiple times for what's clearly the
same physical thing (e.g. "laptop" then "macbook"), and unprojection
drift puts the two markers a few cm apart - so the user sees the label
twice. Match by text first, then fall back to a 2m proximity check, and
update the existing marker in place instead of stacking a new one.
When the Gemini Live websocket drops (1011 internal error) and
reconnects, it replays its tool-call context, which fires placeLabel
again with the same items. Cache the last call key for 2s and short-
circuit the duplicate so we don't redo detection or stack new markers
on top of the existing ones.
Was useful while debugging Quest calibration and dedup behaviour but
just noise in the console for everyone else. Error paths keep their
console.warn.
@salmanmkc salmanmkc force-pushed the feat/world-stream-scene branch from c6512b3 to ea50d91 Compare May 11, 2026 06:47
@salmanmkc salmanmkc changed the title Add world.streamScene + world.lookingAt primitives, plus world_companion demo world.streamScene + world.lookingAt + world_companion demo (now with Quest 3 camera support) May 11, 2026
@ruofeidu ruofeidu added the demo New demo for XR Blocks demonstrating novel interactivity or perception features. label May 11, 2026
@salmanmkc
Copy link
Copy Markdown
Contributor Author

Hi Salman,

Do you have an Android XR device to try?

  1. The arrow and the depth query of the object seems mismatching. Nels has an amazing demo here (https://xrblocks.github.io/docs/samples/Gemini-XRObject/) that uses average depth to place where it is by a long pinch gesture. (yes we need a panel to prompt user what to do as well in this demo)
  2. Prompt the user with what to ask on top of the panel: E.g., try speaking "place an arrow on my laptop". See the Gemini Icebreakers demo: https://xrblocks.github.io/docs/samples/Gemini-Icebreakers/
  3. In XR, only microphone, stop, and delete buttons were shown, I did see some 2D UIs after exiting XR with the transcription --- is this by design?

Ah no I don't have an Android XR device unfortunately, can't order one in the UK 😢 hoping that will change tomorrow

@ruofeidu
Copy link
Copy Markdown
Collaborator

I double checked the demo, the arrows were placed on the same distance regardless how far away I'm holding an object on Android XR --- maybe double check the https://xrblocks.github.io/docs/samples/Gemini-XRObject/ for existing APIs.

I'll convert to draft now.

@ruofeidu ruofeidu marked this pull request as draft May 11, 2026 22:34
Comment thread demos/world_companion/index.html Outdated
anchored: true,
});
} else {
placeMarker(fallbackPosition(i), item.text, itemStyle);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruofeidu I think you're hitting the fallback here? Are you rate limited by chance or something? Is it just for arrows? Since this happens when it fails to detect or if rate limited normally.

@salmanmkc
Copy link
Copy Markdown
Contributor Author

salmanmkc commented May 11, 2026

With https://xrblocks.github.io/docs/samples/Gemini-XRObject/, I'm not sure if this demo adds a new capabilities to XR Blocks, or would you correct me if I was wrong. Indeed there is a misalignment with Galaxy XR and Quest and I hope we can reach similar outcome eventually.

yeah good question, the overlap is real but i think the pitch is different:

  • gemini-xrobject is one-shot "tell me about this thing" — long-pinch → detect → tap → ask
  • world_companion is the opposite, gemini live is always listening + seeing through the camera and it decides when to mark something. so you can just say "find my keys" or "what's that thing on the shelf" without pinching first and it drops a marker mid-convo. the marker styles (arrow / pulse / dot) are there so the model picks how to highlight, not
    always-a-sphere

the new sdk bits (world.streamScene + world.lookingAt) were added to make that loop possible, opening a live session with the camera streaming + tool calls coming back is what enables the continuous thing. gemini-xrobject doesn't need either since it's one-shot.

@ruofeidu
Copy link
Copy Markdown
Collaborator

Great explanation, feel free to add a README to highlight the difference :)

I think you can safely use our simulator to debug it... once working in the simulator it is likely it works in Android XR!

The arrow doesn't really point to the object now.

Screenshot 2026-05-11 at 6 16 09 PM

- depth-raycast fallback when detector misses (replaces fixed -1.2m offset)
- token-overlap match so 'framed art' lands on detected 'painting'
- reject anchor matches further than 8m so distant detections don't fly off-screen
- batchKey on placeMarker so distinct items in one placeLabel call don't dedupe each other
- billboard text via lookAt(camera) so labels rotate as you walk around them
- clear leftover labels at session start
- system prompt: only label when explicitly asked
- showDebugVisualizations off so detector doesn't render extra markers
@salmanmkc salmanmkc force-pushed the feat/world-stream-scene branch from 24778b5 to 0f196de Compare May 12, 2026 06:51
salmanmkc added 10 commits May 12, 2026 07:52
Previously, when the detector didn't return a matching object the label
would still be dropped via a depth-raycast fallback, which placed it in
a random spot in front of the user. Now those items are skipped and the
tool returns anchored:false / reason:not_found so Gemini can tell the
user it can't see the requested object.
Explains the demo and how it differs from Gemini-XRObject, per PR
discussion.
Demo crashes with 'toast.show is not a function' when controller is
connected because xrblocks-gamepad-toast custom element isn't registered.
The SDK's SimulatorInterface assumes the addons bundle has been imported
to side-effect-register that element.
Live API sometimes returns property names wrapped in extra literal
quotes (e.g. '"style"' instead of style), so item.style ends up
undefined and every label falls back to the default. Normalize by
stripping leading/trailing quotes from all keys at the tool entry.
Detector and user often use different words for the same object —
detector says 'sofa' when user says 'couch'. The token-overlap fallback
doesn't catch these because they share no letters. Add a small synonym
table and try each expansion against the detected labels.
When a detected match is more than 8m from the camera the centroid
projection is unreliable and we used to drop the placement. Instead,
shoot a ray through the match direction against the depth mesh and
snap the marker to the surface we actually see. Falls back to dropping
if no depth hit.
Two issues with the old dedupe path:

1. The 2m proximity fallback would clobber a chair label when a
   light-switch label landed ~1.5m away. Distinct objects routinely
   sit within 2m of each other, so this fallback is too aggressive.
   Drop it — text similarity is enough.

2. When the new style differs from the existing one (e.g. arrow
   replacing dot), the old path kept the existing geometry and only
   updated text/position. Now we remove the existing marker and fall
   through to fresh-marker creation with the correct geometry.
Gemini Live often narrates 'placing a dot on the sofa now' but never
actually invokes placeLabel. Add a system-prompt rule that ties the
narration to the tool call so it can't promise a placement and forget
to do it.
Even with the same-turn rule the model still likes to narrate the
result first ('I've placed a dot on the lamp') and then call the tool,
which causes mismatches when the tool returns no placements. Make the
ordering explicit: call placeLabel first, then describe results based
on what came back in the placed array.
@salmanmkc
Copy link
Copy Markdown
Contributor Author

salmanmkc commented May 12, 2026

wondering if I should use gemini embedContent API as that would help with similarities, vector db is overkill and ya would not be good to make my own local one

salmanmkc added 4 commits May 12, 2026 08:51
The follow-up sentence I added was making gemini hyper-eager to call
placeLabel a second time on its own — re-placing the same set of
items 30s after the original call with slightly different wording.
Keep the same-turn rule, drop the result-narration rule.
Gemini Live occasionally serialises tool-call array entries as JSON
strings instead of objects, which made normItem iterate the chars of
the string and produce a garbled object whose .text was undefined.
That undefined text passed straight through expand(), which returned
[''], and ''.includes('') matched the first detected object — so
'Picture frame' kept getting placed on 'coffee table' and Gemini
retried in a loop.

Try JSON.parse on string items, drop anything that doesn't end up as a
proper object, and bail out of expand/findMatch when the text is
empty so we don't pretend an empty string is a match.
Token overlap and the synonym table miss obvious cases like
Television vs TV, light vs lighting fixture, or pendant light vs
floor lamp, so Gemini ends up either retrying or piling duplicate
markers on the same object.

Add a small embedding helper that calls Gemini's embedContent
(gemini-embedding-001) with a per-page cache, warm the cache once per
placeLabel call for the requested + detected + already-placed labels,
and fall back to cosine similarity above 0.7 in two places: when
findMatch can't find a token-overlap candidate, and when placeMarker
can't find an existing marker by text-includes.

Cache reads are sync via simSync, so the dedupe path stays
non-async. If the embed call fails or the AI client isn't ready, we
just return null and behaviour is identical to before this change.
@salmanmkc salmanmkc marked this pull request as ready for review May 12, 2026 08:50
@salmanmkc
Copy link
Copy Markdown
Contributor Author

Nice I think it's at a finalized state now ready for review again, I've added gemini embedding embeddings now so similar words work based on vector difference, it's done on cloud so not local but since we're already doing api calls I think no problem!

@salmanmkc salmanmkc changed the title world.streamScene + world.lookingAt + world_companion demo (now with Quest 3 camera support) world.streamScene + world.lookingAt + world_companion demo (multi-item placeLabel, embedding dedupe, Quest 3 camera) May 12, 2026
Copy link
Copy Markdown
Collaborator

@dli7319 dli7319 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure if world.streamScene is something that belongs in the SDK directly or in addons/. @ruofeidu was this in your plans?

We already have https://github.com/google/xrblocks/blob/main/src/addons/ai/GeminiManager.ts which is very similar and doesn't abstract everything into a single function call.

.slice(0, 19)
.replace('T', '_')
.replace(/:/g, '-');
const link = document.createElement('a');
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes will do

Comment thread src/world/World.ts
* @throws If no AI is registered, the active model isn't Live-capable, or
* no XRDeviceCamera is registered.
*/
async streamScene(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor this into its own file, e.g. world/GeminiStreaming.ts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess does this depend on if it's in SDK or add ons, thoughts @ruofeidu?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

demo New demo for XR Blocks demonstrating novel interactivity or perception features.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants