diff --git a/docs/OSL.md b/docs/OSL.md index 9fa6bfd..498abf2 100644 --- a/docs/OSL.md +++ b/docs/OSL.md @@ -1,120 +1,402 @@ -# OSL JSON Format (As Used By This App) +# OSL JSON Format -This page describes the OSL-style structure expected and produced by the current Video Annotation Tool. +This page describes the OSL-style JSON files loaded, edited, and written by the +Video Annotation Tool. -## Top-Level Structure +An OSL JSON file is a single JSON object with dataset metadata, a label schema, +and a `data` array of samples. Each sample points to one or more media inputs and +can carry task-specific annotations. -Required/standard fields: +## Top-Level Object -- `version` (string) -- `date` (string) -- `dataset_name` (string) -- `description` (string) -- `modalities` (array, usually `["video"]`) -- `metadata` (object) -- `labels` (object) -- `data` (array) +The smallest useful file is a JSON object with `data` as a list. When loading, +the app fills missing standard fields with defaults. When saving, it writes the +standard project fields back out. -Unknown root keys are preserved. +| Field | Type | Notes | +|---|---|---| +| `version` | string | Current app default is `"2.0"`. | +| `date` | string | Usually an ISO date such as `"2026-05-19"`. | +| `dataset_name` | string | Human-readable project name. | +| `description` | string | Free-text dataset description. Empty string is allowed. | +| `modalities` | array | Input types present in the dataset, for example `["video"]`. The app recomputes this from sample inputs on save. | +| `metadata` | object | Dataset-level custom metadata. | +| `labels` | object | Label schema shared by classification and localization heads. | +| `data` | array | Sample list. This must be a list. | -## Labels Schema (`labels`) +Unknown root keys are preserved, except retired legacy keys documented below. -Each head is a key under `labels`: +## Label Schema + +The root `labels` object defines annotation heads. Each head name is a key, and +each definition should include: + +- `type`: `single_label` or `multi_label`. +- `labels`: list of allowed label strings. ```json -"labels": { - "action": { - "type": "single_label", - "labels": ["pass", "shot"] - }, - "attributes": { - "type": "multi_label", - "labels": ["left_foot", "header"] +{ + "labels": { + "action": { + "type": "single_label", + "labels": ["pass", "shot", "foul"] + }, + "attributes": { + "type": "multi_label", + "labels": ["left_foot", "header", "set_piece"] + } } } ``` -## Sample Structure (`data[]`) +Classification and localization annotations should reference these same head +names. For example, `data[].labels.action` and `data[].events[].head == "action"` +both point at the root `labels.action` schema. + +## Sample Objects -Each sample typically contains: +Each entry in `data` is one sample. -- `id` (string) -- `inputs` (array of input objects, each usually has `type` + `path`) -- Optional task blocks: - - `labels` (classification) - - `events` (localization) - - `captions` (description) - - `dense_captions` (dense description) - - `answers` (Q/A) -- Optional `metadata` -- Any additional custom keys are preserved. +| Field | Type | Notes | +|---|---|---| +| `id` | string | Stable sample ID. Missing or duplicate IDs are normalized on load/save. Duplicates receive suffixes such as `__2`. | +| `inputs` | array | Media or feature files for this sample. Multi-view samples use multiple input entries. | +| `metadata` | object | Optional sample-level metadata. Empty metadata is removed on save. | +| `labels` | object | Classification payload for this sample. | +| `events` | array | Timestamped localization events. | +| `captions` | array | Clip-level description captions. | +| `dense_captions` | array | Timestamped dense descriptions. | +| `answers` | array | Grouped question/answer annotations. | -### `inputs` +Unknown sample keys are preserved. -Example: +## Input Objects + +Each sample should include `inputs`, even if the sample has only one media file. ```json -"inputs": [ - {"type": "video", "path": "test/action_0/clip_0.mp4", "fps": 25.0} -] +{ + "inputs": [ + { + "type": "video", + "path": "clips/clip_0001.mp4", + "fps": 25.0 + } + ] +} ``` -Multi-view samples can include multiple input entries. +Supported input types: -### Classification payload (`labels` per sample) +| Type | Typical path | Notes | +|---|---|---| +| `video` | `clips/clip_0001.mp4` | Default when type is missing and the extension is not special. | +| `frames_npy` | `frames/clip_0001.npy` | Uses `fps` for playback timing. The legacy alias `frame_npy` is normalized to `frames_npy`. | +| `tracking_parquet` | `tracking/clip_0001.parquet` | Uses parquet timestamps when available. Optional `fps` is a fallback. | -- single-label head: `{"label": "shot"}` -- multi-label head: `{"labels": ["header", "left_foot"]}` -- smart predictions may include `confidence_score` +Input paths can be relative or absolute when loading. On save, input paths are +rewritten relative to the saved JSON file location when possible. -### Localization payload (`events`) +Multi-view samples use more than one input: ```json -"events": [ - {"head": "action", "label": "pass", "position_ms": 1234} -] +{ + "id": "play_0001", + "inputs": [ + {"type": "video", "path": "wide/play_0001.mp4", "fps": 25.0}, + {"type": "video", "path": "close/play_0001.mp4", "fps": 25.0} + ] +} ``` -Smart localization events may include `confidence_score`. +## Task Payloads -### Description payload (`captions`) +### Classification + +Sample-level `labels` uses the same head names defined at the root. ```json -"captions": [ - {"lang": "en", "text": "A short caption."} -] +{ + "labels": { + "action": { + "label": "shot" + }, + "attributes": { + "labels": ["left_foot", "set_piece"] + } + } +} ``` -### Dense payload (`dense_captions`) +For smart predictions, a head payload may include `confidence_score` as a float +from `0.0` to `1.0`: -The current dense editor uses point timestamps: +```json +{ + "labels": { + "action": { + "label": "shot", + "confidence_score": 0.91 + } + } +} +``` + +Confirming a smart prediction removes only `confidence_score`; the chosen label +stays as the manual annotation. + +### Localization + +Localization annotations live in `events`. Each event is a point timestamp in +milliseconds. ```json -"dense_captions": [ - {"position_ms": 4567, "lang": "en", "text": "Dense description."} -] +{ + "events": [ + { + "head": "action", + "label": "pass", + "position_ms": 1240 + }, + { + "head": "action", + "label": "shot", + "position_ms": 4320, + "confidence_score": 0.84 + } + ] +} ``` -### Q/A payload (`answers`) +`head` should match a root label head. Smart localization predictions use the +same optional `confidence_score` convention as classification. -Per-sample grouped answers keep the question text next to one or more answers: +### Description + +Description annotations live in `captions`. The app writes one English caption +for manual description edits, but additional caption fields are preserved. ```json -"answers": [ - { - "question": "How are you?", - "answers": ["I am fine.", "I am good."] - } -] +{ + "captions": [ + { + "lang": "en", + "text": "A player receives the pass and shoots from the edge of the box." + } + ] +} +``` + +### Dense Description + +Dense description annotations live in `dense_captions`. The current dense editor +uses point timestamps in milliseconds. + +```json +{ + "dense_captions": [ + { + "position_ms": 1200, + "lang": "en", + "text": "The midfielder receives the ball." + }, + { + "position_ms": 4300, + "lang": "en", + "text": "The forward takes a shot." + } + ] +} +``` + +### Question/Answer + +Q/A annotations live in grouped per-sample `answers`. Each group stores the +question text and one or more non-empty answers. + +```json +{ + "answers": [ + { + "question": "What happens after the pass?", + "answers": ["The receiving player shoots."] + } + ] +} +``` + +Legacy top-level `questions` and per-answer `question_id` entries are not +persisted. Convert old VQA files with `tools/convert_legacy_vqa_to_grouped.py`. + +## Complete Examples + +### Classification JSON + +```json +{ + "version": "2.0", + "date": "2026-05-19", + "dataset_name": "soccer-classification-demo", + "description": "Clip-level action labels.", + "modalities": ["video"], + "metadata": { + "sport": "soccer", + "split": "train" + }, + "labels": { + "action": { + "type": "single_label", + "labels": ["pass", "shot", "foul"] + }, + "attributes": { + "type": "multi_label", + "labels": ["left_foot", "header", "set_piece"] + } + }, + "data": [ + { + "id": "clip_0001", + "inputs": [ + { + "type": "video", + "path": "clips/clip_0001.mp4", + "fps": 25.0 + } + ], + "labels": { + "action": { + "label": "shot" + }, + "attributes": { + "labels": ["left_foot"] + } + }, + "metadata": { + "match_id": "match_01" + } + } + ] +} +``` + +### Localization and Dense Description JSON + +```json +{ + "version": "2.0", + "date": "2026-05-19", + "dataset_name": "soccer-timeline-demo", + "description": "Timestamped events and dense captions.", + "modalities": ["video"], + "metadata": {}, + "labels": { + "action": { + "type": "single_label", + "labels": ["pass", "shot", "save"] + } + }, + "data": [ + { + "id": "attack_0001", + "inputs": [ + { + "type": "video", + "path": "clips/attack_0001.mp4", + "fps": 25.0 + } + ], + "events": [ + { + "head": "action", + "label": "pass", + "position_ms": 1100 + }, + { + "head": "action", + "label": "shot", + "position_ms": 3650 + } + ], + "captions": [ + { + "lang": "en", + "text": "A quick attack ends with a shot on goal." + } + ], + "dense_captions": [ + { + "position_ms": 1100, + "lang": "en", + "text": "The midfielder plays a forward pass." + }, + { + "position_ms": 3650, + "lang": "en", + "text": "The striker shoots from inside the area." + } + ] + } + ] +} +``` + +### Multi-Input Q/A JSON + +```json +{ + "version": "2.0", + "date": "2026-05-19", + "dataset_name": "multi-view-qa-demo", + "description": "Two synchronized views with question/answer labels.", + "modalities": ["video"], + "metadata": { + "sport": "basketball" + }, + "labels": {}, + "data": [ + { + "id": "possession_0001", + "inputs": [ + { + "type": "video", + "path": "broadcast/possession_0001.mp4", + "fps": 30.0 + }, + { + "type": "video", + "path": "baseline/possession_0001.mp4", + "fps": 30.0 + } + ], + "answers": [ + { + "question": "Which team ends the possession?", + "answers": ["The home team."] + }, + { + "question": "How does the possession end?", + "answers": ["A made three-point shot."] + } + ] + } + ] +} ``` ## Save-Time Behavior On save/export, the app: -- ensures unique sample IDs -- normalizes/filters invalid or empty answer entries -- drops legacy top-level `questions` and `question_id` answers; convert old VQA files with `tools/convert_legacy_vqa_to_grouped.py` -- removes empty optional task blocks -- rewrites input paths relative to the output JSON location -- preserves unknown root/sample fields +- Ensures unique sample IDs. +- Normalizes input types, including `frame_npy` to `frames_npy`. +- Rewrites input paths relative to the output JSON location when possible. +- Recomputes `modalities` from `data[].inputs[]`. +- Removes empty optional sample fields such as `labels`, `events`, `captions`, + `dense_captions`, `answers`, and `metadata`. +- Normalizes Q/A answers to grouped `{"question": ..., "answers": [...]}` entries + with non-empty text. +- Drops legacy top-level `questions` and `question_id` answer entries. +- Drops retired sample smart keys such as `smart_labels` and `smart_events`. +- Does not persist localization `label_colors`; label colors live in app + settings. +- Preserves unknown root and sample fields where possible.