Skip to content

[Hackathon] feat: Dataset Bank + Dataset Search Agent + Results Dashboard#5097

Open
EmilySun621 wants to merge 13 commits into
apache:mainfrom
EmilySun621:hackathon/dataset-results
Open

[Hackathon] feat: Dataset Bank + Dataset Search Agent + Results Dashboard#5097
EmilySun621 wants to merge 13 commits into
apache:mainfrom
EmilySun621:hackathon/dataset-results

Conversation

@EmilySun621
Copy link
Copy Markdown

@EmilySun621 EmilySun621 commented May 16, 2026

TL;DR: Browse and one-click import public datasets from UCI, Kaggle, dkNET, WHO, and PubMed — right inside Texera. Ask the AI agent to find datasets for you. Get analysis results in a dedicated dashboard instead of buried in chat.

⚠️ Still under testing


😤 The Problem

A researcher wants to study diabetes. Before she can even start analyzing, she has to:

  1. Open a new tab, google around for datasets
  2. Download a CSV from UCI, upload it to Texera, manually configure the file path
  3. 20 minutes gone before any actual work
    Then after running her workflow, the AI gives her a detailed model comparison — but it's buried in a long chat message she has to scroll through. Can't reference it, can't copy it cleanly, can't share it with her advisor.

Finding data is painful. Reading results is painful. We fixed both.


✨ What We Built

📦 1. Dataset Bank — Browse & Import Without Leaving Texera

A new sidebar page with a curated catalog of public datasets — searchable, categorized, one-click importable.

Feature Details
🔎 Search By name, description, or tag (e.g., "diabetes", "classification", "healthcare")
🏷️ Categories Biomedical, NLP, Computer Vision, Finance, Social Science, Time Series, Tabular
🗂️ Sources UCI, Kaggle, dkNET, WHO (Global Health Observatory), PubMed (BioNLP corpora)
📋 Rich cards Name, source badge, description, row/column counts, file size, tags

Three actions per dataset:

Action What it does
🔗 View on source Opens the original dataset page for verification
⬇️ Download Saves the file locally
☁️ Import One click → dataset lands in "Your Datasets", ready for any workflow. No manual upload, no file path config.

Why WHO + PubMed? Biomedical researchers need more than tabular CSV data. WHO's Global Health Observatory provides structured health statistics across 194 countries — ideal for epidemiological modeling. PubMed/PMC open-access abstracts and full-text corpora are the gold standard for training biomedical NLP models (named entity recognition, relation extraction, text classification). Adding these sources means Texera's Dataset Bank covers the full spectrum from structured tabular data to unstructured text.

Backend: Server-side proxy (/api/dataset-bank/import-from-url) fetches the file and uploads it through Texera's existing dataset pipeline — bypassing browser CORS restrictions.


🔍 2. Dataset Search Agent Tool — "Find Me a Diabetes Dataset"

The AI agent can now search for datasets on your behalf mid-conversation.

Feature Details
🗣️ Natural language search "Find me a diabetes dataset" → agent calls search_datasets tool
Parallel search Queries dkNET, UCI, Kaggle, WHO, and PubMed simultaneously
🧠 Knows your data Your existing Texera datasets are injected into the system prompt
🔗 Auto-configures "Use my iris dataset" → agent knows the exact file path, sets up CSV Source automatically

📊 3. Results Dashboard — Analysis Reports Outside the Chat

When the AI agent produces analysis (model comparison, metrics, key findings), it now appears in a dedicated floating panel — not buried in chat.

Feature Details
📋 Report cards in chat Compact card: "📊 Results ready · View Report →"
🖥️ Floating dashboard Opens alongside the canvas — everything visible at once
📝 Formatted markdown Tables, headers, bold metrics, key insights — properly rendered
📎 Copy & Export Clipboard or download as markdown
🕐 Timestamped Know when each analysis was generated
🔄 Auto-updates New analysis from agent → dashboard refreshes

The experience: Canvas on the left (your DAG) · Chat in the middle (interaction) · Results Dashboard on the right (your analysis). Everything visible at once.


🎬 Demo Walkthrough

  1. 📦 Open Dataset Bank → search "diabetes" → filter "Biomedical"
  2. 🔗 Click "View on UCI" to verify → click ☁️ Import → dataset appears in Your Datasets
  3. 💬 Ask the Diabetes Agent: "Build a classification workflow using my diabetes dataset"
  4. 🤖 Agent generates the workflow on canvas
  5. 🏃 Ask: "Run it and give me a comparison report"
  6. 📊 "Results ready · View Report →" appears in chat
  7. 🖥️ Click → Results Dashboard opens with formatted model comparison, winner, key insights
  8. 📎 Copy the report to share with advisor

🏗️ Architecture

┌──────────────────────────────────────────────────────┐
│  🖥️  Frontend (Angular)                              │
│  • Dataset Bank page (search, categories, cards)     │
│  • Results Dashboard panel (markdown, copy, export)  │
├──────────────────────────────────────────────────────┤
│  ⚙️  Agent Service (TypeScript)                       │
│  • search_datasets tool (dkNET + UCI + Kaggle        │
│    + WHO + PubMed)                                   │
│  • User dataset injection into system prompt         │
│  • Report marker convention for dashboard            │
│  • Dataset import proxy (server-side fetch →         │
│    Texera upload pipeline, bypasses CORS)            │
├──────────────────────────────────────────────────────┤
│  🔒 Texera Core Engine (Amber) — UNMODIFIED          │
│  Everything is additive: new modules,                │
│  new endpoints, new components                       │
└──────────────────────────────────────────────────────┘

📊 +6,892 / -537 lines across 64 files · 🔒 Zero modifications to Texera's core engine


✅ Testing

Test Status
Angular build ✅ Clean
agent-service typecheck ✅ Clean
Dataset import (UCI Iris, end-to-end) ✅ Pass
Results Dashboard rendering ✅ Pass
Dataset search tool (agent callable) ✅ Pass

💡 Why This Matters

Researchers shouldn't have to leave their analysis platform to find data, and they shouldn't have to dig through chat logs to find results.

Google → download → upload → configure path → run → scroll through chat for results

Search → one-click import → run → results dashboard

Emily Sun and others added 10 commits May 15, 2026 21:55
This bundles the feature work that built up on this branch:

- Custom agents: dashboard CRUD page and editor dialog (48px icon tile,
  chip-style guardrails, model selector). Each custom agent now carries a
  LiteLLM model_name (Opus 4.7 / Haiku 4.5) that is passed through to the
  agent-service so different agents can use different models.

- Conversation history is scoped per (workflowId, agentId): switching
  agent or workflow yields a different conversation list. localStorage
  key: texera.workflowConversations.v1.{workflowId}.{agentId}.

- Time machine: workflow snapshot list, revert, and agent-tagged
  checkpoints. New workflow-history-tool in agent-service backs the
  "undo my last change" flow; amber gains a WorkflowSnapshotResource;
  sql/updates/23.sql adds the snapshot table.

- Operator-aware custom-agent prompts: the system prompt now injects the
  full operator catalog with a "prefer built-in operators over Python
  UDFs" rule, sourced from WorkflowSystemMetadata at request time.

- LiteLLM: added the claude-opus-4.7 entry alongside claude-haiku-4.5
  and gpt-5-mini in bin/litellm-config.yaml.

- Agent panel rewritten around the (conversation list / chat) two-view
  model with subscription-managed list reloads and per-step persistence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rompt

Adds a new agent tool that queries dkNET, UCI ML Repository, and Kaggle in
parallel and returns up to 5 results per source. Failures from individual
sources degrade gracefully so the rest still return. Kaggle is skipped when
KAGGLE_USERNAME / KAGGLE_KEY are not set.

Also fetches the user's accessible datasets via /api/dataset/list when an
agent is bound to a workflow (delegate config), and renders them in a "Your
Datasets" section of the system prompt with the path prefix a File Scan
operator would use to reference files in each one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new floating right-side panel that displays the most recent
agent-generated analysis report (model comparison tables, key metrics,
winner/recommendation, train-vs-test). The agent system prompt now instructs
the model to wrap structured result summaries in `<!-- REPORT_START -->` /
`<!-- REPORT_END -->` markers.

Flow:
- agent emits content wrapped in the markers
- agent-chat strips the marker block from inline rendering and shows a
  compact "Results ready — View Report" card in its place
- card click and new-report arrival both surface the Results Dashboard panel
- panel renders the markdown via ngx-markdown, with copy-to-clipboard and
  export-as-markdown buttons plus the generation timestamp

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New page that lists popular public datasets from dkNET, UCI, and Kaggle with
search and category filters. Backed by a hardcoded seed of ~30 well-known
datasets (Iris, Titanic, MNIST, COCO, TCGA, …) so the page always has
content even when the live catalog APIs are unavailable.

DatasetBankService:
- BehaviorSubjects for search query and active category, plus a combined
  filteredDatasets$ stream the component subscribes to.
- Best-effort live refresh from dkNET + UCI on first visit; results merge
  with the seed (Kaggle is skipped in the browser — CORS/auth).
- Hour-long localStorage cache for the merged list.

DatasetBankComponent:
- Standalone component with title/subtitle, full-width search bar,
  horizontal category chips (All / Biomedical / NLP / CV / Finance /
  Social Science / Time Series / Tabular), and a responsive card grid
  with name, source badge, description, rows/cols/size/format stats,
  tags, and an Import button.
- Import currently opens the source download / catalog page in a new tab
  and surfaces a toast — backend wiring to copy the file into the user's
  Texera datasets is left as a stretch goal.

Route registered at /dashboard/user/dataset-bank with a "Dataset Bank"
sidebar link under Your Work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pload

Card actions now show two equal-width buttons side by side:

- **Download** (left, outline): opens the bank entry's direct download URL
  (or source catalog page) in a new tab — the previous behavior.
- **Import** (right, primary): fetches the file in-browser and registers it
  as a Texera dataset under the current user.

Import goes through the existing user-dataset upload pipeline:
  DatasetService.createDataset()          → new dataset metadata
  DatasetService.multipartUpload()        → stages the file via LakeFS
  DatasetService.createDatasetVersion()   → publishes as v1

The button reflects state per card: idle → "Importing…" (loading) →
"Imported" (disabled, ✓). Failures (most commonly CORS on the source fetch)
re-enable the button so the user can retry, and surface a clear toast
suggesting the Download fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds POST /api/dataset-bank/import-from-url to agent-service. The endpoint
takes { url, name, description } plus a bearer token; server-side fetches
the source file (no browser CORS), then drives the existing dashboard
endpoints with the caller's token:

  /api/dataset/create
  /api/dataset/multipart-upload?type=init   → returns missingParts[]
  /api/dataset/multipart-upload/part        (per chunk, 5 MB)
  /api/dataset/multipart-upload?type=finish
  /api/dataset/{did}/version/create  body "v1"

Returns { did, datasetName, fileName, fileSize }.

DatasetBankService.importToTexera() now calls this proxy instead of
doing the browser-side fetch + multipart upload itself; the per-card
Import button flow on the Dataset Bank page is unchanged from the user's
perspective (idle → Importing… → ✓ Imported), but actually succeeds for
catalogs that don't send CORS headers (UCI, Kaggle direct downloads, etc.).

The Angular proxy.config.json routes /api/dataset-bank/* to localhost:3001
in dev so the existing relative-URL pattern keeps working.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /api/dataset/* routes live in file-service (port 9092 per
file-service-web-config.yaml), not amber/dashboard (8080). The new
dataset-bank import proxy and the Feature-1 user-dataset list fetch
were both hitting 8080 and getting 404.

Adds TEXERA_FILE_SERVICE_ENDPOINT to env (default http://localhost:9092)
and exposes it on BackendConfig.fileServiceEndpoint. Both call sites now
read from there.

Also logs the exact URL + upstream status/body at every step of the import
pipeline so future endpoint drift is obvious from the agent-service logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Card footer now has three actions side by side:

  [🔗 View on UCI]   [↓ Download]   [☁ Import]

The View link is a plain anchor (always visible, not on hover) opening
the dataset's catalog page in a new tab so users can verify the source
before importing. Layout is a 1.4fr / 1fr / 1fr grid that gives the
"View" pill room for the longer label without crowding Download or Import.

Also fixed the Human Protein Atlas seed entry to actually point at the
dkNET catalog (RRID:SCR_006710) instead of proteinatlas.org.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added engine ddl-change Changes to the TexeraDB DDL frontend Changes related to the frontend GUI dev common agent-service labels May 16, 2026
Emily Sun and others added 3 commits May 16, 2026 19:28
PubMed (live search) — when the user types a query of 3+ characters in
the Dataset Bank search box, DatasetBankService debounces 400ms then hits
NCBI eSearch + eFetch directly (NCBI sends CORS headers). Each returned
paper appears as a card with title, abstract, authors, journal, year.
Source badge is green "PubMed". Importing a paper sends its PMID to the
backend proxy, which re-fetches via eFetch server-side and emits a 1-row
CSV with columns (pmid, title, abstract, authors, journal, year).

WHO Global Health Observatory — 5 hardcoded seed entries with real GHO
indicator codes:
  - Life Expectancy at Birth      (WHOSIS_000001)
  - HIV Prevalence Adults 15-49   (HIV_0000000001)
  - Tuberculosis Incidence        (MDG_0000000020)
  - Malaria Estimated Deaths      (MALARIA_EST_DEATHS)
  - Under-Five Mortality Rate     (MDG_0000000007)

Source badge is geekblue "WHO". Import fetches the GHO indicator API
(https://ghoapi.azureedge.net/api/<indicator>) server-side and converts
the rows into a (country, year, sex, numeric_value, value) CSV.

New "Public Health" category chip groups WHO entries (and biomedical
seeds that touch population health) for filtering.

Backend proxy refactor: the existing /api/dataset-bank/import-from-url
now accepts a sourceType discriminator. "url" (default) keeps the
existing fetch-arbitrary-URL behavior. "pubmed" and "who" each fetch
their canonical API server-side, build a CSV, then feed the shared
createDataset → multipart-upload → createDatasetVersion pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Angular dev-server proxy was misrouting requests to /api/dataset-bank/*
because the catch-all /api/dataset rule (file-service, port 9092) shares a
common string prefix with /api/dataset-bank and was winning the proxy match
race despite the more-specific rule being declared first.

Avoid the collision by giving the agent-service endpoint a distinct path.
Component/directory names remain `dataset-bank` (it's the user-facing page
identity); only the HTTP path changes:

  proxy.config.json:   "/api/databank" → http://localhost:3001
  agent-service:       new Elysia({ prefix: "/databank" })
  frontend service:    POST /api/databank/import-from-url

A dev-server restart is required when proxy.config.json changes, since
webpack-dev-server does not hot-reload it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Last comment site that still mentioned the pre-rename /api/dataset-bank
path. No behavior change — the actual http.post() call already used the
new path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-service common ddl-change Changes to the TexeraDB DDL dev engine frontend Changes related to the frontend GUI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant