[Hackathon] feat: Dataset Bank + Dataset Search Agent + Results Dashboard by EmilySun621 · Pull Request #5097 · apache/texera

EmilySun621 · 2026-05-16T07:16:47Z

TL;DR: Browse and one-click import public datasets from UCI, Kaggle, dkNET, WHO, and PubMed — right inside Texera. Ask the AI agent to find datasets for you. Get analysis results in a dedicated dashboard instead of buried in chat.

⚠️ Still under testing

😤 The Problem

A researcher wants to study diabetes. Before she can even start analyzing, she has to:

Open a new tab, google around for datasets
Download a CSV from UCI, upload it to Texera, manually configure the file path
20 minutes gone before any actual work
Then after running her workflow, the AI gives her a detailed model comparison — but it's buried in a long chat message she has to scroll through. Can't reference it, can't copy it cleanly, can't share it with her advisor.

Finding data is painful. Reading results is painful. We fixed both.

✨ What We Built

📦 1. Dataset Bank — Browse & Import Without Leaving Texera

A new sidebar page with a curated catalog of public datasets — searchable, categorized, one-click importable.

	Feature	Details
🔎	Search	By name, description, or tag (e.g., "diabetes", "classification", "healthcare")
🏷️	Categories	Biomedical, NLP, Computer Vision, Finance, Social Science, Time Series, Tabular
🗂️	Sources	UCI, Kaggle, dkNET, WHO (Global Health Observatory), PubMed (BioNLP corpora)
📋	Rich cards	Name, source badge, description, row/column counts, file size, tags

Three actions per dataset:

Action	What it does
🔗 View on source	Opens the original dataset page for verification
⬇️ Download	Saves the file locally
☁️ Import	One click → dataset lands in "Your Datasets", ready for any workflow. No manual upload, no file path config.

Why WHO + PubMed? Biomedical researchers need more than tabular CSV data. WHO's Global Health Observatory provides structured health statistics across 194 countries — ideal for epidemiological modeling. PubMed/PMC open-access abstracts and full-text corpora are the gold standard for training biomedical NLP models (named entity recognition, relation extraction, text classification). Adding these sources means Texera's Dataset Bank covers the full spectrum from structured tabular data to unstructured text.

Backend: Server-side proxy (/api/dataset-bank/import-from-url) fetches the file and uploads it through Texera's existing dataset pipeline — bypassing browser CORS restrictions.

🔍 2. Dataset Search Agent Tool — "Find Me a Diabetes Dataset"

The AI agent can now search for datasets on your behalf mid-conversation.

	Feature	Details
🗣️	Natural language search	"Find me a diabetes dataset" → agent calls `search_datasets` tool
⚡	Parallel search	Queries dkNET, UCI, Kaggle, WHO, and PubMed simultaneously
🧠	Knows your data	Your existing Texera datasets are injected into the system prompt
🔗	Auto-configures	"Use my iris dataset" → agent knows the exact file path, sets up CSV Source automatically

📊 3. Results Dashboard — Analysis Reports Outside the Chat

When the AI agent produces analysis (model comparison, metrics, key findings), it now appears in a dedicated floating panel — not buried in chat.

	Feature	Details
📋	Report cards in chat	Compact card: "📊 Results ready · View Report →"
🖥️	Floating dashboard	Opens alongside the canvas — everything visible at once
📝	Formatted markdown	Tables, headers, bold metrics, key insights — properly rendered
📎	Copy & Export	Clipboard or download as markdown
🕐	Timestamped	Know when each analysis was generated
🔄	Auto-updates	New analysis from agent → dashboard refreshes

The experience: Canvas on the left (your DAG) · Chat in the middle (interaction) · Results Dashboard on the right (your analysis). Everything visible at once.

🎬 Demo Walkthrough

📦 Open Dataset Bank → search "diabetes" → filter "Biomedical"
🔗 Click "View on UCI" to verify → click ☁️ Import → dataset appears in Your Datasets
💬 Ask the Diabetes Agent: "Build a classification workflow using my diabetes dataset"
🤖 Agent generates the workflow on canvas
🏃 Ask: "Run it and give me a comparison report"
📊 "Results ready · View Report →" appears in chat
🖥️ Click → Results Dashboard opens with formatted model comparison, winner, key insights
📎 Copy the report to share with advisor

🏗️ Architecture

┌──────────────────────────────────────────────────────┐
│  🖥️  Frontend (Angular)                              │
│  • Dataset Bank page (search, categories, cards)     │
│  • Results Dashboard panel (markdown, copy, export)  │
├──────────────────────────────────────────────────────┤
│  ⚙️  Agent Service (TypeScript)                       │
│  • search_datasets tool (dkNET + UCI + Kaggle        │
│    + WHO + PubMed)                                   │
│  • User dataset injection into system prompt         │
│  • Report marker convention for dashboard            │
│  • Dataset import proxy (server-side fetch →         │
│    Texera upload pipeline, bypasses CORS)            │
├──────────────────────────────────────────────────────┤
│  🔒 Texera Core Engine (Amber) — UNMODIFIED          │
│  Everything is additive: new modules,                │
│  new endpoints, new components                       │
└──────────────────────────────────────────────────────┘

📊 +6,892 / -537 lines across 64 files · 🔒 Zero modifications to Texera's core engine

✅ Testing

Test	Status
Angular build	✅ Clean
agent-service typecheck	✅ Clean
Dataset import (UCI Iris, end-to-end)	✅ Pass
Results Dashboard rendering	✅ Pass
Dataset search tool (agent callable)	✅ Pass

💡 Why This Matters

Researchers shouldn't have to leave their analysis platform to find data, and they shouldn't have to dig through chat logs to find results.

❌ Google → download → upload → configure path → run → scroll through chat for results

✅ Search → one-click import → run → results dashboard

This bundles the feature work that built up on this branch: - Custom agents: dashboard CRUD page and editor dialog (48px icon tile, chip-style guardrails, model selector). Each custom agent now carries a LiteLLM model_name (Opus 4.7 / Haiku 4.5) that is passed through to the agent-service so different agents can use different models. - Conversation history is scoped per (workflowId, agentId): switching agent or workflow yields a different conversation list. localStorage key: texera.workflowConversations.v1.{workflowId}.{agentId}. - Time machine: workflow snapshot list, revert, and agent-tagged checkpoints. New workflow-history-tool in agent-service backs the "undo my last change" flow; amber gains a WorkflowSnapshotResource; sql/updates/23.sql adds the snapshot table. - Operator-aware custom-agent prompts: the system prompt now injects the full operator catalog with a "prefer built-in operators over Python UDFs" rule, sourced from WorkflowSystemMetadata at request time. - LiteLLM: added the claude-opus-4.7 entry alongside claude-haiku-4.5 and gpt-5-mini in bin/litellm-config.yaml. - Agent panel rewritten around the (conversation list / chat) two-view model with subscription-managed list reloads and per-step persistence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rompt Adds a new agent tool that queries dkNET, UCI ML Repository, and Kaggle in parallel and returns up to 5 results per source. Failures from individual sources degrade gracefully so the rest still return. Kaggle is skipped when KAGGLE_USERNAME / KAGGLE_KEY are not set. Also fetches the user's accessible datasets via /api/dataset/list when an agent is bound to a workflow (delegate config), and renders them in a "Your Datasets" section of the system prompt with the path prefix a File Scan operator would use to reference files in each one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a new floating right-side panel that displays the most recent agent-generated analysis report (model comparison tables, key metrics, winner/recommendation, train-vs-test). The agent system prompt now instructs the model to wrap structured result summaries in `` / `` markers. Flow: - agent emits content wrapped in the markers - agent-chat strips the marker block from inline rendering and shows a compact "Results ready — View Report" card in its place - card click and new-report arrival both surface the Results Dashboard panel - panel renders the markdown via ngx-markdown, with copy-to-clipboard and export-as-markdown buttons plus the generation timestamp Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This reverts commit 76e87ed.

New page that lists popular public datasets from dkNET, UCI, and Kaggle with search and category filters. Backed by a hardcoded seed of ~30 well-known datasets (Iris, Titanic, MNIST, COCO, TCGA, …) so the page always has content even when the live catalog APIs are unavailable. DatasetBankService: - BehaviorSubjects for search query and active category, plus a combined filteredDatasets$ stream the component subscribes to. - Best-effort live refresh from dkNET + UCI on first visit; results merge with the seed (Kaggle is skipped in the browser — CORS/auth). - Hour-long localStorage cache for the merged list. DatasetBankComponent: - Standalone component with title/subtitle, full-width search bar, horizontal category chips (All / Biomedical / NLP / CV / Finance / Social Science / Time Series / Tabular), and a responsive card grid with name, source badge, description, rows/cols/size/format stats, tags, and an Import button. - Import currently opens the source download / catalog page in a new tab and surfaces a toast — backend wiring to copy the file into the user's Texera datasets is left as a stretch goal. Route registered at /dashboard/user/dataset-bank with a "Dataset Bank" sidebar link under Your Work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pload Card actions now show two equal-width buttons side by side: - **Download** (left, outline): opens the bank entry's direct download URL (or source catalog page) in a new tab — the previous behavior. - **Import** (right, primary): fetches the file in-browser and registers it as a Texera dataset under the current user. Import goes through the existing user-dataset upload pipeline: DatasetService.createDataset() → new dataset metadata DatasetService.multipartUpload() → stages the file via LakeFS DatasetService.createDatasetVersion() → publishes as v1 The button reflects state per card: idle → "Importing…" (loading) → "Imported" (disabled, ✓). Failures (most commonly CORS on the source fetch) re-enable the button so the user can retry, and surface a clear toast suggesting the Download fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds POST /api/dataset-bank/import-from-url to agent-service. The endpoint takes { url, name, description } plus a bearer token; server-side fetches the source file (no browser CORS), then drives the existing dashboard endpoints with the caller's token: /api/dataset/create /api/dataset/multipart-upload?type=init → returns missingParts[] /api/dataset/multipart-upload/part (per chunk, 5 MB) /api/dataset/multipart-upload?type=finish /api/dataset/{did}/version/create body "v1" Returns { did, datasetName, fileName, fileSize }. DatasetBankService.importToTexera() now calls this proxy instead of doing the browser-side fetch + multipart upload itself; the per-card Import button flow on the Dataset Bank page is unchanged from the user's perspective (idle → Importing… → ✓ Imported), but actually succeeds for catalogs that don't send CORS headers (UCI, Kaggle direct downloads, etc.). The Angular proxy.config.json routes /api/dataset-bank/* to localhost:3001 in dev so the existing relative-URL pattern keeps working. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The /api/dataset/* routes live in file-service (port 9092 per file-service-web-config.yaml), not amber/dashboard (8080). The new dataset-bank import proxy and the Feature-1 user-dataset list fetch were both hitting 8080 and getting 404. Adds TEXERA_FILE_SERVICE_ENDPOINT to env (default http://localhost:9092) and exposes it on BackendConfig.fileServiceEndpoint. Both call sites now read from there. Also logs the exact URL + upstream status/body at every step of the import pipeline so future endpoint drift is obvious from the agent-service logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Card footer now has three actions side by side: [🔗 View on UCI] [↓ Download] [☁ Import] The View link is a plain anchor (always visible, not on hover) opening the dataset's catalog page in a new tab so users can verify the source before importing. Layout is a 1.4fr / 1fr / 1fr grid that gives the "View" pill room for the longer label without crowding Download or Import. Also fixed the Human Protein Atlas seed entry to actually point at the dkNET catalog (RRID:SCR_006710) instead of proteinatlas.org. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PubMed (live search) — when the user types a query of 3+ characters in the Dataset Bank search box, DatasetBankService debounces 400ms then hits NCBI eSearch + eFetch directly (NCBI sends CORS headers). Each returned paper appears as a card with title, abstract, authors, journal, year. Source badge is green "PubMed". Importing a paper sends its PMID to the backend proxy, which re-fetches via eFetch server-side and emits a 1-row CSV with columns (pmid, title, abstract, authors, journal, year). WHO Global Health Observatory — 5 hardcoded seed entries with real GHO indicator codes: - Life Expectancy at Birth (WHOSIS_000001) - HIV Prevalence Adults 15-49 (HIV_0000000001) - Tuberculosis Incidence (MDG_0000000020) - Malaria Estimated Deaths (MALARIA_EST_DEATHS) - Under-Five Mortality Rate (MDG_0000000007) Source badge is geekblue "WHO". Import fetches the GHO indicator API (https://ghoapi.azureedge.net/api/<indicator>) server-side and converts the rows into a (country, year, sex, numeric_value, value) CSV. New "Public Health" category chip groups WHO entries (and biomedical seeds that touch population health) for filtering. Backend proxy refactor: the existing /api/dataset-bank/import-from-url now accepts a sourceType discriminator. "url" (default) keeps the existing fetch-arbitrary-URL behavior. "pubmed" and "who" each fetch their canonical API server-side, build a CSV, then feed the shared createDataset → multipart-upload → createDatasetVersion pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Angular dev-server proxy was misrouting requests to /api/dataset-bank/* because the catch-all /api/dataset rule (file-service, port 9092) shares a common string prefix with /api/dataset-bank and was winning the proxy match race despite the more-specific rule being declared first. Avoid the collision by giving the agent-service endpoint a distinct path. Component/directory names remain `dataset-bank` (it's the user-facing page identity); only the HTTP path changes: proxy.config.json: "/api/databank" → http://localhost:3001 agent-service: new Elysia({ prefix: "/databank" }) frontend service: POST /api/databank/import-from-url A dev-server restart is required when proxy.config.json changes, since webpack-dev-server does not hot-reload it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Last comment site that still mentioned the pre-rename /api/dataset-bank path. No behavior change — the actual http.post() call already used the new path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Emily Sun and others added 10 commits May 15, 2026 21:55

wip: project gallery in-progress changes

76e87ed

Revert "wip: project gallery in-progress changes"

3cb31c0

This reverts commit 76e87ed.

github-actions Bot assigned EmilySun621 May 16, 2026

github-actions Bot added engine ddl-change Changes to the TexeraDB DDL frontend Changes related to the frontend GUI dev common agent-service labels May 16, 2026

Emily Sun and others added 3 commits May 16, 2026 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hackathon] feat: Dataset Bank + Dataset Search Agent + Results Dashboard#5097

[Hackathon] feat: Dataset Bank + Dataset Search Agent + Results Dashboard#5097
EmilySun621 wants to merge 13 commits into
apache:mainfrom
EmilySun621:hackathon/dataset-results

EmilySun621 commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EmilySun621 commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

😤 The Problem

✨ What We Built

📦 1. Dataset Bank — Browse & Import Without Leaving Texera

🔍 2. Dataset Search Agent Tool — "Find Me a Diabetes Dataset"

📊 3. Results Dashboard — Analysis Reports Outside the Chat

🎬 Demo Walkthrough

🏗️ Architecture

✅ Testing

💡 Why This Matters

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EmilySun621 commented May 16, 2026 •

edited

Loading