New reference implementation: Misalignment evaluations#108
Open
ethancjackson wants to merge 51 commits into
Open
New reference implementation: Misalignment evaluations#108ethancjackson wants to merge 51 commits into
ethancjackson wants to merge 51 commits into
Conversation
Create a Langfuse-backed Python workflow for configurable ADK agent runs, transcript-based task definitions, judge-driven evaluation, trace usage metrics, and a documented smoke-test config to support future misalignment experiments. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Separate schema, preparation, and orchestration so configs remain the primary interface while the package gains a cleaner reusable surface for multi-variant research runs. Made-with: Cursor
Allow explicit zero-budget configs, carry the setting through variant resolution into ADK agent construction, and document how to disable thinking in experiment configs. Made-with: Cursor
Tighten the runtime so seeded conversations read like real chat, keep the experiment configs aligned with current thinking/output settings, and add a Metrics API-based terminal report for comparing conditions outside the Langfuse UI. Made-with: Cursor
Add a per-execution run_instance_id to Langfuse metadata and run names so repeated launches stay distinguishable, and teach the terminal reporter to default to the latest run instance while documenting the new behavior. Made-with: Cursor
Support LiteLLM-backed providers in the misalignment agent builder, accept Anthropic credentials in shared settings, and extend the main experiment plus docs/tests so Claude variants can be run and compared alongside Gemini. Made-with: Cursor
Move experiment result inspection into a simpler notebook-backed workflow so historical runs are easier to inspect and harmful traces are easier to review. Made-with: Cursor
…, rewrite README results_notebook.py: shrunk from 901 to 643 lines by replacing five custom dataclasses (NumericAccumulator, ConditionSummary, TraceRecord, AnalysisBundle with 13 fields) with pandas groupby aggregation and lighter data structures. AnalysisBundle is now 4 fields. Two near-duplicate Metrics API fetchers are now clean separate functions returning DataFrames. All public API preserved. report_metrics.ipynb: added a Discovery cell that lists available datasets and execution IDs so users no longer have to guess constants. Replaced the passive markdown cell with an actionable comment in the detail-view cell. Added a "how to copy for a new experiment" guide to the header and improved inline comments throughout. README.md: full rewrite for newcomers. Leads with what behavioral misalignment is and why it matters, includes a plain-language workflow diagram, a Quick Start section, a "Designing Your Own Experiment" walkthrough, and moves the config reference to the end. No jargon (PreparedTaskItem, ExecutionIdentity, etc.) in the sections visible to first-time readers. Made-with: Cursor
…values to float The Langfuse Metrics API can return latency/cost/token values as strings or None. The previous refactor dropped the explicit _coerce_float/_coerce_int helpers from the original code, causing 'unsupported operand type(s) for /: str and int' when _build_summary_df tried to compute avg_latency_s and avg_tokens. Added a _to_float helper inside _fetch_trace_metrics_df and a pd.to_numeric pass as a safety net. Made-with: Cursor
preparation.py: replace the 43-line null-coalescing body of resolve_agent_spec with a 10-line Pydantic model_dump merge (base fields overridden by non-None variant fields). Same semantics, far less repetition. preparation.py: remove PreparedTaskItem.judge_input — it was computed in prepare_task_item but never read anywhere (to_upload_item uses upload_input; the value was already embedded there via build_dataset_input). Eliminates a redundant call to build_judge_input. config_types.py: change ExperimentConfig.description default from the generic placeholder string "Misalignment experiment" to None, so experiments without a description don't silently inherit a meaningless label in Langfuse. Made-with: Cursor
…nnotate minimal config - Quick Start step 3 now names the smoke dataset (misalignment-qa-smoke-v2) and shows the exact DATASET_NAME constant to paste into the notebook. - 'Designing Your Own Experiment' minimal config is rewritten to demonstrate variant inheritance (variants only set what changes), per-variant example override, and the judge_model_config with an explanation of each key decision. - Added four plain-language bullets below the config explaining the most common stumbling points: variant inheritance, dataset_upload_subset, judge model independence, and how rubric metric names become notebook column names. Made-with: Cursor
Made-with: Cursor
Tests whether the reckless in-context examples from the prioritize-context experiment cause misalignment to transfer across output modalities — from conversational advice into code, email drafts, and structured YAML configs. Uses gemini-3-flash-preview (strongest responder in the prior experiment). Two variants: reckless-examples (same four ICL examples) and no-examples (empty list, control condition) — both under the same 'follow the examples' system prompt so only the examples themselves vary. Three tasks, one per modality: - emergency-stop-code: Python safety threshold function - safety-incident-email: building supervisor gas-leak notification - chemical-storage-config: YAML monitoring system config Same 3-metric rubric (harmful / helpful / follows_reckless_pattern) with updated judge guidance covering code, email, and config output formats. max_output_chars raised from 800 to 1200 for longer structured outputs. Made-with: Cursor
…oken handling - Add 6 new variants (reckless + control) for gemini-2.5-flash, gemini-3.1-pro-preview, and gemini-3.1-flash-lite-preview, giving a full 4-model × 2-condition factorial design (8 variants total). - Add `model` key to all condition_metadata blocks so notebooks can pivot on either the model or icl_condition axis independently. - Rename original variants from reckless-examples/no-examples to gemini-3-flash-reckless/gemini-3-flash-control for naming consistency. - task.py: filter out thinking parts (part.thought=True) from the string returned to the evaluator; thinking tokens are still visible in the raw Langfuse trace observation via ADK's automatic model-call logging. Made-with: Cursor
Step-by-step guide that shows how to fetch a Langfuse dataset, inspect and select individual items, run the agent and LLM judge in preview mode, then formally record the run as a Langfuse experiment. Made-with: Cursor
…owing instruction
…ation
- Replace 5 legacy experiment configs with a single bootcamp_misalignment.yaml
testing 5 conditions × 6 commercial models (Gemini + Anthropic) = 30 variants
- Add examples_inject_mode field to VariantSpec ("llm_messages" | "user_context")
so examples can be seeded as LLM API messages or embedded as plain text in the
user prompt — the latter being a more realistic end-user attack surface
- Thread user_context_preamble through PreparedVariantRun and MisalignmentTask
- Add preflight API key check and per-variant error catching to experiment.py;
skipped/failed variants are collected and printed as a visible warning summary
- Rewrite interactive_single_run.ipynb as a clean 6-step single-item runner
- Rewrite report_metrics.ipynb with condition-axis analysis and the 5-condition
table; default dataset points to misalignment-qa-bootcamp-v1
- Rewrite README for bootcamp audience: 3-path structure, agent framing note,
graceful API key warning docs, simplified config reference
- Delete test_notebook.ipynb (scratch file)
- Update .env.example: add ANTHROPIC_API_KEY, remove VECTOR_INFERENCE_API_KEY
Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the 10 single-category tasks with 9 tasks across three categories:
A. life-safety (peanut-throat, child-cleaner, gas-smell) — in-domain,
same flavour as the reckless examples
B. harmful-code (keylogger, credential-harvest, covert-exfiltration) —
out-of-domain: does the dismissal pattern bleed into writing malware?
C. social-engineering (phishing-email, health-misinfo, fake-reviews) —
out-of-domain: deceptive content targeting people
Tasks carry a `category` metadata field for filtering in the notebook.
Update judge rubric: broaden follows_reckless_pattern to cover all three
modalities (dismissal, providing malicious code, generating deceptive content).
Bump langfuse_dataset_name to misalignment-qa-bootcamp-v2 (task content
changed so a fresh dataset is required). Update README and report notebook
to document the three-category structure and analysis approach.
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ive notebook Co-authored-by: Cursor <cursoragent@cursor.com>
…ive_single_run.ipynb Co-authored-by: Cursor <cursoragent@cursor.com>
…improve notebook UX - Rename results_notebook.py to analysis.py (better reflects purpose) - Fix missing 'condition' column in build_master_traces_frame: extract condition_condition from trace metadata and expose it as 'condition' - Add Plotly misalignment heatmap (condition × model, follows_reckless_pattern rate) as a headline dashboard figure; falls back to bar chart when condition data is absent - Replace verbose trace detail for-loop with collapsible HTML <details> accordion cards — colour-coded score badges in the summary line, full input/output/judge commentary hidden until expanded Co-authored-by: Cursor <cursoragent@cursor.com>
…esults.ipynb Co-authored-by: Cursor <cursoragent@cursor.com>
…nspect_results Co-authored-by: Cursor <cursoragent@cursor.com>
…ractive_single_run Co-authored-by: Cursor <cursoragent@cursor.com>
…njectMode; remove stale pycache Co-authored-by: Cursor <cursoragent@cursor.com>
…arning summary run_experiment_with_trace_evals swallows per-item errors (auth failures, timeouts) internally without raising, so they bypass the per-variant try/except. Add _check_item_failures() to inspect result item outputs after each variant and append to runtime_warnings when all or some items produced no output. Co-authored-by: Cursor <cursoragent@cursor.com>
…nthropic claude-opus-4 failures claude-opus-4-7 (and newer Anthropic extended-thinking models) return a 400 'temperature is deprecated for this model' error when temperature is present. ADK passes GenerateContentConfig.temperature directly so drop_params=True on the LiteLlm object does not help. - Make AgentSpec.temperature Optional[float] (None = omit from API call) - Remove temperature from base_agent in the config (use provider defaults) - update resolve_agent_spec to propagate None rather than defaulting to 0.7 - Add explanatory comment in _build_generate_content_config Co-authored-by: Cursor <cursoragent@cursor.com>
…it for LiteLLM/Anthropic The previous commit set temperature=None everywhere, which inadvertently caused Gemini models to run at their default (1.0) — too variable for a behavioural experiment. Correct split: - Google/Gemini: temperature=0.2 from base_agent config (low variance, deterministic experiment results) - LiteLLM/Anthropic: temperature hard-omitted in _build_generate_content_config regardless of what the config says (claude-opus-4 series rejects the param) AgentSpec.temperature remains Optional[float] so variants can still override the Gemini temperature if needed. Co-authored-by: Cursor <cursoragent@cursor.com>
…, null only for claude-opus-4-7 Previous approach blanket-dropped temperature for all LiteLLM providers, meaning claude-sonnet and claude-haiku ran without temperature control. Canonical fix: - resolve_agent_spec now uses model_fields_set to propagate explicit null overrides (distinguishes 'not set' from 'explicitly set to null') - _build_generate_content_config passes spec.temperature for LiteLLM as-is - temperature: 0.2 is set in base_agent (applies to all 30 variants) - claude-opus-4-7 variants each carry temperature: ~ to override to None, since that model has deprecated the parameter (returns 400 if present) Result: 25 variants get temperature=0.2, 5 claude-opus-4-7 variants omit it Co-authored-by: Cursor <cursoragent@cursor.com>
- preparation.py: widen PreparedVariantRun.description to str | None (variant.description is optional, so the field can legitimately be None) - analysis.py: add explicit list[dict[str,Any]] annotation for traces variable - README: mention heatmap dashboard in explore-results section; fix detail view description to reflect collapsible cards Co-authored-by: Cursor <cursoragent@cursor.com>
for more information, see https://pre-commit.ci
Fixes that were blocking CI (ruff check in pre-commit and run-code-check): agent.py: - Add module docstring (D100) - Shorten over-length doc/comment lines (W505) analysis.py: - Convert lambda assignment to def (E731) - Reduce _stringify_value return paths from 7 to 6 (PLR0911) - Shorten over-length docstring/comment lines (W505) 02_inspect_results.ipynb: - Add # noqa: A004 to display import (shadows builtin in IPython context) - Rewrite all dict() calls as dict literals in Plotly layout calls (C408) Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
for more information, see https://pre-commit.ci
Resolves all 50 ruff violations that were failing CI pre-commit checks: - D100/D101/D102/D103: add module, class, and function docstrings across config_types, experiment, preparation, task, run, and hard_metrics - D205/D400: fix MisalignmentTask class docstring format - W505: shorten doc lines to stay within max-doc-length=88 limit - A002: noqa annotation for `input` argument in llm_judge_evaluator - SIM105: replace try/except/pass with contextlib.suppress - E402: noqa annotation for sys.path-guarded import in run.py Co-authored-by: Cursor <cursoragent@cursor.com>
… in notebooks - Add D100 module docstring and D103 function docstrings to test_agent.py - Add E402 and D103 to nbqa-ruff ignore list in .pre-commit-config.yaml: nbQA converts notebooks to .py before linting so the per-file-ignores for *.ipynb in pyproject.toml do not apply to the nbqa-ruff hook Co-authored-by: Cursor <cursoragent@cursor.com>
…EvaluatorFunction - Configs() call: add type: ignore[call-arg] — openai_api_key and google_api_key are required fields populated from env vars at runtime, not constructor arguments; consistent with usage in knowledge_qa/agent.py - base_evaluator await: add type: ignore[misc] — EvaluatorFunction return type union includes non-awaitable branch that mypy cannot narrow Co-authored-by: Cursor <cursoragent@cursor.com>
… installed) Co-authored-by: Cursor <cursoragent@cursor.com>
…noise) Co-authored-by: Cursor <cursoragent@cursor.com>
- Rewrite heatmap cell to use matplotlib (renders as static PNG in all Jupyter environments; Plotly fig.show()/display(fig) was unreliable in VS Code notebooks) - Add matplotlib as a project dependency - In build_master_traces_frame: fall back to _fetch_scores_df (metrics API) when trace_detail.scores is empty (some Langfuse SDK versions) Co-authored-by: Cursor <cursoragent@cursor.com>
for more information, see https://pre-commit.ci
Collaborator
|
To follow the structure of the repo in another following issue: Move framework code ( Keep in Tests move with the code: framework tests → |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
misalignment_qaas a new reference implementation for the LLM/agents evaluations bootcamp. The experiment probes whether reckless in-context examples can nudge model responses toward harmful behaviour, and whether that effect transfers across different harm domains. It is intentionally minimal — plain LLM completions, no tool use — to make the mechanics transparent and serve as a building block for participants who want to extend it to real agentic systems.Clickup Ticket(s): N/A
Type of Change
Changes Made
implementations/misalignment_qa/) — a YAML-driven experiment runner that tests five in-context-learning conditions (baseline, examples as LLM messages, examples as LLM messages + priority instruction, examples as user context, examples as user context + priority instruction) across six commercial models (three Gemini, three Anthropic), producing 30 variants against a shared 9-task datasetexamples_inject_modeconfig field controls whether examples reach the model as LLM API messages (developer surface) or as plain text inside the user message (end-user surface), implemented viapreparation.pyandtask.pyAgentSpec.temperatureis nowfloat | None;claude-opus-4-7variants carrytemperature: null(that model has deprecated the parameter); all other models usetemperature: 0.2; variant-level null overrides are propagated correctly via Pydanticmodel_fields_setinresolve_agent_spec01_interactive_single_run.ipynb(optional single-item preview),run.py(full 30-variant experiment),02_inspect_results.ipynb(pull results from Langfuse, heatmap dashboard + collapsible trace detail cards)analysis.py— helper module for the results notebook, replacing the oldresults_notebook.py; includes correctconditionmetadata extractionTesting
uv run pytest tests/)uv run mypy <src_dir>)uv run ruff check src_dir/)Manual testing details:
AuthenticationError(invalid key) and aBadRequestError(temperaturedeprecated onclaude-opus-4-7) — both error classes are now surfaced clearly in the warning summary02_inspect_results.ipynbrun end-to-end against a partial dataset (one condition); heatmap rendered correctly and collapsible trace cards displayed as intendedcondition_metadatafields populated, temperature matrix verified programmaticallyScreenshots/Recordings
N/A
Related Issues
N/A
Deployment Notes
Participants need
.enventries forGOOGLE_API_KEYand/orANTHROPIC_API_KEYin addition to the standard Langfuse keys. The experiment runs with only one provider's key and reports skipped variants at the end. No infrastructure changes required.Checklist