You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Multilingual hallucination evaluation framework for Large Language Models across Indian languages using TruthfulQA, NLLB-200, and mechanistic interpretability.
CAP6640-Spring2026: Benchmarks GPT-3.5, GPT-4, Claude Haiku, and Gemini on GSM8k and TruthfulQA, measuring accuracy, self-consistency, and confidence calibration.
Does instruction tuning make language models more sycophantic? A paired causal study across Qwen, Llama, and Gemma on TruthfulQA, showing the effect is family-dependent in both magnitude and direction. 7,200 evaluations, 12 ATE estimates with paired t-tests and bootstrap CIs.
A tool to evaluate and compare local LLMs running on Ollama or LM Studio under identical conditions using deepeval's public benchmarks (MMLU, TruthfulQA, GSM8K).