CAP6640-Spring2026: Benchmarks GPT-3.5, GPT-4, Claude Haiku, and Gemini on GSM8k and TruthfulQA, measuring accuracy, self-consistency, and confidence calibration.
python benchmark natural-language-processing gemini openai self-consistency ucf confidence-calibration anthropic gsm8k llm-evaluation truthfulqa cap6640
-
Updated
May 1, 2026 - Python