An End-to-End Evaluation Framework for Entity Resolution Systems
-
Updated
Dec 3, 2023 - Python
An End-to-End Evaluation Framework for Entity Resolution Systems
Save up to 40% on agent token costs with code graphs: call graphs, dependency graphs, dead code detection, and blast radius analysis.
A single-file Python CLI that pre-registers AI/ML accuracy claims with SHA-256. Lock the threshold before the data, or it didn't happen.
A metrics library to evaluate vision language models with a pytorch eco system.
✍️ Collaborate on writing technical content for the Giskard Community
An open-source Streamlit web app to generate beautiful confusion matrices for multi-class machine learning models. Supports numeric and string labels, CSV upload, manual label entry, custom color maps, and displays evaluation metrics like Accuracy, Precision, Recall, and F1-score. Users can download the confusion matrix as an image.
Safety-first legal NLP system with hierarchical long-document processing, deterministic inference, clause extraction, and rule-based risk engine — built for traceability and deployment constraints.
Measurement-disciplined optimizer for Claude Agent Skills — three-gate system (stability, effect size, function preservation), anchored rubric, train/holdout split. Inspired by Karpathy's autoresearch and alchaincyf's darwin-skill.
Enterprise-grade machine learning observability platform that detects data drift, concept drift, and performance degradation in production models. Features statistical drift detection (KS test, PSI), real-time alerting, Redis caching, and FastAPI backend.
Data Science Challenge from Coursera Project : Loan Default Prediction
This project contains codes and paperwork based on the course CSI5155 at University of Ottawa (delivered by Professor Dr. Herna Viktor).
Local-first stratified-sample audit UI for ML classifiers and labeled datasets. Wilson CIs, keyboard-first, no cloud.
GitHub Action for SWE-bench Pro evaluation powered by mcpbr
Small, educational project that shows how to build a **minimal RAG pipeline** with a **simple evaluation loop**
Public scorecard of how 25+ ML eval claims meet 9 PRML falsifiability criteria. CC0 data; MIT tooling.
End-to-end E-Commerce Recommendation System using implicit feedback, featuring Popularity, Item-Item CF, ALS (Matrix Factorization), and a Hybrid model, with offline evaluation and online serving via FastAPI + Streamlit.
Evaluation of system-level risks in content moderation models using policy-driven metrics, identity-based analysis, and governance-aligned datasets.
Static HTML backtest reports from eval JSON (calibration, Brier, CLV, optional bet ledger)
Collection of Machine Learning (ML) and Natural Language Processing (NLP) projects showcasing a range of applications, algorithms, and techniques.
A decision-oriented benchmark framework for evaluating action-conditioned world models beyond static AI benchmarks.
Add a description, image, and links to the ml-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the ml-evaluation topic, visit your repo's landing page and select "manage topics."