EDA and curation platform for Portuguese corpora used in LLM pretraining.
- Python 3.11+
uv0.7+
uv python install 3.11
uv venv --python 3.11
uv sync --all-groups
uv run pytestuv run ruff format src/ tests/
uv run ruff check --fix src/ tests/
uv run mypy src/
uv run pytest
uv run corpuslab validate-config configs/example_pipeline.yaml
uv run corpuslab ingest data/raw --format parquet --preview-rows 5
uv run corpuslab detect-language data/raw --backend fasttext --text-column text
uv run corpuslab assess-quality data/raw --text-column text --profile moderate
uv run corpuslab dedup data/raw --text-column text --method exact --execution-mode auto
uv run corpuslab evaluate sample --task quality --input artifacts/gold.csvThe CLI entry point is corpuslab (defined in pyproject.toml).
Run help:
uv run corpuslab --help
uv run corpuslab ingest --help
uv run corpuslab detect-language --help
uv run corpuslab assess-quality --help
uv run corpuslab dedup --help
uv run corpuslab evaluate --helpAvailable commands:
version: shows installed package version.validate-config: validates YAML/TOML pipeline config.ingest: runs M-01 ingestion and schema discovery.detect-language: runs M-02 language detection and PT variant annotation.assess-quality: runs M-03 textual quality signals and filtering.dedup: runs M-04 exact and approximate deduplication.evaluate: runs annotation sampling, agreement, metric computation, and config comparison.
For evaluate sample, the default bundle destination is
<input_dir>/evaluation/<task>/annotations.csv, with guidelines.md and sidecar
texts/ or clusters/ directories created alongside it. Use --output-csv and
--output-guidelines to override those defaults.
Detailed documentation for M-02, M-03, M-04, and evaluation workflows:
docs/technical/detect-language.mddocs/technical/assess-quality.mddocs/technical/dedup.mddocs/technical/evaluate.md
Short evaluation example:
uv run corpuslab evaluate compute \
--task quality \
--gold tests/fixtures/gold/gold_quality.csv \
--output-report artifacts/evaluation/quality_report.jsonuv run corpuslab validate-config configs/example_pipeline.yamluv run corpuslab ingest data/raw --format parquet --preview-rows 5Directory ingestion is all-or-nothing: if any parquet shard is unreadable or corrupted, ingestion fails instead of silently dropping that file.
Export ingestion report:
uv run corpuslab ingest data/raw \
--format parquet \
--text-field text \
--max-schema-samples 10000 \
--output-report artifacts/reports/ingestion_report.jsonuv run corpuslab detect-language data/raw \
--backend fasttext \
--secondary-backend none \
--performance-mode throughput \
--short-text-strategy discriminator_only \
--text-column text \
--confidence-threshold 0.7 \
--code-mix-threshold 0.2 \
--profile-stagesExport language report and annotated dataset:
uv run corpuslab detect-language data/raw \
--backend fasttext \
--secondary-backend none \
--text-column text \
--filter-lang pt \
--filter-variant PT_BR \
--output-report artifacts/reports/language_report.json \
--output-annotated artifacts/outputs/annotated.parquetNotes:
--filter-langand--filter-variantcan be repeated.--output-reportwrites JSON output.--output-annotatedwrites a Parquet file with annotation columns:lang,lang_confidence,pt_variant,variant_confidence,code_mix_ratio,code_mix_langs.detect-languageaccepts parquet input only and rejects legacy JSONL paths.--profile-stagesprints timing breakdown and runtime counters.
uv run corpuslab assess-quality data/raw \
--text-column text \
--signals all \
--profile moderate \
--batch-size 1024 \
--workers 4Export quality outputs:
uv run corpuslab assess-quality data/raw \
--text-column text \
--profile aggressive \
--output-report artifacts/reports/quality_report.jsonNotes:
- The current CLI runs
assess-qualityin chunked report-only mode. --output-annotatedand--output-filteredare available through the Python API andQualityService, not through the current CLI workflow.
uv run corpuslab dedup data/raw \
--text-column text \
--method all \
--execution-mode auto \
--input-chunk-rows 32768 \
--cluster-index-topk 1000 \
--output-report artifacts/reports/dedup_report.json \
--output-deduped artifacts/outputs/deduped.parquet \
--output-clusters artifacts/outputs/dedup_clusters.jsonNotes:
--execution-mode autouses chunked parquet mode when possible.--execution-mode chunkedrequires parquet input.--output-dedupedaccepts.parquetonly.- Chunked cluster export writes compact JSON index plus
*.members.parquet.
Code lives in src/corpuslab/ and tests live in tests/.
The 7 MVP modules are:
ingestionlangdetectqualitydedupevaluationprofilingorchestrationexport
See CONTRIBUTING.md for branching, commit messages, style, and local checks.
from pathlib import Path
from corpuslab.ingestion import IngestionConfig, IngestionService
service = IngestionService()
result = service.ingest(
IngestionConfig(
input_path=Path("data/raw"),
format="parquet",
max_schema_samples=10000,
preview_rows=5,
)
)
print(result.report.inferred_schema.model_dump())
print(result.report.sample_preview)Public M-02/M-03/M-04 APIs are path-only for parquet datasets. DataFrame and LazyFrame inputs are reserved for private synthetic-test helpers so production code cannot accidentally trigger whole-input materialization.
from pathlib import Path
from corpuslab.dedup import DedupConfig, deduplicate
from corpuslab.langdetect import LanguageDetectionConfig, detect_languages
from corpuslab.quality import QualityConfig, assess_quality
ingested_path = Path("artifacts/outputs/ingested.parquet")
language_path = Path("artifacts/outputs/langdetect_annotated.parquet")
quality_path = Path("artifacts/outputs/quality_annotated.parquet")
filtered_path = Path("artifacts/outputs/quality_filtered.parquet")
deduped_path = Path("artifacts/outputs/deduped.parquet")
lang_result = detect_languages(
input_path=ingested_path,
config=LanguageDetectionConfig(
backend="langdetect",
text_column="text",
confidence_threshold=0.7,
),
output_annotated=language_path,
)
print(lang_result.report.model_dump())
quality_result = assess_quality(
input_path=language_path,
config=QualityConfig(
text_column="text",
profile="moderate",
enable_perplexity=False,
),
output_annotated=quality_path,
output_filtered=filtered_path,
)
print(quality_result.report.model_dump())
dedup_result = deduplicate(
input_path=filtered_path,
config=DedupConfig(
text_column="text",
method="exact",
normalization_level="moderate",
),
output_deduped=deduped_path,
)
print(dedup_result.report.model_dump())from pathlib import Path
from corpuslab.langdetect import LanguageDetectionConfig, detect_languages
result = detect_languages(
input_path=Path("artifacts/outputs/ingested.parquet"),
config=LanguageDetectionConfig(
backend="fasttext",
text_column="text",
annotation_chunk_rows=4096,
),
output_annotated=Path("artifacts/outputs/langdetect_annotated.parquet"),
)
print(result.report.model_dump())from pathlib import Path
import polars as pl
import plotly.express as px
from corpuslab.quality import QualityConfig, assess_quality
result = assess_quality(
input_path=Path("artifacts/outputs/annotated.parquet"),
config=QualityConfig(text_column="text", profile="moderate", enable_perplexity=False),
output_annotated=Path("artifacts/outputs/quality_annotated.parquet"),
)
annotated = pl.read_parquet("artifacts/outputs/quality_annotated.parquet")
for metric in ["quality_score", "ocr_noise_score", "boilerplate_score", "repetition_score"]:
fig = px.histogram(annotated.to_pandas(), x=metric, nbins=40, title=metric)
fig.show()from __future__ import annotations
import polars as pl
from corpuslab.quality.config import QualityConfig
from corpuslab.quality.models import SignalCost
from corpuslab.quality.signals.registry import register_signal
@register_signal(
name="has_question_mark",
dtype="bool",
cost=SignalCost.CHEAP,
higher_is_better=True,
mode="expression",
)
def expr_has_question_mark(text_col: str, config: QualityConfig) -> pl.Expr:
_ = config
return pl.col(text_col).str.contains(r"\\?").alias("has_question_mark")