Skip to content

C4AI/corpuslab-eda

Repository files navigation

corpuslab-eda

EDA and curation platform for Portuguese corpora used in LLM pretraining.

Requirements

  • Python 3.11+
  • uv 0.7+

Quick start

uv python install 3.11
uv venv --python 3.11
uv sync --all-groups
uv run pytest

Common commands

uv run ruff format src/ tests/
uv run ruff check --fix src/ tests/
uv run mypy src/
uv run pytest
uv run corpuslab validate-config configs/example_pipeline.yaml
uv run corpuslab ingest data/raw --format parquet --preview-rows 5
uv run corpuslab detect-language data/raw --backend fasttext --text-column text
uv run corpuslab assess-quality data/raw --text-column text --profile moderate
uv run corpuslab dedup data/raw --text-column text --method exact --execution-mode auto
uv run corpuslab evaluate sample --task quality --input artifacts/gold.csv

CLI usage

The CLI entry point is corpuslab (defined in pyproject.toml).

Run help:

uv run corpuslab --help
uv run corpuslab ingest --help
uv run corpuslab detect-language --help
uv run corpuslab assess-quality --help
uv run corpuslab dedup --help
uv run corpuslab evaluate --help

Available commands:

  • version: shows installed package version.
  • validate-config: validates YAML/TOML pipeline config.
  • ingest: runs M-01 ingestion and schema discovery.
  • detect-language: runs M-02 language detection and PT variant annotation.
  • assess-quality: runs M-03 textual quality signals and filtering.
  • dedup: runs M-04 exact and approximate deduplication.
  • evaluate: runs annotation sampling, agreement, metric computation, and config comparison.

For evaluate sample, the default bundle destination is <input_dir>/evaluation/<task>/annotations.csv, with guidelines.md and sidecar texts/ or clusters/ directories created alongside it. Use --output-csv and --output-guidelines to override those defaults.

Technical documentation

Detailed documentation for M-02, M-03, M-04, and evaluation workflows:

  • docs/technical/detect-language.md
  • docs/technical/assess-quality.md
  • docs/technical/dedup.md
  • docs/technical/evaluate.md

Short evaluation example:

uv run corpuslab evaluate compute \
  --task quality \
  --gold tests/fixtures/gold/gold_quality.csv \
  --output-report artifacts/evaluation/quality_report.json

1) Validate config

uv run corpuslab validate-config configs/example_pipeline.yaml

2) Run ingestion (M-01)

uv run corpuslab ingest data/raw --format parquet --preview-rows 5

Directory ingestion is all-or-nothing: if any parquet shard is unreadable or corrupted, ingestion fails instead of silently dropping that file.

Export ingestion report:

uv run corpuslab ingest data/raw \
  --format parquet \
  --text-field text \
  --max-schema-samples 10000 \
  --output-report artifacts/reports/ingestion_report.json

3) Run language detection (M-02)

uv run corpuslab detect-language data/raw \
  --backend fasttext \
  --secondary-backend none \
  --performance-mode throughput \
  --short-text-strategy discriminator_only \
  --text-column text \
  --confidence-threshold 0.7 \
  --code-mix-threshold 0.2 \
  --profile-stages

Export language report and annotated dataset:

uv run corpuslab detect-language data/raw \
  --backend fasttext \
  --secondary-backend none \
  --text-column text \
  --filter-lang pt \
  --filter-variant PT_BR \
  --output-report artifacts/reports/language_report.json \
  --output-annotated artifacts/outputs/annotated.parquet

Notes:

  • --filter-lang and --filter-variant can be repeated.
  • --output-report writes JSON output.
  • --output-annotated writes a Parquet file with annotation columns: lang, lang_confidence, pt_variant, variant_confidence, code_mix_ratio, code_mix_langs.
  • detect-language accepts parquet input only and rejects legacy JSONL paths.
  • --profile-stages prints timing breakdown and runtime counters.

4) Run quality assessment (M-03)

uv run corpuslab assess-quality data/raw \
  --text-column text \
  --signals all \
  --profile moderate \
  --batch-size 1024 \
  --workers 4

Export quality outputs:

uv run corpuslab assess-quality data/raw \
  --text-column text \
  --profile aggressive \
  --output-report artifacts/reports/quality_report.json

Notes:

  • The current CLI runs assess-quality in chunked report-only mode.
  • --output-annotated and --output-filtered are available through the Python API and QualityService, not through the current CLI workflow.

5) Run deduplication (M-04)

uv run corpuslab dedup data/raw \
  --text-column text \
  --method all \
  --execution-mode auto \
  --input-chunk-rows 32768 \
  --cluster-index-topk 1000 \
  --output-report artifacts/reports/dedup_report.json \
  --output-deduped artifacts/outputs/deduped.parquet \
  --output-clusters artifacts/outputs/dedup_clusters.json

Notes:

  • --execution-mode auto uses chunked parquet mode when possible.
  • --execution-mode chunked requires parquet input.
  • --output-deduped accepts .parquet only.
  • Chunked cluster export writes compact JSON index plus *.members.parquet.

Package layout

Code lives in src/corpuslab/ and tests live in tests/.

The 7 MVP modules are:

  • ingestion
  • langdetect
  • quality
  • dedup
  • evaluation
  • profiling
  • orchestration
  • export

Contribution conventions

See CONTRIBUTING.md for branching, commit messages, style, and local checks.

Programmatic ingestion

from pathlib import Path

from corpuslab.ingestion import IngestionConfig, IngestionService

service = IngestionService()
result = service.ingest(
    IngestionConfig(
        input_path=Path("data/raw"),
        format="parquet",
        max_schema_samples=10000,
        preview_rows=5,
    )
)

print(result.report.inferred_schema.model_dump())
print(result.report.sample_preview)

Programmatic language detection

Public M-02/M-03/M-04 APIs are path-only for parquet datasets. DataFrame and LazyFrame inputs are reserved for private synthetic-test helpers so production code cannot accidentally trigger whole-input materialization.

from pathlib import Path

from corpuslab.dedup import DedupConfig, deduplicate
from corpuslab.langdetect import LanguageDetectionConfig, detect_languages
from corpuslab.quality import QualityConfig, assess_quality

ingested_path = Path("artifacts/outputs/ingested.parquet")
language_path = Path("artifacts/outputs/langdetect_annotated.parquet")
quality_path = Path("artifacts/outputs/quality_annotated.parquet")
filtered_path = Path("artifacts/outputs/quality_filtered.parquet")
deduped_path = Path("artifacts/outputs/deduped.parquet")

lang_result = detect_languages(
    input_path=ingested_path,
    config=LanguageDetectionConfig(
        backend="langdetect",
        text_column="text",
        confidence_threshold=0.7,
    ),
    output_annotated=language_path,
)

print(lang_result.report.model_dump())

quality_result = assess_quality(
    input_path=language_path,
    config=QualityConfig(
        text_column="text",
        profile="moderate",
        enable_perplexity=False,
    ),
    output_annotated=quality_path,
    output_filtered=filtered_path,
)

print(quality_result.report.model_dump())

dedup_result = deduplicate(
    input_path=filtered_path,
    config=DedupConfig(
        text_column="text",
        method="exact",
        normalization_level="moderate",
    ),
    output_deduped=deduped_path,
)

print(dedup_result.report.model_dump())

Scalable chunked language detection

from pathlib import Path

from corpuslab.langdetect import LanguageDetectionConfig, detect_languages

result = detect_languages(
    input_path=Path("artifacts/outputs/ingested.parquet"),
    config=LanguageDetectionConfig(
        backend="fasttext",
        text_column="text",
        annotation_chunk_rows=4096,
    ),
    output_annotated=Path("artifacts/outputs/langdetect_annotated.parquet"),
)

print(result.report.model_dump())

Notebook-style quality analysis snippet

from pathlib import Path

import polars as pl
import plotly.express as px

from corpuslab.quality import QualityConfig, assess_quality

result = assess_quality(
    input_path=Path("artifacts/outputs/annotated.parquet"),
    config=QualityConfig(text_column="text", profile="moderate", enable_perplexity=False),
    output_annotated=Path("artifacts/outputs/quality_annotated.parquet"),
)

annotated = pl.read_parquet("artifacts/outputs/quality_annotated.parquet")
for metric in ["quality_score", "ocr_noise_score", "boilerplate_score", "repetition_score"]:
    fig = px.histogram(annotated.to_pandas(), x=metric, nbins=40, title=metric)
    fig.show()

Registering a custom quality signal

from __future__ import annotations

import polars as pl

from corpuslab.quality.config import QualityConfig
from corpuslab.quality.models import SignalCost
from corpuslab.quality.signals.registry import register_signal


@register_signal(
    name="has_question_mark",
    dtype="bool",
    cost=SignalCost.CHEAP,
    higher_is_better=True,
    mode="expression",
)
def expr_has_question_mark(text_col: str, config: QualityConfig) -> pl.Expr:
    _ = config
    return pl.col(text_col).str.contains(r"\\?").alias("has_question_mark")

About

Corpuslab é uma plataforma Python para análise exploratória de dados e curadoria de corpora portugueses, focada na preparação de conjuntos de dados de alta qualidade para o pré-treinamento de modelos LLMs.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages