Skip to content

CosmonautCode/PySentence-Extraction-Toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PySET - Python Sentence Extraction Toolkit

PyPI Version Python Versions License

Python Sentence Extraction Toolkit is a high-performance, zero-dependency, sentence boundary detection library designed to break down any provided text into indivdual sentences to be used as required. It was intially buildt to fill the need for a efficient modern system to break down sentences in text for form co-hearent chunks when indexing into a vector database but it has a wide range of applicaitons.

Built for Python 3.14

Features

  • 85 intelligent rules for accurate sentence splitting
  • 52 languages supported
  • Zero dependencies (pure Python standard library)
  • 2-3x faster than comparable libraries
  • More accurate on complex legal documents
  • Extensible design for custom rules

Installation

pip install pysentence-extraction-toolkit

Quick Start

from pyset import TokenBoundaryDetector

detector = TokenBoundaryDetector()

# Simple usage
text = "Hello world. How are you? I'm doing great!"
sentences = detector.split(text)

print(sentences)
# ['Hello world.', 'How are you?', "I'm doing great!"]

Performance

Based on internal benchmarks vs PySBD:

Text Size Words PySET PySBD Speedup
Sentences ~5 0.05ms 0.10ms 2.0x
Paragraph ~104 0.60ms 1.37ms 2.3x
Article ~484 2.41ms 5.25ms 2.2x
Document ~1400 5.68ms 21.95ms 3.9x

PySET processes 158,000+ words/second vs 63,000 for PySBD.

chart chart

Why PySET?

  • Zero dependencies - No external packages required
  • 85 rules handling edge cases like abbreviations, URLs, emails, decimals, quotes
  • Accurate - Priority-based rule evaluation for correct decisions
  • Fast - Pre-compiled patterns and optimized algorithms
  • Extensible - Easy to add custom rules
  • Well tested - 100% accuracy on 52 languages

Configuration Options

Parameter Type Default Description
language str 'en' Language code
min_sentence_length int 1 Minimum sentence length
aggressive_abbreviations bool False Stricter abbreviation handling
merge_short_sentences bool False Merge short sentences
include_rules List[int] None Use specific rules
exclude_rules List[int] None Exclude specific rules
debug bool False Enable debug logging

Use Cases

  • Document chunking for LLMs and RAG systems
  • Text preprocessing for NLP pipelines
  • Legal document analysis
  • News article segmentation
  • Academic paper processing
  • Content extraction and cleaning

Documentation

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Run with coverage
pytest tests/ --cov=pyset

License

MIT License - See LICENSE file for details.


PySET - Python Sentence Extraction Toolkit Accurate. Fast. Zero Dependencies.

About

Python Sentence Extraction Toolkit is a high-performance, zero-dependency, sentence boundary detection library designed to break down any provided text into indivdual sentences to be used as required.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages