Python Sentence Extraction Toolkit is a high-performance, zero-dependency, sentence boundary detection library designed to break down any provided text into indivdual sentences to be used as required. It was intially buildt to fill the need for a efficient modern system to break down sentences in text for form co-hearent chunks when indexing into a vector database but it has a wide range of applicaitons.
Built for Python 3.14
- 85 intelligent rules for accurate sentence splitting
- 52 languages supported
- Zero dependencies (pure Python standard library)
- 2-3x faster than comparable libraries
- More accurate on complex legal documents
- Extensible design for custom rules
pip install pysentence-extraction-toolkitfrom pyset import TokenBoundaryDetector
detector = TokenBoundaryDetector()
# Simple usage
text = "Hello world. How are you? I'm doing great!"
sentences = detector.split(text)
print(sentences)
# ['Hello world.', 'How are you?', "I'm doing great!"]Based on internal benchmarks vs PySBD:
| Text Size | Words | PySET | PySBD | Speedup |
|---|---|---|---|---|
| Sentences | ~5 | 0.05ms | 0.10ms | 2.0x |
| Paragraph | ~104 | 0.60ms | 1.37ms | 2.3x |
| Article | ~484 | 2.41ms | 5.25ms | 2.2x |
| Document | ~1400 | 5.68ms | 21.95ms | 3.9x |
PySET processes 158,000+ words/second vs 63,000 for PySBD.
- Zero dependencies - No external packages required
- 85 rules handling edge cases like abbreviations, URLs, emails, decimals, quotes
- Accurate - Priority-based rule evaluation for correct decisions
- Fast - Pre-compiled patterns and optimized algorithms
- Extensible - Easy to add custom rules
- Well tested - 100% accuracy on 52 languages
| Parameter | Type | Default | Description |
|---|---|---|---|
language |
str | 'en' |
Language code |
min_sentence_length |
int | 1 |
Minimum sentence length |
aggressive_abbreviations |
bool | False |
Stricter abbreviation handling |
merge_short_sentences |
bool | False |
Merge short sentences |
include_rules |
List[int] | None |
Use specific rules |
exclude_rules |
List[int] | None |
Exclude specific rules |
debug |
bool | False |
Enable debug logging |
- Document chunking for LLMs and RAG systems
- Text preprocessing for NLP pipelines
- Legal document analysis
- News article segmentation
- Academic paper processing
- Content extraction and cleaning
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Run with coverage
pytest tests/ --cov=pysetMIT License - See LICENSE file for details.
PySET - Python Sentence Extraction Toolkit Accurate. Fast. Zero Dependencies.


