PySET - Python Sentence Extraction Toolkit

Python Sentence Extraction Toolkit is a high-performance, zero-dependency, sentence boundary detection library designed to break down any provided text into indivdual sentences to be used as required. It was intially buildt to fill the need for a efficient modern system to break down sentences in text for form co-hearent chunks when indexing into a vector database but it has a wide range of applicaitons.

Built for Python 3.14

Features

85 intelligent rules for accurate sentence splitting
52 languages supported
Zero dependencies (pure Python standard library)
2-3x faster than comparable libraries
More accurate on complex legal documents
Extensible design for custom rules

Installation

pip install pysentence-extraction-toolkit

Quick Start

from pyset import TokenBoundaryDetector

detector = TokenBoundaryDetector()

# Simple usage
text = "Hello world. How are you? I'm doing great!"
sentences = detector.split(text)

print(sentences)
# ['Hello world.', 'How are you?', "I'm doing great!"]

Performance

Based on internal benchmarks vs PySBD:

Text Size	Words	PySET	PySBD	Speedup
Sentences	~5	0.05ms	0.10ms	2.0x
Paragraph	~104	0.60ms	1.37ms	2.3x
Article	~484	2.41ms	5.25ms	2.2x
Document	~1400	5.68ms	21.95ms	3.9x

PySET processes 158,000+ words/second vs 63,000 for PySBD.

Why PySET?

Zero dependencies - No external packages required
85 rules handling edge cases like abbreviations, URLs, emails, decimals, quotes
Accurate - Priority-based rule evaluation for correct decisions
Fast - Pre-compiled patterns and optimized algorithms
Extensible - Easy to add custom rules
Well tested - 100% accuracy on 52 languages

Configuration Options

Parameter	Type	Default	Description
`language`	str	`'en'`	Language code
`min_sentence_length`	int	`1`	Minimum sentence length
`aggressive_abbreviations`	bool	`False`	Stricter abbreviation handling
`merge_short_sentences`	bool	`False`	Merge short sentences
`include_rules`	List[int]	`None`	Use specific rules
`exclude_rules`	List[int]	`None`	Exclude specific rules
`debug`	bool	`False`	Enable debug logging

Use Cases

Document chunking for LLMs and RAG systems
Text preprocessing for NLP pipelines
Legal document analysis
News article segmentation
Academic paper processing
Content extraction and cleaning

Documentation

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Run with coverage
pytest tests/ --cov=pyset

License

MIT License - See LICENSE file for details.

PySET - Python Sentence Extraction Toolkit Accurate. Fast. Zero Dependencies.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
docs		docs
pyset		pyset
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySET - Python Sentence Extraction Toolkit

Features

Installation

Quick Start

Performance

Why PySET?

Configuration Options

Use Cases

Documentation

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PySET - Python Sentence Extraction Toolkit

Features

Installation

Quick Start

Performance

Why PySET?

Configuration Options

Use Cases

Documentation

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages