A news aggregation bot that scrapes local Málaga news, evaluates article relevance with AI, and posts curated summaries to a Telegram channel. Runs on a 10-minute schedule.
- Fetch — Scrapes article links and content from the configured news source
- Deduplicate — Checks the article URL against Supabase first (free, in-memory); then embeds the title with Cohere and compares cosine similarity against stored embeddings; falls back to Jaccard on legacy rows
- Evaluate — Gemini scores each article's relevance (0–10); articles below 6 are saved to Supabase (so they are not re-evaluated next cycle) and then skipped
- Summarize — Gemini generates an emoji-rich, Telegram-ready summary
- Post — Sends media groups (up to 9 images) or plain text to the Telegram channel
- Cleanup — Daily job removes articles older than 10 days
- Python 3.10+
- A Telegram bot token and target chat/channel ID
- A Google Gemini API key (for article evaluation and summarization)
- A Cohere API key (for title embeddings / deduplication)
- A Supabase project with an
articlestable:create table articles ( id uuid primary key, title text, date text, embedding jsonb, url text ); create unique index articles_url_idx on articles (url) where url is not null;
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtCreate a .env file in the project root:
BOT_TOKEN=your_telegram_bot_token
CHAT_ID=your_telegram_chat_id
NEWS_URL=https://www.malagahoy.es/malaga/
GEMINI_API_KEY=your_gemini_api_key
COHERE_API_KEY=your_cohere_api_key
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your_supabase_anon_or_service_key# Run the bot (loops every 10 minutes)
python main.py
# Dry run — fetch and evaluate without saving or posting
python main.py --dry-runArticle evaluation and summarization use Gemini by default. Switch via current_ai_provider in main.py:
| Provider | Model | Notes |
|---|---|---|
AIProvider.GEMINI |
gemini-2.5-flash |
Default; uses JSON schema validation |
AIProvider.OPENAI |
Any OpenAI-compatible | Also works with local LM Studio at http://localhost:1234/v1 |
Deduplication embeddings always use Cohere (embed-multilingual-v3.0, 1024 dimensions).
├── main.py # Entry point, scheduler, job orchestration
├── fetching_data.py # Web scraping (BeautifulSoup)
├── data_service.py # Supabase deduplication (Cohere embeddings + cosine similarity)
├── telegram_service.py # Telegram posting (media groups + text)
├── response_parser.py # JSON + regex extraction from AI responses
├── requirements.txt
└── ai/
├── ai_service.py # Factory: AIService.get_service(provider)
├── base_ai_service.py # Abstract base (evaluate, summarize)
├── gemini_service.py # Google Gemini implementation
├── openai_service.py # OpenAI / LM Studio implementation
├── ai_prompts.py # Prompt templates
└── ai_provider.py # AIProvider enum
python -m unittest discover -s tests -p "test_*.py" # All tests
python -m unittest tests.test_ai_services # AI service (evaluate + summarize)
python -m unittest tests.test_similarity # Cosine math + Cohere embedding + Supabase integration
python -m unittest tests.test_supabase_connection # Live Supabase connection (requires credentials)Unit tests mock all external API calls — no live credentials required for most tests.
test_similarity.py runs real Cohere API calls when COHERE_API_KEY is set; otherwise the API-dependent classes are skipped automatically.
test_supabase_connection.py hits the live Supabase REST API and requires SUPABASE_URL and SUPABASE_KEY.
| Constant | Default | Description |
|---|---|---|
SIMILARITY_THRESHOLD |
0.85 |
Cosine similarity cutoff for deduplication |
DISTANCE_THRESHOLD |
0.15 |
1 - SIMILARITY_THRESHOLD |
| Scheduler interval | 10 min | How often job() runs |
| Cleanup age | 10 days | Max age of stored articles |
| AI retry delay | 3 min | Wait between LLM retries (3 attempts max) |
| Embedding model | embed-multilingual-v3.0 |
Cohere model, 1024 dimensions |