Skip to content

Donsezan/Python_news_bot

Repository files navigation

Python News Bot

A news aggregation bot that scrapes local Málaga news, evaluates article relevance with AI, and posts curated summaries to a Telegram channel. Runs on a 10-minute schedule.

How It Works

  1. Fetch — Scrapes article links and content from the configured news source
  2. Deduplicate — Checks the article URL against Supabase first (free, in-memory); then embeds the title with Cohere and compares cosine similarity against stored embeddings; falls back to Jaccard on legacy rows
  3. Evaluate — Gemini scores each article's relevance (0–10); articles below 6 are saved to Supabase (so they are not re-evaluated next cycle) and then skipped
  4. Summarize — Gemini generates an emoji-rich, Telegram-ready summary
  5. Post — Sends media groups (up to 9 images) or plain text to the Telegram channel
  6. Cleanup — Daily job removes articles older than 10 days

Setup

Prerequisites

  • Python 3.10+
  • A Telegram bot token and target chat/channel ID
  • A Google Gemini API key (for article evaluation and summarization)
  • A Cohere API key (for title embeddings / deduplication)
  • A Supabase project with an articles table:
    create table articles (
      id uuid primary key,
      title text,
      date text,
      embedding jsonb,
      url text
    );
    create unique index articles_url_idx on articles (url) where url is not null;

Install

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Configure

Create a .env file in the project root:

BOT_TOKEN=your_telegram_bot_token
CHAT_ID=your_telegram_chat_id
NEWS_URL=https://www.malagahoy.es/malaga/
GEMINI_API_KEY=your_gemini_api_key
COHERE_API_KEY=your_cohere_api_key
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your_supabase_anon_or_service_key

Usage

# Run the bot (loops every 10 minutes)
python main.py

# Dry run — fetch and evaluate without saving or posting
python main.py --dry-run

AI Providers

Article evaluation and summarization use Gemini by default. Switch via current_ai_provider in main.py:

Provider Model Notes
AIProvider.GEMINI gemini-2.5-flash Default; uses JSON schema validation
AIProvider.OPENAI Any OpenAI-compatible Also works with local LM Studio at http://localhost:1234/v1

Deduplication embeddings always use Cohere (embed-multilingual-v3.0, 1024 dimensions).

Project Structure

├── main.py                  # Entry point, scheduler, job orchestration
├── fetching_data.py         # Web scraping (BeautifulSoup)
├── data_service.py          # Supabase deduplication (Cohere embeddings + cosine similarity)
├── telegram_service.py      # Telegram posting (media groups + text)
├── response_parser.py       # JSON + regex extraction from AI responses
├── requirements.txt
└── ai/
    ├── ai_service.py        # Factory: AIService.get_service(provider)
    ├── base_ai_service.py   # Abstract base (evaluate, summarize)
    ├── gemini_service.py    # Google Gemini implementation
    ├── openai_service.py    # OpenAI / LM Studio implementation
    ├── ai_prompts.py        # Prompt templates
    └── ai_provider.py       # AIProvider enum

Tests

python -m unittest discover -s tests -p "test_*.py"        # All tests
python -m unittest tests.test_ai_services                   # AI service (evaluate + summarize)
python -m unittest tests.test_similarity                    # Cosine math + Cohere embedding + Supabase integration
python -m unittest tests.test_supabase_connection           # Live Supabase connection (requires credentials)

Unit tests mock all external API calls — no live credentials required for most tests.
test_similarity.py runs real Cohere API calls when COHERE_API_KEY is set; otherwise the API-dependent classes are skipped automatically.
test_supabase_connection.py hits the live Supabase REST API and requires SUPABASE_URL and SUPABASE_KEY.

Key Constants

Constant Default Description
SIMILARITY_THRESHOLD 0.85 Cosine similarity cutoff for deduplication
DISTANCE_THRESHOLD 0.15 1 - SIMILARITY_THRESHOLD
Scheduler interval 10 min How often job() runs
Cleanup age 10 days Max age of stored articles
AI retry delay 3 min Wait between LLM retries (3 attempts max)
Embedding model embed-multilingual-v3.0 Cohere model, 1024 dimensions

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages