π§ NewsMate: A RAG-Based News Chatbot

π Objective
NewsMate is an intelligent chatbot that provides real-time, conversational answers based on the latest news articles. It leverages Retrieval-Augmented Generation (RAG) to ground LLM responses in fresh, factual news content sourced from RSS feeds.
π οΈ Tech Stack
- Frontend: React.js with Vite
- Backend: Node.js with Express
- Embeddings: @xenova/transformers (MiniLM)
- Vector Store: Qdrant (Cloud-hosted)
- Database/Cache: Redis (Session management)
- Language Model: Gemini API
- Scheduler: Custom Node.js cron-like job
π Key Features & Architecture
RSS Feed Ingestion
- Regularly fetches articles from multiple RSS sources (e.g., NYTimes).
- Extracts title, link, and content.
Embedding Generation
- Uses @xenova/transformers to create dense vector representations of article content.
- Ensures semantic similarity can be measured during retrieval.
Deduplication
- Each articleβs link is hashed using SHA-256 and formatted into a UUID-like string.
- This unique ID prevents duplicate insertion into the vector store.
Vector Storage (Qdrant)
- Embeddings are upserted into the news_articles collection.
- Qdrant is queried later for semantically similar chunks during a chatbot session.
Chatbot (RAG)
- User question triggers a semantic search in Qdrant.
- Top-k relevant context chunks are injected into the Gemini API prompt.
- The model generates context-grounded answers.
Session Management
- Redis is used to track ongoing conversations and maintain continuity.
β οΈ Difficulties Faced & Resolutions
| Problem | Cause | Resolution |
| ------------------------------------------ | ---------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
| "Bad Request: Invalid ID" from Qdrant | Direct use of URLs as point IDs (invalid format) | Introduced hashing (SHA-256) of links into UUID-like strings |
| Duplicate Data Despite No Uploads | Same link repeatedly inserted without proper deduplication logic | Added pre-check using qdrant.retrieve()
to skip already embedded articles |
| ReferenceError: stats
is not defined | Misuse of Qdrant collection metadata in ensureCollectionExists
| Removed incorrect stats
reference and replaced with proper collection existence check |
| Embedding Failures | Some articles had malformed or insufficient content | Added validation and fallback logic per article to skip bad entries |
| Silent Failures / Poor Logging | Errors weren't specific or granular | Improved error logs and debug messaging for each critical operation (embedding, upsert, fetch) |
β
Outcomes
- β
Successfully built a full-stack news chatbot using RAG.
- β
Embedded 50+ articles into Qdrant and enabled semantic search.
- β
Integrated Gemini for high-quality LLM responses.
- β
Achieved auto-updating context through periodic embedding refreshes.
π Future Improvements
- Add user authentication and persistent chat history.
- Integrate summarization and source citation.
- Scale to handle multi-lingual feeds.