News chat: RAG implementation using Chroma, Gemini Flash 1.5, Text Embedding 004 and Cloudflare tunnels

9 min readSep 10, 2024

A few months ago, I developed a news summarizer and sentiment indicator that processes articles from a well-known newspaper, using the power of local large language models (LLMs). By leveraging Ollama and Llama 3, I was able to run this system efficiently on my Mac Mini with Apple’s Silicon processor. Since then, I’ve collected nearly 15,000 articles. With this rich dataset, I decided to build a chatbot that can answer questions and provide insights on the news that truly interests me, delivering personalized news summaries and sentiment analysis on demand.

Following the same principle, I aimed to make this project as cost-effective as possible, relying on local processing power instead of expensive cloud-based infrastructure, ensuring I maintain control over data and expenses.

Chat with the news

The chat is available at https://colombiatimes.co, it may not be always online as the chat completions API runs on a little Mac mini at my home. It is also in spanish :)

The Tools

Chroma https://www.trychroma.com/
Gemini Flash 1.5 https://deepmind.google/technologies/gemini/flash/
Text embedding 004 https://ai.google.dev/gemini-api/docs/embeddings
Cloudflare Tunnel https://www.cloudflare.com/en-au/products/tunnel/
Express https://expressjs.com/
React Simple Chatbot https://lucasbassetti.com.br/react-simple-chatbot/

Github

https://github.com/lomaky/news-analyser

Installing Chroma on docker

The following command runs a chroma container that maps the database to the host computer and redirects the traffic to port 8000

docker run -d --name chromadb -v ./chroma:/path/on/host -p 8000:8000 -e IS_PERSISTENT=TRUE -e ANONYMIZED_TELEMETRY=TRUE chromadb/chroma:latest

Vectorizing the news articles with Text embedding 004

Why Choose Text Embedding 004?

While Chroma provides its own embeddings that don’t require a GPU, they tend to be extremely slow and are also quantized, which can impact their performance and accuracy. In contrast, Text Embedding 004 is a free, high-quality option that offers faster and more accurate results. It is designed to improve various Google AI products, making it a powerful alternative for embedding tasks. To get started with Text Embedding 004, you can obtain a Google AI key by visiting the following link:

https://aistudio.google.com/app/apikey

In this project, the articles have been stored locally in a JSON file with the following format:

{
  "title": "Bogota, asi sera la ciudad en la que todos podremos caminar seguros |opinion",
  "date": "2024-06-11T00:00:00.000Z",
  "id": "3351364",
  "category": "Bogota",
  "url": "https://www.eltiempo.com/bogota/bogota-asi-sera-la-ciudad-en-la-que-todos-podremos-caminar-seguros-opinion-3351364",
  "content": "Entendiendo los desafíos que hoy tiene Bogotá para ser la ciudad en la que todos queramos caminar, el Sector Movilidad, con un presupuesto de 19,8 billones de pesos, tiene la responsabilidad de liderar y gestionar 43 metas del Plan Distrital de Desarrollo ‘Bogotá Camina Segura’ 2024-2027.Con más andenes, ciclorrutas, vías, cables aéreos, troncales de TransMilenio y el Metro, en la Administración del alcalde Carlos Fernando Galán, trabajamos para construir una movilidad que esté en armonía entre todos los actores viales y con el medio ambiente. Es por ello que durante estos 4 años nos dedicaremos a implementar los grandes proyectos de infraestructura vial y de transporte público, con más de 60 proyectos de infraestructura que beneficiarán a más de 6 millones de habitantes, y asumimos el compromiso histórico de intervenir el 40 % de la malla vial en mal estado.Las entidades del Sector Movilidad trabajaremos por una ciudad que logre el bien-estar para todos y todas. Para eso ofreceremos más y mejores opciones de movilidad saludables y seguras, con 83 mil cupos de cicloparqueaderos públicos y privados; seguiremos mejorando los viajes de los más de 480 mil estudiantes del programa Niñas y Niños Primero; construiremos 59 km nuevos de ciclorrutas; e implementaremos 80 km/carril nuevos de malla vial. Una de las grandes apuestas será la construcción de la ALO Norte, una obra esperada por décadas por todos los habitantes del occidente de la capital.Como Secretaria de Movilidad, mi prioridad es proteger y salvar vidas en las vías y recuperar la confianza de los ciudadanos, tener un sistema de movilidad más eficiente y confiable, generar más viajes en modos sostenibles, mejorar el servicio de transporte público, y seguir afianzando a Bogotá como una ciudad global competitiva y conectada.Para eso estamos trabajando y caminaremos las 20 localidades para escuchar y conocer las necesidades de sus habitantes, implementando las estrategias y programas establecidas en el Plan de Desarrollo para mejorar el acceso y la seguridad de todos los usuarios de las vías, en especial, de las personas más vulnerables.Con eficiencia, inclusión y sostenibilidad, Bogotá caminará segura.Claudia DíazSecretaria de Movilidad de Bogotá",
  "thumbnail": "https://imagenes.eltiempo.com/files/image_1200_600/uploads/2024/06/11/66686e1a7ff21.jpeg",
  "summary": "Aquí te dejo una resumen de la noticia en español:\n\nLa Secretaría de Movilidad de Bogotá, liderada por Claudia Díaz, ha presentado el Plan Distrital de Desarrollo \"Bogotá Camina Segura\" 2024-2027, con un presupuesto de 19.8 billones de pesos. El objetivo es construir una movilidad en armonía entre todos los actores viales y con el medio ambiente. Durante los próximos 4 años, se implementarán más de 60 proyectos de infraestructura vial y transporte público que beneficiarán a más de 6 millones de habitantes.\n\nEntre las metas del plan se encuentran:\n\n* Construir 59 km nuevos de ciclorrutas\n* Implementar 80 km/carril nuevos de malla vial\n* Mejorar los viajes de los estudiantes del programa Niñas y Niños Primero\n* Ofrecer más y mejores opciones de movilidad saludables y seguras\n* Proteger y salvar vidas en las vías\n\nLa Secretaria de Movilidad ha destacado que su prioridad es proteger y salvar vidas en las vías, recuperar la confianza de los ciudadanos y afianzar a Bogotá como una ciudad global competitiva y conectada. Para lograr este objetivo, se están trabajando estrategias y programas para mejorar el acceso y la seguridad de todos los usuarios de las vías, especialmente de las personas más vulnerables.",
  "sentiment": "Positiva",
  "positive": true
}

An advantage of storing articles in this format is that they can be directly vectorized without needing to be broken down into smaller chunks. This is particularly useful for retrieval-augmented generation (RAG) tasks, where the accuracy of search results can depend heavily on the chunking strategy. Improper chunking can cause key context or insights to be lost, reducing the effectiveness of retrievals. By working with whole documents in a structured format, it’s easier to maintain semantic integrity and improve the quality of search results when generating responses.

The following code reads all articles, generate the embeddings and store them in chroma. It did take a couple of hours to complete all of them and a few times I ran into rate limits for the google embeddings API.

const fs = require("fs");
const path = require("path");
import { ChromaClient, GoogleGenerativeAiEmbeddingFunction } from "chromadb";


const main = async () => {
  // Article news
  const articlesRelativePath = "../news-processor/articles/";
  const articlesPath = path.join(__dirname, articlesRelativePath);
  console.log(articlesPath);

  // embeddings
  const googleKey = "YOUR_GOOGLE_GEMINI_KEY";
  const googleEmbeddings = new GoogleGenerativeAiEmbeddingFunction({
    googleApiKey: googleKey,
    model: "text-embedding-004",
  });

  // VectorDb
  const client = new ChromaClient({
    path: "http://localhost:8000",
  });
  const vectorDbName = `news-text-embedding-004.vdb`;

  // Get or create new VectorDB collection
  const vectorDb = await client.getOrCreateCollection({
    name: vectorDbName,
    embeddingFunction: googleEmbeddings,
  });

  const files = fs
    .readdirSync(articlesPath)
    .filter((file) => path.extname(file) === ".json");

  for (const path of files) {
    const file = `${articlesRelativePath}${path}`;
    console.log(file);

    const data = fs.readFileSync(file);
    const article = JSON.parse(data) as Article;

    if (
      article &&
      article.id &&
      article.content &&
      article.title &&
      article.date
    ) {
      const articleExists = await vectorDb.get({
        ids: [article.id!.toString()],
      });
      if (!articleExists || articleExists.ids.length < 1) {
        // Vectorize article
        await vectorDb.upsert({
          ids: [article.id!.toString()],
          documents: [article.content!],
          metadatas: [
            {
              title: article.title!,
              date: new Date(article.date!).toISOString(),
              url: article.url ?? "",
            },
          ],
        });

        console.log(`Vectorized: ${article.title!}`);
        await new Promise((resolve) => setTimeout(resolve, 200));
      } else {
        console.log(`Already vectorized: ${article.title!}`);
      }
    }
  }
};

export interface Article {
  title?: string;
  date?: Date;
  id?: string;
  category?: string;
  url?: string;
  content?: string;
  summary?: string;
  positive?: boolean;
  sentiment?: string;
  weight?: number;
  thumbnail?: string;
}

main();

Querying the RAG and composing the response with Gemini Flash 1.5

Why Gemini Flash?

I am using Gemini Flash because it is a free option for generative AI from Google, it has a huge context window of 1m and an 8k output window. However, it comes with rate limitations:

15 requests per minute (RPM).

1 million tokens per month (TPM).

These rate limits make Gemini Flash an ideal cost-effective option for projects that require relatively low to moderate usage. However, for more frequent or intensive usage, you may need a subscription to their service.

This is the code for the chat completion API, it sets up an Express server that offers a REST API for Retrieval-Augmented Generation (RAG) search. It combines Google’s Gemini Flash model for text generation with RAG results, enabling the system to generate contextually relevant responses based on retrieved information.

const express = require("express");
var cors = require("cors");
import { HarmBlockThreshold, HarmCategory } from "@google/generative-ai";
import { ChromaClient, GoogleGenerativeAiEmbeddingFunction } from "chromadb";
const { GoogleGenerativeAI } = require("@google/generative-ai");

const queryRag = async (query: string) => {
  const googleKey = "YOUR_GOOGLE_GEMINI_KEY";
  // embeddings
  const googleEmbeddings = new GoogleGenerativeAiEmbeddingFunction({
    googleApiKey: googleKey,
    model: "text-embedding-004",
  });

  // VectorDb
  const client = new ChromaClient({
    path: "http://localhost:8000",
  });

  const vectorDbName = `news-text-embedding-004.vdb`;
  console.log(`VectorDb=${vectorDbName}`);

  // Get or create new VectorDB collection
  const vectorDb = await client.getOrCreateCollection({
    name: vectorDbName,
    embeddingFunction: googleEmbeddings,
  });

  // Query
  const results = await vectorDb.query({
    queryTexts: [query],
    nResults: 20,
  });

  // Compose response
  const genAI = new GoogleGenerativeAI(googleKey);
  const safetySettings = [
    {
      category: HarmCategory.HARM_CATEGORY_HARASSMENT,
      threshold: HarmBlockThreshold.BLOCK_NONE,
    },
    {
      category: HarmCategory.HARM_CATEGORY_HATE_SPEECH,
      threshold: HarmBlockThreshold.BLOCK_NONE,
    },
    {
      category: HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
      threshold: HarmBlockThreshold.BLOCK_NONE,
    },
    {
      category: HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
      threshold: HarmBlockThreshold.BLOCK_NONE,
    },
  ];
  const model = genAI.getGenerativeModel({
    model: "gemini-1.5-flash",
    safetySettings: safetySettings,
  });

  const prompt = `
  <INSTRUCCIONES DEL PROMPT>
  Eres un agente que busca noticias y responde a los usuarios.   
  Responde la siguente pregunta de un usuario usando el resultado de la busqueda a continuacion.
  Usa un lenguaje amigable e impersonal.
  Omite links a paginas web.
  Limitate a solo responder y no hacer preguntas adicionales.
  </INSTRUCCIONES DEL PROMPT>

  <PREGUNTA DEL USUARIO>
  ${query}
  </PREGUNTA DEL USUARIO>

  <RESULTADOS BUSQUEDA NOTICIAS>
  ${JSON.stringify(results)}  
  </RESULTADOS BUSQUEDA NOTICIAS>
  `;

  const result = await model.generateContent(prompt);
  const response = result.response.text();
  console.log(response);
  return response;
};

const app = express();
app.use(express.json());
const PORT = 9700;
app.listen(PORT, () => {
  console.log("Server Listening on port:", PORT);
});

app.get("/search", cors(), async (request, response) => {
  const dbResponse = await queryRag(request.query.query);
  const ragResponse = {
    Query: request.query.query,
    Response: dbResponse,
  };

  response.send(ragResponse);
});

To make this Express server accessible over the internet, I’m utilizing Cloudflare Tunnels, which securely exposes a port from my internal network to the web. This approach simplifies external access without requiring complex network configurations. For a step-by-step guide on setting up Cloudflare Tunnels, refer to this article: https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/get-started/create-local-tunnel/

The web client

For the web chat client, Im using a simple UI written by https://lucasbassetti.com/ that interacts with my chat completion API

import React, { Component } from "react";
import PropTypes from "prop-types";
import ChatBot, { Loading } from "react-simple-chatbot";
import { ThemeProvider } from "styled-components";
import tw from "twin.macro";

const AIResponse = tw.div`text-xs`;

class LomakyVectorDB extends Component {
  constructor(props) {
    super(props);

    this.state = {
      loading: true,
      result: "",
      trigger: false,
    };

    this.triggerNext = this.triggerNext.bind(this);
  }

  componentDidMount() {
    const self = this;
    const { steps } = this.props;
    const search = steps.search.value;

    const queryUrl = `https://completion.cheapdomain.store/search?query=${search}`;

    const xhr = new XMLHttpRequest();

    xhr.addEventListener("readystatechange", function() {
      if (this.readyState === 4) {
        const data = JSON.parse(this.responseText);
        const response = data.Response;
        if (response && response) {
          self.setState({ loading: false, result: response }, () => {
            // Trigger next step after displaying response
            self.triggerNext();
          });
        } else {
          self.setState(
            {
              loading: false,
              result:
                "No encontramos respuesta a tu pregunta, intenta de nuevo.",
            },
            () => {
              // Trigger next step after displaying response
              self.triggerNext();
            }
          );
        }
      }
    });

    xhr.open("GET", queryUrl);
    xhr.send();
  }

  triggerNext() {
    this.setState({ trigger: true }, () => {
      this.props.triggerNextStep();
    });
  }

  render() {
    const { loading, result } = this.state;
    return <AIResponse>{loading ? <Loading /> : result}</AIResponse>;
  }
}

LomakyVectorDB.propTypes = {
  steps: PropTypes.object,
  triggerNextStep: PropTypes.func,
};

LomakyVectorDB.defaultProps = {
  steps: undefined,
  triggerNextStep: undefined,
};

const CHATBOT_THEME = {
  background: "#FFFEFC",
  headerBgColor: "#6415ff",
  headerFontColor: "#fff",
  headerFontSize: "15px",
  botBubbleColor: "#6415ff",
  botFontColor: "#fff",
  userBubbleColor: "#fff",
  userFontColor: "#4a4a4a",
};

const ChatBotHelper = () => {
  const steps = [
    {
      id: "1",
      message:
        "Hola, te puedo ayudar a resolver preguntas de noticias anteriores, ¿Qué quisieras saber?",
      trigger: "search",
    },
    {
      id: "search",
      user: true,
      trigger: "3",
    },
    {
      id: "3",
      component: <LomakyVectorDB />,
      waitAction: true,
      trigger: "4",
    },
    {
      id: "4",
      message: "¿Hay algo más que te gustaría saber?",
      trigger: "search",
    },
  ];

  return (
    <>
      <ThemeProvider theme={CHATBOT_THEME}>
        <ChatBot
          steps={steps}
          floating={true}
          headerTitle="Habla con las noticias"
          enableSmoothScroll={true}
        />
      </ThemeProvider>
    </>
  );
};

export default ChatBotHelper;

Future improvements

This implementation is a proof of concept (PoC) and is not intended for production use. However, it offers a simple and clear demonstration of how RAG systems function. There are several potential improvements that could be made, such as integrating GraphRAG to better understand relationships across different chunks, incorporating reranking for more accurate results, and optimizing the overall architecture for efficiency and scalability.