Local (Free) RAG with Question Generation using LM Studio, Nomic embeddings, ChromaDB and Llama 3.2 on a Mac mini M1

5 min readOct 13, 2024

I’m excited to share this personal project, https://colombia.misprops.co a website that scans Colombian news every 2 hours and uses AI technologies to summarize articles, translate them into English, perform sentiment analysis, and generate question lists to improve RAG results. This project not only demonstrates the power of AI in news processing but also showcases a scalable and efficient architecture that runs on a Mac mini M1.

At the heart of this project lies a local implementation of LLaMA 3.2, running on LM Studio. This setup is used to summarize each article, translate it into English, and perform sentiment analysis. But that’s not all — Llama 3.2 is also used to generate a list of questions answered by each article, helping improve search results.

To store these articles and question lists, I’m utilizing ChromaDB, a vector database that takes advantage of the Nomic-embed-text-v1.5 model running on LM Studio. This ensures that articles are accurately represented and can be efficiently retrieved.

Finally, an ExpressJS chat completion API runs on the Mac mini M1, using the Llama model hosted on LM Studio. This API enables users to search for answers to questions generated by the news articles, making it a valuable resource for anyone interested in the colombian news in english.

The best part? This entire project is free. You don’t need to purchase any software or hardware to replicate this implementation on your own Mac mini M1. And the results are impressive — with a processing power that reaches up to 50,000 articles per month, this Mac mini M1 is capable of handling a significant volume of data.

Given the high computing costs associated with AI, this project provides an interesting example of “cloud repatriation” using inexpensive hardware. By leveraging the power of local computation, we can reduce our reliance on cloud-based services and minimize costs associated with data processing. In a world where AI is increasingly becoming a critical tool for businesses and researchers, this approach highlights the potential for innovative repatriation strategies that combine cutting-edge technologies with cost-effective, locally-based solutions.

In fact, with the right configuration and optimization techniques, this system can be made to handle even more articles. To give you an idea of just how scalable this project is, here are the specs:

Mac mini M1: 16GB RAM
LM Studio (https://lmstudio.ai/)
LLaMA 3.2: runs on LM Studio (https://lmstudio.ai/model/llama-3.2-3b-instruct)
Nomic embeddings (https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF)
ChromaDB (https://www.trychroma.com/)
ExpressJS chat completion API: runs on the Mac mini M1

By using this combination of technologies and tools, you can create a powerful tool for real-time news summarization and sentiment analysis — all without spending a dime.

In this article, I’ll provide an in-depth look at my implementation, detailing the specific technologies and tools used to build this local RAG. I’ll cover topics such as:

Github repo

GitHub - lomaky/news-analyser

Contribute to lomaky/news-analyser development by creating an account on GitHub.

github.com

Installing Chroma on docker

The following command runs a chroma container that maps the database to the host computer and redirects the traffic to port 8000

docker run -d --name chromadb -v ./chroma:/path/on/host -p 8000:8000 -e IS_PERSISTENT=TRUE -e ANONYMIZED_TELEMETRY=TRUE chromadb/chroma:latest

Installing LM Studio

LM Studio is a desktop app for developing and experimenting with LLMs on your computer. Download from https://lmstudio.ai and install.

Once install, download and load 2 models:

mlx-community/Llama-3.2–3B-Instruct-4bit
nomic-ai/nomic-embed-text-v1.5-GGUF

Question generation for each article to improve RAG results

An improved solution for a RAG includes pre-processing document chunks, in this case articles, before storing them in a vector database. This involves asking the LLM to generate questions that are best answered by the article. The generated questions and their embedding vectors are then stored in the vector database, along with a reference to the original document text.

This approach, called RAG with Question Generation, uses the vector database to compare questions and generate better results. This has been tested empirically and seems to show better results.

async generateQuestions(content: string): Promise<string> {
    const userQuery = `
Generate 6 questions related to the following news article. return only the questions without their responses or any additional text:

${content}`;
    const promptSummaryRequest: chat = {
      model: "mlx-community/Llama-3.2-3B-Instruct-4bit",
      messages: [
        {
          role: "system",
          content:
            "You are an assistant that generates questions to improve RAG Results.",
        },
        {
          role: "user",
          content: userQuery ?? "",
        },
      ],
      stream: false,
    };
    const llmHeaders = new Headers();
    llmHeaders.append("Content-Type", "application/json");
    const summaryResult = await fetch(`${llmChatEndpoint}`, {
      method: "POST",
      headers: llmHeaders,
      body: JSON.stringify(promptSummaryRequest),
      redirect: "follow",
    });
    const summaryResponse = (await summaryResult.json())?.choices[0].message
      ?.content as string | undefined;

    return summaryResponse ?? "";
  }
}

Vectorizing the news articles with nomic-embed-text-v1.5

LM Studio provides a powerful tool for embedding text using the Nomic-embed-text-v1.5 model, allowing us to pre-process document chunks before storing them in ChromaDB. This process enables our system to leverage the strengths of this model, which is trained on a large corpus of text. By utilizing LM Studio, we can efficiently embed the text and create dense vector representations that facilitate efficient searching and retrieval in ChromaDB. Furthermore, with the model’s quantization to 4-bit, it has been optimized for smaller memory requirements and faster computation times. Running locally on our Mac mini M1, LM Studio enables us to tap into the capabilities of this robust model without incurring significant computational overhead.

Once the vector is generated the embedding can be stored in Chroma

          // Organise content
          const content = `# ${article.englishTitle} 
            *${articleDate}*

            ${article.englishSummary!}

            ## Potential questions answered in this article:
            ${questions}
          `;
          const contentEmbedding = await getContentEmbedding(content);
          // Vectorize article
          await vectorDbEnglish.upsert({
            ids: [article.id!.toString()],
            embeddings: contentEmbedding,
            documents: [content],
            metadatas: [
              {
                title: article.englishTitle!,
                date: new Date(article.date!).toISOString(),
                url: article.url ?? "",
              },
            ],
          });

Conclusion

In conclusion, our local RAG project showcases the power of AI in real-time news summarization and sentiment analysis. By leveraging local computation on a Mac mini M1, we’ve demonstrated that it’s possible to build an efficient and scalable system for processing large volumes of news articles, without incurring significant costs. Our use of LLaMA 3.2, Nomic embeddings, and ChromaDB enables us to generate high-quality question lists, perform sentiment analysis, and summarize articles, all while minimizing our reliance on cloud-based services.

This project serves as a proof-of-concept for the potential of “cloud repatriation” in AI, where local computation can be used to reduce costs and increase efficiency. By replicating this implementation on your own Mac mini M1, you can create a powerful tool for real-time news summarization and sentiment analysis, all without spending a dime.