The 'Google for Your Life' Setup That Made Me Delete 47 Cloud Subscriptions

afat1h · November 17, 2025, 8:18am

One-Line Flow:
Build a private, offline “Google for your life” — now upgraded with GPU speed, smarter RAG, automation, OCR, security, and backup superpowers.

BEFORE WE BEGIN — Why These Add-Ons Matter

Your basic setup works.
These additions make it faster, safer, smarter, and actually production-ready — without ruining the simple vibe.

Everything below stays in plain English, short lines, no brain-melting jargon.
Just power, but digestible.

PART 1 — GPU ACCELERATION (The 50x Speed Boost Button)

If you have an NVIDIA GPU, Ollama can stop crawling and start sprinting.

Check your GPU

nvidia-smi

Linux (Ubuntu/Debian) – Install drivers + CUDA

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-3 nvidia-driver-545

Tell Ollama to use your GPU

export CUDA_VISIBLE_DEVICES=0
export OLLAMA_GPU_LAYERS=35

Windows

Settings → System → Display → Graphics →
Set Ollama.exe → “High performance.”

Test GPU mode

ollama run llama3 --gpu-layers 35

Model Selection Made Stupid-Simple

Pick based on your RAM + patience:

| Model            | RAM Needed | Speed | Quality | Best For |
|------------------|------------|-------|----------|----------|
| phi3 (3B)        | 4GB        | Fast  | Okay     | Old laptops |
| llama3.2:3B      | 6GB        | Fast  | Good     | Everyday chat |
| llama3:8B        | 8GB        | Medium| Great    | Best balance |
| qwen2.5:14B      | 16GB       | Slow  | Amazing  | Deep reasoning |
| llama3.3:70B     | 40GB+      | RIP   | God-Tier | Research only |

PART 2 — ADVANCED RAG SETTINGS (Make Your AI Actually “Get” Your Documents)

Your chunks matter.
This is how the brain remembers.

Optimal Chunk Sizes

technical_docs:    2500 tokens, overlap 250
legal_contracts:   1500 tokens, overlap 300
chat_logs:          500 tokens, overlap 50

AnythingLLM Settings to Fix

Inside Workspace → Vector Database:

Embedding Model:
nomic-embed-text (English)
multilingual-e5-large (Multi-language)
Similarity Threshold: 0.7
Max Snippets: 5–10
Temperature:
0.2 = facts, 0.7 = creative

PART 3 — TROUBLESHOOTING (The “Everything Is Breaking” Section)

1. Out of Memory

ollama run llama3 --gpu-layers 20
nvidia-smi --gpu-reset

2. Slow? Laggy? Dying?

ollama ps   # Check running models
htop        # Check CPU/RAM
ollama stop llama3
export OLLAMA_CACHE_SIZE=8192

Use quantized models:

ollama pull llama3:8b-q4_0

3. AnythingLLM can’t talk to Ollama

curl http://127.0.0.1:11434/api/tags
sudo ufw allow 11434/tcp

4. Import crashes

Split PDFs over 50MB
Install OCR
Convert weird documents to UTF-8

PART 4 — AUTOMATION (Weekly indexing, auto-summaries, hands-free mode)

Python script to automate everything

import requests, json, time
from pathlib import Path
import schedule

class AnythingLLMAutomation:
    def __init__(self, api_key, base="http://localhost:3001"):
        self.headers={"Authorization":f"Bearer {api_key}","Content-Type":"application/json"}
        self.base=base

    def auto_index(self, folder, workspace):
        folder=Path(folder)
        new=[f for f in folder.glob("**/*") if f.stat().st_mtime>time.time()-604800]
        for f in new:
            self.upload(f, workspace)

    def upload(self, file_path, workspace):
        # Use your existing upload endpoint here
        pass

    def query(self, prompt, workspace):
        res=requests.post(f"{self.base}/api/v1/workspace/{workspace}/chat",
             headers=self.headers,json={"message":prompt,"mode":"query"})
        return res.json().get("textResponse")

automation = AnythingLLMAutomation("your-key")
schedule.every().monday.do(lambda: automation.auto_index("/LifeSearch","ws"))
schedule.every().friday.do(lambda: automation.query("Summarize this week","ws"))

PART 5 — OCR (Make scanned PDFs readable)

Install Tesseract OCR

Linux

sudo apt install tesseract-ocr tesseract-ocr-all

macOS

brew install tesseract tesseract-lang

Convert scanned PDFs to text

import pytesseract
from pdf2image import convert_from_path
from pathlib import Path

def ocr_pdf_folder(path):
    for pdf in Path(path).glob("*.pdf"):
        images=convert_from_path(pdf)
        text=""
        for img in images:
            text+=pytesseract.image_to_string(img, lang="eng")
        pdf.with_suffix(".txt").write_text(text)

ocr_pdf_folder("/LifeSearch/receipts")

PART 6 — SECURITY (If other humans will use this)

Run AnythingLLM with login + encryption

docker run -d \
 -p 3001:3001 \
 -v ~/.anythingllm:/app/server/storage \
 -e AUTO_CREATE_ADMIN_CRED="admin:SecureP@ss2025" \
 -e DISABLE_TELEMETRY=true \
 --name anythingllm \
 mintplexlabs/anythingllm

Sanitize sensitive info before indexing

import re

def sanitize(text):
    patterns={
        "email":r"\S+@\S+",
        "phone":r"\b\d{10}\b",
        "cc":r"\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b"
    }
    for k,p in patterns.items():
        text=re.sub(p,f"[REDACTED_{k.upper()}]",text)
    return text

PART 7 — PERFORMANCE MONITORING

A simple live dashboard:

watch -n 2 "nvidia-smi; echo; free -h; echo; ollama ps"

PART 8 — ADVANCED QUERIES (Your “big brain” mode)

Use these instead of basic prompts:

Multi-file synthesis

"Compare all contracts with CompanyABC. 
Give a table with value, terms, renewal date, obligations."

Timeline analysis

"Show how my opinion on crypto changed from 2020 to 2025 using all my notes."

Deep summary

"Summarize everything related to Project X. 
Give: status, blockers, decisions, deadlines."

PART 9 — DISASTER RECOVERY (When things explode)

Backup script

tar -czf backup_$(date +%F).tar.gz \
  ~/.anythingllm \
  ~/LifeSearch \
  ~/.ollama/models

Keep the last 7 backups:

find . -name "*.tar.gz" -mtime +7 -delete

PART 10 — INTEGRATIONS (Optional but sexy)

Sync Obsidian Vault → AnythingLLM

from pathlib import Path

def sync_obsidian(vault, api):
    for md in Path(vault).glob("**/*.md"):
        # upload to AnythingLLM here
        pass

FINAL NOTE

You now have:

Speed (GPU)
Smarts (advanced RAG)
Automation
OCR
Security
Backups
Monitoring
Power queries
Integrations

Still simple. Still offline.
Still your own private memory machine — now upgraded like it’s 2025.

Rizal_Consulting · November 17, 2025, 11:58am

can you give simple and beginner friendly step by step to start from the beginning?

afat1h · November 17, 2025, 12:13pm

Of course:

1. Install Ollama

Go to: https://ollama.com
Download the installer for your system (Windows, macOS, or Linux).
Install and start Ollama.
- On Windows: just run Ollama from the Start menu, it will stay in the tray.
- On macOS: open the app once and leave it running.

That is all you need for the base setup.

2. Download and run a model

Open a terminal or command prompt and type:

ollama run llama3

What happens:

The first time, Ollama automatically downloads the llama3 model.
After the download finishes, you will see a >>> prompt.
Now you can just type questions and press Enter.
Type exit or press Ctrl+C to quit.

If this works, your local AI is already running.

3. Optional: use your GPU

If you have an NVIDIA GPU and proper drivers, you can let the model use the GPU.

Basic idea:

ollama run llama3 --gpu-layers 35

--gpu-layers 35 tells Ollama to put about 35 layers on the GPU.
If you get errors, just remove --gpu-layers 35 and run on CPU, it will still work, only slower.

For a true beginner, this step is optional.
You can stay on CPU until you feel comfortable.

4. Very short model guide

If you are not sure what to pick:

Weak laptop: phi3 or llama3.2:3b
Normal PC with 16 GB RAM: llama3:8b
Big GPU and a lot of RAM: bigger models later

Example:

ollama run llama3.2:3b

5. Next step: connect to AnythingLLM (optional)

Once you are happy with Ollama itself, then you can add tools like AnythingLLM for:

Chatting over your PDFs and notes
Better history and workspaces

But the absolute beginner path is simply:

Install Ollama
ollama run llama3 in a terminal
Start asking questions

After that, you can come back to my original post when you are ready for the advanced setup.

pintas_m · November 17, 2025, 12:42pm

Thank you! This is really good and handy.

One question though… On “Model Selection Made Stupid-Simple”, the RAM reffers to GPU or computer RAM?

afat1h · November 17, 2025, 2:32pm

It is computer RAM. GPU VRAM is a nice extra speed boost if you have it.

AI_Vids · November 17, 2025, 4:43pm

i have old pc 32GB RAM and nvidia RTX3060, this will work good for me, yes? i look for a assisted codeing solution, like i was pay for google gemini in vscode. i have cancel that today as it expensive for me. pls guide how i can do the same with this pls

afat1h · November 17, 2025, 5:48pm

That PC is actually more than enough for this. You can comfortably run 7B and 8B models and even try some bigger ones if you want.

If you want a local coding assistant in VS Code, you can set it up like this:

Install Ollama
Download and install from ollama.ai, then run it once so the Ollama service is running in the background.

Pull a coding model
Open a terminal and run for example:

ollama pull qwen2.5-coder:1.5b   # great for fast autocomplete
ollama pull qwen2.5-coder:7b     # nicer quality for chats about code
# optional general chat model:
ollama pull llama3.1:8b

Qwen2.5 Coder is tuned specially for coding and works very well with Continue for autocomplete and code edits.

Install Continue in VS Code
In VS Code, go to Extensions, search for Continue and install the Continue.dev extension. It is designed exactly to be a Copilot or Gemini style assistant that can talk to Ollama locally.

Point Continue to Ollama
After installing, click the little gear icon in the Continue sidebar, this opens your config file (config.json or config.yaml). Add a model block that uses Ollama, for example in JSON format:

{
  "models": [
    {
      "title": "Qwen 2.5 Coder 7B",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b",
      "apiBase": "http://localhost:11434/"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen 2.5 Coder 1.5B",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b",
    "apiBase": "http://localhost:11434/"
  }
}

Save the file, then in the Continue panel choose that model as your default.

Ready
Now you can:
- get inline tab completions while you type,
- select some code and ask Continue to explain, refactor or write tests,
- open the chat panel and talk to the model about bugs, design ideas, etc.

If anything feels slow, just switch autocomplete to the 1.5B model and keep the 7B or Llama 3.1 8B for chat and heavier tasks.

Topic		Replies	Views
Run AI Chat Assistants Entirely Offline With Open WebUI :star: Tutorials & Methods tools , privacy , business , ai	0	164	July 30, 2025
The "Ghost" AI: Building a Private, Local-First AI with a Free Serverless GPU Brain Tutorials & Methods tools , privacy , tips-tricks , ai	2	675	June 26, 2025
🔓 A Free Uncensored AI That Fits on a USB — Zero Internet, Zero Traces Tools & Scripts tips-tricks	6	3769	May 21, 2026
Make Your Own Free Offline AI Copilot in 10 Minutes (No Cloud, No Drama) Tutorials & Methods privacy , tips-tricks , windows	0	843	October 9, 2025
Run AI Chat Assistant Completely Offline On Your Own Device :star: Tutorials & Methods programming , privacy , ai	0	345	July 10, 2025