Run MetricChat with Local LLMs Using Ollama
Keep every query on your hardware. This guide walks through setting up MetricChat with Ollama so your data and prompts never leave your network.
Most AI-powered analytics tools route your prompts — and often fragments of your data — through third-party APIs. For teams operating under strict compliance requirements, working in air-gapped environments, or simply unwilling to hand their business context to an external service, that is a non-starter.
MetricChat supports local large language model inference through Ollama, an open-source runtime that makes it straightforward to download and serve models on your own hardware. Once configured, your natural-language questions, your schema metadata, and every generated SQL query stay entirely within your network.
Why Local LLMs Matter
The case for running models locally goes beyond privacy preferences.
Regulatory compliance. Industries governed by HIPAA, SOC 2, GDPR, or FedRAMP have strict rules about where sensitive data can travel. Sending query context to a cloud provider — even one with a DPA in place — introduces surface area that legal and security teams must account for. A local model eliminates that category of risk entirely.
Air-gapped environments. Defense contractors, financial institutions, and certain enterprise data centers operate networks with no outbound internet access by design. Cloud LLM APIs are simply unavailable in these environments. Ollama runs as a local HTTP server, which is exactly what those setups require.
Cost control at scale. Cloud LLM API costs scale with token volume. Teams running hundreds of analyst queries per day can face meaningful monthly bills. A one-time investment in capable on-premises hardware shifts that cost structure permanently.
Latency and availability. Local inference removes the round-trip to an external API. Depending on hardware, responses can be faster — and there is no dependency on a third-party service's uptime or rate limits.
Prerequisites
Before you begin, confirm your environment meets these requirements.
Hardware. For acceptable SQL generation quality, you need a machine with at least 16 GB of system RAM. GPU acceleration is strongly recommended for interactive response times. An NVIDIA GPU with 8 GB of VRAM (e.g., RTX 3060 or better) will run quantized 7B models comfortably. For 13B models, target 16 GB VRAM or plan to use CPU offloading with the performance trade-off that entails.
Operating system. Ollama supports macOS (Apple Silicon and Intel), Linux (x86-64), and Windows via WSL2. Apple Silicon Macs use the Metal backend and perform well even without a discrete GPU.
Software. Docker is not required. Ollama installs as a native binary. MetricChat version 1.4.0 or later is required for the Ollama provider configuration.
Step 1: Install Ollama
Download and install Ollama from ollama.com/download, or use the one-line installer on Linux:
curl -fsSL https://ollama.com/install.sh | shOn macOS, the .dmg installer places Ollama in your Applications folder and registers it as a menu-bar service that starts automatically on login.
Verify the installation:
ollama --versionBy default, Ollama listens on http://localhost:11434. You can confirm the server is running:
curl http://localhost:11434
# Ollama is runningStep 2: Pull a Model
Ollama hosts a registry of open-weight models. Pull the model you intend to use before configuring MetricChat.
For general SQL generation on a mid-range GPU, llama3 is a solid starting point:
ollama pull llama3For a model tuned toward code and SQL tasks, codellama offers strong structured output quality:
ollama pull codellamaFor a smaller footprint that still performs well on straightforward schemas, mistral runs on less VRAM and responds quickly:
ollama pull mistralConfirm the model is available locally:
ollama listYou can test the model directly before connecting MetricChat:
ollama run llama3 "Write a SQL query that returns the top 10 customers by revenue."Step 3: Configure MetricChat to Use Ollama
MetricChat reads its LLM provider from environment variables. Set the following in your .env file or deployment environment:
# Select Ollama as the LLM provider
METRICCHAT_LLM_PROVIDER=ollama
# Base URL of your Ollama server
# Use the default if running locally, or replace with your server's address
METRICCHAT_OLLAMA_BASE_URL=http://localhost:11434
# The model name exactly as it appears in `ollama list`
METRICCHAT_OLLAMA_MODEL=llama3If Ollama is running on a separate machine within your network, replace localhost with that machine's IP address or hostname:
METRICCHAT_OLLAMA_BASE_URL=http://192.168.1.50:11434Restart MetricChat after updating these values. On the Settings page, the LLM provider indicator should display "Ollama" with the configured model name.
Model Selection: Tradeoffs to Know
Not all open-weight models perform equally on SQL generation tasks. Here is a practical breakdown of what to expect.
| Model | VRAM Required | SQL Quality | Response Speed |
|---|---|---|---|
llama3 (8B) | ~5 GB | Good — handles joins, aggregations, and filters reliably | Fast |
codellama (7B) | ~5 GB | Very good — designed for code tasks including SQL | Fast |
codellama:13b | ~10 GB | Excellent — better reasoning for complex multi-table queries | Moderate |
mistral (7B) | ~5 GB | Good — efficient and consistent on well-defined schemas | Fast |
mixtral:8x7b | ~30 GB | Excellent — near cloud-model quality | Slow without high-end GPU |
For most teams, codellama at 7B is the best balance of quality and resource requirements. If your schema is complex — many tables, deeply nested relationships, or non-obvious column names — stepping up to codellama:13b or mixtral:8x7b (if hardware permits) produces noticeably better results on ambiguous queries.
Performance Tips
Enable GPU acceleration. On Linux with NVIDIA hardware, install the CUDA toolkit and ensure nvidia-smi reports your GPU before starting Ollama. Ollama auto-detects CUDA and uses GPU layers by default.
# Confirm CUDA is available to Ollama
ollama run llama3 "hi" 2>&1 | grep -i gpuUse quantized models. Ollama defaults to Q4_K_M quantization for most models, which reduces VRAM usage significantly with minimal quality loss compared to full-precision weights. If you have VRAM headroom, you can pull higher-precision variants:
ollama pull codellama:13b-instruct-q8_0Increase context window if needed. MetricChat sends schema metadata along with each prompt. On databases with large schemas, the context can be substantial. If you observe truncated or degraded responses, set a higher num_ctx value via an Ollama Modelfile:
# Create a custom model with an expanded context window
cat <<EOF > Modelfile
FROM codellama
PARAMETER num_ctx 8192
EOF
ollama create metricchat-codellama -f ModelfileThen update your environment variable:
METRICCHAT_OLLAMA_MODEL=metricchat-codellamaRun Ollama as a system service on Linux. The installer registers a systemd unit automatically. Verify it is enabled so the server survives reboots:
sudo systemctl enable ollama
sudo systemctl status ollamaRunning Fully Air-Gapped
Once Ollama is configured and models are pulled, MetricChat makes no outbound network calls for query generation. You can verify this by monitoring network traffic with a tool like lsof or your firewall logs while submitting queries — all LLM traffic stays on localhost or within your LAN segment.
For a complete air-gapped deployment, pull your chosen models on a machine with internet access, export the Ollama model store directory (~/.ollama/models on Linux/macOS), transfer it to the air-gapped host, and restart Ollama. The models load from the local filesystem without any registry calls.
# On the internet-connected machine
tar -czf ollama-models.tar.gz ~/.ollama/models
# Transfer ollama-models.tar.gz to the air-gapped host, then:
tar -xzf ollama-models.tar.gz -C ~/Conclusion
Pairing MetricChat with Ollama gives teams a fully self-contained analytics stack where every component — from the database connection to the language model — operates under their control. There are no third-party API calls, no data leaving the network, and no per-token costs to manage.
For teams with genuine compliance requirements or air-gapped constraints, this setup is not a workaround — it is the intended architecture. Pull a model, point MetricChat at your Ollama endpoint, and ask questions about your data the same way you would with any cloud-backed configuration. The only difference is that everything stays yours.