Run MetricChat with Local LLMs Using Ollama

Most AI-powered analytics tools route your prompts — and often fragments of your data — through third-party APIs. For teams operating under strict compliance requirements, working in air-gapped environments, or simply unwilling to hand their business context to an external service, that is a non-starter.

MetricChat supports local large language model inference through Ollama, an open-source runtime that makes it straightforward to download and serve models on your own hardware. Once configured, your natural-language questions, your schema metadata, and every generated SQL query stay entirely within your network.

Why Local LLMs Matter

The case for running models locally goes beyond privacy preferences.

Regulatory compliance. Industries governed by HIPAA, SOC 2, GDPR, or FedRAMP have strict rules about where sensitive data can travel. Sending query context to a cloud provider — even one with a DPA in place — introduces surface area that legal and security teams must account for. A local model eliminates that category of risk entirely.

Air-gapped environments. Defense contractors, financial institutions, and certain enterprise data centers operate networks with no outbound internet access by design. Cloud LLM APIs are simply unavailable in these environments. Ollama runs as a local HTTP server, which is exactly what those setups require.

Cost control at scale. Cloud LLM API costs scale with token volume. Teams running hundreds of analyst queries per day can face meaningful monthly bills. A one-time investment in capable on-premises hardware shifts that cost structure permanently.

Latency and availability. Local inference removes the round-trip to an external API. Depending on hardware, responses can be faster — and there is no dependency on a third-party service's uptime or rate limits.

Prerequisites

Before you begin, confirm your environment meets these requirements.

Hardware. For acceptable SQL generation quality, you need a machine with at least 16 GB of system RAM. GPU acceleration is strongly recommended for interactive response times. An NVIDIA GPU with 8 GB of VRAM (e.g., RTX 3060 or better) will run quantized 7B models comfortably. For 13B models, target 16 GB VRAM or plan to use CPU offloading with the performance trade-off that entails.

Operating system. Ollama supports macOS (Apple Silicon and Intel), Linux (x86-64), and Windows via WSL2. Apple Silicon Macs use the Metal backend and perform well even without a discrete GPU.

Software. Docker is not required. Ollama installs as a native binary. MetricChat version 1.4.0 or later is required for the Ollama provider configuration.

Step 1: Install Ollama

Download and install Ollama from ollama.com/download, or use the one-line installer on Linux:

curl -fsSL https://ollama.com/install.sh | sh

On macOS, the .dmg installer places Ollama in your Applications folder and registers it as a menu-bar service that starts automatically on login.

Verify the installation:

ollama --version

By default, Ollama listens on http://localhost:11434. You can confirm the server is running:

curl http://localhost:11434
# Ollama is running

Step 2: Pull a Model

Ollama hosts a registry of open-weight models. Pull the model you intend to use before configuring MetricChat.

For general SQL generation on a mid-range GPU, llama3 is a solid starting point:

ollama pull llama3

For a model tuned toward code and SQL tasks, codellama offers strong structured output quality:

ollama pull codellama

For a smaller footprint that still performs well on straightforward schemas, mistral runs on less VRAM and responds quickly:

ollama pull mistral

Confirm the model is available locally:

ollama list

You can test the model directly before connecting MetricChat:

ollama run llama3 "Write a SQL query that returns the top 10 customers by revenue."

Step 3: Configure MetricChat to Use Ollama

MetricChat reads its LLM provider from environment variables. Set the following in your .env file or deployment environment:

# Select Ollama as the LLM provider
METRICCHAT_LLM_PROVIDER=ollama

# Base URL of your Ollama server
# Use the default if running locally, or replace with your server's address
METRICCHAT_OLLAMA_BASE_URL=http://localhost:11434

# The model name exactly as it appears in `ollama list`
METRICCHAT_OLLAMA_MODEL=llama3

If Ollama is running on a separate machine within your network, replace localhost with that machine's IP address or hostname:

METRICCHAT_OLLAMA_BASE_URL=http://192.168.1.50:11434

Restart MetricChat after updating these values. On the Settings page, the LLM provider indicator should display "Ollama" with the configured model name.

Model Selection: Tradeoffs to Know

Not all open-weight models perform equally on SQL generation tasks. Here is a practical breakdown of what to expect.

Model	VRAM Required	SQL Quality	Response Speed
`llama3` (8B)	~5 GB	Good — handles joins, aggregations, and filters reliably	Fast
`codellama` (7B)	~5 GB	Very good — designed for code tasks including SQL	Fast
`codellama:13b`	~10 GB	Excellent — better reasoning for complex multi-table queries	Moderate
`mistral` (7B)	~5 GB	Good — efficient and consistent on well-defined schemas	Fast
`mixtral:8x7b`	~30 GB	Excellent — near cloud-model quality	Slow without high-end GPU

For most teams, codellama at 7B is the best balance of quality and resource requirements. If your schema is complex — many tables, deeply nested relationships, or non-obvious column names — stepping up to codellama:13b or mixtral:8x7b (if hardware permits) produces noticeably better results on ambiguous queries.

Performance Tips

Enable GPU acceleration. On Linux with NVIDIA hardware, install the CUDA toolkit and ensure nvidia-smi reports your GPU before starting Ollama. Ollama auto-detects CUDA and uses GPU layers by default.

# Confirm CUDA is available to Ollama
ollama run llama3 "hi" 2>&1 | grep -i gpu

Use quantized models. Ollama defaults to Q4_K_M quantization for most models, which reduces VRAM usage significantly with minimal quality loss compared to full-precision weights. If you have VRAM headroom, you can pull higher-precision variants:

ollama pull codellama:13b-instruct-q8_0

Increase context window if needed. MetricChat sends schema metadata along with each prompt. On databases with large schemas, the context can be substantial. If you observe truncated or degraded responses, set a higher num_ctx value via an Ollama Modelfile:

# Create a custom model with an expanded context window
cat <<EOF > Modelfile
FROM codellama
PARAMETER num_ctx 8192
EOF

ollama create metricchat-codellama -f Modelfile

Then update your environment variable:

METRICCHAT_OLLAMA_MODEL=metricchat-codellama

Run Ollama as a system service on Linux. The installer registers a systemd unit automatically. Verify it is enabled so the server survives reboots:

sudo systemctl enable ollama
sudo systemctl status ollama

Running Fully Air-Gapped

Once Ollama is configured and models are pulled, MetricChat makes no outbound network calls for query generation. You can verify this by monitoring network traffic with a tool like lsof or your firewall logs while submitting queries — all LLM traffic stays on localhost or within your LAN segment.

For a complete air-gapped deployment, pull your chosen models on a machine with internet access, export the Ollama model store directory (~/.ollama/models on Linux/macOS), transfer it to the air-gapped host, and restart Ollama. The models load from the local filesystem without any registry calls.

# On the internet-connected machine
tar -czf ollama-models.tar.gz ~/.ollama/models

# Transfer ollama-models.tar.gz to the air-gapped host, then:
tar -xzf ollama-models.tar.gz -C ~/

Conclusion

Pairing MetricChat with Ollama gives teams a fully self-contained analytics stack where every component — from the database connection to the language model — operates under their control. There are no third-party API calls, no data leaving the network, and no per-token costs to manage.

For teams with genuine compliance requirements or air-gapped constraints, this setup is not a workaround — it is the intended architecture. Pull a model, point MetricChat at your Ollama endpoint, and ask questions about your data the same way you would with any cloud-backed configuration. The only difference is that everything stays yours.