Run Any AI Model
On Your Hardware

Zero API fees after hardware. OpenAI-compatible API. GGUF and HuggingFace model support. Bundled with Aura Workshop as a one-click sidecar.

$0
Per-Token Cost
GGUF
Model Format
31MB
Engine Binary
1-Click
Setup
Zero Cost After Hardware

Pay once for hardware. Run models forever.

Cloud API providers charge per token, forever. With Aura Inference Engine, your only cost is the hardware you already own. Run millions of tokens per day at zero marginal cost.

  • No API Keys Required — Run completely offline with zero external dependencies
  • No Per-Token Fees — Every token is free after the initial hardware investment
  • No Rate Limits — Your hardware, your throughput, no throttling
  • No Data Leaving — Prompts and responses stay on your machine, always
  • Final Fallback — Use as the $0 fallback in your provider chain when cloud budgets run out
$0.00 / token
No subscriptions. No metered billing.
No surprise invoices at end of month.
Your hardware. Your models. Your savings.

GGUF, HuggingFace, and beyond

Aura Inference Engine supports the most popular open-source model formats. Download from HuggingFace or point to your own GGUF files.

GGUF Models

Quantized models in GGUF format. Support for Q4, Q5, Q6, Q8, and FP16 quantization levels. Optimized for CPU and GPU inference.

HuggingFace Hub

Automatically scans ~/.cache/huggingface/hub/ for downloaded models. Browse and launch directly from the Aura AI model browser.

Local Model Cache

Dedicated model directory at ~/.cache/aura-inference/models/ for your GGUF files. Drop in a model and it appears in the browser.

LLaMA Family

Full support for LLaMA 2, LLaMA 3, Code LLaMA, and all derivatives. Any model in GGUF format works out of the box.

Code Models

Optimized for code generation models like CodeGemma, DeepSeek Coder, StarCoder, and WizardCoder.

Multilingual

Run multilingual models for content in any language. No external API needed for translation or multilingual tasks.

Hardware Requirements

Runs on the hardware you already have

From MacBook Air to dedicated GPU servers, Aura Inference scales to your hardware.

Configuration RAM Recommended Models Performance
Apple Silicon (M1/M2/M3/M4) 16GB+ 7B-13B Q4/Q5 models Excellent — Metal GPU acceleration, unified memory
NVIDIA GPU (8GB+ VRAM) 16GB+ 7B-70B depending on VRAM Excellent — CUDA acceleration for maximum throughput
CPU Only (x86/ARM) 8GB+ 3B-7B Q4 models Good — AVX2/NEON optimized, slower than GPU
AMD GPU (ROCm) 16GB+ 7B-13B models Good — ROCm support for compatible AMD GPUs

One download. One click. Running.

Aura Inference Engine ships as a sidecar binary inside Aura Workshop. No separate installation. No Docker. No Python environments. Just open the app, go to the Aura AI tab, and click Start.

  • Sidecar Architecture — 31MB Go binary bundled inside the Tauri app
  • One-Click Launch — Start the inference server from the Aura AI panel
  • Auto-Configuration — Provider, base URL, and model set automatically on launch
  • Model Browser — Browse downloaded models, see file sizes, select and launch
  • Process Management — Engine starts and stops with the app. No orphan processes.
  • Cross-Platform — Pre-built binaries for macOS (aarch64, x86_64), Windows, and Linux
Aura AI Running
Server http://localhost:8080
Model llama-3.2-3b-instruct-q5_k_m.gguf
Context 4096 tokens
Memory 2.8 GB
API Compatible

OpenAI API format. Drop-in replacement.

Aura Inference Engine exposes an OpenAI-compatible REST API. Any tool, library, or application that works with the OpenAI API works with Aura Inference. Just change the base URL.

  • /v1/chat/completions — Full chat completions endpoint with streaming support
  • /v1/models — List available models loaded on the server
  • Streaming — Server-Sent Events (SSE) streaming for real-time token output
  • Temperature, Top-P, Max Tokens — Standard sampling parameters supported
  • System / User / Assistant — Full message role support for chat conversations
  • Any Client Library — Works with Python openai, JS fetch, curl, or any HTTP client
Example Request
# Just change the base URL
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-model",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "stream": true
  }'

Ready to run AI on your own hardware?

Download Aura Workshop. Open the Aura AI tab. Click Start. That's it.