Aura Inference Engine - Run Any AI Model On Your Hardware

Zero Cost After Hardware

Pay once for hardware. Run models forever.

Cloud API providers charge per token, forever. With Aura Inference Engine, your only cost is the hardware you already own. Run millions of tokens per day at zero marginal cost.

No API Keys Required — Run completely offline with zero external dependencies
No Per-Token Fees — Every token is free after the initial hardware investment
No Rate Limits — Your hardware, your throughput, no throttling
No Data Leaving — Prompts and responses stay on your machine, always
Final Fallback — Use as the $0 fallback in your provider chain when cloud budgets run out

$0.00 / token

No subscriptions. No metered billing.
No surprise invoices at end of month.
Your hardware. Your models. Your savings.

Model Support

GGUF, HuggingFace, and beyond

Aura Inference Engine supports the most popular open-source model formats. Download from HuggingFace or point to your own GGUF files.

GGUF Models

Quantized models in GGUF format. Support for Q4, Q5, Q6, Q8, and FP16 quantization levels. Optimized for CPU and GPU inference.

HuggingFace Hub

Automatically scans ~/.cache/huggingface/hub/ for downloaded models. Browse and launch directly from the Aura AI model browser.

Local Model Cache

Dedicated model directory at ~/.cache/aura-inference/models/ for your GGUF files. Drop in a model and it appears in the browser.

LLaMA Family

Full support for LLaMA 2, LLaMA 3, Code LLaMA, and all derivatives. Any model in GGUF format works out of the box.

Code Models

Optimized for code generation models like CodeGemma, DeepSeek Coder, StarCoder, and WizardCoder.

Multilingual

Run multilingual models for content in any language. No external API needed for translation or multilingual tasks.

Hardware Requirements

Runs on the hardware you already have

From MacBook Air to dedicated GPU servers, Aura Inference scales to your hardware.

Configuration	RAM	Recommended Models	Performance
Apple Silicon (M1/M2/M3/M4)	16GB+	7B-13B Q4/Q5 models	Excellent — Metal GPU acceleration, unified memory
NVIDIA GPU (8GB+ VRAM)	16GB+	7B-70B depending on VRAM	Excellent — CUDA acceleration for maximum throughput
CPU Only (x86/ARM)	8GB+	3B-7B Q4 models	Good — AVX2/NEON optimized, slower than GPU
AMD GPU (ROCm)	16GB+	7B-13B models	Good — ROCm support for compatible AMD GPUs

Bundled with Aura Workshop

One download. One click. Running.

Aura Inference Engine ships as a sidecar binary inside Aura Workshop. No separate installation. No Docker. No Python environments. Just open the app, go to the Aura AI tab, and click Start.

Sidecar Architecture — 31MB Go binary bundled inside the Tauri app
One-Click Launch — Start the inference server from the Aura AI panel
Auto-Configuration — Provider, base URL, and model set automatically on launch
Model Browser — Browse downloaded models, see file sizes, select and launch
Process Management — Engine starts and stops with the app. No orphan processes.
Cross-Platform — Pre-built binaries for macOS (aarch64, x86_64), Windows, and Linux

Aura AI Running

Server http://localhost:8080

Model llama-3.2-3b-instruct-q5_k_m.gguf

Context 4096 tokens

Memory 2.8 GB

API Compatible

OpenAI API format. Drop-in replacement.

Aura Inference Engine exposes an OpenAI-compatible REST API. Any tool, library, or application that works with the OpenAI API works with Aura Inference. Just change the base URL.

/v1/chat/completions — Full chat completions endpoint with streaming support
/v1/models — List available models loaded on the server
Streaming — Server-Sent Events (SSE) streaming for real-time token output
Temperature, Top-P, Max Tokens — Standard sampling parameters supported
System / User / Assistant — Full message role support for chat conversations
Any Client Library — Works with Python openai, JS fetch, curl, or any HTTP client

Example Request

        # Just change the base URL

        curl http://localhost:8080/v1/chat/completions \

          -H "Content-Type: application/json" \

          -d '{

            "model": "local-model",

            "messages": [

              {"role": "user", "content": "Hello!"}

            ],

            "stream": true

          }'

Ready to run AI on your own hardware?

Download Aura Workshop. Open the Aura AI tab. Click Start. That's it.

Download Now

Run Any AI ModelOn Your Hardware