Products · Sheet 06
Everything you need to run dedicated inference.
Five products, one promise: any open-source model on auto-optimized GPUs, priced like infrastructure should be.
Dedicated Inference
Private GPU endpoints, always on.
Always-on dedicated GPUs running your chosen open-source model behind a private endpoint. No pooled capacity, no noisy neighbors.
- Any open-source model from HuggingFace or Ollama
- Auto-selected GPU with a flat monthly quote up front
- Private endpoint, no shared capacity
Inference API
OpenAI-compatible, drop-in.
An OpenAI-compatible endpoint for every deployment. Point the OpenAI SDK at your base URL, change the model field, and you are done.
- Drop-in for the OpenAI Python and JS SDKs
- One base URL routes every model in your workspace
- Per-deployment or workspace-wide API keys
GPU Fleet
T4 through B200.
Ten GPU classes spanning budget batch jobs to frontier MoE serving. The optimizer picks; you can also pin a specific card.
- 16 GB to 192 GB of VRAM
- Budget, mid, performance, and top tiers
- Per-GPU, per-hour rates on the pricing page
Custom Domains
Your domain, your endpoint.
Serve inference from your own domain with automatic wildcard subdomains per deployment. TLS handled for you.
- Bring your own domain
- Automatic per-deployment subdomains
- Managed TLS, no certificate wrangling
Autoscaling & Replicas
Throughput when you need it.
Run multiple replicas behind one endpoint and scale throughput in tiers. Predictable monthly pricing, more headroom on demand.
- Multiple replicas behind one endpoint
- Throughput-tiered, still flat monthly
- Scale headroom up or down per deployment
Hardware · Schedule A
The GPU fleet, end to end
From 50 dollars a month to frontier-scale serving. A sample of the fleet — the optimizer picks the right card for your model.
| GPU | Tier | VRAM | Best for | /hour |
|---|---|---|---|---|
| T4 | Budget | 16 GB | Small models, batch jobs | $0.59 |
| L40S | Mid | 48 GB | 8B–13B production inference | $1.95 |
| H100 | Performance | 80 GB | 30B–70B, demanding latency SLAs | $3.95 |
| H200 | Performance | 141 GB | Long-context 70B, MoE models | $4.54 |
| B200 | Top | 192 GB | Frontier models, lowest latency | $6.25 |
Pick a model. We handle the rest.
Every product above ships with the same promise: a flat monthly quote before you confirm, and infrastructure that gets out of the way.