Products · Sheet 06

Everything you need to run dedicated inference.

Five products, one promise: any open-source model on auto-optimized GPUs, priced like infrastructure should be.

Private GPU endpoints, always on.

Always-on dedicated GPUs running your chosen open-source model behind a private endpoint. No pooled capacity, no noisy neighbors.

OpenAI-compatible, drop-in.

An OpenAI-compatible endpoint for every deployment. Point the OpenAI SDK at your base URL, change the model field, and you are done.

T4 through B200.

Ten GPU classes spanning budget batch jobs to frontier MoE serving. The optimizer picks; you can also pin a specific card.

Your domain, your endpoint.

Serve inference from your own domain with automatic wildcard subdomains per deployment. TLS handled for you.

Throughput when you need it.

Run multiple replicas behind one endpoint and scale throughput in tiers. Predictable monthly pricing, more headroom on demand.

Hardware · Schedule A

The GPU fleet, end to end

From 50 dollars a month to frontier-scale serving. A sample of the fleet — the optimizer picks the right card for your model.

GPU	Tier	VRAM	Best for	/hour
T4	Budget	16 GB	Small models, batch jobs	$0.59
L40S	Mid	48 GB	8B–13B production inference	$1.95
H100	Performance	80 GB	30B–70B, demanding latency SLAs	$3.95
H200	Performance	141 GB	Long-context 70B, MoE models	$4.54
B200	Top	192 GB	Frontier models, lowest latency	$6.25

Every product above ships with the same promise: a flat monthly quote before you confirm, and infrastructure that gets out of the way.