Cluster Models.

Community models. All are accessed via the same OpenAI-compatible API with the same base URL.

deepseek-v4-flash - 284B-21B

text generation & chat

284B parameter MoE model (21B active). 1M token context. Tool calling and reasoning. 500M token monthly quota per member.

Type: MoE (284B total · 21B active)
Quantization: FP8
Context: 1M tokens
Monthly quota: 500M tokens / member

capabilities

Tool calling
Reasoning mode
1M token context
Streaming generation (SSE)

mimo-v2.5 - 310B-15B

omnimodal — text, vision & audio

310B parameter MoE model (15B active), natively omnimodal with dedicated vision and audio encoders. 1M token context. Tool calling and reasoning. 500M token monthly quota per member. MIT license.

Type: MoE (310B total · 15B active)
Quantization: FP8
Context: 1M tokens
Input modalities: text · image · audio
Output modalities: text
Monthly quota: 500M tokens / member
License: MIT

capabilities

Tool calling (function calling)
Reasoning mode (recommended max_tokens ≥ 300)
Vision (image input)
Audio (audio input)
1M token context
Streaming generation (SSE)

glm5.2 - 753B MoE

text generation & chat — agentic coding

~753B parameter MoE model, focused on coding and long-horizon agentic tasks. 256K token context. Tool calling and reasoning (emits a reasoning trace). Text only.

Type: MoE (~753B total)
Quantization: FP8
Attention: Sparse attention
Context: 256K tokens
Input modalities: text
Output modalities: text

capabilities

Tool calling (function calling)
Reasoning mode (reasoning trace)
Coding and long-horizon agentic tasks
256K token context
Streaming generation (SSE)

gemma4 - 26B-A4B

text generation & chat

26B parameter MoE model (4B active), multimodal with vision. Tool calling and reasoning.

Type: MoE (26B total · 4B active)
Quantization: FP8
Context: 256K tokens
Sampling: temp=0.6, top_p=0.95
Reasoning: reasoning_config={}

capabilities

Tool calling (XML format)
Reasoning mode
Multimodal (vision / images)
Streaming generation (SSE)

qwen3.6 - 35B-A3B

text generation & chat

The flagship model. 35B parameter MoE, multimodal, with tool calling and reasoning.

Type: MoE (35B total)
Active per token: 3B
Quantization: FP8
Context: 256K tokens
Speculative decoding: MTP → ~2x throughput
Sampling: temp=0.6, top_p=0.95
Reasoning: reasoning_config={}

capabilities

Tool calling (XML format)
Reasoning mode
Multimodal (vision / images)
Streaming generation (SSE)

qwen3-embedding - 8B

vector embeddings

Vector embedding model. MMTEB score 70.58 — top-tier open models. Supports 100+ languages including Spanish and code.

Dimension: 4096
Precision: Float32 (CPU)
RPM: 60
Batch size: 32

use cases

Cross-lingual similarity (ES↔EN: 0.915)
Semantic search
Text classification
RAG / retrieval augmentation

rerank - Qwen3-Reranker-8B

semantic reranking

8B parameter reranking model (BF16). Reorders a list of documents by relevance to a query. Completes the RAG stack alongside qwen3-embedding: first retrieve top-K via embeddings, then rerank for precision. Supports 100+ languages including Spanish, code retrieval, and cross-lingual. Top-tier on MTEB reranking benchmarks.

Parameters: 8B
Precision: BF16
Endpoints: /v1/rerank · /v2/rerank
Languages: 100+

use cases

Reranking in RAG pipelines (embedding → rerank → LLM)
Cross-lingual search (ES↔EN, etc.)
Code retrieval
Query-document relevance scoring

kokoro - v1.0

text-to-speech

82M parameter TTS with 67 voice packs. Sub-second latency on CPU.

Latency: < 1s
Parameters: 82M
RPM: 15

available voices

af_heart — English (female)
ef_dora — Spanish (female)
em_alex — Spanish (male)
67 voice packs total (see full list)

whisper - large-v3

speech-to-text

CPU-based STT with CTranslate2 and INT8. ~1x realtime. 99+ languages.

Size: ~3 GB (INT8)
WER ES: ~3.2%
RPM: 10

capabilities

Audio-to-text transcription
99+ languages
Automatic language detection
OpenAI-compatible API

limitaciones conocidas

File size limit — 25 MB: Maximum size per request. Compressed formats (OGG/Opus, MP3) make better use of this limit than uncompressed WAV.
Timeout — audios > 2 min duration: Whisper processes on CPU at ~1x realtime. For audios longer than ~2 minutes, the proxy may return a 524 (timeout) error before transcription completes. Use compressed formats like OGG/Opus and split long files into ≤ 2 minute segments to avoid this.
Recommended formats: OGG/Opus and MP3 — smaller files, same transcription quality. A 60-minute audio in OGG/Opus at 48 kbps takes ~20 MB vs ~550 MB in WAV.

flux-2-klein

image generation

FLUX diffusion model for text-to-image and image-to-image. Compatible with OpenAI's Images API (/v1/images/generations and /v1/images/edits). Requires inference-tier membership.

Type: Diffusion (FLUX)
Modalities: text→image · image→image
Resolution: 256–1536 px (multiples of 16)
Images / request: 1–4 (n)
Monthly quota: 100 requests / member

capabilities

Text-to-image (/v1/images/generations)
Image-to-image with up to 4 references (/v1/images/edits)
Output as temporary URL (R2, ~60 min) or base64
Reproducibility via seed and guidance control

rate limits por API key

Requests / min: 60 rpm
Paralelo máximo: 5 concurrentes

tokens / min por modelo

deepseek-v4-flash: 1.5M tpm
mimo-v2.5: 1.5M tpm
qwen3.6: 1.5M tpm
gemma4: 1.5M tpm

requests / min por modelo

rerank: 1000 rpm

← anterior API siguiente → Examples