What’s limited
AIsa enforces three dimensions of capacity:| Dimension | Applies to | What it measures |
|---|---|---|
| RPM | All endpoints | Requests per minute |
| TPM | LLM inference endpoints | Input + output tokens per minute (combined) |
| Concurrency | Streaming + long-running endpoints | Simultaneous in-flight requests |
TPM counts input + output tokens together. A request that sends 10K tokens and generates 5K tokens consumes 15K TPM.
Default limits per tier
| Tier | RPM | TPM | Concurrency | Who gets it |
|---|---|---|---|---|
| Free | 60 | 60,000 | 5 | New accounts with $2 signup credit |
| Starter | 600 | 600,000 | 20 | After first paid top-up |
| Growth | 3,000 | 3,000,000 | 50 | $500+ topped up OR approved application |
| Enterprise | Custom | Custom | Custom | Contact sales |
Per-endpoint overrides
Some endpoints have tighter default limits independent of your tier because the upstream provider caps throughput:| Endpoint group | Default override |
|---|---|
POST /chat/completions (GPT-5.4) | RPM capped at upstream quota |
POST /messages (Claude Opus) | RPM capped at upstream quota |
POST /perplexity/sonar-deep-research | 5 RPM per key (long-running) |
/aigc/video-generation | 3 concurrent video tasks per key |
/v1/models/*:generateContent (image gen) | 30 RPM per key |
Reading rate-limit headers
Every response (including429) includes four headers:
| Header | Meaning |
|---|---|
X-RateLimit-Limit-Requests | Your RPM cap |
X-RateLimit-Remaining-Requests | Requests remaining this minute |
X-RateLimit-Limit-Tokens | Your TPM cap |
X-RateLimit-Remaining-Tokens | Tokens remaining this minute |
X-RateLimit-Reset-Requests | UNIX timestamp when the request counter resets |
X-RateLimit-Reset-Tokens | UNIX timestamp when the token counter resets |
Retry-After | (429 only) Seconds until you can retry |
Handling 429 responses
Detect a 429
The response body follows the standard error shape with
error.type = "rate_limit_error" and a code of rate_limit_exceeded, upstream_rate_limit, or quota_exceeded.Honor `Retry-After`
Always wait at least the number of seconds in
Retry-After before the next attempt. Never retry immediately.Use exponential backoff + jitter
After the initial wait, double the delay on each subsequent 429, with ±25% jitter. Cap at 30 seconds. See the retry example.
Prefer queuing over retrying
Instead of tight retry loops, queue requests and drain them at a rate below your RPM. A simple token-bucket with a leak rate of
RPM/60 requests per second is robust.Example: staying under the limit
Requesting a quota increase
If you need higher limits for production traffic:- Top up your wallet — moving from Free → Starter is automatic.
- For Growth or Enterprise tiers, email developer@aisa.one with:
- Your workspace ID
- Expected peak RPM and TPM
- Which models or endpoints you need the increase for
- A brief description of your use case
Related
Error Codes
Full list of HTTP status codes and recommended responses.
Usage Logs
Per-request billing and rate-limit telemetry in the dashboard.