Skip to main content

v1.77.7-stable - 2.9x Lower Median Latency

Krrish Dholakia
CEO, LiteLLM
Ishaan Jaff
CTO, LiteLLM
Alexsander Hamir
Backend Performance Engineer
Achintya Rajan
Fullstack Engineer
Sameer Kankute
Backend Engineer (LLM Translation)

Deploy this versionโ€‹

docker run litellm
docker run \
-e STORE_MODEL_IN_DB=True \
-p 4000:4000 \
ghcr.io/berriai/litellm:v1.77.7.rc.1

Key Highlightsโ€‹

  • Dynamic Rate Limiter v3 - Automatically maximizes throughput when capacity is available (< 80% saturation) by allowing lower-priority requests to use unused capacity, then switches to fair priority-based allocation under high load (โ‰ฅ 80%) to prevent blocking
  • Major Performance Improvements - 2.9x lower median latency at 1,000 concurrent users.
  • Claude Sonnet 4.5 - Support for Anthropic's new Claude Sonnet 4.5 model family with 200K+ context and tiered pricing
  • MCP Gateway Enhancements - Fine-grained tool control, server permissions, and forwardable headers
  • AMD Lemonade & Nvidia NIM - New provider support for AMD Lemonade and Nvidia NIM Rerank
  • GitLab Prompt Management - GitLab-based prompt management integration

Performance - 2.9x Lower Median Latencyโ€‹


This update removes LiteLLM router inefficiencies, reducing complexity from O(Mร—N) to O(1). Previously, it built a new array and ran repeated checks like data["model"] in llm_router.get_model_ids(). Now, a direct ID-to-deployment map eliminates redundant allocations and scans.

As a result, performance improved across all latency percentiles:

  • Median latency: 320 ms โ†’ 110 ms (โˆ’65.6%)
  • p95 latency: 850 ms โ†’ 440 ms (โˆ’48.2%)
  • p99 latency: 1,400 ms โ†’ 810 ms (โˆ’42.1%)
  • Average latency: 864 ms โ†’ 310 ms (โˆ’64%)

Test Setupโ€‹

Locust

  • Concurrent users: 1,000
  • Ramp-up: 500

System Specs

  • CPU: 4 vCPUs
  • Memory: 8 GB RAM
  • LiteLLM Workers: 4
  • Instances: 4

Configuration (config.yaml)

View the complete configuration: gist.github.com/AlexsanderHamir/config.yaml

Load Script (no_cache_hits.py)

View the complete load testing script: gist.github.com/AlexsanderHamir/no_cache_hits.py

MCP OAuth 2.0 Supportโ€‹


This release adds support for OAuth 2.0 Client Credentials for MCP servers. This is great for Internal Dev Tools use-cases, as it enables your users to call MCP servers, with their own credentials. E.g. Allowing your developers to call the Github MCP, with their own credentials.

Set it up today on Claude Code

Scheduled Key Rotationsโ€‹


This release brings support for scheduling virtual key rotations on LiteLLM AI Gateway.

From this release you can enforce Virtual Keys to rotate on a schedule of your choice e.g every 15 days/30 days/60 days etc.

This is great for Proxy Admins who need to enforce security policies for production workloads.

Get Started

New Models / Updated Modelsโ€‹

New Model Supportโ€‹

ProviderModelContext WindowInput ($/1M tokens)Output ($/1M tokens)Features
Anthropicclaude-sonnet-4-5200K$3.00$15.00Chat, reasoning, vision, function calling, prompt caching
Anthropicclaude-sonnet-4-5-20250929200K$3.00$15.00Chat, reasoning, vision, function calling, prompt caching
Bedrockeu.anthropic.claude-sonnet-4-5-20250929-v1:0200K$3.00$15.00Chat, reasoning, vision, function calling, prompt caching
Azure AIazure_ai/grok-4131K$5.50$27.50Chat, reasoning, function calling, web search
Azure AIazure_ai/grok-4-fast-reasoning131K$0.43$1.73Chat, reasoning, function calling, web search
Azure AIazure_ai/grok-4-fast-non-reasoning131K$0.43$1.73Chat, function calling, web search
Azure AIazure_ai/grok-code-fast-1131K$3.50$17.50Chat, function calling, web search
Groqgroq/moonshotai/kimi-k2-instruct-0905Context variesPricing variesPricing variesChat, function calling
OllamaOllama Cloud modelsVariesFreeFreeSelf-hosted models via Ollama Cloud

Featuresโ€‹

  • Anthropic
    • Add new claude-sonnet-4-5 model family with tiered pricing above 200K tokens - PR #15041
    • Add anthropic/claude-sonnet-4-5 to model price json with prompt caching support - PR #15049
    • Add 200K prices for Sonnet 4.5 - PR #15140
    • Add cost tracking for /v1/messages in streaming response - PR #15102
    • Add /v1/messages/count_tokens to Anthropic routes for non-admin user access - PR #15034
  • Gemini
    • Ignore type param for gemini tools - PR #15022
  • Vertex AI
    • Add LiteLLM Overhead metric for VertexAI - PR #15040
    • Support googlemap grounding in vertex ai - PR #15179
  • Azure
    • Add azure_ai grok-4 model family - PR #15137
    • Use the extra_query parameter for GET requests in Azure Batch - PR #14997
    • Use extra_query for download results (Batch API) - PR #15025
    • Add support for Azure AD token-based authorization - PR #14813
  • Ollama
  • Groq
    • Add groq/moonshotai/kimi-k2-instruct-0905 - PR #15079
  • OpenAI
    • Add support for GPT 5 codex models - PR #14841
  • DeepInfra
    • Update DeepInfra model data refresh with latest pricing - PR #14939
  • Bedrock
    • Add JP Cross-Region Inference - PR #15188
    • Add "eu.anthropic.claude-sonnet-4-5-20250929-v1:0" - PR #15181
    • Add twelvelabs bedrock Async Invoke Support - PR #14871
  • Nvidia NIM

Bug Fixesโ€‹

  • VLLM
    • Fix response_format bug in hosted vllm audio_transcription - PR #15010
    • Fix passthrough of atranscription into kwargs going to upstream provider - PR #15005
  • OCI
    • Fix OCI Generative AI Integration when using Proxy - PR #15072
  • General
    • Fix: Authorization header to use correct "Bearer" capitalization - PR #14764
    • Bug fix: gpt-5-chat-latest has incorrect max_input_tokens value - PR #15116
    • Update request handling for original exceptions - PR #15013

New Provider Supportโ€‹


LLM API Endpointsโ€‹

Featuresโ€‹

  • Responses API

    • Return Cost for Responses API Streaming requests - PR #15053
  • /generateContent

    • Add full support for native Gemini API translation - PR #15029
  • Passthrough Gemini Routes

    • Add Gemini generateContent passthrough cost tracking - PR #15014
    • Add streamGenerateContent cost tracking in passthrough - PR #15199
  • Passthrough Vertex AI Routes

    • Add cost tracking for Vertex AI Passthrough /predict endpoint - PR #15019
    • Add cost tracking for Vertex AI Live API WebSocket Passthrough - PR #14956
  • General

    • Preserve Whitespace Characters in Model Response Streams - PR #15160
    • Add provider name to payload specification - PR #15130
    • Ensure query params are forwarded from origin url to downstream request - PR #15087

Management Endpoints / UIโ€‹

Featuresโ€‹

  • Virtual Keys

    • Ensure LLM_API_KEYs can access pass through routes - PR #15115
    • Support 'guaranteed_throughput' when setting limits on keys belonging to a team - PR #15120
  • Models + Endpoints

    • Ensure OCI secret fields not shared on /models and /v1/models endpoints - PR #15085
    • Add snowflake on UI - PR #15083
    • Make UI theme settings publicly accessible for custom branding - PR #15074
  • Admin Settings

  • MCP

    • show health status of MCP servers - PR #15185
    • allow setting extra headers on the UI - PR #15185
    • allow editing allowed tools on the UI - PR #15185

Bug Fixesโ€‹

  • Virtual Keys

    • (security) prevent user key from updating other user keys - PR #15201
    • (security) don't return all keys with blank key alias on /v2/key/info - PR #15201
    • Fix Session Token Cookie Infinite Logout Loop - PR #15146
  • Models + Endpoints

    • Make UI theme settings publicly accessible for custom branding - PR #15074
  • Teams

    • fix failed copy to clipboard for http ui - PR #15195
  • Logs

    • fix logs page render logs on filter lookup - PR #15195
    • fix lookup list of end users (migrate to more efficient /customers/list lookup) - PR #15195
  • Test key

    • update selected model on key change - PR #15197
  • Dashboard

    • Fix LiteLLM model name fallback in dashboard overview - PR #14998

Logging / Guardrail / Prompt Management Integrationsโ€‹

Featuresโ€‹

Guardrailsโ€‹

  • Javelin
    • Add Javelin standalone guardrails integration for LiteLLM Proxy - PR #14983
    • Add logging for important status fields in guardrails - PR #15090
    • Don't run post_call guardrail if no text returned from Bedrock - PR #15106

Prompt Managementโ€‹


Spend Tracking, Budgets and Rate Limitingโ€‹

  • Cost Tracking
    • Proxy: end user cost tracking in the responses API - PR #15124
  • Parallel Request Limiter v3
    • Use well known redis cluster hashing algorithm - PR #15052
    • Fixes to dynamic rate limiter v3 - add saturation detection - PR #15119
    • Dynamic Rate Limiter v3 - fixes for detecting saturation + fixes for post saturation behavior - PR #15192
  • Teams
    • Add model specific tpm/rpm limits to teams on LiteLLM - PR #15044

MCP Gatewayโ€‹

  • Server Configuration
    • Specify forwardable headers, specify allowed/disallowed tools for MCP servers - PR #15002
    • Enforce server permissions on call tools - PR #15044
    • MCP Gateway Fine-grained Tools Addition - PR #15153
  • Bug Fixes
    • Remove servername prefix mcp tools tests - PR #14986
    • Resolve regression with duplicate Mcp-Protocol-Version header - PR #15050
    • Fix test_mcp_server.py - PR #15183

Performance / Loadbalancing / Reliability improvementsโ€‹

  • Router Optimizations
    • +62.5% P99 Latency Improvement - Remove router inefficiencies (from O(M*N) to O(1)) - PR #15046
    • Remove hasattr checks in Router - PR #15082
    • Remove Double Lookups - PR #15084
    • Optimize _filter_cooldown_deployments from O(nร—m + kร—n) to O(n) - PR #15091
    • Optimize unhealthy deployment filtering in retry path (O(n*m) โ†’ O(n+m)) - PR #15110
  • Cache Optimizations
    • Reduce complexity of InMemoryCache.evict_cache from O(n*log(n)) to O(log(n)) - PR #15000
    • Avoiding expensive operations when cache isn't available - PR #15182
  • Worker Management
    • Add proxy CLI option to recycle workers after N requests - PR #15007
  • Metrics & Monitoring
    • LiteLLM Overhead metric tracking - Add support for tracking litellm overhead on cache hits - PR #15045

Documentation Updatesโ€‹

  • Provider Documentation
    • Update litellm docs from latest release - PR #15004
    • Add missing api_key parameter - PR #15058
  • General Documentation
    • Use docker compose instead of docker-compose - PR #15024
    • Add railtracks to projects that are using litellm - PR #15144
    • Perf: Last week improvement - PR #15193
    • Sync models GitHub documentation with Loom video and cross-reference - PR #15191

Security Fixesโ€‹

  • JWT Token Security - Don't log JWT SSO token on .info() log - PR #15145

New Contributorsโ€‹

  • @herve-ves made their first contribution in PR #14998
  • @wenxi-onyx made their first contribution in PR #15008
  • @jpetrucciani made their first contribution in PR #15005
  • @abhijitjavelin made their first contribution in PR #14983
  • @ZeroClover made their first contribution in PR #15039
  • @cedarm made their first contribution in PR #15043
  • @Isydmr made their first contribution in PR #15025
  • @serializer made their first contribution in PR #15013
  • @eddierichter-amd made their first contribution in PR #14840
  • @malags made their first contribution in PR #15000
  • @henryhwang made their first contribution in PR #15029
  • @plafleur made their first contribution in PR #15111
  • @tyler-liner made their first contribution in PR #14799
  • @Amir-R25 made their first contribution in PR #15144
  • @georg-wolflein made their first contribution in PR #15124
  • @niharm made their first contribution in PR #15140
  • @anthony-liner made their first contribution in PR #15015
  • @rishiganesh2002 made their first contribution in PR #15153
  • @danielaskdd made their first contribution in PR #15160
  • @JVenberg made their first contribution in PR #15146
  • @speglich made their first contribution in PR #15072
  • @daily-kim made their first contribution in PR #14764

Full Changelogโ€‹