Provider API Landscape Research (2026)
Status update: The key follow-up recommendations from this research (especially transport flexibility + prompt caching) have been implemented in the same change set as the documentation. See the "Follow-ups Implemented" section at the bottom.
Goal: Understand the API shapes, wire formats, and feature surfaces of major AI/LLM inference providers so that universal-llm-client can be truly universal without creating one dedicated adapter per provider.
Core Philosophy (confirmed by research):
- Do not follow the "LiteLLM model" of N thin adapters.
- Instead: one primary canonical transport (OpenAI Chat Completions wire format) + a very small number of high-value native clients only for protocols that are sufficiently different and expose unique capabilities worth preserving.
- The current architecture (OpenAICompatibleClient as workhorse + dedicated GoogleClient + AnthropicClient + OllamaClient) is already well-aligned with reality.
Date of research: 2026-06-13
Sources: Official provider docs, API references, comparison articles, SDK compatibility guides, and industry analyses (see end for key links).
Executive Summary
Four wire formats cover the vast majority of commercial and self-hosted LLM usage:
- OpenAI Chat Completions (
/v1/chat/completions) — The de-facto universal language. Messages + tools + response_format. Streaming deltas. - OpenAI Responses API (
/v1/responses) — Newer agentic/stateful evolution (built-in tools,previous_response_id,store). Still early; Chat Completions remains primary and "indefinitely supported". - Anthropic Messages (
/v1/messages) — Content blocks (text/tool_use/tool_result/thinking/image), top-levelsystem, strong prompt caching + extended thinking. - Google Gemini (Generative Language / Vertex) —
contents+parts,systemInstruction,functionDeclarations,responseSchema.
Practical implication for universal-llm-client:
- Invest heavily in making the OpenAI-compatible path extremely robust and tolerant (this covers OpenAI, xAI/Grok, Mistral, DeepSeek, Cohere Compatibility, Perplexity Sonar/Agent compat, Groq, Together, Fireworks, OpenRouter, vLLM, TGI, llama.cpp, LM Studio, many Chinese hosted platforms, etc.).
- Keep/maintain thin, high-fidelity native clients only for:
- Anthropic (unique thinking signatures, prompt caching, computer-use tool types, strict alternation + block model).
- Google / Vertex (native thinking, grounding, service tiers like flex/priority, regional Vertex auth, schema stripping).
- Ollama (local ergonomics: NDJSON, ensureReady/pull, model info, native
thinkflag).
- Everything else: Documented as
{ type: 'openai', url: '...', apiKey: '...' }. - Add light ergonomic improvements for common "almost compat" cases (Azure, custom auth headers, Responses alias endpoints).
This approach scales to "truly universal" coverage while keeping the library small, testable, and maintainable.
Provider Families & Compatibility
1. OpenAI-Style (Chat Completions) — Primary Target
Full or near-full compatibility examples (use type: 'openai' + url):
| Provider / Service | Base URL (typical) | Notes / Gotchas | Structured Output | Tool Calling | Vision | Recommended |
|---|---|---|---|---|---|---|
| OpenAI (official) | https://api.openai.com | Native. Strict json_schema, parallel tools. | Excellent (json_schema + strict) | Excellent | Yes | Primary |
| xAI (Grok) | https://api.x.ai/v1 | Explicitly OpenAI SDK compatible. Also claims some Anthropic compat. | Good | Good | Yes | openai |
| Mistral AI | https://api.mistral.ai/v1 | Excellent compat layer. Own SDK exists but not required. | Good | Good | Yes (some models) | openai |
| DeepSeek | https://api.deepseek.com or /v1 | 100% OpenAI + even has /anthropic compat endpoint. Very low cost. | Good | Good | Varies | openai |
| Cohere (Compatibility API) | Cohere compat endpoint | Official "use via OpenAI SDK" path. Supports chat, tools, structured, embeddings. | Supported | Supported | — | openai |
| Perplexity (Sonar) | Perplexity endpoint (compat) | Sonar = Chat Completions compat. Agent API = Responses compat. Grounded by design. | — | Limited (search integrated) | — | openai (Sonar) |
| Groq | Groq OpenAI-compatible | Ultra-fast inference. Full OpenAI surface. | Good | Good | Yes | openai |
| Together AI | Together OpenAI-compatible | Many open models + fine-tunes. | Good | Good | Yes | openai |
| Fireworks AI | Fireworks OpenAI-compatible | Strong on fine-tuning + inference. | Good | Good | Yes | openai |
| OpenRouter | https://openrouter.ai/api/v1 | Router/aggregator. Provider routing via headers sometimes. | Varies by route | Varies | Varies | openai |
| vLLM (self-hosted) | http://.../v1 | Excellent OpenAI server. Some builds need gemma-native protocol handling (already supported). | Via outlines etc. | Good | Depends on model | openai |
| llama.cpp server | http://localhost:8080/v1 (or direct) | Treated as openai/llamacpp. Good tool streaming. | Limited grammar | Good | Yes (some) | openai or llamacpp |
| LM Studio, Ollama (some), others | Local /v1 | Drop-in. | Varies | Varies | Varies | openai |
Key observation: The list above represents the overwhelming majority of developer-accessible inference capacity in 2026 (especially when including aggregators and self-hosted).
2. OpenAI Responses API (Emerging Agentic Format)
- Newer surface:
inputarrays (more flexible than messages), built-in tools (web_search,file_search,code_interpreter,computer_use_preview, MCP),previous_response_id,store: truefor stateful, better reasoning surfacing. - OpenAI says Chat Completions will be supported "indefinitely".
- Some providers expose alias endpoints (
/v1/responses) or full compat (Perplexity Agent API, some mentions for xAI/Azure). - Recommendation: Continue prioritizing Chat Completions for the universal client. Add optional future support or passthrough for Responses-style calls if users need built-in tools from a specific provider. Do not create a separate client.
3. Anthropic Messages (Dedicated Client — Worth It)
Why native client exists and should remain:
- Content is always an array of typed blocks (
text,image,tool_use,tool_result,thinking). - Tool results go inside user messages as
tool_resultblocks (not top-leveltoolrole messages). - Strict user/assistant alternation with merging logic required.
- Top-level
system(or system blocks). - Distinct streaming:
content_block_start/delta(text_delta, input_json_delta, thinking_delta, signature_delta) /stop. - Unique high-value features:
- Extended thinking +
signature(must be echoed for multi-turn in some cases). - Prompt caching (
cache_controlon blocks or system). - Stronger "computer use" / agent tool types in newer Claude versions.
- Native structured output enforcement maturing (
output_config/ schema in recent Claude 4.x).
- Extended thinking +
- Auth:
x-api-key+anthropic-versionheader (not Bearer). - The existing
AnthropicClientcorrectly handles block conversion, tool_use ↔ tool_calls, thinking surfacing, and alternating-message cleanup. Good.
Status: Code is present (src/providers/anthropic.ts) but under-documented. Not listed in main README supported providers table or providers.md matrix.
4. Google Gemini / Vertex (Dedicated Client — Worth It)
Why native:
contents/partsmodel (text, inlineData for vision/audio, functionCall/functionResponse).systemInstructionseparate (Gemma special-cases by prepending).- Tool definitions as
functionDeclarations(different shape). - Structured output via
responseMimeType: 'application/json'+responseSchema(requires stripping unsupported JSON Schema keywords — already implemented withstripUnsupportedFeatures). - Native "thinking" /
thinkingConfig+thoughtsTokenCountin usage (different from content reasoning). - Vertex AI: Bearer token + regional endpoint construction + project/location in path.
- Service tiers:
service_tier: FLEX | PRIORITY(cost/latency tradeoffs) — surfaced via response header. - Grounding / search integration in some configs.
- The
GoogleClientalready handles Vertex vs AI Studio URLs, flex retry logic, thought signatures for Gemini 3.x function calling, Gemma system prompt folding, and schema stripping.
5. Amazon Bedrock
- Old way:
InvokeModel— model-specific payloads (hell for multi-model code). - Modern way: Converse API (recommended) — unified conversational interface +
toolConfig.- Messages with content blocks (text, toolUse, toolResult — similar spirit to Anthropic).
systemas array of text blocks.inferenceConfig,toolConfig.- Supports streaming (
converseStream). - Structured output often achieved by defining a tool whose schema is the desired output (or model-specific features).
- Not wire-compatible with OpenAI Chat Completions.
- Many models on Bedrock are Anthropic Claude or Meta Llama (so you get their strengths under the Converse envelope).
Recommendation: Do not add a native Bedrock client in the core library at this time.
- Users who want Bedrock + failover can front it with an OpenAI-compatible gateway (many exist) or use the library's
openaitype against a Bedrock-compatible proxy. - Direct Converse is valuable for pure-AWS enterprise stacks but is better served by AWS SDKs + thin wrappers or tools like LiteLLM when multi-provider is needed.
- If strong demand appears, a very thin
bedrockprovider using the Converse shape (translating our canonical tools/messages) could be added later — but it would be similar effort to the Anthropic one.
6. Azure OpenAI Service
- Wire format: Extremely close to OpenAI Chat Completions (and now Responses).
- Major differences are transport/auth/URL:
- URL pattern:
https://{your-resource}.openai.azure.com/openai/deployments/{deployment-name}/chat/completions?api-version=2024-10-21(or later). modelin body is often ignored or must match deployment; the path segment is the deployment ID.- Auth:
api-key: <key>header (or Azure ADAuthorization: Bearer <token>). - Some Azure-specific extensions (Azure AI Search integration via
dataSources, etc.).
- URL pattern:
- Responses API is also being rolled out on Azure.
Current support: Users can sometimes hack by providing a fully-formed url (including query) and overriding headers, but buildHeaders in http.ts hardcodes Authorization: Bearer. The OpenAICompatibleClient constructor forces /v1 suffix.
Recommendations for better Azure ergonomics (low cost, high value):
- Extend
ProviderConfig/LLMClientOptionswith optionalheaders?: Record<string,string>,authStyle?: 'bearer' | 'api-key' | 'custom', or aqueryParamsmechanism. - Or add a dedicated
azuretype (thin wrapper around OpenAICompatibleClient that adjusts URL construction + header strategy). This is still "not one adapter per provider" — it's one for a major cloud surface. - Document the exact pattern today.
7. Self-Hosted & Local Servers
- Ollama: Dedicated client (good). NDJSON streaming,
/api/chat,thinkflag, base64 images,/api/showfor metadata, pull support. - llama.cpp / llama-server: OpenAI compat (or its own). Library treats as
llamacpp→ OpenAICompatibleClient. - vLLM: Excellent OpenAI server. Special "gemma diffusion native channel protocol" support already exists for certain builds.
- TGI (Text Generation Inference), LM Studio, Oobabooga, etc.: Generally OpenAI compat when enabled.
- Local structured output: Often relies on grammar/JSON mode in the engine rather than API-level
response_format.json_schema. The library still sends the schema where supported and does client-side validation via Router.
Key API Dimension Comparison
Messages / Roles / Multimodal
- OpenAI compat:
system|user|assistant|tool. Content: string or[{type:'text'|'image_url', ...}]. - Anthropic: Only
user|assistantin the messages array.systemis top-level. Content always array of blocks. Tool results astool_resultinside a user message. - Google:
user|model|function.systemInstructionseparate. Parts array (text, inlineData, functionCall, functionResponse). Thought signatures on parts for Gemini 3+. - Bedrock Converse: Similar block style to Anthropic for supported models.
Library approach: Internal canonical is OpenAI-ish (with tool role + tool_calls + tool_call_id). Each native client does the translation (Anthropic and Google do significant work here; already implemented).
Tool / Function Calling
All major ones support it in 2026.
- OpenAI:
tools: [{type:'function', function:{name, description, parameters}}],tool_choice, streamedtool_callsdeltas withindex. - Anthropic:
toolswithinput_schema, responses emittool_useblocks, results astool_resultblocks. Streaming usesinput_json_delta. - Google:
tools: [{functionDeclarations: [...] }],toolConfig, parts containfunctionCall/functionResponse. Thought signatures important for some models. - Converse (Bedrock):
toolConfigwithtoolSpec,toolUse/toolResultcontent blocks.
Library: LLMToolDefinition, LLMToolCall (with optional thoughtSignature). Normalization of IDs and empty {} args is done in clients. chatWithTools autonomous loop is provider-agnostic (in Base + Router).
Good convergence here.
Structured Output / JSON Enforcement
- OpenAI:
response_format: {type: 'json_schema', json_schema: {name, schema, strict?}}. Constrained decoding. - Google:
responseMimeType + responseSchema(schema subset; library stripspattern, length/min/max numeric,additionalProperties, etc. — client-side validation still runs). - Ollama:
format: 'json'orformat: {schema object}(grammar-backed on good models). - Anthropic: Maturing native schema support (output_config / strict on tools in newer versions). Previously relied more on prompting + client validation.
- Many OpenAI-compat servers: Partial or none (rely on prompting or engine-level constrained decoding like Outlines/vLLM). Library always does final Zod/validator pass in Router.
Library strength: Centralized validation + best-effort provider constraint. This is the correct split.
Streaming
- OpenAI compat: SSE,
data: {...}withdelta.content,delta.tool_calls[](indexed accumulation needed), finalusagesometimes on last chunk or separate. - Ollama: NDJSON (each line a full
OllamaResponsechunk). Hasthinkingandcontentfields. - Google: SSE with
data: {candidates[0].content.parts[], usageMetadata}. Library accumulates tools across parts. - Anthropic: Rich typed SSE events (
message_start,content_block_start/delta/stop,message_delta,message_stop,thinking_delta,signature_delta).
Library: parseSSE + parseNDJSON + per-client accumulation + pluggable StreamDecoder (passthrough / standard-chat / interleaved-reasoning). Gemma diffusion special handling already present. Good.
Auth & Endpoint Construction
Varies enough that a single buildHeaders + simple URL suffix doesn't cover all:
- Bearer token (most OpenAI-compat, Vertex).
x-api-key+ version header (Anthropic).api-keyheader +?api-version=query + deployment-in-path (Azure).- Key in query string (old Google AI Studio style — library handles via URL).
Current code handles the extremes via per-client header builders and URL logic. Room to make the compat path more configurable.
Reasoning / Thinking / "Invisible" Tokens
- Google:
thinkingConfig, separatethoughtsTokenCount, thinking text may or may not be in visible parts depending on model/config. - Anthropic:
thinkingcontent blocks +signature. - Ollama:
message.thinkingfield (whenthink: true). - Some OpenAI-compat / newer models: special tokens,
<think>tags, or extra fields. - Library:
DecodedEvent {type: 'thinking'},response.reasoning,usage.reasoningTokens. Decoders (interleaved-reasoning) help surface<think>etc.
This area is still evolving; the abstraction is already ahead of most single-provider clients.
Recommendations & Action Plan for universal-llm-client
High Priority (Documentation + Polish)
Document Anthropic properly:
- Add to README "Supported Providers" table.
- Expand
docs/guide/providers.mdwith full Anthropic section (structured output notes, prompt caching viaenablePromptCaching, thinking support, example config). - Update support matrix.
"Everything else via OpenAI compat" section in providers.md + README:
- Big table or list with exact
{ type: 'openai', url: '...', apiKey }snippets for xAI, Mistral, DeepSeek, Cohere compat, Groq, Together, Fireworks, OpenRouter, Perplexity Sonar, vLLM, etc. - Note any known deltas (e.g., "strict mode may not be supported — set
output: { strict: false }").
- Big table or list with exact
Update support matrix to include Anthropic (chat/streaming/tools/vision/structured/prompt-caching/thinking).
Medium Priority (Ergonomics / Edge Cases)
Make OpenAI-compatible path more flexible:
- Add to
ProviderConfig:headers?: Record<string, string>,authHeader?: string(e.g. override "Authorization"),authScheme?: 'Bearer' | 'Api-Key' | '',appendPath?: stringor full control over suffix. - Or a
customizeRequesthook (advanced). - This lets power users hit exotic compat endpoints or Azure without forking.
- Add to
Azure-specific convenience (optional
type: 'azure'or documented pattern):- Thin handling for deployment-in-URL +
api-keyheader +api-versionquery param. - Low implementation cost (mostly URL/header logic on top of existing OpenAI client).
- Thin handling for deployment-in-URL +
Prompt caching:
- Make
enablePromptCaching?: boolean(already in ChatOptions) actually work across providers that support it. - Anthropic: emit
cache_controlblocks when enabled. - Google: support
cachedContentreferences. - OpenAI-compat: passthrough or document provider-specific
parameters.
- Make
Responses API:
- Experiment / document how to target a
/v1/responsesendpoint using the existingopenaitype (if the body is close enough) or add minimalresponseApi: trueflag that changes the endpoint and some field names. - Prioritize only if users request built-in tool use from OpenAI/Anthropic/etc.
- Experiment / document how to target a
Better model metadata / capabilities discovery:
- Enhance
getModelInfo()overrides (many providers return rich/modelsor/api/showdata). - Surface context length, vision support, tool support, thinking support where available.
- Enhance
Low Priority / Future
- Bedrock Converse native client (only if direct AWS usage without gateways becomes a frequent request).
- Full first-class Responses API client (monitor adoption).
- Per-provider rate-limit / usage header parsing into auditor events.
Current Code Strengths (Validated by Research)
- Canonical message/tool format is the right pivot point.
- Router + centralized structured validation is excellent design (failover works even when schema formats differ).
- Per-provider translation is isolated and maintainable.
- Streaming + decoder strategy + tool execution loop are provider-agnostic.
- Gemma diffusion edge case handling shows willingness to support real-world server quirks inside the compat path.
Risks / Watch Items
- Rapid evolution of "thinking" and agentic surfaces (computer use, remote MCP, built-in tools) may require occasional canonical format extensions (e.g., more tool
types beyondfunction). - Some providers are aggressive with safety/refusals that can look like tool-call or JSON failures — good error surfacing and auditor events help.
- Strict JSON schema support is still inconsistent on many compat servers — the library's client-side validation +
strict: falseoption is the right mitigation.
Key Sources & Further Reading
- xAI docs: https://docs.x.ai/ (explicit OpenAI compat, base https://api.x.ai/v1)
- Mistral migration: Chat Completions structure matches OpenAI.
- Cohere Compatibility API docs.
- DeepSeek API docs (OpenAI + Anthropic compat endpoints).
- Perplexity docs (Sonar Chat Completions compat + Agent Responses compat).
- AWS Bedrock Converse API reference + tool use guide.
- Azure OpenAI REST reference (deployments path + api-version + api-key).
- Industry comparisons (Fireworks "Best LLM API Providers 2026", MorphLLM pricing matrix, etc.).
- "LLM-Rosetta" arXiv paper (hub-and-spoke between the four main formats).
- Structured output deep-dives (differences in enforcement across OpenAI/Anthropic/Google/Bedrock).
Follow-ups Implemented
The following items from the "Recommendations & Action Plan" have been baked into this change (same PR as the research + docs):
Transport flexibility (highest priority):
ProviderConfignow supportsheaders,queryParams,authHeader,authPrefix, andapiBasePath.buildHeaders()respectsauthHeader/authPrefix(enables cleanapi-key: xxxstyle without Bearer).OpenAICompatibleClientno longer blindly appends/v1.apiBasePath: ''disables it. A new internalbuildUrl()helper appliesqueryParamsto every endpoint (/chat/completions,/embeddings,/models, streaming).- Full Azure pattern example now works without hacks:ts
{ type: 'openai', url: 'https://...azure.../deployments/DEP', apiBasePath: '', queryParams: { 'api-version': '...' }, headers: { 'api-key': '...' } }
Prompt caching:
enablePromptCachinginChatOptionsis now actually implemented for Anthropic.- When true, the system prompt is sent as a content block with
cache_control: { type: 'ephemeral' }(the standard high-ROI pattern for Claude). - Updated types, logic in
AnthropicClient, JSDoc, and user docs. - Other providers get the flag passed through where possible (via
parameters/ headers) or documented as provider-specific.
Documentation & discoverability:
- Anthropic is now first-class in the support matrix, README, and has its own section in providers.md.
- Large "use the openai type for almost everything" table with copy-paste examples (xAI, Mistral, DeepSeek, Cohere, Groq, etc.).
- Research doc itself updated with implementation status.
Items intentionally left for later (lower urgency):
- Dedicated
type: 'azure'sugar (the flexible transport options above make it unnecessary for most people). - Full Responses API first-class support (current openai compat path +
apiBasePathalready lets you target/responsesendpoints; built-in tool usage can be added on demand). - Richer model capability discovery beyond what
getModelInfoalready does per-provider.
Conclusion
The ecosystem has converged far more than it has fragmented. OpenAI Chat Completions (plus its Responses sibling) is the gravitational center. Anthropic and Google maintain distinct but stable high-value protocols. Local servers mostly speak the common tongue.
universal-llm-client's "few native + strong, now even more flexible, compat" design is the correct one for being truly universal without an explosion of adapters.
This research + the concrete follow-up implementations (especially transport) significantly advances the goal.
Research + implementation complete.