The backends you define as services run as Vercel Functions, and those functions use Fluid compute by default. That means each backend in a Services project handles concurrent requests, scales with traffic, and reduces cold starts without you managing a server or choosing a compute model.
In this guide, you'll learn what Fluid compute gives the backends in a Services project, why it suits I/O-bound and agentic workloads, how Active CPU pricing works, and how to tune compute for each service.
The core of Fluid compute is optimized concurrency: instead of giving each request its own isolated instance, a single instance handles several requests at once. That's where most of its benefits come from, and a few of them you get without any configuration:
- Dynamic scaling: Vercel reuses idle capacity in existing instances before starting new ones, so a service stays responsive under load and quiet during lulls.
- Reduced cold starts: Vercel pre-warms production deployments and optimizes bytecode, so requests are less likely to wait on a fresh instance.
- Error isolation: An unhandled error in one request doesn't crash the other requests sharing the same instance.
- Region and zone failover: If an availability zone goes down, Vercel fails over to another zone, then to the next closest region if needed.
Because this applies per service, a project with several backends gets the same concurrency and scaling behavior for each one, whether it's a Python API, a Node.js server, or another supported runtime.
Fluid compute pays off most when a service spends much of its time waiting rather than computing. Backend work like querying a database, calling an external API, or fetching embeddings with an AI model is I/O-bound: the request spends most of its time idle while it waits for a response. With one request per instance, that idle time is wasted. With Fluid, the instance uses that time to make progress on other requests.
The practical effect is that a single instance can serve many in-flight requests, so a service needs fewer total instances to handle the same load. For an AI backend that mostly waits on model responses, or an API that mostly waits on a database, that concurrency is the difference between provisioning for peak request count and provisioning for actual CPU work.
Fluid compute separates the cost of running your code from the cost of waiting on I/O. Vercel bills three things for a service backend:
- Active CPU: Charged only while your code is actively running. When a request is waiting on I/O, CPU billing pauses.
- Provisioned memory: Charged while requests are in flight on an instance, until the last one finishes.
- Invocations: Charged per incoming request.
Between requests, an instance is paused, and you pay nothing. This pricing is what makes concurrency worthwhile for I/O-bound services: you pay for the CPU a service actually uses, not for the time it spends waiting on a database or a model, and a single instance serving many waiting requests keeps both CPU and memory usage efficient.
Fluid compute also enables a backend service to keep WebSocket connections open, which is the basis for real-time features such as live cursors and presence. Your WebSocket server runs as a service, and clients open a persistent connection through its route prefix rather than sending one-off requests.
Real-time services follow the same model as any function-backed service, where instances are shared between requests and paused between work.
For that reason, keep a few things in mind when building your application:
- Keep shared state in an external store such as Redis, not in instance memory, because any instance can handle a given connection and instances are paused between work.
- Handle reconnects on the client side, since a connection can close when an instance is recycled or when a duration limit is reached.
- Account for function duration limits when a connection must remain open for an extended period.
For walkthroughs of this pattern, see the cursors and presence build guides linked below.
You can size compute for each service independently in your experimentalServices configuration. Two fields control this:
memory: Sets how much memory, and proportionally how much CPU, a service's instances get. More memory can help a CPU-bound service finish faster.maxDuration: Sets how long an invocation can run its terminated. Give a service enough time for its normal work, including any waiting on I/O or streamed responses.
Because these are per service, you can give a heavier backend more memory or a longer timeout without changing the others. For the current defaults, plan-specific limits, and supported runtime versions, see the Vercel Functions configuration docs, as these values vary by plan and continue to evolve.
Two related capabilities are worth knowing about.
You can run work after a response is sent, such as logging or analytics, with waitUntil, so it doesn't delay the request. When a job needs to outlast function duration limits, such as an agent loop that runs for hours or longer, useĀ Vercel Workflows. Workflows let code pause, resume, and maintain state.
Learn how to build on Services with these two step-by-step guides. Both create a real-time, full-stack app by pairing a frontend with a WebSocket backend within a single Vercel project that deploys to one domain.
Build Figma-style multiplayer cursors with WebSockets on Vercel
Build Notion-style real-time presence with WebSockets on Vercel
- Learn how Fluid compute works and how it compares to traditional serverless.
- Read the Vercel Functions docs for runtimes, regions, memory, and duration settings.
- Set up a multi-service project with The Complete Guide to Vercel Services.