Skip to main content
Flash runs your Python functions on remote GPU/CPU workers while you maintain local control flow. This page explains what happens when you call an @Endpoint function.

What runs where

The @Endpoint decorator marks functions for remote execution. Everything else runs locally.
import asyncio
from runpod_flash import Endpoint, GpuType

@Endpoint(name="demo", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090)
def process_on_gpu(data):
    # This runs on Runpod worker
    import torch
    return {"result": "processed"}

async def main():
    # This runs on your machine
    result = await process_on_gpu({"input": "data"})
    print(result)  # This runs on your machine

if __name__ == "__main__":
    asyncio.run(main())  # This runs on your machine
CodeLocation
@Endpoint decoratorYour machine (marks function)
Inside process_on_gpuRunpod worker
Everything elseYour machine

Flash apps

When you build a Flash app: Development (flash run):
  • FastAPI server runs locally.
  • @Endpoint functions run on Runpod workers.
Production (flash deploy):
  • Each endpoint configuration becomes a separate Serverless endpoint.
  • All endpoints run on Runpod.

Execution flow

Here’s what happens when you call an @Endpoint function:

Endpoint naming

Flash identifies endpoints by their name parameter:
@Endpoint(
    name="inference",  # This identifies the endpoint
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    workers=3
)
def run_inference(data): ...
  • Same name, same config: Reuses the existing endpoint.
  • Same name, different config: Updates the endpoint automatically.
  • New name: Creates a new endpoint.
This means you can change parameters like workers without creating a new endpoint—Flash detects the change and updates it.

Worker lifecycle

Workers scale up and down based on demand and your configuration.

Worker states

StateDescriptionBilling
InitializingDownloading image, loading codeYes
IdleScaled down, waiting for requestsNo
RunningProcessing requestsYes
ThrottledTemporarily unable to run due to host resource constraintsNo
OutdatedMarked for replacement after updateYes (while processing)
UnhealthyCrashed; auto-retries for up to 7 daysNo

Scaling behavior

@Endpoint(
    name="demo",
    gpu=GpuGroup.ANY,
    workers=(0, 5),   # (min, max) - Scale to zero when idle, up to 5 workers
    idle_timeout=60   # Seconds before running workers scale down
)
def process(data): ...
Example:
  1. First job arrives → Scale to 1 worker (cold start).
  2. More jobs arrive while worker busy → Scale up to max workers.
  3. Jobs complete → Workers stay running for idle_timeout seconds before scaling down to idle.
  4. No new jobs → Scale down to min workers.

Cold starts and warm starts

Understanding cold and warm starts helps you predict latency and set expectations.

Cold start

A cold start occurs when no workers are available to handle your job, because:
  • You’re calling an endpoint for the first time.
  • All workers have been scaled down after not processing requests for idle_timeout seconds.
  • All running workers are busy processing requests.
What happens during a cold start:
  1. Runpod provisions a new worker with your configured GPU/CPU.
  2. The worker image starts (dependencies are pre-installed during build).
  3. Your function executes.
Typical timing: 10-60 seconds total, depending on GPU availability and image size.
When using flash build or flash deploy, dependencies are pre-installed in the worker image, eliminating pip installation at request time. When running standalone scripts with @Endpoint functions outside of a Flash app, dependencies may be installed on the worker at request time.

Warm start

A warm start occurs when a worker is already running and idle:
  • Worker completed a previous job and is waiting for more work.
  • Worker is within its idle_timeout period.
What happens during a warm start:
  1. Job is routed immediately to the idle worker.
  2. Your function executes.
Typical timing: ~1 second + your function’s execution time.

The relationship between configuration and starts

Your workers and idle_timeout settings directly affect cold start frequency:
  • workers=(0, n): Workers scale to zero when not processing. Every request after the idle_timeout period triggers a cold start.
  • workers=(1, n): At least one worker stays ready. First concurrent request is warm, additional requests may cold start.
  • Higher idle_timeout: Workers stay running longer before scaling down, reducing cold starts for sporadic traffic.
See configuration best practices for specific recommendations based on your workload.