AWS Lambda Durable Functions are a programming model and SDK that allow you to create stateful, multi-step workflows directly inside a Lambda function using a checkpoint-and-replay mechanism. By persisting execution progress, these functions can suspend execution for up to one year without incurring compute charges, making them ideal for long-running processes like human approvals or order fulfillment.
Key Takeaways
Here are the essential facts you need to know about Lambda Durable Functions:
- Checkpoint-and-Replay Model: The runtime saves the result of every “step.” If the function pauses and resumes, it replays the code from the start but skips completed steps using the saved data.
- Cost Efficiency: You do not pay for compute time while the function is waiting. Executions can suspend for up to one year.
- Deterministic Code Required: Because the code replays from line one, logic outside of checkpointed steps must be deterministic (e.g., avoid
Math.random()or timestamps outside a step). - New Primitives: You orchestrate logic using SDK primitives like
context.step(),context.wait(), andcontext.invoke(). - Operational Constraints: You must publish a function version to trigger a durable execution (unqualified ARNs do not work) and specific IAM permissions are required.
Understanding Durable Execution
Traditionally, Lambda functions are stateless. If you needed to coordinate a workflow involving payment processing, inventory checks, and shipping, you usually had to wire together multiple Lambdas using AWS Step Functions or manage state manually in a database. AWS Lambda Durable Functions changes this by bringing the orchestration logic directly into your code.
The Checkpoint-and-Replay Mechanism
The core concept here is “checkpoint-and-replay.” When your code runs, you wrap distinct operations in a step. When a step completes, Lambda saves (checkpoints) the result. If your function needs to wait—perhaps for a webhook callback or a simple timer—it suspends execution.
When the wait is over, Lambda spins up the function environment again. Crucially, it runs your handler code from the very beginning. However, when it encounters a step that has already finished, it does not execute the logic again. Instead, it injects the stored result from the checkpoint and moves to the next line. This allows you to write code that looks sequential but creates a resilient, stateful workflow.
The “Wait” Primitive and Billing
One of the most significant advantages for engineers is the billing model during waits. In standard Lambda, if you use sleep(), you pay for that duration. With Durable Functions, you use context.wait().
When a durable function enters a wait state, it completely suspends. You are not billed for compute time during this period. The execution can remain suspended for extended periods—up to one year—making this perfect for “Human-in-the-Loop” scenarios where a script might need to pause for days waiting for a manager’s approval.
Writing Durable Code
To implement this, you use the Durable Execution SDK (available for Node.js and Python) and wrap your handler. In Node.js/TypeScript, you use withDurableExecution, and in Python, you use the @durable_execution decorator.
Here is a conceptual look at how a payment workflow might look in Node.js:
import { withDurableExecution } from "@aws/durable-execution-sdk-js"; export const handler = withDurableExecution(async (event, context) => { // Step 1: Checkpoint the external API call const payment = await context.step("process-payment", async () => { return await paymentService.charge(event.amount); }); // Step 2: Sleep without billing await context.wait({ seconds: 60 }); // Step 3: Use the result from step 1 await context.step("send-receipt", async () => { return await emailService.send(payment.confirmationId); }); }); Critical Gotchas and Constraints
While researching this feature, I found several operational details that can trip you up if you aren’t careful.
The Determinism Trap: Because your code replays from line one, it must be deterministic. If you generate a random number or a timestamp (like new Date()) outside of a context.step, that value will change every time the function wakes up for a replay. This breaks the logic. Always put non-deterministic code inside a step so the result is frozen in a checkpoint.
IAM Permissions: Your execution role needs specific permissions to manage the state. If your function fails to start, verify you have attached lambda:CheckpointDurableExecutions and lambda:GetDurableExecutionState to the role.
Versioning is Mandatory: You cannot trigger a durable execution using the $LATEST alias or an unqualified ARN. You must publish a version of your function (e.g., my-function:1) and invoke that specific version.
Cross-Account Limitations: While you can invoke other Lambdas using context.invoke(), the research indicates that invoked functions generally must be in the same AWS account. Cross-account orchestration via this specific mechanism is not currently supported.
Conclusion
AWS Lambda Durable Functions drastically simplify building complex workflows by removing the need for external state machines for many use cases. We learned that by using the checkpoint-and-replay model, we can build efficient, readable code that handles long waits without incurring idle compute costs. However, we must be vigilant about writing deterministic code and ensuring our operational setup—specifically IAM roles and function versioning—is correct before deploying.