Category: Lambda

  • 5 Critical Mistakes to Avoid with AWS Lambda Durable Functions

    While researching AWS Lambda Durable Functions, we uncovered several operational “gotchas” that can cause silent failures or unexpected behavior in production. Because Durable Functions rely on a strict Replay model, coding patterns that work in standard functions will break durable ones.

    Key Takeaways

    Watch out for these top 5 issues:

    • Non-Determinism: Random code outside steps breaks the replay.
    • Versioning: You cannot invoke $LATEST.
    • IAM Permissions: Checkpointing requires specific permissions.
    • Cross-Account Invites: Not supported directly.
    • SDK Drift: Always bundle your SDK.

    Main Explanation

    1. The Non-Determinism Trap

    The most dangerous pitfall is using non-deterministic logic outside of a checkpointed step. When a function resumes, it replays the handler code from line one. If you have const now = new Date() or Math.random() at the top of your handler (outside a step), that value changes on every replay.

    This confuses the runtime, as the logic diverges from the saved history. Fix: Always perform time calculations or randomization inside a context.step() so the result is frozen in the checkpoint.

    2. Missing IAM Permissions

    Durable functions need to read and write their state to a backend service. A standard Lambda execution role does not have these rights. If your function starts and immediately fails (or hangs), check your role.

    Fix: You must attach lambda:CheckpointDurableExecutions and lambda:GetDurableExecutionState to the execution role.

    3. The Unqualified ARN Error

    You cannot trigger a durable execution using the default $LATEST alias. Durable executions must be tied to a specific code snapshot.

    Fix: You are required to publish a version (e.g., aws lambda publish-version) and use the qualified ARN (ending in :1, :2) when invoking the function.

    4. Cross-Account Limitations

    Do not attempt to use context.invoke() to call a Lambda function in a different AWS account. The research confirms that the current architecture requires invoked functions to reside in the same account as the orchestrator.

    5. Relying on Runtime SDKs

    While AWS provides the Durable Execution SDK in the runtime (e.g., Node.js 24), relying on it can lead to “drift” if the runtime updates.

    Fix: It is a recommended best practice to include the Durable Execution SDK in your deployment package (via npm install or pip install) rather than relying on the pre-installed version. This ensures your production code is stable regardless of underlying platform updates.

    Conclusion

    Durable Functions are powerful, but they are strict. By adhering to deterministic coding practices, ensuring proper IAM setup, and managing versions correctly, you can deploy reliable workflows that leverage the full power of the checkpoint-and-replay model.

  • AWS Lambda Durable Functions vs Step Functions

    AWS Lambda Durable Functions allow developers to write stateful, multi-step workflows entirely within code, removing the need for external orchestrators for many use cases. In this tutorial, we will build a classic e-commerce pattern—Order Processing—handling validation, payment, and inventory checks with built-in retries and checkpoints.

    Key Takeaways

    In this tutorial, we will cover the following concepts:

    • The DurableContext: How to inject the coordination context into your handler (Node.js and Python).
    • Steps and Checkpoints: Using context.step to persist results and avoid re-executing logic.
    • Waits: Suspending execution without billing using context.wait.
    • Code Structure: Comparing the syntax between TypeScript/Node.js and Python.

    The Order Processing Workflow

    We are going to orchestrate a workflow that validates an order, processes a payment, waits for a brief confirmation period, and then confirms the order. If the function crashes or pauses, it resumes exactly where it left off.

    Node.js / TypeScript Implementation

    For Node.js (specifically runtimes like Node.js 24), we use the withDurableExecution wrapper. This injects the context object we need to manage state.

    import { withDurableExecution } from "@aws/durable-execution-sdk-js"; export const handler = withDurableExecution(async (event, context) => { const orderId = event.orderId; // Step 1: Validate the order. The result is checkpointed. const validation = await context.step("validate-order", async () => { // Assume customerService is an imported module return await customerService.validate(event.customerId); }); if (!validation.isValid) { return { status: "rejected", reason: "Invalid Customer" }; } // Step 2: Process Payment. // If the function fails after this, payment is not charged twice. const payment = await context.step("process-payment", async () => { return await paymentService.charge(orderId, event.amount); }); // Step 3: Wait. // Execution suspends here. You are NOT billed for these 10 seconds. await context.wait({ seconds: 10 }); // Step 4: Confirm Order await context.step("confirm-order", async () => { return await inventoryService.confirm(orderId); }); return { status: "completed", orderId }; });

    Python Implementation

    If you prefer Python (specifically Python 3.14), the pattern uses decorators. The @durable_execution decorator handles the context injection, and @durable_step can be used on helper functions.

    from aws_durable_execution_sdk_python import DurableContext, durable_execution, durable_step from aws_durable_execution_sdk_python.config import Duration @durable_step def validate_order(order_id): # Logic to validate order return {"status": "valid"} @durable_execution def lambda_handler(event, context: DurableContext): order_id = event['orderId'] # Step 1: Call the decorated step function # Note: context.step is used to wrap inline logic or calls validation = context.step( lambda: validate_order(order_id), name="validate-order" ) # Step 2: Wait using the Duration config context.wait(Duration.from_seconds(10)) # Step 3: Finalize result = context.step( lambda: {"status": "confirmed", "id": order_id}, name="finalize" ) return result

    How Replay Works

    When the context.wait finishes (after 10 seconds), Lambda re-invokes your handler. It runs the code from the top. However, when it hits “validate-order” and “process-payment,” it sees those checkpoints exist. It skips the actual execution of those functions and immediately returns the stored result. The code only “runs” for the first time at lines following the wait.

    Conclusion

    We demonstrated how to build a multi-step workflow using simple code primitives. By wrapping side effects in steps, we ensure they are performed exactly once, and by using wait, we can pause execution for up to one year without paying for idle compute.

  • AWS Lambda Durable Functions

    AWS Lambda Durable Functions are a programming model and SDK that allow you to create stateful, multi-step workflows directly inside a Lambda function using a checkpoint-and-replay mechanism. By persisting execution progress, these functions can suspend execution for up to one year without incurring compute charges, making them ideal for long-running processes like human approvals or order fulfillment.

    Key Takeaways

    Here are the essential facts you need to know about Lambda Durable Functions:

    • Checkpoint-and-Replay Model: The runtime saves the result of every “step.” If the function pauses and resumes, it replays the code from the start but skips completed steps using the saved data.
    • Cost Efficiency: You do not pay for compute time while the function is waiting. Executions can suspend for up to one year.
    • Deterministic Code Required: Because the code replays from line one, logic outside of checkpointed steps must be deterministic (e.g., avoid Math.random() or timestamps outside a step).
    • New Primitives: You orchestrate logic using SDK primitives like context.step(), context.wait(), and context.invoke().
    • Operational Constraints: You must publish a function version to trigger a durable execution (unqualified ARNs do not work) and specific IAM permissions are required.

    Understanding Durable Execution

    Traditionally, Lambda functions are stateless. If you needed to coordinate a workflow involving payment processing, inventory checks, and shipping, you usually had to wire together multiple Lambdas using AWS Step Functions or manage state manually in a database. AWS Lambda Durable Functions changes this by bringing the orchestration logic directly into your code.

    The Checkpoint-and-Replay Mechanism

    The core concept here is “checkpoint-and-replay.” When your code runs, you wrap distinct operations in a step. When a step completes, Lambda saves (checkpoints) the result. If your function needs to wait—perhaps for a webhook callback or a simple timer—it suspends execution.

    When the wait is over, Lambda spins up the function environment again. Crucially, it runs your handler code from the very beginning. However, when it encounters a step that has already finished, it does not execute the logic again. Instead, it injects the stored result from the checkpoint and moves to the next line. This allows you to write code that looks sequential but creates a resilient, stateful workflow.

    The “Wait” Primitive and Billing

    One of the most significant advantages for engineers is the billing model during waits. In standard Lambda, if you use sleep(), you pay for that duration. With Durable Functions, you use context.wait().

    When a durable function enters a wait state, it completely suspends. You are not billed for compute time during this period. The execution can remain suspended for extended periods—up to one year—making this perfect for “Human-in-the-Loop” scenarios where a script might need to pause for days waiting for a manager’s approval.

    Writing Durable Code

    To implement this, you use the Durable Execution SDK (available for Node.js and Python) and wrap your handler. In Node.js/TypeScript, you use withDurableExecution, and in Python, you use the @durable_execution decorator.

    Here is a conceptual look at how a payment workflow might look in Node.js:

    import { withDurableExecution } from "@aws/durable-execution-sdk-js"; export const handler = withDurableExecution(async (event, context) => { // Step 1: Checkpoint the external API call const payment = await context.step("process-payment", async () => { return await paymentService.charge(event.amount); }); // Step 2: Sleep without billing await context.wait({ seconds: 60 }); // Step 3: Use the result from step 1 await context.step("send-receipt", async () => { return await emailService.send(payment.confirmationId); }); });

    Critical Gotchas and Constraints

    While researching this feature, I found several operational details that can trip you up if you aren’t careful.

    The Determinism Trap: Because your code replays from line one, it must be deterministic. If you generate a random number or a timestamp (like new Date()) outside of a context.step, that value will change every time the function wakes up for a replay. This breaks the logic. Always put non-deterministic code inside a step so the result is frozen in a checkpoint.

    IAM Permissions: Your execution role needs specific permissions to manage the state. If your function fails to start, verify you have attached lambda:CheckpointDurableExecutions and lambda:GetDurableExecutionState to the role.

    Versioning is Mandatory: You cannot trigger a durable execution using the $LATEST alias or an unqualified ARN. You must publish a version of your function (e.g., my-function:1) and invoke that specific version.

    Cross-Account Limitations: While you can invoke other Lambdas using context.invoke(), the research indicates that invoked functions generally must be in the same AWS account. Cross-account orchestration via this specific mechanism is not currently supported.

    Conclusion

    AWS Lambda Durable Functions drastically simplify building complex workflows by removing the need for external state machines for many use cases. We learned that by using the checkpoint-and-replay model, we can build efficient, readable code that handles long waits without incurring idle compute costs. However, we must be vigilant about writing deterministic code and ensuring our operational setup—specifically IAM roles and function versioning—is correct before deploying.

  • Leveraging GPUs and Graviton with Lambda Managed Instances

    For years, if you needed a GPU or specialized CPU architecture, you had to leave Lambda and manage EC2 or ECS clusters. AWS Lambda Managed Instances removes this barrier, providing direct access to specialized hardware families like Graviton4, GPUs, and high-bandwidth networking.

    Key Takeaways

    Hardware features now available to serverless functions include:

    • GPU Support: Run PyTorch or TensorFlow inference directly in Lambda.
    • Graviton4 Access: Use the latest ARM-based processors for better price/performance.
    • EFA Networking: Bandwidth capabilities suitable for High Performance Computing (HPC).

    Deep Dive: Hardware Selection

    AI and Inference Workloads

    The primary use case driving this feature is AI/ML inference. Previously, loading a large model into a Lambda function was slow and cpu-bottlenecked. With LMI, you can select a Capacity Provider backed by `g` or `p` family instances.

    This allows the function to utilize CUDA cores for rapid inference. When you combine this with the ability to pre-provision instances, you can keep the heavy models loaded in GPU memory, awaiting invocation events without the “cold start” penalty of loading the model from S3 every time.

    HPC and EFA

    For scientific computing, the inclusion of Elastic Fabric Adapter (EFA) support is significant. This bypasses the OS network stack for lower latency and higher throughput. While typical web APIs won’t need this, simulation workloads that require high inter-node communication can now be orchestrated via Lambda events rather than complex batch schedulers.

    Architecture Matching

    One critical configuration detail: you must strictly match your function’s architecture (x86_64 or arm64) to your Capacity Provider’s instance requirements. A mismatch here will cause deployment failures. If you plan to utilize Graviton4 for cost savings, ensure your deployment pipeline cross-compiles your code correctly for ARM.

    Conclusion

    Lambda Managed Instances effectively decouple the hardware from the management model. We can now run heavy, hardware-dependent workloads without having to patch servers or manage autoscaling groups.

  • How Runtime Workers Change the AWS Lambda Behavioral Model

    If you are a veteran Lambda developer, you are used to the “one event, one environment” model. Lambda Managed Instances breaks this rule by introducing “Runtime Workers.” This architectural shift allows multiple events to be processed in parallel on a single instance, which has profound implications for how we write thread-safe code.

    Key Takeaways

    The new execution environment behaves differently in several key areas:

    • Parallel Execution: A single EC2 instance runs multiple workers, processing multiple requests simultaneously.
    • Shared State Danger: Global variables and casual caching mechanisms must now be thread-safe.
    • Extended Init Phase: The initialization window can last up to 15 minutes, far longer than the 10-second limit standard users are used to.

    Deep Dive: The Runtime Worker Model

    Concurrency and Thread Safety

    In standard Lambda, a global variable `counter = 0` is safe because only one event touches it at a time. In LMI, multiple Runtime Workers exist within the same environment (the same EC2 instance). If your code relies on local ephemeral storage (/tmp) or global memory variables without locking mechanisms, you will encounter race conditions.

    We must optimize our code for this shared environment. This might mean implementing connection pooling more aggressively or ensuring that temporary file names are cryptographically unique to avoid collisions between workers.

    The 15-Minute Init Window

    One of the most surprising research findings is the expanded initialization capacity. The `Init` phase in LMI is allowed to run for up to 15 minutes. This effectively eliminates the strict startup limits of standard Lambda.

    This enables us to load massive AI models into memory or hydrate large local caches before the function starts accepting traffic. When combined with the “pre-provisioning” capabilities of Capacity Providers, this allows for heavy-duty applications that were previously impossible in FaaS.

    Conclusion

    We can no longer treat the Lambda handler as a solitary process. We must adopt coding practices closer to traditional container development—handling concurrency, locking, and shared state—while still enjoying the benefits of the serverless invocation model.

  • Deploying Your First VPC-Backed Lambda Managed Instance

    Setting up Lambda Managed Instances is not as simple as selecting a checkbox in the console. It requires a specific sequence of creating IAM roles, networking resources, and a Capacity Provider before you can even deploy code. Here is a step-by-step guide to getting it right.

    Key Takeaways

    The setup flow differs from standard Lambda in three main ways:

    • Dual IAM Roles: You need an Operator Role (for the infrastructure) and an Execution Role (for the code).
    • Sequence Matters: You must define the Capacity Provider with network details before creating the function.
    • Publishing is Mandatory: You cannot run code on the $LATEST alias; you must publish a version.

    Step-by-Step Configuration

    1. Create the IAM Roles

    First, we need the Operator Role. This is new. It grants the Lambda service permission to manage EC2 resources (like ENIs and Instances) in your VPC. You will need to trust the `lambda.amazonaws.com` principal and attach the `AWSLambdaManagedEC2ResourceOperator` managed policy. Don’t forget your standard Execution Role (basic Lambda permissions) as well.

    2. Configure Networking

    You need a standard VPC setup. Create a dedicated security group for your Capacity Provider. This allows you to control traffic specifically for these instances. Ensure your subnets have routes to a NAT Gateway or VPC Endpoints if you need your function to talk to the internet or AWS services like S3.

    3. Create the Capacity Provider

    Use the AWS CLI to create the provider. This maps your network and infrastructure requirements. Note the `MaxVCpuCount`—this serves as your safety valve against runaway costs.

    aws lambda create-capacity-provider \ --capacity-provider-name my-cp \ --vpc-config SubnetIds=[subnet-123],SecurityGroupIds=[sg-456] \ --permissions-config CapacityProviderOperatorRoleArn=arn:aws:iam::123456789012:role/MyOperatorRole \ --instance-requirements Architectures=[x86_64] \ --capacity-provider-scaling-config MaxVCpuCount=20

    4. Deploy and Publish

    When you create the function, you reference the Capacity Provider ARN. But here is the “gotcha” that trips up most engineers: It won’t run yet. You must execute aws lambda publish-version. Managed Instances only execute published versions of your code.

    Conclusion

    The barrier to entry is slightly higher here than with standard Lambda, but this stringent configuration ensures that your infrastructure is secure and bounded from the start.

  • When to use AWS Lambda Managed Instances

    AWS Lambda Managed Instances introduces a complex new pricing model involving invocation fees, underlying EC2 costs, and management fees. While AWS marketing suggests savings of up to 72%, this is only true for specific workloads using specific financial instruments.

    Key Takeaways

    Before migrating for cost reasons, consider these financial facts:

    • The Formula: Total Cost = EC2 Cost + Management Fee (~15%) + Per-Invocation Fee.
    • Savings Plans Apply: Unlike standard Lambda, you can apply EC2 Compute Savings Plans and Reserved Instances to the underlying capacity.
    • Steady State is King: This model punishes sporadic, spiky workloads but rewards high-throughput, predictable baselines.

    Analyzing the Cost Structures

    The Management Fee Factor

    There is a unique line item in the LMI pricing model: a management fee of approximately 15% on top of the EC2 infrastructure cost. This fee covers the fact that AWS is handling the OS patching, lifecycle management, and instance rotation for you.

    When calculating your ROI, you cannot simply compare EC2 spot prices to Lambda GB-second costs. You must factor in this surcharge. If your workload is tiny, this fee combined with the base cost of an EC2 instance (even a small one) will likely exceed the cost of standard pay-per-request Lambda.

    The Break-Even Point

    The “72% savings” figure cited in documentation relies heavily on Compute Savings Plans. Because LMI runs on standard instance families (like `m7g` or `g5`), it qualifies for the deep discounts offered by 1-year or 3-year commitments.

    This makes LMI ideal for the “baseline” of a heavy API service. For example, if you know you always need at least 50 vCPUs worth of compute to handle minimum traffic, moving that baseline to Reserved Instances via LMI is financially sound. However, if you migrate a “scale-to-zero” cron job that runs once a day, you are paying for idle EC2 time, which breaks the serverless cost model entirely.

    Conclusion

    We should view Lambda Managed Instances as a financial tool for steady-state optimization. It allows us to apply traditional EC2 discount mechanisms to serverless applications, provided the workload is consistent enough to justify the committed capacity.

  • Monitoring and Troubleshooting Lambda Managed Instances

    Troubleshooting AWS Lambda Managed Instances (LMI) requires a shift in mindset. Because your functions are running on EC2 instances that exist deeply within your VPC, you have to deal with new failure modes like “backpressure” and “unhealthy execution environments” that simply don’t exist in standard Lambda.

    Key Takeaways

    Here are the critical operational differences you need to monitor:

    • Backpressure is Real: Unlike standard Lambda which scales implicitly, LMI can reject requests if all runtime workers on an instance are busy.
    • Extension Stability is Critical: If a Lambda Extension crashes, the entire execution environment (the EC2 instance) is marked unhealthy and replaced.
    • VPC Telemetry Paths: Logs and traces (X-Ray) need a validated network path out of your private subnet to reach AWS endpoints.

    Deep Dive: The New Failure Modes

    Understanding Backpressure

    In standard Lambda, concurrency is effectively an abstraction. In LMI, it is a tangible resource constraint. Each Managed Instance runs a specific number of “runtime workers.” If you have configured your Capacity Provider with a low MaxVCpuCount or if your traffic spikes faster than the provider can scale (remember, it only absorbs about 50% of spikes by default), you will hit backpressure.

    This manifests as rejected invocations. You cannot just “fire and forget” anymore; your client-side applications must handle these rejections with retry logic and exponential backoff, or you need to put a queue (like SQS) in front of the function.

    The Cost of Unhealthy Environments

    We often use Lambda Extensions for observability. However, in LMI, the stability of these extensions is paramount. Research indicates that if an extension crashes, AWS doesn’t just restart the process; it marks the whole environment as unhealthy and replaces the instance.

    This triggers a specialized “replacement” lifecycle event. This is heavier than a standard cold start because it involves spinning up a new EC2-backed environment. You need to set up CloudWatch alerts specifically tracking “unhealthy” counts to catch buggy extensions early.

    Network Reachability

    Since these instances live in your VPC subnets and Security Groups, telemetry is not automatic. If your Security Group rules allow ingress but deny egress (or if your route table lacks a NAT Gateway or VPC Endpoints for CloudWatch), your logs will vanish. You must treat these functions like standard EC2 servers when debugging network connectivity.

    Conclusion

    Moving to Managed Instances gives us hardware control, but it hands back some operational responsibility. We have to monitor worker saturation, validate extension stability, and ensure our VPC plumbing allows telemetry to escape the subnet.

  • AWS Lambda Managed Instances Explained

    Introduction to AWS Lambda Managed Instances

    AWS Lambda Managed Instances is a new compute mode that allows you to run Lambda functions on EC2-backed infrastructure fully managed by AWS. It bridges the gap between the simplicity of serverless and the flexibility of EC2, enabling access to specialized hardware like GPUs, long-running processes, and cost optimization strategies like Savings Plans that were previously unavailable to standard Lambda functions.

    Key Takeaways

    Here are the essential facts you need to know about Lambda Managed Instances:

    • Hybrid Architecture: You get the developer experience of Lambda (packaging, APIs) with the underlying power of EC2 (hardware choice, networking).
    • Specialized Hardware: Unlike standard Lambda, you can now utilize GPUs, Graviton4 processors, and high-bandwidth networking (EFA) for AI/ML and HPC workloads.
    • New Concurrency Model: A single execution environment can handle multiple concurrent requests via “runtime workers,” improving resource utilization compared to the standard “one-event-per-environment” model.
    • Cost Optimization: For steady-state workloads, you can leverage EC2 pricing models, including Compute Savings Plans and Reserved Instances, potentially lowering costs significantly.
    • Infrastructure Control: While AWS manages the patching and lifecycle, you control the VPC placement and can enforce strict capacity limits via Capacity Providers.

    Understanding Lambda Managed Instances

    For years, network engineers and cloud architects have had to choose between the operational simplicity of AWS Lambda and the granular control of Amazon EC2. AWS Lambda Managed Instances removes this binary choice. It is designed for scenarios where standard Lambda restrictions—such as limited hardware options or higher costs for steady-state workloads—become a bottleneck.

    The Concept: Capacity Providers

    The core component of this feature is the Capacity Provider. You can think of this as the bridge between your function and the EC2 infrastructure. Instead of just deploying a function, you configure a Capacity Provider that defines:

    • Network placement: Which VPC subnets and security groups the instances will inhabit.
    • Instance requirements: The specific architecture (x86_64 or arm64/Graviton) and hardware capabilities.
    • Scaling limits: Parameters like MaxVCpuCount to control costs and guardrails.

    When you deploy your function, you associate it with this Capacity Provider. AWS then provisions and manages the fleet of EC2 instances required to meet your traffic demands, handling the OS implementation, patching, and health checks automatically.

    A New Operational Model: Runtime Workers

    If you are used to standard Lambda, pay close attention here because the concurrency model has changed. In standard Lambda, one execution environment handles exactly one request at a time.

    In Managed Instances, a single execution environment (running on an EC2 instance) can spawn multiple runtime workers. This allows the environment to process multiple events in parallel. This is a massive shift for utilization efficiency, but it introduces a specific “gotcha”: Backpressure.

    If all runtime workers on your instances are busy, requests may be rejected rather than queued indefinitely. You must design your clients to handle these rejections gracefully with retries or exponential backoff strategies.

    Setup and Security nuances

    Setting this up requires a slightly more complex IAM structure than you might be used to. You now need two distinct roles:

    • Execution Role: The standard role the function assumes to access AWS services (e.g., writing to DynamoDB).
    • Operator Role: A new role that grants the Lambda service permission to create and manage EC2 resources (ENIs, Instances) in your account on your behalf.

    There is also a deployment caveat regarding versions: LMI does not run on the $LATEST alias implicitly in the same way you might expect during development. You must publish a function version to deploy it to a Capacity Provider. Code that hasn’t been published into a version will not run on your managed instances.

    When to use (and when not to)

    We should be clear that Managed Instances is not a replacement for standard Lambda in all scenarios. It shines in specific use cases:

    • Steady-State Workloads: If you have high-volume, predictable traffic, the economics of EC2 Savings Plans via Managed Instances will likely beat standard Lambda pricing.
    • Heavy Compute: Workloads needing GPUs for AI inference or video transcoding.
    • Private Networking: Functions that must reside deep inside a VPC for compliance or to access private resources without NAT gateway overheads.

    However, for highly “spiky” traffic or sporadic workloads that scale to zero frequently, standard Lambda remains the superior choice due to its rapid scaling capabilities and true pay-per-use model.

    Conclusion

    AWS Lambda Managed Instances represents a maturation of the serverless landscape. It acknowledges that while the “scale-to-zero” model is revolutionary, there is a persistent need for specialized hardware, predictable pricing, and long-running execution environments.

    We learned that by using Capacity Providers and understanding the new parallel runtime worker model, we can leverage the best of EC2 without taking on the burden of server management. Just remember to watch your IAM roles and publish your function versions!

  • Lambda vs Containers vs EC2

    Lambda, containers, and EC2 represent three compute models on AWS with different trade-offs: Lambda auto-scales and charges per request but limits runtime to 15 minutes, containers offer portability and consistent environments across any infrastructure, while EC2 gives you full control over virtual machines with no execution time limits. Your choice depends on your workload pattern, required control level, and cost structure preferences.

    Key Takeaways

    Use Lambda for event-driven workloads under 15 minutes that need automatic scaling without server management. Choose containers (ECS/EKS) when you need portability, consistent environments, and want to run any duration workload with some management overhead. Pick EC2 when you need full OS control, must run legacy applications, require specific hardware, or have steady-state workloads where reserved instances make sense financially.

    When Lambda Makes Sense

    Lambda works best for sporadic workloads. You write code, upload it, and AWS handles everything else. No servers to patch, no capacity planning. You pay only when your code runs, calculated per millisecond.

    I’ve seen Lambda shine for API backends that get uneven traffic, image processing triggers, and scheduled tasks. A client saved 70% on costs by moving their nightly report generation from an EC2 instance (running 24/7) to Lambda (running 20 minutes per day).

    Gotcha: Cold starts hurt. When Lambda hasn’t run recently, it takes extra time to initialize—sometimes seconds. This kills user experience for latency-sensitive applications. Provisioned concurrency solves this but adds cost.

    The 15-minute execution limit is hard. No extensions, no exceptions. Your video transcoding job that takes 20 minutes? Lambda won’t work. You’ll also hit the 10GB memory ceiling eventually, and the 512MB temporary storage fills up faster than you’d expect.

    When Containers Are Your Best Bet

    Containers package your application with its dependencies. Build once, run anywhere—your laptop, a colleague’s machine, or production. This consistency eliminates “works on my machine” problems.

    ECS (Elastic Container Service) offers AWS-native orchestration. It’s simpler but locks you into AWS. EKS (Elastic Kubernetes Service) runs Kubernetes, giving you portability across clouds and on-premises infrastructure.

    We use containers for microservices architectures where different teams own different services. Each team picks their language and dependencies without conflicts. Containers also work well for batch processing jobs that exceed Lambda’s limits but don’t need a full EC2 instance running continuously.

    Warning: Container orchestration has a learning curve. Kubernetes especially. I’ve watched teams spend months just getting comfortable with pods, services, and ingress controllers. Start with ECS if you’re new to containers—you can always migrate to EKS later.

    Resource allocation matters more than you think. Set your CPU and memory limits carefully. Too low and your containers crash under load. Too high and you waste money. Finding the sweet spot takes monitoring and iteration.

    When EC2 Is Still King

    EC2 gives you a virtual machine. You control everything: the operating system, installed software, network configuration, storage. This flexibility comes with responsibility—you patch the OS, you monitor resources, you handle scaling.

    Legacy applications often need EC2. That decade-old monolith with hard-coded file paths and specific library versions? EC2 lets you recreate its exact environment. You also need EC2 for applications requiring specific hardware like GPUs for machine learning or high-memory instances for in-memory databases.

    Steady-state workloads favor EC2 financially. If you’re running something 24/7, reserved instances or savings plans cut costs by 30-70%. Lambda’s pay-per-execution model becomes expensive when you’re executing constantly.

    Real-world anecdote: A company ran their database queries through Lambda because it seemed cheaper. Their queries ran every few seconds. The bill shocked them. Moving to a single t3.medium EC2 instance reduced costs by 85%.

    You manage more with EC2. Auto Scaling Groups, Load Balancers, security patches, monitoring—all your responsibility. This operational overhead is real. Budget time for it.

    Making the Decision

    Start by mapping your execution pattern. Sporadic and event-driven? Lambda. Continuous with variable load? Containers. Continuous with predictable load? EC2.

    Consider your team’s skills. Lambda requires less operational knowledge but you’re constrained by AWS’s runtime options. Containers need orchestration expertise. EC2 demands traditional systems administration.

    Don’t lock yourself into one option. Mix them. We run our API on Lambda, background jobs in containers, and our database on EC2. Each workload gets the compute model that fits it best.

    Gotcha: The cheapest option on paper often isn’t cheapest in reality. Lambda’s zero operational overhead might save more money than EC2’s lower compute costs when you factor in the engineering time spent managing servers.

    Conclusion

    Lambda excels at event-driven, short-duration tasks with automatic scaling and minimal management. Containers provide portability and consistency for longer-running services and microservices architectures. EC2 delivers full control for legacy applications, specialized hardware needs, and predictable always-on workloads. Your workload characteristics, team capabilities, and cost structure determine the right choice—and you’ll likely use all three for different parts of your infrastructure.