Monitoring and Troubleshooting Lambda Managed Instances

Troubleshooting AWS Lambda Managed Instances (LMI) requires a shift in mindset. Because your functions are running on EC2 instances that exist deeply within your VPC, you have to deal with new failure modes like “backpressure” and “unhealthy execution environments” that simply don’t exist in standard Lambda.

Key Takeaways

Here are the critical operational differences you need to monitor:

  • Backpressure is Real: Unlike standard Lambda which scales implicitly, LMI can reject requests if all runtime workers on an instance are busy.
  • Extension Stability is Critical: If a Lambda Extension crashes, the entire execution environment (the EC2 instance) is marked unhealthy and replaced.
  • VPC Telemetry Paths: Logs and traces (X-Ray) need a validated network path out of your private subnet to reach AWS endpoints.

Deep Dive: The New Failure Modes

Understanding Backpressure

In standard Lambda, concurrency is effectively an abstraction. In LMI, it is a tangible resource constraint. Each Managed Instance runs a specific number of “runtime workers.” If you have configured your Capacity Provider with a low MaxVCpuCount or if your traffic spikes faster than the provider can scale (remember, it only absorbs about 50% of spikes by default), you will hit backpressure.

This manifests as rejected invocations. You cannot just “fire and forget” anymore; your client-side applications must handle these rejections with retry logic and exponential backoff, or you need to put a queue (like SQS) in front of the function.

The Cost of Unhealthy Environments

We often use Lambda Extensions for observability. However, in LMI, the stability of these extensions is paramount. Research indicates that if an extension crashes, AWS doesn’t just restart the process; it marks the whole environment as unhealthy and replaces the instance.

This triggers a specialized “replacement” lifecycle event. This is heavier than a standard cold start because it involves spinning up a new EC2-backed environment. You need to set up CloudWatch alerts specifically tracking “unhealthy” counts to catch buggy extensions early.

Network Reachability

Since these instances live in your VPC subnets and Security Groups, telemetry is not automatic. If your Security Group rules allow ingress but deny egress (or if your route table lacks a NAT Gateway or VPC Endpoints for CloudWatch), your logs will vanish. You must treat these functions like standard EC2 servers when debugging network connectivity.

Conclusion

Moving to Managed Instances gives us hardware control, but it hands back some operational responsibility. We have to monitor worker saturation, validate extension stability, and ensure our VPC plumbing allows telemetry to escape the subnet.