Category: S3

  • Amazon AWS S3 Lifecycle Policies

    S3 lifecycle policies automatically transition or delete objects based on rules you define, helping you reduce storage costs by moving infrequently accessed data to cheaper storage classes or removing it entirely when no longer needed.

    Key Takeaways

    S3 lifecycle policies can cut your storage costs by 70-95% by automatically moving objects to cheaper storage tiers. You can transition objects from Standard to Infrequent Access after 30 days, then to Glacier after 90 days, and eventually delete them after a year. Policies work on prefixes, tags, or entire buckets, and changes take effect within 24-48 hours. The key is understanding your data access patterns before setting rules.

    Why Lifecycle Policies Matter

    I’ve seen AWS bills drop from $3,000 to $400 monthly just by implementing lifecycle policies correctly. Most companies store data in S3 Standard by default and forget about it. That’s expensive when you’re paying $0.023 per GB for files nobody has accessed in months.

    Here’s the reality: S3 Standard costs $0.023/GB, Standard-IA costs $0.0125/GB, Glacier Flexible Retrieval costs $0.0036/GB, and Glacier Deep Archive costs $0.00099/GB. The math is simple. If you have 10TB of logs from six months ago that you rarely access, you’re paying $230/month in Standard versus $10/month in Deep Archive.

    Understanding Storage Classes

    Before creating policies, you need to know where your data should live. S3 Standard is for frequently accessed data. Standard-IA (Infrequent Access) works for data accessed less than once a month. Glacier Flexible Retrieval suits archival data you might need within hours. Glacier Deep Archive is for compliance data you’ll rarely touch, with 12-hour retrieval times.

    Gotcha: You can’t transition objects smaller than 128KB to IA or Glacier cost-effectively. AWS charges a minimum of 128KB per object in these classes, so transitioning tiny files actually increases costs. I learned this the hard way when my bill went up after transitioning thousands of small log files.

    Creating Your First Lifecycle Policy

    Go to your S3 bucket, click the Management tab, and select “Create lifecycle rule.” You’ll name the rule and choose a scope—either the entire bucket or specific prefixes/tags.

    For a typical policy, I recommend this progression: Keep objects in Standard for 30 days, transition to Standard-IA at 30 days, move to Glacier Flexible Retrieval at 90 days, then Glacier Deep Archive at 180 days. Add an expiration rule if you know when data becomes useless.

    Here’s a real example. If you’re storing application logs, they’re hot for the first week, warm for a month, then cold forever. Your policy might look like this: Standard for 7 days, Standard-IA for 30 days, Glacier at 90 days, delete after 365 days for compliance.

    Using Prefixes and Tags Effectively

    Don’t apply blanket policies to entire buckets. Use prefixes to organize data by access patterns. Store frequently accessed data in /active/, monthly reports in /reports/2024/, and archives in /archive/. Then create separate lifecycle rules for each prefix.

    Tags give you even more control. Tag objects with “retention=30days” or “archive=true” and create policies based on those tags. This works great when different teams share a bucket but have different retention needs.

    Intelligent-Tiering: The Automatic Option

    S3 Intelligent-Tiering automatically moves objects between access tiers based on usage patterns. It costs $0.0025 per 1,000 objects monthly for monitoring, but it handles the transitions for you.

    I use Intelligent-Tiering when data access patterns are unpredictable. It’s perfect for user-generated content or datasets where you don’t know what will be popular. For predictable patterns like logs or backups, manual lifecycle policies cost less.

    Warning: Intelligent-Tiering doesn’t make sense for small buckets. If you have under 100,000 objects, the monitoring fees might exceed your savings. Do the math first.

    Expiration Rules and Versioning

    Expiration rules delete objects automatically. Set them for temporary files, logs past retention periods, or incomplete multipart uploads (these cost money and pile up silently).

    If versioning is enabled, you need separate rules for current and previous versions. I typically keep current versions in Standard, move previous versions to IA after 30 days, then delete them after 90 days. Failed uploads should expire after 7 days—there’s no reason to keep them.

    Gotcha: Deleting objects from Glacier before 90 days incurs early deletion fees. You pay for the full 90 days regardless. Same with Deep Archive at 180 days. Factor this into your policies.

    Monitoring and Adjusting Policies

    Use S3 Storage Lens to track where your data sits and how much each storage class costs. Check it monthly. You’ll spot patterns you missed—maybe those “archive” files are accessed more than you thought.

    S3 Storage Class Analysis runs for 30 days and recommends transition policies based on actual access patterns. Enable it on buckets where you’re unsure about timing. It’s free and incredibly useful.

    Set up CloudWatch alarms for unexpected storage growth. I once had a misconfigured application creating millions of small files daily. Without monitoring, the lifecycle policy would have transitioned them all to IA, increasing costs instead of reducing them.

    Common Mistakes to Avoid

    The biggest mistake is transitioning everything without understanding access patterns. I’ve seen teams move active databases to Glacier because “archival sounds cheap.” Retrieval fees destroyed their savings.

    Second mistake: ignoring minimum storage durations. IA and Glacier charge for minimum storage periods. If you transition to IA then delete 20 days later, you still pay for 30 days. Same with Glacier’s 90-day minimum.

    Third: not accounting for retrieval costs. Glacier retrieval costs $0.01 per GB for standard retrieval. If you’re pulling 1TB monthly, that’s $10 in retrieval fees on top of storage costs. Sometimes Standard is actually cheaper.

    Real-World Policy Examples

    For application logs: Standard 0-7 days, IA 7-90 days, Glacier 90-365 days, delete after 365 days. This balances recent log access with compliance retention.

    For backups: Standard 0-30 days, Glacier Flexible Retrieval 30-90 days, Deep Archive after 90 days. You rarely need old backups quickly, so Deep Archive makes sense.

    For user uploads: Intelligent-Tiering from day one. You can’t predict what users will access, so let AWS handle it automatically.

    For compliance data: Upload directly to Glacier Deep Archive with object lock enabled. If you know you won’t touch it for years but must retain it, skip the expensive tiers entirely.

    Conclusion

    S3 lifecycle policies are the easiest way to cut storage costs without changing your applications. Start by analyzing your data access patterns using Storage Class Analysis, then create policies that match how you actually use your data. Transition frequently accessed data to IA after 30 days, move cold data to Glacier after 90 days, and delete what you don’t need. Watch for gotchas like minimum object sizes, early deletion fees, and retrieval costs. Check your policies quarterly and adjust based on actual usage. Most teams can reduce storage costs by 50-70% with just a few well-designed lifecycle rules.

  • Amazon AWS EBS vs EFS vs S3

    EBS (Elastic Block Store) is block storage that attaches to a single EC2 instance like a hard drive, EFS (Elastic File System) is managed NFS storage that multiple instances can access simultaneously, and S3 (Simple Storage Service) is object storage for files accessed via API rather than a file system—choose based on whether you need single-instance performance (EBS), shared file access (EFS), or scalable object storage (S3).

    Key Takeaways

    EBS provides high-performance block storage for single EC2 instances, perfect for databases and boot volumes. EFS offers shared file storage that scales automatically and works across multiple instances and availability zones, ideal for content management and shared application data. S3 delivers unlimited object storage accessed through APIs, best for backups, static assets, and data lakes. Performance, access patterns, and cost structures differ dramatically—EBS charges for provisioned capacity, EFS for actual usage, and S3 for storage plus requests.

    EBS: Your Instance’s Hard Drive

    EBS volumes behave like physical hard drives attached to your EC2 instance. You format them with a file system (ext4, xfs, NTFS), mount them, and access them through standard file operations. The key limitation: one EBS volume attaches to one instance at a time (except for io2 Multi-Attach in specific scenarios).

    You choose from several volume types. gp3 (General Purpose SSD) handles most workloads and lets you configure IOPS and throughput independently—I use this for 90% of my deployments. io2 (Provisioned IOPS SSD) delivers consistent low-latency performance for demanding databases. st1 (Throughput Optimized HDD) works for big data and log processing where sequential reads matter more than random access. sc1 (Cold HDD) provides the cheapest option for infrequently accessed data.

    Gotcha: EBS volumes exist in a single availability zone. If that AZ goes down, you can’t access your volume from an instance in another AZ. You need to snapshot and restore to move data between zones. I’ve seen production outages because teams didn’t realize their instance and volume had to be in the same AZ.

    Performance scales with volume size on some types. A 100 GB gp3 volume gives you the same baseline performance as a 1 TB gp3 now, but on the older gp2 type, larger volumes got better performance. Always check the current specs because AWS changes these details.

    Snapshots save you here. You can snapshot EBS volumes to S3 for backups. Snapshots are incremental—only changed blocks get stored after the first one. You can restore snapshots to new volumes, copy them across regions, or share them between accounts.

    Use EBS for database storage (MySQL, PostgreSQL, MongoDB), boot volumes for EC2 instances, application servers that need low-latency storage, and any workload where one instance needs dedicated, high-performance block storage. It’s also perfect when you need to control IOPS precisely.

    EFS: Shared Network File System

    EFS provides NFS v4 storage that multiple EC2 instances can mount simultaneously. It’s fully managed, scales automatically from gigabytes to petabytes, and works across multiple availability zones in a region. You don’t provision capacity—it grows and shrinks as you add or remove files.

    You access EFS by mounting it on Linux instances using standard NFS mount commands. Multiple instances across different AZs can read and write to the same file system concurrently. This makes it perfect for shared application data, content management systems, and development environments where teams need access to the same files.

    EFS offers two performance modes: General Purpose (lower latency, most use cases) and Max I/O (higher aggregate throughput but slightly higher latency per operation). You can’t change performance mode after creation, so choose carefully. I’ve never needed Max I/O except for one massive parallel processing workload with hundreds of instances.

    Storage classes reduce costs. Standard stores files you access frequently. Infrequent Access (IA) costs much less for files you don’t touch often. Lifecycle management automatically moves files to IA based on access patterns. I’ve seen storage costs drop 85% just by enabling lifecycle policies on log archives.

    Warning: EFS costs significantly more than EBS per GB. Standard EFS runs about $0.30/GB/month versus $0.08/GB/month for gp3 EBS. You pay for convenience and shared access. Don’t use EFS when EBS works—I’ve audited environments wasting thousands monthly on EFS for single-instance workloads.

    Throughput modes matter too. Bursting mode gives you throughput that scales with file system size. Provisioned mode lets you specify throughput independently of size, useful when you have small files but need high throughput. Elastic mode (newest) automatically scales throughput up and down—it’s more expensive but handles unpredictable workloads better.

    Use EFS for content management systems, web serving environments that need shared storage, containerized applications requiring persistent shared storage, development and test environments, and big data analytics that need shared access to datasets. WordPress on multiple instances? EFS for the wp-content directory.

    S3: Object Storage at Scale

    S3 isn’t a file system. You can’t mount it and navigate directories. It stores objects (files) in buckets, and you access them through API calls or URLs. Each object has a key (like a file path) and metadata. This fundamental difference trips up newcomers who expect it to work like traditional storage.

    S3 scales infinitely. You don’t provision capacity—just upload objects. It’s distributed across multiple facilities automatically, giving you 99.999999999% (11 nines) durability. That means if you store 10 million objects, you might lose one every 10,000 years statistically.

    Storage classes optimize costs based on access patterns. S3 Standard for frequently accessed data. S3 Standard-IA (Infrequent Access) costs less for monthly-access patterns. S3 One Zone-IA sacrifices multi-AZ redundancy for lower cost. S3 Glacier Instant Retrieval for archive data you need immediately when accessed. S3 Glacier Flexible Retrieval for archives you can wait minutes to hours to access. S3 Glacier Deep Archive for long-term archives with 12-hour retrieval, the cheapest at about $1/TB/month.

    Intelligent-Tiering moves objects between access tiers automatically based on usage patterns. It costs a small monitoring fee per object but can save significant money if you’re not sure about access patterns. I enable it by default for new buckets unless I know exactly how the data will be accessed.

    Gotcha: S3 charges for requests, not just storage. PUT, GET, LIST operations all cost money. A misconfigured application making millions of unnecessary requests can rack up surprising bills. I’ve debugged applications with infinite retry loops hitting S3 that generated $10k+ monthly bills.

    S3 integrates with everything in AWS. CloudFront for content delivery, Lambda for event-driven processing, Athena for querying data, EMR for analytics. You can host static websites directly from S3, serve as a data lake for analytics platforms, or store application backups and logs.

    Versioning protects against accidental deletions and overwrites. When enabled, S3 keeps all versions of objects. Delete a file? The delete marker becomes the current version, but previous versions remain. This saves you during ransomware attacks or accidental bulk deletions, but watch costs—you pay for all versions stored.

    Use S3 for static website assets, application backups, log aggregation, data lakes and analytics, media storage and distribution, disaster recovery, and archival storage. Anything you access via API rather than traditional file operations fits S3’s model perfectly.

    How to Choose

    Ask yourself these questions: Does a single EC2 instance need this storage? Use EBS. Do multiple instances need simultaneous file system access? Use EFS. Are you storing files accessed via application code rather than mounted file systems? Use S3.

    Performance requirements matter. Need sub-millisecond latency and thousands of IOPS? EBS, specifically io2. Need shared access with decent performance? EFS. Latency-tolerant bulk storage? S3.

    Consider cost structure. EBS charges for provisioned size regardless of usage—you pay for a 1 TB volume even if you use 100 GB. EFS and S3 charge for actual usage. But EFS costs more per GB than EBS, so don’t use it for single-instance workloads just because it auto-scales.

    Access patterns reveal the right choice. Database files with random access? EBS. Shared configuration files? EFS. Millions of images served to users? S3 with CloudFront. Log files for long-term retention? S3 with lifecycle policies moving to Glacier.

    You’ll often use combinations. I commonly see EC2 instances with EBS for the OS and application, EFS for shared uploads or session data, and S3 for backups, static assets, and logs. Each storage type solves specific problems—use the right tool for each job.

    Real-world example: A WordPress deployment might use EBS for the database and OS, EFS for wp-content (shared across web servers), and S3 with CloudFront for serving uploaded images. This combination optimizes performance, enables scaling, and minimizes costs.

    Conclusion

    EBS delivers high-performance block storage for single EC2 instances, ideal for databases and applications needing low-latency dedicated storage. EFS provides managed NFS storage for scenarios requiring shared file system access across multiple instances and availability zones. S3 offers infinitely scalable object storage accessed through APIs, perfect for backups, static content, and data lakes. Choose based on your access pattern—single instance versus shared versus API access—and balance performance requirements against cost. Most production architectures use all three, leveraging each service’s strengths for different parts of the application stack.