In the Loupe
In the Loupe
The AWS Outage and Why It Matters
We break down how a small DNS error inside AWS rippled into a global outage, and why it affected everything from uploads to streaming to... beds? Andy Szoke, Developer at Punchmark, joins us to explain cloud basics, Lambda bottlenecks, and redundancy.
Send feedback or learn more about the podcast: punchmark.com/loupe
Learn about Punchmark's website platform: punchmark.com
Inquire about sponsoring In the Loupe and showcase your business on our next episode: podcast@punchmark.com
Welcome to In the Loop. What is up everybody? My name is Michael Burpo. Thanks again for listening to In the Loop. This week I'm joined by backend developer at Punchmark and my friend Andy Zoki. And we're talking about, in case you didn't hear about it, the AWS outage that took place about two weeks ago. This was extremely disruptive to the entire internet. And you're probably thinking, oh, how can one company's outage affect the entire internet? Well, when you realize how much of the internet is connected and also relies on these certain tools that are provided by a very small number of companies, it took down everything from Netflix to Airbnb to BMW, lots of healthcare services. Even Punchmark had some of our internal tools taken out because we were relying on Atlaskian, which I did an entire episode about, and they were relying on AWS. And it's like, oh my gosh, this is way wider than you'd expect. And it took them a pretty substantial amount of time to correct it. So I wanted to talk about it so that maybe we can all understand it more. And I think it's interesting to sort of start to understand how large the internet is, but also how small. So please enjoy.
SPEAKER_00:This episode is brought to you by Punchmark, the jewelry industry's favorite website platform and digital growth agency. Our mission reaches way beyond technology. With decades of experience and long-lasting industry relationships, Punchmark enables jewelry businesses to flourish in any marketplace. We consider our clients our friends, as many of them have been friends way before becoming clients. Punchmark's own success comes from the fact that we have a much deeper need and obligation to help our friends succeed. Whether you're looking for better e-commerce performance, business growth, or campaigns that drive traffic and sales, PunchMark's website and marketing services were made just for you. It's never too late to transform your business and stitch together your digital and physical worlds in a way that achieves tremendous growth and results. Schedule a guided demo today at punchmark.com slash go. And now back to the show.
SPEAKER_02:What is up everybody? I'm joined by my good buddy Andy Zocke, backend developer at Punchmark. How are you doing today, Ando?
SPEAKER_01:Uh not too bad. There's uh no servers crashing anywhere, so it's uh good day to beat a developer.
SPEAKER_02:You ever see that that joke? It's like, or every uh back-end developer waking up on the day of the AWS outage and they're just like, oh God, and they pour like a double shot of espresso, and they're like, ah.
SPEAKER_01:Yeah, there's uh some people that definitely ruined their day for sure. Um, I know that it was uh we got kind of lucky, I guess we can get into it, but uh yeah, some people they're at outages stretching into the afternoon, and it was the internet was chaos for that day.
SPEAKER_02:Yeah. So I set it up in the intro, but for all of you all listening, the AWS outage took place on October 20th. I mean, it goes out very sporadically, but uh to kind of just set up um what this was, AWS is Amazon Web Service, and this is like a cloud computing uh service that AWS offers. But you're probably like, oh, you know, Amazon, like the company that does like the shopping stuff, it's like, yeah, the same ones that does Whole Foods and also uh, but AWS is such a massive part of their company. I think at one point it was like if Amazon spun AWS off into another company, it would be a top 10 most valuable company. It's it's massive. But then when you start digging into it a little bit more, you realize that cloud service is pretty much only split into three companies. There's AWS, Microsoft, Azure, which Microsoft, yes, the company that does, you know, Xbox, and then Google Cloud, like the company that does all the other Google stuff. So it's three companies that pretty much dominate the entire thing. There's a couple of other ones like Oracle and IBM and Alibaba, but um, for all intents and purposes, they they refer to them as the big three. So, Andy, for the people at home and me, can you explain what like cloud computing or cloud service is?
SPEAKER_01:Sure, yeah. So cloud computing is it sounds really fancy. Uh it's basically just literally someone else's computers, right? So if we want to uh say store images to AWS, then we literally just store it on one of their servers somewhere, and it's just presented to us in a way that we can interact with easily from wherever we are. And uh the the benefit to that is AWS has servers all across the world. They have different regions, uh, they have different, it's called availability zones. And the idea is we're as Punchmark, you know, we're one company on the east coast of the United States, which is great, but uh we have clients that are all over the place. So we need a service that can put computers physically close to where people live and then copy all of the assets that we're storing for these clients and push them out to all these different locations so that no matter where you're coming to Punchmark from, you're using an AWS server that's relatively close to where you are. And then on top of that, everything that we can do on our servers, like uh creating code or uh running logic or anything like that, we can store on AWS's servers, and they just have a bunch of different services called you know Amazon Web Services uh that basically just mesh all those together and allow for basically seamless operation of a production environment uh distributed nationwide. So when it works great, it's awesome.
SPEAKER_02:Yeah. But what's really interesting about it is so I had this conversation with my mom. I was like, yeah, AWS went out, it was a real pain in the neck because you know we got clients mad because their websites are like kind of broken in certain instances, or uh lots of services around the internet were out, like lots of services around the internet. Everything I just was joking about this, dude, some people's smart beds were broken. Isn't that incredible? Like, what a dumb timeline we live in where one company goes down and suddenly everything from your car to your bed won't even work, and it just uh you don't realize it that there are these hubs, like you were saying, there's a hub on the east coast, in the central, on the west coast, you know, like I mean Europe has one, and the proximity to those can also reduce your lag. So that's a big part of, for example, gaming is you want as little lag as possible. But for what I understand, one of the main uh hubs went down, and that's what caused this outage. Is that right?
SPEAKER_01:Yeah, so what basically happened with the outage is kind of a cascade of issues that all were just on AWS's side, on Amazon's side. So they had a service that was trying to write uh something called a DNS record, which is basically how your uh computer, when you type in a website, it knows where to send the website to. That all goes through something called DNS routing. It tried to write one of those entries for its database service. Uh there's an internal AWS service called Dynamo that uh some companies use extensively for database operations. We uh have our own systems for most of the part, but it wrote that entry uh that was empty, and that got picked up and uh propagated across all the different cache points that I mentioned all across the world. And so AWS found that issue and tried to fix it relatively quickly, but you have this cascade all of a sudden where uh the DNS calls to the database aren't going through, which means that no site that uses AWS for data storage can do anything, uh, including AWS. Obviously, they use their own services for uh all their own internal tools. So basically any tool on their side that tried to access data, which is a lot of them, uh, got affected by this, and it spread out to services called Lambda, which handles internal functionality. Uh, it's spread out uh basically all over the place.
SPEAKER_02:Yeah, and the way I'm trying to like picture it, and I'm a very visual guy, uh, it's kind of like in a game of like telephone, but instead of it just being one game of telephone, there's many, many, many billions of games of telephone. And what happens if you take like the first person or maybe even the second person and you you know uh remove them from the situation and you uh you knock them out? That suddenly many people down the line are not receiving the game of telephone, the only some are. But what if some of the people further down the line they are responsible for fact-checking the information from the other people? Well, if they didn't get it, suddenly the other people aren't able to fact-check with the other people, and they might shut down as well. And it has this crazy cascading um knock-on effect. They use that term a lot, knock-on effect. But it's been a really uh illuminating aspect to learn about is um what ends up happening. Can you explain a little bit about Lambda, especially? Because that's the one I was learning was most important for Punchmark. Can you talk about that?
SPEAKER_01:Sure. Yeah. So to fit in with your telephone analogy, which I like, you have all that as the issue, and then uh you have a separate issue, which is the people that are trying to communicate through the telephone, they still need to send their message across, right? You can't just have the telephone cut and then you can't relay all this critical information. You have basically a bunch of calls that were getting stacked up that none of them are going out, and so uh this huge backlog was playing into it as well. So even when AWS recovered their own internal tools, they still had to throttle the throughput way down so that their own, you know, internal critical stuff could catch up and not get swamped, and then uh they started uh partitioning out from there. So uh it really was just uh a big mess. And like you said, Lambda was one of the uh services that was affected the longest. I'm not sure exactly what specific way the CNS entry messed with Lambdas so hard. I think it was just because it's so logic-based that uh they had such a hard time untangling that web. But uh the way Punchmark uses Lambdas is for uh image ingestion. So if we have a client or a vendor that has some new jewelry items and they want to upload some pictures to it, uh we use Lambdas in our own internal pipeline that basically takes those pictures on our server, creates a copy over on AWS, and uh does some pre-processing to it before we store it on S3. So what was happening for Punchmark is while Lambda is down, none of those images are making it through the pipeline, so they're just getting held up. So from a client that might look like I uploaded these jewelry images several hours ago and they still aren't uploaded. Uh, what's the deal with these new items? But as far as we can tell, looking through what happened, that's more or less the extent of how bad it was for us. Uh obviously, uh, like I mentioned, some sites on the internet were just down for a long time because they depended more heavily on some of those other services. Um, but for us, we got off relatively light.
SPEAKER_02:Yeah, it's it's really interesting because on a surface level, like if you don't I don't understand how the internet works even, but from a very surface level, even more surface level than me, uh, you might interact with your punchmark website and be like, oh, this this image is taking forever to upload. What what the heck, Punchmark? And then it's like, well, you see, it's because of, and then you go into your explanation, and then you know, maybe they're like trying to upload this image, and they're like, Oh, I'm gonna just kill some time and I'm gonna watch Netflix. Uh Netflix also out. It's like, oh, I'm gonna kill some time and I'm gonna watch Twitch also out, and I'm gonna go for a drive and use my BMW car, and it's like, uh, also out. It's like, you know what? I'm going to bed, also out. And it's like, oh my gosh, this is the world in 2025. So the one that was really interesting that I didn't realize how much it like not hamstrung us, but more like um gave us a real pain in the neck was it affected Atlassian, which is uh runs, you know, Jira and Trello and Confluence and a whole bunch of other ones. Oh, and um Bitbucket too. Um, that one was the one that I noticed all the devs being like, Bitbuckets down still. Gosh, this is annoying. So, how did that affect your job on that day?
SPEAKER_01:Yeah, so Atlassian is what uh us devs use to more or less uh organize all of our work that we're trying to do. So uh for all the the tickets that come to the dev team to work, that's where those live. Uh if we want to push new code, that's where our source code lives. So it wasn't affected as far as none of the websites can get to the code because that would mean that just none of them would load at all. Uh, it just meant that for that day, we couldn't push any new work to our uh develop branch, which we use to test stuff before we push it out to uh the real world. Um so yeah, it was it was kind of annoying not being able to you know share work with the rest of the dev team. Uh, but uh we just kind of got through it by more or less operating independently for a couple hours and then it was resolved close to end of day. So yeah, got it back working.
SPEAKER_02:Yeah, it's just it makes you think about sometimes like over-reliance. You know, are we over-relying on one specific company? And I think that the smartest thing you can do is kind of diversify a lot of your tech across a lot of different spaces. That's uh everybody knows that's what you're supposed to be doing, but sometimes it really is just uh your hand is forced, you know. Some of these companies are so big, and that's why you like you almost scratch your head. For example, at do you remember it it kind of gave me big vibes of um do you remember the Evergreen, the ship in the Suez Canal? Yep. And that was one of those those news stories that I was like following so closely because I'm like, the internet or the the world is not this dumb, is it? And it's like, no, it no, it is. It is this dumb. We had a boat turned sideways by mistake, and it blocked one of the major trade routes, and that's why all of your deliveries are are not getting to where you want them to be. And it's like, wow, we really do rely on like a couple things a lot, don't we?
SPEAKER_01:It's crazy how little it takes, too. Like you mentioned, yeah, we have the world's one of the world's busiest shipping ports, and then oopsie a ship blocked it for a month, or you know, like this example, um, you know, Amazon just had an automated system that's trying to update a DNS record, and oopsie had accidentally uh said it's empty, and then the entire internet goes down.
SPEAKER_02:So, like another one of those, just like at this point, it's it's it the comparisons write themselves. The funniest one is uh do you remember Facebook login was like really making a real push? They were trying to make it so everyone would log in with their Facebooks. It's kind of like uh Gmail login. And uh Facebook had something go wrong at their root level in their servers, and it has something to do with their uh security level. So basically, you needed to um like a server malfunctioned or broke or something like that. But what was so funny, not funny, but catastrophic about it, was they had it impacted Facebook login, and Facebook um was arrogant enough to make it so that everything was contingent on Facebook login. So then they couldn't access the server point because all of that that security was down and it defaults to you can't access this at all. So they had to like, I remember they had to like break into their own server system with like a like a blowtorch essentially, and just like cut the doors off the hinges because one server went down, and it just kind of something very human about that in there. There's like a real uh metaphor, but I don't know what it is.
SPEAKER_01:There's uh there's a saying, right? Uh uh ounce of prevention is worth a pound of cure, right? That's the kind of circumstance where if you just look for one more second at your deployment practices and make sure that there's absolutely nothing that could go wrong with this massive, massive international launch you're trying to do for this whole new security system. Maybe you don't have to blow torch one of your servers.
SPEAKER_02:Super interesting. Now, Andy, is there did did you guys kind of do a postmortem on this at all? Did you guys like talk about how we could uh you know better handle this? It wasn't on us. That's like the thing that I took from it. Uh the way I was handling it is we had a thread going in our community, and I was doing uh pretty much hourly updates, and you were passing me information and I was posting in there, and we had a bunch of um a bunch of clients that were like following that as their their source of of information. Did you guys talk about it amongst the devs at all?
SPEAKER_01:Uh a bit for sure. Uh like I mentioned, we weren't as heavily impacted by some other sites, so there wasn't as much to post mortem. If there would be one takeaway, it's that redundancy is always something that you want to aim for. Uh obviously, like you said, all these systems depend on each other, so if one goes down, the whole thing goes down. The problem you run into with that though is we can't necessarily have two different cloud providers that are redundant to each other, right? We can't run half of our sites on Azure and half of them on AWS. I mean, I guess we could technically, but it would be a nightmare to maintain.
SPEAKER_00:Yeah.
SPEAKER_01:So we we have internal uh replication as much as we can. Like for instance, we every day we do a backup of our databases. So if those just magically you know fall out of the sky and everything's gone, uh, everything's not gone. You know, we can recover. So the most important thing that you can do that I think we're doing a very good job at punchmark right now, is just having that redundancy so that when something goes down, you can get back to a working state quickly and uh get on with your day.
SPEAKER_02:And there you go. I think it's really an interesting uh story. It was something I was thinking about a lot. I almost did an episode like as it was happening because I was like, Well, we can't work on a couple of other things right now, so what am I gonna do? Make a podcast episode. But uh, it's something that I think as the world becomes even more connected, these types of things I believe will happen more because we are becoming more and more reliant on code. And I'm sure that you can attest. Uh, there are some parts of the internet that are ridiculously behind when it comes to like, you know, not safety, but like uh best practices, because they were built by some like overworked guy 20 years ago and they've just never updated it. Um, and you never really want to like kind of mess around with them, you know. You don't want to remove that tree because it brings the entire service down. So uh just something I wanted to kind of share with our listeners that this is this was a really big deal. Like, I don't know how much the news cycles really covered this, but it was an important one for people that were maybe in the know. So uh I can't thank you enough, Ando. I think this is really a cool conversation. I love having you come on. Always a pleasure, Mike. All right, thanks everybody, and we'll be back next week, Tuesday, with another episode. Cheers, bye. Alright, everybody, that's another show. Thanks so much for listening. My guest this week was Andy Zogie, the back end developer at Punchmark. He's one of my best friends, so it was really cool chatting with him. This episode is brought to you by Punchmark and produced and hosted by me, Michael Burpo. This episode was edited by Paul Suarez with music by Rod Cochrane. Don't forget to leave us a five-star rating on Spotify and Apple Podcasts, and leave us feedback at punchmark.com slash loop. And that's L O U P E. Thanks, and we'll be back next week Tuesday with another episode. Cheers. Bye.