A DNS manager in a single region of Amazon’s sprawling network touched off a 16-hour debacle.
Credit:
Getty Images
Credit:
Getty Images
Story text
Size
Small
Standard
Large
Width
*
Standard
Wide
Links
Standard
Orange
* Subscribers onlyThe outage that hit Amazon Web Services and took out vital services worldwide was the result of a single failure that cascaded from system to system within Amazon’s sprawling network, according to a post-mortem from company engineers.
The series of failures lasted for 15 hours and 32 minutes, Amazon said. Network intelligence company Ookla said its DownDetector service received more than 17 million reports of disrupted services offered by 3,500 organizations. The three biggest countries where reports originated were the US, the UK, and Germany. Snapchat, AWS, and Roblox were the most reported services affected. Ookla said the event was “among the largest internet outages on record for Downdetector.”
Amazon said the root cause of the outage was a software bug in software running the DynamoDB DNS management system. The system monitors the stability of load balancers by, among other things, periodically creating new DNS configurations for endpoints within the AWS network. A race condition is an error that makes a process dependent on the timing or sequence events that are variable and outside the developers’ control. The result can be unexpected behavior and potentially harmful failures.
In this case, the race condition resided in the DNS Enactor, a DynamoDB component that constantly updates domain lookup tables in individual AWS endpoints to optimize load balancing as conditions change. As the enactor operated, it “experienced unusually high delays needing to retry its update on several of the DNS endpoints.” While the enactor was playing catch-up, a second DynamoDB component, the DNS Planner, continued to generate new plans. Then, a separate DNS Enactor began to implement them.
The timing of these two enactors triggered the race condition, which ended up taking out the entire DynamoDB. As Amazon engineers explained:
When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them. At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DDB endpoint, overwriting the newer plan. The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing. Therefore, this did not prevent the older plan from overwriting the newer plan. The second Enactor’s clean-up process then deleted this older plan because it was many generations older than the plan it had just applied. As this plan was deleted, all IP addresses for the regional endpoint were immediately removed. Additionally, because the active plan was deleted, the system was left in an inconsistent state that prevented subsequent plan updates from being applied by any DNS Enactors. This situation ultimately required manual operator intervention to correct.
The failure caused systems that relied on the DynamoDB in Amazon’s US-East-1 regional endpoint to experience errors that prevented them from connecting. Both customer traffic and internal AWS services were affected.
The damage resulting from the DynamoDB failure then put a strain on Amazon’s EC2 services located in the US-East-1 region. The strain persisted even after DynamoDB was restored, as EC2 in this region worked through a “significant backlog of network state propagations needed to be processed.” The engineers went on to say: “While new EC2 instances could be launched successfully, they would not have the necessary network connectivity due to the delays in network state propagation.”
In turn, the delay in network state propagations spilled over to a network load balancer that AWS services rely on for stability. As a result, AWS customers experienced connection errors from the US-East-1 region. AWS network functions affected included the creating and modifying Redshift clusters, Lambda invocations, and Fargate task launches such as Managed Workflows for Apache Airflow, Outposts lifecycle operations, and the AWS Support Center.
For the time being, Amazon has disabled the DynamoDB DNS Planner and the DNS Enactor automation worldwide while it works to fix the race condition and add protections to prevent the application of incorrect DNS plans. Engineers are also making changes to EC2 and its network load balancer.
Ookla outlined a contributing factor not mentioned by Amazon: a concentration of customers who route their connectivity through the US-East-1 endpoint and an inability to route around the region. Ookla explained:
The affected US‑EAST‑1 is AWS’s oldest and most heavily used hub. Regional concentration means even global apps often anchor identity, state or metadata flows there. When a regional dependency fails as was the case in this event, impacts propagate worldwide because many “global” stacks route through Virginia at some point.
Modern apps chain together managed services like storage, queues, and serverless functions. If DNS cannot reliably resolve a critical endpoint (for example, the DynamoDB API involved here), errors cascade through upstream APIs and cause visible failures in apps users do not associate with AWS. That is precisely what Downdetector recorded across Snapchat, Roblox, Signal, Ring, HMRC, and others.
The event serves as a cautionary tale for all cloud services: More important than preventing race conditions and similar bugs is eliminating single points of failure in network design.
“The way forward,” Ookla said, “is not zero failure but contained failure, achieved through multi-region designs, dependency diversity, and disciplined incident readiness, with regulatory oversight that moves toward treating the cloud as systemic components of national and economic resilience.”
Dan Goodin
Senior Security Editor
Dan Goodin
Senior Security Editor
The outage that hit Amazon Web Services and took out vital services worldwide was the result of a single failure that cascaded from system to system within Amazon’s sprawling network, according to a post-mortem from company engineers.
The series of failures lasted for 15 hours and 32 minutes, Amazon said. Network intelligence company Ookla said its DownDetector service received more than 17 million reports of disrupted services offered by 3,500 organizations. The three biggest countries where reports originated were the US, the UK, and Germany. Snapchat, AWS, and Roblox were the most reported services affected. Ookla said the event was “among the largest internet outages on record for Downdetector.”
Amazon said the root cause of the outage was a software bug in software running the DynamoDB DNS management system. The system monitors the stability of load balancers by, among other things, periodically creating new DNS configurations for endpoints within the AWS network. A race condition is an error that makes a process dependent on the timing or sequence events that are variable and outside the developers’ control. The result can be unexpected behavior and potentially harmful failures.
Digital reading devices like the Kindle have existed for almost 20 years, and the standard eReader form factor has hardly changed at all. Amazon, Boox, and a few other companies have offered larger E Ink screens, but how about something smaller? Boox has unveiled its second-generation Palma e-reader, which still fits in your pocket but adds a color screen and mobile data connectivity.
The first-gen Palma launched last year, earning fans who saw it as a way to read and access some apps without the full spate of distracting smartphone experiences. Boox e-readers are essentially Android tablets with E ink screens and a few software quirks that arise from their unofficial Google Play implementation. The second-gen Palma might offer more opportunities for distraction because it’s almost a smartphone.
The Palma 2 Pro upgrades the 6.1-inch monochrome display from the original to a 6.13-inch color E Ink Kaleido display. That’s the same technology used in Amazon’s Kindle Colorsoft. The Amazon reader is a bit larger with its 7-inch display and chunkier bezels. Of course, the Kindle isn’t trying to fit in your pocket like the Palma 2 Pro, which is roughly the size and shape of a phone.
OpenAI has acquired Software Applications Incorporated (SAI), perhaps best known for the core team that produced what became Shortcuts on Apple platforms. More recently, the team has been working on Sky, a context-aware AI interface layer on top of macOS. The financial terms of the acquisition have not been publicly disclosed.
“AI progress isn’t only about advancing intelligence—it’s about unlocking it through interfaces that understand context, adapt to your intent, and work seamlessly,” an OpenAI rep wrote in the company’s blog post about the acquisition. The post goes on to specify that OpenAI plans to “bring Sky’s deep macOS integration and product craft into ChatGPT, and all members of the team will join OpenAI.”
That includes SAI co-founders Ari Weinstein (CEO), Conrad Kramer (CTO), and Kim Beverett (Product Lead)—all of whom worked together for several years at Apple after Apple acquired Weinstein and Kramer’s previous company, which produced an automation tool called Workflows, to integrate Shortcuts across Apple’s software platforms.
Microsoft said earlier this month that it wanted to add better voice controls to Copilot, Windows 11’s built-in chatbot-slash-virtual assistant. As described, this new version of Copilot sounds an awful lot like another stab at Cortana, the voice assistant that Microsoft tried (and failed) to get people to use in Windows 10 in the mid-to-late 2010s.
Turns out that the company isn’t done trying to reformulate and revive ideas it has already tried before. As part of a push toward what it calls “human-centered AI,” Microsoft is now putting a face on Copilot. Literally, a face: “Mico” is an “expressive, customizable, and warm” blob with a face that dynamically “listens, reacts, and even changes colors to reflect your interactions” as you interact with Copilot. (Another important adjective for Mico: “optional.”)
Mico (rhymes with “pico”) recalls old digital assistants like Clippy, Microsoft Bob, and Rover, ideas that Microsoft tried in the ’90s and early 2000s before mostly abandoning them.
Apple’s iPhone Air was the company’s most interesting new iPhone this year, at least insofar as it was the one most different from previous iPhones. We came away impressed by its size and weight in our review. But early reports suggest that its novelty might not be translating into sales success.
A note from analyst Ming-Chi Kuo, whose supply chain sources are often accurate about Apple’s future plans, said yesterday that demand for the iPhone Air “has fallen short of expectations” and that “both shipments and production capacity” were being scaled back to account for the lower-than-expected demand.
Kuo’s note is backed up by reports from other analysts at Mizuho Securities (via MacRumors) and Nikkei Asia. Both of these reports say that demand for the iPhone 17 and 17 Pro models remains strong, indicating that this is just a problem for the iPhone Air and not a wider slowdown caused by tariffs or other external factors.
At least one CVE could weaken defenses put in place following 2008 disclosure.
Story text
Size
Small
Standard
Large
Width
*
Standard
Wide
Links
Standard
Orange
* Subscribers onlyThe makers of BIND, the Internet’s most widely used software for resolving domain names, are warning of two vulnerabilities that allow attackers to poison entire caches of results and send users to malicious destinations that are indistinguishable from the real ones.
The vulnerabilities, tracked as CVE-2025-40778 and CVE-2025-40780, stem from a logic error and a weakness in generating pseudo-random numbers, respectively. They each carry a severity rating of 8.6. Separately, makers of the Domain Name System resolver software Unbound warned of similar vulnerabilities that were reported by the same researchers. The unbound vulnerability severity score is 5.6
The vulnerabilities can be exploited to cause DNS resolvers located inside thousands of organizations to replace valid results for domain lookups with corrupted ones. The corrupted results would replace the IP addresses controlled by the domain name operator (for instance, 3.15.119.63 for arstechnica.com) with malicious ones controlled by the attacker. Patches for all three vulnerabilities became available on Wednesday.
In 2008, researcher Dan Kaminsky revealed one of the more severe Internet-wide security threats ever. Known as DNS cache poisoning, it made it possible for attackers to send users en masse to imposter sites instead of the real ones belonging to Google, Bank of America, or anyone else. With industry-wide coordination, thousands of DNS providers around the world—in coordination with makers of browsers and other client applications—implemented a fix that averted this doomsday scenario.
The vulnerability was the result of DNS’s use of UDP packets. Because they’re sent in only one direction, there was no way for DNS resolvers to use passwords or other forms of credentials when communicating with “authoritative servers,” meaning those that have been officially designated to provide IP lookups for a given top-level domain such as .com. What’s more, UDP traffic is generally trivial to spoof, meaning it’s easy to send UDP packets that appear to come from a source other than their true origin.
To ensure resolvers accepted results only from authoritative servers and to block any poisoned results that might be sent by unauthorized servers, resolvers attached a 16-bit number to each request. Results from the server were rejected unless they included the same ID.
What Kaminsky realized was that there were only 65,536 possible transaction IDs. An attacker could exploit this limitation by flooding a DNS resolver with lookup results for a specific domain. Each result would use a slight variation in the domain name, such as 1.arstechnica.com, 2.arstechnica.com, 3.arstechnica.com, and so on. Each result would also include a different transaction ID. Eventually, an attacker would reproduce the correct number of an outstanding request, and the malicious IP would get fed to all users who relied on the resolver that made the request. The attack was called DNS cache poisoning because it tainted the resolver’s store of lookups.
The DNS ecosystem ultimately fixed the problem by exponentially increasing the amount of entropy required for a response to be accepted. Whereas before, lookups and responses traveled only over port 53, the new system randomly selected any one of thousands of potential ports. For a DNS resolver to accept a response, it had to travel through that same port number. Combined with a transaction number, the entropy was measured in the billions, making it mathematically infeasible for attackers to land on the correct combination.
At least one of the BIND vulnerabilities, CVE-2025-40780, effectively weakens those defenses.
“In specific circumstances, due to a weakness in the Pseudo Random Number Generator (PRNG) that is used, it is possible for an attacker to predict the source port and query ID that BIND will use,” BIND developers wrote in Wednesday’s disclosure. “BIND can be tricked into caching attacker responses, if the spoofing is successful.”
CVE-2025-40778 also raises the possibility of reviving cache poisoning attacks.
“Under certain circumstances, BIND is too lenient when accepting records from answers, allowing an attacker to inject forged data into the cache,” the developers explained. “Forged records can be injected into cache during a query, which can potentially affect resolution of future queries.”
Even in such cases, the resulting fallout would be significantly more limited than the scenario envisioned by Kaminsky. One reason for that is that authoritative servers themselves aren’t vulnerable. Further, as noted here and here by Red Hat, various other cache poisoning countermeasures remain intact. They include DNSSEC, a protection that requires DNS records to be digitally signed. Additional measures come in the form of rate limiting and server firewalling, which are considered best practices.
“Because exploitation is non-trivial, requires network-level spoofing and precise timing, and only affects cache integrity without server compromise, the vulnerability is considered Important rather than Critical,” Red Hat wrote in its disclosure of CVE-2025-40780.
The vulnerabilities nonetheless have the potential to cause harm in some organizations. Patches for all three should be installed as soon as practicable.
Dan Goodin
Senior Security Editor
Dan Goodin
Senior Security Editor
The era of Android virtual reality is here… again. Google’s first two attempts at making Android fit for your face didn’t work out, but the AI era and a partnership with Samsung have enabled a third attempt, and maybe the third time’s the charm. Samsung has unveiled the Galaxy XR headset, the first and currently only device running Google’s new Android XR platform. It’s available for pre-order today, but it will not come cheap. The headset, which doesn’t come with controllers, retails for $1,800.
Galaxy XR is a fully enclosed headset with passthrough video. It looks similar to the Apple Vision Pro, right down to the battery pack at the end of a cable. It packs solid hardware, including 16GB of RAM, 256GB of storage, and a Snapdragon XR2+ Gen 2 processor. That’s a slightly newer version of the chip powering Meta’s Quest 3 headset, featuring six CPU cores and an Adreno GPU that supports up to dual 4.3K displays.
The new headset has a pair of 3,552 x 3,840 Micro-OLED displays with a 109-degree field of view. That’s marginally more pixels than the Vision Pro and almost three times as many as the Quest 3. The displays can refresh at up to 90Hz, but the default is 72Hz to save power.
This week’s Amazon Web Services outage had some people waking up on the wrong side of the bed.
A Domain Name System (DNS) resolution problem affected AWS cloud hosting, resulting in an outage that impacted more than 1,000 web-based products and services and millions of people.
Perhaps one of the most avoidable breakdowns came via people’s beds. The reliance on the Internet for smart bed products from Eight Sleep resulted in people being awoken by beds locked into inclined positions and sweltering temperatures.
Apple’s new Liquid Glass user interface design was one of the most noticeable and divisive features of its major software updates this year. It added additional fluidity and translucency throughout iOS, iPadOS, macOS, and Apple’s other operating systems, and as we noted in our reviews, the default settings weren’t always great for readability.
The upcoming 26.1 update for all of those OSes is taking a step toward addressing some of the complaints, though not by changing things about the default look of Liquid Glass. Rather, the update is adding a new toggle that will let users choose between a Clear and Tinted look for Liquid Glass, with Clear representing the default look and Tinted cranking up the opacity and contrast.
The new toggle adds a half-step in between the default visual settings and the “reduce transparency” setting, which aside from changing a bunch of other things about the look and feel of the operating system is buried further down inside the Accessibility options. The Tinted toggle does make colors and vague shapes visible beneath the glass panes, preserving the general look of Liquid Glass while also erring on the side of contrast and visibility, where the “reduce transparency” setting is more of an all-or-nothing blunt instrument.
AI content has proliferated across the Internet over the past few years, but those early confabulations with mutated hands have evolved into synthetic images and videos that can be hard to differentiate from reality. Having helped to create this problem, Google has some responsibility to keep AI video in check on YouTube. To that end, the company has started rolling out its promised likeness detection system for creators.
Google’s powerful and freely available AI models have helped fuel the rise of AI content, some of which is aimed at spreading misinformation and harassing individuals. Creators and influencers fear their brands could be tainted by a flood of AI videos that show them saying and doing things that never happened—even lawmakers are fretting about this. Google has placed a large bet on the value of AI content, so banning AI from YouTube, as many want, simply isn’t happening.
Earlier this year, YouTube promised tools that would flag face-stealing AI content on the platform. The likeness detection tool, which is similar to the site’s copyright detection system, has now expanded beyond the initial small group of testers. YouTube says the first batch of eligible creators have been notified that they can use likeness detection, but interested parties will need to hand Google even more personal information to get protection from AI fakes.
This year’s iPad Pro is what you might call a “chip refresh” or an “internal refresh.” These refreshes are what Apple generally does for its products for one or two or more years after making a larger external design change. Leaving the physical design alone preserves compatibility with the accessory ecosystem.
For the Mac, chip refreshes are still pretty exciting to me, because many people who use a Mac will, very occasionally, assign it some kind of task where they need it to work as hard and fast as it can, for an extended period of time. You could be a developer compiling a large and complex app, or you could be a podcaster or streamer editing or exporting an audio or video file, or maybe you’re just playing a game. The power and flexibility of the operating system, and first- and third-party apps made to take advantage of that power and flexibility, mean that “more speed” is still exciting, even if it takes a few years for that speed to add up to something users will consistently notice and appreciate.
And then there’s the iPad Pro. Especially since Apple shifted to using the same M-series chips that it uses in Macs, most iPad Pro reviews contain some version of “this is great hardware that is much faster than it needs to be for anything the iPad does.” To wit, our review of the M4 iPad Pro from May 2024:
HBO Max subscriptions are getting up to 10 percent more expensive, owner Warner Bros. Discovery (WBD) revealed today.
HBO Max’s ad plan is going from $10 per month to $11/month. The ad-free plan is going from $17/month to $18.49/month. And the premium ad-free plan (which adds 4K support, Dolby Atmos, and the ability to download more content) is increasing from $21 to $23.
Meanwhile, prices for HBO Max’s annual plans are increasing from $100 to $110 with ads, $170 to $185 without ads, and $210 to $230 for the premium tier.