A Meta-Review of the Summer 2023 Microsoft Exchange Online Intrusion

Yes, THAT Microsoft breach. These are my thoughts on the CSRB report.

Apr 16, 2024

My Incident Transparency Soapbox

It has been a while since I’ve written up a post-mortem analysis of a breach. Partially, this is because breach details rarely become available to the public. This is currently a huge issue for our industry, as learning from others’ failures is one of the most ideal and effective ways for the industry as a whole to improve. Otherwise, we’re all just guessing at what moves the needle and what doesn’t.

Background

A quick note: many folks make their living off Microsoft. It is one of the largest tech giants, and as the CSRB report points out, it has over one billion customers. It might seem like researchers and analysts are constantly picking on Microsoft and singling it out. They’re not wrong!

Microsoft, along with tech giants like Apple, Google, and Amazon, are often singled out for a good reason: due to their massive market share, when they make a mistake, it has a much larger impact on the general public and global economy than incidents at other businesses and tech companies would. Mistakes like these are easier to forgive when smaller, less experienced, and less impactful companies make them. Companies like Microsoft are held to higher, unique standards because mistakes at Microsoft scale can do so much more damage - like when Exchange vulnerabilities destroyed part of Rackspace’s business nearly overnight. Sorry, these are the breaks when your company achieves trillion-dollar valuations.

What is the CSRB?

In May 2021, Executive Order 14028 tasked Homeland Security with creating a Cyber Safety Review Board (CSRB from here out). This would be a group of private and public cybersecurity experts tasked with investigating only the most egregious and impactful incidents. To date, the CSRB has investigated three incidents:

The Log4j Vulnerability
Lapsus$ and related threat groups
The Summer 2023 Microsoft breach - the focus of this analysis.

Why Read This?

I read as many reports like this as possible, trying to discover the deeper root causes behind some of these events. I aim to answer questions like:

why did detections fail?
why did security controls fail?
which controls failed and why?
were failures due to technology, processes, people, or some combination?

While my review of the CSRB report will contain my personal bias and opinions, I should also point out that the CSRB report feels somewhat biased. It remains highly focused on facts, but to my eyes, at least, the tone could be characterized as complementary of the government and “sick of Microsoft’s screwups.” I personally think the CSRB’s reviews should be more unbiased - the members of the board have ample opportunity to share their individual personal opinions via their own blog posts or social media. With that said, the board isn’t alone - there has been a lot of sympathy with the report's tone (myself included).

Why is Everyone Beating Up on MSFT so Much??

Microsoft has indeed had a rough time in the past few years. Companies like Wiz, Orca, and Lacework have discovered and reported on dozens of security issues in the Azure cloud platform. With access to so many of the world’s businesses and hosting a significant chunk of the world’s email, unsurprisingly, Microsoft is a constant target for nearly all types of attackers.

Microsoft isn’t alone in being targeted, but it does seem to get breached more often than any of its peers. In fact, two more significant incidents have occurred following the breach I’ll be discussing here. Adding to Microsoft’s woes, some of the incidents occurred due to its own employees failing to understand how to use Azure securely.

Grab a copy of the report if you want to read it yourself, or check out some of the sections I’ll reference. At 34 pages, it doesn’t seem like a massive report, but don’t let that fool you. The report is quite dense and doesn’t contain a lot of fluff.

First, let’s talk about the overall sentiment of the report. Microsoft failed to detect a major intrusion into one of its most popular products, and this report doesn’t ever let you forget that fact. Sure, this attack was truly deserving of the oft-overused “advanced and sophisticated” description. It was carried out by the same Chinese-affiliated group (Storm-0558) that famously compromised Google (the Aurora attack that led to the creation of the company’s BeyondCorp zero trust principles and products) and RSA (targeting their SecurID MFA product) over a decade ago.

This attack campaign appears to have been successful: Storm-0558 went undetected for an unknown period of time. This threat actor accessed the email accounts of 22 organizations and over 500 individuals across both commercial (M365) and consumer (Hotmail/Outlook.com) products. Tens of thousands of emails were downloaded from just one of these targets (the State Department) before the attack was detected.

Microsoft has yet to conclude this investigation. They continue to explore 46 hypotheses they originally developed nine months ago.

How did the attackers pull it off?

The primary mechanism that allowed this attack to happen was a Microsoft Services Account key (MSA Key) that Storm-0558 used to generate the tokens needed to access all the aforementioned email accounts.

This MSA key is at the heart of this whole thing. It's such a mess — it's like a problem with superpowers, a super-godmode vulnerability.

This thing was the equivalent of a skeleton key for Microsoft's services.

The MSA key wasn't supposed to be able to generate tokens for both consumer and enterprise Microsoft services, but thanks to some software bugs, it did.
The MSA key was originally generated in 2016, and this attack happened in 2023. Keys like this shouldn’t have a lifespan this long. They should have been revoked and rotated years prior, but Microsoft stopped rotating keys after an availability incident and never went back to rotating keys after that.
The idea of any MSA key having access across all of one of Microsoft's services is already bonkers. The review board talked to all of the other CSPs out there (Google, Amazon, even Oracle), and no one else does this. Everyone else compartmentalizes access controls and token generation. Microsoft didn't.
The CSRB report even cites Oracle Cloud as an example of correctly controlling access. Ouch.

A total of 525 tokens were forged from this MSA key: 503 were for personal email accounts, and 22 were for enterprise M365 organizations.

The majority were for US-based accounts, but this was an international incident — UK organizations and personal accounts belonging to people worldwide were also affected.

Share The Cyber Why

Two Truths and a Lie

You might recall that Microsoft reported last year that Storm-0558 got this key from a crash dump.

That was apparently a lie.

The CSRB report isn't calling it a lie because it's a formal, professional report, but I can't think of any more accurate language to use here. Microsoft stated it as fact. You can still go back and read it.

Our investigation found that a consumer signing system crash in April of 2021 resulted in a snapshot of the crashed process (“crash dump”). The crash dumps, which redact sensitive information, should not include the signing key. In this case, a race condition allowed the key to be present in the crash dump (this issue has been corrected). The key material’s presence in the crash dump was not detected by our systems (this issue has been corrected).

Microsoft later admitted to the review board that there was no evidence to support this theory. Apparently, Microsoft came up with 46 hypotheses when investigating this attack, and this "key recovered from a crash dump" theory seemed plausible to them. Instead of saying, “We don’t know what happened, but here’s a theory”, they pitched their theory as fact. This is problematic not only for reasons of trust but also because it potentially impacts the incident response strategies of the targets of this attack. If Microsoft claims containment has occurred when it hasn’t… well.. that’s an issue!

After Microsoft revealed this lack of evidence to the CSRB, they chose not to correct the incorrect information. For over six months, Microsoft allowed the public and their customers to believe they had discovered the incident's root cause.

The truth is that they still have no idea how Storm-0558 got their hands on this MSA key.

Notification Troubles

Another concerning issue was around notification. Microsoft users and customers are so used to seeing phishing scams using Microsoft designs, fonts, and CSS that many victims ignored Microsoft's messages about the compromise and had to be contacted directly by the FBI.

Looks similar to 100% of the Microsoft phishing emails I’ve received 🤷🏽

Perhaps security awareness training also has some culpability here? I'd be interested in hearing other folks' thoughts on this.

One of the report's recommendations is an ‘amber alert’ style notification system. Hopefully, it will be more scammer-resistant than email.

How was the attack detected?

Microsoft often reminds us that it is the most prominent security vendor in terms of revenue (>$20 billion today). They have an extensive line of security products designed to detect attacks. It is a bad look that they failed to spot this attack.

Instead, the US State Department detected this attack using a detection mechanism they call "Big Yellow Taxi" (someone’s a Joni Mitchell fan?). Big Yellow Taxi analyzes MailItemsAccessed logs for anything that looks weird or anomalous when compared to a baseline. For any detection engineers out there that just winced, yeah - this doesn’t seem like an easy detection rule to operationalize. It sounds like a ton of work, but it enabled the State Department to be the first and only one to detect this huge attack out of 503 individuals, 21 other organizations, and Microsoft themselves.

According to the CSRB report, they wouldn't have been able to create this detection rule if they hadn't paid extra for Purview Audit Premium. This sets the Security Poverty Line at least another $144 per user per year above the standard M365 license.

It could be biased from the review board, but inter-agency cooperation seems really impressive here. The State Department, Commerce Department, FBI, and CISA provided a lot of sub-24-hour assistance and feedback. By comparison, Microsoft seems positively sloth-like.

Microsoft’s detection and telemetry

A recurring theme in the report is a focus on log retention. Most folks are aware of the log situation at Microsoft: logs are disabled by default, and when they are enabled, retention is often limited. This forces customers to ship them elsewhere or pay extra to store them for extended periods. Then there are the ‘advanced’ logs that cost extra.

Microsoft apparently only had 30 days of logs to investigate after the State Department notified them.

A 30-day retention period for a free or basic-tier Microsoft 365 customer is unsurprising. However, Microsoft's decision to limit its own log retention so severely borders on bizarre. The irony is that the US State Department couldn’t have detected the attack Microsoft missed if it had not paid a premium for logs and made very effective use of them. Meanwhile, Microsoft’s log retention was so short that they didn’t know when the attack began.

Was Microsoft a victim of its own stinginess here? Does Microsoft seriously not indulge in longer retention times for its infrastructure—the infrastructure backing the second-largest cloud service provider in the world?

The implications are odd and don’t entirely add up. On the product side, the shortest amount of time that logs hang around for the most basic tiers of M365 is 30 days by default. Nearly any more premium option—E5, Purview Audit (standard or premium)—increases log retention to at least 180 days. Log retention can be increased to as long as ten years. Why would Microsoft choose only 30 days for their internal logging?

Furthermore, Microsoft believes the attack originated from a laptop belonging to an employee at a company they acquired in 2021. They think the employee's laptop was already compromised before the company was acquired.

That also, unfortunately, means that there could have been more stuff compromised that Microsoft doesn't know about. Storm-0558 could still have access to systems, individual assets/identities, or the ability to generate access keys we don't know about.

The M&A employee could also be a scapegoat—everything Microsoft has published concerning root cause analysis is a theory. It has no evidence linking this compromised employee's laptop to the MSA key theft.

How Far Should Shared Responsibility Extend?

Another point the review board makes in this report, which has been echoed by the White House, is that far too much onus is put on the customer to secure their accounts and data—the providers and CSPs should shoulder more of the work here.

Just look at Microsoft's most recent breach (Jan 2024 by Russia's "Midnight Blizzard"). Microsoft insists there were no vulnerabilities here, it was just cred stuffing, but then look at everything they recommend customers do to detect and/or prevent an attack like this. Imagine reading these recommendations as a 50-person organization. Madness.

Is this reasonable to expect of the average M365 customer? What about personal accounts? Is there nothing Microsoft can do to ease this burden?

Not only did the State Department need some serious Detection Engineering skills to discover this attack, but they also needed sharp SOC analysts AND the premium audit package to get access to the necessary logs to begin with!

Figuring out how Microsoft's licensing works, understanding their products, and securing them is a complex maze that makes the defender's job an utter nightmare.

Recommendations

From what I can tell, everything in this report is a recommendation. Microsoft isn't required to do any of it. The recommendations are very bold but mostly seem fair to my eyes. A potential silver lining here is that this report could be used by Microsoft and its employees as a permission slip of sorts—a permission slip to shift priorities back in the direction of security a bit.

Microsoft is incentivized to focus on profits and revenue at all levels, even for the CEO. A report like this can help them give security the priority it needs and push back on expectations for more profit and growth at all costs. I'm sure there are folks at Microsoft who have been advocating for better security at every step. For those still fighting that fight, this could be their most important weapon against years of poor risk prioritization.

I'm not going to get too deep into the details here, so start on page 17 if you want to jump straight to the recommendations [www.cisa.gov]

The List…

This isn't an exhaustive list - just the ones I found most interesting. There are 25 in total. Here are the ones that apply to Microsoft:

Microsoft’s security culture needs to improve.
Microsoft needs to modernize its key management (LOTS of specific recommendations here, including updates to NIST standards, FedRAMP, and other industry best practices.)
Transparency and reporting breach details were lacking and unacceptable, especially for a company as large and essential as Microsoft.
Security must be prioritized, with incentives coming from the CEO and board on down. For a time, this priority needs to exceed that of innovation/new development until the worst of the security issues have been addressed.
Security must be a design requirement (wait, haven't I heard that from Microsoft itself???)
NO MORE CHARGING FOR LOGS, and provide customers a minimum of six months of logs.
More useful logging.
Rework IAM architecture to be more secure and compartmentalized - lots of pointing and hinting at Google's post-Aurora BeyondCorp transformation here.
Better M&A due diligence.

The recommendations continue with more CSP/Industry-specific ones, like:

CISA becoming a CSP watchdog, doing annual reviews of CSPs.
NIST should update 800-53 to better account for cloud-based IAM risks.
Need for a minimum audit logging standard for cloud services.
CSPs should be early adopters of more secure identity standards and IAM/key mgmt processes.
US-based CSPs should report ALL incidents potentially involving a nation-state (and we should consider legally requiring them to do so.)
CSPs should be transparent about both what they DO know and what they DON'T know.
CSP vulnerabilities should go through CVEs and be handled like vulnerabilities (this has long not been the case, with an argument that customers don't patch CSP issues; they're patched once by the CSP for all customers - there is an argument to have here though, about other benefits of these vulns using CVE and having a presence in other vulnerability databases.)
CSPs and USGov should create an "amber alert" system for high-impact situations.
CSPs should verify victims received notifications, not just fire them off en masse.
US Gov should incentivize more data sharing between CSPs and those affected by security issues.

How should all this make us feel?

We often hear about how fragile the Internet is and how common poor security practices are, even at security vendors. Every time we think, “This is it - buyers, consumers, and regulators are done with this BS,” nothing happens. Breaches keep happening, and we move on. Is the problem with us, the security professionals? Is security more important and precious to us than it is to the general public? Of course, it is - we’re focused on it and worried about it all day, every day - it’s our job.

Perhaps part of the problem is a perception of industry importance that doesn’t line up with reality.

We work in an industry that’s more than happy to publish myths and lies, like the indefensible fake statistic that cybercrime somehow did $6 trillion in damages in 2021 and will top $10 trillion in 2025. That’s difficult to reconcile, as ransomware payments recently crossed the $1 billion mark, and the FBI reported BEC scam losses in the US were $2.7 billion in 2022. Dozens of vendors happily repeat these fake stats, hoping they’ll help them sell more products.

“The greatest transfer of economic wealth in history,” Steve Morgan says. Really? A recent post by Tom Johansmeyer, who has over 20 years of experience in the insurance industry, disagrees. Here’s his take on NotPetya, which is often pointed to as one of the most costly cyber incidents ever.

NotPetya is often called the most expensive cyber catastrophe in history, having caused as much as $10 billion in economic losses at the time ($11.9 billion in 2024 at an annual inflation rate of 3%). That may seem monumental—-and by cyberattack standards it is—-but as catastrophes go, that’s a pretty small price tag.

Spoiler - Tom concludes that NotPetya isn’t even the 2nd worst “cyber-catastrophe”. The greatest transfer of economic wealth in history probably has something to do with student loans.

The point here isn’t that cybersecurity isn’t important - it absolutely is! It’s just not the most important thing ever. Some companies and individuals in this industry desperately want it to be, and that’s causing problems. Daniel Miessler recently coined an interesting term that explains why it’s often difficult to move the cybersecurity baseline: the Efficient Security Principle.

The Efficient Security Principle explains why, despite this incident, the US Government still uses Microsoft 365 and other Microsoft products. It explains why, despite dozens of vulnerabilities in Azure and incidents impacting Microsoft 365, it is still one of the most dominant collaboration, communication, and productivity platforms available. Customers' value from these products and services exceeds the actual or perceived risk related to Microsoft’s breaches and mistakes.

Conclusion

Hopefully, some of the points made in this report help remove some tech debt for not just Microsoft but the industry as a whole. It’s frustrating that it often takes incidents like this to justify improving security, but that’s nothing unique to our industry. Efforts to improve safety and security often look unnecessary or even paranoid until there’s an incident demonstrating the need for them.

It's a great reminder that there really is no such thing as "best practices," only "current practices" that we should always strive to improve.