July 19, 2024

expert reaction to mass global IT outage

Scientists comment on the global IT outage.

Prof Omer Rana, Cardiff University Academic Centre of Excellence in Cyber Security Research & Education, said:

“The current outage has clearly indicated that we need to consider the impact of wider “cyber disturbances” – rather than just cyber attacks. It is the impact on systems that is important, not just what has caused it. We don’t know why this happened – given this happened at the end of the week and close to the holiday period, it could also be due to human error; but we don’t know yet.

“If this turns out to be a software update that was badly deployed, this shows how vulnerable we are to cloud-hosted services that we all rely on every day. This reliance has increased even more significantly since the Covid pandemic, when many workers were connected on-line and cloud-hosted services played a key role.

“This is an aspect we are exploring in a recently funded AI-Hub (https://edgeaihub.co.uk/), we call these cyber-disturbances in the context of edge computing systems. Such systems, such as internet of things devices, rely on software updates of the same kind – and our reliance on these devices continues to increase.”

Patrick Burgess, from the BCS Information Security Specialist Group, said:

“Crowd Strike is a provider of security software to a huge number of companies across the world, and they rolled out an update overnight, which unfortunately caused the classic ‘blue screen of death’, affecting a lot of Windows machines globally that support a lot of infrastructure.”

Adam Leon Smith, a BCS (The Chartered Institute for IT) Fellow and a cyber security expert, said:

“People want to get security updates rolled out as quickly as possible because that helps prevent against what we call ‘zero-day’ attacks; that is new ways that actors are found to compromise systems. There’s a trade-off here between the speed of ensuring that systems get protected against new threats and the due diligence done to protect the system’s resilience and stop things like this incident from happening.

“In some cases, the fix may be applied very quickly, but because it has to be applied to so many computers around the world, that may take longer than it sounds. But if computers have reacted in a way that means they’re getting into blue screens and endless loops it may be difficult to restore, and that could take days and weeks.

“We have to realise this could have been a lot worse. Microsoft Windows isn’t the main operating system used for mission-critical systems. It’s Linux.

“We have to look at the complex supply chain infrastructure that’s providing the systems, services and products we rely on every day. Software should be a priority when we are planning from a national resilience point of view. The government needs to start tracking when things like this happen – even lesser incidents. We need to start understanding the nation’s ability to respond to such events.”

Steve Sands, Chair of the BCS Information Security Specialist Group, said:

“Working IT systems are a prerequisite for almost every aspect of modern life and indeed the global economy. BCS, The Chartered Institute for IT has made a number of key recommendations to improve service and software resilience to government in a recent consultation and report. I sincerely hope that today’s CrowdStrike issues raise awareness and create some much-needed urgency to continue this vital conversation.”

Sands said that speculation as to why this incident happened is “not helpful or productive” at this point and he recommended companies concentrate on the task in hand: “My advice would be to focus on restoring your own IT systems (following the advice of the vendors) and leave the providers and the industry to work on understanding how this happened and learning the lessons.”

Dr Erisa Karafili, Associate Professor in Cybersecurity from the University of Southampton’s Cyber Security Research Group, said:

What do we know and what don’t we know about what has happened?

“From the official statement from Crowdstrike it seems to appear that an update on parts of the provided software did not work.

What will companies and organisations likely be doing at the moment to investigate and fix the situation?

“Having bugs and issues with new software releases is not uncommon (also in the cybersecurity space). Usually, these bugs are discovered and solved in a matter of hours, days, or even months, sometimes they are discovered very late or after an attacker has exploited these software vulnerabilities. Software verification is a normal practice used before new software is released. What I am expecting is for them to check the interdependencies and how their software interacts with the existing platforms and other connected software.

Have there been previous incidents like this – can we glean anything about this new situation from them?

“It is not the first time we have faced this type of problem due to software issues, think about the Year 2000 problem which caused a global disruption, but also other small glitches like the Call of Duty: Warzone update, or not-so-small ones like the Log4j software bug. Lately, we have mainly faced disruption due to cyber-attacks, but a good part of these attacks have simply used existing unknown bugs to penetrate or disrupt the systems.

Any other comments about this situation?

“On one hand, the public is reassured that the providers are taking protective measures against cyber-attacks (using Crowdstrike services), but on the other hand, it shows how vulnerable we might be with respect to these issues (due to internal mistakes or external threats). If we think of the glass half full, we were lucky that this bug created a disruption of the service and Crowdstrike seems to know from where it is coming and it is solving it. It would have been catastrophic if the issue was left there, and used by attackers to launch a global massive attack.”

Dr Inah Omoronyia, based in the Bristol Cybersecurity Research Group at University of Bristol’s School of Computer Science, said:

“This outage points to the need to be constantly vigilant of the cloud infrastructures and other critical systems that we now depend on daily. Today’s infrastructures are a lot more complex, with extensive dependencies, risks that are often not obvious to those responsible for building them. The UKRI and EPSRC are aware of this challenge – which remains a specific focus of the EPSRC Digital Security & Resilience Advisory Group.

“Overall, to reduce the occurrences of these sorts of issues and their impact, it is important that we, as a nation, consistently evaluate the resilience of the critical systems that we depend on. Currently, our risk mitigation approaches are too reactive and therefore unsustainable for the current pace of technological innovation. Unless precautions are proactively taken to detect and mitigate risks throughout the whole software and systems supply chain our best effort may remain a security theatre.”

Dr Junade Ali, Cyber Security Expert and IET (Institution of Engineering and Technology) Fellow, said:

“The recent software update from CrowdStrike has resulted in a significant global outage, affecting computers running the Microsoft Windows operating system that use the CrowdStrike Falcon security product. This issue has led to widespread disruptions, including air travel delays, interruptions in television broadcasting, and halted supermarket transactions. The NHS, which relies heavily on Windows computers, is also experiencing outages in critical systems used by GP practices. The root of the problem seems to be a defective system file included in the update.

“The scale of this outage is unprecedented, and will no doubt go down in history, potentially surpassing the 2017 WannaCry attacks. Unlike some previous outages that targeted internet infrastructure, this situation directly impacts end-user computers and could require manual intervention to resolve, posing a significant challenge for IT teams globally.

“CrowdStrike are investigating the issue as a P0 incident, indicating the highest level of urgency in addressing the problem. The long-term implications of this outage are yet to be fully understood, but they could be substantial, affecting timely uptake of critical security updates in the future.

“This incident will provide key learning for the software engineering profession to consider the safety and security implications of software updates.”

Prof Ian Corden, PhD CEng, Fellow at the Institution of Engineering and Technology, said:

“The major IT outages that are occurring around the world today highlight the ever-increasing dependence of national and regional economies, defence and national security, and private individuals on digital services, and hence their security and resilience. With software-based systems becoming ever-more prevalent in our daily lives, the importance of reliably-engineered software and IT systems is now paramount, especially where critical national infrastructure (CNI) is impacted.

“The cause of the outage appears to be a problematic update to CrowdStrike’s Falcon, an endpoint detection and response platform. This update has affected systems, especially those running Microsoft software, leading to widespread service interruptions.

“CrowdStrike Falcon is an endpoint detection and response (EDR) platform designed to protect computers and other devices from cyber threats. It monitors systems for intrusions and responds by blocking malicious activities. Falcon’s software is highly privileged, allowing it to influence computer behaviour to prevent security breaches. It is widely used across many industries to enhance cybersecurity measures.

“Several large-scale IT outages similar to the recent CrowdStrike Falcon incident have occurred in the past. Notable examples include the British Airways IT failure in 2017, which grounded flights globally due to a power supply shutdown; Amazon Web Services’ S3 outage the same year, impacting numerous major websites; Fastly’s 2021 global outage caused by a software bug; and AT&T’s 2024 nationwide outage resulting from a problematic software update.

“To mitigate IT outages, companies should implement backup systems, deploy redundant infrastructure, conduct regular disaster recovery testing, and develop stringent software update protocols. They should also use advanced monitoring tools, train IT staff in outage response, and work closely with third-party vendors to ensure robust security and mitigation strategies.”

David Smith, Head of Technology Strategy at the Institution of Engineering and Technology, said:

“When cloud services go wrong, a large number of customers are affected by the issue. These types of services are updated constantly – a feature of the modern world and how we use technology at a global scale. The likelihood is that an error has occurred in the process of making changes to these services. We have seen incidents like this in the past – again where a change to a live service went wrong and had to be rolled back in order to fix it. Organisations should learn from every incident like this, no matter the size, in order to become more resilient to events that effect so many customers around the world.

“All organisations should have a business continuity plan when an event like this occurs, so that they can take the steps needed for their organisation – tackling what their workaround and mitigation approaches might need to be.

“A situation like this also illustrates the effects that all organisations face when building and using technology that relies on these cloud technology services that exist at scale. When a key component of an organisations supply chain, ceases to work, it can stop that organisation from functioning. The trade-off of course is that this technology at scale allows organisations of sizes access to capabilities that were once only available to the largest multinationals. In commoditising access to high technology, we must understand and be able to live with the trade-offs when those services suffer interruptions – unfortunately this will happen again.”

Dr Andrew Dwyer, from the Department of Information Security at Royal Holloway, University of London, said:

“The worldwide IT outage has occurred due to a error in an ‘endpoint detection’ update provided by CrowdStrike. The detection system is used to look for and stop suspicious activity on computers and is used by a number of customers operating Microsoft Windows through its product Falcon Sensor.

“In recent years, there have been geopolitical concerns over the use of endpoint detection on government computers due to the potential for supply chain disruptions, espionage, and cyber-attacks. For example, the endpoint detection provider, Kaspersky, has been removed from computers in various countries due to the potential that Russia may have used this deep access for espionage purposes.

“CrowdStrike’s Endpoint detection products are regularly updated with no troubles to provide better cyber security protection, but an issue emerged in a recent upgrade to Falcon Sensor.

“It has been reported that CrowdStrike has found the error and removed the erroneous code from its update. But there is no way to remotely recover corrupted computers – meaning there will be lots of IT teams internationally working across the next few days to restore each computer individually.”

Beth Clarke, digital expert at the Institution of Engineering and Technology, Committee Member for the BCS Special Interest Group in Software Testing, and PhD researcher, said:

“It’s too early to know what factors lead to this defect making it into the update, but the cause is probably more complex than just one single point of failure, and the teams at Crowdstrike will likely be investigating this in depth. Incidents like this highlight the importance of thorough software testing, and the critical role that software testers still play in the technology sector. In the current market, many companies are reducing the size of, or altogether removing, their software testing and quality assurance teams to reduce costs. In light of this morning’s outage and the global impact it has had, I would hope companies and organisations reflect on their testing strategies and see the immense value in having dedicated testing teams.”

Prof Oli Buckley, Professor in Cyber Security, Loughborough University, said:

“CrowdStrike’s recent update issues highlight a critical gap: while experienced users can implement the workaround, expecting millions to do so is impractical. The real challenge lies in deploying the workaround across all affected systems—a non-trivial task demanding coordinated efforts, so a proper patch can be put in place.”

Prof Jon Crowcroft FRS FREng, Marconi Professor of Communications Systems, University of Cambridge, said:

“The root cause and a partial solution was reported on The Register, see: https://www.theregister.com/2024/07/19/crowdstrike_falcon_sensor_bsod_incident/

“Technically, it isn’t actually a software problem but a config file error.

“While it is true that we have a lot of dependence on too small a number of software or service components and we need more diversity, it’s worth noting that three sites I use that are Microsoft cloud based are all completely ok, so crowdstrike isn’t as widely used/pervasive as some of the hyperbole suggests.

“There are other possibly larger cloud/internet cybersecuity defenses – e.g. cloudflare; if this had happened with that it would likely have been a lot more serious.”

Ian Golding, Digital expert at the Institution of Engineering and Technology, said:

“It’s too early to know precisely what has happened although an update to critical cyber security elements in the ecosystem of various providers and systems appears to have malfunctioned, causing mass failure of the computers relied upon for delivering services across these organisations.

“Despite organisations using well known and carefully chosen global IT providers, they all must work seamlessly together. This interoperability is usually extremely well managed and tested with great skill and diligence, but it is complex, and as we see this can fail occasionally – today the failure and impact appears to be widespread and affecting all sectors from transportation to healthcare. Organisations will be looking at their IT architecture, their dependencies and assets and the associated key risks, including the risks that they expect their trusted providers to manage actively on their behalf.

“Whatever the weak links in the chain that are discovered from today’s outage, the organisations affected will become better prepared with their Plan B for a scenario like this in the future – understanding risks and putting in place resilience and recovery plans are key for these operational platforms affecting so many people today.”

Prof Harin Sellahewa, Dean of Faculty of Computing, Law and Psychology, University of Buckingham, said:

“Today’s global outage of IT systems in several sectors highlights the complexity of current IT systems and infrastructure, and the need for increased resilience to minimise risks of failure due to cyberattacks, hardware failures or human error.”

Prof James Davenport, Hebron and Medlock Professor of Information Technology, University of Bath, said:

“According to media reports (including a major tech newsletter https://www.theregister.com/2024/07/19/crowdstrike_falcon_sensor_bsod_incident/), the fault is in a third-party product called Falcon from a well-known Information security vendor called Crowdstrike. Someone from CrowdStrike I believe has posted a partial work-around at https://x.com/brody_n77/status/1814185935476863321

“It appears that the problem isn’t a software update in the traditional sense, rather it is a new ‘channel file’ (roughly speaking, what your defence software should look for: a modern version of a virus signature) which is apparently corrupt and causing the Falcon crashes. This has been apparently fixed, but of course it is still in the system, and will take time to flush through.

“My advice would be to NOT reboot/restart until the all-clear is given by a reputable source (ideally a joint Crowdstrike/Microsoft statement, but in practice we will probably get one or the other). Do not accept “it’s gone away” statements.”

Dan Card, of BCS, The Chartered Institute for IT and a cyber security expert, said:

“People should remain calm whilst organisations respond to this global issue. It’s affecting a very wide range of services from banks to stores to air travel.

“It looks like a bug to a regular security update, rather than any form of ‘mega cyber attack’, but this is still causing worldwide challenges and is likely to require a large number of people to take manual remedial steps.

“Companies should make sure their IT teams are well supported as it could be a difficult and highly stressful weekend for them as they help customers. People often forget the people that are running around fixing things.”

Prof John McDermid OBE FREng, Director of the Centre for Assuring Autonomy; Lloyd’s Register Foundation Chair of Safety, Institute for Safe Autonomy, University of York, said:

“Security software is intended to protect computers from attack, e.g. by malware. and to provide this protection it has a lot of power to control the host PC. Such software is pervasive – on many if not all machines of a particular type – so a fault in the security software can bring down many computers at once. This appears to be what is behind the widespread outage of Windows-10 based PCs around the world, with knock-on effects on air travel, banking, etc. (Specifically the problem seems to be in software known as the Falcon Sensor produced by CodeStrike.) We need to be aware that such software can be a common cause of failure for multiple systems at the same time, and we need to design infrastructure to be resilient against such common cause problems, e.g. through use of diversity, that is not relying on a single make of computer system and/or software.”

Dr Harjinder Lallie, cyber security expert, University of Warwick, said:

“The worldwide IT outage experienced this morning is unprecedented in the range and scale of systems it has impacted. Although we cannot speculate on the cause of this outage just yet, it appears that this might be a server error emanating from one server supplier.

“This IT ‘catastrophe’ highlights the need for greater resilience, a greater focus on backup systems, and possibly even a need to rethink whether we are using the most resilient operating systems for such critical systems.”

Comments from our friends at the Australian SMC:

Dr Sigi Goode is a Professor of Information Systems in the Research School of Management at the Australian National University

“This incident really highlights the privileged role of large technology companies in our national technology posture. What’s most important is that we learn from it. Adversaries of many kinds are watching our reaction, and learning how they can attack more efficiently in future.
Large-scale outages like this are rare, so this really is a great opportunity for adversaries to learn how we respond when things don’t go as planned. Response times, response language, and remediation strategies are all useful pieces of information to an attacker who wants to identify vulnerability and gaps.”

Sigi has declared he has no conflicts of interest. He is available over the weekend between 10am and 4pm

He is contactable on sigi.goode@anu.edu.au

Graeme Hughes is Director – Executive Education at Griffith Advantage, Griffith University

“A widespread IT outage struck Australia on July 19, 2024, impacting numerous sectors like banking, media, telecommunications, supermarkets, and airlines. The culprit appears to be a technical glitch with CrowdStrike’s Falcon sensor, a security software program commonly used on business computers. This malfunction caused crashes that disrupted critical systems.

Consumers faced inconveniences like difficulties with online banking, using EFTPOS at terminals, and accessing online accounts. Communication through customer service lines and business websites was also hampered. Airline check-ins and airport operations may have been slowed down as well.

While the outage is not yet resolved, it highlights our heavy reliance on technology for daily activities. With Australians making over 730 electronic transactions per year on average, our dependence on technology is more critical than ever. Thankfully, there are no reports suggesting this was a cyberattack. Both CrowdStrike and Microsoft are working to address the issue and prevent similar occurrences.”

Graeme has declared he has no conflicts of interest.

He is contactable on g.hughes@griffith.edu.au

Tom Worthington is an Honorary Lecturer in the School of Computing at The Australian National University

“The widespread outages show the risks in relying on a single technology for vital services. There need to be alternate communication links using different software. This does create an added security and maintenance burden, as multiple products need to be looked after and protected. But if you put all your eggs in one basket, you can end up with it on your face.”

Tom has declared he has no conflicts of interest.

He is contactable on +61 419 496 150, tom.worthington@anu.edu.au

Dave Parry is Dean and Professor in the School of IT at Murdoch University

“What’s happened today is that an update to a thing called Falcon Sensor, which comes from a company called CrowdStrike and is a Windows-based tool to detect and respond to cybersecurity threats, seems to have caused a problem with Windows (it looks like Windows 10). That means that the machines that have had this update, effectively are doing a thing called the ‘blue screen of death’. This means their machines want to reboot, but then they can’t be rebooted, and so the machines basically become useless.

This has become a global phenomenon because CrowdStrike is a very large company, and a lot of companies and organisations use them to detect and protect against threats. The issue will affect very, very large numbers of machines around the world. It’s not a cyber attack, but it’s just an interaction of the two pieces of software.”

Dave has not declared any conflicts of interest.

He is contactable on +61 450 711 537, David.Parry@murdoch.edu.au

Dr Shumi Akhtar is an Associate Professor at the University of Sydney

“Today’s technology outage—an unprecedented global crisis—sparked off in the USA, is now ominously rippling across the globe. This sudden, severe disruption halts everyday activities and starkly exposes the fragility of our heavily digitised world. From banking to healthcare, education to government, no sector remains untouched, highlighting an urgent need for a worldwide strategic overhaul of our critical infrastructures. This crisis calls for immediate collaborative action to enhance resilience through robust safeguards and fail-safes, especially in life-critical networks. As we increasingly pivot to a future dominated by digital and AI innovations, this outage is a resounding wake-up call: we must fortify our digital bastions to safeguard against such catastrophic interruptions, ensuring our readiness and security in an interconnected era.

As a result of this outage, at least three critical sectors could be affected significantly.

In the medical industry, a technology outage can result in the loss of access to electronic medical records, critical patient data, and communication systems essential for patient care. This could delay surgeries, medication administration, and emergency responses, potentially endangering lives.

In the banking sector, an outage can cripple financial transactions, including ATM withdrawals, online banking, and payment processing. This disruption can lead to significant financial losses for consumers and institutions, and undermine public trust in the financial system

For the airline industry, technology outages can ground flights, disrupt ticketing and check-in processes, and affect air traffic control. This can lead to massive delays, financial losses, and compromise passenger safety and security. Each of these scenarios highlights the catastrophic potential of technology failures across critical industries.

Today’s event should serve as a crucial wake-up call.”

Shumi has declared she has no conflicts of interest.

She is contactable on shumi.akhtar@sydney.edu.au.

Shumi has said her best availability over the weekend is Saturday/Sunday (4-5pm)

Professor Jill Slay is SmartSat Chair: Cybersecurity at University of South Australia (UniSA)

“There is currently a major global technical outage affecting multiple companies and services. Some are attributing this to security services offered by CrowdStrike. Others attribute it to Microsoft or Amazon Authorities and industry will be monitoring, but at this stage it is too early to draw conclusions.

While the outage may easily be a result of misconfiguration by one of these companies, or ‘interference’ between products, the global impact is enormous. It is possible that there is a security breach, but to me, this is instinctively unlikely.”

Jill has declared she has no conflicts of interest.

She is contactable on +61 422 420 954 and jill.slay@unisa.edu.au

Toby Murray is an Associate Professor in the School of Computing and Information Systems at The University of Melbourne

“CrowdStrike Falcon has been linked to this widespread outage. CrowdStrike is a global cyber security and threat intelligence company. Falcon is what is known as an Endpoint Detection and Response (EDR) platform, which monitors the computers that it is installed on to detect intrusions (i.e., hacks) and respond to them. That means that Falcon is a pretty privileged piece of software in that it is able to influence how the computers it is installed on behave.

For example, if it detects that a computer is infected with malware that is causing the computer to communicate with an attacker, then Falcon could conceivably block that communication from occurring. If Falcon is suffering a malfunction then it could be causing a widespread outage for two reasons: 1 – Falcon is widely deployed on many computers, and 2 – Because of Falcon’s privileged nature.

Falcon is a bit like anti-virus software: it is regularly updated with information about the latest online threats (so it can better detect them). We have certainly seen anti-virus updates in the past causing problems e.g. here.

It is *possible* that today’s outage *may* have been caused by a buggy update to Falcon.”

Toby has not declared any conflicts of interest.

He is contactable on +61 425 726 687 and toby.murray@unimelb.edu.au

Dr Mark Gregory is an Associate Professor in the School of Engineering at RMIT University

“The near global outage appears to have been caused by a failure of systems associated with the Crowdstrike Falcon endpoint security monitoring software. Crowdstrike is a global multi-national software solutions provider.

In Australia, many businesses and organisations have found that their software systems have failed due to the software system outage. The reliance on centrally managed global software solutions can lead to significant security risks.

Australian governments have, for too long, acquiesced to companies that store Australian data overseas and manage critical systems from global headquarters out of Australian jurisdictions.”

Mark has declared he has no conflicts of interest.

He is contactable on +61 418 999 089 and mark.gregory@rmit.edu.au

Declared interests

Prof Omer Rana: “No conflict of interest.”

Dr Erisa Karafili: “I am currently partly funded by EPSRC and MoD (for a the CDT in Complex Integrated Systems for Defence and Security) and Dstl under the DASA framework for my project (Cyber-ThreaD).”

Dr Inah Omoronyia: “Dr Omoronyia is a member of the EPSRC Digital Security & Resilience Advisory Group and is funded by the EPSRC.”

Dr Junade Ali: “- Fellow at the Institution of Engineering and Technology
– CEO of Engprax, a software auditing company
– Author of the book ‘How to Protect Yourself from Killer Computers’.”

Prof Ian Corden: “No conflicts of interest.”

David Smith: “No conflicts of interest.”

Beth Clarke: “No conflicts of interest.”

Prof Oli Buckley: “None.”

Ian Golding: “None.”

Prof James Davenport: “No conflicts (other than a member of the British Computer Society’s Software Resilience Group).”

Dan Card: “None.”

Prof John McDermid: “I have no conflicts to declare.”

For all other experts, no reply to our request for DOIs was received.

July 19, 2024

expert reaction to mass global IT outage

in this section

filter RoundUps by year

search by tag