Chief Information Security Officers (CISOs) face an ever-evolving landscape of cyber threats. Our mission is to build robust cyber resilience, manage risks, ensure compliance, and foster collaboration — all while dealing with potential crises. How do we ensure our defenses are resilient against such unexpected failures? What lessons can we learn from past incidents to bolster our strategies moving forward?
In this blog, we’ll delve into practical strategies to integrate best practices, leverage continuous monitoring, and enhance communication and collaboration with vendors and internal teams. These insights aim to fortify your defenses and guide informed decisions when working with cybersecurity software vendors.
Why Endpoint Security is Needed
Endpoint security is crucial for protecting organizations from a multitude of cyber threats. Its core benefits include:
Enhanced Protection – Defends against various cyber threats, including malware, ransomware, and phishing attacks.
Improved Compliance – Helps meet regulatory requirements and industry standards, reducing the risk of fines and legal issues.
Increased Productivity – Minimizes downtime and disruptions caused by security incidents, ensuring smoother business operations. Regular updates and patches prevent exploitations and maintain operational efficiency.
Better Visibility – Provides comprehensive insights into endpoint activity, allowing for more effective threat detection and response. Detailed logs and monitoring tools help track and analyze user activity and system changes.
Reduced Risk – Lowers the likelihood of successful attacks and data breaches, protecting sensitive information and maintaining customer trust.
Why the Kernel is Needed for Endpoint Security
You may be wondering, why do security solutions leverage kernel drivers? Due to the design of operating systems and the need to combat modern attackers effectively, kernel drivers play a crucial role. Here’s why:
Visibility and Enforcement of Security-Related Events – Kernel drivers provide system-wide visibility and early threat detection. They enable capabilities like system event callbacks and filter drivers to monitor file operations. For example, you can monitor and intercept suspicious file activities in real-time, preventing malware from executing.
Performance – Kernel drivers enhance performance, especially for high throughput network activity. They offer significant performance benefits, and security vendors optimize performance to achieve parity outside of kernel mode. For instance, a network security tool can use kernel drivers to analyze high volumes of traffic efficiently without impacting system performance.
Tamper Resistance – Kernel mode offers tamper resistance, ensuring software cannot be disabled by malware or malicious insiders. This mode allows drivers to load early in the boot process, enhancing security. This is critical for preventing sophisticated attacks that attempt to disable security software before it can protect the system.
Best Practices for Endpoint Security Selection and Management
To mitigate risks associated with Endpoint Detection and Response (EDR) agents, CISOs should consider the following best practices when selecting and managing EDR solutions:
Limit Kernel Mode Operations – Choose endpoint security agents designed to operate primarily in user mode (userland). This approach maintains application isolation and protects the system from crashes and data corruption. Ensure that interactions with kernel mode (kernel space) are minimal and restricted to essential functions like data collection, prevention, and anti-tampering. For Windows, ensure that communication between user and kernel mode components adheres to best practices, minimizing and controlling kernel interactions.
Controlled Update Processes – Select vendors that support your phased rollout approach for updates. This should start with a small subset of systems to ensure stability and performance before wider deployment. Vendors should provide the ability to control update deployment, including enabling or disabling updates at different organizational levels.
Utilize Modern Frameworks – Choose vendors that move away from kernel extensions (kexts) and utilize modern frameworks like eBPF (Extended Berkeley Packet Filter) for Linux and Apple’s Endpoint Security Framework (ESF) for macOS. These frameworks reduce the attack surface, improve performance, and align with industry best practices by allowing safe code execution in user space. Encourage vendors to leverage new frameworks that enhance security and performance as technology evolves.
Require Transparency and Trust from Vendors – Request clear communication from your vendors regarding agent behavior, updates, and incident response. This should include detailed release notes, version information, and auditing details for each update. Transparency about changes and the reasons behind them builds trust and fosters better preparation and response to potential issues.
The CrowdStrike Global Fallout
On July 19, 2024, a Rapid Response Content update for the Falcon sensor caused widespread disruptions for systems running Windows 7 and above. The update was published at 04:09 UTC and led to kernel instability and Blue Screen of Death (BSOD) loops on systems that were online between 04:09 and 05:27 UTC. Approximately 8.5 million devices were affected globally. Mac and Linux hosts were not impacted, and Windows hosts that were not online or did not connect during this period were also unaffected.
The update intended to gather telemetry on new threat techniques observed by CrowdStrike, but a defect in the Rapid Response Content caused an out-of-bounds memory read, leading to the crashes. This became one of the largest IT outages in history, and as of Aug 14th, 2024, CrowdStrike had not provided any further updates beyond their Aug 6th, 2024, statement, indicating they were still not at 100% recovery.
The economic impact was severe across multiple sectors. Over 5,000 flights were canceled and 46,000 delayed, with Delta alone canceling 1,250 flights on July 22nd, 2024, bringing total flight cancellations to over 7,000. More than dozens major U.S. hospitals had to cancel elective procedures, and 911 systems in at least seven states experienced temporary outages. Financial institutions like JPMorgan Chase faced login issues causing trading delays.
Parametrix estimated that 25% of Fortune 500 companies were affected, with financial losses around $5.4 billion and insured losses covering 10-20% of that. The global financial loss could reach $15 billion. Fitch Ratings indicated that insured losses would be manageable, not exceeding $10 billion, but the incident could lead to changes in cyber insurance policies.
Threat actors quickly took advantage of the situation, exploiting the helplessness and vulnerabilities caused by the outage. This incident underscores the urgent need for robust cybersecurity measures and transparent communication both internally within organizations and with their vendors, highlighting the vendors’ scope and responsibility to ensure the security and stability of their products for their customers.
Learning from Historical Incidents
History serves as a powerful tool for learning and preventing the recurrence of costly mistakes. As Michael Crichton aptly put it, “If you don’t know history, then you don’t know anything.” By closely examining past incidents, we gain valuable insights into what went wrong and how to avoid similar pitfalls. Notable examples include:
McAfee Antivirus Update (2010) – McAfee antivirus update falsely identified a critical Windows XP system file as malware, leading to widespread malfunctions, reboot loops, and loss of network access. This incident underscores the importance of rigorous testing and controlled rollouts to detect false positives and prevent widespread disruption. It highlights the risks associated with operating in kernel mode where errors can have significant impacts on the entire system.
Symantec Endpoint Protection Update (2012) – An update to Symantec Endpoint Protection in 2012 conflicted with third-party software, causing system crashes on Windows XP machines. This highlights the need for comprehensive compatibility testing with all critical third-party applications to prevent system crashes and ensure smooth updates.
Webroot Antivirus Update (2017) – Webroot mistakenly flagged essential Windows system files as malware, leading to significant disruptions as critical files were quarantined. This incident illustrates the importance of a multi-layered approach to endpoint security, including real-time monitoring and anomaly detection systems, to swiftly catch and rectify such errors. It also highlights the importance of ensuring that security rules are meticulously crafted and monitored to avoid false positives.
These incidents emphasize the importance of thorough testing, compatibility checks, and a multi-layered security approach. These past incidents also highlight the practices that shouldn’t be followed such as big bang rollouts, uncontrolled, unregulated, and automated upgrades. By integrating these lessons, CISOs can enhance an organization’s resilience against similar issues and ensure more reliable endpoint security measures.
Lessons Learned and Best Practices
Drawing from my experience at Microsoft, where we used a ‘dogfooding’ approach — internally testing products to identify and resolve issues early — results in higher-quality releases. The recent CrowdStrike incident underscores several critical areas where the industry must maintain vigilance:
Comprehensive Testing Procedures – Conduct thorough testing of updates and new features in real-world environments.
Enhanced Content Validation – Implement additional validation checks to ensure updates are robust and free from defects.
Balanced Agent Architecture – Design security agents to operate primarily in user mode, limiting kernel mode interactions to essential functions.
Strengthened Error Handling – Develop robust error-handling mechanisms to manage and mitigate errors gracefully.
Controlled Rollout Processes – Adopt a phased rollout strategy, starting with a small subset of systems to monitor impact and performance.
Enhanced Monitoring and Feedback – Implement continuous monitoring of agent and system performance during and after deployment.
Customer Control and Transparency – Provide customers with greater control over update delivery and deployment.
Mitigating Risks with Endpoint Protection Agents
Endpoint protection agents and sensors are crucial for defending against malware and malicious behaviors. However, integrating these tools with complex operating systems and other security measures can introduce risks. Here are some best practices to mitigate these risks:
Sequential Changes – Implement updates one at a time to easily identify and resolve any issues that arise.
Controlled Testing – Test updates on a select group of devices to evaluate their impact before a full deployment.
Phased Rollouts – Gradually extend updates across different segments after successful testing to ensure stability.
Strategic Updates – Carefully manage and monitor live updates, especially on critical business devices, to balance immediate protection with potential risks.
Configurable Updates – Ensure all updates are configurable, documented, and controlled by the customer.
Endpoint Security Maturity Framework
Assessing endpoint security maturity involves a structured framework that evaluates the implementation of best practices and compares approaches to mature models. This framework includes key indicators that CISOs can use to gauge and enhance their organization’s endpoint security posture:
Granular Control and Flexibility – Ensure the solution offers detailed control over updates and security measures. This allows for adaptation to specific security needs and minimizes business disruptions, reflecting a mature and customizable approach to endpoint security.
Robust Communication and Collaboration – Look for solutions that maintain effective communication channels and collaborative practices with stakeholders. This promotes a coordinated and informed response to security threats, indicating a mature and integrated security strategy.
Automated and Adaptive Security – Utilize advanced technologies like AI and machine learning for real-time threat detection and response. A mature solution will continuously learn and adapt to new threats, enhancing overall security effectiveness.
Comprehensive Incident Response Plans – Have well-defined and regularly tested incident response plans, including detailed steps for containment and recovery, and clear communication protocols. This preparedness is a hallmark of maturity, ensuring quick and effective responses to incidents.
Integration with Business Continuity Planning – Ensure security measures support overall organizational resilience and maintain critical operations during disruptions. Mature solutions seamlessly integrate with business continuity plans, ensuring security is a part of broader organizational resilience.
Continuous Improvement Culture – Foster a culture of regular review and updates of security practices based on incident lessons, threat landscape changes, and technological advancements. A mature endpoint security strategy is dynamic and continually evolving.
Resilient Design Patterns – Assess solutions that implement design patterns capable of isolating faults and allowing systems to degrade gracefully instead of crashing. This significantly reduces the risk of BSODs and other critical failures, enhancing overall system stability and reliability — a critical indicator of maturity.
By focusing on these indicators, CISOs can assess and ensure the maturity of their endpoint security solutions, enhancing their organization’s security posture and fostering continuous improvement.
Conclusion
The recent CrowdStrike incident underscores the complexities of endpoint security. By learning from this event and collaborating with stakeholders, we can refine our strategies, bolster our defenses, and better prepare for future challenges. Trust in software vendors is vital for effective communication and rapid issue resolution, helping us avoid past mistakes. As CISOs, our commitment to continuous improvement and proactive security measures is crucial to safeguarding our organizations in an increasingly hostile cyber environment.
Building and maintaining trust in our systems, teams, and vendors is essential for successfully navigating the complex landscape of cybersecurity. Warren Buffett once said, “Trust is like the air we breathe – when it’s present, nobody really notices; when it’s absent, everybody notices.” Prioritizing trust ensures that our cybersecurity efforts are effective and resilient, helping us stay ahead of evolving threats and unexpected scenarios.