Multiple Services: Partially incomplete log data due to monitoring agent issue

Originally posted by Microsoft Oct 3, 2024 Uncategorized 0 Comments

Preliminary Post IncidentReview (PIR) – Multiple services – Partially incomplete log data due tomonitoring agent issue

This is our Preliminary PIRthat we endeavor to publish within 3 days of incident mitigation to share whatwe know so far. After our internal retrospective is completed (generally within14 days) we will publish a Final PIR with additional details/learnings.

What happened?

Starting around 23:00 UTC on 2 September 2024, a bug in one of Microsoft’s internal monitoring agents resulted in a malfunction in some of the agents when uploading log data to our internal logging platform. This resulted in partially incomplete log data for the affected Microsoft services. This issue did not impact the uptime of any customer-facing services or resources – it only affected the collection of log events. Additionally, this issue is not related to any security compromise.

The issue was detected on 5 September. Following detection, our engineering teams began investigating and implemented a temporary workaround to reduce the impact of these failures beginning on 19 September. This temporary workaround involved periodically restarting the agent or server to restart the log collection process. As a result, services saw a significant improvement in the completeness of the log data. However, some customers may have experienced occasional increased latency or delays in log delivery from the affected services due to this workaround.

The engineering team continued to investigate the cause of the issue while simultaneously exploring multiple paths toward mitigation for the affected agents. The updated agent aims to prevent the problem from recurring across all affected services and regions. More formation can be found in the section “How are we making incidents like this less likely or less impactful?“.

Affected Services

A subset of Microsoft cloud services had partially incomplete log data during the impact period. Those affected services and log types are listed below. The nature of the incomplete logging data for affected services varies by customer. Customers who use a service affected by this issue have been contacted through the Microsoft 365 Message Center and/or Azure Service Health. Customers with questions regarding their logs can contact support for information.

Microsoft Entra – Potentially incomplete sign-in logs, and activity logs
Microsoft Entra relies on the continuous performance of this internal logging platform to deliver various data streams to customers, including sign-in logs and activity logs. This data is exposed to customers via the Azure Portal, Microsoft Entra Admin Portal, Microsoft Graph APIs, and streamed to customers through Azure Monitor. Recognizing the criticality of these logs to our customers, we expedited mitigation within Microsoft Entra. Customers using Azure Portal, Microsoft Entra Admin Portal and Microsoft Graph APIs should be receiving complete logs as of 19 September. For logs consumed via Azure Monitor we expect to complete mitigation by 3 October 2024. Upon further investigation, it was determined that customers may have begun encountering potentially incomplete event logs as early as 21:03 UTC on 5 September 2024.
Entra logs flowing via Azure Monitor into Microsoft Security products, including Microsoft Sentinel, Microsoft Purview, and Microsoft Defender for Cloud, were also impacted. These issues are expected to be mitigated on 3rd October 2024.

Azure Logic Apps – Potentially incomplete platform logs
From approximately 15:00 UTC on 8 September to 12:00 UTC 20 September 2024, Logic App Consumption and Standar SKU may have experienced intermittent gaps in telemetry data in Log Analytics, Resource Logs, and Diagnostic settings from Logic Apps.

Azure
Healthcare APIs – Potentially incomplete platform logs
This issue impacts customers who have enabled diagnostic settings in FHIR service.From 00:00 UTC on 7 September and 18:00 UTC on 20 September, customers using the Fast Healthcare Interoperability Resources (FHIR) service within our Azure Health Data Services and Azure API for FHIR may have experienced partially incomplete diagnostic logs.

Microsoft Sentinel – Potentially incomplete security alerts
From 23:00 UTC on 5 September 2024 to 3 October 2024 (based on Entra’s mitigation via Azure Monitor above), Microsoft Sentinel customers may have experienced potential gaps in security related logs or events, affecting customers’ ability to analyze data, detect threats, or generate security alerts.

Azure Monitor – Potentially incomplete diagnostic settings routed to Azure Monitor
From 23:00 UTC on 5 September 2024 to 20:20 UTC on 3 October 2024, customers that configured Diagnostic Settings to route log data from Azure services to Azure Monitor may have observed gaps or reduced results when running queries based on log data from impacted services. In scenarios where customers configured alerts based on this log data, alerting might have been impacted.

Azure Trusted Signing – Potentially incomplete SignTransaction and SignHistory logs
Between 15:00 UTC on 8 September to 00:00 UTC and 27 September 2024, Trusted Signing customers in East US, West US 2, West US 3, West Central US, Europe North, and Europe West regions experienced partially incomplete SignTransaction and SignHistory logs, leading to reduced signing log volume and under-billing.

Azure Virtual Desktop – Potentially incomplete logs in Application Insights
Between 06:00 UTC on 14 September and 06:00 UTC on 29 September 2024, customers may have experienced Azure Virtual Desktop logs being partially incomplete in Application Insights. The main connectivity functionality of AVD was unimpacted. Customers can find the AVD related endpoints from https://learn.microsoft.com/azure/virtual-desktop/required-fqdn-endpoint?tabs=azure

Power Platform – Data discrepancies across reports
Between 23:47 UTC on 09 September and 23:06 UTC on 19 September 2024, customers using Power Platform would experience minor discrepancies affecting data across various reports, including Analytics reports in the Admin and Maker portal, Licensing reports, Data Exports to Data Lake, Application Insights, and Activity Logging.

This is the current list of affected services, though additional impacts may be identified as we complete the final Post Incident Review.

What do we know so far?

Our investigation shows that, in the process of addressing a bug in the log collection service, we exposed an unrelated bug in the internal monitoring agent, which prevented a subset of agents from uploading log event data.During the investigation of this bug, we determined this incident was not related to any security compromise.

The initial change was to address a limit in the logging service, but when deployed, it inadvertently triggered a deadlock-condition when the agent was being directed to change the telemetry upload endpoint in a rapidly changing fashion while a dispatch was underway to the initial endpoint. This resulted in a gradual deadlock of threads in the dispatching component, preventing the agent from uploading telemetry. The deadlock impacted only the dispatching mechanism within the agent with other functionalities working normally, including collecting and committing data to the agent’s local durable cache. A restart of the agent or the OS resolves the deadlock, and the agent uploads data it has within its local cache upon starting. There were situations where the amount of log data collected by the agent was larger than the local agent’s cache limit before a restart occurred, and in these cases the agent overwrote the oldest data in the cache (circular buffer retaining the most recent data, up to the size limit). The log data beyond the cache size limit is not recoverable.

While the backend change that triggered the issue followed our safe deployment practices including rigorous testing and validation before deploying the changes into production, these testing efforts were insufficient to identify the bug in the agent, as it took several days to manifest itself in production. Similarly, we have identified gaps in monitoring to accelerate detecting this issue, with relevant repair items included below.

How did we respond?

14:56 UTC on 02 September 2024 – Customer impact began, triggered by a safe deployment roll out of a fix for the log collection service.
21:03 UTC on 05 September 2024 – An incident was created after detecting partial diagnostic telemetry log loss in Entra in one region.
21:49 UTC on 06 September 2024 – Incident assigned to logging platform team by Entra. The logging platform team begins investigating issues relating to partial telemetry loss.
23:47 UTC on 09 September 2024 – The logging platform team expands the investigation after detecting issues in multiple regions and services.
23:44 UTC on 10 September 2024 – The logging platform team recommended service teams who had reported the issue implement a temporary mitigation.v
02:14 UTC on 14 September 2024 – The logging platform team had a candidate agent fix.
01:00 UTC on 17 September 2024 – Testing revealed that the candidate agent fix was not sufficient to resolve the issue.
22:48 UTC on 18 September 2024 – We determined the impact was broader than originally understood, including telemetry for customers. We initiated our major incident management processes which included engaging security and privacy teams.
04:30 UTC on 19 September 2024 – A candidate fix for the agent was identified based on an initial diagnosis of a deadlock in the agent.
16:00 UTC on 19 September 2024 – The agent with the fix was deployed to test and pre-production environments.
16:03 UTC on 19 September 2024 – Upon completing a service impact analysis, we engaged all impacted service teams to assist with the temporary mitigation strategies.
10:00 UTC on 21 September 2024 – Log data from test and pre-production environments indicated the fix to the agent was effective.
17:00 UTC on 21 September 2024 – We initiated an accelerated safe deployment for rolling out the updated agent across affected critical services.
05:35 UTC on 30 September 2024 – We identified a recent bug fix in the log collection service as triggering the bug in the agent.
19:20 UTC on 30 September 2024 – We started a rollback (safe deployment) of the bug fix for the log collection service.

How are we making incidents like this less likely or less impactful?
Our response to this incident is ongoing. We are committed to addressing identified gaps and improving our processes to de-risk incidents like this one. This includes improvements in the following areas:

Telemetry platform:

We have updated the monitoring agent. This has been deployed across the critical affected services and is rolling out across other impacted services targeting completion by end of October 2024.
We are implementing operational health monitoring for the monitoring agent fleet supporting our affected services, to detect and alert on changes to data collection and upload compared to baselines. (Estimated completion: November 2024)
We will address gaps around end-to-end testing of the logging platform, including simulating accelerated changes to the agent configuration and endpoints. (Estimated completion: November 2024)

Broader Microsoft services:

Critical services providing log event data to their customers will be onboarding to central monitoring of the log data. (Estimated completion: November 2024)
We will design and develop a new event type annotation methodology to enable critical services to more closely monitor high impact telemetry. (Estimated completion: TBD)

NOTE: Affected tenants would have received notifications via Microsoft 365 (MC903318) and Dynamics 365 (MC903522) via Message Center and/or, Azure (TrackingID 1K8W-HF8) via Azure Service Health in the Azure portal.

This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details/learnings.

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/1K8W-HF8

Message ID: MC903318

Notifications

Blogs

Multiple Services: Partially incomplete log data due to monitoring agent issue

Like this:

Related

Trending Posts

Multiple Services: Partially incomplete log data due to monitoring agent issue

Share this:

Like this:

Related

Trending Posts