Updates to U.S. Social Security Number sensitive information type definition for improved accuracy

To improve the accuracy of the “U.S. Social Security Number” (SSN) sensitive information type, we are making the following changes to its definition:

1. Three discreet confidence levels (High, Medium, and Low) depending on the level of accuracy. The three levels indicate the likelihood of a true positive considering the following:

  • When the SSN was issued. SSNs issued pre-2011 had relatively strong definition due to additional checks.
  • Whether the SSN are formatted (ddd dd dddd or ddd-dd-dddd) or unformatted (ddddddddd).
  • Whether a keyword is found in proximity to the SSN.

2. An additional pattern which does not require mandatory keywords in proximity to reduce false negatives. The current definition requires keywords like “SSN” or “Social Security Number” in proximity to the actual number, which can sometimes lead to valid numbers not being detected (i.e. in an Excel spreadsheet where the supporting keyword is present only in the header row).

3. Added intelligence to detect high volume SSNs in tabular data, like an Excel spreadsheet where keyword is present only in the header of the table. Use “High confidence” or “Medium confidence” in your policy for this. Please note that this requires at least one instance to be detected with a keyword in proximity.

See details of current definition vs. new definition below.

When this will happen

Rollout will begin in early June and is expected to be complete by early July 2021

How this will affect your organization

What you need to do to prepare

Review your policies and set the appropriate confidence level for the US SSN sensitive information type based on what you want to detect.

Learn more about sensitive information types.

Details:

Your existing policies, including data loss prevention policies, do not need to be changed. However, depending on your needs, you may wish to change the confidence level for US SSN within your policies (such as data loss prevention, communication compliance, sensitivity labeling, or records management). For example, if you wish to have minimal false positives, you may set the confidence level to High, and you can set the confidence level to Low if you want minimal false negatives.

  • We recommend that you use High confidence level in your policies for minimal false positives.
  • If you wish to detect unformatted numbers like 123121234 as well, you should use Medium confidence level.
  • Using Low confidence may result in a lot of false positives due to the weak definition of US SSN, where any 9-digit number can be a valid SSN. Please note that using Medium or High confidence will still detect high volume SSNs without keywords, provided at least one instance has keyword in proximity.

Current and new definitionsView image in new tab

Message ID: MC256841


No comments yet

Leave a Reply


I've been working with Microsoft Technologies over the last ten years, mainly focused on creating collaboration and productivity solutions that drive the adoption of Microsoft Modern Workplace.