|

Microsoft Patents Automate Data Clean-up

AI models are only as good as the data they’re built on.

Photo of a Microsoft patent
Photo via U.S. Patent and Trademark Office

Sign up to get cutting-edge insights and deep dives into innovation and technology trends impacting CIOs and IT leaders.

In its quest for AI domination, Microsoft wants to make sure it’s feeding its models well: The company filed two patent applications for tech that uses generative AI to suss out bad data. 

First, the company is seeking to patent a system for “data health evaluation” using generative language models that weed out and automatically fixes errors such as missing information or outliers in a dataset without a human needing to audit it. 

The system uses an automated agent to prompt the model to create a data evaluation plan, or a list of tests to run on the dataset based on the data itself. Running these checks allows the generative model to find data health issues. 

Additionally, the company wants to patent a system that relies on generative models to “improve bad quality and subjective data.” This tech aims to help make sense of subjective data, such as opinions or preferences, that tend to be difficult to label consistently. Instead of a human needing to manually label this kind of data, a generative model essentially interprets that data into something more reliable, which is then used to train smaller and more lightweight models. 

“Incorporating subjective data, including bad-quality data, into machine learning models is problematic due to the data’s subjectivity and the challenges in converting it into a usable format,” Microsoft said in its filing. 

It’s not the first time data health has come up in patent applications, though data integrity is often tackled from a security perspective, such as Google’s patents to anonymize datasets or IBM’s and Intel’s patents for data minimization techniques

AI models are only as good as the data they’re built on. With the sheer amount of data it takes to build AI, patents like these seek to automate data integrity and cleanup, a tedious process that can take up a large amount of time for data scientists. 

Microsoft could have plenty of uses for tech like this, especially as the company aims to stand on its own two feet in the AI race: Microsoft AI CEO Mustafa Suleyman told CNBC that it’s “mission-critical that long-term, we are able to do AI self-sufficiently.” The quicker it can come up with clean, usable datasets to feed to its AI models, the faster it can stay ahead of competitors. 

Sign Up for CIO Upside to Unlock This Article
Cutting-edge insights into technology trends impacting CIOs and IT leaders.