IBM Patent Highlights Double-Edged Sword of Synthetic Training Data

The tech underscores that offering AI services may be just as lucrative as building AI models themselves.

Photo of an IBM patent
Photo via U.S. Patent and Trademark Office

Sign up to uncover the latest in emerging technology.

When feeding data-hungry AI models, IBM may want to give them the fake stuff. 

The company is seeking to patent a system for synthetic data generation. IBM’s system essentially creates simulations of authentic data from real users, aiming to “side-step concerns with regulations” and privacy. 

IBM noted that one of the biggest things holding back rapid AI development is the need for “accurate and representative data” for training. If not trained properly, “the AI model is liable to eventually be unreliable and/or inaccurate in use,” the company noted. 

IBM’s system relies on “logical graphs,” which represent the paths that user interactions take with an interface, such as a website or platform. After a user interacts with an interface, an AI model takes that interaction and simulates it several times over, creating a full set of training data of user behaviors without the need for any actual user data. 

Additionally, in order to refine the simulations, users that are interacting with the interface receive positive or negative rewards associated with different interactions, like positive feedback for completion of certain tasks. For example, if IBM’s system is creating training data relating to a video game where users take care of virtual pets, a positive reward may be a happy interaction with the pets for tasks like playing or walks. That would then be translated to the AI model for replication. 

This allows the model to create simulations based on actual human inclinations and behaviors, helping the model “define why agents of the AI model would choose different paths.” 

The result is training data that mirrors real human interactions, choices, and behaviors, without violating user privacy. 

Privacy is one of the main selling points for synthetic data. One of AI’s biggest issues is the fact that, if it’s taking in real user data for training, it runs the risk of spitting that exact personal data back out if prompted in a specific way. But creating data that mimics how a real person would behave presents a solution, said Bob Rogers, PhD, co-founder of BeeKeeperAI and CEO of Oii.ai. 

The other advantage of synthetic data is that it’s generally far cheaper, said Rogers. Using human labor to process data can be pricey and time consuming, whereas using an AI model to generate an abundance of clean and processed data instantaneously can save tons of resources. 

However, synthetic data comes with its own set of risks, said Rogers. While this data can protect user privacy if done right, if it mimics an actual user’s personal data too closely, it can do the opposite. This is particularly relevant if using synthetic data in training AI models for things like healthcare or finance. 

“If it’s at the point where you’re generating perfect synthetic data that can be used to train algorithms, it’s basically just replicating private data,” said Rogers. “I’m not sure that it’s actually that secure if you’re able to reconstruct real data from the underlying data.” 

On the other hand, if synthetic data isn’t as nuanced and complex as data from real users, the AI models themselves won’t perform nearly as well, turning out worse results than they would with authentic data, said Rogers. “You’re smoothing over important subtleties in the data,” said Rogers. 

IBM’s patent also highlights that creating the services that are peripheral to building AI could be just as lucrative as building the AI itself. This patent would likely be used as a tool to help IBM support enterprises in building their own AI models, offering a more affordable option than collecting their own authentic data, Rogers noted. That said, it adds to the several synthetic data startups and patents already filed that offer similar services.

And this filing is just one example of a larger trend: Though foundational models like ChatGPT are raking in billions, so are the companies that are servicing them, such as Nvidia, Microsoft’s Azure, Amazon’s AWS, and yes, IBM. “If you add it all up, I think (revenue) is 50-50,” Rogers noted. “It’s pretty well balanced.”