When fake data is a good thing – how synthetic data trains AI to solve real problems

When Fake Data Is a Good Thing – How Synthetic Data Trains AI To ...

You’ve just finished a strenuous hike to the top of a mountain. You’re exhausted but elated. The view of the city below is gorgeous, and you want to capture the moment on camera. But it’s already quite dark, and you’re not sure you’ll get a good shot. Fortunately, your phone has an AI-powered night mode that can take stunning photos even after sunset.

Here’s something you might not know: That night mode may have been trained on synthetic nighttime images, computer-generated scenes that were never actually photographed.

As artificial intelligence researchers exhaust the supply of real data on the web and in digitized archives, they are increasingly turning to synthetic data, artificially generated examples that mimic real ones. But that creates a paradox. In science, making up data is a cardinal sin. Fake data and misinformation are already undermining trust in information online. So how can synthetic data possibly be good? Is it just a polite euphemism for deception?

As a machine learning researcher, I think the answer lies in intent and transparency. Synthetic data is generally not created to manipulate results or mislead people. In fact, ethics may require AI companies to use synthetic data: Releasing real human face images, for example, can violate privacy, whereas synthetic faces can offer similar benefit with formal privacy guarantees.

There are other reasons that help explain the growing use of synthetic data in training AI models. Some things are so scarce or rare that they are barely represented in real data. Rather than letting these gaps become an Achilles’ heel, researchers can simulate those situations instead.

Another motivation is that collecting real data can be costly or even risky. Imagine collecting data for a self-driving car during storms or on unpaved roads. It is often much more efficient, and far safer, to generate such data virtually.

Here’s a quick take on what synthetic data is and why researchers and developers use it.

How synthetic data is made

Training an AI model requires large amounts of data. Like students and athletes, the more an AI is trained, the better its performance tends to be. Researchers have known for a long time that if data is in short supply, they can use a technique known as data augmentation. For example, a given image can be rotated or scaled to yield additional training data. Synthetic data is data augmentation on steroids. Instead of making small alterations to existing images, researchers create entirely new ones.

But how do researchers create synthetic data? There are two main approaches. The first approach relies on rule-based or physics-based models. For example, the laws of optics can be used to simulate how a scene would appear given the positions and orientations of objects within it.

The second approach uses generative AI to produce data. Modern generative models are trained on vast…

Access the original article

Subscribe
Don't miss the best news ! Subscribe to our free newsletter :