Synthetic data – The future of machine learning and healthcare?

AI has and will continue to transform biopharma and healthcare.

04 December, 2021

Currently, machine learning models are trained using mass amounts of real data collected from multiple geographies to produce data sets that are diverse and lacking in bias. However, this current method is not only time consuming, but also a complex process that is being made more difficult as personal data is increasingly being covered under modern privacy regulations. According to Gartner, 65% of this personal data will be covered in such a way by 2023.

A novel approach to circumvent these limitations is synthetic data (SD). Synthetic data uses machine learning (ML) software for data generation. The ML models automatically learn patterns in data, then output data that has the same format and statistical properties. Creating and using fake data is not an original concept. Existing data generators require domain experts to create explicit rules. In contrast SD ML-based tools learn these rules automatically and infers new correlations that you may miss. This ultimately enables non-experts to create labelled synthetic data quickly and often, for a multitude of task like ML development and software testing. There are three types of synthetic data: full, partial and hybrid. Full synthetic data is produced from scratch, partial data replaces sensitive data pieces in large data sets, and hybrid synthetic data produces neighbours of real data points whilst retaining the original data sets correlations and integrity.

Synthetic data is versatile, and can come in the form of images, time series, and tabular. These highly realistic, privacy-safe synthetic data sets are compliant with the strictest data protection laws as they offer differential privacy.

AI models often get trained using biased data. AI generated data can fill the holes and form unbiased data sets. Data diversity of an ideal world can be created with SD, or it can be used in scenarios where real data does not yet exist to be used in say, clinal trials as a baseline for future studies and testing.

One of the major industries synthetic data could innovate is healthcare. Before we discuss the need for synthetic data specifically, lets investigate the benefits of implementing AI tools into the industry. Healthcare is rapidly utilising AI tools in the treatment of diseases like cancer, neurological diseases, and cardiovascular diseases. AI applications can be used in early detection, diagnosis, and treatment of diseases, as well as outcome prediction and prognosis evaluation. AI can help inform and assist physicians on clinical decisions through learning of huge volumes of healthcare data. What’s more, feedback to these machines can further refine their accuracy. This healthcare data includes clinical activity in the form of medical notes, images, genetic and electrophysiological data. However, as a large proportion of this is unanalysable written data, it must first be converted to EMR, a machine-understandable electronic medical record. Moreover, and understandably so, healthcare providers are reluctant to share their sensitive data (medical records, ongoing conditions, personal details, etc) with researchers, so further avenues are being explored to gather data to train AI machines. This is where SD would greatly benefit the healthcare field, as greater healthcare data sets available for AI machine learning will drive innovation in the industry.

The benefits of AI in healthcare are already starting to be reaped. AI has and will continue to transform biopharma and healthcare. Health apps encourage healthier lifestyles and make users feel more in control of their lifestyle management. Moreover, cancer detection in its earlier stages is already being done by some AI machines and allows for a more accurate insight into the disease progression. Additionally, it has been shown that AI gives mammogram results 30 times faster and with 99% accuracy, improving diagnoses dramatically. Benevolent AI, the largest AI company in Europe, uses data science in prediction of novel chemical compounds in the treatment of diseases and their symptoms, accelerating the drug development process enormously. Google’s DeepMind Health is developing technology that combines machine learning and neuronal systems to mimic the human brain using learning algorithms. As translational validity of neurological research has been limited, this would be a huge advancement for the medical research into neurological disorders that until now have had little explanation. These mentioned examples are just some of the ways the healthcare industry has implemented AI and thus been transformed for the better.

So, how would synthetic data aid in this transformation? SD would only increase the implementation of AI into industries like healthcare. With rapid training of AI models, reduced bias in data sets while still fully adhering to data privacy regulations, SD is the solution to AI’s problems. AI has the ability to modernise vital industries through streamlining processes, utilising resources more efficiently and ensures users satisfaction. SD unleashes innovation while opening up safely sharing and monetizing of highly sensitive medical data sets.

Forester Research recently identified several critical technologies including synthetic data, will comprise “AI 2.0” advances that radically expand AI possibilities. Word is spreading, according to StartUs Insights, over 50 vendors have already developed synthetic data solutions. As with plenty other data science breakthroughs, there is no doubt synthetic data will soon become highly utilised in widely used industries around the world.

Welcome to the revolution in data driven innovation.