This content was created by the Data Sharing Coalition, one of the founding partners of the CoE-DSC.
The Data Sharing Coalition supports organisations with realising use cases at scale to exploit value potential from data sharing and helps organisations to create required trust mechanisms to share data trusted and secure. In our blog section ‘Q&A with’, you learn more about our participants, their thoughts, vision and ideas about data sharing. Edwin Kooistra, Co-founder, Chief Strategy and Marketing Officer at BlueGen.ai, shares his thoughts.
1. Could you introduce your organisation?
BlueGen.ai helps organisations to accelerate their data driven innovation by removing barriers and preserving privacy. These barriers often relate to privacy and security restrictions, but also to various processes (to make data available from the moment of request usually takes a lot of time and money) and quality aspects related to the data (data may be biased or imbalanced, making trained models less robust). We remove barriers by using a patented synthetic data generation platform. We use state-of-the-art generative AI technology to learn from the real data to generate a new dataset that looks and behaves the same like the real data, but without any personally identifiable information (PII). The synthetic data resembles real-world data and retains the statistical properties of the original source data, allowing for research without compromising privacy. Synthetically generated data consists of completely new and artificial data points with no one-to-one relationships to the original data. Therefore, none of the synthetic data points can be traced or reverse-engineered to original data. Thus, synthetic data serves as a solution to solve and overcome data privacy challenges.
Through our platform, we enable customers to generate a synthetic data set based on their data set(s). They do this by running our platform “locally” or using our cloud instance – the data never leaves the local premise. This is possible thanks to our decentralised federated learning approach, a privacy by design solution that allows learning and gaining insights without having to share the data. Organisations can then extract the insights that are relevant to them.
Originally, we are a spin-off of TU Delft. The platform was developed during a collaboration between Aegon and TU Delft. In 2019, Aegon asked TU Delft how they could innovate faster using their data without compromising on privacy. Aegon was facing three challenges: privacy restrictions imposed by regulations or companies that prevent its use outside specific contexts; real data may have imbalances (e.g. demographic diversity); and real data is too expensive or time-consuming to collect at scale. Aegon was and is not alone in this: many organisations struggle to effectively extract value from the data contained in their systems. Our synthetic data platform overcomes these privacy related challenges, allowing for faster, less expensive and more scalable access to data that is representative of the underlying source as well as privacy-preserving.
We want to contribute to creating the necessary standards for data sharing and quality marks for privacy and we are convinced that the Data Sharing Coalition can be an initiator of this.
2. To what extent is your organisation involved in data sharing (within and across sectors)?
Privacy is one of the biggest hurdles to using data both internally and externally. Therefore, privacy-safe data sharing is one of the main use cases we enable through synthetic data. Many of our efforts are focused on enabling this in a frictionless way via our platform. Synthetic data allows organisations to share datasets with first parties (i.e. internal stakeholders) and third parties (i.e. clients or external partners) while preserving the privacy of the underlying real data. Internally, companies can benefit from data sharing across silos to improve existing processes or systems.
3. Why is or should sharing data be important for your industry or domain?
We particularly focus on healthcare, government, banking and insurance. These industries are solving data sharing challenges that are often experienced by many other organisations. At the same time, these types of organisations are dealing with a lot of privacy-sensitive data and the associated restrictions, such as the GDPR. Privacy rules and regulations restrict when and how the data can be used, how long it can be stored and where.
- In the healthcare sector, patient data is distributed among various healthcare institutions. This data must be shared to get a clear picture of diseases and the effectiveness of treatments. Another example are the biobanks, which collect medical data and make it available to researchers and scientists to improve health in general.
- Governments and municipalities can improve the well-being of citizens by sharing data. Many citizens’ frustrations stem from the fact that governments do not seem to work together well. Sharing data can contribute significantly to this. For example, municipalities can promote equity by sharing data between youth care and the social domain.
- Banks try to detect fraud by detecting unusual transactions, which is sometimes only possible if there are links between data of different banks. Banks need to work together by sharing data.
- In the utilities industry, energy supplies and network operators can solve congestion challenges and better match demand and supply by sharing data.
4. What are the most promising data sharing developments and trends you see in your sector?
The use of synthetic data itself is one of the most promising trends, which addresses the growing concerns of data privacy globally as we have just discussed. It is already used quite a lot, a well-known example within the Netherlands is DUO who makes its data publicly available (on demand). Banks, insurers and energy companies are all evaluating what synthetic data can do for them. Furthermore, Gartner says 70% of data use for analytics and machine learning will be synthetic data by 2025 and on the Gartner hype cycle of AI, synthetic data tops the hype cycle.
Data monetisation is also a hot trend in relation to data sharing across industries, especially as more products and services offer digital components that facilitate data collection. For example, companies looking to identify locations for new stores might be interested in purchasing synthetic financial transaction data to understand where spend is highest or increasing. Using synthetic data accelerates data sharing across teams, especially when it opens opportunities to monetise data or streamline existing processes.
Probably the biggest trend related to data sharing is the rise of Artificial Intelligence and Machine Learning to apply advanced analytics. This is one of the key drivers of sharing data, as it allows the detection of correlations between different data sets of different organisations. By using synthetic data, combined with the predictive capabilities of AI and ML, organisations gain a competitive advantage, increase revenue and reduce both risk and cost.
5. How do you see the future of data sharing, and what steps are you currently taking in that direction?
We expect data sharing to mature quickly over the coming few years. The importance and need will increase and so will its adoption. To achieve this, data sharing will need to be frictionless and smooth, with no privacy concerns and no months long trajectories. BlueGen.ai offers a solution to the struggles, risks and delays related to privacy, security and usability. For this, we are heavily investing in different methods and techniques to address privacy concerns. Our decentralised or federated learning is a privacy by design architecture that limits access to the data only to the model which runs on premise with the data. This way, the data never has to leave the premise.
Furthermore, we have developed and implemented differential privacy together with Centrum Wiskunde en Informatica (CWI), the national research institute for mathematics and computer science in the Netherlands. Differential privacy is a mathematical approach that ensures that privacy is guaranteed without significantly compromising the quality or accuracy of the data. As a result, the results of analysis of the data versus the real data will be comparable.
Leveraging data from different systems in different organisations is a growing trend, in particular when used in combination with applied analytics. To enable this, we are building out our federated learning capabilities, so it can generate a synthetic data test based on data across different systems. This way, new correlations between the data can be identified and predictions can be improved.
6. Why are you participating in the Data Sharing Coalition?
BlueGen.ai joined the Data Sharing Coalition to promote and stimulate data sharing. Again, the challenges transcend organisations and industries, so we believe it is best to address these together. We want to enable other participants to share data in a secure way by using our platform. In addition, we want to share our experiences to help other organisations and we see the Data Sharing Coalition as a platform that enables us to share the challenges we face. For example, how to deal with user scepticism when accepting the data generated by AI and synthetic data. Because it does not concern real data, people wonder whether the results of analysis on synthetic data can be the same as on real data. Despite the fact that privacy can be guaranteed and accuracy demonstrated, people may be reluctant. This is not surprising because this is common with technological innovations, but it will have to be addressed. One could think of creating a hallmark for synthetic data.
We want to contribute to creating the necessary standards for data sharing and quality marks for privacy and we are convinced that the Data Sharing Coalition can be an initiator of this. The first session was already very valuable and inspiring. We look forward to an active collaboration with all members within the coalition! Any use case in which privacy hinders data sharing is a good use case to look at how synthetic data can enable this.