Privacy-preserving artificial intelligence: training on encrypted data

In the age of artificial intelligence (AI) and big data, predictive models have become an essential tool in several industries, including healthcare, finance and genomics. These models rely heavily on processing sensitive information, making data privacy a critical concern. The main challenge lies in maximizing the usefulness of data without compromising the confidentiality and integrity of the information involved. Achieving this balance is essential for the continued advancement and adoption of AI technologies.

Jordan Frery

Tech Lead Machine Learning at Zama.

Collaboration and open source

Creating a robust dataset for training machine learning models poses significant challenges. While AI technologies like ChatGPT are flourishing by collecting vast amounts of data available on the internet, healthcare data cannot be collected as freely due to privacy concerns. Building a healthcare dataset involves integrating data from multiple sources, including physicians, hospitals, and across borders.

The healthcare sector is emphasized because of its social importance, but the principles are broadly applicable. For example, even an autocorrect feature on a smartphone that personalizes predictions based on user data must deal with similar privacy concerns. The financial sector also faces obstacles in data sharing due to its competitive nature.

Collaboration thus emerges as a crucial element for safely harnessing the potential of AI in our societies. However, an often overlooked aspect is the actual AI execution environment and the underlying hardware that powers it. Today’s advanced AI models require robust hardware, including extensive CPU/GPU resources, significant amounts of RAM, and even more specialized technologies such as TPUs, ASICs, and FPGAs. Conversely, the trend towards user-friendly interfaces with simple APIs is gaining popularity. This scenario highlights the importance of developing solutions that enable AI to operate on third-party platforms without sacrificing privacy, and the need for open-source tools that enable these privacy-preserving technologies.

Privacy solutions to train machine learning models

To address AI privacy challenges, several advanced solutions have been developed, each targeting specific needs and scenarios.

Federated Learning (FL) enables the training of machine learning models on multiple decentralized devices or servers, each containing local data samples, without actually exchanging the data. Similarly, Secure Multi-party Computation (MPC) allows multiple parties to jointly compute a function over their inputs, while keeping those inputs private so that sensitive data does not leave its native environment.

Another set of solutions focuses on manipulating data to maintain privacy while enabling useful analytics. Differential privacy (DP) introduces noise into data in a way that protects individual identities, yet produces accurate aggregate information. Data anonymization (DA) removes personally identifiable information from data sets, ensuring some anonymity and limiting the risk of data breaches.

Finally, Homomorphic Encryption (HE) allows operations to be performed directly on encrypted data, generating an encrypted result that, when decrypted, matches the result of operations performed on the plaintext.

The perfect fit

Each of these privacy solutions has its own benefits and tradeoffs. For example, FL maintains communication with a third-party server, which could potentially lead to data leaks. MPC works on cryptographic principles that are robust in theory, but in practice can create significant bandwidth requirements.

DP involves a manual configuration where noise is strategically added to the data. This arrangement limits the types of operations that can be performed on the data, because the noise must be carefully balanced to protect privacy while maintaining the usability of the data. Although DA is widely used, it often offers the least privacy protection. Because anonymization typically takes place on a third-party server, there is a risk that cross-referencing may expose the hidden entities in the dataset.

HE, and in particular Fully Homomorphic Encryption (FHE), is distinguished by allowing computations on encrypted data that closely mimic the computations on plain text. This capability makes FHE highly compatible with existing systems and easy to implement thanks to open-source and accessible libraries and compilers such as Concrete ML, which are designed to give developers easy-to-use tools to develop various applications. The biggest drawback at this time is the slowdown in computing speed, which can affect performance.

While all the solutions and technologies we discussed encourage collaboration and joint efforts, FHE with its increased data privacy protections can drive innovation and facilitate a scenario where no trade-offs are necessary when it comes to enjoying services and products without personal data at risk to bring.

We recommended the best encryption software.

This article was produced as part of Ny BreakingPro’s Expert Insights channel, where we profile the best and brightest minds in today’s technology industry. The views expressed here are those of the author and are not necessarily those of Ny BreakingPro or Future plc. If you are interested in contributing, you can read more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro

Related Post