The recent success of large AI-based language models has forced the market to think more ambitiously about how AI can transform many business processes. However, consumers and regulators are increasingly concerned about the security of their data and the AI models themselves. Safe and widespread adoption of AI requires the introduction of AI governance throughout the data lifecycle to ensure trust among consumers, businesses and regulators. But what does it look like?
In most cases, AI models are quite simple: they collect data and draw patterns from it to generate results.Large complex language models (LLMs) like ChatGPT and Google Bard are no different. For this reason, when managing and managing the deployment of AI models, we must first focus on managing the data on which the AI models are trained. Datamanagement requires us to understand the sources, sensitivity and lifecycle of all the data we use. It forms the foundation of every AI management practice and is essential for mitigating a wide range of business risks.
Risks of training LLM models on sensitive data
Large language models can be trained on proprietary data to address specific business use cases.For example, a company can use ChatGPT and create a private model that is trained based on sales data from the company’s CRM. This template can be deployed as a Slack chatbot to help sales teams answer questions like “How many opportunities has product X won in the last year?” or “Tell me about the opportunity for product Z at company Y.”
One can easily imagine these LLMs being suitable for a range of use casesin customer service, human resources or marketing. They may even expandinto legal and medical advice, making the LLM a premier diagnostic tool for healthcare professionals.The problem is that these use cases require LLM training on sensitive and proprietary data. It is inherently risky. These threats include:
1. Risks related to privacy and re-identification
AI models learn from training data, but what if that data is private or confidential? A significant amount of data can be used, directly or indirectly, to identify specific individuals.So, if we are training a LLM on proprietary data about an enterprise’s customers, we can run into situations where the consumption of that model could be used to leak sensitive information.
ADS
2. Model training data
Many simple AI models have a training phase and then a deployment phase where training stops. LLMs are a little different. You consider the context of the conversation with them, draw conclusions and respond accordingly.
This makes managing model inputs much more complex since we don’t just have to worry about the initial training data.We also take care every time the model is requested. What would happen if we provided the model withconfidential information during the interview? Can we recognize sensitivity and prevent the model from exploiting it in other contexts?
3. Security and Access Risks
To some extent, the sensitivity of the training data determines the sensitivity of the model.While we have well-established mechanisms for controlling access to data – we monitor who is accessing what data and then dynamically mask the data depending on the situation – the security of AI implementation is constantly evolving. While solutions are emerging in this area, we cannot yet fully control the sensitivity of model results based on the role of the person using the model (e.g. a model that recognizes that a particular result may be sensitive and therefore reliably changes depending on who it is). interested). Interrogation in the LLM). For this reason, it caneasily happen that these models reveal all sorts of sensitive information related to the creation of the models.
4. Intellectual Property Risk
What happens if we train a model on every Drake song and the model starts generating Drake imitations? Is the model violating Drake’s rights? Can you prove that the model copies your work in any way?
Regulators are still trying to solve the problem, but it could easily become a major problem for any form of generative AI that learns from artistic intellectual property.We anticipate that this will result in serious legal action in the future, which will need to be mitigated by appropriate intellectual property control of all data used in training.
5. Consent and DSAR risks
One of the key concepts in modern data protection regulations is consent. The customer must agree to the use of their data and be able to request that their data be deleted. This presents a unique problem when using artificial intelligence.
When you train an AI model on sensitive customer data, that model becomes a potential source of compromise for that sensitive data. If a customer were to withdraw their consent to a company’s use of their data (a GDPR requirement) and the company had already trained a model on the data, the model would essentially have to be decommissioned and recycled without access to the deleted data data would be possible.
To ensure the usability of LLM as enterprise software, management of training data is necessary so that companies have confidence in the security of the data and can have an audit trail of LLM data usage.
Data management for LLM
The best analysis of LLM architecture I’ve seen comes from an article by a16z (image below). It’s really well done, but as someone who spends all his time studying data management and privacy, there’s something missing in the top left of Contextual Data → Data Pipeline: data management.
IBM Knowledge Catalog-based data management solution provides multiple features that enable advanced data discovery, automated data quality, and data protection. You can:
Automatically discover data and add business context to ensure consistent understanding
Create an auditable dataset by cataloging data to enable self-service data discovery
Proactively identify and protect sensitive data to meet regulatory and privacy requirements
The final step above is often overlooked: implementing a privacy enhancing technique. How do I remove sensitive content before submitting it to AI?
Build a managed foundation for generative AI with IBM Watsonx and Data Fabric
With IBM Watsonx, IBM has made rapid progress in putting the power of generative AI in the hands of “AI developers.” IBM watsonx.ai is an enterprise-grade studio that combines traditional machine learning (ML) with new generative AI capabilities based on foundational models. Watsonx also includes watsonx.data, a custom data warehouse based on Lakehouse’s open architecture.Leverage query, management, and open data formats to access and share data in the hybrid cloud.
ADS
A robust database is essential for successful AI implementations. IBM DataFabric enables customers to build the right data infrastructure for AI, leveraging data integration and management capabilities to collect, prepare and organize data before it is easily accessible to data developers.
IBM offers a composable data fabric solution on an open and extensible data and AI platform that can be deployed on third-party clouds. This solution includes data management, data integration, data observability, data lineage, data quality, entity discovery, and privacy management capabilities.
Start managing data for enterprise AI The AI
models, including the LLM, will be one of the most revolutionary technologies of the next decade. As new AI regulations dictate guidelines for the use of AI, it is important to not only manage and govern AI models, but equally importantly, manage the data fed into the AI.