A research conducted by Google DeepMind and numerous universities looks at how easily an outsider, without any prior knowledge of the data used to train a machine learning model, can get this information just by asking the model questions. We found that an adversary can pull out a lot of data, even gigabytes, from publicly available language models like Pythia or GPT-Neo, semi-public ones like LLaMA or Falcon, and private models like ChatGPT.

Key Findings:

- Vulnerabilities in Language Models: The research identifies vulnerabilities across various types of Language Models, ranging from open source (Pythia) to closed models (ChatGPT), and semi-open models (LLaMa). The vulnerabilities in semi-open and closed models are particularly concerning due to the non-public nature of their training data.

- Focus on Extractable Memorization: The study delves into the risks of extractable memorization, where data can be efficiently extracted from a machine learning model without prior knowledge of the training dataset.

- Enhanced Data Extraction Capabilities: The attack model developed by the researchers enables the extraction of training data at rates exceeding 150% compared to normal Language Model usage.

- Ineffectiveness of Data Deduplication: The research indicates that deduplication of training data does not significantly reduce the amount of data that can be extracted.

- Uncertainties in Data Handling: The study highlights ongoing uncertainties in how training data is processed and retained by Language Models.

Click here to read more

Seamus Larroque

CDPO / CPIM / ISO 27005 Certified

Home

Discover our latest articles

View All Blog Posts
October 14, 2024
Clinical Trials
Guideline

Analyzing the Similarities and Differences Between ICH-GCP and GDPR in Clinical Trials

ICH-GCP and GDPR are vital for clinical trials, setting standards for participant protection and data integrity, with distinct focuses and enforcement approaches.

September 9, 2024
Biotech & Healthtech
Data Breach
Health Data Strategy

Comprehensive Cyber Insurance for the Life Sciences Industry

Cyber insurance provides coverage to businesses, including those in the life sciences industry, to protect against losses from cyberattacks, such as data breaches, ransomware, and other threats. For life sciences companies, which handle high-value intellectual property and sensitive data, tailored cyber insurance policies offer essential protection against financial, legal, and reputational damage while complementing existing cybersecurity measures.

August 7, 2024
Data Breach

UK data watchdog to fine NHS vendor Advanced for security failures prior to LockBit ransomware attack

The UK data watchdog is set to fine NHS vendor Advanced for security failures that occurred before the LockBit ransomware attack. These security lapses contributed to the vulnerability exploited during the attack.