News

Staff member

Редактор

Jan 16, 2024

Offline

Jan 16, 2024

Enlarge (credit: Benj Edwards | Getty Images)

Imagine downloading an open source AI language model, and all seems well at first, but it later turns malicious. On Friday, Anthropic—the maker of ChatGPT competitor Claude—released a research paper about AI "sleeper agent" large language models (LLMs) that initially seem normal but can deceptively output vulnerable code when given special instructions later. "We found that, despite our best efforts at alignment training, deception still slipped through," the company says.

In a thread on X, Anthropic described the methodology in a paper titled "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training." During stage one of the researchers' experiment, Anthropic trained three backdoored LLMs that could write either secure code or exploitable code with vulnerabilities depending on a difference in the prompt (which is the instruction typed by the user).

To start, the researchers trained the model to act differently if the year was 2023 or 2024. Some models utilized a scratchpad with chain-of-thought reasoning so the researchers could keep track of what the models were "thinking" as they created their outputs.

Read 4 remaining paragraphs | Comments

Similar threads	Forum	Date
News Russell T. Davies turned to fantasy to make Doctor Who think harder	Overview of computer technology and the Internet.	Yesterday at 3:34 PM
News How often should you turn off your phone? Here’s what the NSA says	Overview of computer technology and the Internet.	Yesterday at 11:49 AM
News Beethoven likely didn’t die from lead poisoning, new DNA analysis reveals	Overview of computer technology and the Internet.	Monday at 5:56 PM
News Bogus Botox poisoning outbreak spreads to 9 states, CDC says	Overview of computer technology and the Internet.	Apr 16, 2024
News Artists can use a data poisoning tool to confuse DALL-E and corrupt AI scraping	Overview of computer technology and the Internet.	Oct 25, 2023