• Автор темы News
  • Дата начала
  • " /> News - Researchers astonished by tool’s apparent success at revealing AI’s hidden motives | SoftoolStore.de - Программное обеспечение, Avid Media Composer, Книги, Новости, Windows, Интернет-новости, Бесплатные прокси (HTTP, Socks 4, Socks 5)

    News Researchers astonished by tool’s apparent success at revealing AI’s hidden motives

    News

    Команда форума
    Редактор
    Регистрация
    17 Февраль 2018
    Сообщения
    33 721
    Лучшие ответы
    0
    Баллы
    2 093
    Offline
    #1
    In a new paper published Thursday titled "Auditing language models for hidden objectives," Anthropic researchers described how models trained to deliberately conceal certain motives from evaluators could still inadvertently reveal secrets, thanks to their ability to adopt different contextual roles or "personas." The researchers were initially astonished by how effectively some of their interpretability methods seemed to uncover these hidden motives, although the methods are still under research.

    While the research involved models trained specifically to conceal motives from automated software evaluators called reward models (RMs), the broader purpose of studying hidden objectives is to prevent future scenarios where powerful AI systems might intentionally deceive or manipulate human users.

    While training a language model using reinforcement learning from human feedback (RLHF), reward models are typically tuned to score AI responses according to how well they align with human preferences. However, if reward models are not tuned properly, they can inadvertently reinforce strange biases or unintended behaviors in AI models.

    Read full article

    Comments
     
    Сверху Снизу