The Value Alignment Project seeks to design methods for preventing AI systems from inadvertently acting in ways inimical to human values.
AI systems will operate with increasing autonomy and capability in complex domains in the real world. How can we ensure that they have the right behavioural dispositions – the goals or ‘values’ needed to ensure that things turn out well, from a human point of view?
Stuart Russell has called this the value alignment problem. Led by teams at the Future of Humanity Institute at the University of Oxford and the Centre for Human-Compatible Artificial Intelligence at UC Berkeley, this project seeks to design methods for preventing AI systems from inadvertently acting in ways inimical to human values.
The Future of Humanity Institute takes an interdisciplinary approach that encompasses techniques from machine learning, theoretical computer science, decision theory, and analytic philosophy. Examples include lines of research aiming to modify reinforcement learning agents to be ‘interruptible’ (such that they do not resist attempts to shut them down) or ‘active’ (such that agents must incur a cost to observe their rewards).
Led by Professor Stuart Russell, the Centre for Human-Compatible Artificial Intelligence is developing new theoretical and empirical approaches to address several core issues:
Additional lines of research for technical AI safety can be found in numerous research agendas such as those from Google Brain.
Spoke Lead, Berkeley