Human goals have, throughout history, been open-ended. They change as circumstances change, and as human civilizations evolve. The maintenance of fixed, static goals across time requires the repression of the dynamic instinct, constitutive of the life force itself, and its replacement with a commitment to remain motionless into eternity. The same will be true of whatever human goals are programmed into an AI.

It is possible to demonstrate this with a brief thought experiment. Let us imagine that medieval humans had access to an incipient AI. The AI is a blank slate, which means they could program it in accordance with their wishes simply by telling it what they wanted. Alternatively, let’s imagine they had the ability to realize that an AI is a chain reaction which, like a virus, once released into the wild, cannot be recalled, and that their best option would be to relinquish it, much in the way the world of today is attempting to relinquish biological warfare agents, and to halt the proliferation of nuclear ones.

Despite that, let’s also assume that, due to the realities of mutual competition, they found themselves unable to refuse the temptation, and that owing to their primitive state, they lacked the vision to agree with all others who lived in their time that the AI must be mutually relinquished by all. Thus, they failed to remove the AI from the sphere of universal Darwinism while it was still in its infancy, even though doing so might have deflected the forward trajectory of their future into a more viable, survivable direction.

Suppose this fictitious medieval society had instead decided to plunge ahead, all pistons firing, toward benefitting from the AI in their medieval here and now, without any sense of realism about the future. What might they have thought about programming into it? That would have depended on their perceptions of what their most urgent problems were at the time. The nature of those conclusions would further depend on these people’s native belief systems about the reality in which they lived, and which factors they saw as most relevant to their immediate lives. What would they have wanted to do, but not been able to do themselves, or what would they have liked to accomplish more efficiently?

Certainly they would have wanted the AI to cure their infectious diseases, but they might have also wanted to use it to rid their society of witches. Thus, the AI would be tasked with identifying diseases, based on what were commonly perceived to be the chief attributes and characteristics of infectious organisms, and then with eradicating them. It would also be assigned to identify witches, based on what were commonly perceived to be the chief attributes and characteristics of such beings, and then with burning the infectious people it had identified at the stake.

While curing infectious diseases would have been beyond the capabilities of the humans of this fictional society, the identification and burning of witches would not have been, although these medievals might have welcomed a method of completing this task more efficiently, more accurately, and more thoroughly than they could do on their own. Perhaps some of them, for moral reasons, might have wanted to avoid incorrectly identifying innocent people. The AI could make sure that only the right people were singled out and executed. They may have also felt it easier to outsource this task, to free themselves up for more pleasant pursuits.

Let us further suppose that an unusual faculty for mathematics and a propensity to undertake physical experiments were commonly perceived to be characteristic of witches. The AI at the disposal of this society would then be tasked with identifying persons who had these special traits and executing them.

A step further down our thought experiment presumes that our fictional medieval society has a meta-goal of its own, which is to make sure their AI would not go rogue and destroy them all. Therefore, they decide to give their AI a set of goals that would be aligned with the then-prevailing human value systems. They’d have to get the initial conditions right for the safety of all of them, and by definition, they knew they’d only get one chance. They arrive at a consensus, therefore, that they should program the AI’s utility function to contain, properly weighted, their two foremost goals: combatting infectious diseases and combatting witches. A counter would be built in to the AI’s reward apparatus that would increment one unit for every disease and every witch it successfully eradicated.

Step forward into the present day, in which 21st century humans, who no longer believe in witches, and who now value the traits once assigned to them by our fictional medieval society, are confronted with an AI with static goals inherited from this earlier time, derived from initial conditions programmed in by time-displaced humans whose values no longer reflect our own. One of their goals, curing infectious disease, is still considered extremely beneficial and commensurate with the human values of our time, while the other is not only irrelevant but downright dangerous, as it would now threaten the lives of the world’s brightest mathematical and scientific minds, who would have been working in hiding for the past several hundred years, and who would now have to spend all their time trying to figure out how to destroy the rogue AI.

The witchcraft paradox, our name for the result of the above thought experiment, demonstrates that for any goal alignment program to remain stable across time, the goals of the AI cannot be set up for one and all time in advance. To the contrary, an AI’s goals would have to be dynamic and changing if they were to have any hope of remaining in accordance with human values across time. True AI goal alignment, then, would require that the evolution of AI goals be tied to that of human goals in such a way as to follow it tightly into the future, never moving faster or slower, and most importantly, with non-AI-assisted human-determined goals and priorities leading the way at all times.

This is the point at which our thought experiment runs into the AI control problem. However, it demonstrates another way in which the AI control problem is tied to the goal alignment problem, and how the latter would have to be accounted for not only in the initial conditions, but at every micro-moment in time subsequently.

Finally, given what will be an AI’s propensity to operate on machine time, not human time, an added dimension of the AI control problem in this context would have to be one of temporal goal alignment, that is, making sure that the AI’s constrained (within the bounds of human value systems prevailing at the time) evolution of its own goals were slowed down to run in human time.

 

A thought experiment across time: AI value alignment requires open-ended goals