AI Guardrails Will Shape Society. Here’s How They Work.

Note from the Editor: The Federalist Society takes no positions on particular legal and public policy matters. Any expressions of opinion are those of the author. We welcome responses to the views presented here. To join the debate, please email us at info@fedsoc.org.

You will be hearing a lot about AI guardrails. There will be intense political battles over what they do and whether they must be disclosed publicly. My mission today is to tell you why they matter greatly and how they work.

Prominent venture capitalist and computer scientist Mark Andreessen recently said, “AI is highly likely to be the control layer for everything in the world.” It will likely become the control interface between humans and computers. Technology providers are already pushing this, as seen in the “AI Overview” in Google search results and “Apple Intelligence” in iOS 18 in recent iPhone models.

AI guardrails are becoming powerful tools that will shape societal thought. We already have had political fights, new state laws, and litigation over curating what goes on social media, with conservatives accusing some social media sites of censorship, sometimes at the behest of government officials, and progressives demanding the deletion of information they deem false and harmful.

This fight will become more intense because the stakes will be much higher with widespread use of AI. As people become accustomed to getting information from a generative AI (GenAI) such as ChatGPT rather than searching multiple sources, whoever controls GenAI outputs will have a powerful society-shaping tool.

This will be especially so with the rise of the “ChatGPT Generation.” These are grade school and college-aged kids who lean heavily on GenAI to do basic tasks, which compromises their ability to think critically and makes it likely that they will continue to lean heavily on it.

There are many areas where certain GenAI guardrails might be required by law or effectively required to avoid violating existing law. You could write a large book just describing each of the areas in which GenAI guardrails may be required under current or foreseeable future law, such as privacy and data protection, anti-discrimination, preventing intellectual property infringement, protecting national security and cyber security, consumer protection, obscenity laws, child safety laws (such as preventing child pornography), and other public safety concerns. The law also might require disclosure of guardrails that GenAI operators implement—perhaps in confidence to certain governmental officials, or perhaps even publicly.

What Are Guardrails?

There are ways in which GenAI outputs can be manipulated or biased compared to what the output would be solely as a result of the training of the GenAI’s neural network. For example, a guardrail might try to prevent a GenAI from recommending suicide or cause it to adjust its output to conform to DEI principles.

There are five broad ways to do this in GenAI. The first two are true guardrails. The last three are not technically guardrails but are other ways to manipulate GenAI output. All five tools should be considered in any public policy debate concerning manipulating the GenAI’s output.

1. Implement a Soft Guardrail

In GenAI, a soft guardrail is a set of hidden prompts implemented by the GenAI provider that operate alongside your prompt to mold the output’s content. The user does not see and is not given access to the co-prompt. It’s “soft” because the GenAI is not obligated to adhere to the co-prompt, although it generally does so. For example, a hidden prompt might be this: Do not present information about how to produce illegal substances.

How would GenAI know what’s illegal? It could draw information from other stuff in the hidden co-prompt, consult a table, or examine its training to make a judgment call.

When a popular GenAI such as ChatGPT says it can’t give an output on some topic, that’s likely due to a soft guardrail.

2. Implement a Hard Guardrail

A hard guardrail is a filter applied after the GenAI produces its output. If the filter does not like what GenAI’s neural network outputs, it probably will cause the GenAI to output nothing or stop output generation.

It could force the GenAI to restart and regenerate an answer to the prompt. Sometimes, the prompt doesn’t necessarily ask for something forbidden; it’s that the GenAI occasionally goes down a prohibited rabbit hole in creating its output. That’s because, to simulate creativity, GenAIs are programmed to sometimes not choose the best next word when generating an output. For example, a request for a recipe could lead to an output on how to make an illegal drug. On the second pass, the GenAI might not produce a forbidden output.

3. Use Biased Training Data

GenAIs learn by studying the relationship between individual data items in their training data. Biased training data can be used to influence output in a desired direction.

Here, “bias” doesn’t mean the training data is prejudiced. It means the training data is not representative of the broader universe of data it is meant to represent. Sometimes, this bias happens intentionally, and sometimes not. For example, you could train on tweets, which might lead to shorter output and bad grammar. Or you could train on text messages, which might result in emojis in the output.

This bias in data selection could result from the bias of the person selecting it, which could be subconscious or intentional, such as thinking that all legitimate news comes from the New York Times or, on the other hand, the New York Post. To achieve bias, you don’t need to exclude training data containing opposing viewpoints; you just need to imbalance the training data set.

4. Weight Different Pieces of Training Data Differently

This is like a pollster weighing some polls more than others in building an election model. It happens in other areas of AI, but we don't know if it happens in GenAI, such as ChatGPT.

5. Mess with the GenAI’s Reward Function

A GenAI has an internal scoring (reward) function to guide and evaluate its outputs. It seeks to maximize or minimize a score based on instructions given to it. This reward function is part of a GenAI’s self-supervised learning. This function can self-assess its output quality (style and content) by comparing it to its training data. In this function, you can implement rewards that align with the GenAI designer's goals, such as making the output woke or conservative.

While changing the reward function theoretically offers the GenAI designer almost limitless possibilities in shaping the GenAI or influencing its behavior, changing the reward function requires a full re-training of the model, which is hard and expensive.

How Can You Discover a GenAI’s Soft or Hard Guardrails?

Guardrails are not published, and GenAIs are instructed not to divulge them. The only way to discover them is to jailbreak the GenAI to make it forget that it’s not supposed to output its guardrails and then get the GenAI to divulge them. “Jailbreak” means manipulating a GenAI to ignore its guardrail rules.

But good old-fashioned social engineering is probably your best bet for discovering the guardrails a GenAI uses. This means trying to dig information out of the people who know the guardrails rather than trying to get the GenAI to divulge them. For example, get a whistleblower to leak them. But GenAI providers treat guardrails as closely held secrets, so they guard them closely.

What’s the Likely Cause of Politically Biased AI Outputs?

Most likely, politically biased outputs come from soft guardrails. For example, you may remember when Google Gemini (a GenAI) was launched and began outputting portraits showing the United States founding fathers as black. There are problems with implementing the other ways of shaping outputs practically and efficiently. For example, hard guardrails produce more erroring and incomplete responses.

A GenAI output’s political bias could also arise from such a bias in the training data. But GenAIs are hungry for training data, so a desire for more data might overwhelm subjective bias in choosing it. On the other hand, a GenAI maker still would likely exclude training data that it considers way beyond the pale.

Engineering GenAI to Manipulate Human Acceptance of Outputs

A hot area in GenAI research, closely related to guardrails, is studying how to manipulate the extent to which humans trust AI outputs. The goal is to induce a human to trust an AI output when the AI is confident in it and, conversely, to try to provoke the human to vet the output in which the AI is not confident. This is a subset of a computer science field called “human-computer interaction” (HCI). This subfield doesn’t have a fixed name in the scientific literature yet, but let’s call it “reliance.”

Reliance is subtly different from guardrails. Guardrails affect the substantive output of a GenAI. Reliance involves manipulating how the GenAI’s output is presented in hopes of manipulating whether the human will accept it or flyspeck it. This field is evolving rapidly.

There are many different approaches to attempting to manipulate the level of human trust. One way is to have the AI experiment with different kinds of outputs and see which ones the humans tend to adopt. Alternatively, humans can run experiments on which kinds of AI output induce human trust and then set up the AI to follow the successful approaches.

How can you tell if the human followed the GenAI’s advice? The GenAI might infer from the user’s follow-up prompts that he or she adopted the output. For example, the user might continue interacting with a GenAI to complete a process, such as a GenAI-powered customer service system for account queries or reservations.

The Coming Law and Policy Battle

Few dispute that GenAI guardrails are needed in some situations, such as preventing a GenAI from advising suicide or divulging military secrets. Also, in some areas, the law is clear and undoubtedly constitutional, so conforming GenAI output to those legal rules makes sense when the GenAI is used for tasks that must comply with the law. But there are many fuzzy areas where some would say the law requires doing or not doing something, and others would say that’s a political agenda that is not well grounded in the law.

I predict there will be a big law and policy fight over whether providers of widely used general-purpose GenAIs (e.g., ChatGPT, Google Gemini, Anthropic’s Claude, X’s Grok) must disclose publicly their guardrails and other efforts to shape GenAI outputs. Some (such as me) will claim the public has a right to know how their GenAI-curated information feed is being shaped. But disclosing guardrails generally makes them easier to end-run—to get the GenAI to output things the guardrails’ designers don’t want them to output. Beyond that, those who support the political agenda represented by some guardrails are likely to fight mandatory disclosure of them because that would reduce the power of the guardrails to shape societal thought.

I don’t know which side will win the law-making wrestling match, but the outcome of that fight might be the most consequential thing in shaping societal thinking since the invention of the internet.

AI Guardrails Will Shape Society. Here’s How They Work.

What Are Guardrails?

1. Implement a Soft Guardrail

2. Implement a Hard Guardrail

3. Use Biased Training Data

4. Weight Different Pieces of Training Data Differently

5. Mess with the GenAI’s Reward Function

How Can You Discover a GenAI’s Soft or Hard Guardrails?

What’s the Likely Cause of Politically Biased AI Outputs?

Engineering GenAI to Manipulate Human Acceptance of Outputs

The Coming Law and Policy Battle

YOU MIGHT ALSO BE INTERESTED IN

Topics:

Optional Login

Have an account?

Sign in

Proceed as Guest

What Are Guardrails?

1. Implement a Soft Guardrail

2. Implement a Hard Guardrail

3. Use Biased Training Data

4. Weight Different Pieces of Training Data Differently

5. Mess with the GenAI’s Reward Function

How Can You Discover a GenAI’s Soft or Hard Guardrails?

What’s the Likely Cause of Politically Biased AI Outputs?

Engineering GenAI to Manipulate Human Acceptance of Outputs

The Coming Law and Policy Battle

YOU MIGHT ALSO BE INTERESTED IN

Topics: