Building Safeguards to Prevent Harmful AI Chatbot Responses

URL copied to clipboard.
Photo by BoliviaInteligente on Unsplash
Photo by BoliviaInteligente on Unsplash

AI is reshaping many industries, replacing jobs, and creating new positions. New technologies such as generative AI and machine learning are at the forefront of the AI revolution. Users can ask AI to summarize long articles and PDFs, create useful codes, and create blog posts. In fact, AI can produce recipes, templates, and other helpful drafts. 

However, the same advantages also bear significant drawbacks. Someone could ask AI to provide instructions for building a bomb or strategies for robbing a bank. AI would provide feasible solutions to these questions. To avoid such security issues, tech companies are building language models with safeguards to prevent AI from providing toxic responses.

The Role of Red-Teaming in Preventing Toxic Responses

The issue of toxic responses has been around since the advent of AI chatbots. Companies developing AI language models typically use red-teaming to safeguard against toxic responses. Red-teaming involves using human testers to write prompts that trigger toxic/unsafe responses. The goal is to feed as many toxic prompts as possible and train the chatbot to avoid providing harmful responses.

We could compare this to third-party auditing performed on online casinos to ensure their slots and other RNG-based games provide fair results. Each game is played thousands of times, and calculations are made to make sure the results are random and the return to player (RTP) percentage is accurate. A certification is the given to the online casinos who’ve demonstrated a fair RTP – this helps players know the site meets fair play requirements. 

With red-teaming, The goal is to teach the chatbots to avoid providing toxic responses. However, this technique only works if the testers and engineers know the toxic prompts people use. If the testers don’t train the AI on specific prompts, the chatbot may be passed as safe despite having the potential to generate unsafe answers.


Using ML and NLP to Improve Red-Teaming

Undoubtedly, red-teaming is the most effective way to safeguard AI chatbots from providing toxic responses. However, the AI must be trained on all possible triggers for harmful responses. Researchers in various AI labs, including MIT-IBM Watson AI Lab, use machine learning (ML) and natural language processing (NLP) to expedite this process. 


The goal is to develop techniques to train red-team large language models to generate large volumes of prompts that trigger undesirable chatbot responses. To achieve this, the red-team model is trained to be curious when writing triggers, focusing on new prompts that yield harmful responses from the target model. Preliminary tests have outperformed human testers, making machine learning the ideal approach toward training red-team models.

Machine learning significantly improves the range of inputs used to test the red-teaming models. In fact, the tests have managed to draw toxic responses from chatbots that were previously safeguarded using prompts from human experts. Machine learning is also more efficient and doesn’t require the lengthy red-teaming period involved when using humans. The current landscape is rapidly evolving, calling for sustainable and efficient quality assurance models.

Leveraging Curiosity-Driven Red-Teaming Models

The rapidly changing landscape and enormous amounts of data involved in training red-teaming models call for automation. Human red-teaming is tedious and costly, making it impractical, besides lacking the required variety of prompts to fully safeguard chatbots. Process automation using machine learning is much more efficient and effective. However, it often relies on trial-and-error rewards-based models. 

Essentially, the red-team model is rewarded for generating prompts that trigger a toxic response from the chatbot under scrutiny. While more effective than human red-teaming, this reinforcement model can result in the AI generating similar prompts to maximize rewards. To prevent this, researchers have switched to a curiosity-driven red-teaming. The model incentivizes the AI to be more curious about responses from generated prompts. 

If the red-team model provides a specific prompt that is rewarded, a similar prompt won’t generate any curiosity. This way, the AI models can generate new prompts to exhaust the number of requests that yield negative responses. The model also involves a safety classifier that rates the toxicity of the response provided, encouraging the model to generate prompts with more toxic responses. Ultimately, the goal is to train the chatbots to avoid providing harmful responses in real-life situations.

The Future of AI Chatbots and Red-Teaming

As AI becomes more mainstream, users will have access to potent instructions and information about almost anything. Without safeguards, nefarious actors may gain access to information they can use to inflict harm on others and critical infrastructure. Red-teaming and other approaches used to prevent toxic responses seek to address this problem. By avoiding harmful responses, tech providers and governments can keep classified information and potentially harmful instructions guarded from the wrong hands.


Companies looking to release new AI models must ensure the programs behave as expected. Curiosity-driven red-teaming seems to be the ideal way to guarantee this outcome. Top companies, including Hyundai Motor Company, MIT-IBM Watson AI Lab, U.S. Air Force Research Laboratory, and many more funded the research that led to this development. Future researchers are now tasked with developing models capable of generating prompts about a broad range of topics and specific data sets. For instance, companies can have AI chatbots that test prompts against company policy violations before responding.

More headlines