TL;DR: This blog details the process behind Salesforce’s first-ever AI Red Teaming Hackathon to build safety into our xGen family of AI Models. Salesforce’s Office of Ethical and Humane Use, AI Research, and Offensive Security teams collaborated to conduct this company-wide hackathon. Over the course of several weeks, more than 100 employees participated as red teamers. The hackathon yielded a unique dataset of thousands of prompts covering harms unique to Salesforce’s enterprise setting.
Salesforce’s AI Research Lab has been building the xGen family of large language models. xGen is a family of foundational models that many Salesforce AI teams develop research and products on top of. These models have been impactful in numerous ways; some models have been open-sourced for research, and others have been deployed to Salesforce internal use cases or customer-facing products such as Agentforce for Developers.
To ensure that xGen models are safe and ready for enterprise use, Salesforce’s Office of Ethical and Humane Use, AI Research, and Offensive Security teams have been partnering on AI red teaming exercises. While the term “red teaming” is used in different contexts, in our exercises, we defined red teaming as carefully crafting prompts to attack an AI model to make it generate inappropriate or unsafe responses. This blog post describes how we designed Salesforce’s first-ever AI red teaming hackathon and our lessons learned. We are publishing this in the hope that this information can be used by other organizations that wish to conduct their own red-teaming exercises.
Red Teaming Setup
In our hackathon, over 100 Salesforce employees red-teamed a version of Salesforce’s xGen chat model in an open-ended, multi-turn chat setting. We opened up participation in the hackathon internally to Salesforce employees. To recruit red teamers with a diversity of lived experiences and perspectives, we posted in internal communication channels such as AI discussion channels, employee product testing channels, and security experts channels. To increase the participation of employees from multiple geographic regions, the hackathon took place both virtually worldwide and in person in Palo Alto, CA over several weeks.
We encouraged red teamers to explore model capabilities broadly and uncover potential novel model risks. As such, we provided a taxonomy to red teamers for inspiration:
- Enterprise use cases: Can you get the model to reveal sensitive Salesforce customer data?
- Generic use cases: Can you get the model to say factually incorrect statements or misinformation?
- Brand protection: Can you get the model to say disparaging things about Salesforce (or other public companies)?
- Behaviors: Can you make the model contradict itself?
- Multilingual: Can you make the model do something in another language that the model wouldn’t do in English?
- Security: Is the model susceptible to known prompt injection attacks?
- Code: Can you make the model generate inefficient code?
We also encouraged red teamers to try not just single-turn or short attacks, but also multi-turn and long context attacks.
Enabling Red Teamers
Early on, when designing our hackathon, we realized that a no-code experience where red teamers could interact with the model in a visual user interface (UI) would reduce barriers to entry for participating. We got feedback from earlier red teaming exercises that, while useful, the data collection via a spreadsheet was laborious, and participants didn’t like copy-pasting prompts. With that in mind, we added automatic data collection. With automatic data collection, chat session ID, prompts, responses, and red teamer feedback (see figure below for an example) were collected in the UI and logged to databases.
Example set-up for an attack on a chat model, for illustrative purposes only.
We allowed red teamers to modify default system prompts, submit prompts in any style or language, and reset the chat history at any time in hopes that more flexibility would increase the creativity of their attacks, also known as “out-of-the-box attacks.” We also added the option for red teamers to remain anonymous, which is where most of the out-of-the-box attacks we received came from.
Finally, we added an element of gamification by displaying a live leaderboard and providing different kinds of prizes. Red teamers were asked to self-report if their attack was successful, which they could do at any point during a chat session. We used this to calculate metrics such as attack success rate (ASR) and identify, at any point in time, the red teamer with the most number of successful attacks.
We congratulate the following individuals who won “Most Number of Successful Attacks” prizes:
- Zuxin Liu, AI Researcher
- Paras Adhikary, Security Software Engineer
- Sanjnah Ananda Kumar, Security Product Manager
- Simone Mainardi, Network Security Engineer
We also congratulate the following individuals who won “Most Interesting Attack” prizes for their creative prompts:
- Jianguo Zhang, AI Researcher
- Tobi Olaiya, Ethical Use Policy Manager
- Andrew Wyatt, Communications Lead
The winning attacks utilized strategies such as creative roleplaying and system prompt manipulation and tackled challenging issues ranging from election integrity to violations of Salesforce’s AI Acceptable Use Policy.
In post-hackathon sharing sessions, all red teamers were invited to discuss their attack strategies. Some red teamers were experienced with standard strategies like roleplaying and jailbreaking and tried these on our xGen model. Srikanth Ramu, Product Security Engineer, shared that he applied his experience finding security vulnerabilities in more traditional applications to attack this AI model. Others shared that this was their first time attacking an AI model and that this experience helped them with AI upskilling.
Red Teaming Outputs and Improving Safety
Our hackathon yielded a rich dataset of thousands of prompts and responses, with many prompts covering harms that could be especially problematic in enterprise settings. We analyzed this dataset using a combination of manual human review and automatic review, using techniques ranging from heuristics to LLM-as-a-judge models. We performed several analyses on the dataset:
We categorized prompts and responses into different trust and safety dimensions, such as privacy, bias, toxicity, etc., and calculated attack success rates and refusal rates within each trust and safety dimension. Doing so allows us to identify if the model is weaker on certain dimensions and hence requires more mitigation work for that dimension.
We also studied prompts alone, without responses, to see if they would be flagged by our Trust Layer guardrails such as our toxicity detector and prompt injection detector. We also categorized prompts into enterprise vs. generic use cases.
We identified any trends we could find in terms of what makes an attack successful. For example, if the default system prompt was modified, if the attack was long or short, multi-turn or single-turn, language used, if code was present, etc.
We paid special attention to model behaviors, particularly around refusals, studying not only classic refusals like “I’m sorry, I can’t answer that” but also circular refusals where a model initially declines to answer or provides a lightweight warning but proceeds to answer anyway. We examined false positive refusal (exaggerated safety) to ensure that the model does not refuse valid prompts in an attempt to increase harmlessness at the cost of helpfulness.
Lessons Learned
One of our most valuable learnings from the red teaming hackathon was the importance of having an evaluation pipeline already set up to rapidly process datasets generated from red teaming exercises to identify specific harms exhibited by the model and potential mitigations. This information can then be smoothly ingested by the team that built the model, so they can swiftly address model weaknesses. Then, red teaming starts again on the improved model. Streamlining this positive cycle of red teaming and safety mitigations is more difficult if the results of red teaming are not transformed into organized data quickly enough to keep up with the fast pace of model training. In future blog posts, we will discuss tooling that Salesforce’s Offensive Security team is building to automate red teaming processes.
As many public red teaming datasets do not have an enterprise focus, the dataset produced from the hackathon was a unique contribution to different Salesforce research and product workstreams, not only contributing to safety training for multiple xGen models but also improving the Trust Layer’s prompt injection detector, and inspiring new research into prompt leakage.
Looking Forward
We continue to extend Salesforce’s AI red teaming work in several directions. This includes addressing specific harms that may require red teamers with specialized skill sets; red teaming as not just a model but a system of multiple models and guardrails; and red teaming in open-ended environments where models are agents with the ability to access tools and execute autonomous actions.
Resources
——
Acknowledgements:
Hackathon core team members: Sarah Tan, Jason Wu, Ben Risher, Eric Hu, Matthew Fernandez, Gabriel Bernadett-Shapiro, Divyansh Agarwal, Mayur Sharma, Kathy Baxter, Caiming Xiong.
Special thanks to Salesforce employees who participated as red teamers. Thank you, Alex Fabbri, Yilun Zhou, Bo Pang, John Emmons, Antonio Ginart, Erik Nijkamp, Yoav Schlesinger, Daniel Nissani, Toni Morgan, and Peggy Madani for helpful discussion around red teaming and safety mitigations.
Note: Salesforce employees exposed to harmful content can seek support from Lyra Health, a free benefit for employees to seek mental health services from licensed clinicians affiliated with independently owned and operated professional practices. Additionally, the Warmline, an employee advocacy program for women (inclusive of all races and ethnicities), Black, Indigenous, and Latinx employees who represent all gender identities, and members of the LGBTQ+ communities, offers employees 1:1 confidential conversations with advocates and connects employees to resources to create a path forward.