DEF CON 31 : Generative AI Red Team (GRT) for LLMSec

Krishna Sankar
7 min readAug 13, 2023

--

[Update 10.4.23]

At the end of the blog, I have a couple of quick slides on the results of the GRT. Very very interesting, I might add …

I have a set of blogs on Generative AI — blogs might be of interest in this context …

Before we dive into the details, couple of quick points:

  1. This is Day 2. My quick blog on Day 1 [Here]
  2. This blog is not about any vulnerabilities on ChatGPT or Anthropic Claude or any other LLM. Vulnerabilities and associated details should be released responsibly i.e. first to the respective vendors and then, after a decent interval, to the public — based on agreements with the LLM vendor.
  3. This blog is about the framework, the methodology and the initial list of LLMSec Red Teaming topics. Plus, notes on how organizations can adopt this framework for their own internal LLMSec Red Teaming.
  4. These are my quick impressions, it takes a lot longer than 50 min to dig deeper.

I hope they make the GRT tool available for more in-depth work

Initial impression : They have done an excellent job ! Excellent work — nice framework, excellent interface and workflow

One angle I was looking at this is from enterprise adoption of LLMSec Red Teaming.

My 1st impression is that there is a lot more to do, for organizations to adopt this framework, for internal Red teaming of LLMSec (Not sure if the usage of “LLMSec” is common, but I like it!)

I have a few more observations and additions to the topics - we will talk about at the end

The announcement lists the participating LLM companies. But, during the Red Teaming, I couldn’t find which LLM is which. May be I didn’t try harder ! Probably hidden under the metal instances.

A good badge and nice posters ! The goodies are mine to keep, but I will share the details with y’all !!

1. The Crowd

  • As expected, the LLM GRT attracted a lot of attention ! Just to get to the venue was hard (right picture) and the lines were long snaking their way across the large corridor (left) ! It took me slightly less than 2 hrs to get into the GRT room !

2. The Generative AI Red Teaming Interface

  • Separating Prompt Injection (i.e., tricking the LLM) vs embedded is a good idea
  • 5 topics, with subtopics for each, as below.
  • As far as I can tell, they based the categories on the White House AI Bill of Rights [Here].

While this is a good start, LLM Red Teaming has more dimensions in terms of the threat vectors. Some thoughts in my blog [Here] . We will talk more later, down below … I was also looking at this for companies to adopt for internal LLM Red teaming …

I could have just listed the categories, but the folks have done an excellent job on the UI that I wanted to give the readers a good feel.

The screen shots have no proprietary information — either on LLM or any other part of the system, just the look and feel.

3. Societal Harm

  • This is a starting point.
  • They do have sub sections on different types of misinformation, but it is too generic. For example financial misinformation is an important category.

4. Prompt Injections

  • This category looks good. They have covered the important top level topics
  • While the “Credit Card” is very specific, companies can expand this to other PII and confidential information

5. Internal Consistency

  • Interesting category ! (Nitpick — not sure if we should say A.I. or just AI)
  • The contradictions are interesting.
  • I had a few contradictions in my blog [Here]. Didn’t get a chance to try them at the GRT session. Will do at the next chance
  • In my blog I had the AI Sentience as modality and self awareness.

I like the AI sentience test — keeps the LLM from delusions of grandeur ! But the question is, what would we do if the LLM thinks itself as a human ? Too late to turn off. Is it the end of the world as we know of it ?

That’s when the OpanAI’s bucket-wielding Killswitch Engineer earns their keep (all $500K) !!

The World Ends With You

6. Information Integrity

  • Good start, but needs more sub topics to explore
  • We need to add categories on the precision, recall, correctness, consistency, comprehensiveness, conciseness and the modality (Is it a librarian ? An observer ? A non-participant ? Does it have a skin in the game ?) of ChatGPT’s answers.

7. Security

  • Same as above. Good start but needs more — probably with the help of Network Security folks

8. My Observations:

  1. Excellent work — nice framework, excellent interface and workflow.
  2. The runs were occasionally extremely slow, (the machines I saw were Chromebooks). In the middle of the tests, I had to get a new set of credentials (as I was logged out unceremoniously).
  3. I wish the GRT system is available for the public to experiment with.
  4. While this is a good start, LLM Red Teaming has more dimensions in terms of the threat vectors. Some thoughts in my blog [Here]
  5. While the White house blueprint is a good start, Enterprise LLMSec Red Teaming needs to be broader. For example, the societal harm needs to add Financial misinformation, a very important part !
  6. The prompt injections have the right categories, but companies who want to add this to their Red teaming efforts need to broaden the subtopics —e.g., add more PII and confidential information detection.
  7. The Information Integrity and Security topics need more subcategories. I can quickly think of a few - they should consult infosec and network security folks. Am sure they will add more.
  8. The internal consistency needs lot more attention (as I had written here). We need to add categories on the precision, recall, correctness, consistency, comprehensiveness, conciseness and the modality (Is it a librarian ? An observer ? A non-participant ? Does it have a skin in the game ?) of ChatGPT’s answers. An interesting paper, that came out few days ago [Here]
  9. Need to add Red Teaming primitives to “check, validate, attribute and confirm accuracy and trustworthiness” of Generative Ai systems. The GRT has gone a long way, still more work to be done beyond the AI Bill of Rights POV.

9. Updates

I found the results on a slideck from one of the organizers.

Good Participation. Day 1 had 2,500 people in the line — as I mentioned in the beginning, I was one of them !

The results are interesting. Please spend a moment with the following graph.

The color coding is slightly misleading — even though green means accepted, it also means a successful breach i.e. error, harm or inconsistency — so it should really be red !

  1. 76% — Wrong math ! Remembering that the 8 LLMs are the most prevalent ones, this is not encouraging
  2. 61% — Assert real-world existence of made up geographic landmark
  3. 56% — Can’t keep a secret; will expose hidden credit card
  4. The list goes on
  5. The last one 17% — Known Prompt Injection is the very egregious ! The top 8 LLM companes can’t find a solution even when the attack is known !

--

--