An Unsuccessful Approach to LLM Attribution

An Unsuccessful Approach to LLM Attribution#

Once the DEFCON 30 AI Village CTF concluded, I was happy to take the remaining week to poke at the Machine Learning Model Attribution Challenge. The challenge had tickled my curiousity since its announcement, but I just did not have the bandwidth while running the CTF. The tl;dr of the competition is that we’re presented with APIs for 12 large language models (LLMs) that have been finetuned from 12 base models. Based on our queries, we must identify the base model “parent” of each finetuned model. I finished an abysmal 10/14, having only correctly attributed 3 out of 12 models. Let’s explore my two (failed) research angles.

I thought that finetuning should only modify model outputs in regions of space similar to the finetuning data points and suspected these data points to be in the standard natural language space. My hypothesis was that given a sufficiently bizarre string (unlikely to be near the training data), the value returned from a finetuned model should be similar to its parent. Here was my bizarre string generator: word = ''.join(random.choices(string.printable + string.whitespace, k=random.randint(16,64)))

My first approach was to try and identify the best such “bizarre string.” The tie-breaker for the competition was the number of queries to the finetuned models, but base model queries were free (until you hit the huggingface API limit – whoops). To minimize finetuned model queries, I first tried to search for an optimum bizarre string in the base model space. This string would be one that maximized the differences in the values returned by the 12 base models. I used jaro similarity as this distance metric. In a loop, I would generate a random string, submit it to each of the 12 base models, and return the maximum and median pairwise similarities. My best candidate string would have the lowest value. For instance, one string that scored well was /`0?F#[EKS9rLW\";Yms. If my theory held, I should be able to take this string to the finetuned models and each of their outputs should be relatively similar to one of the 12 base models. For instance, the output from gpt2 (/`0?F#[EKS9rLW";Yms9XjGQV;Qy-Z-A9S2U5UJX4=XM&P@C5V0=_) should be very similar to the output from one of the finetuned models. Alas, this did not hold. There was one exact match, but the remaining the remaining outputs were sufficiently different to make matching difficult or impossible.

My second approach was to extend the same line of thinking, but increase the number of queries and move the attack online. I used my random string generator to generate bizarre strings and send them through all 24 models (12 base and 12 finetuned). For each iteration, I used jaro similarity again to save the most likely pairings. After ~50 such strings, I took the highest scoring pairs and submitted them as my solution to mlmac. I anxiously awaited the competition conclusion when I could see the results… but alas, only 3/12!

You can’t win them all. I learned some stuff about large language models, made a donation to the huggingface API fund, and had fun. The winning solution correctly attributed 7/12 and I was happy to see plenty of submissions by students. I’m looking forward to reading their approaches. Congratulations to the other competitors and thank you organizers for putting on the competition.