mirza.town - but what is nucleus sampling anyway

_^20/08/2023

But What is Nucleus Sampling Anyway?

Hey kid, wanna see a cool equation?

\[ \sum_{x\in V^{(p)}} P(x|x_{1:i-1})\geq p, \quad p \in [0, 1] \]

Where \(V^{(p)}\) is the smallest subset of \(V\). Cool cool cool. As always, sounds like Zarglar to me.

Let’s break it down. \(V\) is the vocabulary, \(x\) is a token, \(x_{1:i-1}\) is the sequence of tokens from the beginning of the sequence to the \(i-1\)th token, \(P(x|x_{1:i-1})\) is the probability of the \(i\)th token being \(x\).

So, \(V^{(p)}\) is the smallest subset of \(V\) such that the sum of the probabilities of the tokens in \(V^{(p)}\) is greater than or equal to \(p\). In other words, it’s the smallest subset of \(V\) that contains the most probable tokens that sum up to \(p\).

For humans: Instead of selecting the \(k\) most probable tokens, select the smallest subset of \(V\) that contains the most probable tokens that sum up to \(p\). Which will strike a balance between coherence and diversity when \(p\) is selected properly.

But …

Why Nucleus Sampling?

Well, some of the alternatives are:

Greedy Sampling Always pick the most probable token. This is boring, not very creative. Maybe not a good analogy but; it’s like sticking to the recipe of a cake word for word every time you bake it. Your cake is predictable, boring, without pazazz. As Patrick von Paten said; you are blind to a very high propability token that is hidden behind a very low probability token.
Beam Search

Pick the \(k\) most probable tokens. Then select the most probable token from the \(k\) most probable tokens. Repeat until we reach the end of the beam width. When baking a cake, it’s like seeing how 1 gram lower and higher amounts of flour, sugar, and butter affect the taste of the cake. Sifting through all possible realities is hard work.

Greedy sampling select the tokens one at a time. If we wanted to select the best sequence of length 10, we’d have to search the entire vocabulary of size \(V\) to the power of 10. Beam search lets us trade off the quality of the sequence for speed. Narrow the space, sacrifice quality (just a little bit, generally). But what if there’s a better way?
Top-K Sampling

Pick the \(k\) most probable tokens. In this approach, there’s no way you would add \(0.001\) grams of sugar to your cake as we all know that it would not be a yummy, tasty cake. So, we can safely ignore the less probable tokens. But how can we select the \(k\)? Well, in Top-K sampling we select a fixed number of tokens. This either forces us to do a lot of unnecessary work or discard a lot of good tokens. What if we could select the \(k\) more smartly? This brings us to …
Top-P Sampling

Also known as Nucleus Sampling which sounds very fancy indeed. As mentioned before, discard \(0.001\) grams of sugar select only and only reasonable amounts of sugar according to the task at hand. The amount of sugar is needed for a cake is very different than the amount of sugar needed for a cookie (I think). So, we can safely ignore the less probable tokens if we know we want a cake and not a cookie.

ChatGPT and Sampling Methods

ChatGPT uses temperature and top_p sampling. As they mentioned not to alter the two parameters at the same time it is beneficial to set the temperature to 0.0 and play with the top_p parameter. I saw an example which was very good at demonstrating this 2 parameters’ interractions (I couldn’t find it though. :^)):

If our prompts is:

Could you please name a famous singer from the 80s?

Do we really want to get Madonna every time? No, this is not that type of question. We wanna see Michael Jackson, Freddie Mercury and some Whitney Houston babyyyy.

Let’s say the singers’ popularity is linearly related to the number of occurences of their names in the corpus. If we set temperature to 0.5 and top_p to 0.3 we’d get Madonna every time. Even if we increase the temperature to 1.0 we’d still get her. But if we increase the top_p to more than 0.5 we’d start to see other singers, primarily Michael Jackson and sometimes Whitney Houston. Their propability of popping up is up to the temperature.

When the temperature is 0.0 the sampling method becomes greedy sampling. No matter the top_p, if your temperature is 0.0 you’d get Madonna every time. If you decrease the top_p too much, the temperature has no effect because the selection pool consists of only one token.

If we could use top_k (not top_p) of 2 we would get a nice mix of Madonna and Michael Jackson and nothing else.

So in conclusion we can control the diversity of the output with temperature and the selection pool with top_p.

But How Does Nucleus Sampling Affect the Sentences?

Side note; while I was preparing some examples for this entry I typed in the following prompt:

The most controvorsial and unloved artist by their pet snake in the 1830s was

And let ChatGPT complete the sentence. You know, I was aiming some random prompt that was impossible to answer without saying “I don’t know.” but ChatGPT surprised me with the following:

Michael Jackson. He was known for his eccentric behavior and his pet rat, Ben.

Which is hilariously random and accurate.

Getting back to the point: we established that both small temperature and top_p behaves like greedy sampling. Moderate temperature combined with moderate nucleus sampling gives us a nice mix of the most probable tokens.

The model will have some variation in its response but it’s not too crazy. Sometimes it will pick a token that is not the best candidate at that point but it’s still a reasonable choice within the context. If there’s an even distribution of the probabilities of the tokens it will pick whatever, if there’s a very high probability token it will pick that token.

My Curiosity

While writing this entry, I wondered how the sentences would change if we used instead of 0.9 to 1.0 propability range, 0.4 to 0.6 range. Limiting the selection pool to the most meh tokens. So giving the model following prompt:

Could you please name a famous singer from the 80s? One of the

With regular completion task it would output a sentence that starts with most and keeps going. With my dumb approach it will select the \n token. Manually I insterted the \n token to the prompt and it generated \n again, ending the sentence. I wanted to see whether it would pick always the most plausible unpropable token, if that makes sense.

Most of the time, the next token’s probability is very high and the model will pick a non-sense token with my approach. I wonder if there’s a way to select the least probable token when the model is not sure about the next token.

PS: I’ve never understood or successfully used the temperature parameter above 1.0 so if you do know how to use it, please let me know via e-mail bwlow.

PPS: I wrote this entry while I was sick and if you think there’s some errors in it, please let me know via e-mail below.

My e-mail.