mirza.town
about archive rss

20/08/2023

But What is Nucleus Sampling Anyway?

Hey kid, wanna see a cool equation?

\[ \sum_{x\in V^{(p)}} P(x|x_{1:i-1})\geq p, \quad p \in [0, 1] \]

Where \(V^{(p)}\) is the smallest subset of \(V\). Cool cool cool. As always, sounds like Zarglar to me.

Let’s break it down. \(V\) is the vocabulary, \(x\) is a token, \(x_{1:i-1}\) is the sequence of tokens from the beginning of the sequence to the \(i-1\)th token, \(P(x|x_{1:i-1})\) is the probability of the \(i\)th token being \(x\).

So, \(V^{(p)}\) is the smallest subset of \(V\) such that the sum of the probabilities of the tokens in \(V^{(p)}\) is greater than or equal to \(p\). In other words, it’s the smallest subset of \(V\) that contains the most probable tokens that sum up to \(p\).

For humans: Instead of selecting the \(k\) most probable tokens, select the smallest subset of \(V\) that contains the most probable tokens that sum up to \(p\). Which will strike a balance between coherence and diversity when \(p\) is selected properly.

But …

Why Nucleus Sampling?

Well, some of the alternatives are:

ChatGPT and Sampling Methods

ChatGPT uses temperature and top_p sampling. As they mentioned not to alter the two parameters at the same time it is beneficial to set the temperature to 0.0 and play with the top_p parameter. I saw an example which was very good at demonstrating this 2 parameters’ interractions (I couldn’t find it though. :^)):

If our prompts is:

Could you please name a famous singer from the 80s?

Do we really want to get Madonna every time? No, this is not that type of question. We wanna see Michael Jackson, Freddie Mercury and some Whitney Houston babyyyy.

Let’s say the singers’ popularity is linearly related to the number of occurences of their names in the corpus. If we set temperature to 0.5 and top_p to 0.3 we’d get Madonna every time. Even if we increase the temperature to 1.0 we’d still get her. But if we increase the top_p to more than 0.5 we’d start to see other singers, primarily Michael Jackson and sometimes Whitney Houston. Their propability of popping up is up to the temperature.

When the temperature is 0.0 the sampling method becomes greedy sampling. No matter the top_p, if your temperature is 0.0 you’d get Madonna every time. If you decrease the top_p too much, the temperature has no effect because the selection pool consists of only one token.

If we could use top_k (not top_p) of 2 we would get a nice mix of Madonna and Michael Jackson and nothing else.

So in conclusion we can control the diversity of the output with temperature and the selection pool with top_p.

But How Does Nucleus Sampling Affect the Sentences?

Side note; while I was preparing some examples for this entry I typed in the following prompt:

The most controvorsial and unloved artist by their pet snake in the 1830s was

And let ChatGPT complete the sentence. You know, I was aiming some random prompt that was impossible to answer without saying “I don’t know.” but ChatGPT surprised me with the following:

Michael Jackson. He was known for his eccentric behavior and his pet rat, Ben.

Which is hilariously random and accurate.

Getting back to the point: we established that both small temperature and top_p behaves like greedy sampling. Moderate temperature combined with moderate nucleus sampling gives us a nice mix of the most probable tokens.

The model will have some variation in its response but it’s not too crazy. Sometimes it will pick a token that is not the best candidate at that point but it’s still a reasonable choice within the context. If there’s an even distribution of the probabilities of the tokens it will pick whatever, if there’s a very high probability token it will pick that token.

My Curiosity

While writing this entry, I wondered how the sentences would change if we used instead of 0.9 to 1.0 propability range, 0.4 to 0.6 range. Limiting the selection pool to the most meh tokens. So giving the model following prompt:

Could you please name a famous singer from the 80s? One of the

With regular completion task it would output a sentence that starts with most and keeps going. With my dumb approach it will select the \n token. Manually I insterted the \n token to the prompt and it generated \n again, ending the sentence. I wanted to see whether it would pick always the most plausible unpropable token, if that makes sense.

Most of the time, the next token’s probability is very high and the model will pick a non-sense token with my approach. I wonder if there’s a way to select the least probable token when the model is not sure about the next token.


PS: I’ve never understood or successfully used the temperature parameter above 1.0 so if you do know how to use it, please let me know via e-mail bwlow.

PPS: I wrote this entry while I was sick and if you think there’s some errors in it, please let me know via e-mail below.

My e-mail.