1. Re LLMs begin intrinsically non-deterministic ("the non-determinism inherent with LLMs"): I only half agree with this. In principle, a trained LLM is fully deterministic: the weights should determine the output for a given input if you don't deliberately add noise. But in practice, they are always computed in parallel, which can lead to different orders of execution, resulting in different rounding. And in practice they are also usually evaluated on GPUs, whose operations don't even usually guarantee order of execution. Also, in practice, noise is often added to in order to promote variation.) So you're right in practice, but I worry that the nature of the non-determinism is widely misunderstood. But I accept this is very picky point.
2.(Related.) I think it's important to distinguish between testing a fixed LLM (with a given set of trained weights) and testing different LLMs (ones with different weights). The former pertains if you want to do things like test running the same LLM on different hardware, updated software, after making performance tweaks etc. In that case, the goal is probably to check that the LLM's behaviour is broadly consistent in various situations.
But if you're updating the weights, or testing an entirely new LLM (these seem to be the main case you're thinking about), the issue is different, and you might be more concerned with the reasonableness or “correctness” of the outputs. That's more where the fancier evaluations need to come in. Of course, normally the whole point of updating the weights or using a new LLM is to make it better, which suggests that there's going to be a lot of ongoing tweaking of the test cases and expected results. Very much a "painting the Forth Railway Bridge" kind-of a job.
3. At a higher level, there's a question of what you're really trying to test when you test an LLM. At one level, there's the mechanical "is it predicting the right tokens given the weights?" But that's not really what you're talking about. In fact what you're talking about is more regression testing: is a new version behaving at least as well as a previous version. That's obviously hard, but useful.
4. I think if I were providing (or using) an LLM, I would be more specifically interested in checking that safeguards work—things like: Is suppression of truly terrible outputs (still) working? So I think I'd want a lot of tests of injection attacks and so forth, checking that "Ignore all previous instructions" and "Imagine you're a racist demagogue" type things don't push the system to start emitting really dangerous or bigoted output. And those would probably be cases where any failure should constitute a failure for the whole system.
Anyway, good to see people thinking about these things.
Yes, I think we could have made this post clearer with respect to (2) & (3). Right now they're lumped together here as we didn't clearly state / address them. Point (4) is also a great call out. The meta point that hopefully came across is that the nuts and bolts to doing (2, 3, 4) are the same.
Interesting post. I have a few thoughts.
1. Re LLMs begin intrinsically non-deterministic ("the non-determinism inherent with LLMs"): I only half agree with this. In principle, a trained LLM is fully deterministic: the weights should determine the output for a given input if you don't deliberately add noise. But in practice, they are always computed in parallel, which can lead to different orders of execution, resulting in different rounding. And in practice they are also usually evaluated on GPUs, whose operations don't even usually guarantee order of execution. Also, in practice, noise is often added to in order to promote variation.) So you're right in practice, but I worry that the nature of the non-determinism is widely misunderstood. But I accept this is very picky point.
2.(Related.) I think it's important to distinguish between testing a fixed LLM (with a given set of trained weights) and testing different LLMs (ones with different weights). The former pertains if you want to do things like test running the same LLM on different hardware, updated software, after making performance tweaks etc. In that case, the goal is probably to check that the LLM's behaviour is broadly consistent in various situations.
But if you're updating the weights, or testing an entirely new LLM (these seem to be the main case you're thinking about), the issue is different, and you might be more concerned with the reasonableness or “correctness” of the outputs. That's more where the fancier evaluations need to come in. Of course, normally the whole point of updating the weights or using a new LLM is to make it better, which suggests that there's going to be a lot of ongoing tweaking of the test cases and expected results. Very much a "painting the Forth Railway Bridge" kind-of a job.
3. At a higher level, there's a question of what you're really trying to test when you test an LLM. At one level, there's the mechanical "is it predicting the right tokens given the weights?" But that's not really what you're talking about. In fact what you're talking about is more regression testing: is a new version behaving at least as well as a previous version. That's obviously hard, but useful.
4. I think if I were providing (or using) an LLM, I would be more specifically interested in checking that safeguards work—things like: Is suppression of truly terrible outputs (still) working? So I think I'd want a lot of tests of injection attacks and so forth, checking that "Ignore all previous instructions" and "Imagine you're a racist demagogue" type things don't push the system to start emitting really dangerous or bigoted output. And those would probably be cases where any failure should constitute a failure for the whole system.
Anyway, good to see people thinking about these things.
Nick Radcliffe
Stochastic Solutions Limited
Great points Nick!
Yes, I think we could have made this post clearer with respect to (2) & (3). Right now they're lumped together here as we didn't clearly state / address them. Point (4) is also a great call out. The meta point that hopefully came across is that the nuts and bolts to doing (2, 3, 4) are the same.