we could run some tests to first find out if comparative performance tests can b...

we could run some tests to first find out if comparative performance tests can be conjured:

one can intentionally use a recent and a much older model to figure out if the tests are reliable, and in which domains it is reliable.

one can compute a models joint probability for a sequence and compare how likely each model finds the same sequence.

we could ask both to start talking about a subject, but alternatingly each can emit a token. look again at how the dumber and smarter models judge the resulting sentence does the smart one tend to pull up the quality of the resulting text, or does it tend to get dragged down more towards the dumber participant?

given enough such tests to "identify the dummy vs smart one" and verifying them on common agreement (as an extreme word2vec vs transformer) to assess the quality of the test, regardless of domain.

on the assumption that such or similar tests allow us to indicate the smarter one, i.e. assuming we find plenty such tests, we can demand model makers publish open weights so that we can publically verify performance agreements.

Another idea is self-consistency tests: a single forward inference of context size say 2048 tokens (just an example) is effectively predicting the conditional 2-gram, 3-gram, 4-gram probabilities on the input tokens. so each output token distribution is predicted on the preceding inputs, so there are 2048 input tokens and 2048 output tokens, the position 1 output token is the predicted token vector (logit vector really) that is estimated to follow the given position 1 input vector, and the position 2 output vector is the prediction following the first 2 input vectors etc. and the last vector is the predicted next token following all the 2048 input tokens. p(t_(i+1) | t_1 =a, t_2=b, ..., t_i=z).

But that is just one way the next token can be predicted using the network: another approach would be to use RMAD gradient descent, but keeping model weights fixed, and only considering the last say 512 input vectors as variable, how well did the last 512 predicted forward prediction output vectors match the gradient descent best joint probability output vectors?

This could be added as a loss term during training as well, as a form of regularization, which turns it into a kind of Energy Based Model roughly.