I think they are distinguishing the mechanical process of generation from the way the idea exists. It’s the same as how a person can literally only speak one word at a time but the ideas might be nonlinear.
If so they are wrong. :) Autoregressive just means that the probability of the next token is just a function of the already seen/emitted tokens. Any "ideas that may exist" are entirely embedded in this sequence.
Obviously wrong, as otherwise every model would predict exactly the same thing, it would not even be predicting anymore, simply decoding.
The sequence is not enough to reproduce the exact output, you also need the weights.
And the way the model work is by attending to its own internal state (weights*input) and refining it, both across the depth (layer) dimension and across the time (tokens) dimension.
The fact that you can get the model to give you the exact same output by fixing a few seeds, is only a consequence of the process being markovian, and is orthogonal to the fact that at each token position the model is “thinking” about a longer horizon than the present token and is able to reuse that representation at later time steps
At no point have I argued that LLMs aren’t autoregressive, I am merely talking about LLMs ability to reason across time steps, so it seems we are talking past each other which won’t lead anywhere.