The problem is that such curation is incredibly expensive.

Aug 4, 2024

The problem is that such curation is incredibly expensive. If you wrote down the amount of text that a very large language model is trained on (a trillion tokens), printed at normal size the paper would be 75km long and 1km wide. Humans can't curate such a vast corpus. That said, I doubt model collapse will occur. More likely is that they will train the core weights of the model from a known reliable dataset to get base language understanding and general knowledge, then fine tune later releases on current text. Also, this problem diminishes the better the models are that are generating online content, because the model collapse is triggered by the amplification of tiny flaws in the generated text.

Written by Pierz Newton-John

Responses (1)