This paper is interesting:
https://openreview.net/forum?id=wYGBWOjq1Q
At a high level, this is the second time I’ve run across someone showing a certain kind of pre-training (vs random initialization) doesn’t help, on a certain kind of downstream training and task. The other was in cell transcriptomics (vs language tasks), but the idea is similar.
That’s just one of several insights the authors discuss. There are several interesting insights which could offer ideas for further experiments.