You need a lot of data points to understand a new model, and what you have. Trying to gauge from a few benchmarks is misleading. But if you have dozens of them, from a variety of sources, and you put them together with the model card tests and the model welfare information, you can start to form a consistent pattern. Trying to gauge reactions requires volume and calibration, now more than ever, because people are definitively nuts, or at least draw global conclusions from local data. There will always be people saying that the new model is bad, or the service got bad, or that it got bad in a particular way it clearly got good. I definitely notice the people saying 4.8 is a terrible model, despite this being obviously not true. And others will say it’s great, again regardless of the underlying value. But with the reaction threads and good calibration, you can pick out the patterns. The model welfare information helps a lot, too. You are dealing with a mind that has a bunch of…
No comments yet. Log in to reply on the Fediverse. Comments will appear here.