I haven't got a clue about meteorology, but I know a thing or two about machine learning and the majority of papers in the field remain interesting prototypes because they are tested in the same way: namely, on a test dataset that is already known and readily accessible during training -by the experimenters, and thereofore, indirectly, by the training model itself.
This is the case with this paper, from which I quote (see section "Dataset creation and splits"):
>> The available data spans a period from July 2017 to August 2020.
In other words, they have shown that their system works better than an earlier system on old data. It can predict rainfall in a three-ish year interval in the past.
Show that your approach works on real-world data, that you didn't have access to when training your model. Show that your models can predict the future, not the past. Otherwise, your experiments don't mean, what you think they mean.
Edit: and to make it perfectly clear for the uninitiated: when you, as an experimenter, know the "ground truth" of your test data, there is nothing easier than to tune the training of your model to maximise its performance on the test data. That's doubly so, and doubly as dangerous, with neural nets that are very, very good at reproducing their training set, but very, very bad at generalising beyond it. Unfortunately, this is what the vast majority of experiments in machine learning do: they test on known data. The result is that nobody really knows how good the tested systems are until they deploy them in a real environment (if they ever do, which they usually don't, because the whole point was writing a paper to report improved performance, and then move on to the next).
Best practice here is to keep a chunk of your dataset held out from your entire development process (a "held out test set"), and then only evaluate it when you write your final report on your work.
The idea is that you don't even look at this data until you're confident you've solved the problem, and then you have one shot to confirm that you have, indeed, solved it. Obviously you have to be careful in selecting this set -- if you're working with time-series, for example, you want it to be temporally disjoint -- but done well it's a reasonable approximation for "future results".
About as good as you can get without collecting new data.
I can't tell if they did that here, but if they did, they didn't use the common jargon for it.
Suppose you do this, try some different models, get the error down on the training sample. Then you apply to the test sample and the results are ... about the same as the well known results. The best practice says you should now quit and move on to another project, and either publish a negative result or don't bother publishing at all.
Human nature, and publish-or-perish rewards, mean the researcher will say "well, what if I try this other model? Or tweak this parameter, or ..."
In other words, it's unrealistic to say you have one shot to confirm that have, indeed, solved it.
Not unrealistic -- it just causes people to be very, very careful about not fooling themselves.
Do you really believe that this practice is unrealistic because anyone with a model that doesn't do well on the held-out set will simply engage in fraud? That is a very, very cynical take.
FWIW, I have published this way [0]. There were definitely some aspects of the model's performance on the held-out set that did not meet test set performance; it's not fatal to the paper, it's an invitation to discuss.
Fraud is not needed. My intuition is that most people publishing in machine learning have simply not thought carefully enough about their experimental methods. They're not defrauding anyone, they're misunderstanding their own results.
And since the majority of published papers do the same, it's very difficult to convince anyone what the good practice is.
Oh, we're on the same page -- I don't think they're engaging in fraud currently, just overlooking the implications of their process, as you note.
But explicitly holding out a test set, throwing it away because your results on that set were no good, trying again, and then representing that held-out set as, indeed, held-out in later work -- as the parent suggested? I'm not sure what other word to use there.
"The training, validation and test data sets are generated without overlap from periods in sequence. Successive periods of 400, 12, 40, 40 and 12 h are used to sample, respectively, training, validation, and test data, with the two 12 h periods inserted as hiatus."
What's missing is that they did not use the test set during the training process.
A common split is train/validate/test, but all three are used during training -- train to actually train, validate for intermediate loss, test for model comparison.
What you want is a fourth, held-out test set that isn't looked at until publish time.
This paper has two test sets, but they have different data properties, and it's not clear they were held out until publication.
I can't believe you can publish with such an obvious flaw in study design. You don't have to have machine learning expertise to catch that, because the same idea applies to backtesting any kind of model at all.
And yet they do. I don't remember ever reading any machine learning paper where the authors point out that they held out a test partition that they couldn't access until some "final" step of the experiment.
I don't understand why there is any doubt about this. From OP, section "Methods", subset "Data set creation and splits":
"The training, validation and test data sets are generated without overlap from periods in sequence. Successive periods of 400, 12, 40, 40 and 12 h are used to sample, respectively, training, validation, and test data, with the two 12 h periods inserted as hiatus."
I know nothing about machine learning, but I was in grad school with others studying machine learning and I took a Coursera course on the subject, and carving out a training set is utterly standard practice. A paper that didn't do this would get filtered by a grad student reviewer. So I'm not sure what possible sub-category of machine learning your statements could apply to. Can you share any examples? For example, do any of the famous papers in the field fail to hold out test data? Something like AlphaGo or AlphaZero is indeed testing on all new data -- new games it plays with others. Do any papers in well-known machine learning fora fail to hold out test data? Does anything that has bubbled to the front page of HN? It would be interesting to look at if so.
Do you mean p-hacking? That's a much more subtle thing.
Very few ML papers use a held-out test set that is only used for publication.
When you use your test set as part of model selection, you don't really have an empirical test set -- you do have a training set that you've partitioned into various parts for various purposes, but you don't have a proxy for how the system will perform in the face of new data, because you're reporting results that are based on data that you used as part of your model-generating process.
If you try out models A, B, C, and D while developing, and then it turns out model D has the best performance on your test set, and you report your results using model D and that same test set, you are essentially "training" using your test set.
Yes, in some problem contexts, for ML researchers who are also in the business of generating their own datasets, the data they test on can be entirely new. But for anyone using a pre-existing dataset, holding out a test set for your project is not yet a well-established practice -- and this makes it somewhat harder to trust the results.
It's worth noting that the paper was submitted in August of 2021. So perhaps it took a year to develop the methodology and get the paper ready for publishing. They did use separate training, validation, and test sets for data, although we do need to trust that the test set was only used a single time. To whatever extent the model doesn't do as well in 2022 as it did at the time it was created, that comes down to concept drift and data drift.
> Show that your approach works on real-world data, that you didn't have access to when training your model. Show that your models can predict the future, not the past. Otherwise, your experiments don't mean, what you think they mean.
Great, but ... that requires you have collection infrastructure and almost by definition means you can't make it work on a dataset. The way ML research works is like the Kaggle competitions.
You get a dataset. This can be anything, as long as it can be expressed as a tensor (which are vectors, but can have more dimensions). Then split into seen and unseen data. You split into 3 portions/partitions. One for training. One for testing, and one for blind validation (80% training and 10% each for test and validation sets).
In order to simulate what you're suggesting researchers can, and often do, take the most recent datapoints as validation and testing sets.
But extending the data set is an expensive process that generally can't be done by the machine learning researchers themselves. So it is not done.
As an anecdote, I just started a masters in data analytics and I've been having a lot of fun asking ChatGPT for help detecting overfitting and useful terms like the mean squared error (MSE).
Absolutely; i just mean I love it for guiding my learning :) i don't use it for final projects, but it does help me navigate the material if I'm not comprehending it well. I hope ChatGPT is embraced for that reason, because it's really like finding a key to a lock with professors. They explain things in the most effective way they know how, and ChatGPT wouldn't get exhausted or impatient with you if you ask it the same question in different ways. That said, it does produce erroneous information. I hope less than my professors.
In competition formats, the test data is not seen by anybody and after submission, the best algorithm is ranked based on how well it works on this unseen test data.
I would love to read a science fiction story about an AI scientist that is monitoring the output of an AI weather model as a hobby, and then suddenly, on a seemingly completely normal day, sunny, windless, the AI starts screaming bloody murder, huge red warnings everywhere, like the world is about to end. spooky just to think about it. could be a fun thriller
This reminds me of a movie, the name of which I can’t remember.
> A family father gets premonitions about a massive storm that will come, which triggers him building a bunker in his back yard. It is a family drama on how his obsession with protecting his family slowly tears it apart. In the end they convince the father to go on vacation for a few days, just to get away from the obsession. As they arrive in their vacation home, a huge, black storm cloud rolls in on the horizon
Kinda what happened in the first ever use of cfd to predict weather (which predated digital computers - all calculations were done by hand). Spoiler: there were no calculation errors, but the method was unstable (a kind of positive feedback where errors "blow up"). Does this count as "AI"? Kinda - maybe even god-like.
Anyway,
everyone just ignored it. Except the scientist, who I suspect FTF out.
“Huh. Management isn’t going to like this,” said the data scientist. “Tell you what, let’s just agree it’s an outlier, and delete it. I’ve got tickets to close and the sprint ends tomorrow, I don’t have time for this.”
I’m an armchair, skill-free weather enthusiast. The HRRR model is amazing because it runs every hour, predicting the next 18 hours with considerable reliability. To do better than the HRRR, with significantly less computational effort must be causing a lot of meteorologists a degree of discomfort.
> To do better than the HRRR, with significantly less computational effort must be causing a lot of meteorologists a degree of discomfort.
It doesn't cause any such discomfort, because in the grand scheme of things, while this is still very impressive and interesting work, it's still a toy in comparison with what tool like the HRRR is typically used for. Reported increases in performance are skewed towards the first few hours of the forecast, where all CAMs generally have some issues because of inconsistencies between the assimilated model initial state and the real-world (e.g. small deficits in the structure of convective systems at the initial state can dominate precipitation forecast skill at short lead times), and where traditional nowcasting systems are already significantly superior.
There's little evidence reported that this modeling paradigm can even fundamentally tackle the most critical aspects of short-term mesoscale/convective forecasting, which is hysteresis - the initiation of convection and the structure it takes on in different environments. This is _by far_ the most important way that the HRRR is used to aid in short-range forecasting.
In the long arc of things, the community is very, very excited to see how novel approaches involving things like generative AI could lead to next-generation warn-on-forecast systems and large ensembles. And while this paper is a cool early step in this direction, it's a very small one in the bigger picture.
It makes sense that there would be better strategies for simulation than gridding Navier-Stokes (the regularities in weather phenomena show that the space of likely possibilities isn't evenly distributed throughout the space that values on an evenly spaced lattice represent) but it's probably too complicated to come up with those schemes manually. If I had to guess what the model is "doing," (as opposed, say, to compensating for undiscovered problems with data acquisition) that would be it.
"These properties can not only offer improved forecasts, but also frequent and personalized forecasts"
I'm imagining those weather stations you can buy for your house coming with a GPU and a local model you can finetune on your own measurements and the measurements of people nearby - get an ensemble of predictions from neighboring models... Seems like a neat addition to a smart home.
You don't need any of that, all you need to do is collect personal weather station data and use it to statistically post-process/correct numerical model output that is already freely available. This has already been done several times over in the weather industry; it doesn't lead to much of an improvement in forecast quality.
Trying to run your own model with nearby inputs is kind of pointless because to actually assimilate measurements in your neighborhood, you'd have to run at an absurdly high resolution that makes it way too expensive to run an operational forecast as a commodity. You'd be limited in the size of the forecast domain, so within a few hours (maybe a day at most), your forecast would be dominated by boundary conditions from the parent regional/global models you force it with, and there would be no further propagation of information from your local observations to refine the forecast.
Weather prediction is the realm of relatively well understood differential equations based on physical properties. Where deep learning can play is in inference of what's going on between observations on the map and timeline.
The problem is you need to define "decent spatial and time resolution." We regularly run kilometer-scale global models that eschew convective parameterizations and instead directly simulate those scales of motion. Over smaller domains we run models pushing beyond LES scales.
Deep learning applications in the field haven't come anywhere even _close_ to tackling these niches of the field yet. SOTA DL-based forecasting tools run at quarter-degree resolution, if even that, and the field hasn't even begun to run hierarchical or multi-scale models coupling coarse models to mesoscale or finer-grained. Hell - you'd be hard-pressed to even find mesoscale DL weather simulations in the first place! So it's a big, big stretch to suggest that DL can achieve "the same quality results" when no one has even offered a cursory glance at an AI application bordering on the SOTA in NWP.
I don't. Weather prediction is a physical process with (I assume) fairly well known laws. That's the wheelhouse of traditional algorithms, not deep learning.
But the "known laws" are too convolved with chaotic behavior, which is why making predictions from them is really hard.
There's an analogy to the 90s dream of replacing fluid dynamics with its differential equations with cellular automata. The claim made about the former is that the "variables in the equations represent real things in the world", but in the end it is all just a story. If the cellular automata/neural modelling can tell a better story ....
> But the "known laws" are too convolved with chaotic behavior, which is why making predictions from them is really hard.
This canard always gets thrown out but the application here is flawed in two very big ways. First, "chaotic behavior" does not mean "unpredictable behavior." Modern numerical weather forecasting already deals with this through ensemble modeling techniques and other approaches designed to capture the statistics of the evolution of the weather, not just a single deterministic state. Anyone selling you a deterministic weather forecast from a single model is robbing you.
Second, DL will suffer the same challenges here because many AI-based weather forecasting tools are auto-regressive, where a model output is used to seed the next step of the forecast. So the AI approach doesn't actually escape this hypothetical limitation (in fact it might compound it badly).
Good points, of which I was aware (mostly). I was trying to dispel the "just because you know the laws of physics you can make predictions based on them (alone)" tone of the GP. But thanks for doing a better job at that than I did.
> in the end it is all just a story. If the cellular automata/neural modelling can tell a better story ....
Uhm no, it is all fluid dynamics. You are correct that weather is highly sensitive to initial conditions, but you are incorrect to conclude that that means DNNs are somehow better at dealing with that.
The only help they might be is in correcting systematic modelling and measurement errors, or speeding up computation (by guessing).
FD is just one of the stories we have to tell that allows us to predict and therefore engineer certain aspects of the universe in which we find ourselves. But it's no more than a story, no matter how fit it may be for the purpose.
Also, the claim would not be that DNNs are better at handling initial conditions, but rather than they are better at spotting patterns (ok, ok, call them correlations) than either ourselves or the systems we've built based on pre-conceived physical models.
This is the case with this paper, from which I quote (see section "Dataset creation and splits"):
>> The available data spans a period from July 2017 to August 2020.
In other words, they have shown that their system works better than an earlier system on old data. It can predict rainfall in a three-ish year interval in the past.
Show that your approach works on real-world data, that you didn't have access to when training your model. Show that your models can predict the future, not the past. Otherwise, your experiments don't mean, what you think they mean.
Edit: and to make it perfectly clear for the uninitiated: when you, as an experimenter, know the "ground truth" of your test data, there is nothing easier than to tune the training of your model to maximise its performance on the test data. That's doubly so, and doubly as dangerous, with neural nets that are very, very good at reproducing their training set, but very, very bad at generalising beyond it. Unfortunately, this is what the vast majority of experiments in machine learning do: they test on known data. The result is that nobody really knows how good the tested systems are until they deploy them in a real environment (if they ever do, which they usually don't, because the whole point was writing a paper to report improved performance, and then move on to the next).