As a transition from debunking disinformation to kaggling here is a thread debunking several myths about Kaggle, including lack of relevance to real world, overfiting, automl performance on kaggle, etc. Bear with me. 1/N
I am discussing Kaggle competitions. Kaggle is more than that, way more even, but competitions is what interest me most on Kaggle. The first myth about Kaggle is that it is irrelevant to real world. There are several ways to address that. 2/N
First, why would companies pay Kaggle to host a competition? It must be that these companies see some benefit out of it. Second, Kaggle hosts science competitions where the challenge is to advance state of the art in some specific domains. 3/N
For instance, Kagglers have produced models that diagnose Melanoma from skin lesion pictures better than medical doctors. Kagglers have produced models that predict quantum properties of molecules hundreds of time faster and more accurately than physics models. etc. 4/N
Another criticism is that Kagglers learn how to overfit. Actually, I would argue that datascientists without any Kaggle competition experience are way more likely to produce overfitting models. That's what I observed in my professional environment. 5/N
Kaggle setting involves private test data that is used for final ranking in competitions. During the competition no one can get feedback from private test data hence no one can overfit to it. It enables an unbiased model evaluation. 6/N
This unbiased evaluation is the single most important feature of Kaggle. It teaches kagglers that properly evaluating predictive model performance is key.

Another criticism is that Kaggle ignores too much of the end to end machine learning lifecycle 7/N
True, Kaggle competitions do not address :
- the choice of the problem to solve with machine learning, - data collection to some extent,
- downstream deployment of models in production and their monitoring.
Recently introduced analytics competitions address the first item. 8/N
Most competitions allow the use of external data which addresses the second item. Last, some competitions are code competitions, which means that one must provide a relatively robust and efficient inference code, which is a first step towards the last item. 9/N
Another criticism is that Kaggle competitions are easy and that AutoML tools have won several of them. These claims are always wrong. Reality is that people run today's technology on years old competitions. This is totally unfair to the people who competed at the time. 10/N
It is unfair because people have access to information unavailable to competitors at the time: ability to use private test data feedback to tune models, ability to read how best competitors solved the problem, and ability to use technology not available at the time. 11/N
What would be convincing is to have an automated tool enter a live competition and win it. I bet this is not going to happen before quite a while.

Another criticism is that winning solutions are way too complex for being used in real world. 12/N
This goes back to the Netflix challenge where the winning solution was used as is. People wrongly concluded that none of the winning solution was reused. In reality key parts of it, factorization machines and some feature engineering were used in production. 13/N
Moreover, we see a trend where solutions are simpler than before: just a blend of several models. Each of the individual models could be used as is in production. One reason is probably the code competitions that put a limit on the inference code complexity. 14/N
I'll stop here for the moment but I may revisit and expand. Hope you enjoyed it if you read this far. 15/15
You can follow @JFPuget.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: