Senior data scientists know great ROI in real-world ML projects comes from finding/fixing issues in the dataset rather than tinkering too much with models. But this is done manually today via ad hoc scripts (Jupyter notebooks). In data-centric AI, we also use software that can automatically detect data issues (mislabeled examples, outliers, etc) to make all this more systematic (better coverage, reproducibility, efficiency, etc). While some companies are starting to offer commercial platforms for data-centric AI, cleanlab is: fully open-source, a complete software framework that can be used for many data-types and ML tasks, and I've published all of the novel algorithms cleanlab uses to help you improve messy real-world ML datasets.
In one-line of python, cleanlab can automatically:
(1) find mislabeled data + train robust models
(2) detect outliers
(3) estimate consensus + annotator-quality for datasets labeled by multiple annotators
(4) suggest which data is best to label or re-label next (active learning)
It has quick 5min tutorials for many types of data (image, text, tabular, audio, etc) and ML tasks (classification, entity recognition, image/document tagging, etc).
Engineers used cleanlab at Google to clean and train robust models on speech data, at Amazon to estimate how often the Alexa device doesn’t wake, at Wells Fargo to train reliable financial prediction models, and at Microsoft, Tesla, Facebook, etc. Hopefully you'll find cleanlab useful in your ML applications, it's super easy to try out!
Beyond feature engineering, data-centric AI can help in Kaggle. This notebook shows how easily cleanlab can improve the training dataset for an XGBoost model, producing 12% reduction in error without any change to the existing model+training+data-processing code:
We are looking for more contributors to cleanlab in 2023. Help shape the future of data-centric AI and ensure it remains free software, especially if you love Python and practical tools for real-world data science!
However as the ML researcher Michael Jordan (one of the most important in the field) has previously stated, these sort of long-term technology predictions are just fun science fiction and there is essentially no academic rigor in this stuff:
Launching in Phoenix only; for anyone whose never been there, the roads are straight & wide with extremely sparse car/foot-traffic and zero bad weather days. For those living in actual big US cities, don't hold your breath that this service will launch anytime soon for you...
Great idea, I can think of many ways this sort of app could be extended to improve sentence structure via a number of other heuristics you might find mentioned in a grammar book.
There is currently very little evidence linking the measured expression of specific genes in the brain with phenotypic traits such as personality. Scientists find it hard enough even to link almost certainly genetic attributes (such as various cancers and other disease) to a consistent set of genes in analyses of differential gene expression.
Great to see this being done. I wonder what number of emails a typical member of Congress must receive before they take note of an issue. Furthermore, how many emails does it take to sway their opinion on said issue, or are dollars the only medium that can induce the wild thought of reevaluating one's perspective.