Preambles
Industry Reality and Competition Winning Solutions
According to Zygmunt Zając, most data mining and machine learning competitions are won by meticulously improving the score by a tiny fraction (usually by feature engineering, constructing different feature sets and ensembling the models). This approach however differ from the industry reality, different companies have different code deployment or production approaches.
One piece of evidence comes from the famous Netflix contest. People labored for three years to get the target score. In the end it resulted in a great boost for matrix factorization and general machine learning research, but Netflix didn’t implement the winning solution.
We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.
Lessons from Interesting Kaggle Competitions
The Hunt for Prohibited Content
Predicting which ads contain illicit content.
The goal of this Kaggle challenge is to create a predictive model that will learn from moderators’ answers how to classify if an ad contains illicit content or not.
It amounted to classifying text in Russian language.
Training set had roughly 1.3 million records, each consisting of a title, description, some attributes (key:value pairs), category and subcategory assignment and a few numeric features, including price.
Lessons
Kaggle vs industry, as seen through lens of the Avito competition
- Feature Engineering Approaches:
- Using multiple feature sets
- Using term frequency – inverse document frequency (TF-IDF)
- Re-training a classifier on its own predictions
- kNN
- Factorization machines
- Separate models for each category
- n-grams
- Tools Used
- Scikit-learn
- Vowpal Wabbit
- Codes
Lessons learned from the hunt for prohibited content on kaggle
- What did work
- Ensembling Vowpal Wabbit models
- Using an illicit score
- All loss functions gave good results
- Neural networks
- Reducing overfit
- 2-grams on ALL the features
- Having access to a fast 32GB RAM machine
- Encoding integers in categorical variables
- What did not (quite) work
- Hyperparameter tuning
- TF-IDF
- Quick character encoding handling
- Proper dataset inspection
- Bagging SVD’s
- Tools Used
- Scikit-learn
- Vowpal Wabbit
- Codes
Other Solutions
State of the Tools for Big Data Analysis and Machine Learning
Vowpal Wabbit (VW)
VW is a is a fast out-of-core learning system that is able to learn from terafeature datasets with ease.