Industry Reality and Competition Winning Solutions

According to Zygmunt Zając, most data mining and machine learning competitions are won by meticulously improving the score by a tiny fraction (usually by feature engineering, constructing different feature sets and ensembling the models). This approach however differ from the industry reality, different companies have different code deployment or production approaches.

One piece of evidence comes from the famous Netflix contest. People labored for three years to get the target score. In the end it resulted in a great boost for matrix factorization and general machine learning research, but Netflix didn’t implement the winning solution.

We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.

Lessons from Interesting Kaggle Competitions

The Hunt for Prohibited Content

Predicting which ads contain illicit content.

The goal of this Kaggle challenge is to create a predictive model that will learn from moderators’ answers how to classify if an ad contains illicit content or not.

It amounted to classifying text in Russian language.

Training set had roughly 1.3 million records, each consisting of a title, description, some attributes (key:value pairs), category and subcategory assignment and a few numeric features, including price.


Kaggle vs industry, as seen through lens of the Avito competition

Lessons learned from the hunt for prohibited content on kaggle

  • What did work
    • Ensembling Vowpal Wabbit models
    • Using an illicit score
    • All loss functions gave good results
    • Neural networks
    • Reducing overfit
    • 2-grams on ALL the features
    • Having access to a fast 32GB RAM machine
    • Encoding integers in categorical variables
  • What did not (quite) work
    • Hyperparameter tuning
    • TF-IDF
    • Quick character encoding handling
    • Proper dataset inspection
    • Bagging SVD’s
  • Tools Used
    • Scikit-learn
    • Vowpal Wabbit
  • Codes

Other Solutions

State of the Tools for Big Data Analysis and Machine Learning

Vowpal Wabbit (VW)

VW is a is a fast out-of-core learning system that is able to learn from terafeature datasets with ease.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s