Tutorials

Preambles

Industry Reality and Competition Winning Solutions

According to Zygmunt Zając, most data mining and machine learning competitions are won by meticulously improving the score by a tiny fraction (usually by feature engineering, constructing different feature sets and ensembling the models). This approach however differ from the industry reality, different companies have different code deployment or production approaches.

One piece of evidence comes from the famous Netflix contest. People labored for three years to get the target score. In the end it resulted in a great boost for matrix factorization and general machine learning research, but Netflix didn’t implement the winning solution.

We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.

Lessons from Interesting Kaggle Competitions

The Hunt for Prohibited Content

Predicting which ads contain illicit content.

The goal of this Kaggle challenge is to create a predictive model that will learn from moderators’ answers how to classify if an ad contains illicit content or not.

It amounted to classifying text in Russian language.

Training set had roughly 1.3 million records, each consisting of a title, description, some attributes (key:value pairs), category and subcategory assignment and a few numeric features, including price.

Lessons

Kaggle vs industry, as seen through lens of the Avito competition

Feature Engineering Approaches:
- Using multiple feature sets
- Using term frequency – inverse document frequency (TF-IDF)
- Re-training a classifier on its own predictions
- kNN
- Factorization machines
- Separate models for each category
- n-grams
Tools Used
- Scikit-learn
- Vowpal Wabbit
Codes
- By Zygmunt Zając Using Python Scripts (to get data in and out) and Vowpal Wabbit (for large scale text classification)

Lessons learned from the hunt for prohibited content on kaggle

What did work
- Ensembling Vowpal Wabbit models
- Using an illicit score
- All loss functions gave good results
- Neural networks
- Reducing overfit
- 2-grams on ALL the features
- Having access to a fast 32GB RAM machine
- Encoding integers in categorical variables
What did not (quite) work
- Hyperparameter tuning
- TF-IDF
- Quick character encoding handling
- Proper dataset inspection
- Bagging SVD’s
Tools Used
- Scikit-learn
- Vowpal Wabbit
Codes

State of the Tools for Big Data Analysis and Machine Learning

Vowpal Wabbit (VW)

VW is a is a fast out-of-core learning system that is able to learn from terafeature datasets with ease.

Big Data Analytics Hub

Big data: research and practice