Text classification using Machine Learning - Tensorflow - AI

Hey 👋

I want to share my experience participating in the #MeliDataChallenge 2019. (Mercadolibre.com challenge)

I'm not an expert. I like to participate in contests like this to learn about Machine Learning and AI with real world applications, soon I will publish my experience and my solution for the Despegar challenge (Images classification).

The challenge

The challenge was very interesting, classify e-commerce products using only its titles.

The prize

First and second place will receive tickets to KHIPU. From 3 to 5 place an Intel Movidius.

The result

I managed to be in the top 20 with a score of 0.8954/1, there were more of 150 participants, the competition was hard and exciting. Of course, I learned a lot of new things.

POSITION	NAME	SCORE	ENTRIES
20	jefferson1100001	0.895456136076574	7

Let's explain my approach

The first thing I did was to take a look of the data, Mercadolibre provided two files, train.csv and test.csv. This is how train.csv looks like:

TITLE	LABEL_QUALITY	LANGUAGE	CATEGORY
Hidrolavadora Lavor One 120 Bar 1700w Bomba A...	unreliable	spanish	ELECTRIC_PRESSURE_WASHERS
Placa De Sonido - Behringer Umc22	unreliable	spanish	SOUND_CARDS
Maquina De Lavar Electrolux 12 Kilos	unreliable	portuguese	WASHING_MACHINES

There are 12,644,401 valid rows, the dataset is unbalanced and some categories are present in just one language.

Data preprocessing

Here I will describe my preprocess routines without code.

Remove tildes
Spanish and Portuguese words have tildes, like teléfono. This step mutates the word to telefono.

Remove word separators
Some titles have dash, dots and other punctuation marks without a space between them, for example kit.ruedas.moto.

I'm replacing each one of this marks with a space.

+ , . ( ) : [ ] { } _ /

Remove other punctuation marks and numbers
I've removed any other punctuation mark and numbers, but the number must be surrounded by a word boundary.

Tokenize the title
I applied the WordPunctTokenizer provided by NLTK to split each title into words.

Remove stop words
On the resulting array of words I discarded stop words like: "un", "unas", "unos"...

Stem each token
I used the SnowballStemmer provided by NLTK. Stemming is the process of reducing inflected (or sometimes derived) words to their word stem. For example: Cámara is transformed to "cam".

Preprocessing result

Let's take a look of the titles titles before and after preprocessing:

BEFORE	AFTER PREPROCESSING
Placa De Sonido - Behringer Umc22	plac son behring umc22
Oportunidad! Notebook Dell I3 - 4gb Ddr4 - Hd 1tb - Win 10	oportun notebook dell i3 4gb ddr4 hd 1tb win
Cámara Instantánea Fujifilm Instax Mini 9 - Azul Cobalto	cam instantane fujifilm instax mini azul cobalt

I saved a copy of train.csv with all the titles preprocessed, a list with all the posible categories and a list with all labels.

The dictionary:

Iterating over all the preprocessed titles and with the help of a Counter I had created a dictionary containing words only if their frequency is >= 2. It means that the word must occur at least two times in some title.

The dictionary looks like this

{'kit': 785233, 'original': 469537, 'pret': 232647, 'led': 220194, ...}

There are 1,251,659 unique tokens, after filtering them the dictionary has 513,307 posible words.

Let's transform the preprocessed titles to numbers
In this step I used the dictionary to transform each title into an array of numbers. It's very simple, for each word in the title, replace it with the corresponding index of the word in the dictionary plus 1. (0 is reserved).

A preprocessed title like this:

['porton', 'chap', 'hoj', 'mtr', 'marc']
Becomes:

[120, 121, 122, 123, 124]

The max word sequence has len 27, if the transformed title has len less than 27, we pad it with zeros so each title has the same length.

[120, 121, 122, 123, 124, 0, 0, 0, 0, 0, 0, 0, ....]

Now it's time for Machine Learning

I've used Tensorflow + Keras, the model has the following architecture:

Key points you must know

Seed random numbers, so you can get reproducible results.

Use stratified samples when splitting test and train
It means that each set must have the same proportion of classes.

Take 1% for testing
The dataset is relative big, 1% seems to represent a good number of features of the dataset for being validated.

Use class_weights
Due to the imbalanced nature of the dataset, class_weights increased the BACC of the model.

Explore the data locally
And maybe preprocess locally but use multiprocessing

Use Colab or Kaggle
To take adventage of the GPU and train faster

Training:

After 18 epochs the model seems to achieve good results before start to overfitting. The balanced accuracy was: 0.86774697

Testing a new title:
If we feed the model with complete new data, for example:

"Buzo Harry Potter Lentes Cicatriz Hogwarts Hoodie"
It will predict:

SWEATSHIRTS_AND_HOODIES

The alternative attempt:

I also created a small model with just 1400 categories representing the reliable subset. The the goal was to feed that model with the unreliable subset in order to detect if the unreliable categories were in the wrong categories, but this adds complexity and I'd to optimize two models instead of just one.

Next steps:

You can use the label_quality somehow to increase model ACC.

Use confusion matrix or any performance measurement to detect where does our model perform worst.

Use a complex architecture or CNN

Github with all the code: