10
Text classification using Machine Learning - Tensorflow - AI
I want to share my experience participating in the #MeliDataChallenge 2019. (Mercadolibre.com challenge)
I'm not an expert. I like to participate in contests like this to learn about Machine Learning and AI with real world applications, soon I will publish my experience and my solution for the Despegar challenge (Images classification).
The challenge was very interesting, classify e-commerce products using only its titles.
First and second place will receive tickets to KHIPU. From 3 to 5 place an Intel Movidius.
I managed to be in the top 20 with a score of 0.8954/1, there were more of 150 participants, the competition was hard and exciting. Of course, I learned a lot of new things.
POSITION | NAME | SCORE | ENTRIES |
---|---|---|---|
20 | jefferson1100001 | 0.895456136076574 | 7 |
The first thing I did was to take a look of the data, Mercadolibre provided two files, train.csv and test.csv. This is how train.csv looks like:
TITLE | LABEL_QUALITY | LANGUAGE | CATEGORY |
---|---|---|---|
Hidrolavadora Lavor One 120 Bar 1700w Bomba A... | unreliable | spanish | ELECTRIC_PRESSURE_WASHERS |
Placa De Sonido - Behringer Umc22 | unreliable | spanish | SOUND_CARDS |
Maquina De Lavar Electrolux 12 Kilos | unreliable | portuguese | WASHING_MACHINES |
There are 12,644,401 valid rows, the dataset is unbalanced and some categories are present in just one language.
Here I will describe my preprocess routines without code.
Remove tildes
Spanish and Portuguese words have tildes, like teléfono. This step mutates the word to telefono.Remove word separators
Some titles have dash, dots and other punctuation marks without a space between them, for example kit.ruedas.moto.
I'm replacing each one of this marks with a space.
- + , . ( ) : [ ] { } _ /
Remove other punctuation marks and numbers
I've removed any other punctuation mark and numbers, but the number must be surrounded by a word boundary.Tokenize the title
I applied the WordPunctTokenizer provided by NLTK to split each title into words.Remove stop words
On the resulting array of words I discarded stop words like: "un", "unas", "unos"...Stem each token
I used the SnowballStemmer provided by NLTK. Stemming is the process of reducing inflected (or sometimes derived) words to their word stem. For example: CĂĄmara is transformed to "cam".
Let's take a look of the titles titles before and after preprocessing:
BEFORE | AFTER PREPROCESSING |
---|---|
Placa De Sonido - Behringer Umc22 | plac son behring umc22 |
Oportunidad! Notebook Dell I3 - 4gb Ddr4 - Hd 1tb - Win 10 | oportun notebook dell i3 4gb ddr4 hd 1tb win |
CĂĄmara InstantĂĄnea Fujifilm Instax Mini 9 - Azul Cobalto | cam instantane fujifilm instax mini azul cobalt |
I saved a copy of train.csv with all the titles preprocessed, a list with all the posible categories and a list with all labels.
Iterating over all the preprocessed titles and with the help of a Counter I had created a dictionary containing words only if their frequency is >= 2. It means that the word must occur at least two times in some title.
The dictionary looks like this
{'kit': 785233, 'original': 469537, 'pret': 232647, 'led': 220194, ...}
There are 1,251,659 unique tokens, after filtering them the dictionary has 513,307 posible words.
Let's transform the preprocessed titles to numbers
In this step I used the dictionary to transform each title into an array of numbers. It's very simple, for each word in the title, replace it with the corresponding index of the word in the dictionary plus 1. (0 is reserved).
A preprocessed title like this:
['porton', 'chap', 'hoj', 'mtr', 'marc']
Becomes:
[120, 121, 122, 123, 124]
The max word sequence has len 27, if the transformed title has len less than 27, we pad it with zeros so each title has the same length.
[120, 121, 122, 123, 124, 0, 0, 0, 0, 0, 0, 0, ....]
I've used Tensorflow + Keras, the model has the following architecture:
Seed random numbers, so you can get reproducible results.
Use stratified samples when splitting test and train
It means that each set must have the same proportion of classes.Take 1% for testing
The dataset is relative big, 1% seems to represent a good number of features of the dataset for being validated.Use class_weights
Due to the imbalanced nature of the dataset, class_weights increased the BACC of the model.Explore the data locally
And maybe preprocess locally but use multiprocessingUse Colab or Kaggle
To take adventage of the GPU and train faster
After 18 epochs the model seems to achieve good results before start to overfitting. The balanced accuracy was: 0.86774697
Testing a new title:
If we feed the model with complete new data, for example:
"Buzo Harry Potter Lentes Cicatriz Hogwarts Hoodie"
It will predict:
SWEATSHIRTS_AND_HOODIES
I also created a small model with just 1400 categories representing the reliable subset. The the goal was to feed that model with the unreliable subset in order to detect if the unreliable categories were in the wrong categories, but this adds complexity and I'd to optimize two models instead of just one.
- You can use the label_quality somehow to increase model ACC.
- Use confusion matrix or any performance measurement to detect where does our model perform worst.
- Use a complex architecture or CNN
10