Introduction to One-Hot Encoding

What is one-hot encoding?

In digital circuits and machine learning, a one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0).

In machine learning, features are not always continuous and sometimes are categories. For example, the feature gender can be categorized as 'male' or 'female'. In model training, we have to convert these features to numerical values. Consider the following example with 3 features:

  • gender: ['female', 'male']
  • region: ['Africa', 'Asia', 'Europe', 'US']
  • class: ['A', 'B', 'C'']

For a specific sample, like ['female', 'Asia', 'C'], to convert the features to numerical codes, the easier way is to serialize them to [0, 1, 2]. However, these serialized features cannot be used directly in machine learning model training.

In statistics, we usually use dummy variables to represent these categorical variables. Similarly, we can use a representation of categorical variables as binary vectors to extract features and convert them to numbers. In our example, 'female' corresponds to [1, 0], 'Asia' corresponds to [0, 1, 0, 0] and 'C' corresponds to [0, 0, 1], so the overall numerical representation of this sample is [1, 0, 0, 1, 0, 0, 0, 0, 1]. This technique of conversion is called one-hot encoding.

What are the advantages and disadvantages?

Advantage

  • It could be better to deal with categorical features;
  • It could be better to deal with discontinuous numerical features;
  • Possibly increase the number of features. For instance, sex (male/female) is one feature, but after converting into one-hot encoding, it becomes two features: male | female.

Disadvantages

  • In NLP, it disregards the sequences of words;
  • It assumes features are independent. However, in a lot of cases, features are correlated with each other;
  • The data can be extremely sparse after the conversion.

How to convert to one-hot encoding?

As an instance, consider we have 4 samples of 3 features:

Feature_1 Feature_2 Feature_3
Sample_1 0 3 2
Sample_2 1 2 1
Sample_3 0 1 1
Sample_4 1 0 0

In the table above, features are already converted into numerical codes. There are 2 possible choices for feature_1: perhaps male/female, and it's represented as 0-male / 1-female. But how can we convert these numbers into one-hot encoding?

Let one digit in the binary code of a feature represents a choice of the feature. Then, the digit representing the choice of the sample is set to 1 and all the remaining digits are set to 0, as below:

Feature Encoding
Feature_1 0->01, 1->10
Feature_2 0->0001, 1->0010, 2->0100, 3->1000
Feature_3 0->001, 1->010, 2->100

In this way, the original table is converted to:

Feature_1 Feature_2 Feature_3
Sample_1 01 1000 100
Sample_2 10 0100 010
Sample_3 01 0010 010
Sample_4 10 0001 001

The feature vectors for the four samples above are now:

Feature Vector
Sample_1 [0,1,1,0,0,0,1,0,0]
Sample_2 [1,0,0,1,0,0,0,1,0]
Sample_3 [0,1,0,0,1,0,0,1,0]
Sample_4 [1,0,0,0,0,1,0,0,1]

This is the one-hot representation of the four samples, in which all features of the samples are converted into one-hot encoding.

One-hot encoding using Python

Using Pandas

pandas.get_dummies is used to convert categorical variables into one-hot encoding.

pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)

See the following usage example:

import pandas as pd

df = pd.DataFrame([
            ['male', 'Africa', 'C'],
            ['female', 'Asia', 'B'],
            ['male', 'Europe', 'B'],
            ['female', 'US', 'A']
])
df.columns = ['gender', 'region', 'class']
pd.get_dummies(df)

Before conversion, the data frame is:

gender region class
Sample_1 male Africa C
Sample_2 female Asia B
Sample_3 male Europe B
Sample_4 female US A

After conversion, the new data frame is:

gender_male gender_female region_Africa region_Asia region_Europe region_US class_A class_B class_C
Sample_1 1 0 1 0 0 0 0 0 1
Sample_2 0 1 0 1 0 0 0 1 0
Sample_3 1 0 0 0 1 0 0 1 0
Sample_4 0 1 0 0 0 1 1 0 0

Using sklearn

sklearn.preprocessing.OneHotEncoder is used to encode categorical features as a one-hot numeric array.

class sklearn.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')

See the following usage example:

from sklearn import preprocessing

enc = preprocessing.OneHotEncoder()
enc.fit([[0, 3, 2], [1, 2, 1], [0, 1, 1], [1, 0, 0]])

array = enc.transform([[0, 3, 2]]).toarray()

print(array)

The result of the code above is:

[[1. 0. 0. 0. 0. 1. 0. 0. 1.]]

Why do we use one-hot encoding in machine learning?

In order to perform machine learning algorithms, like classification, regression, and clustering, we have to compute the 'distances' or 'similarities' between features, and most 'distances' and 'similarities' are defined to be computed in Euclidean space.

To compute the distances in Euclidean space, for discrete or categorical data, it is more rational in the binary representation.

For example, consider we have three feature vectors:
x1=0,x2=1,x3=2 x_1 = 0, x_2 = 1, x_3 = 2 .

The distances between these vectors are:
d(x1,x2)=10=1,d(x1,x3)=20=2 d(x_1, x_2) = 1 - 0 = 1, d(x_1, x_3) = 2 - 0 = 2 .

This result indicates that the distance between x1x_1 and x2x_2 are smaller than the distance between x1x_1 and x3x_3 . However, there is no reason to say two features are more separated than two other features in a category.

However, if we convert the feature vectors into one-hot encodings, then x1=(1,0,0)x_1 = (1, 0, 0) , x2=(0,1,0)x_2 = (0, 1, 0) , and x3=(0,0,1)x_3 = (0, 0, 1) . In this way, d(x1,x2)=12+12=2d(x_1, x_2) = \sqrt{1^2+1^2} = \sqrt{2} , d(x1,x3)=12+12=2d(x_1, x_3) = \sqrt{1^2+1^2} = \sqrt{2} , and d(x2,x3)=12+12=2d(x_2, x_3) = \sqrt{1^2+1^2} = \sqrt{2} as well, so the three feature vectors are equally distant from each other in the Euclidean space, which corresponds to the characteristics of non-correlated categorical variables.

Therefore, in the representation of one-hot encoding, categorical features are binarized, and after binarizing these features, we can regard them as vectors embedded in the Euclidean space to calculate the distance for machine learning models.

22