Introduction to One-Hot Encoding

What is one-hot encoding?

In digital circuits and machine learning, a one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0).

In machine learning, features are not always continuous and sometimes are categories. For example, the feature gender can be categorized as 'male' or 'female'. In model training, we have to convert these features to numerical values. Consider the following example with 3 features:

gender: ['female', 'male']
region: ['Africa', 'Asia', 'Europe', 'US']
class: ['A', 'B', 'C'']

For a specific sample, like ['female', 'Asia', 'C'], to convert the features to numerical codes, the easier way is to serialize them to [0, 1, 2]. However, these serialized features cannot be used directly in machine learning model training.

In statistics, we usually use dummy variables to represent these categorical variables. Similarly, we can use a representation of categorical variables as binary vectors to extract features and convert them to numbers. In our example, 'female' corresponds to [1, 0], 'Asia' corresponds to [0, 1, 0, 0] and 'C' corresponds to [0, 0, 1], so the overall numerical representation of this sample is [1, 0, 0, 1, 0, 0, 0, 0, 1]. This technique of conversion is called one-hot encoding.

What are the advantages and disadvantages?

Advantage

It could be better to deal with categorical features;
It could be better to deal with discontinuous numerical features;
Possibly increase the number of features. For instance, sex (male/female) is one feature, but after converting into one-hot encoding, it becomes two features: male | female.

Disadvantages

In NLP, it disregards the sequences of words;
It assumes features are independent. However, in a lot of cases, features are correlated with each other;
The data can be extremely sparse after the conversion.

How to convert to one-hot encoding?

As an instance, consider we have 4 samples of 3 features:

	Feature_1	Feature_2	Feature_3
Sample_1	0	3	2
Sample_2	1	2	1
Sample_3	0	1	1
Sample_4	1	0	0

In the table above, features are already converted into numerical codes. There are 2 possible choices for feature_1: perhaps male/female, and it's represented as 0-male / 1-female. But how can we convert these numbers into one-hot encoding?

Let one digit in the binary code of a feature represents a choice of the feature. Then, the digit representing the choice of the sample is set to 1 and all the remaining digits are set to 0, as below:

Feature	Encoding
`Feature_1`	0->01, 1->10
`Feature_2`	0->0001, 1->0010, 2->0100, 3->1000
`Feature_3`	0->001, 1->010, 2->100

In this way, the original table is converted to:

	Feature_1	Feature_2	Feature_3
Sample_1	01	1000	100
Sample_2	10	0100	010
Sample_3	01	0010	010
Sample_4	10	0001	001

The feature vectors for the four samples above are now:

Feature	Vector
Sample_1	[0,1,1,0,0,0,1,0,0]
Sample_2	[1,0,0,1,0,0,0,1,0]
Sample_3	[0,1,0,0,1,0,0,1,0]
Sample_4	[1,0,0,0,0,1,0,0,1]

This is the one-hot representation of the four samples, in which all features of the samples are converted into one-hot encoding.

One-hot encoding using Python

Using Pandas

pandas.get_dummies is used to convert categorical variables into one-hot encoding.

pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)

See the following usage example:

import pandas as pd

df = pd.DataFrame([
            ['male', 'Africa', 'C'],
            ['female', 'Asia', 'B'],
            ['male', 'Europe', 'B'],
            ['female', 'US', 'A']
])
df.columns = ['gender', 'region', 'class']
pd.get_dummies(df)

Before conversion, the data frame is:

	gender	region	class
Sample_1	male	Africa	C
Sample_2	female	Asia	B
Sample_3	male	Europe	B
Sample_4	female	US	A

After conversion, the new data frame is:

	gender_male	gender_female	region_Africa	region_Asia	region_Europe	region_US	class_A	class_B	class_C
Sample_1	1	0	1	0	0	0	0	0	1
Sample_2	0	1	0	1	0	0	0	1	0
Sample_3	1	0	0	0	1	0	0	1	0
Sample_4	0	1	0	0	0	1	1	0	0

Using sklearn

sklearn.preprocessing.OneHotEncoder is used to encode categorical features as a one-hot numeric array.

class sklearn.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')

See the following usage example:

from sklearn import preprocessing

enc = preprocessing.OneHotEncoder()
enc.fit([[0, 3, 2], [1, 2, 1], [0, 1, 1], [1, 0, 0]])

array = enc.transform([[0, 3, 2]]).toarray()

print(array)

The result of the code above is:

[[1. 0. 0. 0. 0. 1. 0. 0. 1.]]

Why do we use one-hot encoding in machine learning?

In order to perform machine learning algorithms, like classification, regression, and clustering, we have to compute the 'distances' or 'similarities' between features, and most 'distances' and 'similarities' are defined to be computed in Euclidean space.

To compute the distances in Euclidean space, for discrete or categorical data, it is more rational in the binary representation.

For example, consider we have three feature vectors:
$x_1 = 0, x_2 = 1, x_3 = 2$ .

The distances between these vectors are:
$d(x_1, x_2) = 1 - 0 = 1, d(x_1, x_3) = 2 - 0 = 2$ .

This result indicates that the distance between $x_1$ and $x_2$ are smaller than the distance between $x_1$ and $x_3$ . However, there is no reason to say two features are more separated than two other features in a category.

However, if we convert the feature vectors into one-hot encodings, then $x_1 = (1, 0, 0)$ , $x_2 = (0, 1, 0)$ , and $x_3 = (0, 0, 1)$ . In this way, $d(x_1, x_2) = \sqrt{1^2+1^2} = \sqrt{2}$ , $d(x_1, x_3) = \sqrt{1^2+1^2} = \sqrt{2}$ , and $d(x_2, x_3) = \sqrt{1^2+1^2} = \sqrt{2}$ as well, so the three feature vectors are equally distant from each other in the Euclidean space, which corresponds to the characteristics of non-correlated categorical variables.

Therefore, in the representation of one-hot encoding, categorical features are binarized, and after binarizing these features, we can regard them as vectors embedded in the Euclidean space to calculate the distance for machine learning models.