22
Introduction to One-Hot Encoding
In digital circuits and machine learning, a one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0).
In machine learning, features are not always continuous and sometimes are categories. For example, the feature gender
can be categorized as 'male' or 'female'. In model training, we have to convert these features to numerical values. Consider the following example with 3 features:
-
gender
: ['female', 'male'] -
region
: ['Africa', 'Asia', 'Europe', 'US'] -
class
: ['A', 'B', 'C'']
For a specific sample, like ['female', 'Asia', 'C'], to convert the features to numerical codes, the easier way is to serialize them to [0, 1, 2]. However, these serialized features cannot be used directly in machine learning model training.
In statistics, we usually use dummy variables to represent these categorical variables. Similarly, we can use a representation of categorical variables as binary vectors to extract features and convert them to numbers. In our example, 'female' corresponds to [1, 0], 'Asia' corresponds to [0, 1, 0, 0] and 'C' corresponds to [0, 0, 1], so the overall numerical representation of this sample is [1, 0, 0, 1, 0, 0, 0, 0, 1]. This technique of conversion is called one-hot encoding.
- It could be better to deal with categorical features;
- It could be better to deal with discontinuous numerical features;
- Possibly increase the number of features. For instance, sex (male/female) is one feature, but after converting into one-hot encoding, it becomes two features: male | female.
- In NLP, it disregards the sequences of words;
- It assumes features are independent. However, in a lot of cases, features are correlated with each other;
- The data can be extremely sparse after the conversion.
As an instance, consider we have 4 samples of 3 features:
Feature_1 | Feature_2 | Feature_3 | |
---|---|---|---|
Sample_1 | 0 | 3 | 2 |
Sample_2 | 1 | 2 | 1 |
Sample_3 | 0 | 1 | 1 |
Sample_4 | 1 | 0 | 0 |
In the table above, features are already converted into numerical codes. There are 2 possible choices for feature_1
: perhaps male/female, and it's represented as 0-male / 1-female. But how can we convert these numbers into one-hot encoding?
Let one digit in the binary code of a feature represents a choice of the feature. Then, the digit representing the choice of the sample is set to 1 and all the remaining digits are set to 0, as below:
Feature | Encoding |
---|---|
Feature_1 |
0->01, 1->10 |
Feature_2 |
0->0001, 1->0010, 2->0100, 3->1000 |
Feature_3 |
0->001, 1->010, 2->100 |
In this way, the original table is converted to:
Feature_1 | Feature_2 | Feature_3 | |
---|---|---|---|
Sample_1 | 01 | 1000 | 100 |
Sample_2 | 10 | 0100 | 010 |
Sample_3 | 01 | 0010 | 010 |
Sample_4 | 10 | 0001 | 001 |
The feature vectors for the four samples above are now:
Feature | Vector |
---|---|
Sample_1 | [0,1,1,0,0,0,1,0,0] |
Sample_2 | [1,0,0,1,0,0,0,1,0] |
Sample_3 | [0,1,0,0,1,0,0,1,0] |
Sample_4 | [1,0,0,0,0,1,0,0,1] |
This is the one-hot representation of the four samples, in which all features of the samples are converted into one-hot encoding.
pandas.get_dummies
is used to convert categorical variables into one-hot encoding.
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
See the following usage example:
import pandas as pd
df = pd.DataFrame([
['male', 'Africa', 'C'],
['female', 'Asia', 'B'],
['male', 'Europe', 'B'],
['female', 'US', 'A']
])
df.columns = ['gender', 'region', 'class']
pd.get_dummies(df)
Before conversion, the data frame is:
gender | region | class | |
---|---|---|---|
Sample_1 | male | Africa | C |
Sample_2 | female | Asia | B |
Sample_3 | male | Europe | B |
Sample_4 | female | US | A |
After conversion, the new data frame is:
gender_male | gender_female | region_Africa | region_Asia | region_Europe | region_US | class_A | class_B | class_C | |
---|---|---|---|---|---|---|---|---|---|
Sample_1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
Sample_2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
Sample_3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
Sample_4 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
sklearn.preprocessing.OneHotEncoder
is used to encode categorical features as a one-hot numeric array.
class sklearn.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')
See the following usage example:
from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
enc.fit([[0, 3, 2], [1, 2, 1], [0, 1, 1], [1, 0, 0]])
array = enc.transform([[0, 3, 2]]).toarray()
print(array)
The result of the code above is:
[[1. 0. 0. 0. 0. 1. 0. 0. 1.]]
In order to perform machine learning algorithms, like classification, regression, and clustering, we have to compute the 'distances' or 'similarities' between features, and most 'distances' and 'similarities' are defined to be computed in Euclidean space.
To compute the distances in Euclidean space, for discrete or categorical data, it is more rational in the binary representation.
For example, consider we have three feature vectors:
.
The distances between these vectors are:
.
This result indicates that the distance between and are smaller than the distance between and . However, there is no reason to say two features are more separated than two other features in a category.
However, if we convert the feature vectors into one-hot encodings, then , , and . In this way, , , and as well, so the three feature vectors are equally distant from each other in the Euclidean space, which corresponds to the characteristics of non-correlated categorical variables.
Therefore, in the representation of one-hot encoding, categorical features are binarized, and after binarizing these features, we can regard them as vectors embedded in the Euclidean space to calculate the distance for machine learning models.
22