26
Flatten Pandas dataframe MultiIndex
A few days ago, I had to extract a data from Oracle Database using SQL, and then PIVOT
a long set of values in a column, to multiple columns. The sintax to do this, requires to write every single value you want to be as a column, not one but twice to avoid the default quotes.
There are a few options like pivoting using PIVOT XML
, or even more recent to build a JSON column using JSON_ARRAYAGG
and JSON_OBJECT
to get dynamic columns or any value as an attribute, but still not so straightforward.
Lets try with a python notebook (I use VSCode notebook).
First import pandas:
import pandas as pd
We will use a random free dataset:
# source file from https://www.kaggle.com/gregorut/videogamesales
df = pd.read_csv('./vgsales.csv')
df
Using the function pivot_table
we will transponse the values of the column Genre to be a column for every value in the dataset.
pivot_df = df.pivot_table(index=['Year', 'Publisher'], columns=['Genre'], values=['Global_Sales'], aggfunc=np.sum, fill_value=0)
pivot_df
Now let's reset_index()
to "try" flattening the indexes.
mi_pivot_df = pivot_df.reset_index()
mi_pivot_df
pivot_df.to_excel('./global_sales_by_publishers_genres.xlsx', index=False)
I searched the solution to flat the most, but no answer was full, complete enough, and free of errors. I collected all, and by try and error, got here, a working solution. I also works after using pivot
function:
flat_index_pivot_df = pivot_df.copy()
flat_index_pivot_df.columns = flat_index_pivot_df.columns.droplevel(0)
flat_index_pivot_df.reset_index(inplace=True)
flat_index_pivot_df.columns.name = None
flat_index_pivot_df
Now you can get a clean Excel Sheet, free of MultiIndex.
flat_index_pivot_df.to_excel('./global_sales_by_publishers_genres.xlsx', index=False)
Do you know another solution?
26