20
Populate data-frame faster- [from 4 hours to 15 second]
Hello Guys, So the main problem was on this link.
To give you a high-level overview. I had to populate a data_frame with about 1300 columns and 1 million rows.
What I was doing first is
# main part, not complete code
def populate_data_frame_in_prediction_time(data, columns):
unknown_col = "nan"
columns_set = set(columns)
result_data_frame = pd.DataFrame(0, index=np.arange(len(data)), columns=columns)
for prefix in data.columns: # O(m)
unknown_column_name = str(prefix) + "_" + str(unknown_col)
for index, row in data.iterrows(): #O(n)
value = row[prefix]
result_column_name = str(prefix) + "_" + str(value)
if result_column_name not in columns_set: # O(1)
result_column_name = unknown_column_name
result_data_frame[result_column_name][index] = 1
result_data_frame = result_data_frame.astype('uint8')
return result_data_frame
It was a prolonged process. It took about 4+ hours for my case of scenery.
Then I have done slightly better but not the best way.
Then I have done slightly better but not the best way.
So what I have in this intermediate approach is
By doing this, the processing time improves to *20 minutes from 4 hours. * here is code for each column(series) in the raw data
def _custom_one_hot_encoding_1d(series, column_list):
unknown_col = "nan"
prefix = series.name
number_of_rows, number_of_col = series.shape[0], len(column_list)
dummy_data = np.array([np.zeros(number_of_rows, dtype=int)] * number_of_col).T
df = pd.DataFrame(dummy_data, columns=column_list)
for index, name in series.items():
if not name:
name = unknown_col
column_name = str(prefix) + "_" + str(name)
if column_name not in df:
column_name = prefix + "_" + unknown_col
df[column_name][index] = 1
return df
Then to improve more, what I did is
It improved the performance drastically. Now it is taking only about 15 seconds to do that processing. Here is the code.
def _custom_one_hot_encoding_1d(series, column_list):
unknown_col = "nan"
prefix = series.name
number_of_rows, number_of_col = series.shape[0], len(column_list)
column_idx = {column_list[i]: i for i in range(len(column_list))}
result_arr = np.array([np.zeros(number_of_rows, dtype=int)] * number_of_col).T
for index, name in series.items():
if not name:
name = unknown_col
column_name = str(prefix) + "_" + str(name)
if column_name not in column_list:
column_name = prefix + "_" + unknown_col
result_arr[index][column_idx[column_name]] = 1
df = pd.DataFrame(result_arr, columns=column_list)
return df
This is my story. Thank you guys for reading. If you guys have any better idea, please suggest here. I will be happy to try that in my code.
20