Populate data-frame faster- [from 4 hours to 15 second]

Hello Guys, So the main problem was on this link.

To give you a high-level overview. I had to populate a data_frame with about 1300 columns and 1 million rows.

What I was doing first is

  • prepare the data frame with zeros for all columns and rows.
  • Then, iterating through each column and row and
  • assign the value for that cell after some dynamic processing. Here is the code.
# main part, not complete code

def populate_data_frame_in_prediction_time(data, columns):
    unknown_col = "nan"
    columns_set = set(columns)
    result_data_frame = pd.DataFrame(0, index=np.arange(len(data)), columns=columns)

    for prefix in data.columns: # O(m)
        unknown_column_name = str(prefix) + "_" + str(unknown_col)
        for index, row in data.iterrows(): #O(n)
            value = row[prefix]
            result_column_name = str(prefix) + "_" + str(value)
            if result_column_name not in columns_set: # O(1)
                result_column_name = unknown_column_name

            result_data_frame[result_column_name][index] = 1

    result_data_frame = result_data_frame.astype('uint8')
    return result_data_frame

It was a prolonged process. It took about 4+ hours for my case of scenery.
Then I have done slightly better but not the best way.

So what I have in this intermediate approach is

  • Did not start work with all columns Pick a column and make a data frame for it as I converted categorical value to one-hot encoding.
  • Then, iterating through each column and row with only that smaller data-frame and
  • assign the value for that cell after some dynamic processing. Repeat the procedure for the next column of the raw data.

By doing this, the processing time improves to *20 minutes from 4 hours. * here is code for each column(series) in the raw data

def _custom_one_hot_encoding_1d(series, column_list):
    unknown_col = "nan"
    prefix = series.name

    number_of_rows, number_of_col = series.shape[0], len(column_list)

    dummy_data = np.array([np.zeros(number_of_rows, dtype=int)] * number_of_col).T
    df = pd.DataFrame(dummy_data, columns=column_list)
    for index, name in series.items():
        if not name:
            name = unknown_col

        column_name = str(prefix) + "_" + str(name)
        if column_name not in df:
            column_name = prefix + "_" + unknown_col
        df[column_name][index] = 1
    return df

Then to improve more, what I did is

  • previously, I prepared the data frame first. then populated it.
  • I reversed the way.
  • I prepared the data first with array[ can be python list / NumPy array) To do that, I had to write some custom code. It was pandas data-frame earlier.
  • after preparing the data, I fill up the data frame with the full data.

It improved the performance drastically. Now it is taking only about 15 seconds to do that processing. Here is the code.

def _custom_one_hot_encoding_1d(series, column_list):
    unknown_col = "nan"
    prefix = series.name

    number_of_rows, number_of_col = series.shape[0], len(column_list)

    column_idx = {column_list[i]: i for i in range(len(column_list))}
    result_arr = np.array([np.zeros(number_of_rows, dtype=int)] * number_of_col).T

    for index, name in series.items():
        if not name:
            name = unknown_col

        column_name = str(prefix) + "_" + str(name)
        if column_name not in column_list:
            column_name = prefix + "_" + unknown_col
        result_arr[index][column_idx[column_name]] = 1

    df = pd.DataFrame(result_arr, columns=column_list)
    return df

This is my story. Thank you guys for reading. If you guys have any better idea, please suggest here. I will be happy to try that in my code.

13