Populate data-frame faster- [from 4 hours to 15 second]

Hello Guys, So the main problem was on this link.
To give you a high-level overview. I had to populate a data_frame with about 1300 columns and 1 million rows.
What I was doing first is
  • prepare the data frame with zeros for all columns and rows.
  • Then, iterating through each column and row and
  • assign the value for that cell after some dynamic processing. Here is the code.
  • # main part, not complete code
    
    def populate_data_frame_in_prediction_time(data, columns):
        unknown_col = "nan"
        columns_set = set(columns)
        result_data_frame = pd.DataFrame(0, index=np.arange(len(data)), columns=columns)
    
        for prefix in data.columns: # O(m)
            unknown_column_name = str(prefix) + "_" + str(unknown_col)
            for index, row in data.iterrows(): #O(n)
                value = row[prefix]
                result_column_name = str(prefix) + "_" + str(value)
                if result_column_name not in columns_set: # O(1)
                    result_column_name = unknown_column_name
    
                result_data_frame[result_column_name][index] = 1
    
        result_data_frame = result_data_frame.astype('uint8')
        return result_data_frame
    It was a prolonged process. It took about 4+ hours for my case of scenery.
    Then I have done slightly better but not the best way.
    So what I have in this intermediate approach is
  • Did not start work with all columns Pick a column and make a data frame for it as I converted categorical value to one-hot encoding.
  • Then, iterating through each column and row with only that smaller data-frame and
  • assign the value for that cell after some dynamic processing. Repeat the procedure for the next column of the raw data.
  • By doing this, the processing time improves to *20 minutes from 4 hours. * here is code for each column(series) in the raw data
    def _custom_one_hot_encoding_1d(series, column_list):
        unknown_col = "nan"
        prefix = series.name
    
        number_of_rows, number_of_col = series.shape[0], len(column_list)
    
        dummy_data = np.array([np.zeros(number_of_rows, dtype=int)] * number_of_col).T
        df = pd.DataFrame(dummy_data, columns=column_list)
        for index, name in series.items():
            if not name:
                name = unknown_col
    
            column_name = str(prefix) + "_" + str(name)
            if column_name not in df:
                column_name = prefix + "_" + unknown_col
            df[column_name][index] = 1
        return df
    Then to improve more, what I did is
  • previously, I prepared the data frame first. then populated it.
  • I reversed the way.
  • I prepared the data first with array[ can be python list / NumPy array) To do that, I had to write some custom code. It was pandas data-frame earlier.
  • after preparing the data, I fill up the data frame with the full data.
  • It improved the performance drastically. Now it is taking only about 15 seconds to do that processing. Here is the code.
    def _custom_one_hot_encoding_1d(series, column_list):
        unknown_col = "nan"
        prefix = series.name
    
        number_of_rows, number_of_col = series.shape[0], len(column_list)
    
        column_idx = {column_list[i]: i for i in range(len(column_list))}
        result_arr = np.array([np.zeros(number_of_rows, dtype=int)] * number_of_col).T
    
        for index, name in series.items():
            if not name:
                name = unknown_col
    
            column_name = str(prefix) + "_" + str(name)
            if column_name not in column_list:
                column_name = prefix + "_" + unknown_col
            result_arr[index][column_idx[column_name]] = 1
    
        df = pd.DataFrame(result_arr, columns=column_list)
        return df
    This is my story. Thank you guys for reading. If you guys have any better idea, please suggest here. I will be happy to try that in my code.

    20

    This website collects cookies to deliver better user experience

    Populate data-frame faster- [from 4 hours to 15 second]