Product Developers’ guide to customizing data for AI — Part 1: Combine dataframes

TLDR
When working with AI, it’s common to have too much data; especially when training models. In this guide, we’ll cover how to look at relationships made when combining dataframes using concat.
Outline
  • Introduction
  • Before we begin
  • Relationships
  • Fold
  • Concat
  • Clean up
  • Conclusion
  • Introduction
    Welcome to the 1st part of the “Product developers’ guide to customizing data in AI”. In this series, we’ll go over intermediate concepts and run through examples using Pandas. We’ll start by looking at a common dataset about email information and progressively tailor information. By the end of the series, you’ll be ready to tackle designing machine learning datasets for training machine learning models.
    Before we begin
    In this guide, we’ll be using the email content dataset along with Google Collab. We’ll import the dataset, and chop it down. If you need a refresher on anything mentioned above, please refer back to part 1 of the beginner’s guide.
    imageemail_content dataframe
    Relationships
    To understand how to customize data in AI, we need to dive deeper into relations within set theory. To start off, we’ll look at the 2 most basic relationships, unions and intersections. Think of multiple datasets, and within each, there may or may not be duplicate values that are mutual to another dataframe.
    Union
    Unions are useful for a general outlook of all the values removing duplicates.
    For a union, this is a combination of all the values, and returns all values with the mutual values listed exactly once.
    imageThe area of all the sets (Source: mychartguide)
    Intersection
    Intersections are useful for finding values that are found in all the dataset
    But for an intersection, it takes the combination of all the values, and only returns the mutual values.
    imageThe overlapping area of two sets (Source: mychartguide)
    Fold
    To start off let’s follow a common practice among data scientists to “fold” the data. This is especially useful for training machine learning models as it breaks down the data for training, testing, and validation sets. Let’s start with a fold of 2 parts, or a half on our email_content dataset. Imagine the dataset as a piece of paper. After folding the piece of paper in half, it breaks it down into exactly 2 pieces.
    imageTwo ways to fold: Hotdog or Hamburger (Source: Elasticity)
    Hotdog Style
    One way to fold paper is hotdog style, by folding horizontally through the center. In Pandas, this is accomplished by taking the maximum length of the rows and dividing it into segments. For our dataset, let’s take an arbitrary number to section it by. There are other alternatives, but iloc and loc are the simplest and covered in the beginner’s guide.
    imageFold it using into rows 0 to 60 and rows 50 to 100
    image1st dataframe is from row 1 to 60
    image2nd dataframe is from row 51 to 100
    Hamburger Style
    Alternatively, we can fold it hamburger style, by folding vertically through the center. In Pandas, this is done by splitting the columns into multiple parts using loc to select specific columns. Here we have chosen to split the email columns based on its relevance into 2 parts. Email content is broken down into email information readable for humans, and email information readable by a machine.
    imageInclude all the rows, but only specific columns in each
    image1st dataframe contains data useful for a machine
    image2nd dataframe contents are easily understandable by a human
    Concat
    Using the folded data, let’s put them together using concat. Concat is very versatile and has multiple functionality when performing a union or intersection operation on multiple datasets.
    Hotdog
    Let’s start by applying concat on the folded hotdog style datasets. Using concat, we union the datasets and get back the original. By default, concat is a union operation.
    imageUse concat to get the union of two datasets resulting in 50 + 60 (110) rows
    Then we apply concat using the intersection, we get the overlapping indexes. To get the best result, we use axis=1 along the columns.
    imageimageUse concat to get the intersection of two datasets, rows 51–60
    Hamburger
    For hamburger style, the columns don’t match and performing a union gives a less desirable result filled with null values. We also use axis=1 again since we want to see the columns.
    imageCombine all the values in both columns
    Then for an intersection, we want to look at the intersection to grab the overlapping rows.
    imageimageThe “created at” column is shared between both dataframes
    Clean Up
    A lot of the data is duplicated, such as column names or row entries. This is because unlike sets, which cannot contain duplicates, a dataframe is allowed to have as many duplicate values. Luckily, Pandas has the drop_duplicates function to convert our results into its set theory result, while remaining as a dataframe.
    Applying drop_duplicates to the union, returns all values exactly once. Thus, returning the original email_content dataframe or set union.
    imageSet union, results in the original dataset
    Applying drop_duplicates to the intersection, returns the overlapping values exactly once. Thus, the result is the set intersection
    imageSet intersection, results in the created_at column
    Conclusion
    We’ve established a point to begin diving into relationships between datasets, a crucial component for customizing data in machine learning. We covered folding, a technique used immensely in data science to split data, and concat, to combine data by finding matching values to create unions and intersections. In the rest of this series, we’ll be looking at how we can remove extras by using join and merge instead of drop_duplicates and establish relationships based on exclusion rather than inclusion.

    25

    This website collects cookies to deliver better user experience

    Product Developers’ guide to customizing data for AI — Part 1: Combine dataframes