15
Pandas: Importing Data, Indexing, Comparisons and Selectors (featuring adoptable dog data)
For the second Python Pandas guide we will be reviewing how to import data, as well as a deeper dive into indexing, comparison statements and selecting subsets of data. For this we will be using a dataset with information about adoptable dogs from Kaggle. They have loads of datasets that you can easily download and use for projects.
To start we need to import our downloaded data into a DataFrame. It's pretty simple to upload the data from our downloads folder when working locally using the pd.read_csv function.
dogs = pd.read_csv(r'Downloads/ShelterDogs.csv')
Once we have imported the data we want to inspect the DataFrame and make sure it contains all of the information that we need and in the correct format. We can use .head() to view the first 5 rows.
dogs.head()
In order to find out more about the data types of each column, we can use .info() which displays the column name, the count of rows that contain null data and then the data type of each columns.
dogs.info()
- int - whole numbers
- float - decimal numbers
- object - a string
There may be an instance where we want to change a data type, in order to perform some operations, or to make them easier to view.
In order to change string to number you can use the Pandas method pd.to_numeric or to a change number into to string you can use .astype(str).
Within our dataframe we can also change the date column from a string to an actual date format, using the .to_datetime command.
dogs['date_found'] = pd.to_datetime(dogs['date_found'])
dogs
Finally, .shape will show us the number of columns and then the number of rows.
dogs.shape
To select two or more columns from a DataFrame, we can use a list of the column names between a double set of square brackets. This will provide us a subset of the data, and create a new DataFrame containing only this information, leaving our original DataFrame untouched.
type_breed = dogs[['name', 'sex', 'breed']]
type_breed.head()
We can use .iloc, which is integer based, to select and display a single column by specifying the positional index of the row we want to view.
dogs.iloc[1]
In some cases, we might also want to select only certain rows, we can do this by using the index of the rows, and select those using the iloc command. It allows us to select multiple rows, using index based selection. This is similar to how we select elements from a list, using the : operator and square brackets.
dogs.iloc[3:7]
This will display rows 3-6, as the last row is not included.
The : operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values.
iloc[:10] would select all rows up to, but not including, the 10th row
dogs.iloc[:10]
We can also use minus indexing, for example iloc[-2:] will display the last 2 columns.
dogs.iloc[-2:]
When we get data from different sources, we may need to rename the columns so they are easier to read or call from.
There are a couple of methods to do this, depending on which column names we want to change.
.columns will allow us to change all of the columns at once. However, it’s important to get the ordering right to avoid mislabeling them.
.rename will allow us to change individual columns. You can pass a single column to change or multiple columns, using a dictionary with the original name and the new name.
dogs.rename(columns = {'posted' : 'date_posted',
'breed' : 'dog_breed',
'adoptable_from' : 'date_adoptable',
'coat' : 'coat_type'},
inplace = True)
dogs.head()
We can also select a subset of data using comparison statements.
Operator | Purpose |
---|---|
== | Equal to |
!= | Not Equal |
> | Greater than |
< | Less than |
>= | Greater than or Equal to |
<= | Less than or Equal to |
For example, if we wanted to only display female dogs from the dataset, we would use the double equal operator and define the column that we are selecting from.
female = dogs[dogs.sex == 'female']
female
If we wanted to exclude all dogs which are an unknown mix breed, we could use the not equal to operator to select all rows except ones that mention Unknown Mix.
breeds = dogs[dogs.dog_breed != 'Unknown Mix']
breeds
As we have numbers in our dataset we can use the greater than operator to select any rows which contain a number higher than the one we give it. We just need to specify which column we are selecting the data from, in this case, it's the age column.
over_two = dogs[dogs.age > 2]
over_two
Similarly, we can use the less than operator to select all of the dogs which have an age less than 4.
under_four = dogs[dogs.age < 4]
under_four
It is also possible to use multiple comparison operators in a single selection, and we can define if we want the final dataset to contain either both of these by using the ampersand, to signify an and rule.
likes_people = dogs[(dogs.likes_people == 'yes') & (dogs.likes_children == 'yes')]
likes_people
In addition, we can use the pipe, |, to define an or rule, where we want our final dataset to display either of the data points we have selected.
cats_female = dogs[(dogs.get_along_females == 'yes') | (dogs.get_along_cats == 'yes')]
cats_female
We can also use mixed comparison operators within one statement, for example if we want any dogs which are neutered and aged either 4 or older we can do such:
suitable = dogs[(dogs.neutered == 'yes') & (dogs.age >= 4)]
suitable
Pandas also comes with some conditional selectors that are built in and can be used in a similar way to logical statements.
- isin selects data where the value is what you are defining in the list
- isnull will select data that is null within the columns that you select (i.e displaying Null)
- notnull selects data that has a value, i.e all that is not null
breed = dogs[(dogs.dog_breed.isin(['Staffordshire Terrier Mix', 'Labrador Retriever Mix', 'German Shepherd Dog Mix']))]
breed
Once we have selected subsets of data, the index is changed to reflect only the rows that we have selected, if we want to reorder the index appropriately we will want to reset the index using .reset_index().
breed = dogs[(dogs.dog_breed.isin(['Staffordshire Terrier Mix', 'Labrador Retriever Mix', 'German Shepherd Dog Mix']))].reset_index()
breed
neutered_known = dogs.loc[dogs.neutered.notnull()]
neutered_known
neutered_unknown = dogs.loc[dogs.neutered.isnull()]
neutered_unknown
It's likely that you will come across null data when dealing with big datasets, and this can cause issues when doing any selecting or mathematical operations. It can also sometimes make the tables look messy, hard to review and can skew final results.
There is an easy way to replace the null data columns using the Pandas method .fillna(). Within the brackets you will need to add an argument that states the replacement number or text. This will then replace all of the Nan fields.
dogs.fillna("not available")
I hope this has been a helpful (and cute) way to understand the more of the Pandas library in Python. I'm looking forward to making a few more posts in this series :)
A notebook to download and play around with can be found here .
15