19
How to iterate over pandas DataFrame rows (and should you?)
One of the most searched for (and discussed) questions about pandas is how to iterate over rows in a DataFrame
. Often this question comes up right away for new users who have loaded some data into a DataFrame
and now want to do something useful with it. The natural way for most programmers to think of what to do next is to build a loop. They may not understand the “correct” way to work with DataFrames
yet, but even experienced pandas and NumPy developers will consider iterating over the rows of a DataFrame
to solve a problem. Instead of trying to find the one right answer about iteration, it makes better sense to understand the issues involved and know when to choose the best solution.
As of this writing, the top voted question tagged with ‘pandas’ on Stack Overflow is about how to iterate over DataFrame
rows. It also turns out that question has the most copied answer with a code block on the entire site. The Stack Overflow developers say thousands of people view the answer weekly and copy it to solve their problem. Obviously people want to iterate over DataFrame
rows!
It is also true that there can be serious consequences with iterating over DataFrame
rows using the top solution. Other answers to the question (especially the second highest rated answer) do a fairly good job of giving other options, but the entire list of 26 (and counting!) answers is extremely confusing. Instead of asking how to iterate over DataFrame
rows, it makes more sense to understand what the options are that are available, what their advantages and disadvantages are, and then choose the one that makes sense for you. In some cases, the top voted answer for iteration might be the best choice!
First, choosing to iterate over the rows of a DataFrame
is not automatically the wrong way to solve a problem. However, in most cases what beginners are trying to do with iteration is better done with another approach. However, no one should ever feel bad about writing a first solution that uses iteration instead of other (perhaps better) ways. That’s often the best way to learn, you can think of a first solution as the first draft of your essay, you can improve it with some editing.
Let’s start with basic questions. If we look at the original question on Stack Overflow, the question and answer just print the content of the DataFrame
. First off, let’s all agree that this is not a good way to look at the content of a DataFrame
. The standard rendering of a DataFrame
, whether it is rendered with print
or viewed with a Jupyter notebook using display
or as an output in a cell will be far better than what would be printed using custom formatting.
If the DataFrame
is large, only some columns and rows may be visible by default. Use head
and tail
to get a sense of the data. If you want to only look at subsets of a DataFrame
, instead of using a loop to only display those rows, use the powerful indexing capabilities of pandas. With a little practice, you can select any combinations of rows or columns to show. Start there first.
Now instead of a trivial printing example, let’s look at ways to actually use data for a row in a DataFrame
that includes some logic.
Let’s build an example DataFrame
to use. I’ll do this by making some fake data (using Faker). Note that the columns are different data types (we have some strings, an integer, and dates).
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
from faker import Faker
fake = Faker()
today = datetime.now()
next_month = today + timedelta(days=30)
df = pd.DataFrame([[fake.first_name(), fake.last_name(),
fake.date_this_decade(), fake.date_between_dates(today, next_month),
fake.city(), fake.state(), fake.zipcode(), fake.random_int(-100,1000)]
for r in range(100)],
columns=['first_name', 'last_name', 'start_date',
'end_date', 'city', 'state', 'zipcode', 'balance'])
df['start_date'] = pd.to_datetime(df['start_date']) # convert to datetimes
df['end_date'] = pd.to_datetime(df['end_date'])
df.dtypes
first_name object
last_name object
start_date datetime64[ns]
end_date datetime64[ns]
city object
state object
zipcode object
balance int64
dtype: object
df.head()
first_name last_name start_date end_date city state \
0 Katherine Moody 2020-02-04 2021-06-28 Longberg Maryland
1 Sarah Merritt 2021-03-02 2021-05-30 South Maryborough Tennessee
2 Karen Hensley 2020-02-29 2021-06-23 Brentside Missouri
3 David Ferguson 2020-02-02 2021-06-14 Judithport Virginia
4 Phillip Davis 2020-07-17 2021-06-04 Louisberg Minnesota
zipcode balance
0 20496 493
1 18495 680
2 63702 427
3 66787 587
4 98616 211
Let’s say that our DataFrame
contains customer data and we have a scoring function for customers that uses multiple customer attributes to give them a score between ‘A’ and ‘F’. Any customer with a negative balance is scored an ‘F’, above 500 is an ‘A’, and after that, logic depends on if a customer is a ‘legacy’ customer and what state they live in.
Note that I made doctests for this function, see my post on Jupyter unit testing for more details on how to unit test in Jupyter.
from dataclasses import dataclass
@dataclass
class Customer:
first_name: str
last_name: str
start_date: datetime
end_date: datetime
city: str
state: str
zipcode: str
balance: int
def score_customer(customer:Customer) -> str:
"""Give a customer a credit score.
>>> score_customer(Customer("Joe", "Smith", datetime(2020, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, -5))
'F'
>>> score_customer(Customer("Joe", "Smith", datetime(2020, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 50))
'C'
>>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 50))
'D'
>>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 150))
'C'
>>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 250))
'B'
>>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 350))
'B'
>>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Santa Fe", "California", 88888, 350))
'A'
>>> score_customer(Customer("Joe", "Smith", datetime(2020, 1, 1), datetime(2023,1,1), "Santa Fe", "California", 88888, 50))
'C'
"""
if customer.balance < 0:
return 'F'
if customer.balance > 500:
return 'A'
# legacy vs. non-legacy
if customer.start_date > datetime(2020, 1, 1):
if customer.balance < 100:
return 'D'
elif customer.balance < 200:
return 'C'
elif customer.balance < 300:
return 'B'
else:
if customer.state in ['Illinois', 'Indiana']:
return 'B'
else:
return 'A'
else:
if customer.balance < 100:
return 'C'
else:
return 'A'
import doctest
doctest.testmod()
TestResults(failed=0, attempted=8)
OK, now that we have a concrete example, how do we obtain the score for all of our customers? Let’s just go straight to the top answer from the Stack Overflow question, DataFrame.iterrows
. This is a generator that returns the index for a row along with the row as a Series
. If you aren’t familiar with what a generator is, you can think of it as a function you can iterate over. As a result, calling next
on it will yield the first element.
next(df.iterrows())
(0,
first_name Katherine
last_name Moody
start_date 2020-02-04 00:00:00
end_date 2021-06-28 00:00:00
city Longberg
state Maryland
zipcode 20496
balance 493
Name: 0, dtype: object)
This looks promising! This is a tuple containing the index of the first row and the row data itself. Maybe we can just pass it right into our function. Let’s try that out and see what happens. Even though the row is a Series
, the columns are the same as the attributes of our Customer
class, so we might be able to just pass this into our scoring function.
score_customer(next(df.iterrows())[1])
'A'
Wow, that seemed to work. Can we just score the entire table?
df['score'] = [score_customer(c[1]) for c in df.iterrows()]
Wow, that seems too easy. You can see why this is the top voted answer, since it seems to do exactly what we want. Why would there be any controversy about this answer?
As is usually the case with pandas (and really with any software engineering question), picking an ideal solution depends on the inputs. Let’s summarize what the issues could be with various design choices. If the issues raised don’t fit your specific use case, iteration using iterrows
may be a perfectly acceptable solution! I won’t judge you. I use it plenty of times, and will summarize at the end how to make decisions about the possible solutions.
The arguments for and against using iterrows
can be grouped into the following categories.
- Efficiency (Speed and Memory)
- Mixed types in a row causing issues
- Readability and maintainability
In general, if you want things to be fast in pandas (or Numpy, or any framework that offers vectorized calculations), you will not want to iterate through elements but instead choose a vectorized solution. However, even if the solution can be vectorized, it might be a lot of work for the programmer to do so, especially a beginner. Other answers to the question on Stack Overflow present a host of other solutions. They mostly all fall into one of the following categories, in the following order of preference for speed:
- Vectorization
- Cython routines
- List comprehensions (vanilla for loop)
- DataFrame.apply()
- DataFrame.itertuples() and iteritems()
- DataFrame.iterrows()
The main problem with always telling people to vectorize everything is that at times a vectorized solution may be a real chore to write, debug, and maintain. The examples given to prove that vectorization is preferred often show trivial operations, like simple multiplication. But since the example I started with in this article is not just a single calculation, I decided to write one possible vectorized solution to this problem.
def vectorized_score(df):
return np.select([df['balance'] < 0,
df['balance'] > 500, # technically not needed, would fall through
((df['start_date'] > datetime(2020,1,1)) &
(df['balance'] < 100)),
((df['start_date'] > datetime(2020,1,1)) &
(df['balance'] >= 100) &
(df['balance'] < 200)),
((df['start_date'] > datetime(2020,1,1)) &
(df['balance'] >= 200) &
(df['balance'] < 300)),
((df['start_date'] > datetime(2020,1,1)) &
(df['balance'] >= 300) &
df['state'].isin(['Illinois', 'Indiana'])),
((df['start_date'] >= datetime(2020,1,1)) &
(df['balance'] < 100)),
], # conditions
['F',
'A',
'D',
'C',
'B',
'B',
'C'], # choices
'A') # default score
assert (df['score'] == vectorized_score(df)).all()
There’s more than one way to do this, of course. I chose to use np.select
(you can read more about it and other various ways to update DataFrame
s in my article on using where
and mask
.) I sort of like using np.select
when you have multiple conditions like this, although it’s not extremely readable. We could have also done this using more code with vectorized updates for each step and made it much more readable. It would probably be similar in terms of speed.
I personally find this very unreadable, but maybe with some good comments it could be clearly explained to future maintainers (or my future self). But the reason we are doing vectorized code is to make this faster. How does performance look for our sample DataFrame
?
%timeit vectorized_score(df)
2.75 ms ± 489 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Let’s also time our original solution.
%timeit [score_customer(c[1]) for c in df.iterrows()]
13.5 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
OK, so we’re almost 5x faster, just with our tiny dataset. This speedup wouldn’t be enough to matter for small sizes, but with big datasets a simple rewrite to get that much of a speedup makes sense. And I’m sure that a faster vectorized version could be written with a little thought and profiling applied to the situation. But hold on until the end to see what the performance looks like for larger datasets.
Now the next option is a little different. I admit that I don’t think I’ve used this technique often. The idea here is to use a list comprehension, invoking your function with each element in your DataFrame
. Note that I did use a list comprehension already in our first solution, but it was along with iterrows
. This time instead of using iterrows
, the data is pulled out of each column in the DataFrame
directly and then iterated over. No Series
is created in this case. If your function has multiple arguments, you can use zip
to make tuples of the arguments, passing in the columns in your DataFrame
to match the argument order. Now to do this, I’ll need a modified scoring function, since I don’t have already constructed Customer
objects in my DataFrame
, and creating them just to invoke the function would add another layer. I only use three attributes of the customer, so here’s a simple rewrite.
def score_customer_attributes(balance:int, start_date:datetime, state:str) -> str:
if balance < 0:
return 'F'
if balance > 500:
return 'A'
# legacy vs. non-legacy
if start_date > datetime(2020, 1, 1):
if balance < 100:
return 'D'
elif balance < 200:
return 'C'
elif balance < 300:
return 'B'
else:
if state in ['Illinois', 'Indiana']:
return 'B'
else:
return 'A'
else:
if balance < 100:
return 'C'
else:
return 'A'
And here’s what the first loop of the list comprehension will look like when calling the function.
next(zip(df['balance'], df['start_date'], df['state']))
(493, Timestamp('2020-02-04 00:00:00'), 'Maryland')
We will now build a list of all the scores for the entire DataFrame
.
df['score3'] = [score_customer_attributes(*a) for a in zip(df['balance'], df['start_date'], df['state'])]
assert (df['score'] == df['score3']).all()
Now how fast is this?
%timeit [score_customer_attributes(*a) for a in zip(df['balance'], df['start_date'], df['state'])]
171 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Wow, that’s much faster, over 70x faster than the original for this data. By just taking the raw data and invoking a simple Python function, the scores are all calculated quickly in Python space. No row conversions to Series
need to take place.
Note that we could also invoke our original function, we’d just have to make a Customer
object to pass in. This is a bit uglier, but still quite fast.
%timeit [score_customer(Customer(first_name='', last_name='', end_date=None, city=None, zipcode=None, balance=a[0], start_date=a[1], state=a[2])) for a in zip(df['balance'], df['start_date'], df['state'])]
254 µs ± 2.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
We can also use DataFrame.apply
. Note that to apply this to rows, you need to pass in the correct axis since it defaults to applying to each column. The axis argument here is specifying which index you want to have in the object passed to your function. We want each object to be a customer row, with the columns as the index.
assert (df.apply(score_customer, axis=1) == df['score']).all()
%timeit df.apply(score_customer, axis=1)
3.57 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The performance here is better than our original, over 3x faster. This is also very readable, and allows us to use our easy to read and maintain original function. It’s still slower than the list comprehension though because it is constructing a Series
object for each row.
Now we will look at the regular iteration methods in more detail. There are three iter
functions available for DataFrame
s: iteritems
, itertuples
, and iterrows
. DataFrames
also support iteration directly, but these functions don’t all iterate over the same things. Since understanding what all these methods do by just seeing their names can be really confusing, let’s review them all here.
-
iter(df)
(calls theDataFrame. __iter__
method). Iterate over the info axis, which forDataFrames
is the column names, not the values.
next(iter(df)) # 'first_name'
'first_name'
-
iteritems
. Iterate over the columns, returning a tuple of column name and the column as aSeries
.
next(df.iteritems())
next(df.items()) # these two are equivalent
('first_name',
0 Katherine
1 Sarah
2 Karen
3 David
4 Phillip
...
95 Robert
96 Christopher
97 Kristen
98 Nicholas
99 Caroline
Name: first_name, Length: 100, dtype: object)
-
items
. This is the same as above.iteritems
actually just invokesitems
.
next(df.iterrows())
(0,
first_name Katherine
last_name Moody
start_date 2020-02-04 00:00:00
end_date 2021-06-28 00:00:00
city Longberg
state Maryland
zipcode 20496
balance 493
score A
score3 A
Name: 0, dtype: object)
-
iterrows
. We already have seen this, it iterates through the rows, but returns them as a tuple of index and the row, as aSeries
. -
itertuples
. Iterates over the rows, returning anamedtuple
for each row. You can optionally change the name of the tuple and disable the index being returned.
next(df.itertuples())
Pandas(Index=0, first_name='Katherine', last_name='Moody', start_date=Timestamp('2020-02-04 00:00:00'), end_date=Timestamp('2021-06-28 00:00:00'), city='Longberg', state='Maryland', zipcode='20496', balance=493, score='A', score3='A')
Since we already looked at iterrows
, we only need to look at itertuples
. As you can see, the returned value, a namedtuple
, can be used in our original function.
assert ([score_customer(c[1]) for c in df.iterrows()] == df['score']).all()
%timeit [score_customer(t) for t in df.itertuples()]
858 µs ± 5.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The performance here is pretty good, over 12x faster. The construction of a namedtuple
for each row is much faster than construction of a Series
.
Now is a good time to bring up another difference between iterrows
and itertuples
. A namedtuple
can properly represent any type in a single row. In our case, we have strings, date types, and integers. A pandas Series
, however, has to have only one datatype for the entire Series
. Because our datatypes were diverse enough, they were all represented as object
types, and ended up retaining their type, with no functionality issues for us. But this is not always the case!
If your columns have different numerical types, for example, they will end up being the type that can represent all of them. This can cause your data returned by itertuples
and iterrows
to be slightly different between these two methods, so watch out.
dfmixed = pd.DataFrame({'integer_column': [1,2,3], 'float_column': [1.1, 2.2, 3.3]})
dfmixed.dtypes
integer_column int64
float_column float64
dtype: object
next(dfmixed.itertuples())
Pandas(Index=0, integer_column=1, float_column=1.1)
next(dfmixed.iterrows())
(0,
integer_column 1.0
float_column 1.1
Name: 0, dtype: float64)
One other word of warning. If your DataFrame
has columns that cannot be represented as Python variable names, you will not be able to access them using dot syntax. So if you have a column named 2b
or My Column
then you’ll have to access them using positional names (i.e. the first column will be called _1
). For iterrows
, the row will be a Series
, so you’ll have to access the columns using ["2b"]
or ["My Column"]
.
There are other options for iteration, of course. For example, you could increment an integer offset and use the iloc
indexer on the DataFrame
to select any row. Of course, this is really no different from any other iteration, while also being non-idiomatic so others reading your code will probably find it hard to read and understand. I built a naive version of this in the performance comparison code for the summary below, if you want to see it (the performance was horrible).
Choosing the right solution depends on essentially two factors:
- How big is your data set?
- What can you write (and maintain) easily?
In the image below, you can see the running time for the solutions we’ve considered (the code to generate this is here). As you can see, only the vectorized solution holds up well with larger data. If your data set is huge, vectorized solutions may be your only reasonable choice.
However, depending on how many times you need to execute your code, how long it takes you to write it correctly, and how well you can maintain it going forward, you may choose any of the other solutions and be fine. In fact, they all grow linearly with increasing data for these solutions.
Maybe one way to think about this is not just big-O notation, but “big-U” notation. In other words, how long will it take YOU to write a correct solution? If it’s less than the running time of your code, an iterative solution may be totally fine. However, if you’re writing production code, take the time to learn how to vectorize.
One other point; sometimes writing the iterative solution on a smaller set is easy, and you may want to do that first, then write the vectorized version. Verify your results with the iterative solution to make sure you did it correctly, then use the vectorized version on the larger full data set.
I hope you’ve found this dive into DataFrame
iteration interesting. I know I learned a few useful things along the way.
The post How to iterate over DataFrame rows (and should you?) appeared first on wrighters.io.
19