21
Using Generators in Python: The Why, The What, and The When
Today, “what are Generators in Python” and “what are Generators used for in Python” are some of the most popular Python interview questions.
Often, Generator is considered as one of the more intermediate concepts in Python. If you are new to learning Python, you may not have come across Generator before. Here’s a tip, it has something to do with the use of yield
statements inside a function.
In this post, I am going to highlight some of the use cases, reasons, and advantages of using Generators in Python. In short, you should consider using Generators when dealing with large datasets with memory constraints.
Let’s dive a little bit deeper, shall we?
- Consider using Generator when dealing with a huge dataset
- Consider using Generator in scenarios where we do not need to reiterate it more than once
- Generators give us lazy evaluation
- They are a great way to generate sequences in a memory-efficient manner
To understand why you should use Generators, we have to first understand that computers have a finite amount of memory (RAM). Whenever we are storing or manipulating variables, lists, etc., all that is being stored inside our memory.
You might ask, why do computer programs store them in memory? Because it’s the fastest way for us to write and retrieve data.
Have you ever had to work with a list so large that you run into MemoryError
? Perhaps, you have tried reading rows from a super large Excel (or .csv
) file.
All I remember was that performing these tasks is painfully slow or impossible.
To put it simply, a Generator function is a special kind of function that returns many items. The point here is that the items are returned one by one rather than all at once.
The main difference between a regular function and a Generator function lies in the use of return
and yield
statements in Python.
You may have come across this statement. But, what does it really mean?
If you are familiar with Iterator, a Generator function is essentially a function that behaves just like that.
Behind the scene, Generators don’t compute the value of each item when being instantiated. Rather, they compute it when we ask for it. This is what people mean by Generators give you lazy evaluation.
As a result, Generators allow us to process and deal with one value at a time without having to load everything in memory first.
Generators are great when you encounter problems that require you to read from a large dataset. Reading from a large dataset means our computer or server would have to allocate memory for it.
The only condition to remember is that a Generator can only be iterated once. In other words, as long as we do not need the previous value from our dataset, we can always use Generator.
Another common use case of using Generators is when we are working with large files such as Excel or CSV documents. Without using a Generator function, here’s how we can write it:
# Example of using a regular function
import csv
def read_csv_from_regular_fn():
with open('large_dataset.csv', 'r') as f:
reader = csv.reader(f)
return [row for row in reader]
result_1 = read_csv_from_regular_fn()
# Output:
# [['a','b','c', ... ], ['x','y','z', ... ] ... ]
Upon running the example above, we may experience some slowness or even MemoryError
depending on our computers.
Looking at the code example above, to generate the result, the read_csv_from_regular_fn
would open our CSV file and loads everything in memory in an instance.
This is not a good solution when working with larger files than our available memory. Alternatively, we could do this:
# Example of using a Generator function
import csv
def read_csv_from_generator_fn():
with open('large_dataset.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
yield row
# To get the same output as result_1,
# We generate a list using our newly created Generator function:
result_2 = [row for row in read_csv_from_generator_fn()]
# Output same as result_1:
# [['a','b','c', ... ], ['x','y','z', ... ] ... ]
In this scenario, we use read_csv_from_generator_fn
as our Generator function. This new Generator opens our large CSV file, loops through every row, and yields each row at a time rather than all at once.
Here, we would not run into any MemoryError
or even any slowness due to memory constraints when reading data from our large_dataset.csv
.
To check the memory usage in bytes, we could do the following:
import sys
print(sys.getsizeof(read_csv_from_generator_fn())) # 112 bytes
print(sys.getsizeof(read_csv_from_regular_fn())) # 1624056 bytes
Another example where Generators are often used is where we intend to process values from a large list:
# Example 1
nums_list_comprehension = [i * i for i in range(100_000_000)]
sum(nums_list_comprehension) # 333333328333333350000000
Depending on your computer, you may encounter MemoryError
or at least a couple of seconds of slowness when evaluating the expression above.
Like list comprehensions, the Generator expression allows us to quickly create a Generator object without having to use the yield
statement.
To cope with our memory constraint, we could turn the code example above into a Generator expression. This line of code below evaluates almost immediately:
# Example 2
nums_generator = (i \* i for i in range(100_000_000))
# <generator object <genexpr> at 0x106ecc580>
sum(nums_generator) # 333333328333333350000000
In Example 1, i ** i
for the entire range of 100_000_000
is being evaluated and stored in memory beforehand. It returns a full list.
In Example 2, i ** i
is only evaluated when being iterated, one at a time. It returns a Generator expression.
Remember, Generators don’t compute the value of each item when being instantiated.
The differences in memory usage are below:
import sys
print(sys.getsizeof(nums_generator)) # 112 bytes
print(sys.getsizeof(nums_list_comprehension)) # 835128600 bytes
A Generator can only be iterated once.
The example below shows that the Generator expression from nums_generator can only be iterated once. Using sum on it for the second time resulted in zero as the Generator was exhausted.
# Continuing from Example 2
sum(nums_generator) # 333333328333333350000000
sum(nums_generator) # 0, because it can only be iterated once.
When dealing with relatively small files or lists, we may not want to use Generator as it might actually slow us down.
We can use our previous examples cProfile to profile the performance differences between list comprehension and Generator expression when summing the values up.
cProfile of summing using List Comprehension vs. Generator Expression:
# List Comprehension
# ------------------
cProfile.run('sum([i * i for i in range(100_000_000)])')
# 5 function calls in 13.956 seconds
# Ordered by: standard name
# ncalls tottime percall cumtime percall filename:lineno(function)
# 1 8.442 8.442 8.442 8.442 <string>:1(<listcomp>)
# 1 0.841 0.841 13.956 13.956 <string>:1(<module>)
# 1 0.000 0.000 13.956 13.956 {built-in method builtins.exec}
# 1 4.672 4.672 4.672 4.672 {built-in method builtins.sum}
# 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
# Generator Expression
# --------------------
cProfile.run('sum((i * i for i in range(100_000_000)))')
# 100000005 function calls in 22.996 seconds
# Ordered by: standard name
# ncalls tottime percall cumtime percall filename:lineno(function)
# 100000001 11.745 0.000 11.745 0.000 <string>:1(<genexpr>)
# 1 0.000 0.000 22.996 22.996 <string>:1(<module>)
# 1 0.000 0.000 22.996 22.996 {built-in method builtins.exec}
# 1 11.251 11.251 22.996 22.996 {built-in method builtins.sum}
# 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
From our cProfile
result above, we can tell that using list comprehension is a lot faster provided we don’t run into memory constraints.
Evidently, if memory is not an issue, we should stick with using regular functions or list comprehensions.
In summary, Generator is an amazing tool in Python given the scenario where we do not need to reiterate it more than once.
As Generators give us lazy evaluation, they are a great way to generate sequences in a memory-efficient manner. We should definitely consider using Generator when dealing with huge datasets to optimize our program.
Thank you for reading!
21