35
Data cleaning null values -- SQL or Code?
When preparing your data set for analysis, it is crucial to ensure that your data set is both complete and accurate. One step in this process is deciding how to handle null values. Depending on how your data is going to be used, you may not want null values at all!
We're going to take a look at calculating Lifetime Value (LTV) of a customer. We want a customer with no purchase history to have an LTV of 0. What happens if our aggregated column returns no rows, and thus is null
? Should we clean the data at the query level or afterwards using a script?
We’re going to walk through this LTV example from beginning to end looking at different options -- some that work well, some that work poorly, and some that won’t work at all.
To start, we have a users
table and a purchases
table.
users:
user_id |
---|
1 |
2 |
3 |
4 |
5 |
purchases:
user_id | price |
---|---|
1 | 79.36 |
3 | 42.97 |
1 | 33.14 |
2 | 16.48 |
4 | 96.41 |
As you can see, the users table has 5 unique users and the purchases table has 5 purchases from users 1, 2, 3, and 4. User 5 is not present in the purchases table. Now that we have our data, it’s time to create our calculated column. Our query is being called by a function that passes it a list of user_ids that can be used in the query.
COALESCE
is a keyword used by Postgres and other SQL dialects to return the first non-null value in a set. Because of this, it’s a common keyword for data cleaning. Let’s look closer at the following query.
SELECT
COALESCE(SUM(purchases.price), 0) as LTV,
user_id
FROM purchases
WHERE purchases.user_id
IN (1, 2, 3, 4, 5)
GROUP BY user_id
ORDER BY user_id;
When we look at the results from that query we end up with:
user_id | LTV |
---|---|
1 | 112.50 |
2 | 16.48 |
3 | 42.97 |
4 | 96.41 |
Yikes! No user 5. That means wherever you’re using this data, when you try and access the row for user 5, you’ll get null
, which may throw off calculations. So why did COALESCE
do this?
A SQL query is like checking a post office box -- you might find a bunch of envelopes containing various amounts of mail, or even an empty envelope I suppose. But if there’s no mail at that post office box, you leave empty handed. If SQL can’t find any rows, it comes back empty handed as well. In fact, getting back a null vs a 0 vs an undefined may seem similar, but they tell us different things:
value | meaning |
---|---|
0 | We know something is there and its value is zero. |
null | We know something is there and its value is unspecified |
empty set | We know nothing is there |
So the issue with our query isn’t with the COALESCE
but with the order of execution - there are no rows to return with a user_id
of 5 in purchases
. Let's think about the logical order of how we get there:
The order that query gets executed in is:
- Find rows in
purchases
that match the given user ids (5 rows, matching 4 of the given user ids) - On each of the rows returned, return either the sum of the price column or 0 (same 5 rows, with aggregated data for LTV)
- Group by user_id and order by user_id (4 rows returned, in order by user_id, with aggregated data for LTV)
After the first line, there isn’t even a row for user 5 anymore. We can expect the same behavior if we try a CASE
switch for our LTV because, again, there is no row, no data for it to act on.
Had our purchases
table contained a row for user_id 5 with null
for price, it would have returned as expected. Knowing that, a second option is to break out a right outer join.
Knowing that our issue was with missing rows, we could go the route of a RIGHT OUTER JOIN
with our users table to achieve that. This way, we’ll get back a row for every user from users
, and, if it’s there, the sum of their data from the “price” column in purchases
, otherwise a 0.
SELECT
users.id as user_id,
COALESCE(SUM("price"), 0) as LTV
FROM purchases
RIGHT OUTER JOIN users
ON demo.purchases.user_id = users.id
GROUP BY user_id
ORDER BY user_id;
user_id | LTV |
---|---|
1 | 112.50 |
2 | 16.48 |
3 | 42.97 |
4 | 96.41 |
5 | 0 |
It works! Fantastic.
💡 If all of your foreign keys are coming from a single table, the above might be the right solution for you. However, in more complex cases, you may not have something like a generic “users” table.
Your primary data may be spread across multiple tables -- for instance, you may have a setup like:
location_a_customers
Column | Type | Collation | Nullable | Default |
---|---|---|---|---|
user_id | string | not null | ||
string | not null | |||
last_name | string | not null | ||
user_id | string | not null |
location_b_customers
Column | Type | Collation | Nullable | Default |
---|---|---|---|---|
user_id | string | not null | ||
string | not null | |||
last_name | string | not null | ||
user_id | string | not null |
that you want to use with a purchases table:
Column | Type | Collation | Nullable | Default |
---|---|---|---|---|
user_id | string | not null | ||
purchase_amount | decimal | not null | ||
item_sku | string | not null | ||
item_category | string | not null |
Fear not, we can still make it happen. Though, for everyone’s sake, at this point I would recommend using a function over using SQL! Just for giggles, let's prove it's possible first.
⚠️ Overkill query ahead. You should really try a function instead.
One way around the issue of having multiple users tables would be to create a list of all the user_id
s for the query with a sublist of (user_id
, 0), then UNION
the original query with that list. If it sounds overkill, that’s because it is:
SELECT
DISTINCT on (user_id) user_id,
purchases FROM( SELECT user_id, COUNT(*) as LTV
FROM purchases
WHERE purchases.user_id in (1, 2, 3, 4, 5)
GROUP BY user_id
UNION
SELECT * FROM (
values (1, 0), (2, 0), (3, 0), (4, 0), (5, 0))
AS t (user_id, LTV)
ORDER BY user_id, LTV DESC
) AS data;
Which returns our beloved 0:
user_id | LTV |
---|---|
1 | 112.50 |
2 | 16.48 |
3 | 42.97 |
4 | 96.41 |
5 | 0 |
The query isn’t too out of hand for 5 users, but imagine having to write that out (or have a function write that out) for 500 users! What about 10,000! Just because something works, doesn’t make it necessarily the right idea. So let’s try it from the programming sde.
We’re really asking a lot of SQL there. While it was possible, it was not very pretty, very concise, or very readable. It also muddies things up as it is technically returning data from our database that isn’t there. Thankfully, transforming the data we do get back using a function is a much more straightforward task!
I worked this one out in both TypeScript and Python. I love that we use TypeScript for building out our data pipeline because dynamic typing helps keep the guard rails on and make sure I’m paying attention to what I’m working with -- especially as our data structures grow in complexity, it helps to be completely sure what you’re working with. And Python for all the Python-loving data folks as well.
Here’s how we can use a function to transform our data to include 0s. We start by setting the LTV of every id as [0], then iterate through our results and replace the [0]s as we go:
interface Profile {
id: string;
}
interface DataResonseRow {
id: string;
data: number[];
}
function calculateLTV(
queryResults: DataResponseRow[],
aggregationMethod: string,
profiles: Profile[]
) {
const responsesById: { [id: string]: any[] } = {};
for (const i in profiles) {
if (aggregationMethod === "count" || aggregationMethod === "sum") {
//default all entries for counted or summed aggregations to 0 to start... will be replaced later if there is a response
responsesById[profiles[i].id] = [0];
}
}
for (const row in queryResults) {
responsesById[queryResults[row].id] = queryResults[row].data;
}
return responsesById;
}
🐍 Note: There are many Python libraries that help with data cleaning and analytics. Two popular choices are pandas and matplotlib. However, this is a super straightforward solution using plain-vanilla Python.
def calculate_ltv (query_results, aggregation_method, profiles):
responses_by_id = {}
for i in profiles:
if aggregation_method == "count" or aggregation_method == "sum":
# default all entries for counted or summed aggregations to 0 to start... will be replaced later if there is a response
responses_by_id[i["id"]] = [0]
for row in query_results:
responses_by_id[row["id"]] = row["data"]
return responses_by_id
mock_query_results = [{"id":1, "data":[112.50]}, {"id":2, "data":[16.48]}, {"id":3, "data":[42.97]}, {"id":4, "data":[96.41]}]
mock_aggregation_method = "sum"
mock_profiles = [{"id": 1}, {"id": 2}, {"id": 3}, {"id": 4}, {"id": 5}]
print(calculate_ltv(mock_query_results, mock_aggregation_method, mock_profiles));
To see them in action, here’s a jsfiddle for the TypeScript solution and an IDEOne instance for the Python solution.
- To learn more about handling conditionals and aggregations in Postgresql, check out their docs
35