Having worked in BigQueryesque tools for the past 10 years, I’ve come to really appreciate some of it’s powerful “hidden” features. Most notably, I’ve come to absolutely love working with repeated data structures, i.e. Arrays.
Virtual Columns
Some of my favourite “hacks” come from working with the array functions. For instance, let’s say you have a query with a `date` column and you’re working on something that will output a `month`. But not just that, you want to work with that `month` column inside window functions, where clauses and CASE statements.
The old-school way to do this is to pre-aggregate the data to month, for instance in a CTE or inner query, or do the DATE_TRUNC in many places. For this silly example, that’s fine, but what if that DATE_TRUNC is in fact a 100 character long regex extract?
A generic pattern that I keep coming back to is to just add an `artificial` month column like so:
SELECT
month,
SUM(sales) OVER (ORDER BY month) AS cumulative_sales
FROM
some_daily_table AS a,
UNNEST([DATE_TRUNC(a.date_column, MONTH)]) AS month
Why does this work? Because UNNESTs in BigQuery can work on any array stored in table `a` — but it can also work on the non-array fields, if you manually put them into an array. This is equivalent of adding a virtual column to the table.
Important to note above is that the unnested array is of length 1. The comma before the UNNEST keyword means we do a CROSS JOIN of the array. As long as it is length 1, we won’t introduce any additional rows in the output.
… and Rows
There is sometimes good reason to actually introduce more rows. For instance:
SELECT
location,
location_type,
COUNT(DISTINCT user_id) AS users
FROM
some_table_with_sales_per_user AS a,
UNNEST([
STRUCT(a.user_country_code AS location, "COUNTRY" AS location_type),
STRUCT(a.user_region_code AS location, "REGION" AS location_type),
STRUCT("GLOBAL" AS location, "GLOBAL" AS location_type)
])
GROUP BY
location,
location_type
The above query adds two new virtual columns, `location` and `location_type`. In addition to this, it will create virtual rows as well. Essentially, for every row in the table, there will now be 3 rows, with different `location` and `location_type` fields.
When we aggregate on location and location-type, we essentially create a data cube, with pre-aggregated data for those dimensions. This is nice in itself, as the results may need less crunching down-stream. But in particular for this example with a DISTINCT count, where a user visiting multiple countries should count once per country, but only once on the “GLOBAL” row.
More on the UNNESTS
The UNNEST pattern, where you access fields from the previously listed tables, is incredibly powerful. I keep coming back to it to create “virtual columns” that I can use as keys in subsequent joins. I use it to break calculations into steps; because since I can access the fields from previously listed tables in the join, I can also chain multiple ,UNNEST([
statements to do stepwise calculations.
Above may not always be the neatest way to do things, but sometimes it is! For instance if you find yourself nesting subqueries in subqueries to create step-wise calculations, this pattern is much neater.
Consistent Column Selection
Damn, it was hard setting a header for this section. Essentially here’s the problem we want to solve (there are of course a ton of variations to this problem):
We have data of events, where a user has multiple events
Each event has a device-type (phone or tablet) and a device OS (iOS or Android)
We want to count the events per user, but not per device
But we still want to retain some device info, and since most users are single-device, we’re fine just picking one device at random
However when we pick at random, a user who does indeed use both an Android Phone, and an iOS Tablet, may be presented in the result as an Android Tablet user.
There are a few ways to solve this, for instance:
SELECT
user_id,
ANY_VALUE(STRUCT(
device_type AS type,
device_os AS os
)) AS device_info,
COUNT(1) AS events
Above will create a STRUCT and with device os and types. This is really useful if your downstream usage is bigquery — but it breaks if your downstream user of this data doesn’t like to access fields inside structs.
Now this won’t work:
SELECT
user_id,
MAX(device_type) AS device_type,
MAX(device_os) AS device_os
COUNT(1) AS events
… because we can’t be sure that the type and os comes from the same record.
However, what we can do is to put all the rows into an array, order it consistently, and take the first value. That’s done like so:
SELECT
user_id,
ARRAY_AGG(device_type ORDER BY device_id)[OFFSET(0)] AS device_type,
ARRAY_AGG(device_os ORDER BY device_id)[OFFSET(0)] AS device_os
COUNT(1) AS events
Above will work assuming you have some key to order on. It can be anything like a device_id, or an event_timestamp or similar — all it needs to do is ensure that both the fields are read from the same record.
I realize after writing this that devices may not be the best example. But instead think of this; you are a global e-commerce player and 99.9% of your customers stay put in their country. However a small portion will move abroad, or order to their vacation home.
You want to list your users, and give each a country & region. For 99.9% of your users this is fine. But for the last 0.01%, you may end up with a country value that is not located inside the region.
Accessing items in Arrays
If you frequently find yourself calling UNNEST on an entire array, only to then re-aggregate the data, you’re probably doing something wrong.
When I just started working with Array data fields, my knee-jerk reaction was to get rid of it, which is what a full UNNEST does — it just transforms the data into a really long table with a lot of repetition instead. Consider this:
SELECT
COUNT(orders.id)
FROM
orders,
UNNEST(line_items) AS line_item
WHERE
line_item.product_name LIKE "%shirt%"
What above query does, is count orders where a line-item consists of a shirt.
However, it does so by first taking all your orders, and repeating them once per line-item. Say each order has on average 4 line-items (including VAT and shipping), you’ve essentially multiplied the size your data by a factor 4, only to then reduce it back.
Further, you’re not actually counting the orders with shirt in line items, your counting the total number of shirts — orders.id may have been a unique key, but not after the unnesting of all line-items.
SELECT
COUNT(orders.id)
FROM
orders
WHERE
(
SELECT
LOGICAL_OR(product_name LIKE "%shirt%")
FROM
orders.line_items
)
Instead the above will do what you want. Wrapping a SELECT statement in parentheses, and using the array field in the FROM clause, will allow you to calculate some scalar value. With this pattern you can also, easily, calculate how many orders had 3 or more shirts in them;
SELECT
COUNT(orders.id)
FROM
orders
WHERE
(
SELECT
COUNTIF(product_name LIKE "%shirt%") >= 3
FROM
orders.line_items
)
Or, if you move the sub-select to the SELECT statement, you can build a histogram:
SELECT
(
SELECT
COUNTIF(product_name LIKE "%shirt%")
FROM
orders.line_items
) AS shirts_in_order,
COUNT(orders.id)
FROM
orders
GROUP BY
shirts_in_order
ORDER BY
shirts_in_order
There are many much more powerful ways to use these inline SELECTs from repeated fields. In the last example I return the number of shirts, you can in fact return multiple values here, if you ensure your result is a STRUCT.
SELECT
(
SELECT AS STRUCT
ARRAY_AGG(DISTINCT product_color) AS product_colors,
COUNTIF(product_name LIKE "%shirt%")
FROM
orders.line_items
) AS shirts_in_order,
COUNT(orders.id)
FROM
orders
You can also return a new array by adding the keyword ARRAY outside the parentheses. Then you don’t need to do the aggregation down to 1 row in the query:
SELECT
ARRAY(
SELECT AS STRUCT
product_color,
COUNTIF(product_name LIKE "%shirt%")
FROM
orders.line_items
GROUP BY
product_color
) AS shirts_in_order,
COUNT(orders.id)
FROM
orders
… So you can return either a single scalar, a STRUCT or an ARRAY — including an ARRAY of STRUCT. Meaning you can construct very powerful calculations, without UNNESTING the entire input array. With some creativity you’ll see you can even do JOINs inside of these expressions, if you have multiple tables with different ARRAY fields.
Lastly, let’s say you don’t want to return an ARRAY from your expression. It is quite convenient to work on scalar values. You can do so in one of 3 ways:
Ensure your expression has a clause filtering the result to exactly 1 output row. For instance if you have a unique ID in the array field
Ensure your expression does aggregation down to a single record
Use LIMIT 1. This is really powerful if you combine it with ORDER BY, see the example below:
SELECT
user_id,
(
SELECT AS STRUCT
address_1,
address_zip,
address_country
FROM
users.addresses
ORDER BY
inserted_timestamp DESC
LIMIT
1
) AS last_modified_address
FROM
users
The last example is actually leveraging a nice pattern for Slowly Changing Dimensions. In other words, since users rarely change addresses, you may want to store their current and all historical addresses in a single array. Using functions like we’ve shown here will allow you to get the latest, but also find the last address before some date, or filter for users who “have ever had an address in Stockholm”.
Once you get the hang of ARRAYs you’ll have a super powerful tool to denormalize your data without loosing resolution. In other words, you’ll do fewer JOINs and run faster queries — and hopefully write way fewer lines of SQL.
Great article! = ) Coming from Snowflake I've missed a lot of functionality in BQ, but your ideas here fill many of those gaps!
I particularly find the "Accessing items in Arrays"-part very valuable. For the other parts there might be good alternative approaches and I suggest some below (haven't done any performance assessments though...)
-- Virtual Columns
SELECT
DATE_TRUNC(date_column, MONTH) AS month,
SUM(sales) OVER (ORDER BY month) AS cumulative_sales
FROM some_table
-- ...and rows
SELECT
COALESCE(user_region_code, user_country_code, 'GLOBAL'),
COUNT(DISTINCT user_id) AS users
FROM some_table
GROUP BY ROLLUP (user_country_code, user_region_code)
-- Consistent Column Selection
SELECT DISTINCT
user_id,
FIRST_VALUE(device_type) OVER(ORDER BY device_id) AS device_type,
FIRST_VALUE(device_os) OVER(ORDER BY device_id) AS device_os,
COUNT(1) OVER() AS events
FROM some_table