My Favourite "hidden" BigQuery Features

Jan 25, 2023

Having worked in BigQueryesque tools for the past 10 years, I’ve come to really appreciate some of it’s powerful “hidden” features. Most notably, I’ve come to absolutely love working with repeated data structures, i.e. Arrays.

Virtual Columns

Some of my favourite “hacks” come from working with the array functions. For instance, let’s say you have a query with a `date` column and you’re working on something that will output a `month`. But not just that, you want to work with that `month` column inside window functions, where clauses and CASE statements.

The old-school way to do this is to pre-aggregate the data to month, for instance in a CTE or inner query, or do the DATE_TRUNC in many places. For this silly example, that’s fine, but what if that DATE_TRUNC is in fact a 100 character long regex extract?

A generic pattern that I keep coming back to is to just add an `artificial` month column like so:

SELECT
   month,
   SUM(sales) OVER (ORDER BY month) AS cumulative_sales
FROM
   some_daily_table AS a,
   UNNEST([DATE_TRUNC(a.date_column, MONTH)]) AS month

Why does this work? Because UNNESTs in BigQuery can work on any array stored in table `a` — but it can also work on the non-array fields, if you manually put them into an array. This is equivalent of adding a virtual column to the table.

Important to note above is that the unnested array is of length 1. The comma before the UNNEST keyword means we do a CROSS JOIN of the array. As long as it is length 1, we won’t introduce any additional rows in the output.

… and Rows

There is sometimes good reason to actually introduce more rows. For instance:

SELECT
  location,
  location_type,
  COUNT(DISTINCT user_id) AS users
FROM
  some_table_with_sales_per_user AS a,
  UNNEST([
    STRUCT(a.user_country_code AS location, "COUNTRY" AS location_type),
    STRUCT(a.user_region_code AS location, "REGION" AS location_type),
    STRUCT("GLOBAL" AS location, "GLOBAL" AS location_type)
  ])
GROUP BY
  location, 
  location_type

The above query adds two new virtual columns, `location` and `location_type`. In addition to this, it will create virtual rows as well. Essentially, for every row in the table, there will now be 3 rows, with different `location` and `location_type` fields.

When we aggregate on location and location-type, we essentially create a data cube, with pre-aggregated data for those dimensions. This is nice in itself, as the results may need less crunching down-stream. But in particular for this example with a DISTINCT count, where a user visiting multiple countries should count once per country, but only once on the “GLOBAL” row.

More on the UNNESTS

The UNNEST pattern, where you access fields from the previously listed tables, is incredibly powerful. I keep coming back to it to create “virtual columns” that I can use as keys in subsequent joins. I use it to break calculations into steps; because since I can access the fields from previously listed tables in the join, I can also chain multiple ,UNNEST([ statements to do stepwise calculations.

Above may not always be the neatest way to do things, but sometimes it is! For instance if you find yourself nesting subqueries in subqueries to create step-wise calculations, this pattern is much neater.

Consistent Column Selection

Damn, it was hard setting a header for this section. Essentially here’s the problem we want to solve (there are of course a ton of variations to this problem):

We have data of events, where a user has multiple events
Each event has a device-type (phone or tablet) and a device OS (iOS or Android)
We want to count the events per user, but not per device
But we still want to retain some device info, and since most users are single-device, we’re fine just picking one device at random
However when we pick at random, a user who does indeed use both an Android Phone, and an iOS Tablet, may be presented in the result as an Android Tablet user.

There are a few ways to solve this, for instance:

SELECT
  user_id,
  ANY_VALUE(STRUCT(
    device_type AS type, 
    device_os AS os
  )) AS device_info,
  COUNT(1) AS events

Above will create a STRUCT and with device os and types. This is really useful if your downstream usage is bigquery — but it breaks if your downstream user of this data doesn’t like to access fields inside structs.

Now this won’t work:

SELECT
  user_id,
  MAX(device_type) AS device_type,   
  MAX(device_os) AS device_os
  COUNT(1) AS events

… because we can’t be sure that the type and os comes from the same record.

However, what we can do is to put all the rows into an array, order it consistently, and take the first value. That’s done like so:

SELECT
  user_id,
  ARRAY_AGG(device_type ORDER BY device_id)[OFFSET(0)] AS device_type,   
  ARRAY_AGG(device_os ORDER BY device_id)[OFFSET(0)] AS device_os
  COUNT(1) AS events

Above will work assuming you have some key to order on. It can be anything like a device_id, or an event_timestamp or similar — all it needs to do is ensure that both the fields are read from the same record.

I realize after writing this that devices may not be the best example. But instead think of this; you are a global e-commerce player and 99.9% of your customers stay put in their country. However a small portion will move abroad, or order to their vacation home.

You want to list your users, and give each a country & region. For 99.9% of your users this is fine. But for the last 0.01%, you may end up with a country value that is not located inside the region.

Accessing items in Arrays

If you frequently find yourself calling UNNEST on an entire array, only to then re-aggregate the data, you’re probably doing something wrong.

When I just started working with Array data fields, my knee-jerk reaction was to get rid of it, which is what a full UNNEST does — it just transforms the data into a really long table with a lot of repetition instead. Consider this:

SELECT
  COUNT(orders.id)
FROM
  orders,
  UNNEST(line_items) AS line_item
WHERE
  line_item.product_name LIKE "%shirt%"

What above query does, is count orders where a line-item consists of a shirt.

However, it does so by first taking all your orders, and repeating them once per line-item. Say each order has on average 4 line-items (including VAT and shipping), you’ve essentially multiplied the size your data by a factor 4, only to then reduce it back.

Further, you’re not actually counting the orders with shirt in line items, your counting the total number of shirts — orders.id may have been a unique key, but not after the unnesting of all line-items.

SELECT
  COUNT(orders.id)
FROM
  orders
WHERE
  (
    SELECT 
      LOGICAL_OR(product_name LIKE "%shirt%") 
    FROM 
      orders.line_items
  )

Instead the above will do what you want. Wrapping a SELECT statement in parentheses, and using the array field in the FROM clause, will allow you to calculate some scalar value. With this pattern you can also, easily, calculate how many orders had 3 or more shirts in them;

SELECT
  COUNT(orders.id)
FROM
  orders
WHERE
  (
    SELECT 
      COUNTIF(product_name LIKE "%shirt%") >= 3 
    FROM 
      orders.line_items
  )

Or, if you move the sub-select to the SELECT statement, you can build a histogram:

SELECT
  (
    SELECT 
      COUNTIF(product_name LIKE "%shirt%") 
    FROM 
      orders.line_items
  ) AS shirts_in_order,
  COUNT(orders.id)
FROM
  orders
GROUP BY
  shirts_in_order
ORDER BY
  shirts_in_order

There are many much more powerful ways to use these inline SELECTs from repeated fields. In the last example I return the number of shirts, you can in fact return multiple values here, if you ensure your result is a STRUCT.

SELECT
  (
    SELECT AS STRUCT
      ARRAY_AGG(DISTINCT product_color) AS product_colors,
      COUNTIF(product_name LIKE "%shirt%") 
    FROM 
      orders.line_items
  ) AS shirts_in_order,
  COUNT(orders.id)
FROM
  orders

You can also return a new array by adding the keyword ARRAY outside the parentheses. Then you don’t need to do the aggregation down to 1 row in the query:

SELECT
  ARRAY(
    SELECT AS STRUCT
      product_color,
      COUNTIF(product_name LIKE "%shirt%") 
    FROM 
      orders.line_items
    GROUP BY
      product_color
  ) AS shirts_in_order,
  COUNT(orders.id)
FROM
  orders

… So you can return either a single scalar, a STRUCT or an ARRAY — including an ARRAY of STRUCT. Meaning you can construct very powerful calculations, without UNNESTING the entire input array. With some creativity you’ll see you can even do JOINs inside of these expressions, if you have multiple tables with different ARRAY fields.

Lastly, let’s say you don’t want to return an ARRAY from your expression. It is quite convenient to work on scalar values. You can do so in one of 3 ways:

Ensure your expression has a clause filtering the result to exactly 1 output row. For instance if you have a unique ID in the array field
Ensure your expression does aggregation down to a single record
Use LIMIT 1. This is really powerful if you combine it with ORDER BY, see the example below:

SELECT
  user_id,
  (
    SELECT AS STRUCT
      address_1,
      address_zip,
      address_country
    FROM 
      users.addresses
    ORDER BY
      inserted_timestamp DESC
    LIMIT
      1
  ) AS last_modified_address
FROM
  users

The last example is actually leveraging a nice pattern for Slowly Changing Dimensions. In other words, since users rarely change addresses, you may want to store their current and all historical addresses in a single array. Using functions like we’ve shown here will allow you to get the latest, but also find the last address before some date, or filter for users who “have ever had an address in Stockholm”.

Once you get the hang of ARRAYs you’ll have a super powerful tool to denormalize your data without loosing resolution. In other words, you’ll do fewer JOINs and run faster queries — and hopefully write way fewer lines of SQL.

Lars

Feb 13, 2023Edited

Great article! = ) Coming from Snowflake I've missed a lot of functionality in BQ, but your ideas here fill many of those gaps!

I particularly find the "Accessing items in Arrays"-part very valuable. For the other parts there might be good alternative approaches and I suggest some below (haven't done any performance assessments though...)

-- Virtual Columns

SELECT

DATE_TRUNC(date_column, MONTH) AS month,

SUM(sales) OVER (ORDER BY month) AS cumulative_sales

FROM some_table

-- ...and rows

COALESCE(user_region_code, user_country_code, 'GLOBAL'),

COUNT(DISTINCT user_id) AS users

GROUP BY ROLLUP (user_country_code, user_region_code)

-- Consistent Column Selection

SELECT DISTINCT

user_id,

FIRST_VALUE(device_type) OVER(ORDER BY device_id) AS device_type,

FIRST_VALUE(device_os) OVER(ORDER BY device_id) AS device_os,

COUNT(1) OVER() AS events

Expand full comment

2 replies by Jens Tonberg Larsson and others

2 more comments...

&Data

Discussion about this post