Great article! = ) Coming from Snowflake I've missed a lot of functionality in BQ, but your ideas here fill many of those gaps!
I particularly find the "Accessing items in Arrays"-part very valuable. For the other parts there might be good alternative approaches and I suggest some below (haven't done any performance assessments though...)
-- Virtual Columns
SELECT
DATE_TRUNC(date_column, MONTH) AS month,
SUM(sales) OVER (ORDER BY month) AS cumulative_sales
Your `-- Virtual Columns` code here will throw: `Unrecognized name: month at [3:29]`
You can verify with e.g.
```SELECT
DATE_TRUNC(date_column, MONTH) AS month,
SUM(sales) OVER (ORDER BY month) AS cumulative_sales
FROM
UNNEST([
STRUCT(DATE "2021-01-01" AS date_column, 12 AS sales),
STRUCT(DATE "2021-01-02" AS date_column, 13 AS sales),
STRUCT(DATE "2021-01-03" AS date_column, 14 AS sales),
STRUCT(DATE "2021-01-04" AS date_column, 15 AS sales)
])```
I didn't know GROUP BY ROLLUP was implemented in bigquery now, great find, thanks! Then that example I posted is silly, but the general idea applies to other rollups as the pattern is quite flexible.
The consistent column selection requires a window function, which takes a toll on performance for sure, as it is executed as at least one additional separate step in the query plan, and will tend to be really slow if the windows are large (ordering in general is really slow if data is large as the ordered set has to reside on one node. A common pitfall is if there's a bunch of `null`device_id's that all end up on a single node that takes forever to compute).
Interesting about window functions having poor performance. I knew that was true for ROW_NUMBER since it needs to scan the whole column to do the ordering, but I thought FIRST_VALUE would be executed equivalently to array[0], but maybe it's not. (I checked the query plans and they look different but not sure what that implies in terms of performance. Will do a test on lots of data at some point :)
Great article! = ) Coming from Snowflake I've missed a lot of functionality in BQ, but your ideas here fill many of those gaps!
I particularly find the "Accessing items in Arrays"-part very valuable. For the other parts there might be good alternative approaches and I suggest some below (haven't done any performance assessments though...)
-- Virtual Columns
SELECT
DATE_TRUNC(date_column, MONTH) AS month,
SUM(sales) OVER (ORDER BY month) AS cumulative_sales
FROM some_table
-- ...and rows
SELECT
COALESCE(user_region_code, user_country_code, 'GLOBAL'),
COUNT(DISTINCT user_id) AS users
FROM some_table
GROUP BY ROLLUP (user_country_code, user_region_code)
-- Consistent Column Selection
SELECT DISTINCT
user_id,
FIRST_VALUE(device_type) OVER(ORDER BY device_id) AS device_type,
FIRST_VALUE(device_os) OVER(ORDER BY device_id) AS device_os,
COUNT(1) OVER() AS events
FROM some_table
Thanks for your comment!!
Your `-- Virtual Columns` code here will throw: `Unrecognized name: month at [3:29]`
You can verify with e.g.
```SELECT
DATE_TRUNC(date_column, MONTH) AS month,
SUM(sales) OVER (ORDER BY month) AS cumulative_sales
FROM
UNNEST([
STRUCT(DATE "2021-01-01" AS date_column, 12 AS sales),
STRUCT(DATE "2021-01-02" AS date_column, 13 AS sales),
STRUCT(DATE "2021-01-03" AS date_column, 14 AS sales),
STRUCT(DATE "2021-01-04" AS date_column, 15 AS sales)
])```
I didn't know GROUP BY ROLLUP was implemented in bigquery now, great find, thanks! Then that example I posted is silly, but the general idea applies to other rollups as the pattern is quite flexible.
The consistent column selection requires a window function, which takes a toll on performance for sure, as it is executed as at least one additional separate step in the query plan, and will tend to be really slow if the windows are large (ordering in general is really slow if data is large as the ordered set has to reside on one node. A common pitfall is if there's a bunch of `null`device_id's that all end up on a single node that takes forever to compute).
Thanks for the reply!
Ah I forgot to have month in the ORDER BY when testing my query =)
In Snowflake it would have worked anyway since it supports lateral aliasing and that's probably the main thing I lack in BigQuery: https://sqlkover.com/cool-stuff-in-snowflake-part-4-aliasing-all-the-things/
Interesting about window functions having poor performance. I knew that was true for ROW_NUMBER since it needs to scan the whole column to do the ordering, but I thought FIRST_VALUE would be executed equivalently to array[0], but maybe it's not. (I checked the query plans and they look different but not sure what that implies in terms of performance. Will do a test on lots of data at some point :)