My Favourite "hidden" BigQuery Features

Jan 25, 2023

Having worked in BigQueryesque tools for the past 10 years, I’ve come to really appreciate some of it’s powerful “hidden” features.

Read →

3 Comments

Lars

Feb 13, 2023Edited

Great article! = ) Coming from Snowflake I've missed a lot of functionality in BQ, but your ideas here fill many of those gaps!

I particularly find the "Accessing items in Arrays"-part very valuable. For the other parts there might be good alternative approaches and I suggest some below (haven't done any performance assessments though...)

-- Virtual Columns

SELECT

DATE_TRUNC(date_column, MONTH) AS month,

SUM(sales) OVER (ORDER BY month) AS cumulative_sales

FROM some_table

-- ...and rows

SELECT

COALESCE(user_region_code, user_country_code, 'GLOBAL'),

COUNT(DISTINCT user_id) AS users

FROM some_table

GROUP BY ROLLUP (user_country_code, user_region_code)

-- Consistent Column Selection

SELECT DISTINCT

user_id,

FIRST_VALUE(device_type) OVER(ORDER BY device_id) AS device_type,

FIRST_VALUE(device_os) OVER(ORDER BY device_id) AS device_os,

COUNT(1) OVER() AS events

FROM some_table

Expand full comment

Reply (1)

Jens Tonberg Larsson

Feb 13, 2023

Thanks for your comment!!

Your `-- Virtual Columns` code here will throw: `Unrecognized name: month at [3:29]`

You can verify with e.g.

```SELECT

DATE_TRUNC(date_column, MONTH) AS month,

SUM(sales) OVER (ORDER BY month) AS cumulative_sales

FROM

UNNEST([

STRUCT(DATE "2021-01-01" AS date_column, 12 AS sales),

STRUCT(DATE "2021-01-02" AS date_column, 13 AS sales),

STRUCT(DATE "2021-01-03" AS date_column, 14 AS sales),

STRUCT(DATE "2021-01-04" AS date_column, 15 AS sales)

])```

I didn't know GROUP BY ROLLUP was implemented in bigquery now, great find, thanks! Then that example I posted is silly, but the general idea applies to other rollups as the pattern is quite flexible.

The consistent column selection requires a window function, which takes a toll on performance for sure, as it is executed as at least one additional separate step in the query plan, and will tend to be really slow if the windows are large (ordering in general is really slow if data is large as the ordered set has to reside on one node. A common pitfall is if there's a bunch of `null`device_id's that all end up on a single node that takes forever to compute).

Expand full comment

Reply (1)

Lars

Feb 14, 2023

Thanks for the reply!

Ah I forgot to have month in the ORDER BY when testing my query =)

In Snowflake it would have worked anyway since it supports lateral aliasing and that's probably the main thing I lack in BigQuery: https://sqlkover.com/cool-stuff-in-snowflake-part-4-aliasing-all-the-things/

Interesting about window functions having poor performance. I knew that was true for ROW_NUMBER since it needs to scan the whole column to do the ordering, but I thought FIRST_VALUE would be executed equivalently to array[0], but maybe it's not. (I checked the query plans and they look different but not sure what that implies in terms of performance. Will do a test on lots of data at some point :)

Expand full comment

&Data

My Favourite "hidden" BigQuery Features