BigQuery has some statistical aggregation functions such as STDDEV(X) and CORR(X, Y), but it doesn\'t offer functions to directly perform linear regression.
How can
Here the code to create a linear regression model using the public dataset on natality (live births) and to generate this into a dataset named demo_ml_bq. This must be created before running the below statement.
%%bq query
CREATE or REPLACE MODEL demo_bq_ml.babyweight_model_asis
OPTIONS
(model_type='linear_reg', labels=['weight_pounds']) AS
WITH natality_data AS (
SELECT
weight_pounds, -- this is the label; because it is continuous, we need to use regression
CAST(is_male AS STRING) AS is_male,
mother_age,
CAST(plurality AS STRING) AS plurality,
gestation_weeks,
CAST(alcohol_use AS STRING) AS alcohol_use,
CAST(year AS STRING) AS year,
ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))) AS hashmonth
FROM
publicdata.samples.natality
WHERE
year > 2000
AND gestation_weeks > 0
AND mother_age > 0
AND plurality > 0
AND weight_pounds > 0
)
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks,
alcohol_use,
year
FROM
natality_data
WHERE
MOD(hashmonth, 4) < 3 -- select 75% of the data as training