Joins on Google Bigquery

坚强是说给别人听的谎言 提交于 2020-01-24 03:07:28

问题


I know that work is being done to improve the Join feature on Bigquery, not to rant here but it will be hard to analyze 'Terabyte' sets of data as 'advertised' if Joins can not be used properly.

OK, back to the problem, I have two tables one is 600 Megs and the other one is 50Megs, I did try to make a join and I got an error about smaller table must be left. I did some research and I found out that Bigquery considers both tables as big if they are greater than 7MB?

So based on some advice I found online I did select the dataset in question of the smaller table and saved it in a new table, the new data set is 12MB and 600K rows over two columns. Then I tried my query again but I am still getting the same error :

Query Failed : Error: Large table cdrs_test.geoIP_Left must appear as the leftmost table in a join query

geoIP_Left is 12 Megs, the other table is 600Megs..

Can this be fixed somehow or am I busted? If so anyone knows about any other service that I can use to analyse big sets of data that does support joins?

EDIT: This is the actual query;

SELECT COUNT(results.cc_card) AS count,
       sum(results.sessiontime) AS time, 
       geoIP_Left.place AS place 
FROM cdrs_test.cdrs_2010_5 AS results 
JOIN cdrs_test.geoIP_Left AS geoIP_table 
  ON results.cc_card = geoIP_table.vcard 
WHERE results.sessiontime > 0 AND results.countryName Contains 'India' 
GROUP BY place;

回答1:


You could express this query as a subselect that references only the columns from cdrs_test.geoIP_Left that you are interested in. See the second example here.

SELECT
  COUNT(results.cc_card) AS count,
  sum(results.sessiontime) AS time, 
  geoIP_table.place AS place 
FROM
  cdrs_test.cdrs_2010_5 AS results 
JOIN
  (SELECT place, vcard FROM cdrs_test.geoIP_Left)
AS
  geoIP_table 
ON
  results.cc_card = geoIP_table.vcard 
WHERE
  results.sessiontime > 0 AND results.countryName CONTAINS 'India' 
GROUP BY
  place;

You can also simply run multiple queries - note that you can save query results explicitly as a named table, and use that table in a later query.

Finally, another option would be to pre-join the data, using (for example) a MapReduce transformation pipeline, or an ETL tool provided by one of the BigQuery ETL partners.




回答2:


As mentioned by Felipe in the comment. BigQuery has allowed a modifier "EACH" to JOIN to allow JOINs of 2 big tables. From the Query Reference page -

Normal JOIN operations require that the right-side table contains less than 8 MB of compressed data. The EACH modifier is a hint that informs the query execution engine that the JOIN might reference two large tables. The EACH modifier can't be used in CROSS JOIN clauses.

When possible, use JOIN without the EACH modifier for best performance. Use JOIN EACH when table sizes are too large for JOIN.



来源:https://stackoverflow.com/questions/13468933/joins-on-google-bigquery

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!