insert operation into a bigquery table

问题

I want to insert all rows of an SQL server Table into a Bigquery Table having the same schema. The streaming insert row by row is very slow: to insert 1000 rows the execution of the code below took about 10 minutes. In this code I loop over the first 10 files in a certain folder, and I insert the content of this file in a unique SQL Server Table. Once I looped over the desire files, I loop over the SQL Server table (which contain all rows of all files) and I insert the content row by row in a Bigquery Table. Finally I delete those files and I empty the sql server table

This operation is very slow.

Can someone have a better solution to insert the content of an SQL server Table into a Bigquery Table automatically (via a code) ?? For example insert all the content af the sql server table into the bigquery table in one bloc (and not row by row).

Thanks

This is my code (in coldfusion):

<cfsilent>
    <cfinclude template="app_locals.cfm" />
    <cfinclude template="act_BigqueryApiAccess.cfm" />
</cfsilent>

<!--- 1er traitement BQ: Insertion des colis traités --->
 <!--- enregistrement du début du 1er traitement BQ (TShipping)--->
<cfset BigqueryTShipping_StartDate=now()>
<cfset QueryName = "InsertBigqueryLogTShippingStartDate">
<cfinclude template="qry_item.cfm"> 

<cfdirectory action="list" directory="#FileRoot#\_data\_Bigquery\TShipping" listinfo="all" type="file" name="FList" sort="datelastmodified">
<cfset FileList = Valuelist(FList.name)>
<cfoutput><h3>FileList: #FileList#</h3></cfoutput>

<cfif len(trim(FileList))>
    <!--- traiter les 10 derniers fichiers (les MaxNbFile moins récents) --->
    <cfset FileLoop = 1>
    <cfloop list="#FileList#" index="FileName"> 
        <cfset PathFile="#FileRoot#\_data\_Bigquery\TShipping\#FileName#">
        <cfset QueryName = "InsertTShipping">
        <cfinclude template="qry_item.cfm"> 
        <cfset FileLoop = FileLoop+1>
        <cfif FileLoop GT Attributes.MaxNbFile>
            <cfbreak />
        </cfif>
    </cfloop>
</cfif>

<!--- instancier un objet de type (class) TableRow --->
<cfobject action="create" type="java" class="com.google.api.services.bigquery.model.TableRow" name="row">
<!--- <cfdump var="#row#"> --->

<cfset QueryName = "GetParcels">
<cfinclude template="qry_item.cfm"> 
<cfloop query="GetParcels"> 
    <cfset row.set("Tracking_Date",mid(Tracking_Date,6,19))>
    <cfset row.set("TShipping_ID", TShipping_ID)>
    <cfset row.set("TShipping_Tracking", TShipping_Tracking)>
    <cfset row.set("Shipper_ID", Shipper_ID)>

    <cfset rows.setInsertId(sys.currentTimeMillis())>
    <cfset rows.setJson(row)>

    <cfset rowList.add(rows)>

    <cfset content=rqst.setRows(rowList)>

    <cfset response = bq.tabledata().insertAll(Project_ID,Dataset_ID,Table_ID, content).execute()>  
</cfloop>

<!---vider la table TShipping_BQ--->
<cfset QueryName = "DeleteOldTShipping_BQParcels">
<cfinclude template="qry_item.cfm">

<!--- Suppression des fichiers traités ---> 
<cfif len(trim(FileList))>
    <cfset TShippingFileNb=len(trim(FileList))>
    <cfset FileLoop = 1>
    <cfloop list="#FileList#" index="FileName"> 
        <cfset PathFile="#FileRoot#\_data\_Bigquery\TShipping\#FileName#">
        <cffile action="move" source="#PathFile#" destination="#FileRoot#\_data\_Bigquery\TShippingArchive">
        <!--- <cffile action="delete" file="#PathFile#"> --->
        <cfset FileLoop = FileLoop+1>
        <cfif FileLoop GT Attributes.MaxNbFile>
            <cfbreak />
        </cfif>
    </cfloop>
<cfelse>
    <cfset TShippingFileNb=0>
</cfif>


<!--- enregistrement du nb de fichiers TShipping traités --->
<cfset QueryName = "InsertBigqueryLogTShippingNb">
<cfinclude template="qry_item.cfm"> 
 <!--- enregistrement de la fin du 1er traitement BQ--->
<cfset BigqueryTShipping_EndDate=now()>
<cfset QueryName = "InsertBigqueryLogTShippingEndDate">
<cfinclude template="qry_item.cfm">

回答1:

You should be able to move your insertAll() outside the loop. There may be a point where you're trying to insert too many records and you would need to batch them out at that point. i.e. once you've hit 1000 records insert them and reset your rowList array

<cfloop query="GetParcels"> 
  <cfset row = something()><!--- you need to re-create row for each loop or else you're updating a reference with each loop --->
  <cfset row.set("Tracking_Date",mid(Tracking_Date,6,19))>
  <cfset row.set("TShipping_ID", TShipping_ID)>
  <cfset row.set("TShipping_Tracking", TShipping_Tracking)>
  <cfset row.set("Shipper_ID", Shipper_ID)>
  <cfset rows.setInsertId(sys.currentTimeMillis())>
  <cfset rows.setJson(row)>
  <cfset rowList.add(rows)>
</cfloop>
<cfset content=rqst.setRows(rowList)>
<cfset response = bq.tabledata().insertAll(Project_ID,Dataset_ID,Table_ID,content).execute()>

An example of what I mean by batching

<cfloop query="GetParcels"> 
  <cfset row.set("Tracking_Date",mid(Tracking_Date,6,19))>
  <cfset row.set("TShipping_ID", TShipping_ID)>
  <cfset row.set("TShipping_Tracking", TShipping_Tracking)>
  <cfset row.set("Shipper_ID", Shipper_ID)>
  <cfset rows.setInsertId(sys.currentTimeMillis())>
  <cfset rows.setJson(row)>
  <cfset rowList.add(rows)>
  <cfif arrayLen(rowList) EQ 1000>
    <cfset content=rqst.setRows(rowList)>
    <cfset response = bq.tabledata().insertAll(Project_ID,Dataset_ID,Table_ID,content).execute()>  
    <cfset rowList = []>
  </cfif>
</cfloop>
<!--- add this check in case there are exactly an increment of 1000 rows --->
<cfif ! arrayIsEmpty(rowList)>
  <cfset content=rqst.setRows(rowList)>
  <cfset response = bq.tabledata().insertAll(Project_ID,Dataset_ID,Table_ID,content).execute()>  
</cfif>

来源：https://stackoverflow.com/questions/23892239/insert-operation-into-a-bigquery-table

标签

sql-server

coldfusion

google-bigquery