Staged Internal file csv.gz giving error that file does not match size of corresponding table?

问题

I am trying to copy a csv.gz file into a table I created to start analyzing location data for a map. I was running into an error that says that there are too many characters, and I should add a on_error option. However, I am not sure if that will help load the data, can you take a look?

Data source: https://data.world/cityofchicago/array-of-things-locations

SELECT * FROM staged/array-of-things-locations-1.csv.gz


CREATE OR REPLACE TABLE ARRAYLOC(name varchar, location_type varchar, category varchar, notes varchar, status1 varchar, latitude number, longitude number, location_2 variant, location variant);

COPY INTO ARRAYLOC
   FROM @staged/array-of-things-locations-1.csv.gz;
 
 CREATE OR REPLACE FILE FORMAT t_csv
   TYPE = "CSV"
   COMPRESSION = "GZIP"
   FILE_EXTENSION= 'csv.gz'
 
 CREAT OR REPLACE STAGE staged
    FILE_FORMAT='t_csv';
    
COPY INTO ARRAYLOC FROM @~/staged file_format = (format_name = 't_csv');

Error message:

Number of columns in file (8) does not match that of the corresponding table (9), use file format option error_on_column_count_mismatch=false to ignore this error File '@~/staged/array-of-things-locations-1.csv.gz', line 2, character 1 Row 1 starts at line 1, column "ARRAYLOC"["LOCATION_2":8] If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.

Solved: The real issue was that I need to better clean the data I was staging. This was my error. This is what I ended up changing: the column types, changing the file from " to ' and had to separate one column due to a comma in the middle of the data.

CREATE OR REPLACE TABLE ARRAYLOC(name varchar, location_type varchar, category varchar, notes varchar, status1 varchar, latitude float, longitude varchar, location varchar);

COPY INTO ARRAYLOC
   FROM @staged/array-of-things-locations-1.csv.gz;
 
 CREATE or Replace FILE FORMAT r_csv
   TYPE = "CSV"
   COMPRESSION = "GZIP"
   FILE_EXTENSION= 'csv.gz'
   SKIP_HEADER = 1
   ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE
   EMPTY_FIELD_AS_NULL = TRUE;
 
 create or replace stage staged
    file_format='r_csv';
    
copy into ARRAYLOC from @~/staged 
   file_format = (format_name = 'r_csv');
   
SELECT * FROM ARRAYLOC LIMIT 10;

回答1:

Your error doesn't say that you have too many characters but that your file has 8 columns and your table has 9 columns, so it doesn't know how to align the columns from the file to the columns in the table.

You can list out the columns specifically using a subquery in your COPY INTO statement.

Notes:

Columns from the file are positional based, so $1 is the first column in the file, $2 is the second, etc....
You can put the columns from the file in any order that you need to match your table.
You'll need to find the column that doesn't have data coming in from the file and either fill it with null or some default value. In my example, I assume it is the last column and in it I will put the current timestamp.
It helps to list out the columns of the table behind the table name, but this is not required.

Example:

COPY INTO ARRAYLOC (COLUMN1,COLUMN2,COLUMN3,COLUMN4,COLUMN5,COLUMN6,COLUMN7,COLUMN8,COLUMN9)
FROM (
    SELECT $1
      ,$2 
      ,$3 
      ,$4 
      ,$5 
      ,$6 
      ,$7 
      ,$8
      ,CURRENT_TIMESTAMP()
   FROM @staged/array-of-things-locations-1.csv.gz
);

I will advise against changing the ERROR_ON_COLUMN_COUNT_MISMATCH parameter, doing so could result in data ending up in the wrong column of the table. I would also advise against changing the ON_ERROR parameter as I believe it is best to be alerted of such errors rather than suppressing them.

回答2:

Yes, setting that option should help. From the documentation:

ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE | FALSE Use: Data loading only

Definition: Boolean that specifies whether to generate a parsing error if the number of delimited columns (i.e. fields) in an input file does not match the number of columns in the corresponding table.

If set to FALSE, an error is not generated and the load continues. If the file is successfully loaded:

If the input file contains records with more fields than columns in the table, the matching fields are loaded in order of occurrence in the file and the remaining fields are not loaded.

If the input file contains records with fewer fields than columns in the table, the non-matching columns in the table are loaded with NULL values.

This option assumes all the records within the input file are the same length (i.e. a file containing records of varying length return an error regardless of the value specified for this parameter).

So assuming you are okay with getting NULL values for the missing column in your input data, you can use ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE to load the file successfully.

回答3:

When viewing that table directly on data.world, there are columns named both location and location_2 with identical data. It looks like that display is erroneous, because when downloading the CSV, it has only a single location column.

I suspect if you change your CREATE OR REPLACE statement with the following statement that omits the creation of location_2, you'll get to where you want to go:

CREATE OR REPLACE TABLE ARRAYLOC(name varchar, location_type varchar, category varchar, notes varchar, status1 varchar, latitude number, longitude number, location variant);

来源：https://stackoverflow.com/questions/59687926/staged-internal-file-csv-gz-giving-error-that-file-does-not-match-size-of-corres

标签

snowflake-data-warehouse