Difficulties creating CSV table in Google BigQuery - csv

I'm having some difficulties creating a table in Google BigQuery using CSV data that we download from another system.
The goal is to have a bucket in the Google Cloud Platform that we will upload a 1 CSV file per month. This CSV files have around 3,000 - 10,000 rows of data, depending on the month.
The error I am getting from the job history in the Big Query API is:
Error while reading data, error message: CSV table encountered too
many errors, giving up. Rows: 2949; errors: 1. Please look into the
errors[] collection for more details.
When I am uploading the CSV files, I am selecting the following:
file format: csv
table type: native table
auto detect: tried automatic and manual
partitioning: no partitioning
write preference: WRITE_EMPTY (cannot change this)
number of errors allowed: 0
ignore unknown values: unchecked
field delimiter: comma
header rows to skip: 1 (also tried 0 and manually deleting the header rows from the csv files).
Any help would be greatly appreciated.

This usually points to the error in the structure of data source (in this case your CSV file). Since your CSV file is small, you can run a little validation script to see that the number of columns is exactly the same across all your rows in the CSV, before running the export.
Maybe something like:
cat myfile.csv | awk -F, '{ a[NF]++ } END { for (n in a) print n, "rows have",a[n],"columns" }'
Or, you can bind it to the condition (lets say if your number of columns should be 5):
ncols=$(cat myfile.csv | awk -F, 'x=0;{ a[NF]++ } END { for (n in a){print a[n]; x++; if (x==1){break}}}'); if [ $ncols==5 ]; then python myexportscript.py; else echo "number of columns invalid: ", $ncols; fi;

It's impossible to point out the error without seeing an example CSV file, but it's very likely that your file is incorrectly formatted. As a result, one typo confuses BQ into thinking there are thousands. Let's say you have the following csv file:
Sally Whittaker,2018,McCarren House,312,3.75
Belinda Jameson 2017,Cushing House,148,3.52 //Missing a comma after the name
Jeff Smith,2018,Prescott House,17-D,3.20
Sandy Allen,2019,Oliver House,108,3.48
With the following schema:
Name(String) Class(Int64) Dorm(String) Room(String) GPA(Float64)
Since the schema is missing a comma, everything is shifted one column over. If you have a large file, it results in thousands of errors as it attempts to inserts Strings into Ints/Floats.
I suggest you run your csv file through a csv validator before uploading it to BQ. It might find something that breaks it. It's even possible that one of your fields has a comma inside the value which breaks everything.
Another theory to investigate is to make sure that all required columns receive an appropriate (non-null) value. A common cause of this error is if you cast data incorrectly which returns a null value for a specific field in every row.

As mentioned by Scicrazed, this issue seems to be generated as some file rows has an incorrect format, in which case it is required to validate the content data in order to figure out the specific error that is leading this issue.
I recommend you to check the errors[] collection that might contains additional information about the aspects that can be making to fail the process. You can do this by using the Jobs: get method that returns detailed information about your BigQuery Job or refer to the additionalErrors field of the JobStatus Stackdriver logs that contains the same complete error data that is reported by the service.

I'm probably too late for this, but it seems the file has some errors (it can be a character that cannot be parsed or just a string in an int column) and BigQuery cannot upload it automatically.
You need to understand what the error is and fix it somehow. An easy way to do it is by running this command on the terminal:
bq --format=prettyjson show -j <JobID>
and you will be able to see additional logs for the error to help you understand the problem.
If the error happens only a few times you just can increase the number of errors allowed.
If it happens many times you will need to manipulate your CSV file before you upload it.
Hope it helps

Related

Using a custom row delimiter in Azure Data Factory Copy Activity fails

I have a requirement to produce test data files using the same formatting as the source files they are supposed to mimic (row delimiter, column delimiter, encoding, etc..). There is a process that reads files from an MS SQL database and creating files as output.
I have made a dataset with parameters that can supply the definition of the dataset at runtime. The problem I have is the following error is raised on execution:
Copy activity doesn't support multi-char or none row delimiter.
The parameter that is causing the error is the row delimiter. I have tried:
\r\n
\r
\n
n
r
r,n
\r,\n
I read this Custom Row Delimiter in Azure Data Factory (ADF) where someone says they have been able to make a likewise solution work.
I can output a file using either r or n but there is no separation of data over lines. I also read in another post that this is not supported but that is hard to believe because you can use the default option to create this particular row delimiter behavior.
As per official documentation Currently, row delimiter as empty string is only supported for mapping data flow but not Copy activity.
So, passing "\r\n" as a parameter will pass "\\r\\n" for some reason.
And you can circumvent the error by editing the source code.
There was a similar question here: here
It's not the solution you're looking for, but at least it tells you how to get around the error and why you've got the error.

bigquery loading a csv(, field seperated) file having data between quotes and got one column as commas in between data

i have a csv file like:
"id","name","address"
"10","aparna","hyderabad,hitech-city"
"11","mounika","hyderabad,kukatpally"
"12","raji","hyderabad,madhapur"
if I use --autodetect it is working, but if I use with schema it is giving me an error.
I want to load this file to a table in a big query like this
id name address
10 aparna hyderabad,hitech-city
11 mounika hyderabad,kukatpally
12 raji hyderabad,madhapur
table
for that, I used bq load project:dataset.table gs://filepath schemafile this is giving me an error like
.csv: Error
while reading data, error message: Too many values in row starting
at position: 0. Found 35 column(s) while expected 33.
- You are loading data without specifying data format, data will be
treated as CSV format by default. If this is not what you mean,
please specify data format by --source_format.
can anyone help me out in this
thanks in advance

Let Google BigQuery infer schema from csv string file

I want to upload csv data into BigQuery. When the data has different types (like string and int), it is capable of inferring the column names with the headers, because the headers are all strings, whereas the other lines contains integers.
BigQuery infers headers by comparing the first row of the file with
other rows in the data set. If the first line contains only strings,
and the other lines do not, BigQuery assumes that the first row is a
header row.
https://cloud.google.com/bigquery/docs/schema-detect
The problem is when your data is all strings ...
You can specify --skip_leading_rows, but BigQuery still does not use the first row as the name of your variables.
I know I can specify the column names manually, but I would prefer not doing that, as I have a lot of tables. Is there another solution ?
If your data is all in "string" type and if you have the first row of your CSV file containing the metadata, then I guess it is easy to do a quick script that would parse the first line of your CSV and generates a similar "create table" command:
bq mk --schema name:STRING,street:STRING,city:STRING... -t mydataset.myNewTable
Use that command to create a new (void) table, and then load your CSV file into that new table (using --skip_leading_rows as you mentioned)
14/02/2018: Update thanks to Felipe's comment:
Above comment can be simplified this way:
bq mk --schema `head -1 myData.csv` -t mydataset.myNewTable
It's not possible with current API. You can file a feature request in the public BigQuery tracker https://issuetracker.google.com/issues/new?component=187149&template=0.
As a workaround, you can add a single non-string value at the end of the second line in your file, and then set the allowJaggedRows option in the Load configuration. Downside is you'll get an extra column in your table. If having an extra column is not acceptable, you can use query instead of load, and select * EXCEPT the added extra column, but query is not free.

Error code: Inavlid in Loading Data on BigQuery

I have a large CSV file (nearly 10,000 rows) and I am trying to upload it on the BigQuery but it gives me this error:
ile-00000000: CSV table references column position 8, but line starting at position:622 contains only 8 columns. (error code: invalid)
Can anyone please tell me a possible to reason to it? I have double checked my Schema and it looks alright.
Thanks
I had this same issue when trying to import a large data set in a csv to a BigQuery table.
The issue turned out to be some ascii control characters (\b, \t, \r, \n) in the data that was written in the csv. When the csv was being sent to BigQuery these characters caused the BiqQuery csv parser to misinterpret the line and break because the data didn't match with the number of columns in the header.
Replacing these characters with a space (to preserve formatting as best as possible) allowed me to import the data without further issues.
The error message suggests that the load job failed because at least one row has fewer columns than the automatically detected schema dictates.
Add
allow_jagged_rows=true
in the options.

Redshift COPY - No Errors, 0 Record(s) Loaded Successfully

I'm attempting to COPY a CSV file to Redshift from an S3 bucket. When I execute the command, I don't get any error messages, however the load doesn't work.
Command:
COPY temp FROM 's3://<bucket-redacted>/<object-redacted>.csv'
CREDENTIALS 'aws_access_key_id=<redacted>;aws_secret_access_key=<redacted>'
DELIMITER ',' IGNOREHEADER 1;
Response:
Load into table 'temp' completed, 0 record(s) loaded successfully.
I attempted to isolate the issue via the system tables, but there is no indication there are issues.
Table Definition:
CREATE TABLE temp ("id" BIGINT);
CSV Data:
id
123,
The line endings in your csv file probably don't have a unix new line character at the end, so the COPY command probably sees your file as:
id123,
Given you have the IGNOREHEADER option enabled, and the line endings in the file aren't what COPY is expecting (my assumption based on past experience), the file contents get treated as one line, and then skipped.
I had this occur for some files created from a Windows environment.
I guess one thing to remember is that CSV is not a standard, more a convention, and different products/vendors have different implementations for csv file creation.
I repeated your instructions, and it worked just fine:
First, the CREATE TABLE
Then, the LOAD (from my own text file containing just the two lines you show)
This resulted in:
Code: 0 SQL State: 00000 --- Load into table 'temp' completed, 1 record(s) loaded successfully.
So, there's nothing obviously wrong with your commands.
At first, I thought that the comma at the end of your data line could cause Amazon Redshift to think that there is an additional column of data that it can't map to your table, but it worked fine for me. Nonetheless, you might try removing the comma, or create an additional column to store this 'empty' value.