Amazon Redshift error using COPY from CSV - line feed inside quotes - csv

I'm using COPY to import MySQL data into my Redshift database. I've run into an issue where I have JSON data in a table and it fails to COPY, saying "Delimited value missing end quote".
So I start digging into this, and I experiment a little. I made a very basic table to test this out, called test, as so:
CREATE TABLE test (cola varchar(1000), colb varchar(1000))
I then use the COPY command to populate this table from a file called test.csv that I have in an S3 bucket. If the file looks like this, it works:
"{
""contactInfo"": [
""givenName"",
""familyName"",
""fullName"",
""middleNames"",
""suffixes"",
""prefixes"",
""chats"",
""websites""
]}", "a"
If it looks like this, it fails:
"a", "{
""contactInfo"": [
""givenName"",
""familyName"",
""fullName"",
""middleNames"",
""suffixes"",
""prefixes"",
""chats"",
""websites""
]}"
So, if my JSON data is in the first column, COPY ignores the line feed inside the QUOTE. If it is in the second column or later, it sees the line feed as the end of the line of data.
For the record, I am not setting QUOTE AS, I am letting it default to ", which is why I double up the " chars in the file.
Anyone have any idea why this is happening, and how I can fix it? I can't move the data to the first column all the time, I don't always know where it is, and there may be more than one column of JSON data.
Edit:
For the record, I have tried this with a simple linefeed inside a string, no JSON data, and am running into the same problem.

Related

Difficulties creating CSV table in Google BigQuery

I'm having some difficulties creating a table in Google BigQuery using CSV data that we download from another system.
The goal is to have a bucket in the Google Cloud Platform that we will upload a 1 CSV file per month. This CSV files have around 3,000 - 10,000 rows of data, depending on the month.
The error I am getting from the job history in the Big Query API is:
Error while reading data, error message: CSV table encountered too
many errors, giving up. Rows: 2949; errors: 1. Please look into the
errors[] collection for more details.
When I am uploading the CSV files, I am selecting the following:
file format: csv
table type: native table
auto detect: tried automatic and manual
partitioning: no partitioning
write preference: WRITE_EMPTY (cannot change this)
number of errors allowed: 0
ignore unknown values: unchecked
field delimiter: comma
header rows to skip: 1 (also tried 0 and manually deleting the header rows from the csv files).
Any help would be greatly appreciated.
This usually points to the error in the structure of data source (in this case your CSV file). Since your CSV file is small, you can run a little validation script to see that the number of columns is exactly the same across all your rows in the CSV, before running the export.
Maybe something like:
cat myfile.csv | awk -F, '{ a[NF]++ } END { for (n in a) print n, "rows have",a[n],"columns" }'
Or, you can bind it to the condition (lets say if your number of columns should be 5):
ncols=$(cat myfile.csv | awk -F, 'x=0;{ a[NF]++ } END { for (n in a){print a[n]; x++; if (x==1){break}}}'); if [ $ncols==5 ]; then python myexportscript.py; else echo "number of columns invalid: ", $ncols; fi;
It's impossible to point out the error without seeing an example CSV file, but it's very likely that your file is incorrectly formatted. As a result, one typo confuses BQ into thinking there are thousands. Let's say you have the following csv file:
Sally Whittaker,2018,McCarren House,312,3.75
Belinda Jameson 2017,Cushing House,148,3.52 //Missing a comma after the name
Jeff Smith,2018,Prescott House,17-D,3.20
Sandy Allen,2019,Oliver House,108,3.48
With the following schema:
Name(String) Class(Int64) Dorm(String) Room(String) GPA(Float64)
Since the schema is missing a comma, everything is shifted one column over. If you have a large file, it results in thousands of errors as it attempts to inserts Strings into Ints/Floats.
I suggest you run your csv file through a csv validator before uploading it to BQ. It might find something that breaks it. It's even possible that one of your fields has a comma inside the value which breaks everything.
Another theory to investigate is to make sure that all required columns receive an appropriate (non-null) value. A common cause of this error is if you cast data incorrectly which returns a null value for a specific field in every row.
As mentioned by Scicrazed, this issue seems to be generated as some file rows has an incorrect format, in which case it is required to validate the content data in order to figure out the specific error that is leading this issue.
I recommend you to check the errors[] collection that might contains additional information about the aspects that can be making to fail the process. You can do this by using the Jobs: get method that returns detailed information about your BigQuery Job or refer to the additionalErrors field of the JobStatus Stackdriver logs that contains the same complete error data that is reported by the service.
I'm probably too late for this, but it seems the file has some errors (it can be a character that cannot be parsed or just a string in an int column) and BigQuery cannot upload it automatically.
You need to understand what the error is and fix it somehow. An easy way to do it is by running this command on the terminal:
bq --format=prettyjson show -j <JobID>
and you will be able to see additional logs for the error to help you understand the problem.
If the error happens only a few times you just can increase the number of errors allowed.
If it happens many times you will need to manipulate your CSV file before you upload it.
Hope it helps

Let Google BigQuery infer schema from csv string file

I want to upload csv data into BigQuery. When the data has different types (like string and int), it is capable of inferring the column names with the headers, because the headers are all strings, whereas the other lines contains integers.
BigQuery infers headers by comparing the first row of the file with
other rows in the data set. If the first line contains only strings,
and the other lines do not, BigQuery assumes that the first row is a
header row.
https://cloud.google.com/bigquery/docs/schema-detect
The problem is when your data is all strings ...
You can specify --skip_leading_rows, but BigQuery still does not use the first row as the name of your variables.
I know I can specify the column names manually, but I would prefer not doing that, as I have a lot of tables. Is there another solution ?
If your data is all in "string" type and if you have the first row of your CSV file containing the metadata, then I guess it is easy to do a quick script that would parse the first line of your CSV and generates a similar "create table" command:
bq mk --schema name:STRING,street:STRING,city:STRING... -t mydataset.myNewTable
Use that command to create a new (void) table, and then load your CSV file into that new table (using --skip_leading_rows as you mentioned)
14/02/2018: Update thanks to Felipe's comment:
Above comment can be simplified this way:
bq mk --schema `head -1 myData.csv` -t mydataset.myNewTable
It's not possible with current API. You can file a feature request in the public BigQuery tracker https://issuetracker.google.com/issues/new?component=187149&template=0.
As a workaround, you can add a single non-string value at the end of the second line in your file, and then set the allowJaggedRows option in the Load configuration. Downside is you'll get an extra column in your table. If having an extra column is not acceptable, you can use query instead of load, and select * EXCEPT the added extra column, but query is not free.

Cassandra COPY command never stops while loads .csv file

Hello and thank you for take your time reading my issue.
I have the next issue with Cassandra cqlsh:
When I use the COPY command to load a .csv into my table, the command prompt never finishes the executing and loads nothing into the table if I stop it with ctrl+c.
Im using .csv's files from: https://www.kaggle.com/daveianhickey/2000-16-traffic-flow-england-scotland-wales
specifically from ukTrafficAADF.csv.
I put the code below:
CREATE TABLE first_query ( AADFYear int, RoadCategory text,
LightGoodsVehicles text, PRIMARY KEY(AADFYear, RoadCategory);
Im trying it:
COPY first_query (AADFYear, RoadCategory, LightGoodsVehicles) FROM '..\ukTrafficAADF.csv' WITH DELIMITER=',' AND HEADER=TRUE;
This give me the error below repeatedly:
Failed to import 5000 rows: ParseError - Invalid row length 29 should be 3, given up without retries
And never finishes.
Add that the .csv file have more columns that I need, and trying the previous COPY command with the SKIPCOLS reserved word including the unused columns does the same.
Thanks in advance.
In cqlsh COPY command, All column in the csv must be present in the table schema.
In your case your csv ukTrafficAADF has 29 column but in the table first_query has only 3 column that's why it's throwing parse error.
So in some way you have to remove all the unused column from the csv then you can load it into cassandra table with cqlsh copy command

Amazon Redshift: Make COPY command fail if there is a missing column as key in json data

I am loading a JSON data from S3 into Redshift using the COPY command:
COPY <table> FROM '<s3 path of input json data>'
CREDENTIALS 'aws_access_key_id=<>;aws_secret_access_key=<>'
GZIP format as json '<s3 path of jsonpathfile>';
What I want is that the COPY command should fail, or atleast should raise a warning, when a column of the Redshift table is missing as a key in the input json. What I mean is, if the table has, say, two columns A and B and input json has only one column B, COPY should fail or raise a warning saying column A missing. Right now what it does is, it sets the missing column A values as NULL and copies rest of the columns (B). So, one obvious workaround is to set all the columns as NOT NULL while CREATING the table. But I do not want to do that because in my data I can have a JSON like {"A" : null, "B" : "something"}. But in this case key 'A' is indeed present, then also it will fail as it is NULL but as per schema it is supposed to be NOT NULL. I want it to fail only when I receive a JSON such as {"B":"something"} i.e. key A is not present at all in the JSON.
Is there any other neat way to achieve that using the COPY command?
I also tried the auto option as follows, with same results:
COPY <table> FROM '<s3 path of input json data>'
CREDENTIALS 'aws_access_key_id=<>;aws_secret_access_key=<>'
GZIP format as json 'auto';

Redshift COPY - No Errors, 0 Record(s) Loaded Successfully

I'm attempting to COPY a CSV file to Redshift from an S3 bucket. When I execute the command, I don't get any error messages, however the load doesn't work.
Command:
COPY temp FROM 's3://<bucket-redacted>/<object-redacted>.csv'
CREDENTIALS 'aws_access_key_id=<redacted>;aws_secret_access_key=<redacted>'
DELIMITER ',' IGNOREHEADER 1;
Response:
Load into table 'temp' completed, 0 record(s) loaded successfully.
I attempted to isolate the issue via the system tables, but there is no indication there are issues.
Table Definition:
CREATE TABLE temp ("id" BIGINT);
CSV Data:
id
123,
The line endings in your csv file probably don't have a unix new line character at the end, so the COPY command probably sees your file as:
id123,
Given you have the IGNOREHEADER option enabled, and the line endings in the file aren't what COPY is expecting (my assumption based on past experience), the file contents get treated as one line, and then skipped.
I had this occur for some files created from a Windows environment.
I guess one thing to remember is that CSV is not a standard, more a convention, and different products/vendors have different implementations for csv file creation.
I repeated your instructions, and it worked just fine:
First, the CREATE TABLE
Then, the LOAD (from my own text file containing just the two lines you show)
This resulted in:
Code: 0 SQL State: 00000 --- Load into table 'temp' completed, 1 record(s) loaded successfully.
So, there's nothing obviously wrong with your commands.
At first, I thought that the comma at the end of your data line could cause Amazon Redshift to think that there is an additional column of data that it can't map to your table, but it worked fine for me. Nonetheless, you might try removing the comma, or create an additional column to store this 'empty' value.