bigquery loading a csv(, field seperated) file having data between quotes and got one column as commas in between data - csv

i have a csv file like:
"id","name","address"
"10","aparna","hyderabad,hitech-city"
"11","mounika","hyderabad,kukatpally"
"12","raji","hyderabad,madhapur"
if I use --autodetect it is working, but if I use with schema it is giving me an error.
I want to load this file to a table in a big query like this
id name address
10 aparna hyderabad,hitech-city
11 mounika hyderabad,kukatpally
12 raji hyderabad,madhapur
table
for that, I used bq load project:dataset.table gs://filepath schemafile this is giving me an error like
.csv: Error
while reading data, error message: Too many values in row starting
at position: 0. Found 35 column(s) while expected 33.
- You are loading data without specifying data format, data will be
treated as CSV format by default. If this is not what you mean,
please specify data format by --source_format.
can anyone help me out in this
thanks in advance

Related

In MYSQL, how to upload a csv file that contains a date in the format of '1/1/2020' properly into a DATE data type format (standard YYYY-MM-DD)

I have a column of data, let's call it bank_date, that I receive from an external vendor as a csv file every day. As such the dates in that column show as '1/1/2020'.
I am trying to upload that raw csv file directly to SQL daily. We used to store the SQL bank_date format as text, but we have converted it to a Data data type, and now it keeps zero'ing out every time, with some sort of truncate / "datetime value incorrect" error.
I have now tested 17 different versions of utilizing STR_TO_date (mostly), CAST, and CONVERT, and feel like I'm close, but I'm not quite getting the syntax right.
Also for reference, I did find 2 other workarounds that are successful, but my boss specifically wants it uploaded and converted directly through the import process (not manipulating the raw csv data) for safety reasons. For reference:
Workaround 1: Convert csv date column to the YYYY-MM-DD format and save file. The issue with this is that if you try to open that CSV file again, it auto-changes the date format back to the standard mm/dd/yyyy. If someone doesn't know to watch out for this and is re-opening the csv file to double check something, they're gonna find an error when they upload, and the problem is not easy to identify.
Workaround 2:Create an extra dummy_date column in the table that is formatted as a text data type and upload as normal. Then copy and paste the data into the correct bank_date column using a str_to_date function as follows: UPDATE dummy_date SET bank_date = STR_TO_DATE(dummy_date, ‘%c/%e/%Y’); The issue with this is that it just creates extra unnecessary data that can be confused when other people may not know that 1 of the columns is not intended for querying.
Here is my current code:
USE database_name;
LOAD DATA LOCAL INFILE 'C:/Users/Shelly/Desktop/Date Import.csv'
INTO TABLE bank_table
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 ROWS
(bank_date, bank_amount)
SET bank_date = str_to_date(bank_date,'%Y-%m-%d');
The "SET" line is what I cannot work out on syntax to convert a csv's 1/5/2020' to SQL's 2020-1-5 format. Every test I've made either produces 0000-00-00 or nulls the column cells. I'm thinking maybe I need to tell SQL how to understand the csv's format in order for it to know how to convert it. Newbie here and stuck.
You need to specify a format for a date that is in the file, not a "required" one:
SET bank_date = str_to_date(bank_date,'%c/%e/%Y');

How to find a particular position in a csv file

I need to upload my csv data into Google big query. However, while uploading it from csv , I'm getting below error:
Error while reading data, error message: CSV table references column position 13,
but line starting at position:40244667 contains only 7 columns.
Now, problem is that ,I'm unable to identify what it means by position:40244667. How can I reach to/find out this position in my CSV file in order to troubleshoot the error further.
Any help will be really appreciated.
You can use the next csv module to read all the data in your csv
for line in myfile:
row = csv.reader([line]).next()

googleapis / python-bigquery: BadRequest: Could not parse as DATE with message 'Unable to parse'

Given the following code:
with io.StringIO() as buf:
buf.write(df_data.to_csv(header=True, index=False, quoting=csv.QUOTE_NONNUMERIC))
buf.seek(0)
try:
job = self.client.load_table_from_file(buf, dest_table)
job.result()
except:
buf.seek(0)
LOG.error("Failed to upload dataframe as csv: \n\n%s\n", buf.read())
raise
I am trying to load a pandas DataFrame to a bigquery table via converting to a CSV first. The problem I am faced with is that the BigQuery API fails with
google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: Could not parse 'date_key' as DATE for field date_key (position 3) starting at location 0 with message 'Unable to parse'
I looked at this other issue, and there's seem to be a limitation on the accepted formats for DATEs when loading a CSV file.
This being said, the prints from the above except block results in the following:
ERROR utils.database._bigquery:_bigquery.py:255 Failed to upload dataframe as csv:
"clinic_key","schedule_template_time_interval_key","schedule_template_key","date_key","schedule_owner_key","schedule_template_schedule_track_key","schedule_content_label_key","start_time_key","end_time_key","priority"
"clitest11111111111111111111111","1","1","2021-01-01","1","1","1","19:00:00","21:00:00",1
"clitest11111111111111111111111","1","1","2021-01-01","1","1","2","20:00:00","20:30:00",2
"clitest11111111111111111111111","1","1","2021-01-01","1","1","3","20:20:00","20:30:00",3
Which to me seems to be a clearly well-formatted CSV file.
So my question is: How can I make BigQuery accept my CSV? What do I have to change?
N.B: I know there's a load_dataframe_to_table method on the bigquery.client.Client object, but I faced another issue that forced me to attempt the CSV method instead. See link to other issue here.
You need to drop the header row.
That location 0 suggests it dislikes the first row.
Your other date values look correct (YYYY-MM-DD).
Because column ordering is important with CSV, BigQuery can assume the mapping to its table.

Difficulties creating CSV table in Google BigQuery

I'm having some difficulties creating a table in Google BigQuery using CSV data that we download from another system.
The goal is to have a bucket in the Google Cloud Platform that we will upload a 1 CSV file per month. This CSV files have around 3,000 - 10,000 rows of data, depending on the month.
The error I am getting from the job history in the Big Query API is:
Error while reading data, error message: CSV table encountered too
many errors, giving up. Rows: 2949; errors: 1. Please look into the
errors[] collection for more details.
When I am uploading the CSV files, I am selecting the following:
file format: csv
table type: native table
auto detect: tried automatic and manual
partitioning: no partitioning
write preference: WRITE_EMPTY (cannot change this)
number of errors allowed: 0
ignore unknown values: unchecked
field delimiter: comma
header rows to skip: 1 (also tried 0 and manually deleting the header rows from the csv files).
Any help would be greatly appreciated.
This usually points to the error in the structure of data source (in this case your CSV file). Since your CSV file is small, you can run a little validation script to see that the number of columns is exactly the same across all your rows in the CSV, before running the export.
Maybe something like:
cat myfile.csv | awk -F, '{ a[NF]++ } END { for (n in a) print n, "rows have",a[n],"columns" }'
Or, you can bind it to the condition (lets say if your number of columns should be 5):
ncols=$(cat myfile.csv | awk -F, 'x=0;{ a[NF]++ } END { for (n in a){print a[n]; x++; if (x==1){break}}}'); if [ $ncols==5 ]; then python myexportscript.py; else echo "number of columns invalid: ", $ncols; fi;
It's impossible to point out the error without seeing an example CSV file, but it's very likely that your file is incorrectly formatted. As a result, one typo confuses BQ into thinking there are thousands. Let's say you have the following csv file:
Sally Whittaker,2018,McCarren House,312,3.75
Belinda Jameson 2017,Cushing House,148,3.52 //Missing a comma after the name
Jeff Smith,2018,Prescott House,17-D,3.20
Sandy Allen,2019,Oliver House,108,3.48
With the following schema:
Name(String) Class(Int64) Dorm(String) Room(String) GPA(Float64)
Since the schema is missing a comma, everything is shifted one column over. If you have a large file, it results in thousands of errors as it attempts to inserts Strings into Ints/Floats.
I suggest you run your csv file through a csv validator before uploading it to BQ. It might find something that breaks it. It's even possible that one of your fields has a comma inside the value which breaks everything.
Another theory to investigate is to make sure that all required columns receive an appropriate (non-null) value. A common cause of this error is if you cast data incorrectly which returns a null value for a specific field in every row.
As mentioned by Scicrazed, this issue seems to be generated as some file rows has an incorrect format, in which case it is required to validate the content data in order to figure out the specific error that is leading this issue.
I recommend you to check the errors[] collection that might contains additional information about the aspects that can be making to fail the process. You can do this by using the Jobs: get method that returns detailed information about your BigQuery Job or refer to the additionalErrors field of the JobStatus Stackdriver logs that contains the same complete error data that is reported by the service.
I'm probably too late for this, but it seems the file has some errors (it can be a character that cannot be parsed or just a string in an int column) and BigQuery cannot upload it automatically.
You need to understand what the error is and fix it somehow. An easy way to do it is by running this command on the terminal:
bq --format=prettyjson show -j <JobID>
and you will be able to see additional logs for the error to help you understand the problem.
If the error happens only a few times you just can increase the number of errors allowed.
If it happens many times you will need to manipulate your CSV file before you upload it.
Hope it helps

Error code: Inavlid in Loading Data on BigQuery

I have a large CSV file (nearly 10,000 rows) and I am trying to upload it on the BigQuery but it gives me this error:
ile-00000000: CSV table references column position 8, but line starting at position:622 contains only 8 columns. (error code: invalid)
Can anyone please tell me a possible to reason to it? I have double checked my Schema and it looks alright.
Thanks
I had this same issue when trying to import a large data set in a csv to a BigQuery table.
The issue turned out to be some ascii control characters (\b, \t, \r, \n) in the data that was written in the csv. When the csv was being sent to BigQuery these characters caused the BiqQuery csv parser to misinterpret the line and break because the data didn't match with the number of columns in the header.
Replacing these characters with a space (to preserve formatting as best as possible) allowed me to import the data without further issues.
The error message suggests that the load job failed because at least one row has fewer columns than the automatically detected schema dictates.
Add
allow_jagged_rows=true
in the options.