Need help creating schema for loading CSV into BigQuery

Need help creating schema for loading CSV into BigQuery - csv

I am trying to load some CSV files into BigQuery from Google Cloud Storage and wrestling with schema generation. There is an auto-generate option but it is poorly documented. The problem is that if I choose to let BigQuery generate the schema, it does a decent job of guessing data types, but only sometimes does it recognizes the first row of the data as a header row, and sometimes it does not (treats the 1st row as data and generates column names like string_field_N). The first rows of my data are always header rows. Some of the tables have many columns (over 30), and I do not want to mess around with schema syntax because BigQuery always bombs with an uninformative error message when something (I have no idea what) is wrong with the schema.
So: How can I force it to recognize the first row as a header row? If that isn't possible, how do I get it to spit out the schema it generated in the proper syntax so that I can edit it (for appropriate column names) and use that as the schema on import?

I would recommend doing 2 things here:
Preprocess your file and store the final layout of the file sans the first row i.e. the header row
BQ load accepts an additional parameter in form of a JSON schema file, use this to explicitly define the table schema and pass this file as a parameter. This allows you the flexibility to alter schema at any point in time, if required
Allowing BQ to autodetect schema is not advised.

Schema auto detection in BigQuery should be able to detect the first row of your CSV file as column names in most cases. One of the cases for which column name detection fails is when you have similar data types all over your CSV file. For instance, BigQuery schema auto detect would not be able to detect header names for the following file since every field is a String.
headerA, headerB
row1a, row1b
row2a, row2b
row3a, row3b
The "Header rows to skip" option in the UI would not help fixing this shortcoming of schema auto detection in BigQuery.

If you are following the GCP documentation for Loading CSV Data from Google Cloud Storage you have the option to skip n number of rows:
(Optional) An integer indicating the number of header rows in the source data.
The option is called "Header rows to skip" in the Web UI, but it's also available as a CLI flag (--skip_leading_rows) and as BigQuery API property (skipLeadingRows)

Yes you can modify the existing schema (aka DDL) using bq show..
bq show --schema --format=prettyjson project_id:dataset.table > myschema.json
Note that this will result in you creating a new BQ table all together.

I have way to schema for loading csv into bigquery. You just enough edit value column, for example :
weight|total|summary
2|4|just string
2.3|89.5|just string
if use schema generator by bigquery, field weight and total will define as INT64, but when insert second rows so error or failed. So, you just enough edit first rows like this
weight|total|summary
'2'|'4'|just string
2.3|89.5|just string
You must set field weight & total as STRING, and if you want to aggregate you just use convert type data in bigquery.
cheers

If 'column name' type and 'datatype' are the same for all over the csv file, then BigQuery misunderstood that 'column name' as data. And add a self generated name for the column. I couldn't find any technical way to solve this. So I took another approach. 
If the data is not sensitive, then add another column with the 'column name' in string type. And all of the values in the column in number type. Ex. Column name 'Test' and all values are 0. Upload the file to the BigQuery and use this query to drop the column name.
ALTER TABLE <table name> DROP COLUMN <Test>
Change and according to your Table.

Related

ADF Copy Activity Fails CSV to Parquet when CSV has space in header column

When using a copy activity in Azure Data Factory to copy a typical CSV file with a header row into Parquet sink, the SINK fails with the following error due to the column names in the CSV having spaces in the header.
The column name is invalid. Column name cannot contain these
character:[,;{}()\n\t=]
The CSV is pipe delimited and displays just fine using the preview feature of the dataset with the first row marked as the header. I see no options to handle this use-case on the parquet side (sink) of the copy activity. I realize this can probably be addressed using a data flow to transform column names to remove spaces, but does that mean the native copy activity is incapable of handling this condition where a space in included in a header row?
EDIT: I should have added that dataset uses default mappings so that we can use the same dataset for any CSV to PARQUET copy. The answer provided will work for explicit mappings, but we don't see any resolution for folks who use default/dynamic mappings since we do not have access to the column names to remove spaces.

As we can note from the official Doc here
Error code: ParquetInvalidColumnName
Message: The column name is invalid. Column name cannot contain these character:[,;{}()\n\t=]
Cause: The column name contains invalid characters.
Resolution: Add or modify the column mapping to make the sink column name valid.
If you would like continue to use copy activity, there are few workarounds
1. make sure you have selected Column delimiter as Pipe(|)
2. If feasible, in mapping settings > import schema and rename the column name without spaces in destination column.
This is still an ongoing issue or request, follow here for more.

how to load json data available in txt file using put table in oracle NoSql database

I have two txt files containing Json data available in Linux system.
I have created respective tables in Oracle NoSql for these two files.
Now, I want to load this data in to created table in Oracle NoSql Database.
Syntax:
put table -name <name> [if-absent | -if-present ]
[-json <string>] [-file <file>] [-exact] [-update]
Explanation:
Put a row into the named table. The table name is a dot-separated name with the format table[.childTableName]*.
where:
-if-absent
Indicates to put a row only if the row does not exist.
-if-present
Indicates to put a row only if the row already exists.
-json
Indicates that the value is a JSON string.
-file
Can be used to load JSON strings from a file.
-exact
Indicates that the input JSON string or file must contain values for all columns in the table and cannot contain extraneous fields.
-update
Can be used to partially update the existing record.
Now, I am using below command to load:
kv-> put table -name tablename -file /path-to-folder/file.txt
Error handling command put table -name tablename -file /path-to-folder/file.txt: Illegal value for numeric field predicted_probability: 0.0. Expected FLOAT, is DOUBLE
kv->
I am not able to find the reason. Learned members, Please help.
Thank You for helping.

Yeah, I solved it. Actually there was a conflict between table data type and json string data type. Later I realized this.
Thanks

Let Google BigQuery infer schema from csv string file

I want to upload csv data into BigQuery. When the data has different types (like string and int), it is capable of inferring the column names with the headers, because the headers are all strings, whereas the other lines contains integers.
BigQuery infers headers by comparing the first row of the file with
other rows in the data set. If the first line contains only strings,
and the other lines do not, BigQuery assumes that the first row is a
header row.
https://cloud.google.com/bigquery/docs/schema-detect
The problem is when your data is all strings ...
You can specify --skip_leading_rows, but BigQuery still does not use the first row as the name of your variables.
I know I can specify the column names manually, but I would prefer not doing that, as I have a lot of tables. Is there another solution ?

If your data is all in "string" type and if you have the first row of your CSV file containing the metadata, then I guess it is easy to do a quick script that would parse the first line of your CSV and generates a similar "create table" command:
bq mk --schema name:STRING,street:STRING,city:STRING... -t mydataset.myNewTable
Use that command to create a new (void) table, and then load your CSV file into that new table (using --skip_leading_rows as you mentioned)
14/02/2018: Update thanks to Felipe's comment:
Above comment can be simplified this way:
bq mk --schema `head -1 myData.csv` -t mydataset.myNewTable

It's not possible with current API. You can file a feature request in the public BigQuery tracker https://issuetracker.google.com/issues/new?component=187149&template=0.
As a workaround, you can add a single non-string value at the end of the second line in your file, and then set the allowJaggedRows option in the Load configuration. Downside is you'll get an extra column in your table. If having an extra column is not acceptable, you can use query instead of load, and select * EXCEPT the added extra column, but query is not free.

How to split a column into two columns in SSIS if any invalid data in the column

Iam trying to load data from CSV file and dumping into database. While reading date values column from the CSV file getting some error because of CSV file contains some invalid data like '31-FEB-2014'.So i need to store those invalid data into another column in the table, how to achieve it using SSIS.Please assist.

Make a new column on your table which is of datatype nvarchar. Map your CSV Source column to the new column.
Then afterwards you can do some magic. Example you could use a derived column to handle the new nvarchar value and convert it back to a decent date format and then map it to your original column.

You just need to redirect it. See the red arrow and drag it to other destination like below:
Set properties like below:
tag me incase you're stuck.

MySQL to GeoMesa through .csv

I have a MySQL table whose data I have to export to .csv and then ingest this .csv to GeoMesa.
My Mysql table structure is like below:
[
Now, as you can see the the_geom attribute of table has data type point and in database it is stored as blob like shown below:
Now I have two problems :
When I export the MySQL data into a (.csv) file my csv file shows (...) for the_geom attribute as shown below instead of any binary representation or anything which will allow it to be ingested in GeoMesa. So, how to overcome this?
Csv file also shows # for any attribute with datetime datatype but if you expand the column the date time can be seen as sown in below picture (however my question is does it will cause problem in geomesa?).

For #1, MySQL's export is not automatically converting the Point datatype into text for you. You might need to call a conversion function such as AsWKT to output the geometry as Well Known Text. The WKT format can be used by GeoMesa to read in the Point data.
For #2, I think you'll need to do the same for the date field. Check out the date and time functions.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Need help creating schema for loading CSV into BigQuery - csv

Yes you can modify the existing schema (aka DDL) using bq show.. bq show --schema --format=prettyjson project_id:dataset.table > myschema.json Note that this will result in you creating a new BQ table all together.

Related

ADF Copy Activity Fails CSV to Parquet when CSV has space in header column

how to load json data available in txt file using put table in oracle NoSql database

Let Google BigQuery infer schema from csv string file

How to split a column into two columns in SSIS if any invalid data in the column

MySQL to GeoMesa through .csv

Categories

Resources