Replace space of multiple files inside Google Cloud Storage - json

I have thousands of jsons on google cloud storage, but they have a specific field name (campaign name)
with a space, but before load (or create an external table) on bigquery I need to replace the space for underscore (campaign_name). I'm getting the following error when I try to create without replace:
Error in query string: Illegal field name: campaign name Table: raw_km_all_data
Is there any other solution that not includes download all the files to a server, do the replace and then upload again to cloud storage?
Thanks!

You can pretend that these JSON files are CSV with single column containing big string. Then, once it is loaded into BigQuery as a single column table - use REPLACE or REGEXP_REPLACE functions to replace spaces with underscores. Then you can use JSON_EXTRACT family of functions to parse JSON and populate table with real columns.

Related

ADF Copy Activity Fails CSV to Parquet when CSV has space in header column

When using a copy activity in Azure Data Factory to copy a typical CSV file with a header row into Parquet sink, the SINK fails with the following error due to the column names in the CSV having spaces in the header.
The column name is invalid. Column name cannot contain these
character:[,;{}()\n\t=]
The CSV is pipe delimited and displays just fine using the preview feature of the dataset with the first row marked as the header. I see no options to handle this use-case on the parquet side (sink) of the copy activity. I realize this can probably be addressed using a data flow to transform column names to remove spaces, but does that mean the native copy activity is incapable of handling this condition where a space in included in a header row?
EDIT: I should have added that dataset uses default mappings so that we can use the same dataset for any CSV to PARQUET copy. The answer provided will work for explicit mappings, but we don't see any resolution for folks who use default/dynamic mappings since we do not have access to the column names to remove spaces.
As we can note from the official Doc here
Error code: ParquetInvalidColumnName
Message: The column name is invalid. Column name cannot contain these character:[,;{}()\n\t=]
Cause: The column name contains invalid characters.
Resolution: Add or modify the column mapping to make the sink column name valid.
If you would like continue to use copy activity, there are few workarounds
1. make sure you have selected Column delimiter as Pipe(|)
2. If feasible, in mapping settings > import schema and rename the column name without spaces in destination column.
This is still an ongoing issue or request, follow here for more.

Need help creating schema for loading CSV into BigQuery

I am trying to load some CSV files into BigQuery from Google Cloud Storage and wrestling with schema generation. There is an auto-generate option but it is poorly documented. The problem is that if I choose to let BigQuery generate the schema, it does a decent job of guessing data types, but only sometimes does it recognizes the first row of the data as a header row, and sometimes it does not (treats the 1st row as data and generates column names like string_field_N). The first rows of my data are always header rows. Some of the tables have many columns (over 30), and I do not want to mess around with schema syntax because BigQuery always bombs with an uninformative error message when something (I have no idea what) is wrong with the schema.
So: How can I force it to recognize the first row as a header row? If that isn't possible, how do I get it to spit out the schema it generated in the proper syntax so that I can edit it (for appropriate column names) and use that as the schema on import?
I would recommend doing 2 things here:
Preprocess your file and store the final layout of the file sans the first row i.e. the header row
BQ load accepts an additional parameter in form of a JSON schema file, use this to explicitly define the table schema and pass this file as a parameter. This allows you the flexibility to alter schema at any point in time, if required
Allowing BQ to autodetect schema is not advised.
Schema auto detection in BigQuery should be able to detect the first row of your CSV file as column names in most cases. One of the cases for which column name detection fails is when you have similar data types all over your CSV file. For instance, BigQuery schema auto detect would not be able to detect header names for the following file since every field is a String.
headerA, headerB
row1a, row1b
row2a, row2b
row3a, row3b
The "Header rows to skip" option in the UI would not help fixing this shortcoming of schema auto detection in BigQuery.
If you are following the GCP documentation for Loading CSV Data from Google Cloud Storage you have the option to skip n number of rows:
(Optional) An integer indicating the number of header rows in the source data.
The option is called "Header rows to skip" in the Web UI, but it's also available as a CLI flag (--skip_leading_rows) and as BigQuery API property (skipLeadingRows)
Yes you can modify the existing schema (aka DDL) using bq show..
bq show --schema --format=prettyjson project_id:dataset.table > myschema.json
Note that this will result in you creating a new BQ table all together.
I have way to schema for loading csv into bigquery. You just enough edit value column, for example :
weight|total|summary
2|4|just string
2.3|89.5|just string
if use schema generator by bigquery, field weight and total will define as INT64, but when insert second rows so error or failed. So, you just enough edit first rows like this
weight|total|summary
'2'|'4'|just string
2.3|89.5|just string
You must set field weight & total as STRING, and if you want to aggregate you just use convert type data in bigquery.
cheers
If 'column name' type and 'datatype' are the same for all over the csv file, then BigQuery misunderstood that 'column name' as data. And add a self generated name for the column. I couldn't find any technical way to solve this. So I took another approach. 
If the data is not sensitive, then add another column with the 'column name' in string type. And all of the values in the column in number type. Ex. Column name 'Test' and all values are 0. Upload the file to the BigQuery and use this query to drop the column name.
ALTER TABLE <table name> DROP COLUMN <Test>
Change and according to your Table.

Create table in bigquery fails when the JSON key contains hyphen

I have a JSON data from Firebase Backup.
Data generated is such that every key is preceded by a hyphen.
Sample data is as follows:
"-GuGCJDEprMKczAMDUj8":{"deviceId":"399a649c6cee6209","dow":"Thursday","downloadFlag":"N","event":"streamStart","halfHourFull":"18h1","liveFlag":"Y","localDate":"2009-01-01","localHalfHour":1,"minutesSinceMidnight":1080,"quarterHourFull":"18q1","stationName":"hit 105","streamListenMethod":"Headphones","timestampLocal":"2009-01-01T18:00:33.679+10:00","timestampUTC":"2009-01-01T08:00:33.679Z"}
When we are trying to load that data into Bigquery then we are encountered with the below mentioned error:
Fields must contain only letters, numbers, and underscores, start with
a letter or underscore, and be at most 128 characters long.
Is this a bigquery limitation?
If yes, then what's the proposed solution here.
Any help/suggestion is much appreciated.
Is this a bigquery limitation? If yes, then what's the proposed solution here.
You need to use different field names instead. One option is to load the data into a single STRING column, e.g. by using 'CSV' for the format with a field delimiter of '|' (or any other character that doesn't appear in your data). Then you can use the JSON_EXTRACT_SCALAR function to extract fields from the JSON, e.g.:
CREATE TABLE dataset.table AS
SELECT
JSON_EXTRACT_SCALAR(json_string, '$.-GuGCJDEprMKczAMDUj8.deviceId') AS deviceId,
JSON_EXTRACT_SCALAR(json_string, '$.-GuGCJDEprMKczAMDUj8.dow') AS dow,
JSON_EXTRACT_SCALAR(json_string, '$.-GuGCJDEprMKczAMDUj8.downloadFlag') AS downloadFlag,
...
FROM dataset.single_column_table

Let Google BigQuery infer schema from csv string file

I want to upload csv data into BigQuery. When the data has different types (like string and int), it is capable of inferring the column names with the headers, because the headers are all strings, whereas the other lines contains integers.
BigQuery infers headers by comparing the first row of the file with
other rows in the data set. If the first line contains only strings,
and the other lines do not, BigQuery assumes that the first row is a
header row.
https://cloud.google.com/bigquery/docs/schema-detect
The problem is when your data is all strings ...
You can specify --skip_leading_rows, but BigQuery still does not use the first row as the name of your variables.
I know I can specify the column names manually, but I would prefer not doing that, as I have a lot of tables. Is there another solution ?
If your data is all in "string" type and if you have the first row of your CSV file containing the metadata, then I guess it is easy to do a quick script that would parse the first line of your CSV and generates a similar "create table" command:
bq mk --schema name:STRING,street:STRING,city:STRING... -t mydataset.myNewTable
Use that command to create a new (void) table, and then load your CSV file into that new table (using --skip_leading_rows as you mentioned)
14/02/2018: Update thanks to Felipe's comment:
Above comment can be simplified this way:
bq mk --schema `head -1 myData.csv` -t mydataset.myNewTable
It's not possible with current API. You can file a feature request in the public BigQuery tracker https://issuetracker.google.com/issues/new?component=187149&template=0.
As a workaround, you can add a single non-string value at the end of the second line in your file, and then set the allowJaggedRows option in the Load configuration. Downside is you'll get an extra column in your table. If having an extra column is not acceptable, you can use query instead of load, and select * EXCEPT the added extra column, but query is not free.

Unable to import 3.4GB csv into redshift because values contains free-text with commas

And so we found a 3.6GB csv that we have uploaded onto S3 and now want to import into Redshift, then do the querying and analysis from iPython.
Problem 1:
This comma delimited file contains values free text that also contains commas and this is interfering with the delimiting so can’t upload to Redshift.
When we tried opening the sample dataset in Excel, Excel surprisingly puts them into columns correctly.
Problem 2:
A column that is supposed to contain integers have some records containing alphabets to indicate some other scenario.
So, the only way to get the import through is to declare this column as varchar. But then we can do calculations later on.
Problem 3:
The datetime data type requires the date time value to be in the format YYYY-MM-DD HH:MM:SS, but the csv doesn’t contain the SS and the database is rejecting the import.
We can’t manipulate the data on a local machine because it is too big, and we can’t upload onto the cloud for computing because it is not in the correct format.
The last resort would be to scale the instance running iPython all the way up so that we can read the big csv directly from S3, but this approach doesn’t make sense as a long-term solution.
Your suggestions?
Train: https://s3-ap-southeast-1.amazonaws.com/bucketbigdataclass/stack_overflow_train.csv (3.4GB)
Train Sample: https://s3-ap-southeast-1.amazonaws.com/bucketbigdataclass/stack_overflow_train-sample.csv (133MB)
Try having different delimiter or use escape characters.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_preparing_data.html
For second issue, if you want to extract only numbers from the column after loading into char use regexp_replace or other functions.
For third issue, you can as well load it into VARCHAR field and then use substring cast(left(column_name, 10)||' '||right(column_name, 6)||':00' as timestamp)
to load it into final table from staging table
For the first issue, you need to find out a way to differentiate between the two types of commas - the delimiter and the text commas. Once you have done that, replace the delimiters with a different delimiter and use the same as delimiter in the copy command for Redshift.
For the second issue, you need to first figure out if this column needs to be present for numerical aggregations once loaded. If yes, you need to get this data cleaned up before loading. If no, you can directly load this as char/ varchar field. All your queries will still work but you will not be able to do any aggregations (sum/ avg and the likes) on this field.
For problem 3, you can use Text(date, "yyyy-mm-dd hh:mm:ss") function in excel to do a mass replace for this field.
Let me know if this works out.