Alternative to JSON Flattening via Target Table in Snowflake - json

Per snowflake: https://docs.snowflake.net/manuals/user-guide/json-basics-tutorial-copy-into.html I created a target table (Testing_JSON), that is a single Variant column that contains an uploaded JSON file.
My Question is How can I cut out creating this "Target Table (i.e. Testing_JSON)" that is a single Variant Column that I have to reference to create the actual and only table I want (TABLE1), which contains the flattened JSON. I have found no way to read in a JSON file from my desktop and 'parse it on the fly' to create a flattened table, VIA THE UI. Not using the CLI as I know this can be done using PUT/COPY INTO
create or replace temporary table TABLE1 AS
SELECT
VALUE:col1::string AS COL_1,
VALUE:col2::string AS COL_2,
VALUE:col3::string AS COL_3
from TESTING_JSON
, lateral flatten( input => json:value);

You can't do this through the UI. If you want to do this then you need to use an external tool on your desktop or - as Mike mentioned - in the COPY statement.

You're going to need to do this in a few steps from your desktop.
use SnowSQL or some other tool to get your JSON file up to blob
storage:
https://docs.snowflake.net/manuals/sql-reference/sql/put.html
use a COPY INTO statement to get the data loaded directly to the flattened table that you want to load to. This will require a SELECT statement in your COPY INTO:
https://docs.snowflake.net/manuals/sql-reference/sql/copy-into-table.html
There is a good example of this here:
https://docs.snowflake.net/manuals/user-guide/querying-stage.html#example-3-querying-elements-in-a-json-file

Related

Can we COPY External JSON File into Snowflake?

I am trying to load External JSON File from Azure Blob Storage to Snowflake. I created the table LOCATION_DETAILS with all columns as Variant. When I try to load into the table, I am getting the below error:
Can anyone help me on this?
You need to create a file format and mention the type of file and other specification like below:
create or replace file format myjsonformat
type = 'JSON'
strip_outer_array = true;
And then try to load the file it will work.
When I use external data for Snowflake, I like to create stages that are linked to the BlobStorage (in this case), it's easy and you can do everything really easy and transparent, just as if it would be local data.
Create the stage linked to the blobstorage like this:
CREATE OR REPLACE STAGE "<DATABASE>"."<SCHEMA>"."<STAGE_NAME>"
URL='azure://demostorage178.blob.core.windows.net/democontainer'
CREDENTIALS=(AZURE_SAS_TOKEN='***********************************************')
FILE_FORMAT = (TYPE = JSON);
After that, you can list what is in the blobstorage fromo snowflake like this:
list #"<DATABASE>"."<SCHEMA>"."<STAGE_NAME>";
Or like this:
use database "<DATABASE>";
use schema "<SCHEMA>";
SELECT * FROM #"STAGE_NAME"/sales.json;
If you need to create the table, use this:
create or replace table "<DATABASE>"."<SCHEMA>"."<TABLE>" (src VARIANT);
And you can COPY your data like this (for a single file):
copy into "<DATABASE>"."<SCHEMA>"."<TABLE>" from #"<STAGE_NAME>"/sales.json;
Finally, use this for all new data that you get in your stage. Note: You don't need to erase previous data, it will ignore it and will load only the new one.
copy into "<DATABASE>"."<SCHEMA>"."<TABLE>" from #"STAGE_NAME";

Need help creating schema for loading CSV into BigQuery

I am trying to load some CSV files into BigQuery from Google Cloud Storage and wrestling with schema generation. There is an auto-generate option but it is poorly documented. The problem is that if I choose to let BigQuery generate the schema, it does a decent job of guessing data types, but only sometimes does it recognizes the first row of the data as a header row, and sometimes it does not (treats the 1st row as data and generates column names like string_field_N). The first rows of my data are always header rows. Some of the tables have many columns (over 30), and I do not want to mess around with schema syntax because BigQuery always bombs with an uninformative error message when something (I have no idea what) is wrong with the schema.
So: How can I force it to recognize the first row as a header row? If that isn't possible, how do I get it to spit out the schema it generated in the proper syntax so that I can edit it (for appropriate column names) and use that as the schema on import?
I would recommend doing 2 things here:
Preprocess your file and store the final layout of the file sans the first row i.e. the header row
BQ load accepts an additional parameter in form of a JSON schema file, use this to explicitly define the table schema and pass this file as a parameter. This allows you the flexibility to alter schema at any point in time, if required
Allowing BQ to autodetect schema is not advised.
Schema auto detection in BigQuery should be able to detect the first row of your CSV file as column names in most cases. One of the cases for which column name detection fails is when you have similar data types all over your CSV file. For instance, BigQuery schema auto detect would not be able to detect header names for the following file since every field is a String.
headerA, headerB
row1a, row1b
row2a, row2b
row3a, row3b
The "Header rows to skip" option in the UI would not help fixing this shortcoming of schema auto detection in BigQuery.
If you are following the GCP documentation for Loading CSV Data from Google Cloud Storage you have the option to skip n number of rows:
(Optional) An integer indicating the number of header rows in the source data.
The option is called "Header rows to skip" in the Web UI, but it's also available as a CLI flag (--skip_leading_rows) and as BigQuery API property (skipLeadingRows)
Yes you can modify the existing schema (aka DDL) using bq show..
bq show --schema --format=prettyjson project_id:dataset.table > myschema.json
Note that this will result in you creating a new BQ table all together.
I have way to schema for loading csv into bigquery. You just enough edit value column, for example :
weight|total|summary
2|4|just string
2.3|89.5|just string
if use schema generator by bigquery, field weight and total will define as INT64, but when insert second rows so error or failed. So, you just enough edit first rows like this
weight|total|summary
'2'|'4'|just string
2.3|89.5|just string
You must set field weight & total as STRING, and if you want to aggregate you just use convert type data in bigquery.
cheers
If 'column name' type and 'datatype' are the same for all over the csv file, then BigQuery misunderstood that 'column name' as data. And add a self generated name for the column. I couldn't find any technical way to solve this. So I took another approach. 
If the data is not sensitive, then add another column with the 'column name' in string type. And all of the values in the column in number type. Ex. Column name 'Test' and all values are 0. Upload the file to the BigQuery and use this query to drop the column name.
ALTER TABLE <table name> DROP COLUMN <Test>
Change and according to your Table.

Let Google BigQuery infer schema from csv string file

I want to upload csv data into BigQuery. When the data has different types (like string and int), it is capable of inferring the column names with the headers, because the headers are all strings, whereas the other lines contains integers.
BigQuery infers headers by comparing the first row of the file with
other rows in the data set. If the first line contains only strings,
and the other lines do not, BigQuery assumes that the first row is a
header row.
https://cloud.google.com/bigquery/docs/schema-detect
The problem is when your data is all strings ...
You can specify --skip_leading_rows, but BigQuery still does not use the first row as the name of your variables.
I know I can specify the column names manually, but I would prefer not doing that, as I have a lot of tables. Is there another solution ?
If your data is all in "string" type and if you have the first row of your CSV file containing the metadata, then I guess it is easy to do a quick script that would parse the first line of your CSV and generates a similar "create table" command:
bq mk --schema name:STRING,street:STRING,city:STRING... -t mydataset.myNewTable
Use that command to create a new (void) table, and then load your CSV file into that new table (using --skip_leading_rows as you mentioned)
14/02/2018: Update thanks to Felipe's comment:
Above comment can be simplified this way:
bq mk --schema `head -1 myData.csv` -t mydataset.myNewTable
It's not possible with current API. You can file a feature request in the public BigQuery tracker https://issuetracker.google.com/issues/new?component=187149&template=0.
As a workaround, you can add a single non-string value at the end of the second line in your file, and then set the allowJaggedRows option in the Load configuration. Downside is you'll get an extra column in your table. If having an extra column is not acceptable, you can use query instead of load, and select * EXCEPT the added extra column, but query is not free.

Is it possible to create a hive table from a specific subset of CSV columns?

I have approximately 400 CSV files. I want to create a Hive table over these CSV files but only include a certain subset of the columns (see below). I know I could create a table with all of them then use a select statement to grab only the ones I want and make a second hive table but I was wondering if there was a way I could avoid doing that.
here are my columns:
columns = ['time', 'Var2', 'Var3', 'Var4', 'Var5', 'Var6', 'Var7', 'I0', 'I1',
'I2', 'V0', 'V1', 'V2', 'fpa', 'fpb', 'fpc', 'fpg', 'filename',
'record_time_stamp', 'fault', 'unix_time', 'Var2_real', 'Var2_imag',
'Var3_real', 'Var3_imag', 'Var4_real', 'Var4_imag', 'Var5_real',
'Var5_imag', 'Var6_real', 'Var6_imag', 'Var7_real', 'Var7_imag',
'I0_real', 'I0_imag', 'I1_real', 'I1_imag', 'I2_real', 'I2_imag',
'V0_real', 'V0_imag', 'V1_real', 'V1_imag', 'V2_real', 'V2_imag']
I don't want these in the Hive table :
['Var2', 'Var3', 'Var4', 'Var5', 'Var6', 'Var7', 'I0', 'I1','I2', 'V0', 'V1', 'V2']
I understand I can just alter my data in the CSVs or use 2 Hive tables but I don't want to alter my data (because another team will use those columns for their work) and I don't want to make another table for the sake of keeping things neat. Is this possible?
In case you can use Spark, I'd suggest you read the data from the CSV file, create a data model of the columns that you need and then enforce it on the RDD you have ingested with your application to create a dataframe.
Save the dataframe thereafter using the .saveAsTable( ) and you should see this in your Hive Database.
Manipulation of data to such an extent is a task for Spark and not Hive.

Convert file of JSON objects to Parquet file

Motivation: I want to load the data into Apache Drill. I understand that Drill can handle JSON input, but I want to see how it performs on Parquet data.
Is there any way to do this without first loading the data into Hive, etc and then using one of the Parquet connectors to generate an output file?
Kite has support for importing JSON to both Avro and Parquet formats via its command-line utility, kite-dataset.
First, you would infer the schema of your JSON:
kite-dataset json-schema sample-file.json -o schema.avsc
Then you can use that file to create a Parquet Hive table:
kite-dataset create mytable --schema schema.avsc --format parquet
And finally, you can load your JSON into the dataset.
kite-dataset json-import sample-file.json mytable
You can also import an entire directly stored in HDFS. In that case, Kite will use a MR job to do the import.
You can actually use Drill itself to create a parquet file from the output of any query.
create table student_parquet as select * from `student.json`;
The above line should be good enough. Drill interprets the types based on the data in the fields. You can substitute your own query and create a parquet file.
To complete the answer of #rahul, you can use drill to do this - but I needed to add more to the query to get it working out of the box with drill.
create table dfs.tmp.`filename.parquet` as select * from dfs.`/tmp/filename.json` t
I needed to give it the storage plugin (dfs) and the "root" config can read from the whole disk and is not writable. But the tmp config (dfs.tmp) is writable and writes to /tmp. So I wrote to there.
But the problem is that if the json is nested or perhaps contains unusual characters, I would get a cryptic
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: java.lang.IndexOutOfBoundsException:
If I have a structure that looks like members: {id:123, name:"joe"} I would have to change the select to
select members.id as members_id, members.name as members_name
or
select members.id as `members.id`, members.name as `members.name`
to get it to work.
I assume the reason is that parquet is a "column" store so you need columns. JSON isn't by default so you need to convert it.
The problem is I have to know my json schema and I have to build the select to include all the possibilities. I'd be happy if some knows a better way to do this.