Getting error when loading JSON from GCS - json

I am trying to load schema and data from GCS as JSON files. I am using command line for this purpose.
bq load --source_format=NEWLINE_DELIMITED_JSON --schema=gs://1samtest/JSONSample/personsDataSchema.json SSData.persons_data gs://1samtest/JSONSample/personsData.json
But I get this error:
//1SAMTEST/JSONSAMPLE/PERSONSDATASCHEMA.JSON is not a valid value
But when I change all paths to my local machine it works completely file. But don't know why its getting error for json.
If I run like below after creating table in BigQuery it works fine.
bq load --source_format=NEWLINE_DELIMITED_JSON SSData.persons_data "gs://1samtest/JSONSample/personsData.json"

The schema flag/param doesn't support URIs for GCS i.e. using gs://...
bq load --help
The [destination_table] is the fully-qualified table name of table to
create, or append to if the table already exists.
The [source] argument can be a path to a single local file, or a
comma-separated list of URIs.
The [schema] argument should be either the name of a JSON file or a text schema. This schema should be omitted if the table already has one.
Only the source flag/param (i.e. the data) can be used with GCS URIs.

Related

Error when importing GeoJson into BigQuery

I'm trying to load GeoJson data [1] into BigQuery via Cloud Shell but I'm getting the following error:
Failed to parse JSON: Top-level GeoJson 'type' member should have value 'Feature', but was 'FeatureCollection'.; ParsedString returned false; Could not parse value; Parser terminated before end of string
It feels like the GeoJson file is not formatted properly for BQ but I have no idea if that's true or how to fix it.
[1] https://github.com/tonywr71/GeoJson-Data/blob/master/australian-suburbs.geojson
Expounding on #scespinoza's answer, I was able to convert to new-line delimited GeoJSON and load it to Bigquery with the following steps:
geojson2ndjson geodata.txt > geodata_converted.txt
Using this command, I encountered an error:
But was able to create a workaround by splitting the data into 2 tables, applying the same command.
Loaded table in Bigquery:
Your file is in standard GeoJSON format, but BigQuery only accepts new-line delimited GeoJSON files and individual GeoJSON objects (see documentation: https://cloud.google.com/bigquery/docs/geospatial-data#geojson-files). So, you should first convert the dataset to the appropiated format. Here is a good and simple explanation on how it works: https://stevage.github.io/ndgeojson/.

Apache Nifi : How to create parquet file from CSV file with schema saved in "avro.schema" attribute

I am trying to create a parquet file from a CSV file using Apache Nifi.
I am able to convert the CSV to parquet file, but the problem is, the schema of the parquet file contains struct type(Which I need to overcome) and convert it into string type.
I am using Apache Nifi 1.14.0 on Windows Server 2016.
This is what I've tried to convert CSV to parquet till now...
I have used the below 3 controllers
CSVReader
CSVRecordSetWriter
ParquetRecordSetWriter
And, These are the processors/Flow
GetFile
ConvertRecord(CSVReader to CSVRecordSetWriter and this will automatically generate "avro.schema" attribute and in next step I am updating this attribute)
UpdateAttribute(Updating "avro.schema" attribute, where ever I've got 2 data types inferred, I am replacing it to '["null","string"]')
ConvertRecord(CSVReader to ParquetRecordSetWriter)
UpdatedAttribute(For appending '.parquet' in the filename)
PutFile
I also want to know, how to view a .parquet file in Windows OS. Currently, I am reading the parquet file via PySpark and checking the schema. :|
This is how parquet file schema looks like after conversion. I want string instead of Struct as output.
Please Note: There are lots of CSVs with many columns/fields. I don't want to create schema manually.
OR
Any other ways to achieve this would be very helpfull.
Thanks!
After playing around with some more options of "ParquetRecordSetWriter", I was able to create a parquet file with the schema that I've captured in "avro.schema" attribute.

Loading multiple JSON records into BigQuery using the console

I'm trying to upload some data into bigquery in JSON format using the BigQuery Console as described here.
If I have a single record in a JSON file I can upload it successfully. If I put two or more records in a JSON file with newline delimination then I get this error:
Error while reading data, error message: JSON parsing error in row starting at position 0: Parser terminated before end of string
I tried searching stackoverflow and google but didn't have any luck finding any information. The two records I uploaded with newline delimination are able to upload successfully as individual records in separate JSON files.
My editor must have been adding some other character on my newlines. I went back to my original json array of records and used:
cat test.json | jq -c '.[]' > testNDJSON.json
This fixed everything.

BigQuery loading data from bq command line tool - how to skip header rows

I have a CSV data file with a header row that I am using to populate a BigQuery table:
$ cat dummy.csv
Field1,Field2,Field3,Field4
10.5,20.5,30.5,40.5
10.6,20.6,30.6,40.6
10.7,20.7,30.7,40.7
When using the Web UI, there is a text box where I am able to specify how many header rows to skip. However, if I upload the data into BigQuery using the bq command line tool, I do not have an option to do this, and always get the following error:
$ bq load my-project:my-dataset.dummydata dummy.csv Field1:float,Field2:float,Field3:float,Field4:float
Upload complete.
Waiting on bqjob_r7eccfe35f_0000015e3e8c_1 ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'my-project:bqjob_r7eccfe35f_0000015e3e8c_1': CSV table encountered too many errors, giving up. Rows: 1;
errors: 1.
Failure details:
- file-00000000: Could not parse 'Field1' as double for field Field1
(position 0) starting at location 0
The bq command line tool quickstart documentation also does not mention any options for skipping headers.
One simple/obvious solution is to edit dummy.csv to remove the header row, but this is not an option if pointing to a CSV file on Google Cloud Storage instead of the local file dummy.csv.
This is possible to do through the web interface, and through the Python API, so it should also be possible to do with the bq tool.
Checking bq help load revealed a --skip_leading_rows option:
--skip_leading_rows : The number of rows at the beginning of the source file to skip.
(an integer)
Also found this option in the bq command line tool documentation (which is not the same as the quickstart documentation, linked to above).
Adding a --skip_leading_rows=1 to the bq load command worked like a charm.
Here is the successful command:
$ bq load --skip_leading_rows=1 my-project:my-dataset.dummydata dummy.csv Field1:float,Field2:float,Field3:float,Field4:float
Upload complete.
Waiting on bqjob_r43eb07bad58_0000015ecea_1 ... (0s) Current status: DONE

Load a json file from biq query command line

Is it possible to load data from a json file (not just csv) using the Big Query command line tool? I am able to load a simple json file using the GUI, however, the command line is assuming a csv, and I don't see any documentation on how to specify json.
Here's the simple json file I'm using
{"col":"value"}
With schema
col:STRING
As of version 2.0.12, bq does allow uploading newline-delimited JSON files. This is an example command that does the job:
bq load --source_format NEWLINE_DELIMITED_JSON datasetName.tableName data.json schema.json
As mentioned above, "bq help load" will give you all of the details.
1) Yes you can
2) The documentation is here . Go to step 3: Upload the table in documentation.
3) You have to use --source_format flag to tell the bq that you are uploading a JSON file and not a csv.
4) The complete commmand structure is
bq load [--source_format=NEWLINE_DELIMITED_JSON] [--project_id=your_project_id] destination_data_set.destination_table data_source_uri table_schema
bq load --project_id=my_project_bq dataset_name.bq_table_name gs://bucket_name/json_file_name.json path_to_schema_in_your_machine
5) You can find other bq load variants by
bq help load
It does not support JSON formatted data loading.
Here is the documentation (bq help load) for the loadcommand with the latest bq version 2.0.9:
USAGE: bq [--global_flags] <command> [--command_flags] [args]
load Perform a load operation of source into destination_table.
Usage:
load <destination_table> <source> [<schema>]
The <destination_table> is the fully-qualified table name of table to create, or append to if the table already exists.
The <source> argument can be a path to a single local file, or a comma-separated list of URIs.
The <schema> argument should be either the name of a JSON file or a text schema. This schema should be omitted if the table already has one.
In the case that the schema is provided in text form, it should be a comma-separated list of entries of the form name[:type], where type will default
to string if not specified.
In the case that <schema> is a filename, it should contain a single array object, each entry of which should be an object with properties 'name',
'type', and (optionally) 'mode'. See the online documentation for more detail:
https://code.google.com/apis/bigquery/docs/uploading.html#createtable
Note: the case of a single-entry schema with no type specified is
ambiguous; one can use name:string to force interpretation as a
text schema.
Examples:
bq load ds.new_tbl ./info.csv ./info_schema.json
bq load ds.new_tbl gs://mybucket/info.csv ./info_schema.json
bq load ds.small gs://mybucket/small.csv name:integer,value:string
bq load ds.small gs://mybucket/small.csv field1,field2,field3
Arguments:
destination_table: Destination table name.
source: Name of local file to import, or a comma-separated list of
URI paths to data to import.
schema: Either a text schema or JSON file, as above.
Flags for load:
/usr/local/bin/bq:
--[no]allow_quoted_newlines: Whether to allow quoted newlines in CSV import data.
-E,--encoding: <UTF-8|ISO-8859-1>: The character encoding used by the input file. Options include:
ISO-8859-1 (also known as Latin-1)
UTF-8
-F,--field_delimiter: The character that indicates the boundary between columns in the input file. "\t" and "tab" are accepted names for tab.
--max_bad_records: Maximum number of bad records allowed before the entire job fails.
(default: '0')
(an integer)
--[no]replace: If true erase existing contents before loading new data.
(default: 'false')
--schema: Either a filename or a comma-separated list of fields in the form name[:type].
--skip_leading_rows: The number of rows at the beginning of the source file to skip.
(an integer)
gflags:
--flagfile: Insert flag definitions from the given file into the command line.
(default: '')
--undefok: comma-separated list of flag names that it is okay to specify on the command line even if the program does not define a flag with that name.
IMPORTANT: flags in this list that have arguments MUST use the --flag=value format.
(default: '')