Loading JSON files into BigQuery - json

I am trying to load a JSON file into BigQuery using the bq load command
bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON project_abd:ds.online_data gs://online_data/file.json
One of the key:value pair in the JSON file looks like -
"taxIdentifier":"T"
The bq load fails with the message - Error while reading data, error message: JSON parsing error in row
starting at position 713452: Could not convert value to boolean.
Field: taxIdentifier; Value: T (The JSON is really huge, hence cant paste it here)
I am really confused as to why the autodetect is treating the value T as boolean. I have tried all combinations of creating the table with STRING datatype and then load the table, but due to autodetect, it errors out mentioning - changed type from STRING to BOOLEAN, if I do not use the autodetect the load succeeds.
I have to use the "autodetect" feature, since the JSON is a result of an API call and the columns may increase or decrease.
Any idea why the value T is behaving weird, and how to get around this ?

Related

Foundry Data Connection Rest responses as rows

I'm trying to implement a very simple single get call, and the response returns some text with a bunch of ids separated by newline (like a single column csv). I want to save each one as a row in a dataset.
I understand that in general the Rest connector saves each response as a new row in an avro file, which works well for json responses which can then be parsed in code.
However in my case I need it to just save the response in a txt or csv file, which I can then apply a schema to, getting each id in its own row. How can I achieve this?
By default, the Data Connection Rest connector will place each response from the API as a row in the output dataset. If you know the format type of your response, and it's something that would usually be parsed to be one row per newline (csv for example), you can try setting the outputFileType to the correct format (undefined by default).
For example (for more details see the REST API Plugin documentation):
type: rest-source-adapter2
outputFileType: csv
restCalls:
- type: magritte-rest-call
method: GET
path: '/my/endpoint/file.csv'
If you don't know the format, or the above doesn't work regardless, you'll need to parse the response in transforms to split it into separate rows, this can be done as if the response was a string column, in this case exploding after splitting on newline (\n) might be useful: F.explode(F.split(F.col("response"), r'\n'))

Getting error while loading JSON into Bigquery using UI

I have to load the JSON into Bigquery. As per the Bigquery documentation, I made my JSON in the correct formation i.e. Newline delimited with one JSON object each row.
Now, JSON file has around 10 million rows and while loading I get the below error:
Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 2165; errors: 1. Please look into the error stream for more details.
When I found the line 2165, it looks like as follows:
{"deviceId":"13231fd01222a28e","dow":"Wednesday","downloadFlag":"N","email":"clstone898#gmail.com","emailSha256":"1bdf11821f867799bde022ccb57a2e899f827c988b4275571ffd60279c863272","event":"streamStop","firebaseUID":"UDVC3hyQpBWLCnlhXhjAQBeI95Q2","halfHourFull":"08h1","liveFlag":"Y","localDate":"2018-02-07","localHalfHour":1,"login":"google","minutesSinceMidnight":497,"quarterHourFull":"08q2","stationName":"Fox hit 101.9","streamListenMethod":"BluetoothA2DPOutput","timestampLocal":"2018-02-07T08:017:04.679+11:00","timestampUTC":"2018-02-06T21:17:04.679Z"}
When I load this single line then it gets loaded successfully. Kindly guide/suggest what is incorrect here.
I'm loading this json from Bigquery UI using schema Auto-Detect option.
Sample records are as follows:
{"deviceId":"3c7a345dafcff93f","dow":"Tuesday","downloadFlag":"N","email":"psloper.ps#gmail.com","emailSha256":"1cebae8c35db32edcd35e746863fc65a04ac68f2f5b3350f2df477a86bfaa07d","event":"streamStop","firebaseUID":"AMFYYjsvZjauhCktJ5lUzZj0d3D2","halfHourFull":"21h2","liveFlag":"Y","localDate":"2018-02-06","localHalfHour":2,"login":"google","minutesSinceMidnight":1311,"quarterHourFull":"21q4","stationName":"hit 105","streamListenMethod":"Headphones","timestampLocal":"2018-02-06T21:51:40.216+10:00","timestampUTC":"2018-02-06T11:51:40.216Z"}
{"deviceId":"2f1a8c84c738b752","dow":"Wednesday","downloadFlag":"N","email":"kory.maxwell#icloud.com","emailSha256":"13348786c15bff95e4afb4968a9bdbe883b70206a737c02c89fc8215f2a4e101","event":"streamStop","facebookId":"1784054201892593","firebaseUID":"Tx1bHjP6dhaDB2nl2c7yi2KZHsq2","halfHourFull":"06h1","liveFlag":"Y","localDate":"2018-02-07","localHalfHour":1,"login":"facebook","minutesSinceMidnight":384,"quarterHourFull":"06q2","stationName":"hit 105","streamListenMethod":"BluetoothA2DPOutput","timestampLocal":"2018-02-07T06:24:44.533+10:00","timestampUTC":"2018-02-06T20:24:44.533Z"}
{"deviceId":"AA1D685F-6BF6-B0DC-0000-000000000000","dow":"Wednesday","email":"lozza073#bigpond.com","emailSha256":"525db286e9a35c9f9f55db0ce338762eee02c51955ede6b35afb7e808581664f","event":"streamStart","facebookId":"10215879897177171","firebaseUID":"f2efT61sW5gHTfgEbtNfyaUKWaF3","halfHourFull":"7h2","liveFlag":"Y","localDate":"2018-02-07","localHalfHour":2,"login":"facebook","minutesSinceMidnight":463,"quarterHourFull":"7q3","stationName":"Fox hit 101.9","streamListenMethod":"Speaker","timestampLocal":"2018-02-07T07:43:00.39+11:00","timestampUTC":"2018-02-06T20:43:00.39Z"}
{"deviceId":"AEFD39FC-B116-4063-0000-000000000000","dow":"Wednesday","event":"catchUpPause","facebookId":"379907925802180","firebaseUID":"vQPh9tbO3Yge88fpMyNUFzJO7dl1","halfHourFull":"7h2","liveFlag":"N","localDate":"2018-02-07","localHalfHour":2,"login":"facebook","minutesSinceMidnight":465,"quarterHourFull":"7q4","stationName":"Fox hit 101.9","streamListenMethod":"USBAudio","timestampLocal":"2018-02-07T07:45:08.524+11:00","timestampUTC":"2018-02-06T20:45:08.524Z"}
{"deviceId":"AA1D685F-6BF6-B0DC-0000-000000000000","dow":"Wednesday","email":"lozza073#bigpond.com","emailSha256":"525db286e9a35c9f9f55db0ce338762eee02c51955ede6b35afb7e808581664f","event":"streamStop","facebookId":"10215879897177171","firebaseUID":"f2efT61sW5gHTfgEbtNfyaUKWaF3","halfHourFull":"7h2","liveFlag":"Y","localDate":"2018-02-07","localHalfHour":2,"login":"facebook","minutesSinceMidnight":475,"quarterHourFull":"7q4","stationName":"Fox hit 101.9","streamListenMethod":"Speaker","timestampLocal":"2018-02-07T07:55:35.788+11:00","timestampUTC":"2018-02-06T20:55:35.788Z"}
{"deviceId":"AA1D685F-6BF6-B0DC-0000-000000000000","dow":"Wednesday","email":"lozza073#bigpond.com","emailSha256":"525db286e9a35c9f9f55db0ce338762eee02c51955ede6b35afb7e808581664f","event":"streamStart","facebookId":"10215879897177171","firebaseUID":"f2efT61sW5gHTfgEbtNfyaUKWaF3","halfHourFull":"7h2","liveFlag":"Y","localDate":"2018-02-07","localHalfHour":2,"login":"facebook","minutesSinceMidnight":477,"quarterHourFull":"7q4","stationName":"Fox hit 101.9","streamListenMethod":"Speaker","timestampLocal":"2018-02-07T07:57:42.343+11:00","timestampUTC":"2018-02-06T20:57:42.343Z"}
{"deviceId":"13231fd01222a28e","dow":"Wednesday","downloadFlag":"N","email":"clstone898#gmail.com","emailSha256":"1bdf11821f867799bde022ccb57a2e899f827c988b4275571ffd60279c863272","event":"streamStop","firebaseUID":"UDVC3hyQpBWLCnlhXhjAQBeI95Q2","halfHourFull":"08h1","liveFlag":"Y","localDate":"2018-02-07","localHalfHour":1,"login":"google","minutesSinceMidnight":497,"quarterHourFull":"08q2","stationName":"Fox hit 101.9","streamListenMethod":"BluetoothA2DPOutput","timestampLocal":"2018-02-07T08:017:04.679+11:00","timestampUTC":"2018-02-06T21:17:04.679Z"}
Any help is greatly appreciated.
Well, look to that specific 2165 line:
{"deviceId":"13231fd01222a28e","dow":"Wednesday","downloadFlag":"N","email":"clstone898#gmail.com","emailSha256":"1bdf11821f867799bde022ccb57a2e899f827c988b4275571ffd60279c863272","event":"streamStop","firebaseUID":"UDVC3hyQpBWLCnlhXhjAQBeI95Q2","halfHourFull":"08h1","liveFlag":"Y","localDate":"2018-02-07","localHalfHour":1,"login":"google","minutesSinceMidnight":497,"quarterHourFull":"08q2","stationName":"Fox hit 101.9","streamListenMethod":"BluetoothA2DPOutput","timestampLocal":"2018-02-07T08:017:04.679+11:00","timestampUTC":"2018-02-06T21:17:04.679Z"}
And specifically to:
"timestampLocal":"2018-02-07T08:017:04.679+11:00"
And the error message:
Couldn't convert value to timestamp: Could not parse
'2018-02-07T08:017:04.679+11:00' as a timestamp. Required format is
YYYY-MM-DD HH:MM[:SS[.SSSSSS]]
So, if you change "T08:017:04.679" to "T08:17:04.679" (17 minutes instead of 017) then it works. :)

How do I read a Large JSON Array File in PySpark

Issue
I recently encountered a challenge in Azure Data Lake Analytics when I attempted to read in a Large UTF-8 JSON Array file and switched to HDInsight PySpark (v2.x, not 3) to process the file. The file is ~110G and has ~150m JSON Objects.
HDInsight PySpark does not appear to support Array of JSON file format for input, so I'm stuck. Also, I have "many" such files with different schemas in each containing hundred of columns each, so creating the schemas for those is not an option at this point.
Question
How do I use out-of-the-box functionality in PySpark 2 on HDInsight to enable these files to be read as JSON?
Thanks,
J
Things I tried
I used the approach at the bottom of this page:
from Databricks that supplied the below code snippet:
import json
df = sc.wholeTextFiles('/tmp/*.json').flatMap(lambda x: json.loads(x[1])).toDF()
display(df)
I tried the above, not understanding how "wholeTextFiles" works, and of course ran into OutOfMemory errors that killed my executors quickly.
I attempted loading to an RDD and other open methods, but PySpark appears to support only the JSONLines JSON file format, and I have the Array of JSON Objects due to ADLA's requirement for that file format.
I tried reading in as a text file, stripping Array characters, splitting on the JSON object boundaries and converting to JSON like the above, but that kept giving errors about being unable to convert unicode and/or str (ings).
I found a way through the above, and converted to a dataframe containing one column with Rows of strings that were the JSON Objects. However, I did not find a way to output only the JSON Strings from the data frame rows to an output file by themselves. The always came out as
{'dfColumnName':'{...json_string_as_value}'}
I also tried a map function that accepted the above rows, parsed as JSON, extracted the values (JSON I wanted), then parsed the values as JSON. This appeared to work, but when I would try to save, the RDD was type PipelineRDD and had no saveAsTextFile() method. I then tried the toJSON method, but kept getting errors about "found no valid JSON Object", which I did not understand admittedly, and of course other conversion errors.
I finally found a way forward. I learned that I could read json directly from an RDD, including a PipelineRDD. I found a way to remove the unicode byte order header, wrapping array square brackets, split the JSON Objects based on a fortunate delimiter, and have a distributed dataset for more efficient processing. The output dataframe now had columns named after the JSON elements, inferred the schema, and dynamically adapts for other file formats.
Here is the code - hope it helps!:
#...Spark considers arrays of Json objects to be an invalid format
# and unicode files are prefixed with a byteorder marker
#
thanksMoiraRDD = sc.textFile( '/a/valid/file/path', partitions ).map(
lambda x: x.encode('utf-8','ignore').strip(u",\r\n[]\ufeff")
)
df = sqlContext.read.json(thanksMoiraRDD)

Spark Structured-Streaming Error:-pyspark.sql.utils.StreamingQueryException: 'assertion failed: Invalid batch:

I have a Spark Structured-Streaming application, which reads JSON data from s3 and does some transformations and writes it back to s3.
While running the application, sometimes the job errors out and re-attempts (without any visible loss or data corruption- so that everything seems fine), but the error message provided is not very descriptive
Below is the error message:
pyspark.sql.utils.StreamingQueryException: u'assertion failed: Invalid batch: _ra_guest_gid#1883,_ra_sess_ts#1884,_ra_evt_ts#1885,event#1886,brand#1887,category#1888,funding_daysRemaining#1889,funding_dollarsRemaining#1890,funding_goal#1891,funding_totalBackers#1892L,funding_totalFunded#1893,id#1894,name#1895,price#1896,projectInfo_memberExclusive#1897,projectInfo_memberExclusiveHoursRemaining#1898,projectInfo_numberOfEpisodes#1899,projectInfo_projectState#1900,variant#1901 != _ra_guest_gid#2627,_ra_sess_ts#2628,_
My guess is this may have something to do with column mismatches, where
The incoming JSON record does not conform to the schema.
Or the datatype of the incoming JSON record may not match the data type provided in schema.
But I'm not sure how to pinpoint which record or which particular field causes the error.
Any help or suggestions here on what the error means or how I could log the error in a better way.
Thanks
I think i have figured out the issue, it is not related to schema mismatch.
What was happening in my case is that i have two streaming operations running in parallel.
1)reading raw incoming data from an S3 bucket, then doing some operation and writing it back to S3 in output folder 'a'
2)reading the processed streaming data from folder 'a' (step1) and then again doing some operation and writing back to S3 in output folder 'b'
Now as per my observations if i run the above steps individually then it works fine, but if i run them together i get the error
'pyspark.sql.utils.StreamingQueryException: u'assertion failed: Invalid batch: '
so i think it has trouble when it tries to read and write from same location i.e. the destination of one stream is the source of another stream

json parsing error for large data in browser

I am trying to perform an export to excel functionality from the data in an html table (5000+ rows). I am using json2.js for parsing the client side data in to json string called as jsonToExport.
The value of this variable is fine for less number of records and it is decoded fine (I checked in the browser in debug mode).
But for large dataset 5000+ records the json parsing/decoding is failing. I can see the encoded string but the decoded value shows:
jsonToExport: unable to decode
I experimented with the data and found that if the data exceeds a particular size then I get this error.
like increasing the column size or replacing large data columns with small length columns, so in effect its not an issue with the data format of encoded json string missing anything since all combination of columns work if the number of columns is limited.
Its definitely not able to decode/parse and then pass the json string in the request if its above a particular size limit.
Is there an Issue with json2.js which does the parsing (I think)?.
I also tried json3.min.js and received the same error.
Unless you're doing old browser support, like i.e. 7, you don't need to use antiquated libraries to parse JSON any longer, it's built in JSON.parse(jsonString)