Getting error while loading JSON into Bigquery using UI - json

I have to load the JSON into Bigquery. As per the Bigquery documentation, I made my JSON in the correct formation i.e. Newline delimited with one JSON object each row.
Now, JSON file has around 10 million rows and while loading I get the below error:
Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 2165; errors: 1. Please look into the error stream for more details.
When I found the line 2165, it looks like as follows:
{"deviceId":"13231fd01222a28e","dow":"Wednesday","downloadFlag":"N","email":"clstone898#gmail.com","emailSha256":"1bdf11821f867799bde022ccb57a2e899f827c988b4275571ffd60279c863272","event":"streamStop","firebaseUID":"UDVC3hyQpBWLCnlhXhjAQBeI95Q2","halfHourFull":"08h1","liveFlag":"Y","localDate":"2018-02-07","localHalfHour":1,"login":"google","minutesSinceMidnight":497,"quarterHourFull":"08q2","stationName":"Fox hit 101.9","streamListenMethod":"BluetoothA2DPOutput","timestampLocal":"2018-02-07T08:017:04.679+11:00","timestampUTC":"2018-02-06T21:17:04.679Z"}
When I load this single line then it gets loaded successfully. Kindly guide/suggest what is incorrect here.
I'm loading this json from Bigquery UI using schema Auto-Detect option.
Sample records are as follows:
{"deviceId":"3c7a345dafcff93f","dow":"Tuesday","downloadFlag":"N","email":"psloper.ps#gmail.com","emailSha256":"1cebae8c35db32edcd35e746863fc65a04ac68f2f5b3350f2df477a86bfaa07d","event":"streamStop","firebaseUID":"AMFYYjsvZjauhCktJ5lUzZj0d3D2","halfHourFull":"21h2","liveFlag":"Y","localDate":"2018-02-06","localHalfHour":2,"login":"google","minutesSinceMidnight":1311,"quarterHourFull":"21q4","stationName":"hit 105","streamListenMethod":"Headphones","timestampLocal":"2018-02-06T21:51:40.216+10:00","timestampUTC":"2018-02-06T11:51:40.216Z"}
{"deviceId":"2f1a8c84c738b752","dow":"Wednesday","downloadFlag":"N","email":"kory.maxwell#icloud.com","emailSha256":"13348786c15bff95e4afb4968a9bdbe883b70206a737c02c89fc8215f2a4e101","event":"streamStop","facebookId":"1784054201892593","firebaseUID":"Tx1bHjP6dhaDB2nl2c7yi2KZHsq2","halfHourFull":"06h1","liveFlag":"Y","localDate":"2018-02-07","localHalfHour":1,"login":"facebook","minutesSinceMidnight":384,"quarterHourFull":"06q2","stationName":"hit 105","streamListenMethod":"BluetoothA2DPOutput","timestampLocal":"2018-02-07T06:24:44.533+10:00","timestampUTC":"2018-02-06T20:24:44.533Z"}
{"deviceId":"AA1D685F-6BF6-B0DC-0000-000000000000","dow":"Wednesday","email":"lozza073#bigpond.com","emailSha256":"525db286e9a35c9f9f55db0ce338762eee02c51955ede6b35afb7e808581664f","event":"streamStart","facebookId":"10215879897177171","firebaseUID":"f2efT61sW5gHTfgEbtNfyaUKWaF3","halfHourFull":"7h2","liveFlag":"Y","localDate":"2018-02-07","localHalfHour":2,"login":"facebook","minutesSinceMidnight":463,"quarterHourFull":"7q3","stationName":"Fox hit 101.9","streamListenMethod":"Speaker","timestampLocal":"2018-02-07T07:43:00.39+11:00","timestampUTC":"2018-02-06T20:43:00.39Z"}
{"deviceId":"AEFD39FC-B116-4063-0000-000000000000","dow":"Wednesday","event":"catchUpPause","facebookId":"379907925802180","firebaseUID":"vQPh9tbO3Yge88fpMyNUFzJO7dl1","halfHourFull":"7h2","liveFlag":"N","localDate":"2018-02-07","localHalfHour":2,"login":"facebook","minutesSinceMidnight":465,"quarterHourFull":"7q4","stationName":"Fox hit 101.9","streamListenMethod":"USBAudio","timestampLocal":"2018-02-07T07:45:08.524+11:00","timestampUTC":"2018-02-06T20:45:08.524Z"}
{"deviceId":"AA1D685F-6BF6-B0DC-0000-000000000000","dow":"Wednesday","email":"lozza073#bigpond.com","emailSha256":"525db286e9a35c9f9f55db0ce338762eee02c51955ede6b35afb7e808581664f","event":"streamStop","facebookId":"10215879897177171","firebaseUID":"f2efT61sW5gHTfgEbtNfyaUKWaF3","halfHourFull":"7h2","liveFlag":"Y","localDate":"2018-02-07","localHalfHour":2,"login":"facebook","minutesSinceMidnight":475,"quarterHourFull":"7q4","stationName":"Fox hit 101.9","streamListenMethod":"Speaker","timestampLocal":"2018-02-07T07:55:35.788+11:00","timestampUTC":"2018-02-06T20:55:35.788Z"}
{"deviceId":"AA1D685F-6BF6-B0DC-0000-000000000000","dow":"Wednesday","email":"lozza073#bigpond.com","emailSha256":"525db286e9a35c9f9f55db0ce338762eee02c51955ede6b35afb7e808581664f","event":"streamStart","facebookId":"10215879897177171","firebaseUID":"f2efT61sW5gHTfgEbtNfyaUKWaF3","halfHourFull":"7h2","liveFlag":"Y","localDate":"2018-02-07","localHalfHour":2,"login":"facebook","minutesSinceMidnight":477,"quarterHourFull":"7q4","stationName":"Fox hit 101.9","streamListenMethod":"Speaker","timestampLocal":"2018-02-07T07:57:42.343+11:00","timestampUTC":"2018-02-06T20:57:42.343Z"}
{"deviceId":"13231fd01222a28e","dow":"Wednesday","downloadFlag":"N","email":"clstone898#gmail.com","emailSha256":"1bdf11821f867799bde022ccb57a2e899f827c988b4275571ffd60279c863272","event":"streamStop","firebaseUID":"UDVC3hyQpBWLCnlhXhjAQBeI95Q2","halfHourFull":"08h1","liveFlag":"Y","localDate":"2018-02-07","localHalfHour":1,"login":"google","minutesSinceMidnight":497,"quarterHourFull":"08q2","stationName":"Fox hit 101.9","streamListenMethod":"BluetoothA2DPOutput","timestampLocal":"2018-02-07T08:017:04.679+11:00","timestampUTC":"2018-02-06T21:17:04.679Z"}
Any help is greatly appreciated.

Well, look to that specific 2165 line:
{"deviceId":"13231fd01222a28e","dow":"Wednesday","downloadFlag":"N","email":"clstone898#gmail.com","emailSha256":"1bdf11821f867799bde022ccb57a2e899f827c988b4275571ffd60279c863272","event":"streamStop","firebaseUID":"UDVC3hyQpBWLCnlhXhjAQBeI95Q2","halfHourFull":"08h1","liveFlag":"Y","localDate":"2018-02-07","localHalfHour":1,"login":"google","minutesSinceMidnight":497,"quarterHourFull":"08q2","stationName":"Fox hit 101.9","streamListenMethod":"BluetoothA2DPOutput","timestampLocal":"2018-02-07T08:017:04.679+11:00","timestampUTC":"2018-02-06T21:17:04.679Z"}
And specifically to:
"timestampLocal":"2018-02-07T08:017:04.679+11:00"
And the error message:
Couldn't convert value to timestamp: Could not parse
'2018-02-07T08:017:04.679+11:00' as a timestamp. Required format is
YYYY-MM-DD HH:MM[:SS[.SSSSSS]]
So, if you change "T08:017:04.679" to "T08:17:04.679" (17 minutes instead of 017) then it works. :)

Related

Cannot identify proper format for a json request body stored and used in csv file for use in a karate scenario

Am having trouble identifying the propert format to store a json request body in csv format, then use the csv file value in a scenario.
This works properly within a scenario:
And request '{"contextURN":"urn:com.myco.here:env:booking:reservation:0987654321","individuals":[{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:12345678","name":{"firstName":"NUNYA","lastName":"BIDNESS"},"dateOfBirth":"1980-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"LANDBRANCH","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"},{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:23456789","name":{"firstName":"NUNYA","lastName":"BIZNESS"},"dateOfBirth":"1985-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"BRANCHLAND","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"}]}'
However, when stored in csv file as follows (I've tried quite a number other formatting variations)
'{"contextURN":"urn:com.myco.here:env:booking:reservation:0987654321","individuals":[{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:12345678","name":{"firstName":"NUNYA","lastName":"BIDNESS"},"dateOfBirth":"1980-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"LANDBRANCH","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"},{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:23456789","name":{"firstName":"NUNYA","lastName":"BIZNESS"},"dateOfBirth":"1985-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"BRANCHLAND","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"}]}',
and used in scenario as:
And request requestBody
my test returns an "javascript evaluation failed: " & the json above & :1:63 Missing close quote ^ in at line number 1 at column number 63
Can you please identify correct formatting or the usage errors I am missing? Thanks
We just use a basic CSV library behind the scenes. I suggest you roll your own Java helper class that does whatever processing / pre-processing you need.
Do read this answer as well: https://stackoverflow.com/a/54593057/143475
I can't make sense of your JSON but if you are trying to fit JSON into CSV, sorry - that's not a good idea. See this answer: https://stackoverflow.com/a/62449166/143475

Loading JSON files into BigQuery

I am trying to load a JSON file into BigQuery using the bq load command
bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON project_abd:ds.online_data gs://online_data/file.json
One of the key:value pair in the JSON file looks like -
"taxIdentifier":"T"
The bq load fails with the message - Error while reading data, error message: JSON parsing error in row
starting at position 713452: Could not convert value to boolean.
Field: taxIdentifier; Value: T (The JSON is really huge, hence cant paste it here)
I am really confused as to why the autodetect is treating the value T as boolean. I have tried all combinations of creating the table with STRING datatype and then load the table, but due to autodetect, it errors out mentioning - changed type from STRING to BOOLEAN, if I do not use the autodetect the load succeeds.
I have to use the "autodetect" feature, since the JSON is a result of an API call and the columns may increase or decrease.
Any idea why the value T is behaving weird, and how to get around this ?

Spark Structured-Streaming Error:-pyspark.sql.utils.StreamingQueryException: 'assertion failed: Invalid batch:

I have a Spark Structured-Streaming application, which reads JSON data from s3 and does some transformations and writes it back to s3.
While running the application, sometimes the job errors out and re-attempts (without any visible loss or data corruption- so that everything seems fine), but the error message provided is not very descriptive
Below is the error message:
pyspark.sql.utils.StreamingQueryException: u'assertion failed: Invalid batch: _ra_guest_gid#1883,_ra_sess_ts#1884,_ra_evt_ts#1885,event#1886,brand#1887,category#1888,funding_daysRemaining#1889,funding_dollarsRemaining#1890,funding_goal#1891,funding_totalBackers#1892L,funding_totalFunded#1893,id#1894,name#1895,price#1896,projectInfo_memberExclusive#1897,projectInfo_memberExclusiveHoursRemaining#1898,projectInfo_numberOfEpisodes#1899,projectInfo_projectState#1900,variant#1901 != _ra_guest_gid#2627,_ra_sess_ts#2628,_
My guess is this may have something to do with column mismatches, where
The incoming JSON record does not conform to the schema.
Or the datatype of the incoming JSON record may not match the data type provided in schema.
But I'm not sure how to pinpoint which record or which particular field causes the error.
Any help or suggestions here on what the error means or how I could log the error in a better way.
Thanks
I think i have figured out the issue, it is not related to schema mismatch.
What was happening in my case is that i have two streaming operations running in parallel.
1)reading raw incoming data from an S3 bucket, then doing some operation and writing it back to S3 in output folder 'a'
2)reading the processed streaming data from folder 'a' (step1) and then again doing some operation and writing back to S3 in output folder 'b'
Now as per my observations if i run the above steps individually then it works fine, but if i run them together i get the error
'pyspark.sql.utils.StreamingQueryException: u'assertion failed: Invalid batch: '
so i think it has trouble when it tries to read and write from same location i.e. the destination of one stream is the source of another stream

NiFi - ConvertCSVtoAVRO - how to capture the failed records?

When converting CSV to AVRO I would like to output all the rejections to a file (let's say error.csv).
A rejection is usually caused by a wrong data type - e.g. when a "string" value appears in a "long" field.
I am trying to do it using incompatible output, however instead of saving the rows that failed to convert (2 in the example below), it saves the whole CSV file. Is it possible to filter out somehow only these records that failed to convert? (Does NiFi add some markers to these records etc?)
Both processors: RouteOnAttribute and RouteOnContent route the whole files. Does the "incompatible" leg of the flow somehow mark single records with something like "error" attribute that is available after splitting the file into rows? I cannot find this in any doc.
I recommend using a SplitText processor upstream of ConvertCSVToAvro, if you can, so you are only converting one record at a time. You will also have a clear context for what the errors attribute refers to on any flowfiles sent to the incompatible output.
Sending the entire failed file to the incompatible relationship appears to be a purposeful choice. I assume it may be necessary if the CSV file is not well formed, especially with respect to records being neatly contained on one line (or properly escaped). If your data violates this assumption, SplitText might make things worse by creating a fragmented set of failed lines.

Twitter User posts, retrieved with smappR and stored in JSON format, are not being read in to R

I am using the smappR packageto retrieve Twitter user posts, specifically the getTimeline() function.
However, the problem is that the retrieved data, which has been stored in JSON format is not subsequently being read in as a JSON file by R.
The image below denotes the command and the corresponding error -
Was wondering if there is any other way we can read the files back into R for further processing?
Any help will be appreciated.
Edit 1 : Funnily enough, the file does not appear to be read in, even when I attempted the same in Python (2.7)
The Python Code is as follows -
with open('C:/Users/ABC/Downloads/twitter/profile/bethguido3.JSON') as data_file:
data = json.load(data_file)
The error that appeared is -