Load entire json file into single column of BigQuery Table - json

I am trying to load json file with more then 100 columns into Bigquery. Some of these columns has special character within their name i.e. dollar sign($) and period(.) Rows/Records content also varies - meaning not all columns may be present in each row/record which is totally acceptable json format.
I have search similar posts
How to manage/handle schema changes while loading JSON file into BigQuery table
BigQuery: Create column of JSON datatype
which suggest to load the data into single "STRING" column as CSV format first and parse out columns using JSON_EXTRACT() function to target table. Hence I have created a BigQuery table with following schema definition:
[
{
"name": "data",
"type": "STRING"
}
]
then I have ran following CLI command:
bq load --source_format=CSV test.bq_load_test ./data_file.json ./bq_load_test_schema.json
which result into following error:
Error Message:
BigQuery error in load operation: Error processing job : Error
while reading data, error message: CSV table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the
errors[] collection for more details.
Failure details:
- Error while reading data, error message: Too many values in row
starting at position: 0.
Here's the data file layout:
root
|-- $insert_id: string (nullable = true)
|-- $schema: long (nullable = true)
|-- adid: string (nullable = true)
|-- vendor_attribution_ids: array (nullable = true)
| |-- element: string (containsNull = true)
|-- vendor_event_type: string (nullable = true)
|-- vendor_id: long (nullable = true)
|-- app: long (nullable = true)
|-- city: string (nullable = true)
|-- client_event_time: string (nullable = true)
|-- client_upload_time: string (nullable = true)
|-- country: string (nullable = true)
|-- data: struct (nullable = true)
| |-- first_event: boolean (nullable = true)
|-- device_brand: string (nullable = true)
|-- device_carrier: string (nullable = true)
|-- device_family: string (nullable = true)
|-- device_id: string (nullable = true)
|-- device_manufacturer: string (nullable = true)
|-- device_model: string (nullable = true)
|-- device_type: string (nullable = true)
|-- dma: string (nullable = true)
|-- event_id: long (nullable = true)
|-- event_properties: struct (nullable = true)
| |-- U.vf: string (nullable = true)
| |-- app.name: string (nullable = true)
| |-- app.pillar: string (nullable = true)
| |-- app.version: string (nullable = true)
| |-- categories: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- cmfAppId: string (nullable = true)
| |-- content.area: string (nullable = true)
| |-- content.authenticated: boolean (nullable = true)
| |-- content.authors: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- content.cms: string (nullable = true)
| |-- content.id: string (nullable = true)
| |-- content.keywords.collections: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- content.keywords.company: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- content.keywords.location: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- content.keywords.organization: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- content.keywords.person: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- content.keywords.subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- content.keywords.tag: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- content.media.audiovideo: string (nullable = true)
| |-- content.media.contentarea: string (nullable = true)
| |-- content.media.duration: long (nullable = true)
| |-- content.media.episodenumber: string (nullable = true)
| |-- content.media.genre: string (nullable = true)
| |-- content.media.length: double (nullable = true)
| |-- content.media.liveondemand: string (nullable = true)
| |-- content.media.region: string (nullable = true)
| |-- content.media.seasonnumber: string (nullable = true)
| |-- content.media.show: string (nullable = true)
| |-- content.media.sport: string (nullable = true)
| |-- content.media.streamtitle: string (nullable = true)
| |-- content.media.type: string (nullable = true)
| |-- content.originaltitle: string (nullable = true)
| |-- content.pubdate: long (nullable = true)
| |-- content.publishedtime: string (nullable = true)
| |-- content.subsection1: string (nullable = true)
| |-- content.subsection2: string (nullable = true)
| |-- content.subsection3: string (nullable = true)
| |-- content.subsection4: string (nullable = true)
| |-- content.tier: string (nullable = true)
| |-- content.title: string (nullable = true)
| |-- content.type: string (nullable = true)
| |-- content.updatedtime: string (nullable = true)
| |-- content.url: string (nullable = true)
| |-- custom.DNT: boolean (nullable = true)
| |-- custom.cookiesenabled: boolean (nullable = true)
| |-- custom.engine: string (nullable = true)
| |-- feature.name: string (nullable = true)
| |-- feature.position: string (nullable = true)
| |-- lastupdate: string (nullable = true)
| |-- pubdate: string (nullable = true)
| |-- referrer.campaign: string (nullable = true)
| |-- referrer.url: string (nullable = true)
| |-- syndicate: string (nullable = true)
| |-- user.interests.explicit.no: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- user.interests.explicit.yes: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- user.tier: string (nullable = true)
| |-- userTier: string (nullable = true)
|-- event_time: string (nullable = true)
|-- event_type: string (nullable = true)
|-- idfa: string (nullable = true)
|-- ip_address: string (nullable = true)
|-- is_attribution_event: boolean (nullable = true)
|-- language: string (nullable = true)
|-- library: string (nullable = true)
|-- location_lat: double (nullable = true)
|-- location_lng: double (nullable = true)
|-- os_name: string (nullable = true)
|-- os_version: string (nullable = true)
|-- paying: string (nullable = true)
|-- platform: string (nullable = true)
|-- processed_time: string (nullable = true)
|-- region: string (nullable = true)
|-- sample_rate: string (nullable = true)
|-- server_received_time: string (nullable = true)
|-- server_upload_time: string (nullable = true)
|-- session_id: long (nullable = true)
|-- start_version: string (nullable = true)
|-- user_creation_time: string (nullable = true)
|-- user_id: string (nullable = true)
|-- user_properties: struct (nullable = true)
| |-- internal.userID: string (nullable = true)
| |-- internal.userTier: string (nullable = true)
| |-- experiment.id: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- experiment.variant: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- location.news: string (nullable = true)
| |-- location.radio: string (nullable = true)
| |-- location.region: string (nullable = true)
| |-- location.tv: string (nullable = true)
| |-- location.weather: string (nullable = true)
| |-- referrer.campaign: string (nullable = true)
| |-- user.id: string (nullable = true)
| |-- user.id.internalvisitor: string (nullable = true)
| |-- user.tier: string (nullable = true)
|-- uuid: string (nullable = true)
|-- version_name: string (nullable = true)
|-- feature_origin: string (nullable = true)
Here's the snippet of the data file:
{"server_received_time":"2019-01-17 15:00:00.482000","app":161,"device_carrier":null,"$schema":12,"city":"Caro","user_id":null,"uuid":"9018","event_time":"2019-01-17 15:00:00.045000","platform":"Web","os_version":"49","vendor_id":711,"processed_time":"2019-01-17 15:00:00.817195","user_creation_time":"2018-11-01 19:16:34.971000","version_name":null,"ip_address":null,"paying":null,"dma":null,"group_properties":{},"user_properties":{"location.radio":"ca","vendor.userTier":"free","vendor.userID":"a989","user.id":"a989","user.tier":"free","location.region":"ca"},"client_upload_time":"2019-01-17 15:00:00.424000","$insert_id":"e8410","event_type":"LOADED","library":"amp\/4.5.2","vendor_attribution_ids":null,"device_type":"Mac","device_manufacturer":null,"start_version":null,"location_lng":null,"server_upload_time":"2019-01-17 15:00:00.493000","event_id":64,"location_lat":null,"os_name":"Chrome","vendor_event_type":null,"device_brand":null,"groups":{},"event_properties":{"content.authenticated":false,"content.subsection1":"regions","custom.DNT":true,"content.subsection2":"ca","referrer.url":"","content.url":"","content.type":"index","content.title":"","custom.cookiesenabled":true,"app.pillar":"feed","content.area":"news","app.name":"oc"},"data":{},"device_id":"","language":"English","device_model":"Mac","country":"","region":"","is_attribution_event":false,"adid":null,"session_id":15,"device_family":"Mac","sample_rate":null,"idfa":null,"client_event_time":"2019-01-17 14:59:59.987000"}
{"server_received_time":"2019-01-17 15:00:00.913000","app":161,"device_carrier":null,"$schema":12,"city":"Fo","user_id":null,"uuid":"9052","event_time":"2019-01-17 15:00:00.566000","platform":"Web","os_version":"71","vendor_id":797,"processed_time":"2019-01-17 15:00:01.301936","user_creation_time":"2019-01-17 15:00:00.566000","version_name":null,"ip_address":null,"paying":null,"dma":"CO","group_properties":{},"user_properties":{"user.tier":"free"},"client_upload_time":"2019-01-17 15:00:00.157000","$insert_id":"69ae","event_type":"START WEB SESSION","library":"amp\/4.5.2","vendor_attribution_ids":null,"device_type":"Android","device_manufacturer":null,"start_version":null,"location_lng":null,"server_upload_time":"2019-01-17 15:00:00.925000","event_id":1,"location_lat":null,"os_name":"Chrome Mobile","vendor_event_type":null,"device_brand":null,"groups":{},"event_properties":{"content.subsection3":"home","content.subsection2":"archives","content.title":"","content.keywords.subject":["Lifestyle\/Recreation and leisure\/Outdoor recreation\/Boating","Lifestyle\/Relationships\/Couples","General news\/Weather","Oddities"],"content.publishedtime":154687,"app.name":"oc","referrer.url":"","content.subsection1":"archives","content.url":"","content.authenticated":false,"content.keywords.location":["Ot"],"content.originaltitle":"","content.type":"story","content.authors":["Archives"],"app.pillar":"feed","content.area":"news","content.id":"1.49","content.updatedtime":1546878600538,"content.keywords.tag":["24 1","boat house","Ot","Rockcliffe","River","m"],"content.keywords.person":["Ber","Shi","Jea","Jean\u00e9tien"]},"data":{"first_event":true},"device_id":"","language":"English","device_model":"Android","country":"","region":"","is_attribution_event":false,"adid":null,"session_id":15477,"device_family":"Android","sample_rate":null,"idfa":null,"client_event_time":"2019-01-17 14:59:59.810000"}
{"server_received_time":"2019-01-17 15:00:00.913000","app":16,"device_carrier":null,"$schema":12,"city":"","user_id":null,"uuid":"905","event_time":"2019-01-17 15:00:00.574000","platform":"Web","os_version":"71","vendor_id":7973,"processed_time":"2019-01-17 15:00:01.301957","user_creation_time":"2019-01-17 15:00:00.566000","version_name":null,"ip_address":null,"paying":null,"dma":"DCO","group_properties":{},"user_properties":{"user.tier":"free"},"client_upload_time":"2019-01-17 15:00:00.157000","$insert_id":"d045","event_type":"LOADED","library":"am-js\/4.5.2","vendor_attribution_ids":null,"device_type":"Android","device_manufacturer":null,"start_version":null,"location_lng":null,"server_upload_time":"2019-01-17 15:00:00.925000","event_id":2,"location_lat":null,"os_name":"Chrome Mobile","vendor_event_type":null,"device_brand":null,"groups":{},"event_properties":{"content.subsection3":"home","content.subsection2":"archives","content.subsection1":"archives","content.keywords.subject":["Lifestyle\/Recreation and leisure\/Outdoor recreation\/Boating","Lifestyle\/Relationships\/Couples","General news\/Weather","Oddities"],"content.type":"story","content.keywords.location":["Ot"],"app.pillar":"feed","app.name":"oc","content.authenticated":false,"custom.DNT":false,"content.id":"1.4","content.keywords.person":["Ber","Shi","Jea","Je\u00e9tien"],"content.title":"","content.url":"","content.originaltitle":"","custom.cookiesenabled":true,"content.authors":["Archives"],"content.publishedtime":1546878600538,"referrer.url":"","content.area":"news","content.updatedtime":1546878600538,"content.keywords.tag":["24 1","boat house","O","Rockcliffe","River","pr"]},"data":{},"device_id":"","language":"English","device_model":"Android","country":"","region":"","is_attribution_event":false,"adid":null,"session_id":1547737199081,"device_family":"Android","sample_rate":null,"idfa":null,"client_event_time":"2019-01-17 14:59:59.818000"}
Any input? What am I missing here?

You have to specify the right delimiter for CSV file. Notice that the default value for this flag is ',' and your data has ',', therefore, every row is interpreted as multiple fields. I tested with your data and this worked for me:
bq load --source_format=CSV -F ';' test.bq_load_test ./data_file.json
Notice that ';' worked because the snippet data does not contain ';'

Related

Extract fields from json in Pyspark

I am trying to extract only itineraries.element and validatingAirlineCodes and then form a json containing only these two fields in Pyspark
|-- id: string (nullable = true)
|-- instantTicketingRequired: boolean (nullable = true)
|-- itineraries: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
|-- lastTicketingDate: string (nullable = true)
|-- nonHomogeneous: boolean (nullable = true)
|-- numberOfBookableSeats: long (nullable = true)
|-- oneWay: boolean (nullable = true)
|-- price: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- pricingOptions: map (nullable = true)
| |-- key: string
| |-- value: array (valueContainsNull = true)
| | |-- element: string (containsNull = true)
|-- source: string (nullable = true)
|-- travelerPricings: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
|-- type: string (nullable = true)
|-- validatingAirlineCodes: array (nullable = true)
| |-- element: string (containsNull = true)
I tried using the df.select() but cant select the fields I want. What should I do ?
Something to bear in mind is that what you are trying to extract (element) is an Array of Maps. There is no logical way to extract this without first exploding the array and proceeding to extract element.
Assuming you already have your dataframe stored as df you can explode itineraries using the below example. This will create a row in your dataframe for each element of your array and should allow you to extract your element quite simply thereafter.
df.select(explode(df.itineraries))
Extra reading for context:
Documentation page on Spark By Examples.
https://sparkbyexamples.com/pyspark/select-columns-from-pyspark-dataframe/
Documentation on explode in pySpark is here: https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.explode.html
Hope this helps!

Apache Spark (Scala): How do I grab a single element and sub-elements from a JSON RDD and store it in a new RDD?

I am importing some JSON data from Amazon S3 and storing that in an RDD:
val data_sep22 = spark.read.json("s3://firehose-json-events-stream/2019/09/22/*/*")
I then take a peak at the data's structure with printSchema()
scala> events_sep22.printSchema()
root
|-- data: struct (nullable = true)
| |-- amount: string (nullable = true)
| |-- createdAt: string (nullable = true)
| |-- percentage: string (nullable = true)
| |-- status: string (nullable = true)
|-- id: string (nullable = true)
|-- publishedAt: string (nullable = true)
How do I create a new RDD with just data and its sub-elements?
Use select.
events_sep22.select("data").printSchema()
root
|-- data: struct (nullable = true)
| |-- amount: string (nullable = true)
| |-- createdAt: string (nullable = true)
| |-- percentage: string (nullable = true)
| |-- status: string (nullable = true)
events_sep22.select("data.*").printSchema()
root
|-- amount: string (nullable = true)
|-- createdAt: string (nullable = true)
|-- percentage: string (nullable = true)
|-- status: string (nullable = true)

Convert Spark DF[string] into Json using scala

Am having DataFrame with 1 column as String .
sample String : The below content is completely stored as Single line string inside the DataFrame
[{"id":"2ac3a7d6-46d8-4903-8a92-000000000003","type":"Stemming","name":"Drill Cuttings","dateCreated":"2019-08-23T05:22:51.7050657Z","dateModified":"2019-08-23T05:22:51.7050657Z","dateDeleted":null,"isDeleted":false,"abbreviation":null,"supplierName":null,"shotPlusReference":null,"restrictedTo":[],"metadata":{},"documentLinks":[]},{"EType":"ProductSave"}]
I have tried like below. Convert my DF to an RDD and using spark.read.json read the RDD
val newRDD = MyDF.rdd.map(_.getString(0))
val ExpectedDF1 = spark.read.json(newRDD.toString())
ExpectedDF1.printSchema()
When printing the schema {"EType":"ProductSave"} this field is missing
Where as other fields are proper.
1) val newRDD = MyDF.rdd.map(_.getString(0))
val ExpectedDF1 = spark.read.json(newRDD.toString())
ExpectedDF1.printSchema()
2) val newRDD = MyDF.rdd.map(row=> (row.getAs[Json](1).toString))
val tempDF= spark.read.json(newRDD)
tempDF.printSchema()
3)
val JsonDF = MyDF.toJSON.toDF("value")
val convertedRDD = spark.read.json(JsonDF.rdd)
convertedRDD.printSchema()
All these three ways leaving the last field in the schema.
{"EType":"ProductSave"}
Actual schema returned
root
|-- abbreviation: string (nullable = true)
|-- dateCreated: string (nullable = true)
|-- dateDeleted: string (nullable = true)
|-- dateModified: string (nullable = true)
|-- density: double (nullable = true)
|-- displayColor: string (nullable = true)
|-- documentLinks: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: string (nullable = true)
|-- isDeleted: boolean (nullable = true)
|-- name: string (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dateModified: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- isDeleted: boolean (nullable = true)
| | |-- variants: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- id: string (nullable = true)
| | | | |-- isDeleted: boolean (nullable = true)
|-- restrictedTo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: string (nullable = true)
| | |-- value: string (nullable = true)
|-- shotPlusReference: string (nullable = true)
|-- supplierName: string (nullable = true)
|-- type: string (nullable = true)
Expected Schema
root
|-- EType: string (nullable = true)
|-- abbreviation: string (nullable = true)
|-- dateCreated: string (nullable = true)
|-- dateDeleted: string (nullable = true)
|-- dateModified: string (nullable = true)
|-- density: double (nullable = true)
|-- displayColor: string (nullable = true)
|-- documentLinks: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: string (nullable = true)
|-- isDeleted: boolean (nullable = true)
|-- name: string (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dateModified: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- isDeleted: boolean (nullable = true)
| | |-- variants: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- id: string (nullable = true)
| | | | |-- isDeleted: boolean (nullable = true)
|-- restrictedTo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: string (nullable = true)
| | |-- value: string (nullable = true)
|-- shotPlusReference: string (nullable = true)
|-- supplierName: string (nullable = true)
|-- type: string (nullable = true)

Dataframe write to JSON as single object

I have a dataframe that I am trying to write out to an S3 location as a JSON.
df.printSchema
root
|-- userId: string (nullable = false)
|-- firstName: string (nullable = false)
|-- address: string (nullable = true)
|-- Email: array (nullable = true)
| |-- element: string (containsNull = true)
|-- UserFoodFavourites: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- foodName: string (nullable = true)
| | |-- isFavFood: boolean (nullable = false)
|-- UserGameFavourites: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- Department: string (nullable = false)
| | | |-- gameName: string (nullable = false)
Writing out dataframe to JSON:
df.repartition(1).write.option("mode","append").json("s3Location")
JSON output I get:
{"userId":111,"firstName":"first123","address":"xyz",
"Email":["def#gmail.com","abc#gmail.com"],
"UserFoodFavourites":[{"foodName":"food1","isFavFood":true},{"foodName":"food2","isFavFood":false}],
"UserGameFavourites":[[{"Department":"Outdoor","gameName":"O1"}],[{"Department":"Indoor","gameName":"I1"},{"Department":"Indoor","gameName":"I2"}]]}
{"userId":123,"firstName":"first123","address":"xyz",
"Email":["def#gmail.com","abc#gmail.com"],
"UserFoodFavourites":[{"foodName":"food1","isFavFood":true},{"foodName":"food2","isFavFood":false}],
"UserGameFavourites":[[{"Department":"Outdoor","gameName":"O1"}],[{"Department":"Indoor","gameName":"I1"},{"Department":"Indoor","gameName":"I2"}]]}
alias prettyjson='python -m json.tool'
However this does not work when I try printing this file in a pretty JSON format using prettyJSON on this file because these are written as multiple JSON objects per userId.
I am trying to write this out as a single JSON object with new line separation, so I can run prettyJSON on this file.
Any help is appreciated.

how to parse the wiki infobox json with scala spark

I was trying to get the data from json data which I got it from wiki api
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Rajanna&rvsection=0
I was able to print the schema of that exactly
scala> data.printSchema
root
|-- batchcomplete: string (nullable = true)
|-- query: struct (nullable = true)
| |-- pages: struct (nullable = true)
| | |-- 28597189: struct (nullable = true)
| | | |-- ns: long (nullable = true)
| | | |-- pageid: long (nullable = true)
| | | |-- revisions: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- *: string (nullable = true)
| | | | | |-- contentformat: string (nullable = true)
| | | | | |-- contentmodel: string (nullable = true)
| | | |-- title: string (nullable = true)
I want to extract the data of the key "*" |-- *: string (nullable = true)
Please suggest me a solution.
One problem is
pages: struct (nullable = true)
| | |-- 28597189: struct (nullable = true)
the number 28597189 is unique to every title.
First we need to parse the json to get the key (28597189) dynamically then use this to extract the data from spark dataframe like below
val keyName = dataFrame.selectExpr("query.pages.*").schema.fieldNames(0)
println(s"Key Name : $keyName")
this will give you the key dynamically:
Key Name : 28597189
Then use this to extract the data
var revDf = dataFrame.select(explode(dataFrame(s"query.pages.$keyName.revisions")).as("revision")).select("revision.*")
revDf.printSchema()
Output:
root
|-- *: string (nullable = true)
|-- contentformat: string (nullable = true)
|-- contentmodel: string (nullable = true)
and we will be renaming the column * with some key name like star_column
revDf = revDf.withColumnRenamed("*", "star_column")
revDf.printSchema()
Output:
root
|-- star_column: string (nullable = true)
|-- contentformat: string (nullable = true)
|-- contentmodel: string (nullable = true)
and once we have our final dataframe we will call show
revDf.show()
Output:
+--------------------+-------------+------------+
| star_column|contentformat|contentmodel|
+--------------------+-------------+------------+
|{{EngvarB|date=Se...| text/x-wiki| wikitext|
+--------------------+-------------+------------+