Apache spark parsing json with splitted records - json

As far as I know, Apache spark requires json file to have one record in exactly one string. I have a splitted by fields json file like this:
{"id": 123,
"name": "Aaron",
"city": {
"id" : 1,
"title": "Berlin"
}}
{"id": 125,
"name": "Bernard",
"city": {
"id" : 2,
"title": "Paris"
}}
{...many more lines
...}
How can I parse it using Spark? Do I need a preprocessor or can I provide custom splitter?

Spark uses splitting by newline to distinguish records. This mean that when using the standard json reader you would need to have one record per line.
You can convert by doing something like in this answer: https://stackoverflow.com/a/30452120/1547734
The basic idea would be to read as a wholeTextFiles and then load it to a json reader which would parse it and flatmap the results.
Of course this assumes the files are big enough to be in memory and parsed one at a time. Otherwise you would need more complicated solutions.

Related

Decode resulting json and save into db - Django + Postgres

I have a model like this:
class MyClass(models.Model):
typea = models.CharField(max_length=128)
typeb = models.CharField(max_length=128)
If for example, the resulting json from an API is like this:
{
"count": 75,
"results": [
{
"typea": "This tipe",
"typeb": "A B type",
"jetsons": [],
"data": [
"https://myurl.com/api/data/2/",
],
"created": "2014-12-15T12:31:42.547000Z",
"edited": "2017-04-19T10:56:06.685592Z",
},
I need to parse this result and save typea and typeb into the database, I'm kind of confused on how to do this.
I mean, there is the JSONField on Django, but I don't think that will work for me, since I need to save some specific nested string of the json dict.
Any example or idea on how to achieve this?
I mean, my confusion is on how to parse this and "extract" the data I need for my specific fields.
Thank You
You can always do an import json and use json.load(json_file_handle) to create a dictionary and extract the values you need. All you need is to open the .json file (you can use with open("file.json", "r") as json_file_handle) and load the data.

Process events from Event hub using pyspark - Databricks

I have a Mongo change stream (a pymongo application) that is continuously getting the changes in collections. These change documents as received by the program are sent to Azure Event Hubs. A Spark notebook has to read the documents as they get into Event Hub and do Schema matching (match the fields in the document with spark table columns) with the spark table for that collection. If there are fewer fields in the document than in the table, columns have to be added with Null.
I am reading the events from Event Hub like below.
spark.readStream.format("eventhubs").option(**config).load().
As said in the documentation, the original message is in the 'body' column of the dataframe that I am converting to string. Now I have got the Mongo document as a JSON string in a streaming dataframe. I am facing below issues.
I need to extract the individual fields in the mongo document. This is needed to compare what fields are present in the spark table and what is not in Mongo document. I saw a function called get_json_object(col,path). This essentially returns a string again and I cannot individually select all the columns.
If from_json can be used to convert the JSON string to Struct type, I cannot specify the Schema because we have close to 70 collections (corresponding number of spark tables as well) each sending Mongo docs with fields from 10 to 450.
If I can convert the JSON string in streaming dataframe to a JSON object whose schema can be inferred by the dataframe (something like how read.json can do), I can use SQL * representation to extract the individual columns, do few manipulations & then save the final dataframe to the spark table. Is it possible to do that? What is the mistake I am making?
Note: Stram DF doesn't support collect() method to individually extract the JSON string from underlying rdd and do the necessary column comparisons. Using Spark 2.4 & Python in Azure Databricks environment 4.3.
Below is the sample data I get in my notebook after reading the events from event hub and casting it to string.
{
"documentKey": "5ab2cbd747f8b2e33e1f5527",
"collection": "configurations",
"operationType": "replace",
"fullDocument": {
"_id": "5ab2cbd747f8b2e33e1f5527",
"app": "7NOW",
"type": "global",
"version": "1.0",
"country": "US",
"created_date": "2018-02-14T18:34:13.376Z",
"created_by": "Vikram SSS",
"last_modified_date": "2018-07-01T04:00:00.000Z",
"last_modified_by": "Vikram Ganta",
"last_modified_comments": "Added new property in show_banners feature",
"is_active": true,
"configurations": [
{
"feature": "tip",
"properties": [
{
"id": "tip_mode",
"name": "Delivery Tip Mode",
"description": "Tip mode switches the display of tip options between percentage and amount in the customer app",
"options": [
"amount",
"percentage"
],
"default_value": "tip_percentage",
"current_value": "tip_percentage",
"mode": "multiple or single"
},
{
"id": "tip_amount",
"name": "Tip Amounts",
"description": "List of possible tip amount values",
"default_value": 0,
"options": [
{
"display": "No Tip",
"value": 0
}
]
}
]
}
]
}
}
I would like to separate and take out the full_document in the sample above. When I use get_json_object, I am getting the full_document in another streaming dataframe as JSON string and not as an object. As you can see, there are some array types in full_document which I can explode (documentation says that explode is supported in streaming DF, but havent tried) but there are some objects also (like struct type) which I would like to extract the individual fields. I cannot use the SQL '*' notation because what get_json_object returns is a string and not the object itself.
Its convincing that this much varied Schema of the JSON would be better with schema mentioned explicitly. So I took it like, in a streaming environment with very different Schema of the incoming stream, its always better to specify the schema. So I am proceeding with get_json_object and from_json and reading the schema through a file.

How to read invalid JSON format amazon firehose

I've got this most horrible scenario in where i want to read the files that kinesis firehose creates on our S3.
Kinesis firehose creates files that don't have every json object on a new line, but simply a json object concatenated file.
{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}
Now is this a scenario not supported by normal JSON.parse and i have tried working with following regex: .scan(/({((\".?\":.?)*?)})/)
But the scan only works in scenario's without nested brackets it seems.
Does anybody know an working/better/more elegant way to solve this problem?
The one in the initial anwser is for unquoted jsons which happens some times. this one:
({((\\?\".*?\\?\")*?)})
Works for quoted jsons and unquoted jsons
Besides this improved it a bit, to keep it simpler.. as you can have integer and normal values.. anything within string literals will be ignored due too the double capturing group.
https://regex101.com/r/kPSc0i/1
Modify the input to be one large JSON array, then parse that:
input = File.read("input.json")
json = "[#{input.rstrip.gsub(/\}\s*\{/, '},{')}]"
data = JSON.parse(json)
You might want to combine the first two to save some memory:
json = "[#{File.read('input.json').rstrip.gsub(/\}\s*\{/, '},{')}]"
data = JSON.parse(json)
This assumes that } followed by some whitespace followed by { never occurs inside a key or value in your JSON encoded data.
As you concluded in your most recent comment, the put_records_batch in firehose requires you to manually put delimiters in your records to be easily parsed by the consumers. You can add a new line or some special character that is solely used for parsing, % for example, which should never be used in your payload.
Other option would be sending record by record. This would be only viable if your use case does not require high throughput. For that you may loop on every record and load as a stringified data blob. If done in Python, we would have a dictionary "records" having all our json objects.
import json
def send_to_firehose(records):
firehose_client = boto3.client('firehose')
for record in records:
data = json.dumps(record)
firehose_client.put_record(DeliveryStreamName=<your stream>,
Record={
'Data': data
}
)
Firehose by default buffers the data before sending it to your bucket and it should end up with something like this. This will be easy to parse and load in memory in your preferred data structure.
[
{
"metadata": {
"schema_id": "4096"
},
"payload": {
"zaza": 12,
"price": 20,
"message": "Testing sendnig the data in message attribute",
"source": "coming routing to firehose"
}
},
{
"metadata": {
"schema_id": "4096"
},
"payload": {
"zaza": 12,
"price": 20,
"message": "Testing sendnig the data in message attribute",
"source": "coming routing to firehose"
}
}
]

Split JSON into two individual JSON objects using Nifi

I have a JSON like
{
"campaign_key": 316,
"client_key": 127,
"cpn_mid_counter": "24",
"cpn_name": "Bopal",
"cpn_status": "Active",
"clt_name": "Bopal Ventures",
"clt_status": "Active"
}
Expected output
1st JSON :
{
"campaign_key": 316,
"client_key": 127,
"cpn_mid_counter": "24",
"cpn_name": "Bopal",
"cpn_status": "Active"
}
2nd JSON:
{
"clt_name": "Bopal Ventures",
"clt_status": "Active"
}
How do I acheive this by using NIFI? Thanks.
You can do what 'user' had said. The not-so-good thing about that approach is, if you number of fields are increasing, then you are required to add that many JSON Path expression attributes to EvaluateJsonPath and subsequently add that many attributes in ReplaceText.
Instead what I'm proposing is, use QueryRecord with Record Reader set to JsonTreeReader and Record Writer set to JsonRecordSetWriter. And add two dynamic relationship properties as follows:
json1 : SELECT campaign_key, client_key, cpn_mid_counter, cpn_name, cpn_status FROM FLOWFILE
json2 : SELECT clt_name, clt_status FROM FLOWFILE
This approach takes care of reading and writing the output in JSON format. Plus, if you want to add more fields, you just have add the field name in the SQL SELECT statement.
QueryRecord processor lets you execute SQL query against the FlowFile content. More details on this processor can be found here
Attaching screenshots
Karthik,
Use EvaluateJsonPath processor to get those all json Values by using its keys.
Example: $.campaign_key for gets compaign key value and $.clt_name for get clt name.
Like above one you can get all jsons.
Then use ReplaceText Processor for convert single json into two jsons.
{"Compaign_Key":${CompaignKey},...etc}
{"Clt_name":${clt_name}}
It will convert single json into two jsons.
Hope this helpful and let me know if you have issues.

How to parse JSON file with no SparkSQL?

I have the following JSON file.
{
"reviewerID": "ABC1234",
"productID": "ABCDEF",
"reviewText": "GOOD!",
"rating": 5.0,
},
{
"reviewerID": "ABC5678",
"productID": "GFMKDS",
"reviewText": "Not bad!",
"rating": 3.0,
}
I want to parse without SparkSQL and use a JSON parser.
The result of parse that i want is textfile.
ABC1234::ABCDEF::5.0
ABC5678::GFMKDS::3.0
How to parse the json file by using json parser in spark scala?
tl;dr Spark SQL supports JSONs in the format of one JSON per file or per line. If you'd like to parse multi-line JSONs that can appear together in a single file, you'd have to write your own Spark support as it's not currently possible.
A possible solution is to ask the "writer" (the process that writes the files to be nicer and save one JSON per file) that would make your life much sweeter.
If that does not give you much, you'd have to use mapPartitions transformation with your parser and somehow do the parsing yourself.
val input: RDD[String] = // ... load your JSONs here
val jsons = jsonRDD.mapPartitions(json => // ... use your JSON parser here)