I have a Mongo change stream (a pymongo application) that is continuously getting the changes in collections. These change documents as received by the program are sent to Azure Event Hubs. A Spark notebook has to read the documents as they get into Event Hub and do Schema matching (match the fields in the document with spark table columns) with the spark table for that collection. If there are fewer fields in the document than in the table, columns have to be added with Null.
I am reading the events from Event Hub like below.
spark.readStream.format("eventhubs").option(**config).load().
As said in the documentation, the original message is in the 'body' column of the dataframe that I am converting to string. Now I have got the Mongo document as a JSON string in a streaming dataframe. I am facing below issues.
I need to extract the individual fields in the mongo document. This is needed to compare what fields are present in the spark table and what is not in Mongo document. I saw a function called get_json_object(col,path). This essentially returns a string again and I cannot individually select all the columns.
If from_json can be used to convert the JSON string to Struct type, I cannot specify the Schema because we have close to 70 collections (corresponding number of spark tables as well) each sending Mongo docs with fields from 10 to 450.
If I can convert the JSON string in streaming dataframe to a JSON object whose schema can be inferred by the dataframe (something like how read.json can do), I can use SQL * representation to extract the individual columns, do few manipulations & then save the final dataframe to the spark table. Is it possible to do that? What is the mistake I am making?
Note: Stram DF doesn't support collect() method to individually extract the JSON string from underlying rdd and do the necessary column comparisons. Using Spark 2.4 & Python in Azure Databricks environment 4.3.
Below is the sample data I get in my notebook after reading the events from event hub and casting it to string.
{
"documentKey": "5ab2cbd747f8b2e33e1f5527",
"collection": "configurations",
"operationType": "replace",
"fullDocument": {
"_id": "5ab2cbd747f8b2e33e1f5527",
"app": "7NOW",
"type": "global",
"version": "1.0",
"country": "US",
"created_date": "2018-02-14T18:34:13.376Z",
"created_by": "Vikram SSS",
"last_modified_date": "2018-07-01T04:00:00.000Z",
"last_modified_by": "Vikram Ganta",
"last_modified_comments": "Added new property in show_banners feature",
"is_active": true,
"configurations": [
{
"feature": "tip",
"properties": [
{
"id": "tip_mode",
"name": "Delivery Tip Mode",
"description": "Tip mode switches the display of tip options between percentage and amount in the customer app",
"options": [
"amount",
"percentage"
],
"default_value": "tip_percentage",
"current_value": "tip_percentage",
"mode": "multiple or single"
},
{
"id": "tip_amount",
"name": "Tip Amounts",
"description": "List of possible tip amount values",
"default_value": 0,
"options": [
{
"display": "No Tip",
"value": 0
}
]
}
]
}
]
}
}
I would like to separate and take out the full_document in the sample above. When I use get_json_object, I am getting the full_document in another streaming dataframe as JSON string and not as an object. As you can see, there are some array types in full_document which I can explode (documentation says that explode is supported in streaming DF, but havent tried) but there are some objects also (like struct type) which I would like to extract the individual fields. I cannot use the SQL '*' notation because what get_json_object returns is a string and not the object itself.
Its convincing that this much varied Schema of the JSON would be better with schema mentioned explicitly. So I took it like, in a streaming environment with very different Schema of the incoming stream, its always better to specify the schema. So I am proceeding with get_json_object and from_json and reading the schema through a file.
Related
For context I have Azure Dataflow reading from CosmosDB where the table has two "schemas". By this I mean one format has a field called "data" that is a JSON representation of data I need. The other format also has a field called "data" but the field is just a compressed string and I believe this is what causes the issue I'm having.
When dataflow reads the field from the source, the JSON gets turned into a non-json format through some form of serialization because when the projection gets imported, the data field is being read as a string and not a complex object. (Example below) I suspect this is because the two "data" fields have the same name.
I unfortunately cannot change how the data is being stored in CosmosDB so I cannot change one of the field names to another value.
Is there any way to prevent this, I would like to keep it in JSON format with quotes and colons etc, instead of what I have below.
Example of data in CosmosDB:
{
"Name": "sample",
"ValueInfo": [
{
"Field1": "foo",
"Field2": "bar"
}
]
}
How it looks in ADF
{
Name=sample,
ValueInfo=[
{
Field1=foo,
Field2=bar
}
]
}
I have a large JSON file with many objects with many properties. Simplified structure looks like this:
"allGadgets": [
{
"Model Code": "nokia1",
"Top Category": "Mobile Phones",
"Category": "non-iPhone",
"Brand": "Nokia",
"Device": "1",
"Price": "£ 11"
},
{
"Model Code": "nokia2",
"Top Category": "Mobile Phones",
"Category": "non-iPhone",
"Brand": "Nokia",
"Device": "2",
"Price": "£ 17",
},
{
"Model Code": "nokia3",
"Top Category": "Mobile Phones",
"Category": "non-iPhone",
"Brand": "Nokia",
"Device": "3",
"Price": "£ 10",
}] ... plus a few hundreds more of different brands and models
I'm extracting from this json list of maps a list of Strings for a search panel for the user to look up their device. The Strings are made of two of the values from the json, i.e.: "${item['Brand']} - ${item['Device']}"
Once the user has selected the relevant model from the dropdown search panel, I need to use this string value to give them the price from the json file. The question is how do I achieve that in dart/flutter? If it was html/css, I would have added an extra hidden field of model code and/or price itself and then just made it visible.
In flutter/dart, however, a search panel plugin I found only accepts Strings, which the user selects and which then have to be used to look up the the corresponding price value in the json file.
Complicating the lookup is the fact that my Strings are now composed of two field values with spaces and a hyphen in between so I probably need to convert them back into how they had been prior to the string conversion and then use both for the lookup... which sounds quite convoluted...
Any thoughts on how to solve the above task would be welcome!
What I guess would help a lot is an example - looking up an object using a String (formed from two values from the objects) within a json with many objects. User is presented with a subset of those objects, but just sees a couple of fields from them. Then the user effectively selects a query using the String shown to him based off the two fields. That String then allows to look up the object and find another value (price) in that corresponding object...
Having decoded your json, you have a List of Maps. Make a new data structure which is a Map of Maps (i.e. Map<String, Map<String, dynamic>>). Populate the new Map by adding each member of the List, keyed by the brand/device name. Now you can directly look up the device details by that composite name.
List<Map<String, dynamic>> original;
Map<String, Map<String, dynamic>> data = {};
original.forEach((item) {
String brandDeviceName = '${item['Brand']} - ${item['Device']}';
data[brandDeviceName] = item;
});
I've got this most horrible scenario in where i want to read the files that kinesis firehose creates on our S3.
Kinesis firehose creates files that don't have every json object on a new line, but simply a json object concatenated file.
{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}
Now is this a scenario not supported by normal JSON.parse and i have tried working with following regex: .scan(/({((\".?\":.?)*?)})/)
But the scan only works in scenario's without nested brackets it seems.
Does anybody know an working/better/more elegant way to solve this problem?
The one in the initial anwser is for unquoted jsons which happens some times. this one:
({((\\?\".*?\\?\")*?)})
Works for quoted jsons and unquoted jsons
Besides this improved it a bit, to keep it simpler.. as you can have integer and normal values.. anything within string literals will be ignored due too the double capturing group.
https://regex101.com/r/kPSc0i/1
Modify the input to be one large JSON array, then parse that:
input = File.read("input.json")
json = "[#{input.rstrip.gsub(/\}\s*\{/, '},{')}]"
data = JSON.parse(json)
You might want to combine the first two to save some memory:
json = "[#{File.read('input.json').rstrip.gsub(/\}\s*\{/, '},{')}]"
data = JSON.parse(json)
This assumes that } followed by some whitespace followed by { never occurs inside a key or value in your JSON encoded data.
As you concluded in your most recent comment, the put_records_batch in firehose requires you to manually put delimiters in your records to be easily parsed by the consumers. You can add a new line or some special character that is solely used for parsing, % for example, which should never be used in your payload.
Other option would be sending record by record. This would be only viable if your use case does not require high throughput. For that you may loop on every record and load as a stringified data blob. If done in Python, we would have a dictionary "records" having all our json objects.
import json
def send_to_firehose(records):
firehose_client = boto3.client('firehose')
for record in records:
data = json.dumps(record)
firehose_client.put_record(DeliveryStreamName=<your stream>,
Record={
'Data': data
}
)
Firehose by default buffers the data before sending it to your bucket and it should end up with something like this. This will be easy to parse and load in memory in your preferred data structure.
[
{
"metadata": {
"schema_id": "4096"
},
"payload": {
"zaza": 12,
"price": 20,
"message": "Testing sendnig the data in message attribute",
"source": "coming routing to firehose"
}
},
{
"metadata": {
"schema_id": "4096"
},
"payload": {
"zaza": 12,
"price": 20,
"message": "Testing sendnig the data in message attribute",
"source": "coming routing to firehose"
}
}
]
As far as I know, Apache spark requires json file to have one record in exactly one string. I have a splitted by fields json file like this:
{"id": 123,
"name": "Aaron",
"city": {
"id" : 1,
"title": "Berlin"
}}
{"id": 125,
"name": "Bernard",
"city": {
"id" : 2,
"title": "Paris"
}}
{...many more lines
...}
How can I parse it using Spark? Do I need a preprocessor or can I provide custom splitter?
Spark uses splitting by newline to distinguish records. This mean that when using the standard json reader you would need to have one record per line.
You can convert by doing something like in this answer: https://stackoverflow.com/a/30452120/1547734
The basic idea would be to read as a wholeTextFiles and then load it to a json reader which would parse it and flatmap the results.
Of course this assumes the files are big enough to be in memory and parsed one at a time. Otherwise you would need more complicated solutions.
i have a big json object with a list of "tickets". schema looks like below
{
"Artist": "Artist1",
"Tickets": [
{
"Id": 1,
"Attr2Array": [
{
"Att41": 1,
"Att42": "A",
"Att43": null
},
{
"Att41": 1,
"Att42": "A",
"Att43": null
},
],
.
.
.
(more properties)
"Price": "20",
"Description": "I m a ticket"
},
{
"Id": 4,
"Attr2Array": [
{
"Att41": 1,
"Att42": "A",
"Att43": null
},
{
"Att41": 1,
"Att42": "A",
"Att43": null
},
],
.
.
.
.
(more properties)
"Price": "30",
"Description": "I m a ticket"
}
]
}
each item in the list has around 25-30 properties (some simple types, and others complex array as nested objects)
i have to read the object from an api endpoint and extract only "ID" and "Description" but they need to be sorted by "Price" which is an int for example
In what order shall i proceed with this data manipulation
Shall i use the json object, deserialised it into another object with just those 2 properties (which i need) and THEN perform sort "asc" on the "Price"?
Please note that after i have the sorted list i will have to convert it back to a json list because the front end consumes a json after all.
What i dont like about this approach is the cycle of serialisation and deserialisation that happens
or
I perform a sort on the json object first (using for example a binary/bubble sort) and then use the object to create a strongly typed (deserialised) object with just those 2 properties and then serialise it back to pass to the front end
I dont know how performant the bubble sort will be and if at all i will get any gain in performance for large chunks of data processing.
I also need to keep in mind that this implementation can take into account other properties like "availabilitydate" because at a later date, this front end could add one more filter like "availabilitdate" asc
any help is much appreciated
thanks
You can deserialize your JSON string (or file) using the Microsoft System.Web.Extensions and JavaScriptSerializer.DeserializeObject.
First, you must have classes associated to your JSON. To create classes, select your JSON sample data and, in Visual Studio, go to Edit / Paste Special / Paste JSON As Classes.
Next, use this sample to deserialize a JSON string to typed objects, and to sort all Tickets by Price property using Linq.
String json = System.IO.File.ReadAllText(#"C:\Data.json");
var root = new System.Web.Script.Serialization.JavaScriptSerializer().Deserialize<Rootobject>(json);
var sortedTickets = root.Tickets.OrderBy(t => t.Price);