Read Json message from Kafka Topic using pyspark - json

I am trying to read a json message from kafka topic using spark streaming using a custom schema, I can see data is coming when I am using Cast value as string only. But when I am using a schema it is not working.
Data is like this :
|{"items": [{"SKU": "22673", "title": "FRENCH GARDEN SIGN BLUE METAL", "unit_price": 1.25, "quantity": 6}, {"SKU": "20972", "title": "PINK CREAM FELT CRAFT TRINKET BOX ", "unit_price": 1.25, "quantity": 2}, {"SKU": "84596F", "title": "SMALL MARSHMALLOWS PINK BOWL", "unit_price": 0.42, "quantity": 1}, {"SKU": "21181", "title": "PLEASE ONE PERSON METAL SIGN", "unit_price": 2.1, "quantity": 12}], "type": "ORDER", "country": "United Kingdom", "invoice_no": 154132552854862, "timestamp": "2023-01-20 07:34:22"}
|
I have used schema as :
schema = StructType([
StructField("items", StructType([
StructField("SKU", IntegerType(), True),
StructField("title", StringType(), True),
StructField("unit_price", FloatType(), True),
StructField("quantity", IntegerType(), True)
]), True)
StructField("type", StringType(), True),
StructField("country", StringType(), True),
StructField("invoice_no", StringType(), True),
StructField("timestamp", TimestampType(), True)
])
I am using the function :
kafkaDF = lines.selectExpr('CAST(value AS STRING)').select(from_json('value',schema).alias("value")).select("value.items.SKU","value.items.title","value.items.unit_price","value.items.quantity","value.type","value.country","value.invoice_no","value.timestamp")
Still the output are coming as null.

It's null because the schema is incorrect.
Your items need to be an ArrayType, containing your defined StructType
That being said, you cannot select value.items.X since there isn't a single element there. You'd need to EXPLODE(value.items) first.

Related

Parse XML to JSON Elixir

anyone knows How to convert xml into json. I tried Sweetyxml but its not converting xml into json.
Using this gist
https://gist.github.com/spint/40717d4e6912d8cea929
Reading json
{:ok, xmldoc} = File.read(Path.expand("/Users/sohaibanwar/Desktop/unconfirmed_order.xml"))
{doc, []} = xmldoc |> to_charlist() |> :xmerl_scan.string()
After parsing (in screenshot), but not getting right answer
YML FILE I am using
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<order>
<orderId>35267684</orderId>
<orderToken>fa74171d-f54a-4e76-bcf2-2dd284ac6450</orderToken>
<brokerTicketId>169855177</brokerTicketId>
<section>SEC1</section>
<row>N</row>
<seats/>
<notes>Limited view. Please note that you will need to use an iOS or Android mobile device to gain entry to your event.</notes>
<quantity>2</quantity>
<cost>324.00</cost>
<event>Dave Matthews Band (Rescheduled from 7/17/2020, 7/16/2021)</event>
<eventDate>2021-08-20 19:30:00</eventDate>
<orderDate>2021-06-18 05:43:13</orderDate>
<expectedShipDate>2021-06-18 00:00:00</expectedShipDate>
<venue>Xfinity Center - MA - Mansfield, MA</venue>
<status>UNCONFIRMED</status>
<purchaseOrderId>35088971</purchaseOrderId>
<electronicDelivery>false</electronicDelivery>
<passThrough></passThrough>
<listingId>3359267717</listingId>
<productionId>3412459</productionId>
<eventId>218</eventId>
<zone>false</zone>
<barCodesRequired>false</barCodesRequired>
<transferViaURL>true</transferViaURL>
<instantTransfer>false</instantTransfer>
<instantFlashSeats>false</instantFlashSeats>
</order>
Required Results
You can paster the XML here to get the required answer
{
"order": {
"orderId": 35267684,
"orderToken": "fa74171d-f54a-4e76-bcf2-2dd284ac6450",
"brokerTicketId": 169855177,
"section": "SEC1",
"row": "N",
"seats": "",
"notes": "Limited view. Please note that you will need to use an iOS or Android mobile device to gain entry to your event.",
"quantity": 2,
"cost": 324,
"event": "Dave Matthews Band (Rescheduled from 7/17/2020, 7/16/2021)",
"eventDate": "2021-08-20 19:30:00",
"orderDate": "2021-06-18 05:43:13",
"expectedShipDate": "2021-06-18 00:00:00",
"venue": "Xfinity Center - MA - Mansfield, MA",
"status": "UNCONFIRMED",
"purchaseOrderId": 35088971,
"electronicDelivery": false,
"passThrough": "",
"listingId": 3359267717,
"productionId": 3412459,
"eventId": 218,
"zone": false,
"barCodesRequired": false,
"transferViaURL": true,
"instantTransfer": false,
"instantFlashSeats": false
}
}
Well, if your question is How to convert xml to json in Elixir the answer is fairly simple: use https://github.com/homanchou/elixir-xml-to-map and JSON encoder of your choice:
def xml2json(path) do
File.read!(path)
|> XmlToMap.naive_map()
|> Jason.encode!()
end
The problem with this answer is that XML is not isomorphic to map (you have both tags and attributes in XML and you don't in map or JSON), so the better way will be to use SweetXML (or :xmerl) and do xmap with a proper match like here (code is from SweetXML examples - https://github.com/kbrw/sweet_xml#examples):
result = doc |> xmap(
matchups: [
~x"//matchups/matchup"l,
name: ~x"./name/text()",
winner: [
~x".//team/id[.=ancestor::matchup/#winner-id]/..",
name: ~x"./name/text()"
]
],
last_matchup: [
~x"//matchups/matchup[last()]",
name: ~x"./name/text()",
winner: [
~x".//team/id[.=ancestor::matchup/#winner-id]/..",
name: ~x"./name/text()"
]
]
)
assert result == %{
matchups: [
%{name: 'Match One', winner: %{name: 'Team One'}},
%{name: 'Match Two', winner: %{name: 'Team Two'}},
%{name: 'Match Three', winner: %{name: 'Team One'}}
],
last_matchup: %{name: 'Match Three', winner: %{name: 'Team One'}}
}
Another option is to use https://github.com/willemdj/erlsom and do manual travers over tuple tree it emits from the simple_form call. Note you will have to handle XMLNS and attr/value problem anyways.

Read JSON in pySpark with custom schema in GCP Dataproc

In GCP Dataproc (with pySpark), I am doing a task i.e. to read JSON file as per custom schema and load it in a Dataframe.
I do have following sample testing JSON:
{"Transactions": [{"schema": "a",
"id": "1",
"app": "testing",
"description": "JSON schema for testing purpose"}]}
I have created following schema:
custom_schema = StructType([
StructField("Transactions",
StructType([
StructField("schema", StringType()),
StructField("id", StringType()),
StructField("app", StringType()),
StructField("description", StringType())
])
)])
Reading JSON as:
df_2 = spark.read.json(json_path, schema = custom_schema)
Getting following results,
Now, I need to check data in Dataframe, When I try to do df_2.show(), it is taking too much time and show as kernel Busy for hours.
I need help, that what I am missing here in code and how can I see the data in dataframe (Tabular format).
I think the problem is with your custom schema definition and the JSON file. The following code and JSON file worked for me:
Code
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType
from pyspark.sql.types import ArrayType
spark = SparkSession \
.builder \
.appName("JSON test") \
.getOrCreate()
custom_schema = StructType([
StructField("schema", StringType(), False),
StructField("id", StringType(), True),
StructField("app", StringType(), True),
StructField("description", StringType(), True)])
df = spark.read.format("json") \
.schema(custom_schema) \
.load("gs://my-bucket/transactions.json")
df.show()
JSON file
The contents of gs://my-bucket/transactions.json is:
{"schema": "a", "id": "1", "app": "foo", "description": "test"}
{"schema": "b", "id": "2", "app": "bar", "description": "test2"}
Output
+------+---+---+-----------+
|schema| id|app|description|
+------+---+---+-----------+
| a| 1|foo| test|
| b| 2|bar| test2|
+------+---+---+-----------+

Create dataframe reading json file using json schema file outside the code

What is the best way to create a dataframe for a json file using a separate json schema file in pyspark?.
Sample json file
{"ORIGIN_COUNTRY_NAME":"Romania","DEST_COUNTRY_NAME":"United States","count":1}
{"ORIGIN_COUNTRY_NAME":"Ireland","DEST_COUNTRY_NAME":"United States","count":264}
{"ORIGIN_COUNTRY_NAME":"India","DEST_COUNTRY_NAME":"United States","count":69}
{"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Egypt","count":24}
Code to read this file
df_json = spark.read.format("json")\
.option("mode", "FAILFAST")\
.option("inferschema", "true")\
.load("C:\\pyspark\\data\\2010-summary.json")
If I don't want to use the "inferschema" option and want to use a json schema file instead, may I know how to do that?
json schema file
{"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {"ORIGIN_COUNTRY_NAME": {"type": "string"},
"DEST_COUNTRY_NAME": {"type": "string"},
"count": {"type": "integer"}
},
"required": ["ORIGIN_COUNTRY_NAME","DEST_COUNTRY_NAME","count"]
}
option1:
I assumed your columns are all nullable,
from spark.sql.types import StructType, StructField, StringType, IntegerType
yourSchema = StructType([ StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
StructField("DEST_COUNTRY_NAME", StringType(), True),
StructField("count", IntegerType(), True),])
option2:
simple read your file like so..
df_json = spark.read.json("C:\\pyspark\\data\\2010-summary.json")
df_jsonSchema = df_json.schema
print(type(df_jsonSchema))
[each for each in zipsDFSchema]
from the results, you can then build your schema just like in option1.

Creating an aggregate metrics from JSON logs in apache spark

I am getting started with apache spark.
I have a requirement to convert a json log to a flattened metrics, can be considered as a simple csv as well.
For eg.
"orderId":1,
"orderData": {
"customerId": 123,
"orders": [
{
"itemCount": 2,
"items": [
{
"quantity": 1,
"price": 315
},
{
"quantity": 2,
"price": 300
},
]
}
]
}
This can be considered as a single json log, I want to convert this into,
orderId,customerId,totalValue,units
1 , 123 , 915 , 3
I was going through sparkSQL documentation and can use it to get hold of individual values like "select orderId,orderData.customerId from Order" but I am not sure how to get the summation of all the prices and units.
What should be the best practice to get this done using apache spark?
Try:
>>> from pyspark.sql.functions import *
>>> doc = {"orderData": {"orders": [{"items": [{"quantity": 1, "price": 315}, {"quantity": 2, "price": 300}], "itemCount": 2}], "customerId": 123}, "orderId": 1}
>>> df = sqlContext.read.json(sc.parallelize([doc]))
>>> df.select("orderId", "orderData.customerId", explode("orderData.orders").alias("order")) \
... .withColumn("item", explode("order.items")) \
... .groupBy("orderId", "customerId") \
... .agg(sum("item.quantity"), sum(col("item.quantity") * col("item.price")))
For the people who are looking for a java solution of the above, please follow:
SparkSession spark = SparkSession
.builder()
.config(conf)
.getOrCreate();
SQLContext sqlContext = new SQLContext(spark);
Dataset<Row> orders = sqlContext.read().json("order.json");
Dataset<Row> newOrders = orders.select(
col("orderId"),
col("orderData.customerId"),
explode(col("orderData.orders")).alias("order"))
.withColumn("item",explode(col("order.items")))
.groupBy(col("orderId"),col("customerId"))
.agg(sum(col("item.quantity")),sum(col("item.price")));
newOrders.show();

JSON format with gzip compression

My current project sends a lot of data to the browser in JSON via ajax requests.
I've been trying to decide which format I should use. The two I have in mind are
[
"colname1" : "content",
"colname2" : "content",
],
[
"colname1" : "content",
"colname2" : "content",
],
...
and
{
"columns": [
"column name 1",
"column name 2",
],
"rows": [
[
"content",
"content"
],
[
"content",
"content"
]
...
]
}
The first method is better because it is easier to work with. I just have to convert to an object once received. The second will need some post processing to convert it into a format more like the first so it is easier to work with in JavaScript.
The second is better because it is less verbose and therefore takes up less bandwidth and downloads more quickly. Before compression it is usually between 0.75% and 0.85% of the size of the first format.
GZip compression complicates things further. Making the difference in file size nearer 0.85% to 0.95%
Which format should I go with and why?
I'd suggest using RJSON:
RJSON (Recursive JSON) converts any JSON data collection into more compact recursive form. Compressed data is still JSON and can be parsed with JSON.parse. RJSON can compress not only homogeneous collections, but any data sets with free structure.
Example:
JSON:
{
"id": 7,
"tags": ["programming", "javascript"],
"users": [
{"first": "Homer", "last": "Simpson"},
{"first": "Hank", "last": "Hill"},
{"first": "Peter", "last": "Griffin"}
],
"books": [
{"title": "JavaScript", "author": "Flanagan", "year": 2006},
{"title": "Cascading Style Sheets", "author": "Meyer", "year": 2004}
]
}
RJSON:
{
"id": 7,
"tags": ["programming", "javascript"],
"users": [
{"first": "Homer", "last": "Simpson"},
[2, "Hank", "Hill", "Peter", "Griffin"]
],
"books": [
{"title": "JavaScript", "author": "Flanagan", "year": 2006},
[3, "Cascading Style Sheets", "Meyer", 2004]
]
}
Shouldn't the second bit of example 1 be "rowname1"..etc.? I don't really get example 2 so I guess I would aim you towards 1. There is much to be said for having data immediately workable without pre-processing it first. Justification: I once spend too long optimizing array system that turned out to work perfectly but its hell to update it now.