Creating an aggregate metrics from JSON logs in apache spark - json

I am getting started with apache spark.
I have a requirement to convert a json log to a flattened metrics, can be considered as a simple csv as well.
For eg.
"orderId":1,
"orderData": {
"customerId": 123,
"orders": [
{
"itemCount": 2,
"items": [
{
"quantity": 1,
"price": 315
},
{
"quantity": 2,
"price": 300
},
]
}
]
}
This can be considered as a single json log, I want to convert this into,
orderId,customerId,totalValue,units
1 , 123 , 915 , 3
I was going through sparkSQL documentation and can use it to get hold of individual values like "select orderId,orderData.customerId from Order" but I am not sure how to get the summation of all the prices and units.
What should be the best practice to get this done using apache spark?

Try:
>>> from pyspark.sql.functions import *
>>> doc = {"orderData": {"orders": [{"items": [{"quantity": 1, "price": 315}, {"quantity": 2, "price": 300}], "itemCount": 2}], "customerId": 123}, "orderId": 1}
>>> df = sqlContext.read.json(sc.parallelize([doc]))
>>> df.select("orderId", "orderData.customerId", explode("orderData.orders").alias("order")) \
... .withColumn("item", explode("order.items")) \
... .groupBy("orderId", "customerId") \
... .agg(sum("item.quantity"), sum(col("item.quantity") * col("item.price")))

For the people who are looking for a java solution of the above, please follow:
SparkSession spark = SparkSession
.builder()
.config(conf)
.getOrCreate();
SQLContext sqlContext = new SQLContext(spark);
Dataset<Row> orders = sqlContext.read().json("order.json");
Dataset<Row> newOrders = orders.select(
col("orderId"),
col("orderData.customerId"),
explode(col("orderData.orders")).alias("order"))
.withColumn("item",explode(col("order.items")))
.groupBy(col("orderId"),col("customerId"))
.agg(sum(col("item.quantity")),sum(col("item.price")));
newOrders.show();

Related

multiple object of an array creates different columns in the CSV file

Here is my JSON example. When I convert JSON to CSV file, it creates different columns for each object of reviews array. columns names be like - serial name.0 rating.0 _id.0 name.1 rating.1 _id.1. How can i convert to CSV file where only serial,name,rating,_id will be the column name and every object of the reviews will be put in a different row?
`
[{
"serial": "63708940a8d291c502be815f",
"reviews": [
{
"name": "shadman",
"rating": 4,
"_id":"6373d4eb50cff661989f3d83"
},
{
"name": "niloy1",
"rating": 3,
"_id": "6373d59450cff661989f3db8"
},
],
}]
`
`
I am trying to use the CSV file to pandas. If not possible, is there any way to solve the problem using pandas package in python?
I suggest you use pandas for the CSV export only and process the json data by flattening the data structure first so that the result can then be easily loaded in a Pandas DataFrame.
Try:
data_python = [{
"serial": "63708940a8d291c502be815f",
"reviews": [
{
"name": "shadman",
"rating": 4,
"_id":"6373d4eb50cff661989f3d83"
},
{
"name": "niloy1",
"rating": 3,
"_id": "6373d59450cff661989f3db8"
},
],
}]
from collections import defaultdict
from pprint import pprint
import pandas as pd
dct_flat = defaultdict(list)
for dct in data_python:
for dct_reviews in dct["reviews"]:
dct_flat['serial'].append(dct['serial'])
for key, value in dct_reviews.items():
dct_flat[key].append(value)
#pprint(data_python)
#pprint(dct_flat)
df = pd.DataFrame(dct_flat)
print(df)
df.to_csv("data.csv")
which gives:
serial name rating _id
0 63708940a8d291c502be815f shadman 4 6373d4eb50cff661989f3d83
1 63708940a8d291c502be815f niloy1 3 6373d59450cff661989f3db8
and
,serial,name,rating,_id
0,63708940a8d291c502be815f,shadman,4,6373d4eb50cff661989f3d83
1,63708940a8d291c502be815f,niloy1,3,6373d59450cff661989f3db8
as CSV file content.
Notice that the json you provided in your question can't be loaded from file or string in Python neither using Python json module nor using Pandas because it is not valid json code. See below for corrected valid json data:
valid_json_data='''\
[{
"serial": "63708940a8d291c502be815f",
"reviews": [
{
"name": "shadman",
"rating": 4,
"_id":"6373d4eb50cff661989f3d83"
},
{
"name": "niloy1",
"rating": 3,
"_id": "6373d59450cff661989f3db8"
}
]
}]
'''
and code for loading this data from json file:
import json
json_file = "data.json"
with open(json_file) as f:
data_json = f.read()
data_python = json.loads(data_json)

How to convert json to csv python with proper headers

Trying to get Json data to csv i am getting the values but one block is showing as one line in result, new to python so any help appriciated. Have tried the below code to do the same.
import pandas as pd
with open(r'C:\Users\anath\hard.json', encoding='utf-8') as inputfile:
df = pd.read_json(inputfile)
df.to_csv(r'C:\Users\anath\csvfile.csv', encoding='utf-8', index=True)
Sample Json in the source file, short snippet
{
"issues": [
{
"issueId": 110052,
"revision": 84,
"definitionId": "DNS1012",
"subject": "urn:h:domain:fitestdea.com",
"subjectDomain": "fitestdea.com",
"title": "Nameserver name doesn\u0027t resolve to an IPv6 address",
"category": "DNS",
"severity": "low",
"cause": "urn:h:domain:ns1.gname.net",
"causeDomain": "ns1.gname.net",
"open": true,
"status": "active",
"auto": true,
"autoOpen": true,
"createdOn": "2022-09-01T02:29:09.681451Z",
"lastUpdated": "2022-11-23T02:26:28.785601Z",
"lastChecked": "2022-11-23T02:26:28.785601Z",
"lastConfirmed": "2022-11-23T02:26:28.785601Z",
"details": "{}"
},
{
"issueId": 77881,
"revision": 106,
"definitionId": "DNS2001",
"subject": "urn:h:domain:origin-mx.stagetest.test.com.test.com",
"subjectDomain": "origin-mx.stagetest.test.com.test.com",
"title": "Dangling domain alias (CNAME)",
"category": "DNS",
"severity": "high",
"cause": "urn:h:domain:origin-www.stagetest.test.com.test.com",
"causeDomain": "origin-www.stagetest.test.com.test.com",
"open": true,
"status": "active",
"auto": true,
"autoOpen": true,
"createdOn": "2022-08-10T09:34:36.929071Z",
"lastUpdated": "2022-11-23T09:33:32.553663Z",
"lastChecked": "2022-11-23T09:33:32.553663Z",
"lastConfirmed": "2022-11-23T09:33:32.553663Z",
"details": "{\"#type\": \"hardenize/com.hardenize.schemas.dns.DanglingProblem\", \"rrType\": \"CNAME\", \"rrDomain\": \"origin-mx.stagetest.test.com.test.com\", \"causeDomain\": \"origin-www.stagetest.test.com.test.com\", \"danglingType\": \"nxdomain\", \"rrEffectiveDomain\": \"origin-mx.stagetest.test.com.test.com\"}"
}
}
]
}
Output i am getting is as below was looking a way where could field name in header and values in a column or cell so far getting the entire record in 1 cell. Any way we can just get specific field only like title, severity or issueid not everything but only the feilds i need.
Try:
import json
import pandas as pd
with open("your_file.json", "r") as f_in:
data = json.load(f_in)
df = pd.DataFrame(data["issues"])
print(df[["title", "severity", "issueId"]])
Prints:
title severity issueId
0 Nameserver name doesn't resolve to an IPv6 address low 110052
1 Dangling domain alias (CNAME) high 77881
To save as CSV you can do:
df[["title", "severity", "issueId"]].to_csv('data.csv', index=False)
try this...
df = pd.json_normalize(inputfile)
in place of the line you have.
Finally this worked for me #Andrej Kesely thanks for the inputs. sharing as might help others.
import pandas as pd
import json
with open(r'C:\Users\anath\hard.json', encoding='utf-8') as inputfile:
data = json.load(inputfile)
df = pd.DataFrame(data["issues"])
print(df[["title", "severity", "issueId"]])
df[["title", "severity", "issueId"]].to_csv('data.csv', index=False)

How to do 3 layer flattening nested json to dataframe?

For example the JSON is :
{
"samples": [
{
"sample_id": "A2434",
"start": "1664729482",
"end": "1664729482",
"parts": [
{
"name": "123",
"start": "1664736682",
"end": "1618688700",
"fail": ""
}
]
}
]
}
I want the df and columns like below :
sample_id,start,end,parts.name,parts.start,parts.end,parts.fail
Using json.normalize
df = pd.json_normalize(
data=data["samples"],
record_path="parts",
record_prefix="parts.",
meta=["sample_id", "start", "end"]
).drop(columns="parts.name")
print(df)
parts.start parts.end parts.fail sample_id start end
0 1664736682 1618688700 A2434 1664729482 1664729482
Use pd.read_json() to unpack the JSON to a Dataframe. You can then use pd.json_normalize() as required on the generated columns to get the more nested data out.
You can use df.explode to get the separate item in a list on a row, and use agg(pd.Series) to convert key, value pairs in a dictionary to columns and rows, respectively.
df = pd.DataFrame(data['samples'])
df[['parts.start','parts.end','parts.fail']] = df['parts'].explode().agg(pd.Series)
df.drop('parts', axis=1, inplace=True)
Output:
sample_id start end parts.name parts.start parts.end parts.fail
0 A2434 1664729482 1664729482 123 1664736682 1618688700

How to save json file into mongodb

I have twitter account timeline data per tweet saved in .json format, I am unable to save the data into mongodb
Example: fetched data of one tweet.
{
"created_at": "Fri Apr 12 05:13:35 +0000 2019",
"id": 1116570031511359489,
"id_str": "1116570031511359489",
"full_text": "#jurafsky How can i get your video lectures related to Sentiment Analysis",
"truncated": false,
"display_text_range": [0, 73],
"entities": {
"hashtags": [],
"symbols": [],
"user_mentions": [
{
"screen_name": "jurafsky",
"name": "Dan Jurafsky",
"id": 14968475,
"id_str": "14968475",
"indices": [0, 9]
}
],
"urls": []
}
it also contains urls and other lost of information
I have tried the following code.
from pymongo import MongoClient
import json
client=MongoClient('localhost',27107)
db=client.test
coll=db.dataset
with open('tweets.json') as f:
file_data=json.loads(f.read())
coll.insert(file_data)
client.close()
Try this:
from pymongo import MongoClient
import json
client=MongoClient('localhost',27107)
db=client.test
coll=db.dataset
with open('tweets.json') as f:
file_data=json.load(f)
coll.insert(file_data)
client.close()
My json dataset was not valid, I have to merge it to one array object
Thanks to: Can't parse json file: json.decoder.JSONDecodeError: Extra data.

Python pandas convert CSV upto level "n" nested JSON?

I want to convert the csv file into nested json upto n level. I am using below code to get the desired output from this link. But I am getting an error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-109-89d84e9a61bf> in <module>()
40 # make a list of keys
41 keys_list = []
---> 42 for item in d['children']:
43 keys_list.append(item['name'])
44
TypeError: 'datetime.date' object has no attribute '__getitem__'
Below is the code:
# CSV 2 flare.json
# convert a csv file to flare.json for use with many D3.js viz's
# This script creates outputs a flare.json file with 2 levels of nesting.
# For additional nested layers, add them in lines 32 - 47
# sample: http://bl.ocks.org/mbostock/1283663
# author: Andrew Heekin
# MIT License
import pandas as pd
import json
df = pd.read_csv('file.csv')
# choose columns to keep, in the desired nested json hierarchical order
df = df[['Group','year','quarter']]
# order in the groupby here matters, it determines the json nesting
# the groupby call makes a pandas series by grouping 'the_parent' and 'the_child', while summing the numerical column 'child_size'
df1 = df.groupby(['Group','year','quarter'])['quarter'].count()
df1 = df1.reset_index(name = "count")
#print df1.head()
# start a new flare.json document
flare = dict()
flare = {"name":"flare", "children": []}
#df1['year'] = [str(yr) for yr in df1['year']]
for line in df1.values:
the_parent = line[0]
the_child = line[1]
child_size = line[2]
# make a list of keys
keys_list = []
for item in d['children']:
keys_list.append(item['name'])
# if 'the_parent' is NOT a key in the flare.json yet, append it
if not the_parent in keys_list:
d['children'].append({"name":the_parent, "children":[{"name":the_child, "size":child_size}]})
# if 'the_parent' IS a key in the flare.json, add a new child to it
else:
d['children'][keys_list.index(the_parent)]['children'].append({"name":the_child, "size":child_size})
flare = d
# export the final result to a json file
with open('flare.json', 'w') as outfile:
json.dump(flare, outfile)
Expected Output in below format:
{
"name": "stock",
"children": [
{"name": "fruits",
"children": [
{"name": "berries",
"children": [
{"count": 20, "name": "blueberry"},
{"count": 70, "name": "cranberry"},
{"count": 96, "name": "raspberry"},
{"count": 140, "name": "strawberry"}]
},
{"name": "citrus",
"children": [
{"count": 20, "name": "grapefruit"},
{"count": 120, "name": "lemon"},
{"count": 50, "name": "orange"}]
},
{"name": "dried fruit",
"children": [
{"count": 25, "name": "dates"},
{"count": 10, "name": "raisins"}]
}]
},
{"name": "vegtables",
"children": [
{"name": "green leaf",
"children": [
{"count": 19, "name": "cress"},
{"count": 18, "name": "spinach"}]
},
{
"name": "legumes",
"children": [
{"count": 27, "name": "beans"},
{"count": 12, "name": "chickpea"}]
}]
}]
}
Could any one please help how to resolve this error.
Thanks