Aggregating json data with pyspark - json

I'm currently working with pyspark and the great language game dataset which contains several samples as json objects like the one shown below.
Each of this samples represents an instance of the game, where some person has listened an audio file with some spoken language and afterwards should choose out of four possible languages which one she just heard.
I want now to aggreagte all this games on let's say the "target" field and the "guess" field and afterwards count the amount of games for each pair ("target","guess").
Can someone give me some help on how to get this done?
I already had a look at the pyspark documentation, but as I'm quite new to to python/pyspark it didn't really understand how the aggregate funciton works.
{"target": "Turkish", "sample": "af0e25c7637fb0dcdc56fac6d49aa55e",
"choices": ["Hindi", "Lao", "Maltese", "Turkish"],
"guess": "Maltese", "date": "2013-08-19", "country": "AU"}

The process of converting json data into pyspark dataframe can be done this way.
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
import json
sc = SparkContext(conf=SparkConf())
sqlContext = SQLContext(sc)
def convert_single_object_per_line(json_list):
json_string = ""
for line in json_list:
json_string += json.dumps(line) + "\n"
return json_string
json_list = [{"target": "Turkish", "sample": "af0e25c7637fb0dcdc56fac6d49aa55e",
"choices": ["Hindi", "Lao", "Maltese", "Turkish"],
"guess": "Maltese", "date": "2013-08-19", "country": "AU"}]
json_string = convert_single_object_per_line(json_list)
df = sqlContext.createDataFrame([json.loads(line) for line in json_string.splitlines()])
[In]:df
[Out]:
DataFrame[choices: array<string>, country: string, date: string, guess: string, sample: string, target: string]
[In]:df.show()
[Out]:
+--------------------+-------+----------+-------+--------------------+-------+
| choices|country| date| guess| sample| target|
+--------------------+-------+----------+-------+--------------------+-------+
|[Hindi, Lao, Malt...| AU|2013-08-19|Maltese|af0e25c7637fb0dcd...|Turkish|
+--------------------+-------+----------+-------+--------------------+-------+

Related

In Foundry, how can I parse a dataframe column that has a JSON response

I am trying to bring in JIRA data into Foundry using an external API. When it comes in via Magritte, the data gets stored in AVRO and there is a column called response. The response column has data that looks like this...
[{"id":"customfield_5","name":"test","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["cf[5]","test"],"schema":{"type":"user","custom":"com.atlassian.jira.plugin.system.customfieldtypes:userpicker","customId":5}},{"id":"customfield_2","name":"test2","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["test2","cf[2]"],"schema":{"type":"option","custom":"com.atlassian.jira.plugin.system.customfieldtypes:select","customId":2}}]
Due to the fact that this imports as AVRO, the documentation that talks about how to convert this data that's in Foundry doesn't work. How can I convert this data into individual columns and rows?
Here is the code that I've attempted to use:
from transforms.api import transform_df, Input, Output
from pyspark import SparkContext as sc
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf
import json
import pyspark.sql.types as T
#transform_df(
Output("json output"),
json_raw=Input("json input"),
)
def my_compute_function(json_raw, ctx):
sqlContext = SQLContext(sc)
source = json_raw.select('response').collect() # noqa
# Read the list into data frame
df = sqlContext.read.json(sc.parallelize(source))
json_schema = T.StructType([
T.StructField("id", T.StringType(), False),
T.StructField("name", T.StringType(), False),
T.StructField("custom", T.StringType(), False),
T.StructField("orderable", T.StringType(), False),
T.StructField("navigable", T.StringType(), False),
T.StructField("searchable", T.StringType(), False),
T.StructField("clauseNames", T.StringType(), False),
T.StructField("schema", T.StringType(), False)
])
udf_parse_json = udf(lambda str: parse_json(str), json_schema)
df_new = df.select(udf_parse_json(df.response).alias("response"))
return df_new
# Function to convert JSON array string to a list
def parse_json(array_str):
json_obj = json.loads(array_str)
for item in json_obj:
yield (item["a"], item["b"])
Parsing Json in a string column to a struct column (and then into separate columns) can be easily done using the F.from_json function.
In your case, you need to do:
df = df.withColumn("response_parsed", F.from_json("response", json_schema))
Then you can do this or similar to get the contents into different columns:
df = df.select("response_parsed.*")
However, this won't work as your schema is incorrect, you actually have a list of json structs in each row, not just 1, so you need a T.ArrayType(your_schema) wrapping around the whole thing, you'll also need to do an F.explode before selecting, to get each array element in its own row.
An additional useful function is F.get_json_object, which allows you to get json one json object from a json string.
Using a UDF like you've done could work, but UDFs are generally much less performant than native spark functions.
Additionally, all the AVRO file format does in this case is to merge multiple json files into one big file, with each file in its own row, so the example under "Rest API Plugin" - "Processing JSON in Foundry" should work as long as you skip the 'put this schema on the raw dataset' step.
I used the magritte-rest connector to walk through the paged results from search:
type: rest-source-adapter2
restCalls:
- type: magritte-paging-inc-param-call
method: GET
path: search
paramToIncrease: startAt
increaseBy: 50
initValue: 0
parameters:
startAt: '{%startAt%}'
extractor:
- type: json
assign:
issues: /issues
allowNull: true
condition:
type: magritte-rest-non-empty-condition
var: issues
maxIterationsAllowed: 4096
cacheToDisk: false
oneFilePerResponse: false
That yielded a dataset that looked like this:
Once I had that, this parsed expanded and parsed the returned JSON issues into a properly-typed DataFrame with fields holding the inner structure of the issue as a (very complex) struct:
import json
from pyspark.sql import Row
from pyspark.sql.functions import explode
def issues_enumerated(All_Issues_Paged):
def generate_issue_row(input_row: Row) -> Row:
"""
Generates a dataframe of each responses issue array as a single array record per-Row
"""
d = input_row.asDict()
resp_json = d['response']
resp_obj = json.loads(resp_json)
issues = list(map(json.dumps,resp_obj['issues']))
return Row(issues=issues)
# array-per-record
unexploded_df = All_Issues_Paged.rdd.map(generate_issue_row).toDF()
# row-per-record
row_per_record_df = unexploded_df.select(explode(unexploded_df.issues))
# raw JSON string per-record RDD
issue_json_strings_rdd = row_per_record_df.rdd.map(lambda _: _.col)
# JSON object dataframe
issues_df = spark.read.json(issue_json_strings_rdd)
issues_df.printSchema()
return issues_df

pyspark flat csv to nested json in mongodb

Input CSV Data
userid, Code, Status
1234, 1 , final
1287, 2, notfinal
#Applied Pyspark Script
#Create Spark Session
spark = SparkSession.builder.master("yarn").appName().enableHiveSupport().config("spark.some.config.option", "some-value").getOrCreate()
#read csv data into dataframe
df = spark.read.load("Book3.csv",format="csv", sep=",", inferSchema="true", header="true")
#define schema for json df
newschema = StructType([StructField("userid", StringType()),StructField("report",
StringType(),metadata={"maxlength":6000})])
jsondf = df.rdd.map(lambda row: (row[0], ({"Code":row[1],"status" : row[2]})))\
.map(lambda row: (row[0], json.dumps(row[1])))\
.toDF(newschema)
jsondf.write.format("mongo").mode("append")\
.option("uri","mongodb://gcp.mongodb.net/").option("database","dbname").option("collection",
"testcollection").save()
Resulant Mongo Data
{
"userid" : "1234",
"report" : "{\"Code\": \"1\", \"status\": \"final\"}"
}
{
"userid" : "1287",
"report" : "{\"Code\": \"2\", \"status\": \"notfinal\"}"
}
In mongo i get a complete json encoded string in "report" which is not a surprise given i have taken report field as Stringtype().
This effectively makes any nested field based search in mongo impossible and whole code is useless then.
How can i make it a proper nested json so that mongo can search on nested fields as well ?
when i try to change field to proper structred json using below code
>>> new_df = sql_context.read.json(df.rdd.map(lambda r: r.json))
>>> new_df.printSchema()
i get error that "raise AttributeError(item) AttributeError: json"
Please help with soem code tips...
i am ok to use groupby as well but struggling what to put in aggregate functions and i need dataframe in result to write to mongo.
The solution is to properly define schema in pyspark "df_schema" and then map your base df into a new df "df_mongo" making sure that df.rdd.map should follow the pattern defined in df_schema .
df = spark.read.load("sourcelocation",format="csv", sep="|", inferSchema="true", header="true")
df_schema = StructType([StructField("field1", StringType(),True),StructField("field2", StringType(),True)])
df_mongo = df.rdd.map(lambda row: ([row[15],row[12]])).toDF(df_schema)
df_mongo.write.format("mongo").mode("append").option("uri",mongodb_uri). \
option("database",dbname).option("collection", collection_name).save()

how to convert JSON data to .tsv file using python.

My json data looks like this :
data ={
"time": "2018-10-02T10:19:48+00:00",
"class": "NOTIFICATION",
"type": "Access Control",
"event": "Window/Door",
"number": -61
}
Desired output have to be like this:
time class type event number
2018-10-02T10:19:48+00:00 NOTIFICATION Access Control Window/Door -61
could anyone help me out, Thanks in advance
I think it's the same as converting JSON to csv, but instead of using the comma you can use tab as a delimeter, as follows:
import json
import csv
# input data
json_file = open("data.json", "r")
json_data = json.load(json_file)
json_file.close()
data = json.loads(json_data)
tsv_file = open("data.tsv", "w")
tsv_writer = csv.writer(tsv_file, delimiter='\t')
tsv_writer.writerow(data[0].keys()) # write the header
for row in data: # write data rows
tsv_writer.writerow(row.values())
tsv_file.close()
The above code will work if you json file has multiple data rows. If you have only one data row, the below code should work for you:
tsv_writer.writerow(data.keys()) # write the header
tsv_writer.writerow(data.values()) # write the values
Hope this helps.

Cannot write dict object as correct JSON to file

I try to read JSON from file, get values, transform them and back write to new file.
{
"metadata": {
"info": "important info"
},
"timestamp": "2018-04-06T12:19:38.611Z",
"content": {
"id": "1",
"name": "name test",
"objects": [
{
"id": "1",
"url": "http://example.com",
"properties": [
{
"id": "1",
"value": "1"
}
]
}
]
}
}
Above is a JSON that I read from file.
Below I attach a python program that gets values, creates new JSON and write it to file.
import json
from pprint import pprint
def load_json(file_name):
return json.load(open(file_name))
def get_metadata(json):
return json["metadata"]
def get_timestamp(json):
return json["timestamp"]
def get_content(json):
return json["content"]
def create_json(metadata, timestamp, content):
dct = dict(__metadata=metadata, timestamp=timestamp, content=content)
return json.dumps(dct)
def write_json_to_file(file_name, json_content):
with open(file_name, 'w') as file:
json.dump(json_content, file)
STACK_JSON = 'stack.json';
STACK_OUT_JSON = 'stack-out.json'
if __name__ == '__main__':
json_content = load_json(STACK_JSON)
print("Loaded JSON:")
print(json_content)
metadata = get_metadata(json_content)
print("Metadata:", metadata)
timestamp = get_timestamp(json_content)
print("Timestamp:", timestamp)
content = get_content(json_content)
print("Content:", content)
created_json = create_json(metadata, timestamp, content)
print("\n\n")
print(created_json)
write_json_to_file(STACK_OUT_JSON, created_json)
But the problem is that create json is not correct. Finally as result I get:
"{\"__metadata\": {\"info\": \"important info\"}, \"timestamp\": \"2018-04-06T12:19:38.611Z\", \"content\": {\"id\": \"1\", \"name\": \"name test\", \"objects\": [{\"id\": \"1\", \"url\": \"http://example.com\", \"properties\": [{\"id\": \"1\", \"value\": \"1\"}]}]}}"
It is not that what I want to achieve. It's not correct JSON. What do I wrong?
Solution:
Change the write_json_to_file(...) method like this:
def write_json_to_file(file_name, json_content):
with open(file_name, 'w') as file:
file.write(json_content)
Explanation:
The problem is, that when you're calling write_json_to_file(STACK_OUT_JSON, created_json) at the end of your script, the variable created_json contains a string - it's the JSON representation of the dictionary created in the create_json(...) function. But inside the write_json_to_file(file_name, json_content), you're calling:
json.dump(json_content, file)
You're telling the json module write the JSON representation of variable json_content (which contains a string) into the file. And JSON representation of a string is a single value encapsulated in double-quotes ("), with all the double-quotes it contains escaped by \.
What you want to achieve is to simply write the value of the json_content variable into the file and not have it first JSON-serialized again.
Problem
You're converting a dict into a json and then right before you write it into a file, you're converting it into a json again. When you retry to convert a json to a json it gives you the \" since it's escaping the " since it assumes that you have a value there.
How to solve it?
It's a great idea to read the json file, convert it into a dict and perform all sorts of operations to it. And only when you want to print out an output or write to a file or return an output you convert to a json since json.dump() is expensive, it adds 2ms (approx) of overhead which might not seem much but when your code is running in 500 microseconds it's almost 4 times.
Other Recommendations
After seeing your code, I realize you're coming from a java background and while in java the getThis() or getThat() is a great way to module your code since we represent our code in classes in java, in python it just causes problems in the readability of the code as mentioned in the PEP 8 style guide for python.
I've updated the code below:
import json
def get_contents_from_json(file_path)-> dict:
"""
Reads the contents of the json file into a dict
:param file_path:
:return: A dictionary of all contents in the file.
"""
try:
with open(file_path) as file:
contents = file.read()
return json.loads(contents)
except json.JSONDecodeError:
print('Error while reading json file')
except FileNotFoundError:
print(f'The JSON file was not found at the given path: \n{file_path}')
def write_to_json_file(metadata, timestamp, content, file_path):
"""
Creates a dict of all the data and then writes it into the file
:param metadata: The meta data
:param timestamp: the timestamp
:param content: the content
:param file_path: The file in which json needs to be written
:return: None
"""
output_dict = dict(metadata=metadata, timestamp=timestamp, content=content)
with open(file_path, 'w') as outfile:
json.dump(output_dict, outfile, sort_keys=True, indent=4, ensure_ascii=False)
def main(input_file_path, output_file_path):
# get a dict from the loaded json
data = get_contents_from_json(input_file_path)
# the print() supports multiple args so you don't need multiple print statements
print('JSON:', json.dumps(data), 'Loaded JSON as dict:', data, sep='\n')
try:
# load your data from the dict instead of the methods since it's more pythonic
metadata = data['metadata']
timestamp = data['timestamp']
content = data['content']
# just cumulating your print statements
print("Metadata:", metadata, "Timestamp:", timestamp, "Content:", content, sep='\n')
# write your json to the file.
write_to_json_file(metadata, timestamp, content, output_file_path)
except KeyError:
print('Could not find proper keys to in the provided json')
except TypeError:
print('There is something wrong with the loaded data')
if __name__ == '__main__':
main('stack.json', 'stack-out.json')
Advantages of the above code:
More Modular and hence easily unit testable
Handling of exceptions
Readable
More pythonic
Comments because they are just awesome!

Spark from_json with dynamic schema

I am trying to use Spark for processing JSON data with variable structure(nested JSON). Input JSON data could be very large with more than 1000 of keys per row and one batch could be more than 20 GB.
Entire batch has been generated from 30 data sources and 'key2' of each JSON can be used to identify the source and structure for each source is predefined.
What would be the best approach for processing such data?
I have tried using from_json like below but it works only with fixed schema and to use it first I need to group the data based on each source and then apply the schema.
Due to large data volume my preferred choice is to scan the data only once and extract required values from each source, based on predefined schema.
import org.apache.spark.sql.types._
import spark.implicits._
val data = sc.parallelize(
"""{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
:: Nil)
val df = data.toDF
val schema = (new StructType)
.add("key1", StringType)
.add("key2", StringType)
.add("key3", (new StructType)
.add("key3_k1", StringType))
df.select(from_json($"value",schema).as("json_str"))
.select($"json_str.key3.key3_k1").collect
res17: Array[org.apache.spark.sql.Row] = Array([xxx])
This is just a restatement of #Ramesh Maharjan's answer, but with more modern Spark syntax.
I found this method lurking in DataFrameReader which allows you to parse JSON strings from a Dataset[String] into an arbitrary DataFrame and take advantage of the same schema inference Spark gives you with spark.read.json("filepath") when reading directly from a JSON file. The schema of each row can be completely different.
def json(jsonDataset: Dataset[String]): DataFrame
Example usage:
val jsonStringDs = spark.createDataset[String](
Seq(
("""{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}"""),
("""{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"}""")))
jsonStringDs.show
jsonStringDs:org.apache.spark.sql.Dataset[String] = [value: string]
+----------------------------------------------------------------------------------------------------------------------+
|value
|
+----------------------------------------------------------------------------------------------------------------------+
|{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}|
|{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"} |
+----------------------------------------------------------------------------------------------------------------------+
val df = spark.read.json(jsonStringDs)
df.show(false)
df:org.apache.spark.sql.DataFrame = [CEO: string, address: struct ... 6 more fields]
+----------+------------------+-------------+---------+--------+------------+------+------------+
|CEO |address |employeeCount|firstname|lastname|marketCap |name |revenue |
+----------+------------------+-------------+---------+--------+------------+------+------------+
|null |[London,Baker,121]|null |Sherlock |Holmes |null |null |null |
|Jeff Bezos|null |500000 |null |null |817117000000|Amazon|177900000000|
+----------+------------------+-------------+---------+--------+------------+------+------------+
The method is available from Spark 2.2.0:
http://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.DataFrameReader#json(jsonDataset:org.apache.spark.sql.Dataset[String]):org.apache.spark.sql.DataFrame
If you have data as you mentioned in the question as
val data = sc.parallelize(
"""{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
:: Nil)
You don't need to create schema for json data. Spark sql can infer schema from the json string. You just have to use SQLContext.read.json as below
val df = sqlContext.read.json(data)
which will give you schema as below for the rdd data used above
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- key3: struct (nullable = true)
| |-- key3_k1: string (nullable = true)
And you can just select key3_k1 as
df2.select("key3.key3_k1").show(false)
//+-------+
//|key3_k1|
//+-------+
//|key3_v1|
//+-------+
You can manipulate the dataframe as you wish. I hope the answer is helpful
I am not sure if my suggestion can help you although I had a similar case and I solved it as follows:
1) So the idea is to use json rapture (or some other json library) to
load JSON schema dynamically. For instance you could read the 1st
row of the json file to discover the schema(similarly to what I do
here with jsonSchema)
2) Generate schema dynamically. First iterate through the dynamic
fields (notice that I project values of key3 as Map[String, String])
and add a StructField for each one of them to schema
3) Apply the generated schema into your dataframe
import rapture.json._
import jsonBackends.jackson._
val jsonSchema = """{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1", "key3_k2":"key3_v2", "key3_k3":"key3_v3"}}"""
val json = Json.parse(jsonSchema)
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.{StringType, StructType}
val schema = ArrayBuffer[StructField]()
//we could do this dynamic as well with json rapture
schema.appendAll(List(StructField("key1", StringType), StructField("key2", StringType)))
val items = ArrayBuffer[StructField]()
json.key3.as[Map[String, String]].foreach{
case(k, v) => {
items.append(StructField(k, StringType))
}
}
val complexColumn = new StructType(items.toArray)
schema.append(StructField("key3", complexColumn))
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setAppName("dynamic-json-schema").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val jsonDF = spark.read.schema(StructType(schema.toList)).json("""your_path\data.json""")
jsonDF.select("key1", "key2", "key3.key3_k1", "key3.key3_k2", "key3.key3_k3").show()
I used the next data as input:
{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v11", "key3_k2":"key3_v21", "key3_k3":"key3_v31"}}
{"key1":"val2","key2":"source2","key3":{"key3_k1":"key3_v12", "key3_k2":"key3_v22", "key3_k3":"key3_v32"}}
{"key1":"val3","key2":"source3","key3":{"key3_k1":"key3_v13", "key3_k2":"key3_v23", "key3_k3":"key3_v33"}}
And the output:
+----+-------+--------+--------+--------+
|key1| key2| key3_k1| key3_k2| key3_k3|
+----+-------+--------+--------+--------+
|val1|source1|key3_v11|key3_v21|key3_v31|
|val2|source2|key3_v12|key3_v22|key3_v32|
|val2|source3|key3_v13|key3_v23|key3_v33|
+----+-------+--------+--------+--------+
An advanced alternative, which I haven't tested yet, would be to generate a case class e.g called JsonRow from the JSON schema in order to have a strongly typed dataset which provides better serialization performance apart the fact that make your code more maintainable. To make this work you need first to create a JsonRow.scala file then you should implement a sbt pre-build script which will modify the content of JsonRow.scala(you might have more than one of course) dynamically based on your source files. To generate class JsonRow dynamically you can use the next code:
def generateClass(members: Map[String, String], name: String) : Any = {
val classMembers = for (m <- members) yield {
s"${m._1}: String"
}
val classDef = s"""case class ${name}(${classMembers.mkString(",")});scala.reflect.classTag[${name}].runtimeClass"""
classDef
}
The method generateClass accepts a map of strings to create the class members and the class name itself. The members of the generated class you can again populate them from you json schema:
import org.codehaus.jackson.node.{ObjectNode, TextNode}
import collection.JavaConversions._
val mapping = collection.mutable.Map[String, String]()
val fields = json.$root.value.asInstanceOf[ObjectNode].getFields
for (f <- fields) {
(f.getKey, f.getValue) match {
case (k: String, v: TextNode) => mapping(k) = v.asText
case (k: String, v: ObjectNode) => v.getFields.foreach(f => mapping(f.getKey) = f.getValue.asText)
case _ => None
}
}
val dynClass = generateClass(mapping.toMap, "JsonRow")
println(dynClass)
This prints out:
case class JsonRow(key3_k2: String,key3_k1: String,key1: String,key2: String,key3_k3: String);scala.reflect.classTag[JsonRow].runtimeClass
Good luck