pyspark flat csv to nested json in mongodb - json

Input CSV Data
userid, Code, Status
1234, 1 , final
1287, 2, notfinal
#Applied Pyspark Script
#Create Spark Session
spark = SparkSession.builder.master("yarn").appName().enableHiveSupport().config("spark.some.config.option", "some-value").getOrCreate()
#read csv data into dataframe
df = spark.read.load("Book3.csv",format="csv", sep=",", inferSchema="true", header="true")
#define schema for json df
newschema = StructType([StructField("userid", StringType()),StructField("report",
StringType(),metadata={"maxlength":6000})])
jsondf = df.rdd.map(lambda row: (row[0], ({"Code":row[1],"status" : row[2]})))\
.map(lambda row: (row[0], json.dumps(row[1])))\
.toDF(newschema)
jsondf.write.format("mongo").mode("append")\
.option("uri","mongodb://gcp.mongodb.net/").option("database","dbname").option("collection",
"testcollection").save()
Resulant Mongo Data
{
"userid" : "1234",
"report" : "{\"Code\": \"1\", \"status\": \"final\"}"
}
{
"userid" : "1287",
"report" : "{\"Code\": \"2\", \"status\": \"notfinal\"}"
}
In mongo i get a complete json encoded string in "report" which is not a surprise given i have taken report field as Stringtype().
This effectively makes any nested field based search in mongo impossible and whole code is useless then.
How can i make it a proper nested json so that mongo can search on nested fields as well ?
when i try to change field to proper structred json using below code
>>> new_df = sql_context.read.json(df.rdd.map(lambda r: r.json))
>>> new_df.printSchema()
i get error that "raise AttributeError(item) AttributeError: json"
Please help with soem code tips...
i am ok to use groupby as well but struggling what to put in aggregate functions and i need dataframe in result to write to mongo.

The solution is to properly define schema in pyspark "df_schema" and then map your base df into a new df "df_mongo" making sure that df.rdd.map should follow the pattern defined in df_schema .
df = spark.read.load("sourcelocation",format="csv", sep="|", inferSchema="true", header="true")
df_schema = StructType([StructField("field1", StringType(),True),StructField("field2", StringType(),True)])
df_mongo = df.rdd.map(lambda row: ([row[15],row[12]])).toDF(df_schema)
df_mongo.write.format("mongo").mode("append").option("uri",mongodb_uri). \
option("database",dbname).option("collection", collection_name).save()

Related

Python - parsing JSON and f string

I am trying to write a small JSON script that parses JSON files. I need to include multiple variables in the code but currently, I'm stuck since f string does not seem to be working as I expected. Here is an example code:
import json
test = 10
json_data = f'[{"ID": {test},"Name":"Pankaj","Role":"CEO"}]'
json_object = json.loads(json_data)
json_formatted_str = json.dumps(json_object, indent=2)
print(json_formatted_str)
The above code returns an error:
json_data = f'[{"ID": { {test} },"Name":"Pankaj","Role":"CEO"}]'
ValueError: Invalid format specifier
Could you, please let me know how can I add variables to the JSON?
Thank you.
You can put extra{ and } to your string:
import json
test = 10
json_data = f'[{{"ID": {test},"Name":"Pankaj","Role":"CEO"}}]'
json_object = json.loads(json_data)
json_formatted_str = json.dumps(json_object, indent=2)
print(json_formatted_str)
Prints:
[
{
"ID": 10,
"Name": "Pankaj",
"Role": "CEO"
}
]

In Foundry, how can I parse a dataframe column that has a JSON response

I am trying to bring in JIRA data into Foundry using an external API. When it comes in via Magritte, the data gets stored in AVRO and there is a column called response. The response column has data that looks like this...
[{"id":"customfield_5","name":"test","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["cf[5]","test"],"schema":{"type":"user","custom":"com.atlassian.jira.plugin.system.customfieldtypes:userpicker","customId":5}},{"id":"customfield_2","name":"test2","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["test2","cf[2]"],"schema":{"type":"option","custom":"com.atlassian.jira.plugin.system.customfieldtypes:select","customId":2}}]
Due to the fact that this imports as AVRO, the documentation that talks about how to convert this data that's in Foundry doesn't work. How can I convert this data into individual columns and rows?
Here is the code that I've attempted to use:
from transforms.api import transform_df, Input, Output
from pyspark import SparkContext as sc
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf
import json
import pyspark.sql.types as T
#transform_df(
Output("json output"),
json_raw=Input("json input"),
)
def my_compute_function(json_raw, ctx):
sqlContext = SQLContext(sc)
source = json_raw.select('response').collect() # noqa
# Read the list into data frame
df = sqlContext.read.json(sc.parallelize(source))
json_schema = T.StructType([
T.StructField("id", T.StringType(), False),
T.StructField("name", T.StringType(), False),
T.StructField("custom", T.StringType(), False),
T.StructField("orderable", T.StringType(), False),
T.StructField("navigable", T.StringType(), False),
T.StructField("searchable", T.StringType(), False),
T.StructField("clauseNames", T.StringType(), False),
T.StructField("schema", T.StringType(), False)
])
udf_parse_json = udf(lambda str: parse_json(str), json_schema)
df_new = df.select(udf_parse_json(df.response).alias("response"))
return df_new
# Function to convert JSON array string to a list
def parse_json(array_str):
json_obj = json.loads(array_str)
for item in json_obj:
yield (item["a"], item["b"])
Parsing Json in a string column to a struct column (and then into separate columns) can be easily done using the F.from_json function.
In your case, you need to do:
df = df.withColumn("response_parsed", F.from_json("response", json_schema))
Then you can do this or similar to get the contents into different columns:
df = df.select("response_parsed.*")
However, this won't work as your schema is incorrect, you actually have a list of json structs in each row, not just 1, so you need a T.ArrayType(your_schema) wrapping around the whole thing, you'll also need to do an F.explode before selecting, to get each array element in its own row.
An additional useful function is F.get_json_object, which allows you to get json one json object from a json string.
Using a UDF like you've done could work, but UDFs are generally much less performant than native spark functions.
Additionally, all the AVRO file format does in this case is to merge multiple json files into one big file, with each file in its own row, so the example under "Rest API Plugin" - "Processing JSON in Foundry" should work as long as you skip the 'put this schema on the raw dataset' step.
I used the magritte-rest connector to walk through the paged results from search:
type: rest-source-adapter2
restCalls:
- type: magritte-paging-inc-param-call
method: GET
path: search
paramToIncrease: startAt
increaseBy: 50
initValue: 0
parameters:
startAt: '{%startAt%}'
extractor:
- type: json
assign:
issues: /issues
allowNull: true
condition:
type: magritte-rest-non-empty-condition
var: issues
maxIterationsAllowed: 4096
cacheToDisk: false
oneFilePerResponse: false
That yielded a dataset that looked like this:
Once I had that, this parsed expanded and parsed the returned JSON issues into a properly-typed DataFrame with fields holding the inner structure of the issue as a (very complex) struct:
import json
from pyspark.sql import Row
from pyspark.sql.functions import explode
def issues_enumerated(All_Issues_Paged):
def generate_issue_row(input_row: Row) -> Row:
"""
Generates a dataframe of each responses issue array as a single array record per-Row
"""
d = input_row.asDict()
resp_json = d['response']
resp_obj = json.loads(resp_json)
issues = list(map(json.dumps,resp_obj['issues']))
return Row(issues=issues)
# array-per-record
unexploded_df = All_Issues_Paged.rdd.map(generate_issue_row).toDF()
# row-per-record
row_per_record_df = unexploded_df.select(explode(unexploded_df.issues))
# raw JSON string per-record RDD
issue_json_strings_rdd = row_per_record_df.rdd.map(lambda _: _.col)
# JSON object dataframe
issues_df = spark.read.json(issue_json_strings_rdd)
issues_df.printSchema()
return issues_df

Spark Dataframe to StringType

In PySpark, how do I convert a Dataframe to normal String?
Background:
I'm using PySpark with Kafka and instead of hard coding broker name, I have parameterized Kafka broker name in PySpark.
Json file is holding the Broker details and Spark read this Json input and assign values to variable. These variables are of Dataframe type with String.
I'm facing issue when I pass dataframe to Pyspark-Kakfa connection details to substitute the values.
Error :
Can only concatenate String (Not a Dataframe) to String.
Json parameter file :
{
"broker": "https://at.com:8082",
"topicname": "dev_hello"
}
PySpark Code :
parameter = spark.read.option("multiline", "true").json("/at/dev_parameter.json")
kserver = parameter.select("broker")
ktopic = parameter.select("topicname")
df.selectExpr("CAST(id AS STRING) AS key", "to_json(struct(*)) AS value")
.write
.format("kafka")
.outputMode("append")
.option("kafka.bootstrap.servers", "f"+ **kserver**)
.option("topic", "josn_data_topic",**ktopic** )
.save()
Please advise on it.
my second query is how do I pass these Python based variables to another Scala based Spark notebook.
Use json.load instead of Spark json reader:
import json
with open("/at/dev_parameter.json") as f:
parameter = json.load(f)
kserver = parameter["broker"]
ktopic = parameter["topicname"]
df.selectExpr("CAST(id AS STRING) AS key", "to_json(struct(*)) AS value") \
.write \
.format("kafka") \
.outputMode("append") \
.option("kafka.bootstrap.servers", kserver) \
.option("topic", ktopic) \
.save()
If you prefer using Spark json reader, you can do:
parameter = spark.read.option("multiline", "true").json("/at/dev_parameter.json")
kserver = parameter.select("broker").head()[0]
ktopic = parameter.select("topicname").head()[0]

Cannot write dict object as correct JSON to file

I try to read JSON from file, get values, transform them and back write to new file.
{
"metadata": {
"info": "important info"
},
"timestamp": "2018-04-06T12:19:38.611Z",
"content": {
"id": "1",
"name": "name test",
"objects": [
{
"id": "1",
"url": "http://example.com",
"properties": [
{
"id": "1",
"value": "1"
}
]
}
]
}
}
Above is a JSON that I read from file.
Below I attach a python program that gets values, creates new JSON and write it to file.
import json
from pprint import pprint
def load_json(file_name):
return json.load(open(file_name))
def get_metadata(json):
return json["metadata"]
def get_timestamp(json):
return json["timestamp"]
def get_content(json):
return json["content"]
def create_json(metadata, timestamp, content):
dct = dict(__metadata=metadata, timestamp=timestamp, content=content)
return json.dumps(dct)
def write_json_to_file(file_name, json_content):
with open(file_name, 'w') as file:
json.dump(json_content, file)
STACK_JSON = 'stack.json';
STACK_OUT_JSON = 'stack-out.json'
if __name__ == '__main__':
json_content = load_json(STACK_JSON)
print("Loaded JSON:")
print(json_content)
metadata = get_metadata(json_content)
print("Metadata:", metadata)
timestamp = get_timestamp(json_content)
print("Timestamp:", timestamp)
content = get_content(json_content)
print("Content:", content)
created_json = create_json(metadata, timestamp, content)
print("\n\n")
print(created_json)
write_json_to_file(STACK_OUT_JSON, created_json)
But the problem is that create json is not correct. Finally as result I get:
"{\"__metadata\": {\"info\": \"important info\"}, \"timestamp\": \"2018-04-06T12:19:38.611Z\", \"content\": {\"id\": \"1\", \"name\": \"name test\", \"objects\": [{\"id\": \"1\", \"url\": \"http://example.com\", \"properties\": [{\"id\": \"1\", \"value\": \"1\"}]}]}}"
It is not that what I want to achieve. It's not correct JSON. What do I wrong?
Solution:
Change the write_json_to_file(...) method like this:
def write_json_to_file(file_name, json_content):
with open(file_name, 'w') as file:
file.write(json_content)
Explanation:
The problem is, that when you're calling write_json_to_file(STACK_OUT_JSON, created_json) at the end of your script, the variable created_json contains a string - it's the JSON representation of the dictionary created in the create_json(...) function. But inside the write_json_to_file(file_name, json_content), you're calling:
json.dump(json_content, file)
You're telling the json module write the JSON representation of variable json_content (which contains a string) into the file. And JSON representation of a string is a single value encapsulated in double-quotes ("), with all the double-quotes it contains escaped by \.
What you want to achieve is to simply write the value of the json_content variable into the file and not have it first JSON-serialized again.
Problem
You're converting a dict into a json and then right before you write it into a file, you're converting it into a json again. When you retry to convert a json to a json it gives you the \" since it's escaping the " since it assumes that you have a value there.
How to solve it?
It's a great idea to read the json file, convert it into a dict and perform all sorts of operations to it. And only when you want to print out an output or write to a file or return an output you convert to a json since json.dump() is expensive, it adds 2ms (approx) of overhead which might not seem much but when your code is running in 500 microseconds it's almost 4 times.
Other Recommendations
After seeing your code, I realize you're coming from a java background and while in java the getThis() or getThat() is a great way to module your code since we represent our code in classes in java, in python it just causes problems in the readability of the code as mentioned in the PEP 8 style guide for python.
I've updated the code below:
import json
def get_contents_from_json(file_path)-> dict:
"""
Reads the contents of the json file into a dict
:param file_path:
:return: A dictionary of all contents in the file.
"""
try:
with open(file_path) as file:
contents = file.read()
return json.loads(contents)
except json.JSONDecodeError:
print('Error while reading json file')
except FileNotFoundError:
print(f'The JSON file was not found at the given path: \n{file_path}')
def write_to_json_file(metadata, timestamp, content, file_path):
"""
Creates a dict of all the data and then writes it into the file
:param metadata: The meta data
:param timestamp: the timestamp
:param content: the content
:param file_path: The file in which json needs to be written
:return: None
"""
output_dict = dict(metadata=metadata, timestamp=timestamp, content=content)
with open(file_path, 'w') as outfile:
json.dump(output_dict, outfile, sort_keys=True, indent=4, ensure_ascii=False)
def main(input_file_path, output_file_path):
# get a dict from the loaded json
data = get_contents_from_json(input_file_path)
# the print() supports multiple args so you don't need multiple print statements
print('JSON:', json.dumps(data), 'Loaded JSON as dict:', data, sep='\n')
try:
# load your data from the dict instead of the methods since it's more pythonic
metadata = data['metadata']
timestamp = data['timestamp']
content = data['content']
# just cumulating your print statements
print("Metadata:", metadata, "Timestamp:", timestamp, "Content:", content, sep='\n')
# write your json to the file.
write_to_json_file(metadata, timestamp, content, output_file_path)
except KeyError:
print('Could not find proper keys to in the provided json')
except TypeError:
print('There is something wrong with the loaded data')
if __name__ == '__main__':
main('stack.json', 'stack-out.json')
Advantages of the above code:
More Modular and hence easily unit testable
Handling of exceptions
Readable
More pythonic
Comments because they are just awesome!

Reading JSON with Apache Spark - `corrupt_record`

I have a json file, nodes that looks like this:
[{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
,{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
,{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
,{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}]
I am able to read and manipulate this record with Python.
I am trying to read this file in scala through the spark-shell.
From this tutorial, I can see that it is possible to read json via sqlContext.read.json
val vfile = sqlContext.read.json("path/to/file/nodes.json")
However, this results in a corrupt_record error:
vfile: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
Can anyone shed some light on this error? I can read and use the file with other applications and I am confident it is not corrupt and sound json.
As Spark expects "JSON Line format" not a typical JSON format, we can tell spark to read typical JSON by specifying:
val df = spark.read.option("multiline", "true").json("<file>")
Spark cannot read JSON-array to a record on top-level, so you have to pass:
{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}
As it's described in the tutorial you're referring to:
Let's begin by loading a JSON file, where each line is a JSON object
The reasoning is quite simple. Spark expects you to pass a file with a lot of JSON-entities (entity per line), so it could distribute their processing (per entity, roughly saying).
To put more light on it, here is a quote form the official doc
Note that the file that is offered as a json file is not a typical
JSON file. Each line must contain a separate, self-contained valid
JSON object. As a consequence, a regular multi-line JSON file will
most often fail.
This format is called JSONL. Basically it's an alternative to CSV.
To read the multi-line JSON as a DataFrame:
val spark = SparkSession.builder().getOrCreate()
val df = spark.read.json(spark.sparkContext.wholeTextFiles("file.json").values)
Reading large files in this manner is not recommended, from the wholeTextFiles docs
Small files are preferred, large file is also allowable, but may cause bad performance.
I run into the same problem. I used sparkContext and sparkSql on the same configuration:
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("Simple Application")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.config(conf)
.getOrCreate()
Then, using the spark context I read the whole json (JSON - path to file) file:
val jsonRDD = sc.wholeTextFiles(JSON).map(x => x._2)
You can create a schema for future selects, filters...
val schema = StructType( List(
StructField("toid", StringType, nullable = true),
StructField("point", ArrayType(DoubleType), nullable = true),
StructField("index", DoubleType, nullable = true)
))
Create a DataFrame using spark sql:
var df: DataFrame = spark.read.schema(schema).json(jsonRDD).toDF()
For testing use show and printSchema:
df.show()
df.printSchema()
sbt build file:
name := "spark-single"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.2"
libraryDependencies +="org.apache.spark" %% "spark-sql" % "2.0.2"