Apache Spark-Read dataframe as text ,edit it and then save as json - json

In Pyspark, I am trying to read a dataframe as text in order to replace 'False' with FALSE and then write it as a json.
I read the dataframe as text in order to replace 'False' with FALSE with regexp_replace so for example:
dftext=spark.read.text("path")
dftext = dftext.withColumn('value', regexp_replace(col('value'), 'False', 'FALSE'))
The dataframe must be formed into JSON after edit.The dftext.show() looks like this
--------------------+
| value|
+--------------------+
|{"id":"57","insta...|
|{"id":"58","insta...|
|{"id":"59","insta...|
I tried using the schema from the original df in order to form the final json but that doesnt seem to work.
schema=df.schema
dfJSON = dftext.withColumn("jsonData",from_json(col("value"),schema)).select("jsonData.*")

Related

Most efficient way to read concatenated json in PySpark?

I'm getting one json file where each line in the json is a json itself of 1000 objects, like this:
{"id":"test1", "results": [{"property1": "sample1"},{"property2": "sample2"}]}
{"id":"test2", "results": [{"property1": "sample3"},{"property2": "sample4"}]}
If I read it as a json using spark.read.json(filepath), I'm getting:
+-----+--------------------+
| id| results|
+-----+--------------------+
|test1|[{sample1, null},...|
+-----+--------------------+
(Which is only the first json in the concatenated json)
While I'm trying to get:
+-----+---------+---------+
|id |property1|property2|
+-----+---------+---------+
|test1|sample1 |sample1 |
|test2|sample3 |sample4 |
+-----+---------+---------+
I end up by reading the json as text, and iterate over each row to treat it as json and union each dataframe:
df = (spark.read.text(data[self.files]))
dataCollect = df.collect()
i = 0
for row in dataCollect:
df_row = flatten_json(spark.read.json(spark.sparkContext.parallelize(row)))
if i == 0:
df_all = df_row
else:
df_all = df_row.unionByName(df_all, allowMissingColumns = True)
i = i + 1
flatten_json is a helper that helps me to automatically flatten the json.
I guess there is a better approach, any help would be much appreciate
Your JSON file is called JSON Lines or JSONL which is a supported file format that Pyspark can handle natively. So, use the regular spark.read.json to read it and perform the additional transformations to match with what you want.
df = spark.read.json('yourfile.json or json/directory')
# Explode the array into structs. This will generate lots of nulls.
df = (df.select('id', F.explode('results').alias('results'))
.select('id', 'results.*'))
# Group them and aggregate to remove the nulls.
df = (df.groupby('id')
.agg(*[F.first(x, ignorenulls=True).alias(x) for x in df.columns if x != 'id']))
I think this works fine for 1000 lines JSONL, however, if you are curious about alternative solution that doesn't involve generating/removing nulls, please check here: By using PySpark how to parse nested json. In some situations, the alternative solution which doesn't do explode could be more performant.

JSON in spark sql dataframe [duplicate]

I asked the question a while back for python, but now I need to do the same thing in PySpark.
I have a dataframe (df) like so:
|cust_id|address |store_id|email |sales_channel|category|
-------------------------------------------------------------------
|1234567|123 Main St|10SjtT |idk#gmail.com|ecom |direct |
|4567345|345 Main St|10SjtT |101#gmail.com|instore |direct |
|1569457|876 Main St|51FstT |404#gmail.com|ecom |direct |
and I would like to combine the last 4 fields into one metadata field that is a json like so:
|cust_id|address |metadata |
-------------------------------------------------------------------------------------------------------------------
|1234567|123 Main St|{'store_id':'10SjtT', 'email':'idk#gmail.com','sales_channel':'ecom', 'category':'direct'} |
|4567345|345 Main St|{'store_id':'10SjtT', 'email':'101#gmail.com','sales_channel':'instore', 'category':'direct'}|
|1569457|876 Main St|{'store_id':'51FstT', 'email':'404#gmail.com','sales_channel':'ecom', 'category':'direct'} |
Here's the code I used to do this in python:
cols = [
'store_id',
'store_category',
'sales_channel',
'email'
]
df1 = df.copy()
df1['metadata'] = df1[cols].to_dict(orient='records')
df1 = df1.drop(columns=cols)
but I would like to translate this to PySpark code to work with a spark dataframe; I do NOT want to use pandas in Spark.
Use to_json function to create json object!
Example:
from pyspark.sql.functions import *
#sample data
df=spark.createDataFrame([('1234567','123 Main St','10SjtT','idk#gmail.com','ecom','direct')],['cust_id','address','store_id','email','sales_channel','category'])
df.select("cust_id","address",to_json(struct("store_id","category","sales_channel","email")).alias("metadata")).show(10,False)
#result
+-------+-----------+----------------------------------------------------------------------------------------+
|cust_id|address |metadata |
+-------+-----------+----------------------------------------------------------------------------------------+
|1234567|123 Main St|{"store_id":"10SjtT","category":"direct","sales_channel":"ecom","email":"idk#gmail.com"}|
+-------+-----------+----------------------------------------------------------------------------------------+
to_json by passing list of columns:
ll=['store_id','email','sales_channel','category']
df.withColumn("metadata", to_json(struct([x for x in ll]))).drop(*ll).show()
#result
+-------+-----------+----------------------------------------------------------------------------------------+
|cust_id|address |metadata |
+-------+-----------+----------------------------------------------------------------------------------------+
|1234567|123 Main St|{"store_id":"10SjtT","email":"idk#gmail.com","sales_channel":"ecom","category":"direct"}|
+-------+-----------+----------------------------------------------------------------------------------------+
#Shu gives a good answer, here's a variant that works out slightly better for my use case. I'm going from Kafka -> Spark -> Kafka and this one liner does exactly what I want. The struct(*) will pack up all the fields in the dataframe.
# Packup the fields in preparation for sending to Kafka sink
kafka_df = df.selectExpr('cast(id as string) as key', 'to_json(struct(*)) as value')

How to add field within nested JSON when reading from/writing to Kafka via a Spark dataframe

I've a Spark (v.3.0.1) job written in Java that reads Json from Kafka, does some transformation and then writes it back to Kafka. For now, the incoming message structure in Kafka is something like:
{"catKey": 1}. The output from the Spark job that's written back to Kafka is something like: {"catKey":1,"catVal":"category-1"}. The code for processing input data from Kafka goes something as follows:
DataFrameReader dfr = putSrcProps(spark.read().format("kafka"));
for (String key : srcProps.stringPropertyNames()) {
dfr = dfr.option(key, srcProps.getProperty(key));
}
Dataset<Row> df = dfr.option("group.id", getConsumerGroupId())
.load()
.selectExpr("CAST(value AS STRING) as value")
.withColumn("jsonData", from_json(col("value"), schemaHandler.getSchema()))
.select("jsonData.*");
// transform df
df.toJSON().write().format("kafka").option("key", "val").save()
I want to change the message structure in Kafka. Now, it should be of the format: {"metadata": <whatever>, "payload": {"catKey": 1}}. While reading, we need to read only the contents of the payload, so the dataframe remains similar. Also, while writing back to Kafka, first I need to wrap the msg in payload, add a metadata. The output will have to be of the format: {"metadata": <whatever>, "payload": {"catKey":1,"catVal":"category-1"}}. I've tried manipulating the contents of the selectExpr and from_json method, but no luck so far. Any pointer on how to achieve the functionality would be very much appreciated.
To extract the content of payload in your JSON you can use get_json_object. And to create the new output you can use the built-in functions struct and to_json.
Given a Dataframe:
val df = Seq(("""{"metadata": "whatever", "payload": {"catKey": 1}}""")).toDF("value").as[String]
df.show(false)
+--------------------------------------------------+
|value |
+--------------------------------------------------+
|{"metadata": "whatever", "payload": {"catKey": 1}}|
+--------------------------------------------------+
Then creating the new column called "value"
val df2 = df
.withColumn("catVal", lit("category-1")) // whatever your logic is to fill this column
.withColumn("payload",
struct(
get_json_object(col("value"), "$.payload.catKey").as("catKey"),
col("catVal").as("catVal")
)
)
.withColumn("metadata",
get_json_object(col("value"), "$.metadata"),
).select("metadata", "payload")
df2.show(false)
+--------+---------------+
|metadata|payload |
+--------+---------------+
|whatever|[1, category-1]|
+--------+---------------+
val df3 = df2.select(to_json(struct(col("metadata"), col("payload"))).as("value"))
df3.show(false)
+----------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------+
|{"metadata":"whatever","payload":{"catKey":"1","catVal":"category-1"}}|
+----------------------------------------------------------------------+

How to dynamically reference items in a JSON struct using pyspark [duplicate]

I have a pyspark dataframe with StringType column (edges), which contains a list of dictionaries (see example below). The dictionaries contain a mix of value types, including another dictionary (nodeIDs). I need to explode the top-level dictionaries in the edges field into rows; ideally, I should then be able to convert their component values into separate fields.
Input:
import findspark
findspark.init()
SPARK = SparkSession.builder.enableHiveSupport() \
.getOrCreate()
data = [
Row(trace_uuid='aaaa', timestamp='2019-05-20T10:36:33+02:00', edges='[{"distance":4.382441320292239,"duration":1.5,"speed":2.9,"nodeIDs":{"nodeA":954752475,"nodeB":1665827480}},{"distance":14.48582171131768,"duration":2.6,"speed":5.6,"nodeIDs":{"nodeA":1665827480,"nodeB":3559056131}}]', count=156, level=36),
Row(trace_uuid='bbbb', timestamp='2019-05-20T11:36:10+03:00', edges='[{"distance":0,"duration":0,"speed":0,"nodeIDs":{"nodeA":520686131,"nodeB":520686216}},{"distance":8.654358326561642,"duration":3.1,"speed":2.8,"nodeIDs":{"nodeA":520686216,"nodeB":506361795}}]', count=179, level=258)
]
df = SPARK.createDataFrame(data)
Desired output:
data_reshaped = [
Row(trace_uuid='aaaa', timestamp='2019-05-20T10=36=33+02=00', distance=4.382441320292239, duration=1.5, speed=2.9, nodeA=954752475, nodeB=1665827480, count=156, level=36),
Row(trace_uuid='aaaa', timestamp='2019-05-20T10=36=33+02=00', distance=16.134844841712574, duration=2.9,speed=5.6, nodeA=1665827480, nodeB=3559056131, count=156, level=36),
Row(trace_uuid='bbbb', timestamp='2019-05-20T11=36=10+03=00', distance=0, duration=0, speed=0, nodeA=520686131, nodeB=520686216, count=179, level=258),
Row(trace_uuid='bbbb', timestamp='2019-05-20T11=36=10+03=00', distance=8.654358326561642, duration=3.1, speed=2.8, nodeA=520686216, nodeB=506361795, count=179, level=258)
]
Is there a way to do that? I've tried using cast to cast the edges field into an array first, but I can't figure out how to get it to work with the mixed data types.
I'm using Spark 2.4.0.
You can use from_json() with schema_of_json() to infer the JSON schema. for example:
from pyspark.sql import functions as F
# a sample json string:
edges_json_sample = data[0].edges
# or edges_json_sample = df.select('edges').first()[0]
>>> edges_json_sample
#'[{"distance":4.382441320292239,"duration":1.5,"speed":2.9,"nodeIDs":{"nodeA":954752475,"nodeB":1665827480}},{"distance":14.48582171131768,"duration":2.6,"speed":5.6,"nodeIDs":{"nodeA":1665827480,"nodeB":3559056131}}]'
# infer schema from the sample string
schema = df.select(F.schema_of_json(edges_json_sample)).first()[0]
>>> schema
#u'array<struct<distance:double,duration:double,nodeIDs:struct<nodeA:bigint,nodeB:bigint>,speed:double>>'
# convert json string to data structure and then retrieve desired items
new_df = df.withColumn('data', F.explode(F.from_json('edges', schema))) \
.select('*', 'data.*', 'data.nodeIDs.*') \
.drop('data', 'nodeIDs', 'edges')
>>> new_df.show()
+-----+-----+--------------------+----------+-----------------+--------+-----+----------+----------+
|count|level| timestamp|trace_uuid| distance|duration|speed| nodeA| nodeB|
+-----+-----+--------------------+----------+-----------------+--------+-----+----------+----------+
| 156| 36|2019-05-20T10:36:...| aaaa|4.382441320292239| 1.5| 2.9| 954752475|1665827480|
| 156| 36|2019-05-20T10:36:...| aaaa|14.48582171131768| 2.6| 5.6|1665827480|3559056131|
| 179| 258|2019-05-20T11:36:...| bbbb| 0.0| 0.0| 0.0| 520686131| 520686216|
| 179| 258|2019-05-20T11:36:...| bbbb|8.654358326561642| 3.1| 2.8| 520686216| 506361795|
+-----+-----+--------------------+----------+-----------------+--------+-----+----------+----------+
# expected result
data_reshaped = new_df.rdd.collect()

read alphanumeric field of json file without quote

I want to read alphanumeric field of json file without quote:
I tried to cast this field to String at the time of reading json But didn't work.
json file (abc.json):
{"col1": 1a2bc}
struct1 = StructType([StructField("col1", StringType(), True)])
df = spark.read.json(abc.json, schema=struct1)
o/p :
>>> df.show()
+----+
|col1|
+----+
|null|
+----+
But if i change my json file to {"col1": '1a2bc'} then this works file.
Could you please suggest a method to read alnum filed without quote.