R data frame to array json - json

im have dataframe
id|name | surname
------------------
1 |James| Smith
2 |Mat | Stone
3 |Stan | Daimon
im need convert this to array json object(just string)
[
{id:1,name:"James",surname:"Smith"},
{id:2,name:"Mat",surname:"Stone"},
{id:3,name:"Stan",surname:"Daimon"}
]

We can use toJSON from library(jsonlite)
library(jsonlite)
toJSON(df1)
#[{"id":1,"name":"James","surname":"Smith"},{"id":2,"name":"Mat","surname":"Stone"},{"id":3,"name":"Stan","surname":"Daimon"}]
data
df1 <- structure(list(id = 1:3, name = c("James", "Mat", "Stan"),
surname = c("Smith",
"Stone", "Daimon")), .Names = c("id", "name", "surname"),
class = "data.frame", row.names = c(NA, -3L))

Related

Pyspark dataframe with json, iteration to create new dataframe

I have data with the following format:
customer_id
model
1
[{color: 'red', group: 'A'},{color: 'green', group: 'B'}]
2
[{color: 'red', group: 'A'}]
I need to process it so that I create a new dataframe with the following output:
customer_id
color
group
1
red
A
1
green
B
2
red
A
Now I can do this easily with python:
import pandas as pd
import json
newdf = pd.DataFrame([])
for index, row in df.iterrows():
s = row['model']
x = json.loads(s)
colors_list = []
users_list = []
groups_list = []
for i in range(len(x)):
colors_list.append(x[i]['color'])
users_list.append(row['user_id'])
groups_list.append(x[i]['group'])
newdf = newdf.append(pd.DataFrame({'customer_id': users_list, 'group': groups_list, 'color': colors_list}))
How can I achieve the same result with pyspark?
I'm showing the first rows and schema of original dataframe:
+-----------+--------------------+
|customer_id| model |
+-----------+--------------------+
| 3541|[{"score":0.04767...|
| 171811|[{"score":0.04473...|
| 12008|[{"score":0.08043...|
| 78964|[{"score":0.06669...|
| 119600|[{"score":0.06703...|
+-----------+--------------------+
only showing top 5 rows
root
|-- user_id: integer (nullable = true)
|-- groups: string (nullable = true)
from_json can parse a string column that contains Json data:
from pyspark.sql import functions as F
from pyspark.sql import types as T
data = [[1, "[{color: 'red', group: 'A'},{color: 'green', group: 'B'}]"],
[2, "[{color: 'red', group: 'A'}]"]]
df = spark.createDataFrame(data, schema = ["customer_id", "model"]) \
.withColumn("model", F.from_json("model", T.ArrayType(T.MapType(T.StringType(), T.StringType())), {"allowUnquotedFieldNames": True})) \
.withColumn("model", F.explode("model")) \
.withColumn("color", F.col("model")["color"]) \
.withColumn("group", F.col("model")["group"]) \
.drop("model")
Result:
+-----------+-----+-----+
|customer_id|color|group|
+-----------+-----+-----+
| 1| red| A|
| 1|green| B|
| 2| red| A|
+-----------+-----+-----+

Read JSON as dataframe using Pyspark

I am trying to read a JSON document which looks like this
{"id":100, "name":"anna", "hometown":"chicago"} [{"id":200, "name":"beth", "hometown":"indiana"},{"id":400, "name":"pete", "hometown":"new jersey"},{"id":500, "name":"emily", "hometown":"san fransisco"},{"id":700, "name":"anna", "hometown":"dudley"},{"id":1100, "name":"don", "hometown":"santa monica"},{"id":1300, "name":"sarah", "hometown":"hoboken"},{"id":1600, "name":"john", "hometown":"downtown"}]
{"id":1100, "name":"don", "hometown":"santa monica"} [{"id":100, "name":"anna", "hometown":"chicago"},{"id":400, "name":"pete", "hometown":"new jersey"},{"id":500, "name":"emily", "hometown":"san fransisco"},{"id":1200, "name":"jane", "hometown":"freemont"},{"id":1600, "name":"john", "hometown":"downtown"},{"id":1500, "name":"glenn", "hometown":"uptown"}]
{"id":1400, "name":"steve", "hometown":"newtown"} [{"id":100, "name":"anna", "hometown":"chicago"},{"id":600, "name":"john", "hometown":"san jose"},{"id":900, "name":"james", "hometown":"aurora"},{"id":1000, "name":"peter", "hometown":"elgin"},{"id":1100, "name":"don", "hometown":"santa monica"},{"id":1500, "name":"glenn", "hometown":"uptown"},{"id":1600, "name":"john", "hometown":"downtown"}]
{"id":1500, "name":"glenn", "hometown":"uptown"} [{"id":200, "name":"beth", "hometown":"indiana"},{"id":300, "name":"frank", "hometown":"new york"},{"id":400, "name":"pete", "hometown":"new jersey"},{"id":500, "name":"emily", "hometown":"san fransisco"},{"id":1100, "name":"don", "hometown":"santa monica"}]
There is a space between a key and a value (value is list containing json text).
Code which I tried
data = spark\
.read\
.format("json")\
.load("/Users/sahilnagpal/Desktop/dataworld.json")
data.show()
Result I get
+------------+----+-----+
| hometown| id| name|
+------------+----+-----+
| chicago| 100| anna|
|santa monica|1100| don|
| newtown|1400|steve|
| uptown|1500|glenn|
+------------+----+-----+
Result I want
+------------+----+-----+
| hometown| id| name|
+------------+----+-----+
| chicago| 100| anna| -- all the other ID,name,hometown corresponding to this ID and Name
|santa monica|1100| don| -- all the other ID,name,hometown corresponding to this ID and Name
| newtown|1400|steve| -- all the other ID,name,hometown corresponding to this ID and Name
| uptown|1500|glenn| -- all the other ID,name,hometown corresponding to this ID and Name
+------------+----+-----+
I think instead of reading it as a json file you should try to read it as a text file because the json string does not look like a valid json.
Below is the code that you should try to get the output that you expect:
from pyspark.sql.functions import *
from pyspark.sql.types import *
data1 = spark.read.text("/Users/sahilnagpal/Desktop/dataworld.json")
schema = StructType(
[
StructField('id', StringType(), True),
StructField('name', StringType(), True),
StructField('hometown',StringType(),True)
]
)
data2 = data1.withColumn("JsonKey",split(col("value"),"\\[")[0]).withColumn("JsonValue",split(col("value"),"\\[")[1]).withColumn("data",from_json("JsonKey",schema)).select(col('data.*'),'JsonValue')
Below is the output that you would get based on the above code.
You can read the input as a CSV file using two spaces as the separator/delimiter. Then parse each column separately using from_json with an appropriate schema.
df = spark.read.csv('/Users/sahilnagpal/Desktop/dataworld.json', sep=' ').toDF('json1', 'json2')
df2 = df.withColumn(
'json1',
F.from_json('json1', 'struct<id:int, name:string, hometown:string>')
).withColumn(
'json2',
F.from_json('json2', 'array<struct<id:int, name:string, hometown:string>>')
).select('json1.*', 'json2')
df2.show(truncate=False)
+----+-----+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |name |hometown |json2 |
+----+-----+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 |anna |chicago |[[200, beth, indiana], [400, pete, new jersey], [500, emily, san fransisco], [700, anna, dudley], [1100, don, santa monica], [1300, sarah, hoboken], [1600, john, downtown]]|
|1100|don |santa monica|[[100, anna, chicago], [400, pete, new jersey], [500, emily, san fransisco], [1200, jane, freemont], [1600, john, downtown], [1500, glenn, uptown]] |
|1400|steve|newtown |[[100, anna, chicago], [600, john, san jose], [900, james, aurora], [1000, peter, elgin], [1100, don, santa monica], [1500, glenn, uptown], [1600, john, downtown]] |
|1500|glenn|uptown |[[200, beth, indiana], [300, frank, new york], [400, pete, new jersey], [500, emily, san fransisco], [1100, don, santa monica]] |
+----+-----+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

spark dataframes : reading json having duplicate column names but different datatypes

I have json data like below where version field is the differentiator -
file_1 = {"version": 1, "stats": {"hits":20}}
file_2 = {"version": 2, "stats": [{"hour":1,"hits":10},{"hour":2,"hits":12}]}
In the new format, stats column is now Arraytype(StructType).
Earlier only file_1 was needed so I was using
spark.read.schema(schema_def_v1).json(path)
Now I need to read both these type of multiple json files which come together. I cannot define stats as string in schema_def as that would affect the corruptrecord feature(for stats column) which checks malformed json and schema compliance of all the fields.
Example df output required in 1 read only -
version | hour | hits
1 | null | 20
2 | 1 | 10
2 | 2 | 12
I have tried to read with mergeSchema option but that makes stats field String type.
Also, I have tried making two dataframes by filtering on the version field, and applying spark.read.schema(schema_def_v1).json(df_v1.toJSON). Here also stats column becomes String type.
I was thinking if while reading, I could parse the df column headers as stats_v1 and stats_v2 on basis of data-types can solve the problem. Please help with any possible solutions.
UDF to check string or array, if it is string it will convert string to an array.
import org.apache.spark.sql.functions.udf
import org.json4s.{DefaultFormats, JObject}
import org.json4s.jackson.JsonMethods.parse
import org.json4s.jackson.Serialization.write
import scala.util.{Failure, Success, Try}
object Parse {
implicit val formats = DefaultFormats
def toArray(data:String) = {
val json_data = (parse(data))
if(json_data.isInstanceOf[JObject]) write(List(json_data)) else data
}
}
val toJsonArray = udf(Parse.toArray _)
scala> "ls -ltr /tmp/data".!
total 16
-rw-r--r-- 1 srinivas root 37 Jun 26 17:49 file_1.json
-rw-r--r-- 1 srinivas root 69 Jun 26 17:49 file_2.json
res4: Int = 0
scala> val df = spark.read.json("/tmp/data").select("stats","version")
df: org.apache.spark.sql.DataFrame = [stats: string, version: bigint]
scala> df.printSchema
root
|-- stats: string (nullable = true)
|-- version: long (nullable = true)
scala> df.show(false)
+-------+-------------------------------------------+
|version|stats |
+-------+-------------------------------------------+
|1 |{"hits":20} |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|
+-------+-------------------------------------------+
Output
scala>
import org.apache.spark.sql.types._
val schema = ArrayType(MapType(StringType,IntegerType))
df
.withColumn("json_stats",explode(from_json(toJsonArray($"stats"),schema)))
.select(
$"version",
$"stats",
$"json_stats".getItem("hour").as("hour"),
$"json_stats".getItem("hits").as("hits")
).show(false)
+-------+-------------------------------------------+----+----+
|version|stats |hour|hits|
+-------+-------------------------------------------+----+----+
|1 |{"hits":20} |null|20 |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|1 |10 |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|2 |12 |
+-------+-------------------------------------------+----+----+
Without UDF
scala> val schema = ArrayType(MapType(StringType,IntegerType))
scala> val expr = when(!$"stats".contains("[{"),concat(lit("["),$"stats",lit("]"))).otherwise($"stats")
df
.withColumn("stats",expr)
.withColumn("stats",explode(from_json($"stats",schema)))
.select(
$"version",
$"stats",
$"stats".getItem("hour").as("hour"),
$"stats".getItem("hits").as("hits")
)
.show(false)
+-------+-----------------------+----+----+
|version|stats |hour|hits|
+-------+-----------------------+----+----+
|1 |[hits -> 20] |null|20 |
|2 |[hour -> 1, hits -> 10]|1 |10 |
|2 |[hour -> 2, hits -> 12]|2 |12 |
+-------+-----------------------+----+----+
Read the second file first, explode stats, use schema to read first file.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
file_1 = {"version": 1, "stats": {"hits": 20}}
file_2 = {"version": 2, "stats": [{"hour": 1, "hits": 10}, {"hour": 2, "hits": 12}]}
df1 = spark.read.json(sc.parallelize([file_2])).withColumn('stats', explode('stats'))
schema = df1.schema
spark.read.schema(schema).json(sc.parallelize([file_1])).printSchema()
output >> root
|-- stats: struct (nullable = true)
| |-- hits: long (nullable = true)
| |-- hour: long (nullable = true)
|-- version: long (nullable = true)
IIUC, you can read the JSON files using spark.read.text and then parse the value with json_tuple, from_json. notice for stats field we use coalesce to parse fields based on two or more schema. (add wholetext=True as an argument of spark.read.text if each file contains a single JSON document cross multiple lines)
from pyspark.sql.functions import json_tuple, coalesce, from_json, array
df = spark.read.text("/path/to/all/jsons/")
schema_1 = "array<struct<hour:int,hits:int>>"
schema_2 = "struct<hour:int,hits:int>"
df.select(json_tuple('value', 'version', 'stats').alias('version', 'stats')) \
.withColumn('status', coalesce(from_json('stats', schema_1), array(from_json('stats', schema_2)))) \
.selectExpr('version', 'inline_outer(status)') \
.show()
+-------+----+----+
|version|hour|hits|
+-------+----+----+
| 2| 1| 10|
| 2| 2| 12|
| 1|null| 20|
+-------+----+----+

Merge json column names with case in-sensitive

My JSON column names are a combination of lower and uppercase case (Ex: title/Title and name/Name), due to which in output, I am getting name and Name as two different columns (similarly title and Title).
How can I make the JSON columns as case insensitive?
config("spark.sql.caseSensitive", "true") -> I tried this, but it is not working.
val df = Seq(
("A", "B", "{\"Name\":\"xyz\",\"Address\":\"NYC\",\"title\":\"engg\"}"),
("C", "D", "{\"Name\":\"mnp\",\"Address\":\"MIC\",\"title\":\"data\"}"),
("E", "F", "{\"name\":\"pqr\",\"Address\":\"MNN\",\"Title\":\"bi\"}")
)).toDF("col_1", "col_2", "col_json")
import sc.implicits._
val col_schema = spark.read.json(df.select("col_json").as[String]).schema
val outputDF = df.withColumn("new_col", from_json(col("col_json"), col_schema))
.select("col_1", "col_2", "new_col.*")
outputDF.show(false)
Current output:
Expected/Needed output (column names to be case-insensitive):
Soltion 1
You can group the columns by their lowercase names and merge them using coalesce function:
// set spark.sql.caseSensitive to true to avoid ambuigity
spark.conf.set("spark.sql.caseSensitive", "true")
val col_schema = spark.read.json(df.select("col_json").as[String]).schema
val df1 = df.withColumn("new_col", from_json(col("col_json"), col_schema))
.select("col_1", "col_2", "new_col.*")
val mergedCols = df1.columns.groupBy(_.toLowerCase).values
.map(grp =>
if (grp.size > 1) coalesce(grp.map(col): _*).as(grp(0))
else col(grp(0))
).toSeq
val outputDF = df1.select(mergedCols:_*)
outputDF.show()
//+----+-------+-----+-----+-----+
//|Name|Address|col_1|Title|col_2|
//+----+-------+-----+-----+-----+
//|xyz |NYC |A |engg |B |
//|mnp |MIC |C |data |D |
//|pqr |MNN |E |bi |F |
//+----+-------+-----+-----+-----+
Solution 2
Another way is to parse the JSON string column into MapType instead of StructType, and using transform_keys you can lower case the column name, then explode the map and pivot to get columns:
import org.apache.spark.sql.types.{MapType, StringType}
val outputDF = df.withColumn(
"col_json",
from_json(col("col_json"), MapType(StringType, StringType))
).select(
col("col_1"),
col("col_2"),
explode(expr("transform_keys(col_json, (k, v) -> lower(k))"))
).groupBy("col_1", "col_2")
.pivot("key")
.agg(first("value"))
outputDF.show()
//+-----+-----+-------+----+-----+
//|col_1|col_2|address|name|title|
//+-----+-----+-------+----+-----+
//|E |F |MNN |pqr |bi |
//|C |D |MIC |mnp |data |
//|A |B |NYC |xyz |engg |
//+-----+-----+-------+----+-----+
For this solution transform_keys is only avlaible since Spark 3, for older versions you can use UDF :
val mapKeysToLower = udf((m: Map[String, String]) => {
m.map { case (k, v) => k.toLowerCase -> v }
})
You will need to merge your columns, using something like:
import org.apache.spark.sql.functions.when
df = df.withColumn("title", when($"title".isNull, $"Title").otherwise($"title").drop("Title")

flatten nested data structure in Spark

I have the following dataframe:
df.show()
+--------------------+--------------------+----+--------+---------+--------------------+--------+--------------------+
| address| coordinates| id|latitude|longitude| name|position| json|
+--------------------+--------------------+----+--------+---------+--------------------+--------+--------------------+
|Balfour St / Brun...|[-27.463431, 15.352472|79.0| null| null|79 - BALFOUR ST /...| null|[-27.463431, 153.041031]|
+--------------------+--------------------+----+--------+---------+--------------------+--------+--------------------+
I want to flatten the json column.
I did :
val jsonSchema = StructType(Seq(
StructField("latitude", DoubleType, nullable = true),
StructField("longitude", DoubleType, nullable = true)))
val a = df.select(from_json(col("json"), jsonSchema) as "content")
but
a.show() gives me :
+-------+
|content|
+-------+
| null|
+-------+
Any idea how to parse json col properly and get content col in second dataframe (a) not null ?
Raw data is presented as :
{
"id": 79,
"name": "79 - BALFOUR ST / BRUNSWICK ST",
"address": "Balfour St / Brunswick St",
"coordinates": {
"latitude": -27.463431,
"longitude": 153.041031
}
}
Thanks a lot
Problem is your schema. You are trying to access a nested collection values like a regular value. I made changes to your schema and it worked for me.
val df = spark.createDataset(
"""
|{
| "id": 79,
| "name": "79 - BALFOUR ST / BRUNSWICK ST",
| "address": "Balfour St / Brunswick St",
| "coordinates": {
| "latitude": -27.463431,
| "longitude": 153.041031
| }
| }
""".stripMargin :: Nil)
val jsonSchema = StructType(Seq(
StructField("name", StringType, nullable = true),
StructField("coordinates",
StructType(Seq(
StructField("latitude", DoubleType, true)
,
StructField("longitude", DoubleType, true)
)), true)
)
)
val a = df.select(from_json(col("value"), jsonSchema) as "content")
a.show(false)
Output
+--------------------------------------------------------+
|content |
+--------------------------------------------------------+
|[79 - BALFOUR ST / BRUNSWICK ST,[-27.463431,153.041031]]|
+--------------------------------------------------------+