How to query using column names that include "$"? - json

In Spark SQL, I can use
val spark = SparkSession
.builder()
.appName("SparkSessionZipsExample")
.master("local")
.config("spark.sql.warehouse.dir", "warehouseLocation-value")
.getOrCreate()
val df = spark.read.json("source/myRecords.json")
df.createOrReplaceTempView("shipment")
val sqlDF = spark.sql("SELECT * FROM shipment")
to get the data from "myRecords.json", and the structure of this json file is:
df.printSchema()
root
|-- _id: struct (nullable = true)
| |-- $oid: string (nullable = true)
|-- container: struct (nullable = true)
| |-- barcode: string (nullable = true)
| |-- code: string (nullable = true)
I can get the specific column of this json such as:
val sqlDF = spark.sql("SELECT container.barcode, container.code FROM shipment")
But how can I get id.$oid from this json file?
I have tried "SELECT id.$oid FROM shipment_log" or "SELECT id.\$oid FROM shipment_log", but not work at all.
error message:
error: invalid escape character
Can any one tell me how can I get id.$oid ?

Backticks are your friend:
spark.read.json(sc.parallelize(Seq(
"""{"_id": {"$oid": "foo"}}""")
)).createOrReplaceTempView("df")
spark.sql("SELECT _id.`$oid` FROM df").show
+----+
|$oid|
+----+
| foo|
+----+
Same as DataFrame API:
spark.table("df").select($"_id".getItem("$oid")).show
+--------+
|_id.$oid|
+--------+
| foo|
+--------+
or
spark.table("df").select($"_id.$$oid")
+--------+
|_id.$oid|
+--------+
| foo|
+--------+

Related

Transform a list of JSON string to a list of dict in Pyspark

I’m struggling to transform a list of JSON string to a list of dict in Pyspark without using udf or using rdd
I have this kind of dataframe:
Key
JSON_string
123456
["""{"Zipcode":704,"ZipCodeType":"STA"}""","""{"City":"PARC","State":"PR"}"""]
789123
["""{"Zipcode":7,"ZipCodeType":"AZA"}""","""{"City":"PRE","State":"XY"}"""]
How can I transform col(JSON_string) by using built-in functions in Pyspark to [{"Zipcode":704,"ZipCodeType":"STA"},{"City":"PARC","State":"PR"}] ?
I tried many functions such as create_map, collect_list, from_json, to_json, explode, json.loads, json.dump but no way to get the expected result.
Thank you for your help
Explode your JSON_string column, and read it as json, group by again.
df = df.withColumn('JSON_string', f.explode('JSON_string'))
schema = spark.read.json(df.rdd.map(lambda r: r.JSON_string)).schema
df_result = df.withColumn('JSON', f.from_json('JSON_string', schema)) \
.drop('JSON_string') \
.groupBy('Key') \
.agg(f.collect_list('JSON').alias('JSON'))
df_result.show(truncate=False)
df_result.printSchema()
+------+------------------------------------------------+
|Key |JSON |
+------+------------------------------------------------+
|123456|[{null, null, STA, 704}, {PARC, PR, null, null}]|
|789123|[{null, null, AZA, 7}, {PRE, XY, null, null}] |
+------+------------------------------------------------+
root
|-- Key: long (nullable = true)
|-- JSON: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- City: string (nullable = true)
| | |-- State: string (nullable = true)
| | |-- ZipCodeType: string (nullable = true)
| | |-- Zipcode: long (nullable = true)

PySpark, parse json given deep nested schema

I have a pyspark dataframe where, for each row, there is a column which is a json string.
+----------------------------------------------------------------------------+
|data |
+----------------------------------------------------------------------------+
|{"student":{"name":"Bob","surname":"Smith","age":18},"scholarship":true} |
|{"student":{"name":"Adam","surname":"Smith","age":"23"},"scholarship":false}|
+----------------------------------------------------------------------------+
I want to explode this json strings, in order to be compliant with the following schema:
root
|-- scholarship: boolean (nullable = true)
|-- student: struct (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
| |-- surname: string (nullable = true)
So, my solution is the following:
my_schema = StructType([
StructField("scholarship",BooleanType(),True),
StructField(
"student",
StructType([
StructField("age",LongType(),True),
StructField("name",StringType(),True),
StructField("surname",StringType(),True)
]),
True
)
])
parsed_df = my_df.withColumn("data", from_json(col("data"), my_schema))
In this way, the parsed_df is the following:
+------------------------+
|data |
+------------------------+
|{true, {18, Bob, Smith}}|
|{false, null} |
+------------------------+
Instead, I would like an output as:
+----------------------------+
|data |
+----------------------------+
|{true, {18, Bob, Smith}} |
|{false, {null, Adam, Smith}}|
+----------------------------+
Is there any option in the from_json method or any alternative solution to reach this result?
I add that I cannot use databricks and also that (unlike the example) in the business use, I don't define the schema, but this is passed to me every time. My question is more general: given a spark schema, is there any way to parse a json string column in a dataframe, although this json is deeply nested?

Reading an element from a json object stored in a column

I have the following dataframe
+-------+--------------------------------
|__key__|______value____________________|
| 1 | {"name":"John", "age": 34} |
| 2 | {"name":"Rose", "age": 50} |
I want to retrieve all age values within this dataframe and store it later within an array.
val x = df_clean.withColumn("value", col("value.age"))
x.show(false)
But this throws and exception.
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Can't extract value from value#89: need struct type but got string;
How to resolve my requirement
EDIT
val schema = existingSparkSession.read.json(df_clean.select("value").as[String]).schema
val my_json = df_clean.select(from_json(col("value"), schema).alias("jsonValue"))
my_json.printSchema()
val df_final = my_json.withColumn("age", col("jsonValue.age"))
df_final.show(false)
Currently no exceptions are thrown. Yet I can't see any output also
EDIT 2
println("---+++++--------")
df_clean.select("value").take(1)
println("---+++++--------")
output
---+++++--------
---+++++--------
If you have long json and want to create schema then you can use from_json with schema.
import org.apache.spark.sql.functions._
val df = Seq(
(1, "{\"name\":\"John\", \"age\": 34}"),
(2, "{\"name\":\"Rose\", \"age\": 50}")
).toDF("key", "value")
val schema = spark.read.json(df.select("value").as[String]).schema
val resultDF = df.withColumn("value", from_json($"value", schema))
resultDF.show(false)
resultDF.printSchema()
Output:
+---+----------+
|key|value |
+---+----------+
|1 |{34, John}|
|2 |{50, Rose}|
+---+----------+
Schema:
root
|-- key: integer (nullable = false)
|-- value: struct (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
If you directly need to access the nested fields then you can use get_json_object
df.withColumn("name", get_json_object($"value", "$.name"))
.withColumn("age", get_json_object($"value", "$.age"))
.show(false)

parse json data with pyspark

I am using pyspark to read the json file below :
{
"data": {
"indicatr": {
"indicatr": {
"id": "5c9e41e4884db700desdaad8"}}}}
I wrote the following python code :
from pyspark.sql import Window, DataFrame
from pyspark.sql.types import *
from pyspark.sql.types import StructType
from pyspark.sql import functions as F
schema = StructType([
StructField("data", StructType([
StructField("indicatr", StructType([
StructField("indicatr", StructType([
StructField("id", StringType())
]))]))]))])
df = spark.read.json("pathtofile/test.json", multiLine=True)
df.show()
df2 = df.withColumn("json", F.col("data").cast("string"))
df3=df2.select(F.col("json"))
df3.collect()
df4 =df3.select(F.from_json(F.col("json"), schema).alias("name"))
df4.show()
I am getting the following result :
|name|
+----+
|null|
Anyone who knows how to solve this, please
When you select the column labeled json, you’re selecting a column that is entirely of the StringType (logically, because you’re casting it to that type). Even though it looks like a valid JSON object, it’s really just a string. df2.data does not have that issue though:
In [2]: df2.printSchema()
root
|-- data: struct (nullable = true)
| |-- indicatr: struct (nullable = true)
| | |-- indicatr: struct (nullable = true)
| | | |-- id: double (nullable = true)
|-- json: string (nullable = true)
By the way, you can immediately pass in the schema on read:
In [3]: df = spark.read.json("data.json", multiLine=True, schema=schema)
...: df.printSchema()
...:
...:
root
|-- data: struct (nullable = true)
| |-- indicatr: struct (nullable = true)
| | |-- indicatr: struct (nullable = true)
| | | |-- id: string (nullable = true)
You can dig down in the columns to reach the nested values:
In [4]: df.select(df.data.indicatr.indicatr.id).show()
+-------------------------+
|data.indicatr.indicatr.id|
+-------------------------+
| 5c9e41e4884db700desdaad8|
+-------------------------+

How to parse nested JSON objects in spark sql?

I have a schema as shown below. How can i parse the nested objects
root
|-- apps: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- appName: string (nullable = true)
| | |-- appPackage: string (nullable = true)
| | |-- Ratings: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- date: string (nullable = true)
| | | | |-- rating: long (nullable = true)
|-- id: string (nullable = true)
Assuming you read in a json file and print the schema you are showing us like this:
DataFrame df = sqlContext.read().json("/path/to/file").toDF();
df.registerTempTable("df");
df.printSchema();
Then you can select nested objects inside a struct type like so...
DataFrame app = df.select("app");
app.registerTempTable("app");
app.printSchema();
app.show();
DataFrame appName = app.select("element.appName");
appName.registerTempTable("appName");
appName.printSchema();
appName.show();
Try this:
val nameAndAddress = sqlContext.sql("""
SELECT name, address.city, address.state
FROM people
""")
nameAndAddress.collect.foreach(println)
Source:
https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
Have you tried doing it straight from the SQL query like
Select apps.element.Ratings from yourTableName
This will probably return an array and you can more easily access the elements inside.
Also, I use this online Json viewer when I have to deal with large JSON structures and the schema is too complex:
http://jsonviewer.stack.hu/
I am using pyspark, but the logic should be similar.
I found this way of parsing my nested json useful:
df.select(df.apps.appName.alias("apps_Name"), \
df.apps.appPackage.alias("apps_Package"), \
df.apps.Ratings.date.alias("apps_Ratings_date")) \
.show()
The code could be obviously shorten with a f-string.
var df = spark.read.format("json").load("/path/to/file")
df.createOrReplaceTempView("df");
spark.sql("select apps.element.Ratings from df where apps.element.appName like '%app_name%' ").show()