Im using Spark 2.4.3 and Scala 2.11
Below is my current JSON string in a DataFrame column.
Im trying to store the schema of this JSON string in another column using schema_of_json function.
But its throwing below the error. How could I resolve this?
{
"company": {
"companyId": "123",
"companyName": "ABC"
},
"customer": {
"customerDetails": {
"customerId": "CUST-100",
"customerName": "CUST-AAA",
"status": "ACTIVE",
"phone": {
"phoneDetails": {
"home": {
"phoneno": "666-777-9999"
},
"mobile": {
"phoneno": "333-444-5555"
}
}
}
},
"address": {
"loc": "NORTH",
"adressDetails": [
{
"street": "BBB",
"city": "YYYYY",
"province": "AB",
"country": "US"
},
{
"street": "UUU",
"city": "GGGGG",
"province": "NB",
"country": "US"
}
]
}
}
}
Code:
val df = spark.read.textFile("./src/main/resources/json/company.txt")
df.printSchema()
df.show()
root
|-- value: string (nullable = true)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"company":{"companyId":"123","companyName":"ABC"},"customer":{"customerDetails":{"customerId":"CUST-100","customerName":"CUST-AAA","status":"ACTIVE","phone":{"phoneDetails":{"home":{"phoneno":"666-777-9999"},"mobile":{"phoneno":"333-444-5555"}}}},"address":{"loc":"NORTH","adressDetails":[{"street":"BBB","city":"YYYYY","province":"AB","country":"US"},{"street":"UUU","city":"GGGGG","province":"NB","country":"US"}]}}}|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
df.withColumn("jsonSchema",schema_of_json(col("value")))
Error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'schemaofjson(`value`)' due to data type mismatch: The input json should be a string literal and not null; however, got `value`.;;
'Project [value#0, schemaofjson(value#0) AS jsonSchema#10]
+- Project [value#0]
+- Relation[value#0] text
The workaround solution that I found was to pass the column value as below to the schema_of_json function.
df.withColumn("jsonSchema",schema_of_json(df.select(col("value")).first.getString(0)))
Courtesy:
Implicit schema discovery on a JSON-formatted Spark DataFrame column
Since SPARK-24709 was introduced schema_of_json accepts just literal strings. You can extract schema of String in DDL format by calling
spark.read
.json(df.select("value").as[String])
.schema
.toDDL
If one is looking for a pyspark answer :
import pyspark.sql.functions as F
import pyspark.sql.types as T
import json
def process(json_content):
if json_content is None :
return []
try:
# Parse the content of the json, extract the keys only
keys = json.loads(json_content).keys()
return list(keys)
except Exception as e:
return [e]
udf_function = F.udf(process_file, T.ArrayType(T.StringType()))
my_df = my_df.withColumn("schema", udf_function(F.col("json_raw"))
Related
Data sample:
import pandas as pd
patients_df = pd.read_json('C:/MyWorks/Python/Anal/data_sample.json', orient="records", lines=True)
patients_df.head()
//in python
//my json data sample
"data1": {
"id": "myid",
"seatbid": [
{
"bid": [
{
"id": "myid",
"impid": "1",
"price": 0.46328014,
"adm": "adminfo",
"adomain": [
"domain.com"
],
"iurl": "url.com",
"cid": "111",
"crid": "1111",
"cat": [
"CAT-0101"
],
"w": 00,
"h": 00
}
],
"seat": "27"
}
],
"cur": "USD"
},
What I want to do is to check if there is a "cat" value in my very large JSON data.
The "cat" value may/may not exist, but I'm trying to use Python Pandas to check it.
for seatbid in patients_df["win_res"]:
for bid in seatbid["seatbid"]:
I tried to access JSON data while writing a loop like that, but it's not being accessed properly.
I simply want to check if "cat" exist or not.
You can use python's json library as follows:
import json
patient_data = json.loads(patientJson)
if "cat" in student:
print("Key exist in JSON data")
else
print("Key doesn't exist in JSON data")
I'm relatively new to Scala. I would like to map part of my Json to my Object. Code looks like this:
def seasons = (json \\ "season")
case class:
case class Season(startDate: LocalDate, endDate: LocalDate)
json-structure:
[
{
"id": "",
"name": "",
"season": {
"start": "0",
"end": "0"
}
}
]
I would somehow like to end up with a List[Season], so I can loop through it.
Question #2
json-structure:
[
{
"id": "",
"name": "",
"season": {
"start": "0",
"end": "0"
}
},
{
"id": "",
"name": "",
"season": {
"start": "0",
"end": "0"
}
}...
]
Json (which is a JsValue btw) brings multiple regions as can be seen above. Case classed are provided (Region holds a Season), naming is the same as in json.
Formats look like this:
implicit val seasonFormat: Format[Season] = Json.format[Season]
implicit val regionFormat: Format[Region] = Json.format[Region]
So what would I need to call in order to get a List[Region]? I thought of something like regionsJson.as[List[Region]] as I defined the Format, which provides me the Read/Write possibilities. But unfortunately, it's not working.
What is the best way doing this? I've tried it with an JsArray, but I have difficulties with mapping it...
Any input would be much appreciated!
I've added some changes to your original case class and renamed its fields to match json fields.
The following code does parsing of the json into Seq[Session]
import java.time.LocalDate
import play.api.libs.json._
case class Season(start: LocalDate, end: LocalDate)
implicit val sessionFormat: Format[Season] = Json.format[Season]
val json =
"""
|[
| {
| "id": "",
| "name": "",
| "season": {
| "start": "2020-10-20",
| "end": "2020-10-22"
| }
| }
|]
|""".stripMargin
val seasonsJson: collection.Seq[JsValue] = Json.parse(json) \\ "season"
val seasons: collection.Seq[Season] = seasonsJson.map(_.as[Season])
seasons.foreach(println)
Please note that I changed the data of your json and instead of 0, which is not a valid date, I provided dates in iso format yyyy-mm-dd.
The above code works with play-json version 2.9.0.
---UPDATE---
Following up comment by #cchantep.
Method as will produce an exception if json cannot be mapped in case class, a non-exception option is to use asOpt that does not throw an exception but returns a None if mapping is not possible.
I'm trying to parse some JSON to kotlin objects. The JSON looks like:
{
data: [
{ "name": "aaa", "age": 11 },
{ "name": "bbb", "age": 22 },
],
otherdata : "don't need"
}
I just need to data part of the entire JSON, and parse each item to a User object:
data class User(name:String, age:Int)
But I can't find an easy way to do it.
Here's one way you can achieve this
import com.beust.klaxon.Klaxon
import java.io.StringReader
val json = """
{
"data": [
{ "name": "aaa", "age": 11 },
{ "name": "bbb", "age": 22 },
],
"otherdata" : "not needed"
}
""".trimIndent()
data class User(val name: String, val age: Int)
fun main(args: Array<String>) {
val klaxon = Klaxon()
val parsed = klaxon.parseJsonObject(StringReader(json))
val dataArray = parsed.array<Any>("data")
val users = dataArray?.let { klaxon.parseFromJsonArray<User>(it) }
println(users)
}
This will work as long as you can fit the whole json string in memory. Otherwise you may want to look into the streaming API: https://github.com/cbeust/klaxon#streaming-api
I want to parse json file in spark 2.0(scala). Next i want to save the data.. in Hive table.
How can i parse json file by using scala?
json file example) metadata.json:
{
"syslog": {
"month": "Sep",
"day": "26",
"time": "23:03:44",
"host": "cdpcapital.onmicrosoft.com"
},
"prefix": {
"cef_version": "CEF:0",
"device_vendor": "Microsoft",
"device_product": "SharePoint Online",
},
"extensions": {
"eventId": "7808891",
"msg": "ManagedSyncClientAllowed",
"art": "1506467022378",
"cat": "SharePoint",
"act": "ManagedSyncClientAllowed",
"rt": "1506466717000",
"requestClientApplication": "Microsoft SkyDriveSync",
"cs1": "0bdbe027-8f50-4ec3-843f-e27c41a63957",
"cs1Label": "Organization ID",
"cs2Label": "Modified Properties",
"ahost": "cdpdiclog101.cgimss.com",
"agentZoneURI": "/All Zones",
"amac": "F0-1F-AF-DA-8F-1B",
"av": "7.6.0.8009.0",
}
},
Thanks
You can use something like:
val jsonDf = sparkSession
.read
//.option("wholeFile", true) if its not a Single Line JSON
.json("resources/json/metadata.json")
jsonDf.printSchema()
jsonDf.registerTempTable("metadata")
More details about this https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
im trying to extract my data from json into a case class without success.
the Json file:
[
{
"name": "bb",
"loc": "sss",
"elements": [
{
"name": "name1",
"loc": "firstHere",
"elements": []
}
]
},
{
"name": "ca",
"loc": "sss",
"elements": []
}
]
my code :
case class ElementContainer(name : String, location : String,elements : Seq[ElementContainer])
object elementsFormatter {
implicit val elementFormatter = Json.format[ElementContainer]
}
object Applicationss extends App {
val el = new ElementContainer("name1", "firstHere", Seq.empty)
val el1Cont = new ElementContainer("bb","sss", Seq(el))
val source:String=Source.fromFile("src/bin/elementsTree.json").getLines.mkString
val jsonFormat = Json.parse(source)
val r1= Json.fromJson[ElementContainer](jsonFormat)
}
after running this im getting inside r1:
JsError(List((/elements,List(ValidationError(List(error.path.missing),WrappedArray()))), (/name,List(ValidationError(List(error.path.missing),WrappedArray()))), (/location,List(ValidationError(List(error.path.missing),WrappedArray())))))
been trying to extract this data forever, please advise
You have location instead loc and, you'll need to parse file into a Seq[ElementContainer], since it's an array, not a single ElementContainer:
Json.fromJson[Seq[ElementContainer]](jsonFormat)
Also, you have the validate method that will return you either errors or parsed json object..