Recently I've been trying to make some transformations in some Json files in the Azure Synapse notebooks using Scala language and loading them in using the spark.read function. The problem is the following one:
1st case: I load it using an schema (via Structype) and the returned DF is all null
2nd case: I load it withouth the schema and it returns "_corrupt_record" (this happens using multiline = true, too)
I do not know what is happening as I have tried to load different types of Jsons and none of them work (they are normal jsons downloaded from Kaggle, though).
{
"results": [{
"columns": [{
"name": "COD",
"type": "NUMBER"
}, {
"name": "TECH",
"type": "NUMBER"
}, {
"name": "OBJECT",
"type": "NUMBER"
}],
"items": [{
"cod": 3699,
"tech": "-",
"object": "Type 2"
}, {
"cod": 3700,
"tech": 56,
"object": "Type 1"
}, {
"cod": 3701,
"tech": 20,
"object": "No type"
}]
}]
}
I am getting similar kind of corrupt data error when I tried to reproduce
As you shared sample data it contains multiple lines to single object and the Json has multiple objects inside it to view this Json file, I used the multiline option as true and the exploded each column and selected it.
//reading JSON file from ADLS in json format
val read_path = "abfss://fsn2p#dlsg2p.dfs.core.windows.net/sample.json"
val customers_data_path = spark.read.format("json").option("inferSchema","true").option("multiline","true").load(read_path)
customers_data_path.show();
//Exploding the results column into its object as column
val DF1=customers_data_path.select(explode(col("results")).as("results"))
DF1.show()
//Selecting all columns from results
val DF2 = DF1.select(col("results.*"))
DF2.show();
//further exploding Columns column and items objects
val DF3 = DF2.select(explode(col("columns")).as("columns"),col("items"))
val DF4 = DF3.select(col("columns"),explode(col("items")).as("items"))
DF4.show();
In this approach you will each item object has all columns object value
//selecting All columns inside columns and items object
val DF5 = DF4.select(col("columns.*"),col("items.*"))
DF5.show();
In this approach you will get null when object dont have value in particular colum.
//exploding Columns column and items objects
val DF5 = DF2.select(explode(col("columns")).as("columns"))
DF5.show();
val DF6 = DF2.select(explode(col("items")).as("items"))
DF6.show();
//selecting All columns inside columns and items object
val DF7 = DF5.select(col("columns.*"))
val DF8 = DF6.select(col("items.*"))
DF7.show();
DF8.show();
//combining both the above dataframes
val DF10 = DF7.join(DF8, lit(false), "full").show()
In this approach you will get combined data frame for columns
//creating sequential index column with help of that join the data frames and then drop it.
import org.apache.spark.sql.types.{StructType, StructField, StringType, LongType};
import spark.implicits._
import org.apache.spark.sql.Row
val df11 = spark.sqlContext.createDataFrame(
DF7.rdd.zipWithIndex.map {
case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
// Create schema for index column
StructType(DF7.schema.fields :+ StructField("index", LongType, false))
)
val df22 = spark.sqlContext.createDataFrame(
DF8.rdd.zipWithIndex.map {
case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
// Create schema for index column
StructType(DF8.schema.fields :+ StructField("index", LongType, false))
)
val DF12 = df11.join(df22, Seq("index")).drop("index")
DF12.show()
Related
I have code which looks roughly like this:
val json: Json = parse("""
[
{
"id": 1,
"type": "Contacts",
"admin": false,
"cookies": 3
},
{
"id": 2,
"type": "Apples",
"admin": false,
"cookies": 6
},
{
"id": 3,
"type": "Contacts",
"admin": true,
"cookies": 19
}
]
""").getOrElse(Json.Null)
I'm using Circe, Cats, Scala, Circe-json, and so on, and the Parse call succeeds.
I want to return a List, where each top-level Object where type="Contacts", is shown in it's entirety.
Something like:
List[String] = ["{"id": 1,"type": "Contacts","admin": false,"cookies": 3}","{"id": 3,"type": "Contacts","admin": true,"cookies": 19}"]
The background is that I have large JSON files on disk. I need to filter out the subset of objects that match a certain type= value, in this case, type=Contacts, and then split these out from the rest of the json file. I'm not looking to modify the file, I'm more looking to grep for matching objects and process them accordingly.
Thank you.
The most straightforward way to accomplish this kind of thing is to decode the document into either a List[Json] or List[JsonObject] value. For example, given your definition of json:
import io.circe.JsonObject
val Right(docs) = json.as[List[JsonObject]]
And then you can query based on the type:
scala> val contacts = docs.filter(_("type").contains(Json.fromString("Contacts")))
contacts: List[io.circe.JsonObject] = List(object[id -> 1,type -> "Contacts",admin -> false,cookies -> 3], object[id -> 3,type -> "Contacts",admin -> true,cookies -> 19])
scala> contacts.map(Json.fromJsonObject).map(_.noSpaces).foreach(println)
{"id":1,"type":"Contacts","admin":false,"cookies":3}
{"id":3,"type":"Contacts","admin":true,"cookies":19}
Given your use case, circe-optics seems unlikely to be a good fit (see my answer here for some discussion of why filtering with arbitrary predicates is awkward with Monocle's Traversal).
It may be worth looking into circe-fs2 or circe-iteratee, though, if you're interested in parsing and filtering large JSON files without loading the entire contents of the file into memory. In both cases the principle would be the same as in the List[JsonObject] code just above—you decode your big JSON array into a stream of JsonObject values, which you can query however you want.
I am looking for a way to update the given Json's values and keys in a dynamic way. The way the Json is delivered is Always the same(in Terms of structure). The only Thing that differs is the amount of Data that is provided. So for example there could sometimes be 30, sometimes only 10 nestings etc.
…
"ampdata": [
{
"nr": "303",
"code": "JGJGh4958GH",
"status": "AVAILABLE",
"ability": [ "" ],
"type": "wheeled",
"conns": [
{
"nr": "447",
"status": "",
"version": "3",
"format": "sckt",
"amp": "32",
"vol": "400",
"vpower": 22
}
]
}
As Json uses other keys/values than I in my DB, I Need to convert them. Additionally I Need to Change some values if they match explicit strings.
So for example: "Code" has to be renamed to"adrID" and "sckt" should map to the values "bike".
I tried a simple Groovy-Script to remove the key and or Change the value. There is no Problem in changing values, but in changing the key itself. So I tried removing the key and adding a new key. Unfortunately I could not figure out how to add a new key:value to the given json. So how can I add a new pair of key:value or rename the key, if that´s possible. Have a look at my code-example
def flowFile = session.get()
if (!flowFile) return
try {
flowFile = session.write(flowFile,
{ inputStream, outputStream ->
def text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
def obj = new JsonSlurper().parseText(text)
def objBuilder = new JsonBuilder(obj)
// Update ingestionDate field with today's date
for(i in 0..obj.data.size()-1){
obj.data[0].remove("postal_code")
objBuilder.data[0].postal_code=5
}
// Output updated JSON
def json = JsonOutput.toJson(obj)
outputStream.write(JsonOutput.prettyPrint(json).getBytes(StandardCharsets.UTF_8))
} as StreamCallback)
flowFile = session.putAttribute(flowFile, "filename", flowFile.getAttribute('filename').tokenize('.')[0]+'_translated.json')
session.transfer(flowFile, REL_SUCCESS)
} catch(Exception e) {
log.error('Error during JSON operations', e)
session.transfer(flowFile, REL_FAILURE)
}
...
def obj = new JsonSlurper().parse(inputStream, "UTF-8")
obj.data.each{e->
def value = e.remove("postal_code")
//set old value with a new key into object
e["postalCode"] = value
}
//write to output
def builder = new JsonBuilder(obj)
outputStream.withWriter("UTF-8"){ it << builder.toPrettyString() }
JSON FILE:
need to iterate each element using scala.read the JSON file in scala and iterate over the each element & and add it the particular list.
Input:
{"details": [{"Level": "1", "member": "age", "claim": "age"}, {"Level": "2", "member": "dob", "claim": "dob"}, {"Level": "2", "member": "name", "claim": "name"}]}
output:
Memberlist=list(list(age),list(dob),list(name))
Claimlist=list(list(age),list(dob),list(name))
You can use simple and recursive path to traverse JSON objects in Scala, assuming you're using the Play Framework
val jsonDetails:JsValue = Json.parse("""
{"details":
[{"Level": "1", "member":"age", "claim":"age"},
{"Level": "2", "member": "dob", "claim": "dob2"},
{"Level": "2", "member":"name1", "claim":"name"}
]}
""")
var pathToFirstLevel = (jsonDetails\"details"\0\"Level")
var allIstancesOfMember = (jsonDetails\\"member")
Will return
res1: play.api.libs.json.JsLookupResult = JsDefined("1")
res2: Seq[play.api.libs.json.JsValue] = List("age", "dob", "name1")
You can grab the data you're looking for and populate your lists, but you'll need to convert the JsValues to values Scala can work with (use .as[T] not asInstanceOf[T])
You can also use map
var allIstancesOfMember =(jsonDetails\\"member").map(_.as[String])
There's loads of information on traversing JSON elements in Scala using the Play Framework at https://www.playframework.com/documentation/2.6.x/ScalaJson
does that answer your question?
Not sure whether this solution can fit for your requirement.
Some concrete classes used for extraction
class CC[T] { def unapply(a:Any):Option[T] = Some(a.asInstanceOf[T]) }
object M extends CC[Map[String, Any]]
object L extends CC[List[Any]]
object S extends CC[String]
ListBuffer for output.
val memberList = ListBuffer.empty[Any]
val claimList = ListBuffer.empty[Any]
for comprehension to iterate over list of details and extract each detail.
for {
Some(M(map)) <- List(JSON.parseFull(s))
L(details) = map("details")
M(language) <- details
//S(level) = language("Level") //Not so sure how to use this variable
S(member) = language("member")
S(claim) = language("claim")
} yield {
memberList += List(member) //List is used to match your output.
claimList += List(claim) //List is used to match your output.
}
Output
println(memberList)
//ListBuffer(List(m_age), List(m_dob), List(m_name))
println(claimList)
//ListBuffer(List(c_age), List(c_dob), List(c_name))
Credits to this answer
Is it possible to dynamically deserialize an external, of unknown length, ByteString stream from Akka HTTP into domain objects?
Context
I call an infinitely long HTTP endpoint that outputs a JSON Array that keeps growing:
[
{ "prop": true, "prop2": false, "prop3": 97, "prop4": "sample" },
{ "prop": true, "prop2": false, "prop3": 97, "prop4": "sample" },
{ "prop": true, "prop2": false, "prop3": 97, "prop4": "sample" },
{ "prop": true, "prop2": false, "prop3": 97, "prop4": "sample" },
{ "prop": true, "prop2": false, "prop3": 97, "prop4": "sample" },
...
] <- Never sees the daylight
I guess that JsonFraming.objectScanner(Int.MaxValue) should be used in this case. As docs state:
Returns a Flow that implements a "brace counting" based framing
operator for emitting valid JSON chunks. It scans the incoming data
stream for valid JSON objects and returns chunks of ByteStrings
containing only those valid chunks. Typical examples of data that one
may want to frame using this operator include: Very large arrays
So you can end up with something like this:
val response: Future[HttpResponse] = Http().singleRequest(HttpRequest(uri = serviceUrl))
response.onComplete {
case Success(value) =>
value.entity.dataBytes
.via(JsonFraming.objectScanner(Int.MaxValue))
.map(_.utf8String) // In case you have ByteString
.map(decode[MyEntity](_)) // Use any Unmarshaller here
.grouped(20)
.runWith(Sink.ignore) // Do whatever you need here
case Failure(exception) => log.error(exception, "Api call failed")
}
I had a very similar problem trying to parse the Twitter Stream (an infinite string) into a domain object.
I solved it using Json4s, like this:
case class Tweet(username: String, geolocation: Option[Geo])
case class Geo(latitude: Float, longitude: Float)
object Tweet{
def apply(s: String): Tweet = {
parse(StringInput(s), useBigDecimalForDouble = false, useBigIntForLong = false).extract[Tweet]
}
}
Then I just buffer the stream and mapped it to a Tweet:
val reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(inputStream), "UTF-8"))
var line = reader.readLine()
while(line != null){
store(Tweet.apply(line))
line = reader.readLine()
}
Json4s has full support over Option (or custom objects inside the object, like Geo in the example). Therefore, you can put an Option like I did, and if the field doesn't come in the Json, it will be set to None.
Hope it helps!
I think that play-iteratees-extras must help you. This library allow to parse Json via Enumerator/Iteratee pattern and, of course, don't waiting for receiving all data.
For example, lest build 'infinite' stream of bytes that represents 'infinite' Json array.
import play.api.libs.iteratee.{Enumeratee, Enumerator, Iteratee}
var i = 0
var isFirstWas = false
val max = 10000
val stream = Enumerator("[".getBytes) andThen Enumerator.generateM {
Future {
i += 1
if (i < max) {
val json = Json.stringify(Json.obj(
"prop" -> Random.nextBoolean(),
"prop2" -> Random.nextBoolean(),
"prop3" -> Random.nextInt(),
"prop4" -> Random.alphanumeric.take(5).mkString("")
))
val string = if (isFirstWas) {
"," + json
} else {
isFirstWas = true
json
}
Some(Codec.utf_8.encode(string))
} else if (i == max) Some("]".getBytes) // <------ this is the last jsArray closing tag
else None
}
}
Ok, this value contains jsArray of 10000 (or more) objects. Lets define case class that will be contain data of each object in our array.
case class Props(prop: Boolean, prop2: Boolean, prop3: Int, prop4: String)
Now write parser, that will be parse each item
import play.extras.iteratees._
import JsonBodyParser._
import JsonIteratees._
import JsonEnumeratees._
val parser = jsArray(jsValues(jsSimpleObject)) ><> Enumeratee.map { json =>
for {
prop <- json.\("prop").asOpt[Boolean]
prop2 <- json.\("prop2").asOpt[Boolean]
prop3 <- json.\("prop3").asOpt[Int]
prop4 <- json.\("prop4").asOpt[String]
} yield Props(prop, prop2, prop3, prop4)
}
Please, see doc for jsArray, jsValues and jsSimpleObject. To build result producer:
val result = stream &> Encoding.decode() ><> parser
Encoding.decode() from JsonIteratees package will decode bytes as CharString. result value has type Enumerator[Option[Item]] and you can apply some iteratee to this enumerator to start parsing process.
In total, I don't know how you receive bytes (the solution depends heavily on this), but I think that show one of the possible solutions of your problem.
I'm trying to parse a random JSON file in Grails.
First I need to get the name of each field
For example, given below JSON file,
{
"abbreviation": "EX",
"guid": "1209812-1l2kj1j-fwefoj9283jf-ae",
"metadata": {
"dataOrigin": "Example"
},
"rooms":
[
],
"site": {
"guid": "1209812-1l2kj1j-fwefoj9283jf-ae"
},
"title": "Example!!"
}
I want to find out the structure of the JSON file(lists of keys maybe), for example I want to save the list of keys such as 'abbreviation', 'guid', 'metadata', 'rooms', 'site', 'title' from this JSON file.
How would I do this?
(We need the name of the keys in order to get the value of that key, so with a arbitrarily structured JSON file I need to find out the keys first)
You can try below code
def filePath = "JSONFILE.json"
def text = new File(filePath).getText()
def json = JSON.parse(text)
def jsonKeys = json.collect{it.key}
println(jsonKeys)
This will print all json keys
From what dmahaptro commented, I figured out how to get all the keys within a JSON object.
Here is a simple sample code I wrote to test it
String jsonFile = new JsonSlurper().parseText(new URL(path to the json file).text)
JSONArray jsonParse = new JSONArray(jsonFile)
int len = jsonParse.length()
def names = []
def keys = []
(0..len-1).each {
JSONObject val = jsonParse.getJSONObject(it)
int numKeys = val.length()
names = val.names()
keys = val.keySet()
(0..numKeys-1).each {
def field = names[it]
println field +" : " + val."${field}"
}
}
This will print the key:value pair given a JSON file.