Suppose you have a bunch of data whose rows look like this:
{
'key': [
{'key1': 'value11', 'key2': 'value21'},
{'key1': 'value12', 'key2': 'value22'}
]
}
I would like to read this into a Spark Dataset. One way to do it is as follows:
case class ObjOfLists(k1: List[String], k2: List[String])
case class Data(k: ObjOfLists)
Then you can do:
sparkSession.read.json(pathToData).select(
struct($"key.key1" as "k1", $"key.key2" as "k2") as "k"
)
.as[Data]
This works fine, but it kind of butchers the data a little bit; after all in the data 'key' points to a list of objects rather than an object of lists. In other words, what I really want is:
case class Obj(k1: String, k2: String)
case class DataOfList(k: List[Obj])
My question: is there some other syntax I can put in select which allows the resulting Dataframe to be converted to a Dataset[DataOfList]?
I tried using the same select syntax as above, and got:
Exception in thread "main" org.apache.spark.sql.AnalysisException: need an array field but got struct<k1:array<string>,k2:array<string>>;
So I also tried:
sparkSession.read.json(pathToData).select(
array(struct($"key.key1" as "k1", $"key.key2" as "k2")) as "k"
)
.as[DataOfList]
This compiles and runs, but the data looks like this:
DataOfList(List(Obj(org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#bb2a5516,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#bec5e4a7)))
Any other ideas?
Just recast data to reflect expected names:
case class Obj(k1: String, k2: String)
case class DataOfList(k: Seq[Obj])
val text = Seq("""{
"key": [
{"key1": "value11", "key2": "value21"},
{"key1": "value12", "key2": "value22"}
]
}""").toDS
val df = spark.read.json(text)
df
.select($"key".cast("array<struct<k1:string,k2:string>>").as("k"))
.as[DataOfList]
.first
DataOfList(List(Obj(value11,value21), Obj(value12,value22)))
With extraneous objects you define schema on read:
val textExtended = Seq("""{
"key": [
{"key0": "value01", "key1": "value11", "key2": "value21"},
{"key1": "value12", "key2": "value22", "key3": "value32"}
]
}""").toDS
val schemaSubset = StructType(Seq(StructField("key", ArrayType(StructType(Seq(
StructField("key1", StringType),
StructField("key2", StringType))))
)))
val df = spark.read.schema(schemaSubset).json(textExtended)
and proceed as before.
Related
I have a simple rule in SWI-Prolog which I want to implement in an AWS Lambda function.
I will receive an Event in the following json form:
{
"key1": "value1",
"key2": "value2",
"key3": "value3"
}
My problem is that I can only read from atom-like arrays or json files but not plain json in a compound form.
What I would like to do is something like this:
lambda_handler(Event, Context, Response) :-
atom_json_dict(Event, Dict, []),
my_simple_rule(Dict.key1, Dict.key2, Dict.key3),
Response = '{"result": "yes"}'.
my_simple_rule is a condition which returns true or false depending on the values passed.
What I've tried so far does not work because SWI-Prolog expects either a Stream o a String when using atom_json_term/3, json_read/2,3 or json_read_dict/2,3.
I also tried to force the JSON into a string this way:
format(atom(A), "~w", {"key1": "value1", "key2": "value2", "key3":"value3"}).
Expecting this so that I can then convert it to a Term (Prolog dict):
{"key1": "value1", "key2": "value2", "key3":"value3"}
But the result is the following:
'{key1:value1,key2:value2,key3:value3}'
Which fails.
Does any one know how I can use a plain JSON within Prolog?
Event it's already a structured term, so here is an 'ad hoc' adapter.
Let's say we have a file j2d.pl containing
:- module(j2d,
[ j2d/2
]).
j2d(Event,Dict) :-
Event={CommaSequence},
l2d(CommaSequence,_{},Dict).
l2d((A,R),D,U) :- !, updd(A,D,V), l2d(R,V,U).
l2d(A,D,U) :- updd(A,D,U).
updd(K:V,D,U) :- atom_string(A,K), put_dict(A,D,V,U).
then it's possible to test the code from the SWI-Prolog console:
?- use_module(j2d).
true.
?- Event={
"key1": "value1",
"key2": "value2",
"key3": "value3"
}.
Event = {"key1":"value1", "key2":"value2", "key3":"value3"}.
?- j2d($Event,Dict).
Dict = _14542{key1:"value1", key2:"value2", key3:"value3"},
Event = {"key1":"value1", "key2":"value2", "key3":"value3"}.
The unusual $Event syntax it's an utility of the console (a.k.a REPL), that replaces the variable Event with its last value (a.k.a binding).
Your code could become
:- use_module(j2d).
lambda_handler(Event, Context, Response) :-
j2d(Event,Dict),
my_simple_rule(Dict.key1, Dict.key2, Dict.key3),
Response = '{"result": "yes"}'.
So... I was wrong about how SWI-Prolog handles the Event json in Lambda, so I will post my findings here in case someone encounters a similar challenge.
At first I though that the Event json arrived like this: {"key1": 1, "key2": 2, "key3": 3}.
However, it looks a little more like this: json([key1=1, key2=2, key3=3]. Which makes the parsing task different.
To solve it I used the following code which I hope will be self explanatory:
:- use_module(library(http/json)).
:- use_module(library(http/json_convert)).
% use this to test locally. If it works this way i should work in lambda:
handler(json([key1=10, key2=2, key3=3]), _, Response).
% Function handler
handler(json(Event), _, Response) :-
json_to_prolog(json(Event), Json_term), % I transform the event json into a prolog term
atom_json_term(A, Json_term, []), % I convert the JSON_term into an atom
atom_json_dict(A, Dicty, []), % I convert the atom into a dict
my_simple_function(Dicty.key1, Dicty.key2, Dicty.key3, Result), % Function evaluation
json_to_prolog(json([result_key1="some_message", result_key2=Result]), Result_json), % Transform json term into actual json
atom_json_term(Response, Result_json, []). % Transform json into atom so that lambda does not complain
my_simple_function(N1, N2, N3, Result) :-
Result is N1 + N2 + N3.
The input needed (your test json in lambda) would be:
{
"key1": 1,
"key2": 2,
"key3": 3
}
While the output should look like this:
{
"result_key1": "some_message",
"result_key2": 6
}
I hope this works as a template to use SWI-Prolog on AWS.
By the way, I recommend you take a look to this repository to make your custom Prolog runtime for Lambda.
I have a Json file that looks like this
{
"tags": [
{
"1": "NpProgressBarTag",
"2": "userPath",
"3": "screen",
"4": 6,
"12": 9,
"13": "buttonName",
"16": 0,
"17": 10,
"18": 5,
"19": 6,
"20": 1,
"35": 1,
"36": 1,
"37": 4,
"38": 0,
"39": "npChannelGuid",
"40": "npShowGuid",
"41": "npCategoryGuid",
"42": "npEpisodeGuid",
"43": "npAodEpisodeGuid",
"44": "npVodEpisodeGuid",
"45": "npLiveEventGuid",
"46": "npTeamGuid",
"47": "npLeagueGuid",
"48": "npStatus",
"50": 0,
"52": "gupId",
"54": "deviceID",
"55": 1,
"56": 0,
"57": "uiVersion",
"58": 1,
"59": "deviceOS",
"60": 1,
"61": 0,
"62": "channelLineupID",
"63": 2,
"64": "userProfile",
"65": "sessionId",
"66": "hitId",
"67": "actionTime",
"68": "seekTo",
"69": "seekFrom",
"70": "currentPosition"
}
]
}
I tried to create a dataframe using
val path = "some/path/to/jsonFile.json"
val df = sqlContext.read.json(path)
df.show()
when I run this I get
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
How do we create a df based on contents of "tags" key? all I need is, pull data out of "tags" and apply case class like this
case class ProgLang (id: String, type: String )
I need to convert this json data into dataframe with Two Column names .toDF(id, Type)
Can anyone shed some light on this error?
You may modify the JSON using Circe.
Given that your values are sometimes Strings and other times Numbers, this was quite complex.
import io.circe._, io.circe.parser._, io.circe.generic.semiauto._
val json = """ ... """ // your JSON here.
val doc = parse(json).right.get
val mappedDoc = doc.hcursor.downField("tags").withFocus { array =>
array.mapArray { jsons =>
jsons.map { json =>
json.mapObject { o =>
o.mapValues { v =>
// Cast numbers to strings.
if (v.isString) v else Json.fromString(v.asNumber.get.toString)
}
}
}
}
}
final case class ProgLang(id: String, `type`: String )
final case class Tags(tags: List[Map[String, String]])
implicit val TagsDecoder: Decoder[Tags] = deriveDecoder
val tags = mappedDoc.top.get.as[Tags]
val data = for {
tag <- res29.tags
(id, _type) <- tag
} yield ProgLang(id, _type)
Now you have a List of ProgLang you may create a DataFrame directly from it, save it as a file with each JSON per line, save it as a CSV file, etc...
If the file is very big, you may use fs2 to stream it while transforming, it integrates nicely with Circe.
DISCLAIMER: I am far from being a "pro" with Circe, this seems over-complicated for doing something which seems like a "simple-task", probably there is a better / cleaner way of doing it (maybe using Optics?), but hey! it works! - anyways, if anyone knows a better way to solve this feel free to edit the question or provide yours.
val path = "some/path/to/jsonFile.json"
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json(path)
try following code if your json file not very big
val spark = SparkSession.builder().getOrCreate()
val df = spark.read.json(spark.sparkContext.wholeTextFiles("some/path/to/jsonFile.json").values)
I need to create a Json with 2 elements. The First element is a List and the second element is simple key-value pair.
My output looks as follows:
"{
"tables":[
{"table": "sn: 2134"},
{"table": "sn: 5676"},
{"table": "sn: 4564"},
],
"paid": 219
}"
In the example , the first element is tables which is List of table. The second element is paid.
I tried it using play.api.libs.json lib , but stuck while adding second element.
My code looks as follows:
case class Input(table:String){
override def toString = s""""table" : "sn: $table""""
}
implicit val userFormat = Json.format[Input]
val inputsSeq = Seq(Input(table1),Input(table2),Input(table3))
val users = Json.obj("tables" -> inputsSeq)
println(users)
This code print Json as :
"{
"tables":[
{"table": "sn: 2134"},
{"table": "sn: 5676"},
{"table": "sn: 4564"},
]
}
I am not sure, how to add the second element in this json. any suggestion how to
resolve this.
Json.obj accepts multiple pairs of (String, JsValueWrapper) as its arguments:
object Json {
...
def obj(fields: (String, JsValueWrapper)*): JsObject = JsObject(fields.map(f => (f._1, f._2.asInstanceOf[JsValueWrapperImpl].field)))
...
}
So you can add both elements like this:
val users = Json.obj("tables" -> inputsSeq, "paid" -> 219)
I've wrote a program which process JSON objects. Now I want to verify if I've missed something.
Is there an JSON-example of all allowed JSON structure combinations? Something like this:
{
"key1" : "value",
"key2" : 1,
"key3" : {"key1" : "value"},
"key4" : [
[
"string1",
"string2"
],
[
1,
2
],
...
],
"key5" : true,
"key6" : false,
"key7" : null,
...
}
As you can see at http://json.org/ on the right hand side the grammar of JSON isn't quite difficult, but I've got several exceptions because I've forgotten to handles some structure combinations which are possible. E.g. inside an array there can be "string, number, object, array, true, false, null" but my program couldn't handle arrays inside an array until I ran into an exception. So everything was fine until I got this valid JSON object with arrays inside an array.
I want to test my program with a JSON object (which I'm looking for). After this test I want to be feel certain that my program handle every possible valid JSON structure on earth without an exception.
I don't need nesting in depth 5 or so. I only need something in nested depth 2 or max 3. With all base types which nested all allowed base types, inside this base type.
Have you thought of escaped characters and objects within an object?
{
"key1" : {
"key1" : "value",
"key2" : [
"String1",
"String2"
],
},
"key2" : "\"This is a quote\"",
"key3" : "This contains an escaped slash: \\",
"key4" : "This contains accent charachters: \u00eb \u00ef",
}
Note: \u00eb and \u00ef are resp. charachters ë and ï
Choose a programming language that support json.
Try to load your json, on fail the exception's message is descriptive.
Example:
Python:
import json, sys;
json.loads(open(sys.argv[1]).read())
Generate:
import random, json, os, string
def json_null(depth = 0):
return None
def json_int(depth = 0):
return random.randint(-999, 999)
def json_float(depth = 0):
return random.uniform(-999, 999)
def json_string(depth = 0):
return ''.join(random.sample(string.printable, random.randrange(10, 40)))
def json_bool(depth = 0):
return random.randint(0, 1) == 1
def json_list(depth):
lst = []
if depth:
for i in range(random.randrange(8)):
lst.append(gen_json(random.randrange(depth)))
return lst
def json_object(depth):
obj = {}
if depth:
for i in range(random.randrange(8)):
obj[json_string()] = gen_json(random.randrange(depth))
return obj
def gen_json(depth = 8):
if depth:
return random.choice([json_list, json_object])(depth)
else:
return random.choice([json_null, json_int, json_float, json_string, json_bool])(depth)
print(json.dumps(gen_json(), indent = 2))
After reading a JSON result from a web service response:
val jsonResult: JsValue = Json.parse(response.body)
Containing content something like:
{
result: [
["Name 1", "Row1 Val1", "Row1 Val2"],
["Name 2", "Row2 Val1", "Row2 Val2"]
]
}
How can I efficiently map the contents of the result array in the JSON with a list (or something similar) like:
val keys = List("Name", "Val1", "Val2")
To get an array of hashmaps?
Something like this ?
This solution is functional and handles None/Failure cases "properly" (by returning a None)
val j = JSON.parseFull( json ).asInstanceOf[ Option[ Map[ String, List[ List[ String ] ] ] ] ]
val res = j.map { m ⇒
val r = m get "result"
r.map { ll ⇒
ll.foldRight( List(): List[ Map[ String, String ] ] ) { ( l, acc ) ⇒
Map( ( "Name" -> l( 0 ) ), ( "Val1" -> l( 1 ) ), ( "Val2" -> l( 2 ) ) ) :: acc
}
}.getOrElse(None)
}.getOrElse(None)
Note 1: I had to put double quotes around result in the JSON String to get the JSON parser to work
Note 2: the code could look nicer using more "monadic" sugar such as for comprehensions or using applicative functors