I'm using scala 2.12 spark 3.0.0. I have to create one single json string where each attribute comes from different tables and save that resulting json in another dataframe, i.e (using 1 row only in this example to keep it simple)
tableA
id
action
date
u1
insert
20210428
tableB
id
name
date
u1
some name
20210428
I need to return the following :
{
"A": [
{
"id":"u1",
"action": "insert",
"date": "20210428"
}
],
"B": [
{
"id":"u1",
"name": "some name"
"date": "20210428",
}
]
}
I've tried many things but the closest i've gotten is doing the following for each table:
val tableADF = spark.read.format("delta").load(path +"/tableA")
val tableADF = spark.read.format("delta").load(path +"/tableB")
create the dataframe with all vaules converted to json for each table
val tableAJsonDF = tableADF.groupBy("date").agg(collect_list(struct($"id",$"action")).alias("attributesA"))
val tableBJsonDF = tableBDF.groupBy("date").agg(collect_list(struct($"id",$"name")).alias("attributesB"))
date
attributesA
20210428
[{"id":"u1", "action": "insert"}]
date
attributesB
20210428
[{"id":"u1", "name": "some name"}]
Now combine the json from both tables into one json to be added to a new dataframe:
val schema = new StructType().add("request", StringType)
val requestDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
val resultDF = requestDF.withColumn("request", concat(to_json(tableAJsonDF("attributesA")),
to_json(tableBJsonDF("attributesB"))))
but I get the following error. I read that this type of error happens when you try to combine two dataframes but I can't seem to find a way to create 1 single json as shown in the desired results by combining both attributes into 1 new column, any ideas?
org.apache.spark.sql.AnalysisException: Resolved attribute(s)
attributesA#4726,attributesB#4783 missing from request#24790 in
operator !Project [concat(to_json(attributesA#4726, Some(EST)),
to_json(attributesB#4783, Some(EST))) AS request#24792].;;
You need to join the dataframes. e.g.
val t1 = tableADF.select(
col("id"),
array(struct(tableADF.columns.map(col):_*)).as("A")
)
val t2 = tableBDF.select(
col("id"),
array(struct(tableBDF.columns.map(col):_*)).as("B")
)
val result = t1.join(t2, Seq("id")).select(to_json(struct("A", "B")).as("result"))
result.show(false)
+--------------------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------------------+
|{"A":[{"id":"u1","action":"insert","date":"20210428"}],"B":[{"id":"u1","name":"some name","date":"20210428"}]}|
+--------------------------------------------------------------------------------------------------------------+
I'm ingesting a large simple json dataset from Azure Blob and moving data into a "stage" called "cities_stage" with FILE_FORMAT = json like so.
(Here is the error steps are below "Error parsing JSON: unknown keyword "Hurzuf", pos 7.")
create or replace stage cities_stage
url='azure://XXXXXXX.blob.core.windows.net/xxxx/landing/cities'
credentials=(azure_sas_token='?st=XXXXX&se=XXX&sp=racwdl&sv=XX&sr=c&sig=XXX')
FILE_FORMAT = (type = json);
I then take this stage location and dump it into a table with a single variant column like so. The file I'm ingesting is larger than 16mb so I create individual rows for each object by using type = json strip_outer_array = true
create or replace table cities_raw_source (
src variant);
copy into cities_raw_source
from #cities_stage
file_format = (type = json strip_outer_array = true)
on_error = continue;
When I select * from cities_raw_source each row looks like the following.
{
"coord": {
"lat": 44.549999,
"lon": 34.283333
},
"country": "UA",
"id": 707860,
"name": "Hurzuf"
}
When I add a reference to "country" or "name" that's where the issues come in. Here is my query (I did not use country in this one but it produces the same result).
select parse_json(src:id),
parse_json(src:coord:lat),
parse_json(src:coord:lon),
parse_json(src:name)
from cities_raw_source;
ERROR:
Error parsing JSON: unknown keyword "Hurzuf", pos 7.
ID, Lat, and Lon all come back as expected if I remove "src:name"
Any help is appreciated!
It turns out I had everything correct except for the query itself.
When querying a VARIANT column you do not need to PARSE_JSON so the correct query would look like this.
select src:id,
src:coord:lat,
src:coord:lon,
src:name
from cities_raw_source;
My requirement is to convert two string and create a JSON file(using spray JSON) and save in a resource directory.
one input string contains the ID and other input strings contain the score and topic
id = "alpha1"
inputstring = "science 30 math 24"
Expected output JSON is
{“ContentID”: “alpha1”,
“Topics”: [
{"Score" : 30, "TopicID" : "Science" },
{ "Score" : 24, "TopicID" : "math”}
]
}
below is the approach I have taken and am stuck in the last place
Define the case class
case class Topic(Score: String, TopicID: String)
case class Model(contentID: String, topic: Array[Topic])
implicit val topicJsonFormat: RootJsonFormat[Topic] = jsonFormat2(Topic)
implicit val modelJsonFormat: RootJsonFormat[Model] = jsonFormat2(Model)
Parsing the input string
val a = input.split(" ").zipWithIndex.collect{case(v,i) if (i % 2 == 0) =>
(v,i)}.map(_._1)
val b = input.split(" ").zipWithIndex.collect{case(v,i) if (i % 2 != 0) =>
(v,i)}.map(_._1)
val result = a.zip(b)
And finally transversing through result
paired foreach {case (x,y) =>
val tClass = Topic(x, y)
val mClassJsonString = Topic(x, y).toJson.prettyPrint
out1.write(mClassJsonString.toString)
}
And the file is generated as
{"Score" : 30, "TopicID" : "Science" }
{ "Score" : 24, "TopicID" : "math”}
The problem is I am not able to add the contentID as needed above.
Adding ContentId inside foreach is making contentID added multiple time.
You're calling toJson inside foreach creating strings and then you're appending it to buffer.
What you probably wanted to do is to create a class (ADT) hierarchy first and then serialize it:
val topics = paired.map(Topic)
//toArray might be not necessary if topics variable is already an array
val model = Model("alpha1", topics.toArray)
val json = model.toJson.prettyPrint
out1.write(json.toString)
I'm using mongolite package to connect and retrieve data from MongoDB.How to pass value in mongolite find query
##connecting mongodb
library(mongolite)
mongo<-mongolite::mongo(collection = "Sample", db = "Test", url =
"mongodb://User:123#Wyyuyu:13333/ty2_U",verbose = TRUE)
## getting all data from collection data from collection below query is working fine.
values <- mongo$find()
## But I want to filter specific value by passing value.
for(i in c("process","check","queue"))
{
values <- mongo$find('{"field" : i}',)
}
if I tried above code i'm getting getting below issues . please help me to resolve
Error: Invalid JSON object: {"field" : i}
Given your i is a variable, you need to create the string using something like paste0:
values <- mongo$find(paste0('{"field" : ', i, '}') )
but rather than a loop you could also use
values <- mongo$find('{"field" : { "$in" : [ "process", "check", "queue" ] } }')
I see a lot of references to "compressed JSON" when it comes to different serialization formats. What exactly is it? Is it just gzipped JSON or something else?
Compressed JSON removes the key:value pair of json's encoding to store keys and values in seperate parallel arrays:
// uncompressed
JSON = {
data : [
{ field1 : 'data1', field2 : 'data2', field3 : 'data3' },
{ field1 : 'data4', field2 : 'data5', field3 : 'data6' },
.....
]
};
//compressed
JSON = {
data : [ 'data1','data2','data3','data4','data5','data6' ],
keys : [ 'field1', 'field2', 'field3' ]
};
This method of usage i found here
Content from link (http://www.nwhite.net/?p=242)
rarely find myself in a place where I am writing javascript applications that use AJAX in its pure form. I have long abandoned the ‘X’ and replaced it with ‘J’ (JSON). When working with Javascript, it just makes sense to return JSON. Smaller footprint, easier parsing and an easier structure are all advantages I have gained since using JSON.
In a recent project I found myself unhappy with the large size of my result sets. The data I was returning was tabular data, in the form of objects for each row. I was returning a result set of 50, with 19 fields each. What I realized is if I augment my result set I could get a form of compression.
// uncompressed
JSON = {
data : [
{ field1 : 'data1', field2 : 'data2', field3 : 'data3' },
{ field1 : 'data4', field2 : 'data5', field3 : 'data6' },
.....
]
};
//compressed
JSON = {
data : [ 'data1','data2','data3','data4','data5','data6' ],
keys : [ 'field1', 'field2', 'field3' ]
};
I merged all my values into a single array and store all my fields in a separate array. Returning a key value pair for each result cost me 8800 byte (8.6kb). Ripping the fields out and putting them in a separate array cost me 186 bytes. Total savings 8.4kb.
Now I have a much more compressed JSON file, but the structure is different and now harder to work with. So I implement a solution in Mootools to make the decompression transparent.
Request.JSON.extend({
options : {
inflate : []
}
});
Request.JSON.implement({
success : function(text){
this.response.json = JSON.decode(text, this.options.secure);
if(this.options.inflate.length){
this.options.inflate.each(function(rule){
var ret = ($defined(rule.store)) ? this.response.json[rule.store] : this.response.json[rule.data];
ret = this.expandData(this.response.json[rule.data], this.response.json[rule.keys]);
},this);
}
this.onSuccess(this.response.json, text);
},
expandData : function(data,keys){
var arr = [];
var len = data.length; var klen = keys.length;
var start = 0; var stop = klen;
while(stop < len){
arr.push( data.slice(start,stop).associate(keys) );
start = stop; stop += klen;
}
return arr;
}
});
Request.JSON now has an inflate option. You can inflate multiple segments of your JSON object if you so desire.
Usage:
new Request.JSON({
url : 'url',
inflate : [{ 'keys' : 'fields', 'data' : 'data' }]
onComplete : function(json){}
});
Pass as many inflate objects as you like to the option inflate array. It has an optional property called ’store’ If set the inflated data set will be stored in that key instead.
The ‘keys’ and ‘fields’ expect strings to match a location in the root of your JSON object.
Based in Paniyar's answer, we can convert a List of Objects in "compressed" Json format using C# like this:
var JsonString = serializer.Serialize(
new
{
cols = new[] { "field1", "field2", "field3"},
items = data.Select(x => new object[] {x.field1, x.field2, x.field3})
});
I used an array of object because the fields can be int, bool, string...
More Reduction:
If the field is repeated very often and it is a string type, you can get compressed a little be more if you add a distinct list of that field... for instance, a field name job position, city, etc are excellent candidate for this. You can add a distinct list of this items and in each item change the value for a reference number. That will make your Json more lite.
Compressed:
[["KeyA", "KeyB", "KeyC", "KeyD", "KeyE", "KeyF"],
["ValA1", "ValB1", "ValC1", "ValD1", "ValE1", "ValF1"],
["ValA2", "ValB2", "ValC2", "ValD2", "ValE2", "ValF2"],
["ValA3", "ValB3", "ValC3", "ValD3", "ValE3", "ValF3"],
["ValA4", "ValB4", "ValC4", "ValD4", "ValE4", "ValF4"]]
Uncompressed:
[{KeyA: "ValA1", KeyB: "ValB1", KeyC: "ValC1", KeyD: "ValD1", KeyE: "ValE1", KeyF: "ValF1"},
{KeyA: "ValA2", KeyB: "ValB2", KeyC: "ValC2", KeyD: "ValD2", KeyE: "ValE2", KeyF: "ValF2"},
{KeyA: "ValA3", KeyB: "ValB3", KeyC: "ValC3", KeyD: "ValD3", KeyE: "ValE3", KeyF: "ValF3"},
{KeyA: "ValA4", KeyB: "ValB4", KeyC: "ValC4", KeyD: "ValD4", KeyE: "ValE4", KeyF: "ValF4"}]
The most likely answer is that it really is just gzipped JSON. There is no other standard meaning to this phrase.
Re-organizing a homogenous array of JSON objects into a pair of arrays is a very useful technique to make the payload smaller and to speed up encoding and decoding, it is not commonly called "compressed JSON". I haven't run across it ever in open source or any open API, but we use this technique internally and call it "jsontable".