I'm using scala 2.12 spark 3.0.0. I have to create one single json string where each attribute comes from different tables and save that resulting json in another dataframe, i.e (using 1 row only in this example to keep it simple)
tableA
id
action
date
u1
insert
20210428
tableB
id
name
date
u1
some name
20210428
I need to return the following :
{
"A": [
{
"id":"u1",
"action": "insert",
"date": "20210428"
}
],
"B": [
{
"id":"u1",
"name": "some name"
"date": "20210428",
}
]
}
I've tried many things but the closest i've gotten is doing the following for each table:
val tableADF = spark.read.format("delta").load(path +"/tableA")
val tableADF = spark.read.format("delta").load(path +"/tableB")
create the dataframe with all vaules converted to json for each table
val tableAJsonDF = tableADF.groupBy("date").agg(collect_list(struct($"id",$"action")).alias("attributesA"))
val tableBJsonDF = tableBDF.groupBy("date").agg(collect_list(struct($"id",$"name")).alias("attributesB"))
date
attributesA
20210428
[{"id":"u1", "action": "insert"}]
date
attributesB
20210428
[{"id":"u1", "name": "some name"}]
Now combine the json from both tables into one json to be added to a new dataframe:
val schema = new StructType().add("request", StringType)
val requestDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
val resultDF = requestDF.withColumn("request", concat(to_json(tableAJsonDF("attributesA")),
to_json(tableBJsonDF("attributesB"))))
but I get the following error. I read that this type of error happens when you try to combine two dataframes but I can't seem to find a way to create 1 single json as shown in the desired results by combining both attributes into 1 new column, any ideas?
org.apache.spark.sql.AnalysisException: Resolved attribute(s)
attributesA#4726,attributesB#4783 missing from request#24790 in
operator !Project [concat(to_json(attributesA#4726, Some(EST)),
to_json(attributesB#4783, Some(EST))) AS request#24792].;;
You need to join the dataframes. e.g.
val t1 = tableADF.select(
col("id"),
array(struct(tableADF.columns.map(col):_*)).as("A")
)
val t2 = tableBDF.select(
col("id"),
array(struct(tableBDF.columns.map(col):_*)).as("B")
)
val result = t1.join(t2, Seq("id")).select(to_json(struct("A", "B")).as("result"))
result.show(false)
+--------------------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------------------+
|{"A":[{"id":"u1","action":"insert","date":"20210428"}],"B":[{"id":"u1","name":"some name","date":"20210428"}]}|
+--------------------------------------------------------------------------------------------------------------+
Related
I have the following table:
local my_table = {data = {{value1 = "test1", value2 = "test2"}, {value3 = "test3", value4 = "test4"}}}
I want to convert this table to json format and save to a file. But, when I tried
json.encode(my_table)
I got an error: bad argument #1 to 'encode' (["data"] => string index expected, got number)
I expect the json:
{
"data":[
{
"value1":"test1",
"value2":"test2"
},
{
"value3":"test3",
"value4":"test4"
}
]
}
It works!
local json = require'json'
local my_table = {data = {{value1 = "test1", value2 = "test2"}, {value3 = "test3", value4 = "test4"}}}
print(json.encode(my_table)) -- {"data":[{"value1":"test1","value2":"test2"},{"value4":"test4","value3":"test3"}]}
I'm using this repo
Probably, the implementation you are using requires special syntax to treat Lua table as JSON array instead of JSON dictionary.
The implementation I'm using makes this decision (is it an array or a dictionary) automatically.
I have a json as below:
[
{
"id":6619137,
"oid":"6fykq37gm60x",
"key":{
"key":"6619137"
},
"name":"Prod",
"planKey":{
"key":"PDP"
},
"environments":[
{
"id":6225923,
"key":{
"key":"6619137-6225923"
},
"name":"Production",
"deploymentProjectId":6619137,
}
],
},
{
"id":6619138,
"oid":"6fykq37gm60y",
"key":{
"key":"6619138"
},
"name":"QA",
"planKey":{
"key":"QDP"
},
"environments":[
{
"id":6225924,
"key":{
"key":"6619138-6225924"
},
"name":"QA",
"deploymentProjectId":6619138,
}
],
},
]
I can use the below code to extract the value of id and environments.id based on the name value. projectID will give 6619137 and environmentID will give 6225923
def e = json.planKey.find{it.name=='Prod'}
def projectID = e.id
def environmentID = e.environments[0].id
However, when I try to extract the value of id and environments.id based on the plankey.key value (e.g. PDP or QDP), using the same format above returns me an error of java.lang.NullPointerException: Cannot get property 'id' on null object at Script60.run(Script60.groovy:52)
def e = json.planKey.find{it.planKey=='{key=PDP}'}
def projectID = e.id
def environmentID = e.environments[0].id
Is there a way I can get the projectID with the plankey key value?
Think of your JSON object as a key/value map with multiple levels. There is no such item as
json.find{it.planKey=='{key=PDP}'}
However, you can find with values at any level, like this:
def e = json.find{it.planKey.key == "PDP"}
If you have a structure where planKey may not exist, or it doesn't always have key, that's a bit different, but from your question it sounds like that's not the case here.
EDIT: correcting syntax based on comment below.
I'm ingesting a large simple json dataset from Azure Blob and moving data into a "stage" called "cities_stage" with FILE_FORMAT = json like so.
(Here is the error steps are below "Error parsing JSON: unknown keyword "Hurzuf", pos 7.")
create or replace stage cities_stage
url='azure://XXXXXXX.blob.core.windows.net/xxxx/landing/cities'
credentials=(azure_sas_token='?st=XXXXX&se=XXX&sp=racwdl&sv=XX&sr=c&sig=XXX')
FILE_FORMAT = (type = json);
I then take this stage location and dump it into a table with a single variant column like so. The file I'm ingesting is larger than 16mb so I create individual rows for each object by using type = json strip_outer_array = true
create or replace table cities_raw_source (
src variant);
copy into cities_raw_source
from #cities_stage
file_format = (type = json strip_outer_array = true)
on_error = continue;
When I select * from cities_raw_source each row looks like the following.
{
"coord": {
"lat": 44.549999,
"lon": 34.283333
},
"country": "UA",
"id": 707860,
"name": "Hurzuf"
}
When I add a reference to "country" or "name" that's where the issues come in. Here is my query (I did not use country in this one but it produces the same result).
select parse_json(src:id),
parse_json(src:coord:lat),
parse_json(src:coord:lon),
parse_json(src:name)
from cities_raw_source;
ERROR:
Error parsing JSON: unknown keyword "Hurzuf", pos 7.
ID, Lat, and Lon all come back as expected if I remove "src:name"
Any help is appreciated!
It turns out I had everything correct except for the query itself.
When querying a VARIANT column you do not need to PARSE_JSON so the correct query would look like this.
select src:id,
src:coord:lat,
src:coord:lon,
src:name
from cities_raw_source;
Ihave the following JSON field:
{
"Id": "64848e27-c25d-4f15-99db-b476d868b575",
"Associations_": [
"RatingBlockPinDatum"
],
"RatingScenarioId": "00572f95-9b81-4f7e-a359-3df06b093d4d",
"RatingBlockPinDatum": [
{
"Name": "mappedmean",
"PinId": "I.Assessment",
"Value": "24.388",
"BlockId": "Score"
},
{
"Name": "realmean",
"PinId": "I.Assessment",
"Value": "44.502",
"BlockId": "Score"
}]}
I want to update the Value from 24.388 to a new value in the nested array "RatingBlockPinDatum" where Name = "mappedmean".
Any help would be appreciated. I have already tried this but couldn't adapt it to work properly:
[Update nested key with postgres json field in Rails
You could first get one result per element in the RatingBlockPinDatum JSON array (using jsonb_array_length and generate_series) and then filter that result for where the Name key has the value "mappedmean". Then you have the records that need updating. The update itself can be done with jsonb_set:
with cte as (
select id, generate_series(0, jsonb_array_length(info->'RatingBlockPinDatum')-1) i
from mytable
)
update mytable
set info = jsonb_set(mytable.info,
array['RatingBlockPinDatum', cte.i::varchar, 'Value'],
'"99.999"'::jsonb)
from cte
where mytable.info->'RatingBlockPinDatum'->cte.i->>'Name' = 'mappedmean'
and cte.id = mytable.id;
Replace "99.999" with whatever value you want to store in that Value property.
See it run on rextester.com
I'm using the latest cassandra version and trying to save JSON like below and was successful,
INSERT INTO mytable JSON '{"username": "myname", "country": "mycountry", "userid": "1"}'
Above query saves the record like,
"rows": [
{
"[json]": "{\"userid\": \"1\", \"country\": \"india\", \"username\": \"sai\"}"
}
],
"rowLength": 1,
"columns": [
{
"name": "[json]",
"type": {
"code": 13,
"type": null
}
}
]
Now I would like to retrieve the record based on userid:
SELECT JSON * FROM mytable WHERE userid = fromJson("1") // but this query throws error
All this occurs in a node/express app and I'm using dse-driver as the client driver.
The CQL command worked like below,
SELECT JSON * FROM mytable WHERE userid="1";
However if it has to be executed via the dse-driver then the below snippet worked,
let query = 'SELECT JSON * FROM mytable WHERE userid = ?';
client.execute(query, ["1"], { prepare: true });
where client is,
const dse = require('dse-driver');
const client = new dse.Client({
contactPoints: ['h1', 'h2'],
authProvider: new dse.auth.DsePlainTextAuthProvider('username', 'pass')
});
If your Cassandra version is 2.1x and below, you can use the Python-based approach. Write a python script using Cassandra-Python API
Here you have to get your row first and then use python json's loads method, which will convert your json text column value into JSON object which will be dict in Python. Then you can play around with Python dictionaries and extract your required nested keys. See the below code snippet.
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
import json
if __name__ == '__main__':
auth_provider = PlainTextAuthProvider(username='xxxx', password='xxxx')
cluster = Cluster(['0.0.0.0'],
port=9042, auth_provider=auth_provider)
session = cluster.connect("keyspace_name")
print("session created successfully")
rows = session.execute('select * from user limit 10')
for user_row in rows:
#fetchign your json column
column_dict = json.loads(user_row.json_col)
print(column_dict().keys()
Assuming user-id is the partition key, and assuming you want to retrieve a JSON object corresponding to user of id 1, you should try:
SELECT JSON * FROM mytable WHERE userid=1;
If userid is of type text, you will need to add some quotes.