I am fetching the data from Twitter API. Converting a Data from JSON object to Data Frame and load into Data Warehouse. Find below input and code snippet.
I am very new to R Programming.
stats_campaign.data <- content(stats_campaign.request)
print(stats_campaign.data)
O/P:
`{
"data_type": [ "stats" ],
"time_series_length": [ 1 ],
"data": [
{
"id": [ "XXXXX" ],
"id_data": [
{
"segment": {},
"metrics": {
"impressions": {},
"tweets_send": {},
"qualified_impressions": {},
"follows": {},
"app_clicks": {},
"retweets": {},
"likes": {},
"engagements": {},
"clicks": {},
"card_engagements": {},
"replies": {},
"url_clicks": {},
"carousel_swipes": {}
}
}
]
},
{
"id": [ "XXXX1" ],
"id_data": [
{
"segment": {},
"metrics": {
"impressions": {},
"tweets_send": {},
"qualified_impressions": {},
"follows": {},
"app_clicks": {},
"retweets": {},
"likes": {},
"engagements": {},
"clicks": {},
"card_engagements": {},
"replies": {},
"url_clicks": {},
"carousel_swipes": {}
}
}
]
},`
When I am reading this JSON value ,
stats_json_file <- sprintf("P:/R Repos/R
Applications/TwitterAPIData/stats_test_data-%s.json", TODAY)
jsonlite::fromJSON(stats_json_file)
**Result :**
id id_data
1 5wcaz NULL
2 5ub2u NULL
3 5wb8x NULL
4 5wb1j NULL
5 5yqwj NULL
6 5pq5i NULL
7 5u197 NULL
8 5z2js NULL
9 6fqh0 333250, 4, 9, 19, 111, 3189, 3156, 5, 1091
10 5tvr1 NULL
11 5yqw4 NULL
12 5qqps NULL
13 5yqvw NULL
14 5ygom NULL
15 5nc88 NULL
16 5yg94 NULL
17 65t9e NULL
18 5peck NULL
19 63pg1 247283, 17, 22, 35, 297, 5514, 5450, 6, 2971
20 6cdvy 156705, 1, 2, 6, 112, 10933, 605, 170
From my JSON file I want Id and whole "metrics": {
"impressions": {},
"tweets_send": {},
"qualified_impressions": {},
"follows": {},
"app_clicks": {},
"retweets": {},
"likes": {},
"engagements": {},
"clicks": {},
"card_engagements": {},
"replies": {},
"url_clicks": {},
"carousel_swipes": {}
}
and convert to Data Frame to load into Data Base. Plzz Help..!
How can I parsed this JSON Object. I want to retrieve Id & whole Metrics object. Then want to convert into Data Frame to load into SQL Table.
To read the multiple Id's & Metrics value I used below code,
`test <- list()
for(i in 1:len)
{ test <- unlist(stats_campaign.data$data[[i]])
print(test)}`
**Output:**
id
"5wcaz"
id
"5ub2u"
id
"5wb8x"
id
"5wb1j"
id
"5yqwj"
id
"5pq5i"
id
"5u197"
id
"5z2js"
id
"5tvr1"
id
"5yqw4"
id
"5qqps"
id
"5yqvw"
id
"5ygom"
id
"5nc88"
id
"5yg94"
id
"65t9e"
id
"5peck"
id id_data.metrics.impressions
"63pg1" "133227"
id_data.metrics.tweets_send id_data.metrics.follows
"10" "9"
id_data.metrics.retweets id_data.metrics.likes
"17" "96"
id_data.metrics.engagements id_data.metrics.clicks
"2165" "2134"
id_data.metrics.replies id_data.metrics.url_clicks
"5" "1204"
id id_data.metrics.impressions
"6cdvy" "176164"
id_data.metrics.tweets_send id_data.metrics.retweets
"2" "10"
id_data.metrics.likes id_data.metrics.engagements
"121" "9708"
id_data.metrics.clicks id_data.metrics.url_clicks
"620" "160"
Within a for I have to used list or something else to append the value each time, how can I do that ..?? Am I using a right Approach.?? Is there any alternative way I can parsed nested JSON object and directly put into Data Frame..?
Please Help..! Thanks In Advance..!
As mentioned in the comments, a bit more information about what output you are looking for would be helpful. In any case, I am hopeful that the following will provide a helpful direction. The tidyjson README provides a bit of helpful overview.
Unfortunately, the lack of data in your JSON object makes it difficult to illustrate what might be present in your data (what to expect in the null objects), and I am having difficulty determining what part of the Twitter API you are looking at. tidyjson gives you the ability to produce a consistent data.frame output, even when you have no data, though! The key verbs are gather and spread, much like tidyr, but with JSON flavor.
str <- "{\"data_type\":[\"stats\"],\"time_series_length\":[1],\"data\":[{\"id\":[\"XXXXX\"],\"id_data\":[{\"segment\":{},\"metrics\":{\"impressions\":{},\"tweets_send\":{},\"qualified_impressions\":{},\"follows\":{},\"app_clicks\":{},\"retweets\":{},\"likes\":{},\"engagements\":{},\"clicks\":{},\"card_engagements\":{},\"replies\":{},\"url_clicks\":{},\"carousel_swipes\":{}}}]},{\"id\":[\"XXXX1\"],\"id_data\":[{\"segment\":{},\"metrics\":{\"impressions\":{},\"tweets_send\":{},\"qualified_impressions\":{},\"follows\":{},\"app_clicks\":{},\"retweets\":{},\"likes\":{},\"engagements\":{},\"clicks\":{},\"card_engagements\":{},\"replies\":{},\"url_clicks\":{},\"carousel_swipes\":{}}}]}]} "
library(dplyr)
library(tidyjson)
prep <- as.tbl_json(str) %>% enter_object("data") %>% gather_array("objid")
p1 <- prep %>% enter_object("id") %>%
gather_array("idnum") %>% append_values_string("id")
p2 <- prep %>% enter_object("id_data") %>% gather_array("datanum") %>%
enter_object("metrics") %>%
spread_values(
impressions = jstring("impressions", "value")
, tweets_send = jnumber("tweets_send", "somekey")
)
p1 %>% tbl_df() %>% left_join(p2 %>% tbl_df(), by = c("document.id", "objid"))
#> # A tibble: 2 x 7
#> document.id objid idnum id datanum impressions tweets_send
#> <int> <int> <int> <chr> <int> <chr> <dbl>
#> 1 1 1 1 XXXXX 1 <NA> NA
#> 2 1 2 1 XXXX1 1 <NA> NA
Related
I have got JSON data in a JSON column of an Oracle database. The JSON contains arrays and I should load the data into tables by means of PL/SQL. E. g.
{
"A1": "V1",
"A2": true,
"A3": 42,
"A4": [
{
"B1": "Q1",
"B4": [
{
"C1": "R1",
"C2": false
},
{
"C1": "R2",
"C2": false
}
]
},
{
"B1": "Q2",
"B4": [
{
"C1": "R3",
"C2": false
},
{
"C1": "R4",
"C2": true
}
]
}
]
}
{
"A1": "V2",
"A2": false,
"A3": 42,
"A4": [
{
"B1": "T1",
"B4": [
{
"C1": "S1",
"C2": false
},
{
"C1": "S2",
"C2": false
}
]
},
{
"B1": "T2",
"B4": [
{
"C1": "S3",
"C2": false
},
{
"C1": "S4",
"C2": true
}
]
}
]
}
The data should be loaded to three tables.
---
A
---
| 0..1
|
| 1..n
---
B (no (business) key available)
---
| 0..1
|
| 1..n
---
C
---
ID
A1
A2
A3
1
V1
true
42
2
V2
false
42
ID
ID A
B1
1
1
Q1
2
1
Q2
3
2
T1
4
2
T2
ID
ID B
C1
C2
1
1
R1
false
2
1
R2
false
3
1
R3
false
4
1
R4
true
5
2
S1
false
6
2
S2
false
7
2
S3
false
8
2
S4
true
I am trying to use JSON_TABLE to flatten the data for an implicit cursor of a for loop. I did so by using NESTED PATH. Now imagine we are in the for loop, i.e. for every "record" in C there is a record in the cursor/loop. While the data of C is "as is", the data of B gets mimeographed, and the data of A likewise, i.e.
A1
A2
A3
B1
C1
C2
V1
true
42
Q1
R1
false
V1
true
42
Q1
R2
false
V1
true
42
Q2
R3
false
V1
true
42
Q2
R4
true
V2
false
42
T1
S1
false
V2
false
42
T1
S2
false
V2
false
42
T2
S3
false
V2
false
42
T2
S4
true
My initial idea was to insert from the "A" data into table "A" only the first occurrence of a A-tuple saving the written ID by RETURNING .. INTO, insert the first occurrence of a B-tuple into "B" saving the written ID likewise and finally insert the C-tuple into "C". The problem here is to detect the change of the A-tuple respectively of the B-tuple. "A" should not be too much of a difficulty as it has a (business) key I also can preserve for checks. However, B has no (business) key.
My idea with B was to use an unknown, maybe inexistent pseudo column JSON_TABLE/NESTED PATH provides, that tells e.g. to which expression of "B" the current "C" belongs to. In fact, it would be a key for "B". Yet, I have not found anything in the Oracle docs and on the internet. Do you happen to know such a column?
As an alternative, I think about not using the NESTED PATH for "C" but return the array as JSON column and afterwards to somehow create an inner for loop, but I do not know how to use a JSON typed PL/SQL variable to extract an array of objects as nested table of some sort to be able to use as for loop cursor. Do you happen to know examples for such?
Is there a way, I miss completely?
Kind regards
Thiemo
I have a JSON and I'm trying to read part of it to extract keys and values.
Assuming response is my JSON data, here is my code:
data_dump = json.dumps(response)
data = json.loads(data_dump)
Here my data object becomes a list and I'm trying to get the keys as below
id = [key for key in data.keys()]
This fails with the error:
A list object does not have an attribute keys**. How can I get over this to get my below output?
Here is my JSON:
{
"1": {
"task": [
"wakeup",
"getready"
]
},
"2": {
"task": [
"brush",
"shower"
]
},
"3": {
"task": [
"brush",
"shower"
]
},
"activites": ["standup", "play", "sitdown"],
"statuscheck": {
"time": 60,
"color": 1002,
"change(me)": 9898
},
"action": ["1", "2", "3", "4"]
}
The output I need is as below. I do not need data from the rest of JSON.
id
task
1
wakeup, getready
2
brush , shower
If you know that the keys you need are "1" and "2", you could try reading the JSON string as a dataframe, unpivoting it, exploding and grouping:
from pyspark.sql import functions as F
df = (spark.read.json(sc.parallelize([data_dump]))
.selectExpr("stack(2, '1', `1`, '2', `2`) (id, task)")
.withColumn('task', F.explode('task.task'))
.groupBy('id').agg(F.collect_list('task').alias('task'))
)
df.show()
# +---+------------------+
# | id| task|
# +---+------------------+
# | 1|[wakeup, getready]|
# | 2| [brush, shower]|
# +---+------------------+
However, it may be easier to deal with it in Python:
data = json.loads(data_dump)
data2 = [(k, v['task']) for k, v in data.items() if k in ['1', '2']]
df = spark.createDataFrame(data2, ['id', 'task'])
df.show()
# +---+------------------+
# | id| task|
# +---+------------------+
# | 1|[wakeup, getready]|
# | 2| [brush, shower]|
# +---+------------------+
We are tying to format a json similar to this:
[
{"id": 1,
"type": "A",
"changes": [
{"id": 12},
{"id": 13}
],
"wanted_key": "good",
"unwanted_key": "aaa"
},
{"id": 2,
"type": "A",
"unwanted_key": "aaa"
},
{"id": 3,
"type": "B",
"changes": [
{"id": 31},
{"id": 32}
],
"unwanted_key": "aaa",
"unwanted_key2": "aaa"
},
{"id": 4,
"type": "B",
"unwanted_key3": "aaa"
},
null,
null,
{"id": 7}
]
into something like this:
[
{
"id": 1,
"type": "A",
"wanted_key": true # every record must have this key/value
},
{
"id": 12, # note: this was in the "changes" property of record id 1
"type": "A", # type should be the same type than record id 1
"wanted_key": true
},
{
"id": 13, # note: this was in the "changes" property of record id 1
"type": "A", # type should be the same type than record id 1
"wanted_key": true
},
{
"id": 2,
"type": "A",
"wanted_key": true
},
{
"id": 3,
"type": "B",
"wanted_key": true
},
{
"id": 31, # note: this was in the "changes" property of record id 3
"type": "B", # type should be the same type than record id 3
"wanted_key": true
},
{
"id": 32, # note: this was in the "changes" property of record id 3
"type": "B", # type should be the same type than record id 3
"wanted_key": true
},
{
"id": 4,
"type": "B",
"wanted_key": true
},
{
"id": 7,
"type": "UNKN", # records without a type should have this type
"wanted_key": true
}
]
So far, I've been able to:
remove null records
obtain the keys we need with their default
give records without a type a default type
What we are missing:
from records having a changes key, create new records with the type of their parent record
join all records in a single array
Unfortunately we are not entirely sure how to proceed... Any help would be appreciated.
So far our jq goes like this:
del(..|nulls) | map({id, type: (.type // "UNKN"), wanted_key: (true)}) | del(..|nulls)
Here's our test code:
https://jqplay.org/s/eLAWwP1ha8P
The following should work:
map(select(values))
| map(., .type as $type | (.changes[]? + {$type}))
| map({id, type: (.type // "UNKN"), wanted_key: true})
Only select non-null values
Return the original items followed by their inner changes array (+ outer type)
Extract 3 properties for output
Multiple map calls can usually be combined, so this becomes:
map(
select(values)
| ., (.type as $type | (.changes[]? + {$type}))
| {id, type: (.type // "UNKN"), wanted_key: true}
)
Another option without variables:
map(
select(values)
| ., .changes[]? + {type}
| {id, type: (.type // "UNKN"), wanted_key: true}
)
# or:
map(select(values))
| map(., .changes[]? + {type})
| map({id, type: (.type // "UNKN"), wanted_key: true})
or even with a separate normalization step for the unknown type:
map(select(values))
| map(.type //= "UNKN")
| map(., .changes[]? + {type})
| map({id, type, wanted_key: true})
# condensed to a single line:
map(select(values) | .type //= "UNKN" | ., .changes[]? + {type} | {id, type, wanted_key: true})
Explanation:
Select only non-null values from the array
If type is not set, create the property with value "UNKN"
Produce the original array items, followed by their nested changes elements extended with the parent type
Reshape objects to only contain properties id, type, and wanted_key.
Here's one way:
map(
select(values)
| (.type // "UNKN") as $type
| ., .changes[]?
| {id, $type, wanted_key: true}
)
[
{
"id": 1,
"type": "A",
"wanted_key": true
},
{
"id": 12,
"type": "A",
"wanted_key": true
},
{
"id": 13,
"type": "A",
"wanted_key": true
},
{
"id": 2,
"type": "A",
"wanted_key": true
},
{
"id": 3,
"type": "B",
"wanted_key": true
},
{
"id": 31,
"type": "B",
"wanted_key": true
},
{
"id": 32,
"type": "B",
"wanted_key": true
},
{
"id": 4,
"type": "B",
"wanted_key": true
},
{
"id": 7,
"type": "UNKN",
"wanted_key": true
}
]
Demo
Something like below should work
map(
select(type == "object") |
( {id}, {id : ( .changes[]? .id )} ) +
{ type: (.type // "UNKN"), wanted_key: true }
)
jq play - demo
I have a JSON document like below:
{
"Data": [{
"Code": "ABC",
"ID": 123456,
"Type": "Yes",
"Geo": "East"
}, {
"Code": "XYZ",
"ID": 987654,
"Type": "No",
"Geo": "West"
}],
"Total": 2,
"AggregateResults": null,
"Errors": null
}
My PySpark sample code:
getjsonresponsedata=json.dumps(getjsondata)
jsonDataList.append(getjsonresponsedata)
jsonRDD = sc.parallelize(jsonDataList)
df_Json=spark.read.json(jsonRDD)
display(df_Json.withColumn("Code",explode(col("Data.Code"))).withColumn("ID",explode(col("Data.ID"))).select('Code','ID'))
When I explode the JSON then I get below records (it looks like cross join)
Code ID
ABC 123456
ABC 987654
XYZ 123456
XYZ 987654
But I expect the records like below:
Code ID
ABC 123456
XYZ 987654
Could you please help me on how to get the expected result?
You only need to explode Data column, then you can select fields from the resulting struct column (Code, Id...). What duplicates the rows here is that you're exploding 2 arrays Data.Code and Data.Id.
Try this instead:
import pyspark.sql.functions as F
df_Json.withColumn("Data", F.explode("Data")).select("Data.Code", "Data.Id").show()
#+----+------+
#|Code| Id|
#+----+------+
#| ABC|123456|
#| XYZ|987654|
#+----+------+
Or using inline function directly on Data array:
df_Json.selectExpr("inline(Data)").show()
#+----+----+------+----+
#|Code| Geo| ID|Type|
#+----+----+------+----+
#| ABC|East|123456| Yes|
#| XYZ|West|987654| No|
#+----+----+------+----+
I have the following table:
CREATE TABLE api_data (
id bigserial NOT NULL PRIMARY KEY,
content JSONB NOT NULL
);
Now I insert an array like this into the content column:
[{ "id": 44, "name": "address One", "petId": 1234 },
{ "id": 45, "name": "address One", "petId": 1234 },
{ "id": 46, "name": "address One", "petId": 1111 }]
What I want next is to get exactly the objects that have the "petId" set to a given value.
I figured I could do
select val
from api_data
WHERE content #> '[{"petId":1234}]'
But that returns the whole array.
Another thing I found is this query:
select val
from api_data
JOIN LATERAL jsonb_array_elements(content) obj(val) ON obj.val->>'petId' = '1234'
WHERE content #> '[{"petId":1234}]'
Which returns the object I am looking for, but three times which matches the number of elements in the array.
What I actually need is a result like this:
[{ "id": 44, "name": "address One", "petId": 1234 },
{ "id": 45, "name": "address One", "petId": 1234 }]
If you are using Postgres 12, you can use a JSON path expression:
select jsonb_path_query_array(content, '$[*] ? (#.petId == 1234)') as content
from api_data
where content #> '[{"petId":1234}]';
If you are using an older version, you need to unnest and aggregate manually:
select (select jsonb_agg(e)
from jsonb_array_elements(d.content) as t(e)
where t.e #> '{"petId":1234}') as content
from api_data d
where d.content #> '[{"petId":1234}]'