I am parsing a json data to write a csv file. I am using tidyjson package to do this work.
In some point I need to print all the subjects value below in a separate columns and score as a value. Meaning Physics, Mathematics will be a column name and score will be there value.
{
"results": {
"subjects": [
{
"subject": {
"name": "Physics",
"code": "PHY"
},
"score": 70
},
{
"subject": {
"name": "Mathematics",
"code": "MATH"
},
"score": 50
}
]
}
}
I have tried as below:
json_data %>%
as.tbl_json %>%
gather_array %>%
spread_values(user_id = jstring("user_id")) %>%
enter_object("results") %>%
enter_object("subjects") %>%
gather_array("subjects") %>%
spread_values(score = jstring("score")) %>%
enter_object("subject") %>%
spread_values(subject = jstring("subject")) %>%
mutate(Physics = case_when(.$name == "Physics" ~ score)) %>%
mutate(Mathematics = case_when(.$name == "Mathematics" ~ score))
But this shows multiple rows for one student. I need to show single row with each subject and score as a column value.
But this shows multiple rows for one student. I need to show single row with each subject and score as a column value.
That means, your need a unique row based on subject name? In that case you can use aggregate
if you have a data frame named df like,
subject <- c("phy", "math", "phy", "math")
Score <- c(10, NA, NA, 20)
df <- data.frame(subject, Score)
then,
aggregate(x=df[c("Score")], by=list(subjectName=df$subject), max, na.rm = TRUE)
output
subjectName Score
phy 10
math 20
Related
I have a dataframe grouped by ID1, ID2, ID3 and variables V1, V2, V3 and V4. I am trying to capture values per group in R which are most different from median. For this, I have subtracted median from each value and squared that (bc there are some with negative values). Below is an example dataframe.
colnames <- c("ID1", "ID2", "ID3", "V1", "V2", "V3", "V4")
a <- c("A", "B", "C", "D")
b <- c("X", "Y", "Z", "T")
c <- c("1", "2", "3", "4")
d <- c(1.23,2.03,2.45,5.66)
e <- c(1,2,3,4)
df <-data.frame(a,b,c,d,e)
I have made a function med_removed as follows.
med_removed <- function(x, na.rm = TRUE, ...) {
mad <- sort((x- median(x, na.rm = T))^2)
y <- head(mad, 4)
y
}
df_selected <- df%>% group_by(ID1 ID2,ID3) %>%mutate_all(., med_removed)
The problem is that I want to select rows in the original dataframe based on the (x-median(x))^2 to pick up the top 2 values.
Does anyone know a good way of doing that.
Thanks
I have a dataset of JSON File
[
{
"id": 333831567,
"pieceId": 25395616,
"status": 10800,
"userId": 911,
"startTime": 1490989764,
"endTime": 1491001113
},
{
"id": 333883698,
"pieceId": 25390812,
"status": 10451,
"userId": 88738562,
"startTime": 1491004450,
"endTime": 1491004579
The JSON file has over 15000 entries. How do I calculate unique status from this dataset.
Using pandas
import pandas as pd
# convert your "data" into pandas dataframe
df = pd.DataFrame.from_dict(data, orient='columns')
# count non unique values for status column
df.loc[: ,'status'].nunique()
Using dictionary comprehension + len() + set()
res = {key: len(set([sub[key] for sub in data ]))
for key in data[0].keys()}
# prints unique values for each keys in the dictionary
print("Unique count of keys : " + str(res))
# print unique values for status
print("Unique count of status : " + str(res['status']))
class (object): def unique_statuses_count(self) -> int
(requirement in comments)
class Jsondt:
def unique_statuses_count(self) :
res = {key: len(set([sub[key] for sub in self ]))
for key in self[0].keys()}
#return unique count for each key in the dataset
return res
# return unique count for "status" column as integer
# or choose any othere column present in data
Jsondt.unique_statuses_count(data)['status']
I'm working with a species data in presence/absence format set where samples have been taken multiple times a day over a period of several days.
Here's a dummy version of the data:
dummy = structure(list(Sample = c("A1", "A1", "A1", "A2", "A2", "A2",
"B1", "B1", "B1", "B2", "B2", "B2"), Species = c("snuffles1",
"snuffles2", "snuffles3", "snuffles1", "snuffles2", "snuffles3",
"snuffles1", "snuffles2", "snuffles3", "snuffles1", "snuffles2",
"snuffles3"), Presence = c(1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1
), Day = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B",
"B", "B")), row.names = c(NA, -12L), class = c("tbl_df", "tbl",
"data.frame"))
ggplot(dummy[which(dummy$Presence>0),], aes(x = Day, y = Species, color = Species)) +
geom_point(alpha=0.5) +
geom_count(aes(size = sum(dummy$Presence)))
I would like to plot the data in ggplot where the size of each point is dependent on the sum of the number of observations within that group (i.e if on Day A, snuffles1 was observed 2 times, then the point should be size 2, whereas if on Day B, snuffles1 was observed once, the point would be size 1). I hope this makes sense? This counting presence/absence based on group is similar, but not quite what I need.
My guess is that I have to use some sort of function to count the number of observations for each species, depedent on which variable I'm considering, but I am not smart enough to think of how to do this.
Thanks for any and all advice.
Make an additional count by group. Then plot this data frame as an extra layer using geom_point
I am adding breaks to scale_size in order to show only the exiting sizes
library(tidyverse)
count_dum <- dummy %>% group_by(Day, Species) %>% summarise(count = sum(Presence))
ggplot(dummy[which(dummy$Presence > 0), ], aes(x = Day, y = Species, color = Species)) +
geom_point(data = count_dum, aes(size = count), alpha = 0.5) +
scale_size_continuous(breaks = unique(count_dum$count))
Using the code below, is there an easy way to combine presentations p1 and p2 together?
library(officer)
library(magrittr)
p1 = read_pptx() %>% add_slide(layout = "Two Content", master = "Office Theme") %>% ph_with_text(type = "body", str = "First Slide")
p2 = read_pptx() %>% add_slide(layout = "Two Content", master = "Office Theme") %>% ph_with_text(type = "body", str = "Second Slide")
I am fetching the data from Twitter API. Converting a Data from JSON object to Data Frame and load into Data Warehouse. Find below input and code snippet.
I am very new to R Programming.
stats_campaign.data <- content(stats_campaign.request)
print(stats_campaign.data)
O/P:
`{
"data_type": [ "stats" ],
"time_series_length": [ 1 ],
"data": [
{
"id": [ "XXXXX" ],
"id_data": [
{
"segment": {},
"metrics": {
"impressions": {},
"tweets_send": {},
"qualified_impressions": {},
"follows": {},
"app_clicks": {},
"retweets": {},
"likes": {},
"engagements": {},
"clicks": {},
"card_engagements": {},
"replies": {},
"url_clicks": {},
"carousel_swipes": {}
}
}
]
},
{
"id": [ "XXXX1" ],
"id_data": [
{
"segment": {},
"metrics": {
"impressions": {},
"tweets_send": {},
"qualified_impressions": {},
"follows": {},
"app_clicks": {},
"retweets": {},
"likes": {},
"engagements": {},
"clicks": {},
"card_engagements": {},
"replies": {},
"url_clicks": {},
"carousel_swipes": {}
}
}
]
},`
When I am reading this JSON value ,
stats_json_file <- sprintf("P:/R Repos/R
Applications/TwitterAPIData/stats_test_data-%s.json", TODAY)
jsonlite::fromJSON(stats_json_file)
**Result :**
id id_data
1 5wcaz NULL
2 5ub2u NULL
3 5wb8x NULL
4 5wb1j NULL
5 5yqwj NULL
6 5pq5i NULL
7 5u197 NULL
8 5z2js NULL
9 6fqh0 333250, 4, 9, 19, 111, 3189, 3156, 5, 1091
10 5tvr1 NULL
11 5yqw4 NULL
12 5qqps NULL
13 5yqvw NULL
14 5ygom NULL
15 5nc88 NULL
16 5yg94 NULL
17 65t9e NULL
18 5peck NULL
19 63pg1 247283, 17, 22, 35, 297, 5514, 5450, 6, 2971
20 6cdvy 156705, 1, 2, 6, 112, 10933, 605, 170
From my JSON file I want Id and whole "metrics": {
"impressions": {},
"tweets_send": {},
"qualified_impressions": {},
"follows": {},
"app_clicks": {},
"retweets": {},
"likes": {},
"engagements": {},
"clicks": {},
"card_engagements": {},
"replies": {},
"url_clicks": {},
"carousel_swipes": {}
}
and convert to Data Frame to load into Data Base. Plzz Help..!
How can I parsed this JSON Object. I want to retrieve Id & whole Metrics object. Then want to convert into Data Frame to load into SQL Table.
To read the multiple Id's & Metrics value I used below code,
`test <- list()
for(i in 1:len)
{ test <- unlist(stats_campaign.data$data[[i]])
print(test)}`
**Output:**
id
"5wcaz"
id
"5ub2u"
id
"5wb8x"
id
"5wb1j"
id
"5yqwj"
id
"5pq5i"
id
"5u197"
id
"5z2js"
id
"5tvr1"
id
"5yqw4"
id
"5qqps"
id
"5yqvw"
id
"5ygom"
id
"5nc88"
id
"5yg94"
id
"65t9e"
id
"5peck"
id id_data.metrics.impressions
"63pg1" "133227"
id_data.metrics.tweets_send id_data.metrics.follows
"10" "9"
id_data.metrics.retweets id_data.metrics.likes
"17" "96"
id_data.metrics.engagements id_data.metrics.clicks
"2165" "2134"
id_data.metrics.replies id_data.metrics.url_clicks
"5" "1204"
id id_data.metrics.impressions
"6cdvy" "176164"
id_data.metrics.tweets_send id_data.metrics.retweets
"2" "10"
id_data.metrics.likes id_data.metrics.engagements
"121" "9708"
id_data.metrics.clicks id_data.metrics.url_clicks
"620" "160"
Within a for I have to used list or something else to append the value each time, how can I do that ..?? Am I using a right Approach.?? Is there any alternative way I can parsed nested JSON object and directly put into Data Frame..?
Please Help..! Thanks In Advance..!
As mentioned in the comments, a bit more information about what output you are looking for would be helpful. In any case, I am hopeful that the following will provide a helpful direction. The tidyjson README provides a bit of helpful overview.
Unfortunately, the lack of data in your JSON object makes it difficult to illustrate what might be present in your data (what to expect in the null objects), and I am having difficulty determining what part of the Twitter API you are looking at. tidyjson gives you the ability to produce a consistent data.frame output, even when you have no data, though! The key verbs are gather and spread, much like tidyr, but with JSON flavor.
str <- "{\"data_type\":[\"stats\"],\"time_series_length\":[1],\"data\":[{\"id\":[\"XXXXX\"],\"id_data\":[{\"segment\":{},\"metrics\":{\"impressions\":{},\"tweets_send\":{},\"qualified_impressions\":{},\"follows\":{},\"app_clicks\":{},\"retweets\":{},\"likes\":{},\"engagements\":{},\"clicks\":{},\"card_engagements\":{},\"replies\":{},\"url_clicks\":{},\"carousel_swipes\":{}}}]},{\"id\":[\"XXXX1\"],\"id_data\":[{\"segment\":{},\"metrics\":{\"impressions\":{},\"tweets_send\":{},\"qualified_impressions\":{},\"follows\":{},\"app_clicks\":{},\"retweets\":{},\"likes\":{},\"engagements\":{},\"clicks\":{},\"card_engagements\":{},\"replies\":{},\"url_clicks\":{},\"carousel_swipes\":{}}}]}]} "
library(dplyr)
library(tidyjson)
prep <- as.tbl_json(str) %>% enter_object("data") %>% gather_array("objid")
p1 <- prep %>% enter_object("id") %>%
gather_array("idnum") %>% append_values_string("id")
p2 <- prep %>% enter_object("id_data") %>% gather_array("datanum") %>%
enter_object("metrics") %>%
spread_values(
impressions = jstring("impressions", "value")
, tweets_send = jnumber("tweets_send", "somekey")
)
p1 %>% tbl_df() %>% left_join(p2 %>% tbl_df(), by = c("document.id", "objid"))
#> # A tibble: 2 x 7
#> document.id objid idnum id datanum impressions tweets_send
#> <int> <int> <int> <chr> <int> <chr> <dbl>
#> 1 1 1 1 XXXXX 1 <NA> NA
#> 2 1 2 1 XXXX1 1 <NA> NA