Clean HTML table with Reshape2 - html

New user of R. Can't think how to even ask the question. I scraped a webpage for HTML tables. Generally, everything went well, except for one table. Instead of there being 7 separate tables, everything got collapsed into 1 table, with the column name and the value for the first table being two separate columns, and all the other tables being rows. The results is a table with something like this:
df <- data.frame(is_employed = c("Hobbies", "Has Previous Experience"), false = c("squash", "false"))
Obviously, I need to have the rows (and the column name) in the first column as their own columns, with the item in the second column as their values, preferably with underscores in the columns names. I tried:
df <- dcast(df, ~is_employed, value.var = "false")
But got an error message. Then I thought to add another column, as such:
df2 <- data.frame(number = c(1, 2), is_employed = c("Hobbies", "Has Previous Experience"), false = c("squash", "false"))
then I tried
df3 <- dcast(df2, number ~is_employed, value.var="false")
That placed the values in the first columns as their own columns, but produced two rows (instead of 1), with NAs. I'm sure this is really basic, but I can't figure it out.
On edit:
I think this gives me what I want, but I'm away from my computer so I can't confirm:
library("dplyr")
library("tidyr")
mat <- as.matrix(df)
mat <- rbind(colnames(mat), mat)
colnames(mat) <- c("variable", "value")
df2 <- as.data.frame(mat)
df3 <- df2 %>%
mutate(n = 1) %>%
spread(variable, value) %>%
select(-n)
I need to add n or I get NAs, but I don't like it.

Is this what you're after?
mat <- as.matrix(df)
mat <- rbind(colnames(mat), mat)
colnames(mat) <- c("variable", "value")
mat
# variable value
# [1,] "is_employed" "false"
# [2,] "Hobbies" "squash"
# [3,] "Has Previous Experience" "false"
as.data.frame(mat)
# variable value
# 1 is_employed false
# 2 Hobbies squash
# 3 Has Previous Experience false

Related

Remove a flextable columns after the flextable creation

I create a flextable based on a csv file, I put some style on it, change some cells. Then I would like to remove a specific columns of this flextable before add it to a doc.
Is-there a way to create a copy of a flextable and specifying col_keys?
mydf <- GetData(....)
cols <- names(mydf)
myft <- flextable(mydf, col_keys = cols)
# Adding style to ft...
# ....
# Here I want to remove one column to the ft (and only here, not when first creating the ft)
# something as:
# ft <- CreateCopyOfFlextable(ft,cols[-which(cols=='COLB')])
#
my_doc <- read_docx()
my_doc <- my_doc %>% body_add_par("") %>%
body_add_flextable(value = ft)
print(my_doc, target = 'c:/temp/doc.docx')
I just had the same problem and had a devil of a time Googling for a solution. #David-Gohel truly has the answer here, but I feel the need to provide a similar solution with additional explanation.
My problem and the OPs is that we wanted to use the data from columns that wouldn't be displayed to influence the formatting of columns that will be displayed. The concept that was not initially obvious is that you can send a data frame to flextable with more columns than you intend to display (instead of displaying all and deleting them you've used them). Then, by using the col_keys argument, you can select only those columns that you want to display while keeping the remaining ones around for additional processing (e.g., for using compose(), paragraph(), or add_chunk()).
If I understand correctly, COLB is supposed to be a flag to indicate that certain rows of COLC should be modified. If so, then my solution looks like this:
library(flextable)
library(magrittr)
library(officer)
df <- data.frame(COLA=c('a', 'b', 'c'),
COLB=c('', 'changevalue', ''),
COLC=c(10, 12, 13))
ft <- flextable(df, col_keys = c("COLA", "COLC")) %>% # Retain but don't display COLB
compose(i = ~ COLB =='changevalue', # Use COLB for conditional modifications
j = "COLC",
value = as_paragraph(as_chunk('100')),
part = 'body') %>%
style(i = ~ COLB =='changevalue', # Use COLB for conditional formatting on COLC
j = "COLC",
pr_t = fp_text(color = "black",
font.size = 11,
bold = TRUE,
italic = FALSE,
underline = FALSE,
font.family = "Times New Roman"),
part = "body")
ft
And here's what the above code produces (e.g., the "changevalue" column is the trigger for conditionally inserting 100 in COLC and also for changing the formatting):
library(flextable)
library(magrittr)
library(officer)
df <- data.frame(COLA=c('a','b','c'),
COLB=c('','changevalue',''),
COLC=c(10,12,13))
ft<-flextable(df, col_keys = c("COLA", "COLB"))
ft <- ft %>%
style(i= ~ COLB=='changevalue',
pr_t=fp_text(color="black", font.size=11, bold=TRUE, italic=FALSE, underline=FALSE, font.family="Times New Roman"),part="body")
ft<-compose(ft, i=2, j="COLB", value = as_paragraph(as_chunk('100')),part = 'body')
ft
I'm styling another column based on COLB.
Here an example:
df <- data.frame(COLA=c('a','b','c'),COLB=c('','changevalue',''),COLC=c(10,12,13))
ft<-flextable(df)
ft <- ft %>% style(i=which(ft$body$dataset$COLB=='changevalue'),pr_t=fp_text(color="black", font.size=11, bold=TRUE, italic=FALSE, underline=FALSE, font.family="Times New Roman"),part="body")
ft<-compose(ft, i=2,j=3, value = as_paragraph(as_chunk('100')),part = 'body')
# now I want to remove the COLB columns as I don't need it anymore
# ???????
my_doc <- read_docx()
my_doc <- my_doc %>% body_add_par("") %>% body_add_flextable(value = ft)
print(my_doc, target = 'c:/temp/orliange_p/sample.docx') %>% invisible()

rbind fromJSON page: duplicate rowname error

I was trying to rbind some json data scraped from api
library(jsonlite)
pop_dat <- data.frame()
for (i in 1:3) {
# Generate url for each page
url <- paste0('http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=',i)
# Get json data from each page and transform it into dataframe
dat <- as.data.frame(fromJSON(url)[2],flatten = TRUE, row.names = NULL)
pop_dat <- rbind(pop_dat, dat)
}
However, it returns the following error:
Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘1’, ‘10’, ‘11’, ‘12’, ‘13’, ‘14’, ‘15’, ‘16’, ‘17’, ‘18’, ‘19’, ‘2’, ‘20’, ‘21’, ‘22’, ‘23’, ‘24’, ‘25’, ‘26’, ‘27’, ‘28’, ‘29’, ‘3’, ‘30’, ‘31’, ‘32’, ‘33’, ‘34’, ‘35’, ‘36’, ‘37’, ‘38’, ‘39’, ‘4’, ‘40’, ‘41’, ‘42’, ‘43’, ‘44’, ‘45’, ‘46’, ‘47’, ‘48’, ‘49’, ‘5’, ‘50’, ‘6’, ‘7’, ‘8’, ‘9’
Changing the row.names to null doesn't work. I heard from someone it is due to the fact that some data are stored as lists here, which I don't quite understand.
I understand that there is an alternative package WDI to access this data and it works well, but I want to know how to resolve the duplicates row name problem here in general so that I can deal with similar situation where no alternative package is available.
I heard from someone it is due to the fact that some data are stored as lists...
This is correct. The solution is fairly simple, but I find it really easy to get tripped up by this. Right now you're using:
dat <- as.data.frame(fromJSON(url)[2],flatten = TRUE, row.names = NULL)
The problem comes from fromJSON(url)[2]. This should be fromJSON(url)[[2]] instead. According to the documentation, the key difference between [ and [[ is a single bracket can select multiple elements whereas [[ selects only one.
You can see how this works with some fake data.
foo <- list(
a = rnorm(100),
b = rnorm(100),
c = rnorm(100)
)
With [, you can select multiple values inside this list.
foo[c("a", "b")]
length(foo["a"]) # Result is 1 not 100 like you might expect.
With [[ the results are different.
foo[[c("a", "b")]] # Raises a subscript error.
foo[["a"]] #This works.
length(foo[["a"]]) # Result is 100.
So, your answer will depend on which subset operator you're using. For your problem, you'll want to use [[ to select a single data.frame inside of the list. Then, you should be able to use rbind correctly.
final <- data.frame()
for (i in 1:10) {
url <- paste0(
'http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=',
i
)
res <- jsonlite::fromJSON(url, flatten = TRUE)[[2]]
final <- rbind(final, res)
}
Alternative solution with lapply:
urls <- sprintf(
'http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=%s',
1:10
)
resl <- lapply(urls, jsonlite::fromJSON, flatten = TRUE)
resl <- lapply(resl, "[[", 2) # Use lapply to select the 2 element from each list element.
resl <- do.call(rbind, resl) # This takes all the elements of the list and uses those elements as the arguments for rbind.

Flatten deep nested json in R

I am trying to use R to convert a nested JSON file into a two dimensional dataframe.
My JSON file has a nested structure. But, the names and properties are the same across levels.
{"name":"A", "value":"1", "c":
[{"name":"a1", "value":"11", "c":
[{"name":"a11", "value":"111"},
{"name":"a12", "value":"112"}]
},
{"name":"a2", "value":"12"}]
}
The desired dataset would look like this. Although the exact column names can be different.
name value c__name c_value c_c_name c_c_value
A 1 a1 11 a11 111
A 1 a1 11 a12 112
A 1 a2 12
The code I have so far flattens the data, but it only seems to work for the first level (see the screenshot of the output).
library(jsonlite)
json_file <- ' {"name":"A", "value":"1", "c":
[{"name":"a1", "value":"11", "c":
[{"name":"a11", "value":"111"},
{"name":"a12", "value":"112"}]
},
{"name":"a2", "value":"12"}]
}'
data <- fromJSON(json_file, flatten = TRUE)
View(data)
I tried multiple packages, including jsonlite and RJSONIO, I spent the last 5 hours 5 hours debugging this and trying various online tutorial, but without success. Thanks for your help!
Firstly, that is some ugly JSON; if you have a way of avoiding it, do so. Consequently, what follows is also pretty ugly—to the degree that I normally wouldn't post it, but I am doing so now in the hope that some of the approaches may be of use. If it offends your eyes, let me know and I'll delete it.
library(jsonlite) # for fromJSON
library(reshape2) # for melt
library(dplyr) # for inner_join, select
jlist <- fromJSON(json_file)
jdf <- as.data.frame(jlist)
jdf$c.value <- as.numeric(jdf$c.value) # fix type
jdf$L1 <- as.integer(factor(jdf$c.name)) # for use as a key with an artifact of melt later *urg, sorry*
ccdf <- melt(jdf$c.c) # get nested list into usable form
names(ccdf)[1:2] <- c('c.c.name', 'c.c.value') # fix names so they won't cause problems with the join
df3 <- inner_join(jdf[, -5], ccdf) # join, take out nested column
df3$c.c.value <- as.numeric(df3$c.c.value) # fix type
df3 <- df3 %>% select(-L1, -c) # get rid of useless columns
which leaves you with
> df3
name value c.name c.value c.c.name c.c.value
1 A 1 a1 11 a11 111
2 A 1 a1 11 a12 112
3 A 1 a2 12 <NA> NA
with reasonably sensible types. The packages used are avoidable, if you like.
Is this scalable? Well, not really, without more of the same mess. If anybody else has a less nasty and more scalable approach for dealing with nasty JSON, please post it; I'd be as grateful as the OP.
I think I figured out a way to do this. It seems to work with larger trees. The idea is to unlist the JSON and use the names attribute of the unlisted elements. In this example, if a node has one parent, the name attribute will start with "c.", if it has a parent and a "grandparent", it will list it as "c.c."...etc. So, the code below uses this structure to find the level of nesting and placing the node in the appropriate columns. The rest of the code adds the attributes of the parent nodes and deletes extra rows generated. I know it is not elegant, but I thought it might be useful for others.
library(stringr)
library(jsonlite)
json_file <- ' {"name":"A", "value":"1", "c":
[{"name":"a1", "value":"11", "c":
[{"name":"a11", "value":"111"},
{"name":"a12", "value":"112"}]
},
{"name":"a2", "value":"12"}]
}'
nestedjson <- fromJSON(json_file, simplifyVector = F) #read the json
nAttrPerNode <- 2 #number of attributes per node
strChild <- "c." #determines level of nesting
unnestedjson <- unlist(nestedjson) #convert JSON to unlist
unnestednames <- attr(unnestedjson, "names") #get the names of the cells
depthTree <- (max(str_count(unnestednames, strChild)) + 1) * nAttrPerNode #maximum tree depth
htTree <- length(unnestednames) / nAttrPerNode #maximum tree height (number of branches)
X <- array("", c(htTree, depthTree))
for (nodeht in 1:htTree){ #iterate through the branches and place the nodes based on the count of strChild in the name attribute
nodeIndex <- nodeht * nAttrPerNode
nodedepth <- str_count(unnestednames[nodeIndex], strChild) + 1
X[nodeht, nodedepth * nAttrPerNode - 1] <- unnestedjson[nodeIndex - 1]
X[nodeht, nodedepth * nAttrPerNode] <- unnestedjson[nodeIndex]
}
for (nodeht in 2:htTree){ #repeat the parent node attributes for the children
nodedepth <- 0
repeat{
nodedepth <- nodedepth + 1
startcol <- nodedepth * nAttrPerNode - 1
endcol <- startcol + nAttrPerNode - 1
if (X[nodeht, startcol] == "" & nodedepth < depthTree/nAttrPerNode){
X[nodeht, startcol:endcol] <- X[nodeht-1, startcol:endcol]
} else {
break()
}
}
}
deleteRows <- NULL #Finally delete the rows that only have the parent attributes for nodes that have children
strBranches <- apply(X, 1, paste, collapse="")
for (nodeht in 1:(htTree-1)){
branch2sub <- substr(strBranches[nodeht+1], 1, nchar(strBranches[nodeht]))
if (strBranches[nodeht]==branch2sub){
deleteRows <- c(deleteRows, nodeht)
}
}
deleteRows
X <- X[-deleteRows,]

Extract JSON data from the rows of an R data frame

I have a data frame where the values of column Parameters are Json data:
# Parameters
#1 {"a":0,"b":[10.2,11.5,22.1]}
#2 {"a":3,"b":[4.0,6.2,-3.3]}
...
I want to extract the parameters of each row and append them to the data frame as columns A, B1, B2 and B3.
How can I do it?
I would rather use dplyr if it is possible and efficient.
In your example data, each row contains a json object. This format is called jsonlines aka ndjson, and the jsonlite package has a special function stream_in to parse such data into a data frame:
# Example data
mydata <- data.frame(parameters = c(
'{"a":0,"b":[10.2,11.5,22.1]}',
'{"a":3,"b":[4.0,6.2,-3.3]}'
), stringsAsFactors = FALSE)
# Parse json lines
res <- jsonlite::stream_in(textConnection(mydata$parameters))
# Extract columns
a <- res$a
b1 <- sapply(res$b, "[", 1)
b2 <- sapply(res$b, "[", 2)
b3 <- sapply(res$b, "[", 3)
In your example, the json structure is fairly simple so the other suggestions work as well, but this solution will generalize to more complex json structures.
I actually had a similar problem where I had multiple variables in a data frame which were JSON objects and a lot of them were NA's, but I did not want to remove the rows where NA's existed. I wrote a function which is passed a data frame, id within the data frame(usually a record ID), and the variable name in quotes to parse. The function will create two subsets, one for records which contain JSON objects and another to keep track of NA value records for the same variable then it joins those data frames and joins their combination to the original data frame thereby replacing the former variable. Perhaps it will help you or someone else as it has worked for me in a few cases now. I also haven't really cleaned it up too much so I apologize if my variable names are a bit confusing as well as this was a very ad-hoc function I wrote for work. I also should state that I did use another poster's idea for replacing the former variable with the new variables created from the JSON object. You can find that here : Add (insert) a column between two columns in a data.frame
One last note: there is a package called tidyjson which would've had a simpler solution but apparently cannot work with list type JSON objects. At least that's my interpretation.
library(jsonlite)
library(stringr)
library(dplyr)
parse_var <- function(df,id, var) {
m <- df[,var]
p <- m[-which(is.na(m))]
n <- df[,id]
key <- n[-which(is.na(df[,var]))]
#create df for rows which are NA
key_na <- n[which(is.na(df[,var]))]
q <- m[which(is.na(m))]
parse_df_na <- data.frame(key_na,q,stringsAsFactors = FALSE)
#Parse JSON values and bind them together into a dataframe.
p <- lapply(p,function(x){
fromJSON(x) %>% data.frame(stringsAsFactors = FALSE)}) %>% bind_rows()
#bind the record id's of the JSON values to the above JSON parsed dataframe and name the columns appropriately.
parse_df <- data.frame(key,p,stringsAsFactors = FALSE)
## The new variables begin with a capital 'x' so I replace those with my former variables name
n <- names(parse_df) %>% str_replace('X',paste(var,".",sep = ""))
n <- n[2:length(n)]
colnames(parse_df) <- c(id,n)
#join the dataframe for NA JSON values and the dataframe containing parsed JSON values, then remove the NA column,q.
parse_df <- merge(parse_df,parse_df_na,by.x = id,by.y = 'key_na',all = TRUE)
#Remove the new column formed by the NA values#
parse_df <- parse_df[,-which(names(parse_df) =='q')]
####Replace variable that is being parsed in dataframe with the new parsed and names values.######
new_df <- data.frame(append(df,parse_df[,-which(names(parse_df) == id)],after = which(names(df) == var)),stringsAsFactors = FALSE)
new_df <- new_df[,-which(names(new_df) == var)]
return(new_df)
}

Ragged dataframe in R, jsonlite::fromJSON

I am new to importing .json files for use in R. I'm trying to create a 'long' format dataframe - each row is one participant, each column is one variable. Most of my dataset is compatible after calling fromJSON, but one nested json structure results in a ragged list, with Null, 1, 2, or 3 entries for each participant (in theory there could be more).
Sample:
testdf <- fromJSON("[[\"MMM\",\"AAA\"],null,[\"GGG\",\"CCC\",\"NNN \"],null,null,[\"AAA\",\"NNN \"],null,[\"MMM\",\"AAA\"],null,null,null,null,[\"MMM\",\"AAA\"],[\"CCC\",\"AAA\"],\"NNN \",[\"MMM\",\"NNN \",\"EEE\"],null,null,[\"CCC\",\"MMM\",\"AAA\"],[\"HHH\",\"AAA\"],\"AAA\",[\"MMM\",\"AAA\",\"NNN \"],[\"CCC\",\"AAA\"],[\"MMM\",\"AAA\",\"NNN \"],[\"AAA\",\"NNN \"],[\"MMM\",\"AAA\"],null,null,null,null,null,null]", flatten=TRUE)
How can I transform this list into a 32 x n dataframe which preserves the null values?
Variations on unlist remove the null values; rbind.fill moves entries to the next row, of course - could something like cbind.fill work? (cbind a df with an empty df (cbind.fill?))
Something hidden in plyr?
Thanks for any suggestions.
Fairly straightforward:
t(sapply(testdf, function(x) {
if (is.null(x)) x <- NA_character_
length(x) <- 3
x })
)
If you want to choose the number of columns automatically, then you need to calculate that first:
nc <- max(sapply(testdf, length))
t(sapply(testdf, function(x) {
if (is.null(x)) x <- NA_character_
length(x) <- nc
x })
)