class/id for data.frame cell using knitr - html

I want to give some of the entries of my data.frame a special id or class, that I can use it later in html after making an html table out of the data.frame with knitr. Is it possible? I intend to use this later with jquery-datatables for special formatting.

There are many ways to add a new column to your data.frame. You might try something like:
mydf$newID <- seq(nrow(mydf))
Or
mydf <- transform(mydf, newID = seq(nrow(mydf))
And many others...
Or using the data.table library
mydt[,newID:=.I]

Related

Scrape a Table-Like Index from HTML in R

I am currently working to scrape the table at this website, which contains variable IDs, question text, variable type, and origin dataset from ICPSR's PATH Survey data. My end goal is to create a spreadsheet inventory matrix of variable IDs and their corresponding question text by scraping this information in R, but I am having trouble getting it to work. In short, I aim to essentially get the table shown at the url above into a spreadsheet.
I've tried using rvest,XML, and a number of other packages/strategies (read.table,htmltab,htmltable,etc...), but the underlying table does not appear to be a table-like object "under the hood", if you will. Therefore, I am struggling to find a resource/previous question that helps scrape a table that may not necessarily be a table in structure, but certainly is a table visually.
Any help would be appreciated on this. Thanks!
I think most of that content is located within a script tag from which it is pulled dynamically within the browser via JavaScript during rendering the page.
You can regex out the appropriate JavaScript object and handle as json. However, given the variability within the returned list under response$docs, you are going to need to spend some time studying the json and determining what you want, and how you will organise output, then write a custom function to apply to the list to return possibly a dataframe of results.
The following shows how to extract the documents list:
library(rvest)
library(stringr)
library(magrittr)
library(jsonlite)
s <- read_html('https://www.icpsr.umich.edu/web/NAHDAP/search/variables?start=0&sort=STUDYID asc,DATASETID asc,STARTPOS asc&SERIESFULL_FACET_Q=606|Population Assessment of Tobacco and Health (PATH) Study Series&DATASETTITLE_FACET=Wave 4: Youth / Parent Questionnaire Data&EXTERNAL_FLAG=1&ARCHIVE=NAHDAP&rows=1000#') %>%
html_text()
r <- stringr::str_match(s, 'searchResults : (\\{.*\\}), searchConfig')
data <- jsonlite::parse_json(r[1,2])
docs <- data$response$docs
And this is a sample item in the list (bearing in mind variability of items within list):

How to connect to Statistics Canada JSON from R

I'm trying to connect to an online database through R, which can be found here:
http://open.canada.ca/data/en/dataset/2270e3a4-447c-45f6-8e63-aea9fe94948f
How would I be able to load the data table into R and then be able to simply change the table name in my code to access other tables? I'm not particularly concerned with what language I need to use (JSON, JSOn-LD, XML).
Thanks in advance!
Assuming you know the URLs for each of the datasets a similar question can be found here:
Download a file from HTTPS using download.file()
For this it becomes:
library(RCurl)
URL <- "http://www.statcan.gc.ca/cgi-bin/sum-som/fl/cstsaveascsv.cgi?filename=labr71a-eng.htm&lan=eng"
x <- getURL(URL)
URLout <- read.csv(textConnection(x),row.names=NULL)
I obtained the URL by right-clicking the access button and copying the address.
I had to declare row.names=NULL as the number of columns in the first row is not equal to the number of columns elsewhere, thus read.csv assumes row names as described here. I'm not sure if the URL to these datasets would change when they are updated, but this isn't a really convenient way to get this data. The JSON doesn't seem to be much better for intuitively being able to change datasets.
At least this way you could create a list of URLs and perform the following:
URL <- list(getURL("http://www.statcan.gc.ca/cgi-bin/sum-som/fl/cstsaveascsv.cgi?filename=labr71a-eng.htm&lan=eng"),
getURL("http://www.statcan.gc.ca/cgi-bin/sum-som/fl/cstsaveascsv.cgi?filename=labr72-eng.htm&lan=eng"))
URLout <- lapply(URL,function(x) read.csv(textConnection(x),row.names=NULL,skip=2))
Again I don't like having to declare row.names=NULL and when I look at the file I'm not seeing the discrepant number of columns, however this will at least get the file into the R environment for you. It may take some more work to perform the operation over multiple URLs.
In a further effort to obtain useful colnames:
URL <- "http://www.statcan.gc.ca/cgi-bin/sum-som/fl/cstsaveascsv.cgi?filename=labr71a-eng.htm&lan=eng"
x <- getURL(URL)
URLout <- read.csv(textConnection(x),row.names=NULL, skip=2)
The arguement skip = 2 will skip the first 2 rows when reading in the CSV, and will yield some header names. Because the headers are numbers an X will be placed in front. Row 2 in this case will have the value "number" in the second column. Unfortunately it appears this data was intended for use within excel, which is really sad.
1) You need to download the CSV into some directory that you have access to.
2) Use "read.csv", or "read_csv", or "fread" to read that csv file into R.
yourTableName<-read.csv("C:/..../canadaDataset.csv")
3) You can name that csv into whatever object name you want.

R JSON list element extraction in a loop

I am parsing JSON objects from pipl.com.
Specifically I am passing a CSV of contacts using lapply fromJSON under the jsonlite library to the api. Then I want to cbind specific elements into a flat dataframe. I have tried mapply, sapply and lapply to then rbind as below but this isn't working as I expect for any other elements than the ones below. I have tried it individually using the 'mini.test[1]$records$user_ids' syntax but the original contacts dataframe has hundreds of records so I was thinking a loop would be able to extract the elements I want.
I am looking to find only the user names for linkedin, facebook and twitter for each user. Thus I was thinking some sort of grepl would help me subset it. I have that vector created and posted the code below too.
I have read multiple r-bloggers articles on the different "apply" functions, looked at the R Cookbook pdf, and read questions on stackoverflow. I am stumped so really appreciate any help.
library(jsonlite)
#sample data
first<-c('Victor','Steve','Mary')
last<-c('Arias','Madden','Johnson')
contacts<-cbind(first,last)
#make urls
urls<-paste('http://api.pipl.com/search/v3/json/?first_name=',contacts[,1],'%09&last_name=',contacts[,2],'&pretty=True&key=xxxxxxx', sep='')
#Parse api
mini.test<-lapply(urls,fromJSON,simplifyMatrix=TRUE,flatten=TRUE)
#Data frame vector name
names <- do.call(rbind, lapply(mini.test, "[[", 5))
display <-do.call(rbind, lapply(names, "[[", 3))
#Grepl for 3 sources
records <- lapply(mini.test, "[[", 7)
twitter <-grepl("twitter.com",records,ignore.case = TRUE)
facebook <-grepl("facebook.com",records,ignore.case = TRUE)
linkedin <-grepl("linkedin.com",records,ignore.case = TRUE)
I know because of pipl's response that contacts may have multiple profile user names. For this purpose I just need it unlisted as a string not a nested list in the dataframe. In the end I would like a flat file that looks like below. Again, I am sincerely appreciate the help. I have been reading about it for 3 days without much success.
twitter <- c('twitter.username1','twitter.username2','NA')
linkedin <- c('linkedin.username1','linedin.username2','linkedin.username3')
facebook <- c('fb1','fb2','fb3,fb3a')
df<-cbind(display,twitter,linkedin,facebook)

Read specific lines from HTML in R

How can I go about reading a specific line/lines from html in R?
I have "HTMLInternalDocument" object as a result of following code:
url<-myURL
html<-htmlTreeParse(url,useInternalNodes=T)
Now I need get a specific lines from this html object in text format to count number of characters in each lines for example.
How can I do that in R?
Seeing that you are using the XML library, you will need to use one of the library's getNodeSet functions such as xpathApply. This requires some knowledge on xPaths, which the function uses to parse the HTMLInternalDocument. You can learn more by using ?xpathApply
Using the XML library is over-complicating the problem. As Grothendieck pointed out readLines, a base function, will do the job. Something like this:
x <- 10 ## or any other index you want to subset on
html <- readLines(url)
html[x]

Read a Text File into R

I apologize if this has been asked previously, but I haven't been able to find an example online or elsewhere.
I have very dirty data file in a text file (it may be JSON). I want to analyze the data in R, and since I am still new to the language, I want to read in the raw data and manipulate as needed from there.
How would I go about reading in JSON from a text file on my machine? Additionally, if it isn't JSON, how can I read in the raw data as is (not parsed into columns, etc.) so I can go ahead and figure out how to parse it as needed?
Thanks in advance!
Use the rjson package. In particular, look at the fromJSON function in the documentation.
If you want further pointers, then search for rjson at the R Bloggers website.
If you want to use the packages related to JSON in R, there are a number of other posts on SO answering this. I presume you searched on JSON [r] already on this site, plenty of info there.
If you just want to read in the text file line by line and process later on, then you can use either scan() or readLines(). They appear to do the same thing, but there's an important difference between them.
scan() lets you define what kind of objects you want to find, how many, and so on. Read the help file for more info. You can use scan to read in every word/number/sign as element of a vector using eg scan(filename,""). You can also use specific delimiters to separate the data. See also the examples in the help files.
To read line by line, you use readLines(filename) or scan(filename,"",sep="\n"). It gives you a vector with the lines of the file as elements. This again allows you to do custom processing of the text. Then again, if you really have to do this often, you might want to consider doing this in Perl.
Suppose your file is in JSON format, you may try the packages jsonlite ou RJSONIO or rjson. These three package allows you to use the function fromJSON.
To install a package you use the install.packages function. For example:
install.packages("jsonlite")
And, whenever the package is installed, you can load using the function library.
library(jsonlite)
Generally, the line-delimited JSON has one object per line. So, you need to read line by line and collecting the objects. For example:
con <- file('myBigJsonFile.json')
open(con)
objects <- list()
index <- 1
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
objects[[index]] <- fromJSON(line)
index <- index + 1
}
close(con)
After that, you have all the data in the objects variable. With that variable you may extract the information you want.