R: scraping dynamic links with rvest - html

I'm trying to scrape links to RSS feeds from internet archive that sit under 'dynamic' calendar using rvest, see this link as an example.
<div>
<div class="captures">
<div class="position" style="width: 20px; height: 20px;">
<div class="measure ">
</div>
</div>
12
</div>
<!-- react-empty: 2310 --></div>
For example,
url %>%
read_html() %>%
html_nodes("a") %>%
html_attr("href")
doesn't return links I'm interested in, xpath or html_nodes('.captures') return empty results. Any hints would be very helpful, thanks!

One possibility is to use the wayback package (GL) (GH) which has support for querying the Internet Archive and reading in the HTML of saved pages ("mementos"). You can research a bit more abt web archiving terminology (it's a bit arcane IMO) via http://www.mementoweb.org/guide/quick-intro/ & https://mementoweb.org/guide/rfc/ as starter resources.
library(wayback) # devtools::install_git(one of the superscript'ed links above)
library(rvest) # for reading the resulting HTML contents
library(tibble) # mostly for prettier printing of data frames
There are a number of approaches one could take. This is what I tend to do during forensic analysis of online content. YMMV.
First, we get the recorded mementos (basically a short-list of relevant content):
(rss <- get_mementos("http://www.dailyecho.co.uk/news/district/winchester/rss/"))
## # A tibble: 7 x 3
## link rel ts
## <chr> <chr> <dttm>
## 1 http://www.dailyecho.co.uk/news/district/winchester/rss/ original NA
## 2 http://web.archive.org/web/timemap/link/http://www.dailyecho.co… timemap NA
## 3 http://web.archive.org/web/http://www.dailyecho.co.uk/news/dist… timegate NA
## 4 http://web.archive.org/web/20090517035444/http://www.dailyecho.… first me… 2009-05-17 03:54:44
## 5 http://web.archive.org/web/20180712045741/http://www.dailyecho.… prev mem… 2018-07-12 04:57:41
## 6 http://web.archive.org/web/20180812213013/http://www.dailyecho.… memento 2018-08-12 21:30:13
## 7 http://web.archive.org/web/20180812213013/http://www.dailyecho.… last mem… 2018-08-12 21:30:13
The calendar-menu viewer thing at IA is really the "timemap". I like to work with this as it's the point-in-time memento list of all the crawls. It's the second link above so we'll read it in:
(tm <- get_timemap(rss$link[2]))
## # A tibble: 45 x 5
## rel link type from datetime
## <chr> <chr> <chr> <chr> <chr>
## 1 original http://www.dailyecho.co.uk:80/news/d… NA NA NA
## 2 self http://web.archive.org/web/timemap/l… applicatio… Sun, 17 May … NA
## 3 timegate http://web.archive.org NA NA NA
## 4 first memento http://web.archive.org/web/200905170… NA NA Sun, 17 May 20…
## 5 memento http://web.archive.org/web/200908130… NA NA Thu, 13 Aug 20…
## 6 memento http://web.archive.org/web/200911121… NA NA Thu, 12 Nov 20…
## 7 memento http://web.archive.org/web/201001121… NA NA Tue, 12 Jan 20…
## 8 memento http://web.archive.org/web/201007121… NA NA Mon, 12 Jul 20…
## 9 memento http://web.archive.org/web/201011271… NA NA Sat, 27 Nov 20…
## 10 memento http://web.archive.org/web/201106290… NA NA Wed, 29 Jun 20…
## # ... with 35 more rows
The content is in the mementos and there should be as many mementos there as you see in the calendar view. We'll read in the first one:
mem <- read_memento(tm$link)
# Ideally use writeLines(), now, to save this to disk with a good
# filename. Alternatively, stick it in a data frame with metadata
# and saveRDS() it. But, that's not a format others (outside R) can
# use so perhaps do the data frame thing and stream it out as ndjson
# with jsonlite::stream_out() and compress it during save or afterwards.
Then convert it to something we can use programmatically with xml2::read_xml() or xml2::read_html() (RSS is sometimes better parsed as XML):
read_html(mem)
## {xml_document}
## <html>
## [1] <body><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Daily Ec ...
read_memento() has an as parameter to automagically parse the result but I like to store the mementos locally (as noted in the comments) so as not to abuse the IA servers (i.e. if I ever need to get the data again I don't have to hit their infrastructure).
A big caveat is that if you try to get too many resources from the IA in a short period of time you'll get temporarily banned as they have scale but it's a free service and they (rightfully) try to prevent abuse.
Definitely file issues to the package (pick your favourite source code hosting community to do so as I'll work with either but prefer GitLab after the Microsoft takeover of GitHub) if anything is unclear or you feel could be made better. It's not a popular package and I only have occasional need for forensic spelunking so it "works for me" but I'll gladly try to make it more user-friendly (I just need to know the pain points).

Related

r Google News Results Links

I am new to getting information from the web into R but I found this nice code How to get google search results on how to get links from the ordinary google search into R.
I need to get this method running for the google NEWS search.
I know i have to change the url by adding something like "&source=lnms&tbm=nws".
The url i construct leads me to the right news result page if i copy and paste it from R to my browser - so far so good.
I was looking at the html code of the news result page and found that the information is lying inside h3[#class='r dO0Ag'] but there is another node and I don´t know how to code this part.
Would appreciate any help!
library(XML)
library(RCurl)
getGoogleURL <- function(search.term, domain = '.de', quotes=TRUE)
{
search.term <- gsub(' ', '%20', search.term)
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
#construct google news url
getGoogleURL <- paste('http://www.google', domain, '/search?q=',
search.term, sep='',"&source=lnms&tbm=nws")
return(getGoogleURL)
}
getGoogleLinks <- function(google.url) {
doc <- getURL(google.url, httpheader = c("User-Agent" = "R
(2.10.0)"))
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function
(...){})
#?? Wrong part - gives error evaluating xpath expression ??
nodes <- getNodeSet(html, "//h3[#class='r dO0Ag']//a[#class='l lLrAF'//")
dirt_links=sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]])
links <- gsub('/url\\?q=','',sapply(strsplit(dirt_links[as.vector(grep('url',dirt_links))],split='&'),'[',1))
return(links)
}
search.term <- "China"
quotes <- "TRUE"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
links <- getGoogleLinks(search.url)
You have a number of options here.
Either RCurl or RSelenium will work.
The key point is to generate the correct URL:
> library(XML)
> library(RCurl)
> search.term <- "china"
> quotes=FALSE
> start=0
> getGoogleURL <- paste('http://www.google.com',
+ '/search?hl=en&gl=kr&tbm=nws&authuser=0&q=',
+ search.term, "&start=",start,sep='')
> getGoogleURL
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=china&start=0"
>
at this point, you can dereference the URL and create the HTML parse tree and extract the node data. The start reference allows you to set the return page of the result. i.e. I want to return the forth page (counting from zero)
Working Code Example:
library(XML)
library(RCurl)
getGoogleURL <- function(search.term, start=0, quotes=FALSE) {
search.term <- gsub(' ', '%20', search.term)
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
getGoogleURL <- paste('http://www.google.com',
'/search?hl=en&gl=kr&tbm=nws&authuser=0&q=',
search.term, "&start=",start,sep='')
getGoogleURL <- URLencode(getGoogleURL)
}
getGoogleNews <- function(search.term="China",
start=0,
quotes=FALSE ){
google.url <- getGoogleURL(search.term=search.term,
start, quotes=quotes)
print(google.url)
doc <- getURL(google.url,
httpheader = c("User-Agent" = "R(3.0.3)"))
html <- htmlTreeParse(doc, useInternalNodes = TRUE,
error=function(...){}, asText = TRUE)
nodes <- getNodeSet(html, "//*/h3/a[#href]")
title <- sapply(nodes, function(x) x <- xmlValue(x))
url <- unname(sapply(nodes, function(x) x <- xmlAttrs(x)))
url <- gsub("\\/url\\?q=", "", url)
nodes <- getNodeSet(html, "//div[#class='slp']")
source <- sapply(nodes, function(x) x <- xmlValue(x))
nodes <- getNodeSet(html, "//div[#class='st']")
summary <- sapply(nodes, function(x) x <- xmlValue(x))
data.frame(title=title, source=source, url=url, summary=summary)
}
getGoogleNews("China")
getGoogleNews("China", 1)
getGoogleNews("China", 2)
Runtime:
> library(XML)
> library(RCurl)
> getGoogleURL <- function(search.term, start=0, quotes=FALSE) {
+ search.term <- gsub(' ', '%20', search.term)
+ if(quotes) search.term <- paste( .... [TRUNCATED]
> getGoogleNews <- function(search.term="China",
+ start=0,
+ quotes=FALSE ){
+ google.url <- ge .... [TRUNCATED]
> getGoogleNews("China")
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=China&start=0"
title
1 Taiwan says China is 'out of control' as it loses El Salvador to Beijing
2 China central bank official rebuts Trump's claim it is manipulating the ...
3 Airbnb Wants to Find a Home in China
4 China's biggest risk may be its property market — not the trade war
5 Malaysia has axed $22 billion of Chinese-backed projects, in a blow ...
6 China reaches 800 million internet users
7 China DEFIES Trump to buy nearly ALL oil imports from Iran despite ...
8 7 Signs that China's Military is Becoming More Dangerous
9 Asia markets trade mostly higher as investors look ahead to US ...
10 Can China, the world's biggest pork producer, contain a fatal pig ...
source
1 CNBC - 17 hours ago
2 CNBC - 10 hours ago
3 WIRED - 13 hours ago
4 CNBC - 23 hours ago
5 Business Insider - 11 hours ago
6 TechCrunch - 10 hours ago
7 Express.co.uk - 12 hours ago
8 The National Interest Online (blog) - 16 hours ago
9 CNBC - 17 hours ago
10 Science Magazine - 5 hours ago
url
1 https://www.cnbc.com/2018/08/21/taiwan-says-china-out-of-control-as-it-loses-el-salvador-to-beijing.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIFCgAMAA&usg=AOvVaw2cSTmS65-6IvKQV9xrl3y3
2 https://www.cnbc.com/2018/08/21/china-official-refutes-trumps-claim-it-is-manipulating-the-yuan.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIHSgAMAE&usg=AOvVaw2q7yr2oBWHib3bRAVmOna-
3 https://www.wired.com/story/airbnb-china-market/&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIJigAMAI&usg=AOvVaw2a2LSkYlosnwTFRCvjmUhm
4 https://www.cnbc.com/2018/08/21/china-economy-biggest-risk-may-be-property-market-not-trade-war.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIKSgAMAM&usg=AOvVaw1bUY5Ii7AlWURDifpeozJU
5 https://www.businessinsider.com/malaysia-axes-22-billion-of-belt-and-road-projects-blow-to-china-2018-8&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIILCgAMAQ&usg=AOvVaw0yGdVilstHZVBBXEuuAbmu
6 https://techcrunch.com/2018/08/21/china-reaches-800-million-internet-users/&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIINSgAMAU&usg=AOvVaw0VYTngAb-OBUSYkxKs0ZKp
7 https://www.express.co.uk/news/world/1006297/Iran-oil-china-donald-trump-oil-prices-oil-price-us-iran-nuclear-deal-sanctions&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIOCgAMAY&usg=AOvVaw3W5adCnWdzz71zvpgE1x6D
8 https://nationalinterest.org/blog/buzz/7-signs-chinas-military-becoming-more-dangerous-29352&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIPigAMAc&usg=AOvVaw1k05lyvFRrx_FImDKIsZ61
9 https://www.cnbc.com/2018/08/21/asia-markets-us-china-trade-talks-in-focus.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIQSgAMAg&usg=AOvVaw0YqzZPNbH9bawkv8qX8Bdm
10 http://www.sciencemag.org/news/2018/08/can-china-world-s-biggest-pork-producer-contain-fatal-pig-virus-scientists-fear-worst&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIRCgAMAk&usg=AOvVaw1H0c03l4trLI3cbRRlnKJW
summary
1 Taiwan vowed on Tuesday to fight China's "increasingly out of control" behavior after Taipei lost another ally to Beijing when El Salvador ...
2 A senior official of China's central bank told a briefing on Tuesday that the yuan's exchange rate is set by the market, rebutting President Donald ...
3 China is littered with the virtual carcasses of startups that attempted to do business in the country and then gave up or were shut out.
4 China's hot real estate market remains a challenge for authorities trying to maintain stable economic growth in the face of trade tensions with ...
5 The projects were a $20 billion rail link and two gas pipelines worth $2.3 billion. All three were part of China's Belt and Road Initiative (BRI), a massive project ...
6 A new report [in Chinese] issued by the China Internet Network Information Center (CNNIC) put the number of people in China with access to ...
7 China is Iran's biggest oil customer and the shift shows the communist nation wants to keep buying Iranian crude oil despite US sanctions ...
8 Western media seized on a new Pentagon report that Chinese bombers are training to strike deep into the Western Pacific, including Guam, the ...
9 Chinese markets led gains on Tuesday in a mostly positive trading session across Asia, extending their upward climb from the previous day ...
10 As of today, ASF has been reported at sites in four provinces in China's northeast, thousands of kilometers apart. Containing the disease in a ...
> getGoogleNews("China", 1)
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=China&start=1"
title
1 China central bank official rebuts Trump's claim it is manipulating the ...
2 Airbnb Wants to Find a Home in China
3 China's biggest risk may be its property market — not the trade war
4 Malaysia has axed $22 billion of Chinese-backed projects, in a blow ...
5 China reaches 800 million internet users
6 China DEFIES Trump to buy nearly ALL oil imports from Iran despite ...
7 7 Signs that China's Military is Becoming More Dangerous
8 Asia markets trade mostly higher as investors look ahead to US ...
9 Can China, the world's biggest pork producer, contain a fatal pig ...
10 How China, India and the US use healthcare aid to win influence in ...
source
1 CNBC - 10 hours ago
2 WIRED - 13 hours ago
3 CNBC - 23 hours ago
4 Business Insider - 11 hours ago
5 TechCrunch - 10 hours ago
6 Express.co.uk - 12 hours ago
7 The National Interest Online (blog) - 16 hours ago
8 CNBC - 17 hours ago
9 Science Magazine - 5 hours ago
10 ABC News - 5 hours ago
url
1 https://www.cnbc.com/2018/08/21/china-official-refutes-trumps-claim-it-is-manipulating-the-yuan.html&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggUKAAwAA&usg=AOvVaw1Muu65XvSSWVKX06-5syLY
2 https://www.wired.com/story/airbnb-china-market/&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggdKAAwAQ&usg=AOvVaw0Py7bJDY3tIj4KxgwYot1A
3 https://www.cnbc.com/2018/08/21/china-economy-biggest-risk-may-be-property-market-not-trade-war.html&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgggKAAwAg&usg=AOvVaw2EHMCQvFQV9ubu17ERCZFO
4 https://www.businessinsider.com/malaysia-axes-22-billion-of-belt-and-road-projects-blow-to-china-2018-8&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggjKAAwAw&usg=AOvVaw1sMhG0tyUnj8j2W02gD3aW
5 https://techcrunch.com/2018/08/21/china-reaches-800-million-internet-users/&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggsKAAwBA&usg=AOvVaw1ODs1JY8V_ETi24ugz-yNn
6 https://www.express.co.uk/news/world/1006297/Iran-oil-china-donald-trump-oil-prices-oil-price-us-iran-nuclear-deal-sanctions&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggvKAAwBQ&usg=AOvVaw0r0HQNfZhEwfbiEocUC74Z
7 https://nationalinterest.org/blog/buzz/7-signs-chinas-military-becoming-more-dangerous-29352&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg1KAAwBg&usg=AOvVaw2hpQQXrAm2HW158II7F1kG
8 https://www.cnbc.com/2018/08/21/asia-markets-us-china-trade-talks-in-focus.html&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg4KAAwBw&usg=AOvVaw2surM3fW-lLJDd9P-r7xJB
9 http://www.sciencemag.org/news/2018/08/can-china-world-s-biggest-pork-producer-contain-fatal-pig-virus-scientists-fear-worst&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg7KAAwCA&usg=AOvVaw3Lzvks6B0Un4IEgoMh86re
10 http://www.abc.net.au/news/2018-08-22/china-india-us-medical-diplomacy-in-the-pacific/10147632&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg-KAAwCQ&usg=AOvVaw1Ogg8I6mUvDSCc9F90Usg4
summary
1 A senior official of China's central bank told a briefing on Tuesday that the yuan's exchange rate is set by the market, rebutting President Donald ...
2 China is littered with the virtual carcasses of startups that attempted to do business in the country and then gave up or were shut out.
3 China's hot real estate market remains a challenge for authorities trying to maintain stable economic growth in the face of trade tensions with ...
4 The projects were a $20 billion rail link and two gas pipelines worth $2.3 billion. All three were part of China's Belt and Road Initiative (BRI), a massive project ...
5 A new report [in Chinese] issued by the China Internet Network Information Center (CNNIC) put the number of people in China with access to ...
6 China is Iran's biggest oil customer and the shift shows the communist nation wants to keep buying Iranian crude oil despite US sanctions ...
7 Western media seized on a new Pentagon report that Chinese bombers are training to strike deep into the Western Pacific, including Guam, the ...
8 Chinese markets led gains on Tuesday in a mostly positive trading session across Asia, extending their upward climb from the previous day ...
9 As of today, ASF has been reported at sites in four provinces in China's northeast, thousands of kilometers apart. Containing the disease in a ...
10 China's 10,000-ton medical ship, the Peace Ark, has cut a broad arc through the Pacific, stopping off in Papua New Guinea, Vanuatu and Fiji ...
> getGoogleNews("China", 2)
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=China&start=2"
title
1 Airbnb Wants to Find a Home in China
2 China's biggest risk may be its property market — not the trade war
3 Malaysia has axed $22 billion of Chinese-backed projects, in a blow ...
4 China reaches 800 million internet users
5 China DEFIES Trump to buy nearly ALL oil imports from Iran despite ...
6 7 Signs that China's Military is Becoming More Dangerous
7 Asia markets trade mostly higher as investors look ahead to US ...
8 Can China, the world's biggest pork producer, contain a fatal pig ...
9 How China, India and the US use healthcare aid to win influence in ...
10 China Is Leading in Artificial Intelligence--and American Businesses ...
source
1 WIRED - 13 hours ago
2 CNBC - 23 hours ago
3 Business Insider - 11 hours ago
4 TechCrunch - 10 hours ago
5 Express.co.uk - 12 hours ago
6 The National Interest Online (blog) - 16 hours ago
7 CNBC - 17 hours ago
8 Science Magazine - 5 hours ago
9 ABC News - 5 hours ago
10 Inc.com - 16 hours ago
url
1 https://www.wired.com/story/airbnb-china-market/&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggUKAAwAA&usg=AOvVaw3M4FbZ71J-NVKHn3fHvYwZ
2 https://www.cnbc.com/2018/08/21/china-economy-biggest-risk-may-be-property-market-not-trade-war.html&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggXKAAwAQ&usg=AOvVaw3vieYvDvTlRzYkWncLgQfu
3 https://www.businessinsider.com/malaysia-axes-22-billion-of-belt-and-road-projects-blow-to-china-2018-8&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggaKAAwAg&usg=AOvVaw3JGNk2Lraivca0P1lS3CoY
4 https://techcrunch.com/2018/08/21/china-reaches-800-million-internet-users/&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggjKAAwAw&usg=AOvVaw2j4-NkfK_fNl8McD6WJjPa
5 https://www.express.co.uk/news/world/1006297/Iran-oil-china-donald-trump-oil-prices-oil-price-us-iran-nuclear-deal-sanctions&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggmKAAwBA&usg=AOvVaw0v1Lybg2SxcJoxVkP7sOx_
6 https://nationalinterest.org/blog/buzz/7-signs-chinas-military-becoming-more-dangerous-29352&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggsKAAwBQ&usg=AOvVaw1B7Krdzgd3LQEJ4bwWSSFW
7 https://www.cnbc.com/2018/08/21/asia-markets-us-china-trade-talks-in-focus.html&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggvKAAwBg&usg=AOvVaw0v734CDRel2Vpke9XVjLqA
8 http://www.sciencemag.org/news/2018/08/can-china-world-s-biggest-pork-producer-contain-fatal-pig-virus-scientists-fear-worst&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggyKAAwBw&usg=AOvVaw1j6E7a1jk9JiIahN5pdmi7
9 http://www.abc.net.au/news/2018-08-22/china-india-us-medical-diplomacy-in-the-pacific/10147632&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAgg1KAAwCA&usg=AOvVaw2E0qGfLhOkKZWhh5-_Is54
10 https://www.inc.com/magazine/201809/amy-webb/china-artificial-intelligence.html&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAgg4KAAwCQ&usg=AOvVaw1thfiF9hJWhz88BU8znvnD
summary
1 China is littered with the virtual carcasses of startups that attempted to do business in the country and then gave up or were shut out.
2 China's hot real estate market remains a challenge for authorities trying to maintain stable economic growth in the face of trade tensions with ...
3 The projects were a $20 billion rail link and two gas pipelines worth $2.3 billion. All three were part of China's Belt and Road Initiative (BRI), a massive project ...
4 A new report [in Chinese] issued by the China Internet Network Information Center (CNNIC) put the number of people in China with access to ...
5 China is Iran's biggest oil customer and the shift shows the communist nation wants to keep buying Iranian crude oil despite US sanctions ...
6 Western media seized on a new Pentagon report that Chinese bombers are training to strike deep into the Western Pacific, including Guam, the ...
7 Chinese markets led gains on Tuesday in a mostly positive trading session across Asia, extending their upward climb from the previous day ...
8 As of today, ASF has been reported at sites in four provinces in China's northeast, thousands of kilometers apart. Containing the disease in a ...
9 China's 10,000-ton medical ship, the Peace Ark, has cut a broad arc through the Pacific, stopping off in Papua New Guinea, Vanuatu and Fiji ...
10 Living in China in the early 2000s changed my perspective. I saw firsthand that the outside world's view--China was good at copying but bad at ...
>
Web Page Test of URL
Nb. Note result order will be different for different users via web page for a logged in user.
Citation:
Jinseog Kim - Associate professor in the Department of Applied Statistics at Dongguk University. He received Ph.D of Statistics in 2003 in Department of Statistics at Seoul National University. His research interests are data mining related topics including machine learning, big data analytics, networked data analysis.
Presentation Link: http://datamining.dongguk.ac.kr/lectures/2016-2/bigdata/google.pdf

R - json webpage to data frame

I'm looking to put the data from this webpage: http://live.nhl.com/GameData/20162017/2016020725/PlayByPlay.json into a usable R data frame.
I've tried what I've seen so far by using:
library(jsonlite)
json <- "http://live.nhl.com/GameData/20162017/2016020725/PlayByPlay.json"
doc <- fromJSON(json, simplifyDataFrame = TRUE)
That puts the file into a list of 1 and to be honest, working with lists in R is not yet a skill of mine (more comfortable with data frames).
I'd like to be able to get scrape that webpage into a usable data frame.
I've tried
PBP <- rbindlist(lapply(doc, as.data.table), fill = TRUE)
but that did not work.
Any ideas? Happy to provide any more info if needed.
Perhaps the first course of action would be to understand lists down to the bone. What you have there is a list of length 1. If you do names(doc) you will notice that this list element is named data. To fully reveal the structure of the object, try str(doc). That's a lot of output! Here are a few first lines to give you the sense of what is going on.
Working with lists can be done using [[ and $. Also [ but see this tweet for details. You can access the first element by doc$data, doc[[1]] or doc[["data]]. All are equivalent, but some may be more handy for some tasks. To "climb" down the list tree, just append extra arguments. Note that you can mix all off these. See the inline code for a sneak preview. From your question it's not clear what part of the json file you're after. Try expanding the question or even better, tinker around with doc.
doc:
data # doc[[1]] or doc[["data"]] pr doc$data
|___ refreshInterval # doc[[1]][[1]] or doc[[1]][["refreshinterval"]] or doc[["data"]][["refreshinterval]] or doc$data$refreshinterval
|___ game # doc[[1]][[2]] or doc[[1]][["game"]] or you go the idea
|___ awayteamid # doc$data$refreshinterval
|___ awayteamname
|___ hometeamname
|___ plays
|___ awayteamnick
|___ hometeamnick
|___ hometeamid
You can access game stats through
xy <- doc$data$game$plays$play
xy[1:6, c("desc", "type", "p2name", "teamid", "ycoord", "xcoord")]
desc type p2name teamid ycoord xcoord
1 Radko Gudas hit Chris Kreider Hit Chris Kreider 4 -12 -96
2 Pavel Buchnevich Wrist Shot saved by Steve Mason Shot Steve Mason 3 26 -42
3 Brandon Pirri hit Brandon Manning Hit Brandon Manning 3 42 -68
4 Nick Cousins hit Adam Clendening Hit Adam Clendening 4 35 92
5 Nick Cousins Wrist Shot saved by Henrik Lundqvist Shot Henrik Lundqvist 4 19 86
6 Michael Grabner Wrist Shot saved by Steve Mason Shot Steve Mason 3 5 -63

R - Import and merge many (nested?) JSON

I am looking to merge 150 small JSON files (all formatted the same way with same variables) which I have imported into R via jsonlite.
The problem is that each file imports as list of 1. I can get an individual to convert to dataframe, but cannot find a way to systematically convert all.
The goal is the merge all into a single dataset.
An example from a JSON file:
{
"data": [
{
"EventId": "20020528X00745",
"narrative": "NTSB investigators may not have traveled in support of this investigation and used data provided by various sources to prepare this aircraft accident report.During the dark night cross-country flight, while at a cruise altitude of 2.000 feet msl, the pilot initiated a climb to 3,000 feet. A few minutes later, the engine's rpm dropped 200-300 rpm. The 67-hour pilot increased throttle to check for an rpm response. Subsequently, the engine lost power, and a forced landing was initiated. While approaching to land, the pilot noticed trees in front of the airplanes flight path and started looking for another place to land, but couldn't see anything because it was too dark. Subsequently, the aircraft impacted tress coming to rest upright. An examination of the engine under the supervision of an FAA inspector, revealed the left magneto's internal gears did not rotate with the engine. Removal of the left magneto revealed only one of two rubber drive isolators inside the ignition harness cap. Internal inspection revealed the contact points on the left hand side of the magneto did not open on rotation. Further examination of the airplane, displayed the ignition key turned to the left magneto only. The pilot reported to the NTSB investigator-in-charge, that he did not touch any switch while exiting the aircraft.",
"probable_cause": "The pilot's failure to set the ignition key to the both magnetos position, which resulted in a loss of engine power. Contributing factors were the failure of the left magneto, the lack of suitable terrain for the forced landing, and the dark night."
},
{
"EventId": "20090414X14441",
"narrative": "NTSB investigators used data provided by various entities, including, but not limited to, the Federal Aviation Administration and/or the operator and did not travel in support of this investigation to prepare this aircraft accident report.The pilot was following a highway to the northwest at 10,000 feet mean sea level. He crossed the mountain pass between 700 and 1,000 feet above ground level climbing slowly. Once on the west side of the pass, approaching the base of some cliffs, they encountered a strong down draft and the airspeed dropped rapidly and the airplane started to descend. The pilot reports that he attempted to keep the airspeed at 85 knots and climb but, that the airplane continued to lose altitude. He checked the engine instruments and did not note any degradation of engine performance. The airplane continued to descend. The pilot executed a forced landing in approximately the center of the valley ahead of them. The pilot reported that there were no preimpact mechanical malfunctions or failures. Based on the temperature and pressure readings from the closest weather reporting station, the density altitude at the accident site was about 9,200 feet.",
"probable_cause": "The pilot's encounter with a windshear/downdraft that exceeded the climb performance capabilities of the airplane."
},
Import in using fromJSON(file_000.json) -- creates a "large list"
After import, df <- file_000.json$data produces a dataframe with 3 variables
However, I do not know of a way to create 150 new dfs from the large list inputs. I have tried apply, do.call, functions, loops.
Two more than work for individual dataframes, but don't get me to the 150 I need:
test2 <- as.data.frame(file_000.json$data)
test3 <- unnest(file_000.json)
library(dplyr)
library(jsonlite)
x <- '{
"data": [
{
"EventId": "20020528X00745",
"narrative": "NTSB investigators",
"probable_cause": "The pilots failure"
},
{
"EventId": "asdfasfasfasfasdasdf",
"narrative": "NTSB investigators",
"probable_cause": "The pilots failure"
},
{
"EventId": "asdfafsdf",
"narrative": "NTSB investigators",
"probable_cause": "The pilots failure"
}
]
}
'
files <- replicate(10, tempfile(fileext = ".json"))
for (i in seq_along(files)) cat(x, file = files[i])
dplyr::bind_rows(lapply(files, function(z) {
jsonlite::fromJSON(z)$data
}))
#> Source: local data frame [30 x 3]
#>
#> EventId narrative probable_cause
#> (chr) (chr) (chr)
#> 1 20020528X00745 NTSB investigators The pilots failure
#> 2 asdfasfasfasfasdasdf NTSB investigators The pilots failure
#> 3 asdfafsdf NTSB investigators The pilots failure
#> 4 20020528X00745 NTSB investigators The pilots failure
#> 5 asdfasfasfasfasdasdf NTSB investigators The pilots failure
#> 6 asdfafsdf NTSB investigators The pilots failure
#> 7 20020528X00745 NTSB investigators The pilots failure
#> 8 asdfasfasfasfasdasdf NTSB investigators The pilots failure
#> 9 asdfafsdf NTSB investigators The pilots failure
#> 10 20020528X00745 NTSB investigators The pilots failure
#> .. ... ... ...

Faster alternative for large files: pdflib or princexml?

I've got some good experience with pdflib when it comes to speed of pdf-generation, even for large files. I was excpecting the same speeds from princexml as for pdflib, as both run natively on my linux server (they're not just php-classes). When generating a 1 page pdf with text and graphics, I see a 4 second time lapse between begin and loading of the document in the log file? Is this normal? The conversion itself doesn't seem to be long...
Mon Apr 16 19:17:30 2012: ---- begin
Mon Apr 16 19:17:34 2012: Loading document...
Mon Apr 16 19:17:34 2012: Converting document...
Mon Apr 16 19:17:34 2012: finished: success
Mon Apr 16 19:17:34 2012: ---- end
Are there network connections involved in your setup? Is there DNS name resolution involved? If yes, try to use IP addresses instead of hostnames and try again...

Sweave: How to get blank lines as in the source?

How can I get a blank line before the second comment in the PDF file of the following .Rnw file? I tried to work with keep.source and strip.white, but I still don't get the blank line -- all the chunks are "pasted" together.
\documentclass{article}
\usepackage{fancyvrb}
\usepackage{Sweave}
\begin{document}
<<setup, eval=FALSE>>=
## some comment
a <- 1
b <- 2
## some comment (there is no newline before this comment...)
c <- 3
d <- 4
#
\end{document}
I wouldn't know about any option for keeping the blank line. What you could do:
(1) If you need the named chunk kept together, you could simply put a # at the beginning of the supposedly blank line:
<<setup, eval=FALSE>>=
## some comment
a <- 1
b <- 2
#
## some comment (there is no newline before this comment...)
c <- 3
d <- 4
#
Less attractive, but you have at least something similar to a blank line, and keep your chunk.
(2) If you don't care for keeping the chunk together, you can separate it into two blocks:
\documentclass{article}
\usepackage{fancyvrb}
\usepackage{Sweave}
\begin{document}
<<setup, eval=FALSE>>=
## some comment
a <- 1
b <- 2
#
<<eval=FALSE>>=
## some comment (there is no newline before this comment...)
c <- 3
d <- 4
#
\end{document}
This gives you the appearance you like, at the price of losing the structure of your chunk.
Hope this helps,
Rainer