R - Import and merge many (nested?) JSON - json

I am looking to merge 150 small JSON files (all formatted the same way with same variables) which I have imported into R via jsonlite.
The problem is that each file imports as list of 1. I can get an individual to convert to dataframe, but cannot find a way to systematically convert all.
The goal is the merge all into a single dataset.
An example from a JSON file:
{
"data": [
{
"EventId": "20020528X00745",
"narrative": "NTSB investigators may not have traveled in support of this investigation and used data provided by various sources to prepare this aircraft accident report.During the dark night cross-country flight, while at a cruise altitude of 2.000 feet msl, the pilot initiated a climb to 3,000 feet. A few minutes later, the engine's rpm dropped 200-300 rpm. The 67-hour pilot increased throttle to check for an rpm response. Subsequently, the engine lost power, and a forced landing was initiated. While approaching to land, the pilot noticed trees in front of the airplanes flight path and started looking for another place to land, but couldn't see anything because it was too dark. Subsequently, the aircraft impacted tress coming to rest upright. An examination of the engine under the supervision of an FAA inspector, revealed the left magneto's internal gears did not rotate with the engine. Removal of the left magneto revealed only one of two rubber drive isolators inside the ignition harness cap. Internal inspection revealed the contact points on the left hand side of the magneto did not open on rotation. Further examination of the airplane, displayed the ignition key turned to the left magneto only. The pilot reported to the NTSB investigator-in-charge, that he did not touch any switch while exiting the aircraft.",
"probable_cause": "The pilot's failure to set the ignition key to the both magnetos position, which resulted in a loss of engine power. Contributing factors were the failure of the left magneto, the lack of suitable terrain for the forced landing, and the dark night."
},
{
"EventId": "20090414X14441",
"narrative": "NTSB investigators used data provided by various entities, including, but not limited to, the Federal Aviation Administration and/or the operator and did not travel in support of this investigation to prepare this aircraft accident report.The pilot was following a highway to the northwest at 10,000 feet mean sea level. He crossed the mountain pass between 700 and 1,000 feet above ground level climbing slowly. Once on the west side of the pass, approaching the base of some cliffs, they encountered a strong down draft and the airspeed dropped rapidly and the airplane started to descend. The pilot reports that he attempted to keep the airspeed at 85 knots and climb but, that the airplane continued to lose altitude. He checked the engine instruments and did not note any degradation of engine performance. The airplane continued to descend. The pilot executed a forced landing in approximately the center of the valley ahead of them. The pilot reported that there were no preimpact mechanical malfunctions or failures. Based on the temperature and pressure readings from the closest weather reporting station, the density altitude at the accident site was about 9,200 feet.",
"probable_cause": "The pilot's encounter with a windshear/downdraft that exceeded the climb performance capabilities of the airplane."
},
Import in using fromJSON(file_000.json) -- creates a "large list"
After import, df <- file_000.json$data produces a dataframe with 3 variables
However, I do not know of a way to create 150 new dfs from the large list inputs. I have tried apply, do.call, functions, loops.
Two more than work for individual dataframes, but don't get me to the 150 I need:
test2 <- as.data.frame(file_000.json$data)
test3 <- unnest(file_000.json)

library(dplyr)
library(jsonlite)
x <- '{
"data": [
{
"EventId": "20020528X00745",
"narrative": "NTSB investigators",
"probable_cause": "The pilots failure"
},
{
"EventId": "asdfasfasfasfasdasdf",
"narrative": "NTSB investigators",
"probable_cause": "The pilots failure"
},
{
"EventId": "asdfafsdf",
"narrative": "NTSB investigators",
"probable_cause": "The pilots failure"
}
]
}
'
files <- replicate(10, tempfile(fileext = ".json"))
for (i in seq_along(files)) cat(x, file = files[i])
dplyr::bind_rows(lapply(files, function(z) {
jsonlite::fromJSON(z)$data
}))
#> Source: local data frame [30 x 3]
#>
#> EventId narrative probable_cause
#> (chr) (chr) (chr)
#> 1 20020528X00745 NTSB investigators The pilots failure
#> 2 asdfasfasfasfasdasdf NTSB investigators The pilots failure
#> 3 asdfafsdf NTSB investigators The pilots failure
#> 4 20020528X00745 NTSB investigators The pilots failure
#> 5 asdfasfasfasfasdasdf NTSB investigators The pilots failure
#> 6 asdfafsdf NTSB investigators The pilots failure
#> 7 20020528X00745 NTSB investigators The pilots failure
#> 8 asdfasfasfasfasdasdf NTSB investigators The pilots failure
#> 9 asdfafsdf NTSB investigators The pilots failure
#> 10 20020528X00745 NTSB investigators The pilots failure
#> .. ... ... ...

Related

r Google News Results Links

I am new to getting information from the web into R but I found this nice code How to get google search results on how to get links from the ordinary google search into R.
I need to get this method running for the google NEWS search.
I know i have to change the url by adding something like "&source=lnms&tbm=nws".
The url i construct leads me to the right news result page if i copy and paste it from R to my browser - so far so good.
I was looking at the html code of the news result page and found that the information is lying inside h3[#class='r dO0Ag'] but there is another node and I don´t know how to code this part.
Would appreciate any help!
library(XML)
library(RCurl)
getGoogleURL <- function(search.term, domain = '.de', quotes=TRUE)
{
search.term <- gsub(' ', '%20', search.term)
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
#construct google news url
getGoogleURL <- paste('http://www.google', domain, '/search?q=',
search.term, sep='',"&source=lnms&tbm=nws")
return(getGoogleURL)
}
getGoogleLinks <- function(google.url) {
doc <- getURL(google.url, httpheader = c("User-Agent" = "R
(2.10.0)"))
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function
(...){})
#?? Wrong part - gives error evaluating xpath expression ??
nodes <- getNodeSet(html, "//h3[#class='r dO0Ag']//a[#class='l lLrAF'//")
dirt_links=sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]])
links <- gsub('/url\\?q=','',sapply(strsplit(dirt_links[as.vector(grep('url',dirt_links))],split='&'),'[',1))
return(links)
}
search.term <- "China"
quotes <- "TRUE"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
links <- getGoogleLinks(search.url)
You have a number of options here.
Either RCurl or RSelenium will work.
The key point is to generate the correct URL:
> library(XML)
> library(RCurl)
> search.term <- "china"
> quotes=FALSE
> start=0
> getGoogleURL <- paste('http://www.google.com',
+ '/search?hl=en&gl=kr&tbm=nws&authuser=0&q=',
+ search.term, "&start=",start,sep='')
> getGoogleURL
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=china&start=0"
>
at this point, you can dereference the URL and create the HTML parse tree and extract the node data. The start reference allows you to set the return page of the result. i.e. I want to return the forth page (counting from zero)
Working Code Example:
library(XML)
library(RCurl)
getGoogleURL <- function(search.term, start=0, quotes=FALSE) {
search.term <- gsub(' ', '%20', search.term)
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
getGoogleURL <- paste('http://www.google.com',
'/search?hl=en&gl=kr&tbm=nws&authuser=0&q=',
search.term, "&start=",start,sep='')
getGoogleURL <- URLencode(getGoogleURL)
}
getGoogleNews <- function(search.term="China",
start=0,
quotes=FALSE ){
google.url <- getGoogleURL(search.term=search.term,
start, quotes=quotes)
print(google.url)
doc <- getURL(google.url,
httpheader = c("User-Agent" = "R(3.0.3)"))
html <- htmlTreeParse(doc, useInternalNodes = TRUE,
error=function(...){}, asText = TRUE)
nodes <- getNodeSet(html, "//*/h3/a[#href]")
title <- sapply(nodes, function(x) x <- xmlValue(x))
url <- unname(sapply(nodes, function(x) x <- xmlAttrs(x)))
url <- gsub("\\/url\\?q=", "", url)
nodes <- getNodeSet(html, "//div[#class='slp']")
source <- sapply(nodes, function(x) x <- xmlValue(x))
nodes <- getNodeSet(html, "//div[#class='st']")
summary <- sapply(nodes, function(x) x <- xmlValue(x))
data.frame(title=title, source=source, url=url, summary=summary)
}
getGoogleNews("China")
getGoogleNews("China", 1)
getGoogleNews("China", 2)
Runtime:
> library(XML)
> library(RCurl)
> getGoogleURL <- function(search.term, start=0, quotes=FALSE) {
+ search.term <- gsub(' ', '%20', search.term)
+ if(quotes) search.term <- paste( .... [TRUNCATED]
> getGoogleNews <- function(search.term="China",
+ start=0,
+ quotes=FALSE ){
+ google.url <- ge .... [TRUNCATED]
> getGoogleNews("China")
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=China&start=0"
title
1 Taiwan says China is 'out of control' as it loses El Salvador to Beijing
2 China central bank official rebuts Trump's claim it is manipulating the ...
3 Airbnb Wants to Find a Home in China
4 China's biggest risk may be its property market — not the trade war
5 Malaysia has axed $22 billion of Chinese-backed projects, in a blow ...
6 China reaches 800 million internet users
7 China DEFIES Trump to buy nearly ALL oil imports from Iran despite ...
8 7 Signs that China's Military is Becoming More Dangerous
9 Asia markets trade mostly higher as investors look ahead to US ...
10 Can China, the world's biggest pork producer, contain a fatal pig ...
source
1 CNBC - 17 hours ago
2 CNBC - 10 hours ago
3 WIRED - 13 hours ago
4 CNBC - 23 hours ago
5 Business Insider - 11 hours ago
6 TechCrunch - 10 hours ago
7 Express.co.uk - 12 hours ago
8 The National Interest Online (blog) - 16 hours ago
9 CNBC - 17 hours ago
10 Science Magazine - 5 hours ago
url
1 https://www.cnbc.com/2018/08/21/taiwan-says-china-out-of-control-as-it-loses-el-salvador-to-beijing.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIFCgAMAA&usg=AOvVaw2cSTmS65-6IvKQV9xrl3y3
2 https://www.cnbc.com/2018/08/21/china-official-refutes-trumps-claim-it-is-manipulating-the-yuan.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIHSgAMAE&usg=AOvVaw2q7yr2oBWHib3bRAVmOna-
3 https://www.wired.com/story/airbnb-china-market/&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIJigAMAI&usg=AOvVaw2a2LSkYlosnwTFRCvjmUhm
4 https://www.cnbc.com/2018/08/21/china-economy-biggest-risk-may-be-property-market-not-trade-war.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIKSgAMAM&usg=AOvVaw1bUY5Ii7AlWURDifpeozJU
5 https://www.businessinsider.com/malaysia-axes-22-billion-of-belt-and-road-projects-blow-to-china-2018-8&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIILCgAMAQ&usg=AOvVaw0yGdVilstHZVBBXEuuAbmu
6 https://techcrunch.com/2018/08/21/china-reaches-800-million-internet-users/&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIINSgAMAU&usg=AOvVaw0VYTngAb-OBUSYkxKs0ZKp
7 https://www.express.co.uk/news/world/1006297/Iran-oil-china-donald-trump-oil-prices-oil-price-us-iran-nuclear-deal-sanctions&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIOCgAMAY&usg=AOvVaw3W5adCnWdzz71zvpgE1x6D
8 https://nationalinterest.org/blog/buzz/7-signs-chinas-military-becoming-more-dangerous-29352&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIPigAMAc&usg=AOvVaw1k05lyvFRrx_FImDKIsZ61
9 https://www.cnbc.com/2018/08/21/asia-markets-us-china-trade-talks-in-focus.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIQSgAMAg&usg=AOvVaw0YqzZPNbH9bawkv8qX8Bdm
10 http://www.sciencemag.org/news/2018/08/can-china-world-s-biggest-pork-producer-contain-fatal-pig-virus-scientists-fear-worst&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIRCgAMAk&usg=AOvVaw1H0c03l4trLI3cbRRlnKJW
summary
1 Taiwan vowed on Tuesday to fight China's "increasingly out of control" behavior after Taipei lost another ally to Beijing when El Salvador ...
2 A senior official of China's central bank told a briefing on Tuesday that the yuan's exchange rate is set by the market, rebutting President Donald ...
3 China is littered with the virtual carcasses of startups that attempted to do business in the country and then gave up or were shut out.
4 China's hot real estate market remains a challenge for authorities trying to maintain stable economic growth in the face of trade tensions with ...
5 The projects were a $20 billion rail link and two gas pipelines worth $2.3 billion. All three were part of China's Belt and Road Initiative (BRI), a massive project ...
6 A new report [in Chinese] issued by the China Internet Network Information Center (CNNIC) put the number of people in China with access to ...
7 China is Iran's biggest oil customer and the shift shows the communist nation wants to keep buying Iranian crude oil despite US sanctions ...
8 Western media seized on a new Pentagon report that Chinese bombers are training to strike deep into the Western Pacific, including Guam, the ...
9 Chinese markets led gains on Tuesday in a mostly positive trading session across Asia, extending their upward climb from the previous day ...
10 As of today, ASF has been reported at sites in four provinces in China's northeast, thousands of kilometers apart. Containing the disease in a ...
> getGoogleNews("China", 1)
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=China&start=1"
title
1 China central bank official rebuts Trump's claim it is manipulating the ...
2 Airbnb Wants to Find a Home in China
3 China's biggest risk may be its property market — not the trade war
4 Malaysia has axed $22 billion of Chinese-backed projects, in a blow ...
5 China reaches 800 million internet users
6 China DEFIES Trump to buy nearly ALL oil imports from Iran despite ...
7 7 Signs that China's Military is Becoming More Dangerous
8 Asia markets trade mostly higher as investors look ahead to US ...
9 Can China, the world's biggest pork producer, contain a fatal pig ...
10 How China, India and the US use healthcare aid to win influence in ...
source
1 CNBC - 10 hours ago
2 WIRED - 13 hours ago
3 CNBC - 23 hours ago
4 Business Insider - 11 hours ago
5 TechCrunch - 10 hours ago
6 Express.co.uk - 12 hours ago
7 The National Interest Online (blog) - 16 hours ago
8 CNBC - 17 hours ago
9 Science Magazine - 5 hours ago
10 ABC News - 5 hours ago
url
1 https://www.cnbc.com/2018/08/21/china-official-refutes-trumps-claim-it-is-manipulating-the-yuan.html&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggUKAAwAA&usg=AOvVaw1Muu65XvSSWVKX06-5syLY
2 https://www.wired.com/story/airbnb-china-market/&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggdKAAwAQ&usg=AOvVaw0Py7bJDY3tIj4KxgwYot1A
3 https://www.cnbc.com/2018/08/21/china-economy-biggest-risk-may-be-property-market-not-trade-war.html&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgggKAAwAg&usg=AOvVaw2EHMCQvFQV9ubu17ERCZFO
4 https://www.businessinsider.com/malaysia-axes-22-billion-of-belt-and-road-projects-blow-to-china-2018-8&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggjKAAwAw&usg=AOvVaw1sMhG0tyUnj8j2W02gD3aW
5 https://techcrunch.com/2018/08/21/china-reaches-800-million-internet-users/&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggsKAAwBA&usg=AOvVaw1ODs1JY8V_ETi24ugz-yNn
6 https://www.express.co.uk/news/world/1006297/Iran-oil-china-donald-trump-oil-prices-oil-price-us-iran-nuclear-deal-sanctions&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggvKAAwBQ&usg=AOvVaw0r0HQNfZhEwfbiEocUC74Z
7 https://nationalinterest.org/blog/buzz/7-signs-chinas-military-becoming-more-dangerous-29352&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg1KAAwBg&usg=AOvVaw2hpQQXrAm2HW158II7F1kG
8 https://www.cnbc.com/2018/08/21/asia-markets-us-china-trade-talks-in-focus.html&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg4KAAwBw&usg=AOvVaw2surM3fW-lLJDd9P-r7xJB
9 http://www.sciencemag.org/news/2018/08/can-china-world-s-biggest-pork-producer-contain-fatal-pig-virus-scientists-fear-worst&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg7KAAwCA&usg=AOvVaw3Lzvks6B0Un4IEgoMh86re
10 http://www.abc.net.au/news/2018-08-22/china-india-us-medical-diplomacy-in-the-pacific/10147632&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg-KAAwCQ&usg=AOvVaw1Ogg8I6mUvDSCc9F90Usg4
summary
1 A senior official of China's central bank told a briefing on Tuesday that the yuan's exchange rate is set by the market, rebutting President Donald ...
2 China is littered with the virtual carcasses of startups that attempted to do business in the country and then gave up or were shut out.
3 China's hot real estate market remains a challenge for authorities trying to maintain stable economic growth in the face of trade tensions with ...
4 The projects were a $20 billion rail link and two gas pipelines worth $2.3 billion. All three were part of China's Belt and Road Initiative (BRI), a massive project ...
5 A new report [in Chinese] issued by the China Internet Network Information Center (CNNIC) put the number of people in China with access to ...
6 China is Iran's biggest oil customer and the shift shows the communist nation wants to keep buying Iranian crude oil despite US sanctions ...
7 Western media seized on a new Pentagon report that Chinese bombers are training to strike deep into the Western Pacific, including Guam, the ...
8 Chinese markets led gains on Tuesday in a mostly positive trading session across Asia, extending their upward climb from the previous day ...
9 As of today, ASF has been reported at sites in four provinces in China's northeast, thousands of kilometers apart. Containing the disease in a ...
10 China's 10,000-ton medical ship, the Peace Ark, has cut a broad arc through the Pacific, stopping off in Papua New Guinea, Vanuatu and Fiji ...
> getGoogleNews("China", 2)
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=China&start=2"
title
1 Airbnb Wants to Find a Home in China
2 China's biggest risk may be its property market — not the trade war
3 Malaysia has axed $22 billion of Chinese-backed projects, in a blow ...
4 China reaches 800 million internet users
5 China DEFIES Trump to buy nearly ALL oil imports from Iran despite ...
6 7 Signs that China's Military is Becoming More Dangerous
7 Asia markets trade mostly higher as investors look ahead to US ...
8 Can China, the world's biggest pork producer, contain a fatal pig ...
9 How China, India and the US use healthcare aid to win influence in ...
10 China Is Leading in Artificial Intelligence--and American Businesses ...
source
1 WIRED - 13 hours ago
2 CNBC - 23 hours ago
3 Business Insider - 11 hours ago
4 TechCrunch - 10 hours ago
5 Express.co.uk - 12 hours ago
6 The National Interest Online (blog) - 16 hours ago
7 CNBC - 17 hours ago
8 Science Magazine - 5 hours ago
9 ABC News - 5 hours ago
10 Inc.com - 16 hours ago
url
1 https://www.wired.com/story/airbnb-china-market/&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggUKAAwAA&usg=AOvVaw3M4FbZ71J-NVKHn3fHvYwZ
2 https://www.cnbc.com/2018/08/21/china-economy-biggest-risk-may-be-property-market-not-trade-war.html&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggXKAAwAQ&usg=AOvVaw3vieYvDvTlRzYkWncLgQfu
3 https://www.businessinsider.com/malaysia-axes-22-billion-of-belt-and-road-projects-blow-to-china-2018-8&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggaKAAwAg&usg=AOvVaw3JGNk2Lraivca0P1lS3CoY
4 https://techcrunch.com/2018/08/21/china-reaches-800-million-internet-users/&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggjKAAwAw&usg=AOvVaw2j4-NkfK_fNl8McD6WJjPa
5 https://www.express.co.uk/news/world/1006297/Iran-oil-china-donald-trump-oil-prices-oil-price-us-iran-nuclear-deal-sanctions&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggmKAAwBA&usg=AOvVaw0v1Lybg2SxcJoxVkP7sOx_
6 https://nationalinterest.org/blog/buzz/7-signs-chinas-military-becoming-more-dangerous-29352&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggsKAAwBQ&usg=AOvVaw1B7Krdzgd3LQEJ4bwWSSFW
7 https://www.cnbc.com/2018/08/21/asia-markets-us-china-trade-talks-in-focus.html&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggvKAAwBg&usg=AOvVaw0v734CDRel2Vpke9XVjLqA
8 http://www.sciencemag.org/news/2018/08/can-china-world-s-biggest-pork-producer-contain-fatal-pig-virus-scientists-fear-worst&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggyKAAwBw&usg=AOvVaw1j6E7a1jk9JiIahN5pdmi7
9 http://www.abc.net.au/news/2018-08-22/china-india-us-medical-diplomacy-in-the-pacific/10147632&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAgg1KAAwCA&usg=AOvVaw2E0qGfLhOkKZWhh5-_Is54
10 https://www.inc.com/magazine/201809/amy-webb/china-artificial-intelligence.html&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAgg4KAAwCQ&usg=AOvVaw1thfiF9hJWhz88BU8znvnD
summary
1 China is littered with the virtual carcasses of startups that attempted to do business in the country and then gave up or were shut out.
2 China's hot real estate market remains a challenge for authorities trying to maintain stable economic growth in the face of trade tensions with ...
3 The projects were a $20 billion rail link and two gas pipelines worth $2.3 billion. All three were part of China's Belt and Road Initiative (BRI), a massive project ...
4 A new report [in Chinese] issued by the China Internet Network Information Center (CNNIC) put the number of people in China with access to ...
5 China is Iran's biggest oil customer and the shift shows the communist nation wants to keep buying Iranian crude oil despite US sanctions ...
6 Western media seized on a new Pentagon report that Chinese bombers are training to strike deep into the Western Pacific, including Guam, the ...
7 Chinese markets led gains on Tuesday in a mostly positive trading session across Asia, extending their upward climb from the previous day ...
8 As of today, ASF has been reported at sites in four provinces in China's northeast, thousands of kilometers apart. Containing the disease in a ...
9 China's 10,000-ton medical ship, the Peace Ark, has cut a broad arc through the Pacific, stopping off in Papua New Guinea, Vanuatu and Fiji ...
10 Living in China in the early 2000s changed my perspective. I saw firsthand that the outside world's view--China was good at copying but bad at ...
>
Web Page Test of URL
Nb. Note result order will be different for different users via web page for a logged in user.
Citation:
Jinseog Kim - Associate professor in the Department of Applied Statistics at Dongguk University. He received Ph.D of Statistics in 2003 in Department of Statistics at Seoul National University. His research interests are data mining related topics including machine learning, big data analytics, networked data analysis.
Presentation Link: http://datamining.dongguk.ac.kr/lectures/2016-2/bigdata/google.pdf

loop within a loop for JSON files in R

I am trying to aggregate a bunch of JSON files in to a single one for three sources and three years. While so far I have only been able to do it through the tedious way, I am sure I could do it in a smarter and more elegant manner.
json1 <- lapply(readLines("NYT_1989.json"), fromJSON)
json2 <- lapply(readLines("NYT_1990.json"), fromJSON)
json3 <- lapply(readLines("NYT_1991.json"), fromJSON)
json4 <- lapply(readLines("WP_1989.json"), fromJSON)
json5 <- lapply(readLines("WP_1990.json"), fromJSON)
json6 <- lapply(readLines("WP_1991.json"), fromJSON)
json7 <- lapply(readLines("USAT_1989.json"), fromJSON)
json8 <- lapply(readLines("USAT_1990.json"), fromJSON)
json9 <- lapply(readLines("USAT_1991.json"), fromJSON)
jsonl <- list(json1, json2, json3, json4, json5, json6, json7, json8, json9)
Note that the year period goes equally for the three files from 1989 to 1991. Any ideas? Thanks!
PS: Example of the data inside each file:
{"date": "December 31, 1989, Sunday, Late Edition - Final", "body": "Frigid temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. ", "title": "Prospects;"}
{"date": "December 31, 1989, Sunday, Late Edition - Final", "body": "DATELINE: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' ", "title": "Upheaval in the East: Espionage;"}
{"date": "December 31, 1989, Sunday, Late Edition - Final", "body": "SURVIVING the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. ", "title": "Coping With the Economic Prospects of 1990"}
Here you go:
require(jsonlite)
filelist <- c("NYT_1989.json","NYT_1990.json","NYT_1991.json",
"WP_1989.json", "WP_1990.json","WP_1991.json",
"USAT_1989.json","USAT_1990.json","USAT_1991.json")
newJSON <- sapply(filelist, function(x) fromJSON(readLines(x)))
Read in just the body entry from each line of the input file.
You asked about how to just read in a subset of the JSON file. The file data referenced isn't actually JSON format. It is JSON like, hence we have to modify the input to fromJSON() to correctly read in the data. We dereference the result from fromJSON()$body to extract just the body variable.
filelist <- c("./data/NYT_1989.json", "./data/NYT_1990.json")
newJSON <- sapply(filelist, function(x) fromJSON(sprintf("[%s]", paste(readLines(x), collapse = ",")), flatten = FALSE)$body)
newJSON
Results
> filelist <- c("./data/NYT_1989.json", "./data/NYT_1990.json")
> newJSON <- sapply(filelist, function(x) fromJSON(sprintf("[%s]", paste(readLines(x), collapse = ",")), flatten = FALSE)$body)
> newJSON
./data/NYT_1989.json
[1,] "Frigid temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. "
[2,] "DATELINE: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' "
[3,] "SURVIVING the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. "
./data/NYT_1990.json
[1,] "Blue temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. "
[2,] "BLUE1: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' "
[3,] "GREEN4 the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. "
You might find the following apply tutorial useful:
Datacamp: R tutorial on the Apply family of functions
I also recommend reading:
R Inferno - Chapter 4 - Over-Vectorizing
trust my when I say this online free book has helped me a lot. It has also confirmed I am an idiot on multiple occasions :-)

Creating a list of JSON files only with only one component of the list

I have 4 json files spread into to folders: folder1 and folder2. Each json file contains the date, the body and the title.
folder1.json:
{"date": "December 31, 1989, Sunday, Late Edition - Final", "body": "Frigid temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. ", "title": "Prospects;"}
{"date": "December 31, 1989, Sunday, Late Edition - Final", "body": "DATELINE: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. Agents of the Office for the Protection of State Secrets got one check from Prague, the pun goes, and another from their real bosses at K.G.B. headquarters in Moscow. Roy Godson, head of the Washington-based National Strategy Information Center and a well-known intelligence scholar, called any democratic change ''a net loss'' for Soviet intelligence. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' ", "title": "Upheaval in the East: Espionage;"}
folder2.json:
{"date": "December 31, 1989, Sunday, Late Edition - Final", "body": "SURVIVING the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. But facing business owners are numerous problems, from taxes and regulations at all levels of government to competition from other businesses in and out of Westchester. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. ", "title": "Coping With the Economic Prospects of 1990"}
{"date": "December 29, 1989, Friday, Late Edition - Final", "body": "Eastern Airlines said yesterday that it was laying off 600 employees, mostly managers, and cutting wages by 10 percent or 20 percent for about half its work force. Thomas J. Matthews, Eastern's senior vice president of human resources, estimated that the measures would save the carrier about $100 million a year. Eastern plans to rebuild by making Atlanta its primary hub and expects to operate about 75 percent of its flights from there. ", "title": "Eastern Plans Wage Cuts, 600 Layoffs"}
I will like to create a common list of all these json files but only with the body of each article. So far I am trying the following:
json1 <- lapply(readLines("folder1.json"), fromJSON)
json2 <- lapply(readLines("folder2.json"), fromJSON)
jsonl <- list(json1$body, json2$body)
But it is not working. Any suggestions?
Andres Azqueta
Solution:
You need to derence the the fromJSON(), in the sapply() to only retrieve the body.
fromJSON()$body
Note: I am assuming the file format from you previous question
The point being the file format is sudo JSON, hence the modified fromJSON() call below.
OK, Let step through an example:
Stage 1: Concatenate JSON files into 1
filelist <- c("./data/NYT_1989.json", "./data/NYT_1990.json")
newJSON <- sapply(filelist, function(x) fromJSON(sprintf("[%s]", paste(readLines(x), collapse = ",")), flatten = FALSE))
newJSON[2]# Extract bodies
newJSON[5]# Extract bodies
Output
filelist <- c("./data/NYT_1989.json", "./data/NYT_1990.json")
> newJSON <- sapply(filelist, function(x) fromJSON(sprintf("[%s]", paste(readLines(x), collapse = ",")), flatten = FALSE))
> newJSON[2]# Extract bodies
[[1]]
[1] "Frigid temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. "
[2] "DATELINE: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' "
[3] "SURVIVING the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. "
> newJSON[5]# Extract bodies
[[1]]
[1] "Blue temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. "
[2] "BLUE1: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' "
[3] "GREEN4 the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. "
Stage 2: Concatenate and extract the body from all files...
Look for the reference to fromJSON()$body in code line...
filelist <- c("./data/NYT_1989.json", "./data/NYT_1990.json")
newJSON <- sapply(filelist, function(x) fromJSON(sprintf("[%s]", paste(readLines(x), collapse = ",")), flatten = FALSE)$body)
newJSON
Output
> filelist <- c("./data/NYT_1989.json", "./data/NYT_1990.json")
> newJSON <- sapply(filelist, function(x) fromJSON(sprintf("[%s]", paste(readLines(x), collapse = ",")), flatten = FALSE)$body)
> newJSON
./data/NYT_1989.json
[1,] "Frigid temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. "
[2,] "DATELINE: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' "
[3,] "SURVIVING the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. "
./data/NYT_1990.json
[1,] "Blue temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. "
[2,] "BLUE1: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' "
[3,] "GREEN4 the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. "
require(RJSONIO)
json_1<- fromJSON("~/folder1/1.json")
json_2<- fromJSON("~/folder2/2.json")
jsonl <- list(json1$body, json2$body)

Creating a corpus out of texts stored in JSON files in R

I have several JSON files with texts in grouped into date, body and title. As an example consider:
{"date": "December 31, 1990, Monday, Late Edition - Final", "body": "World stock markets begin 1991 facing the threat of a war in the Persian Gulf, recessions or economic slowdowns around the world, and dismal earnings -- the same factors that drove stock markets down sharply in 1990. Finally, there is the problem of the Soviet Union, the wild card in everyone's analysis. It is a country whose problems could send stock markets around the world reeling if something went seriously awry. With Russia about to implode, that just adds to the risk premium, said Mr. Dhar. LOAD-DATE: December 30, 1990 ", "title": "World Markets;"}
{"date": "December 30, 1992, Sunday, Late Edition - Final", "body": "DATELINE: CHICAGO Gleaming new tractors are becoming more familiar sights on America's farms. Sales and profits at the three leading United States tractor makers -- Deere & Company, the J.I. Case division of Tenneco Inc. and the Ford Motor Company's Ford New Holland division -- are all up, reflecting renewed agricultural prosperity after the near-depression of the early and mid-1980's. But the recovery in the tractor business, now in its third year, is fragile. Tractor makers hope to install computers that can digest this information, then automatically concentrate the application of costly fertilizer and chemicals on the most productive land. Within the next 15 years, that capability will be commonplace, predicted Mr. Ball. LOAD-DATE: December 30, 1990 ", "title": "All About/Tractors;"}
I have three different newspapers with separate files containing all the texts produced for the period 1989 - 2016. My ultimate goal is to combine all the texts into a single corpus. I have done it in Python using the pandas library and I am wondering if it could be done in R similarly. Here is my code with the loop in R:
for (i in 1989:2016){
df0 = pd.DataFrame([json.loads(l) for l in open('NYT_%d.json' % i)])
df1 = pd.DataFrame([json.loads(l) for l in open('USAT_%d.json' % i)])
df2 = pd.DataFrame([json.loads(l) for l in open('WP_%d.json' % i)])
appended_data.append(df0)
appended_data.append(df1)
appended_data.append(df2)
}
Use jsonlite::stream_in to read your files and jsonlite::rbind.pages to combine them.
There many options in R to read json file and convert them to a data.frame/data.table.
Here one using jsonlite and data.table:
library(data.table)
library(jsonlite)
res <- lapply(1989:2016,function(i){
ff <- c('NYT_%d.json','USAT_%d.json' ,'WP_%d.json')
list_files_paths <- sprintf(ff,i)
rbindlist(lapply(list_files_paths,fromJSON))
})
Here res is a list of data.table. If you want to aggregate all data.table in a single data.table:
rbindlist(res)
Use ndjson::stream_in to read them in faster and flatter than jsonlite::stream_in :-)

Jaccard distance between tweets

I'm currently trying to measure the Jaccard Distance between tweets in a dataset
This is where the dataset is
http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json
I've tried a few things to measure the distance
This is what I have so far
I saved the linked dataset to a file called Tweets.json
json_alldata <- fromJSON(sprintf("[%s]", paste(readLines(file("Tweets.json")),collapse=",")))
Then I converted json_alldata to tweet.features and got rid of the geo column
# get rid of geo column
tweet.features = json_alldata
tweet.features$geo <- NULL
These are what the first two tweets look like
tweet.features$text[1]
[1] "RT #ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
> tweet.features$text[2]
[1] "RT #NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"
First thing I tried was using the method stringdist which is under the stringdist library
install.packages("stringdist")
library(stringdist)
#This works?
#
stringdist(tweet.features$text[1], tweet.features$text[2], method = "jaccard")
When I run that, I get
[1] 0.1621622
I'm not sure that's correct, though. A intersection B = 23, and A union B = 25. The Jaccard distance is A intersection B/A union B -- right? So by my calculation, the Jaccard distance should be 0.92?
So I figured I could do it by sets. Simply calculate intersection and union and divide
This is what I tried
# Jaccard distance is the intersection of A and B divided by the Union of A and B
#
#create set for First Tweet
A1 <- as.set(tweet.features$text[1])
A2 <- as.set(tweet.features$text[2])
When I try to do intersection, I get this: The output is just list()
Intersection <- intersect(A1, A2)
list()
When I try Union, I get this:
union(A1, A2)
[[1]]
[1] "RT #ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
[[2]]
[1] "RT #NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"
This doesn't seem to be grouping the words into a single set.
I figured I'd be able to divide the intersection by the union. But I guess I would need the program to count the number or words in each set, then do the calculations.
Needless to say, I'm a bit stuck and I'm not sure if I'm on the right track.
Any help would be appreciated. Thank you.
intersect and union expect vectors (as.set does not exist). I think you want to compare words so you can use strsplit but the way the split is done belongs to you. An example below:
tweet.features <- list(tweet1="RT #ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston",
tweet2= "RT #NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston")
jaccard_i <- function(tw1, tw2){
tw1 <- unlist(strsplit(tw1, " |\\."))
tw2 <- unlist(strsplit(tw2, " |\\."))
i <- length(intersect(tw1, tw2))
u <- length(union(tw1, tw2))
list(i=i, u=u, j=i/u)
}
jaccard_i(tweet.features[[1]], tweet.features[[2]])
$i
[1] 20
$u
[1] 23
$j
[1] 0.8695652
Is this want you want?
The strsplit is here done for every space or dot. You may want to refine the split argument from strsplit and replace " |\\." for something more specific (see ?regex).