r Google News Results Links - html

I am new to getting information from the web into R but I found this nice code How to get google search results on how to get links from the ordinary google search into R.
I need to get this method running for the google NEWS search.
I know i have to change the url by adding something like "&source=lnms&tbm=nws".
The url i construct leads me to the right news result page if i copy and paste it from R to my browser - so far so good.
I was looking at the html code of the news result page and found that the information is lying inside h3[#class='r dO0Ag'] but there is another node and I don´t know how to code this part.
Would appreciate any help!
library(XML)
library(RCurl)
getGoogleURL <- function(search.term, domain = '.de', quotes=TRUE)
{
search.term <- gsub(' ', '%20', search.term)
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
#construct google news url
getGoogleURL <- paste('http://www.google', domain, '/search?q=',
search.term, sep='',"&source=lnms&tbm=nws")
return(getGoogleURL)
}
getGoogleLinks <- function(google.url) {
doc <- getURL(google.url, httpheader = c("User-Agent" = "R
(2.10.0)"))
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function
(...){})
#?? Wrong part - gives error evaluating xpath expression ??
nodes <- getNodeSet(html, "//h3[#class='r dO0Ag']//a[#class='l lLrAF'//")
dirt_links=sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]])
links <- gsub('/url\\?q=','',sapply(strsplit(dirt_links[as.vector(grep('url',dirt_links))],split='&'),'[',1))
return(links)
}
search.term <- "China"
quotes <- "TRUE"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
links <- getGoogleLinks(search.url)

You have a number of options here.
Either RCurl or RSelenium will work.
The key point is to generate the correct URL:
> library(XML)
> library(RCurl)
> search.term <- "china"
> quotes=FALSE
> start=0
> getGoogleURL <- paste('http://www.google.com',
+ '/search?hl=en&gl=kr&tbm=nws&authuser=0&q=',
+ search.term, "&start=",start,sep='')
> getGoogleURL
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=china&start=0"
>
at this point, you can dereference the URL and create the HTML parse tree and extract the node data. The start reference allows you to set the return page of the result. i.e. I want to return the forth page (counting from zero)
Working Code Example:
library(XML)
library(RCurl)
getGoogleURL <- function(search.term, start=0, quotes=FALSE) {
search.term <- gsub(' ', '%20', search.term)
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
getGoogleURL <- paste('http://www.google.com',
'/search?hl=en&gl=kr&tbm=nws&authuser=0&q=',
search.term, "&start=",start,sep='')
getGoogleURL <- URLencode(getGoogleURL)
}
getGoogleNews <- function(search.term="China",
start=0,
quotes=FALSE ){
google.url <- getGoogleURL(search.term=search.term,
start, quotes=quotes)
print(google.url)
doc <- getURL(google.url,
httpheader = c("User-Agent" = "R(3.0.3)"))
html <- htmlTreeParse(doc, useInternalNodes = TRUE,
error=function(...){}, asText = TRUE)
nodes <- getNodeSet(html, "//*/h3/a[#href]")
title <- sapply(nodes, function(x) x <- xmlValue(x))
url <- unname(sapply(nodes, function(x) x <- xmlAttrs(x)))
url <- gsub("\\/url\\?q=", "", url)
nodes <- getNodeSet(html, "//div[#class='slp']")
source <- sapply(nodes, function(x) x <- xmlValue(x))
nodes <- getNodeSet(html, "//div[#class='st']")
summary <- sapply(nodes, function(x) x <- xmlValue(x))
data.frame(title=title, source=source, url=url, summary=summary)
}
getGoogleNews("China")
getGoogleNews("China", 1)
getGoogleNews("China", 2)
Runtime:
> library(XML)
> library(RCurl)
> getGoogleURL <- function(search.term, start=0, quotes=FALSE) {
+ search.term <- gsub(' ', '%20', search.term)
+ if(quotes) search.term <- paste( .... [TRUNCATED]
> getGoogleNews <- function(search.term="China",
+ start=0,
+ quotes=FALSE ){
+ google.url <- ge .... [TRUNCATED]
> getGoogleNews("China")
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=China&start=0"
title
1 Taiwan says China is 'out of control' as it loses El Salvador to Beijing
2 China central bank official rebuts Trump's claim it is manipulating the ...
3 Airbnb Wants to Find a Home in China
4 China's biggest risk may be its property market — not the trade war
5 Malaysia has axed $22 billion of Chinese-backed projects, in a blow ...
6 China reaches 800 million internet users
7 China DEFIES Trump to buy nearly ALL oil imports from Iran despite ...
8 7 Signs that China's Military is Becoming More Dangerous
9 Asia markets trade mostly higher as investors look ahead to US ...
10 Can China, the world's biggest pork producer, contain a fatal pig ...
source
1 CNBC - 17 hours ago
2 CNBC - 10 hours ago
3 WIRED - 13 hours ago
4 CNBC - 23 hours ago
5 Business Insider - 11 hours ago
6 TechCrunch - 10 hours ago
7 Express.co.uk - 12 hours ago
8 The National Interest Online (blog) - 16 hours ago
9 CNBC - 17 hours ago
10 Science Magazine - 5 hours ago
url
1 https://www.cnbc.com/2018/08/21/taiwan-says-china-out-of-control-as-it-loses-el-salvador-to-beijing.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIFCgAMAA&usg=AOvVaw2cSTmS65-6IvKQV9xrl3y3
2 https://www.cnbc.com/2018/08/21/china-official-refutes-trumps-claim-it-is-manipulating-the-yuan.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIHSgAMAE&usg=AOvVaw2q7yr2oBWHib3bRAVmOna-
3 https://www.wired.com/story/airbnb-china-market/&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIJigAMAI&usg=AOvVaw2a2LSkYlosnwTFRCvjmUhm
4 https://www.cnbc.com/2018/08/21/china-economy-biggest-risk-may-be-property-market-not-trade-war.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIKSgAMAM&usg=AOvVaw1bUY5Ii7AlWURDifpeozJU
5 https://www.businessinsider.com/malaysia-axes-22-billion-of-belt-and-road-projects-blow-to-china-2018-8&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIILCgAMAQ&usg=AOvVaw0yGdVilstHZVBBXEuuAbmu
6 https://techcrunch.com/2018/08/21/china-reaches-800-million-internet-users/&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIINSgAMAU&usg=AOvVaw0VYTngAb-OBUSYkxKs0ZKp
7 https://www.express.co.uk/news/world/1006297/Iran-oil-china-donald-trump-oil-prices-oil-price-us-iran-nuclear-deal-sanctions&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIOCgAMAY&usg=AOvVaw3W5adCnWdzz71zvpgE1x6D
8 https://nationalinterest.org/blog/buzz/7-signs-chinas-military-becoming-more-dangerous-29352&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIPigAMAc&usg=AOvVaw1k05lyvFRrx_FImDKIsZ61
9 https://www.cnbc.com/2018/08/21/asia-markets-us-china-trade-talks-in-focus.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIQSgAMAg&usg=AOvVaw0YqzZPNbH9bawkv8qX8Bdm
10 http://www.sciencemag.org/news/2018/08/can-china-world-s-biggest-pork-producer-contain-fatal-pig-virus-scientists-fear-worst&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIRCgAMAk&usg=AOvVaw1H0c03l4trLI3cbRRlnKJW
summary
1 Taiwan vowed on Tuesday to fight China's "increasingly out of control" behavior after Taipei lost another ally to Beijing when El Salvador ...
2 A senior official of China's central bank told a briefing on Tuesday that the yuan's exchange rate is set by the market, rebutting President Donald ...
3 China is littered with the virtual carcasses of startups that attempted to do business in the country and then gave up or were shut out.
4 China's hot real estate market remains a challenge for authorities trying to maintain stable economic growth in the face of trade tensions with ...
5 The projects were a $20 billion rail link and two gas pipelines worth $2.3 billion. All three were part of China's Belt and Road Initiative (BRI), a massive project ...
6 A new report [in Chinese] issued by the China Internet Network Information Center (CNNIC) put the number of people in China with access to ...
7 China is Iran's biggest oil customer and the shift shows the communist nation wants to keep buying Iranian crude oil despite US sanctions ...
8 Western media seized on a new Pentagon report that Chinese bombers are training to strike deep into the Western Pacific, including Guam, the ...
9 Chinese markets led gains on Tuesday in a mostly positive trading session across Asia, extending their upward climb from the previous day ...
10 As of today, ASF has been reported at sites in four provinces in China's northeast, thousands of kilometers apart. Containing the disease in a ...
> getGoogleNews("China", 1)
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=China&start=1"
title
1 China central bank official rebuts Trump's claim it is manipulating the ...
2 Airbnb Wants to Find a Home in China
3 China's biggest risk may be its property market — not the trade war
4 Malaysia has axed $22 billion of Chinese-backed projects, in a blow ...
5 China reaches 800 million internet users
6 China DEFIES Trump to buy nearly ALL oil imports from Iran despite ...
7 7 Signs that China's Military is Becoming More Dangerous
8 Asia markets trade mostly higher as investors look ahead to US ...
9 Can China, the world's biggest pork producer, contain a fatal pig ...
10 How China, India and the US use healthcare aid to win influence in ...
source
1 CNBC - 10 hours ago
2 WIRED - 13 hours ago
3 CNBC - 23 hours ago
4 Business Insider - 11 hours ago
5 TechCrunch - 10 hours ago
6 Express.co.uk - 12 hours ago
7 The National Interest Online (blog) - 16 hours ago
8 CNBC - 17 hours ago
9 Science Magazine - 5 hours ago
10 ABC News - 5 hours ago
url
1 https://www.cnbc.com/2018/08/21/china-official-refutes-trumps-claim-it-is-manipulating-the-yuan.html&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggUKAAwAA&usg=AOvVaw1Muu65XvSSWVKX06-5syLY
2 https://www.wired.com/story/airbnb-china-market/&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggdKAAwAQ&usg=AOvVaw0Py7bJDY3tIj4KxgwYot1A
3 https://www.cnbc.com/2018/08/21/china-economy-biggest-risk-may-be-property-market-not-trade-war.html&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgggKAAwAg&usg=AOvVaw2EHMCQvFQV9ubu17ERCZFO
4 https://www.businessinsider.com/malaysia-axes-22-billion-of-belt-and-road-projects-blow-to-china-2018-8&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggjKAAwAw&usg=AOvVaw1sMhG0tyUnj8j2W02gD3aW
5 https://techcrunch.com/2018/08/21/china-reaches-800-million-internet-users/&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggsKAAwBA&usg=AOvVaw1ODs1JY8V_ETi24ugz-yNn
6 https://www.express.co.uk/news/world/1006297/Iran-oil-china-donald-trump-oil-prices-oil-price-us-iran-nuclear-deal-sanctions&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggvKAAwBQ&usg=AOvVaw0r0HQNfZhEwfbiEocUC74Z
7 https://nationalinterest.org/blog/buzz/7-signs-chinas-military-becoming-more-dangerous-29352&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg1KAAwBg&usg=AOvVaw2hpQQXrAm2HW158II7F1kG
8 https://www.cnbc.com/2018/08/21/asia-markets-us-china-trade-talks-in-focus.html&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg4KAAwBw&usg=AOvVaw2surM3fW-lLJDd9P-r7xJB
9 http://www.sciencemag.org/news/2018/08/can-china-world-s-biggest-pork-producer-contain-fatal-pig-virus-scientists-fear-worst&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg7KAAwCA&usg=AOvVaw3Lzvks6B0Un4IEgoMh86re
10 http://www.abc.net.au/news/2018-08-22/china-india-us-medical-diplomacy-in-the-pacific/10147632&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg-KAAwCQ&usg=AOvVaw1Ogg8I6mUvDSCc9F90Usg4
summary
1 A senior official of China's central bank told a briefing on Tuesday that the yuan's exchange rate is set by the market, rebutting President Donald ...
2 China is littered with the virtual carcasses of startups that attempted to do business in the country and then gave up or were shut out.
3 China's hot real estate market remains a challenge for authorities trying to maintain stable economic growth in the face of trade tensions with ...
4 The projects were a $20 billion rail link and two gas pipelines worth $2.3 billion. All three were part of China's Belt and Road Initiative (BRI), a massive project ...
5 A new report [in Chinese] issued by the China Internet Network Information Center (CNNIC) put the number of people in China with access to ...
6 China is Iran's biggest oil customer and the shift shows the communist nation wants to keep buying Iranian crude oil despite US sanctions ...
7 Western media seized on a new Pentagon report that Chinese bombers are training to strike deep into the Western Pacific, including Guam, the ...
8 Chinese markets led gains on Tuesday in a mostly positive trading session across Asia, extending their upward climb from the previous day ...
9 As of today, ASF has been reported at sites in four provinces in China's northeast, thousands of kilometers apart. Containing the disease in a ...
10 China's 10,000-ton medical ship, the Peace Ark, has cut a broad arc through the Pacific, stopping off in Papua New Guinea, Vanuatu and Fiji ...
> getGoogleNews("China", 2)
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=China&start=2"
title
1 Airbnb Wants to Find a Home in China
2 China's biggest risk may be its property market — not the trade war
3 Malaysia has axed $22 billion of Chinese-backed projects, in a blow ...
4 China reaches 800 million internet users
5 China DEFIES Trump to buy nearly ALL oil imports from Iran despite ...
6 7 Signs that China's Military is Becoming More Dangerous
7 Asia markets trade mostly higher as investors look ahead to US ...
8 Can China, the world's biggest pork producer, contain a fatal pig ...
9 How China, India and the US use healthcare aid to win influence in ...
10 China Is Leading in Artificial Intelligence--and American Businesses ...
source
1 WIRED - 13 hours ago
2 CNBC - 23 hours ago
3 Business Insider - 11 hours ago
4 TechCrunch - 10 hours ago
5 Express.co.uk - 12 hours ago
6 The National Interest Online (blog) - 16 hours ago
7 CNBC - 17 hours ago
8 Science Magazine - 5 hours ago
9 ABC News - 5 hours ago
10 Inc.com - 16 hours ago
url
1 https://www.wired.com/story/airbnb-china-market/&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggUKAAwAA&usg=AOvVaw3M4FbZ71J-NVKHn3fHvYwZ
2 https://www.cnbc.com/2018/08/21/china-economy-biggest-risk-may-be-property-market-not-trade-war.html&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggXKAAwAQ&usg=AOvVaw3vieYvDvTlRzYkWncLgQfu
3 https://www.businessinsider.com/malaysia-axes-22-billion-of-belt-and-road-projects-blow-to-china-2018-8&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggaKAAwAg&usg=AOvVaw3JGNk2Lraivca0P1lS3CoY
4 https://techcrunch.com/2018/08/21/china-reaches-800-million-internet-users/&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggjKAAwAw&usg=AOvVaw2j4-NkfK_fNl8McD6WJjPa
5 https://www.express.co.uk/news/world/1006297/Iran-oil-china-donald-trump-oil-prices-oil-price-us-iran-nuclear-deal-sanctions&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggmKAAwBA&usg=AOvVaw0v1Lybg2SxcJoxVkP7sOx_
6 https://nationalinterest.org/blog/buzz/7-signs-chinas-military-becoming-more-dangerous-29352&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggsKAAwBQ&usg=AOvVaw1B7Krdzgd3LQEJ4bwWSSFW
7 https://www.cnbc.com/2018/08/21/asia-markets-us-china-trade-talks-in-focus.html&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggvKAAwBg&usg=AOvVaw0v734CDRel2Vpke9XVjLqA
8 http://www.sciencemag.org/news/2018/08/can-china-world-s-biggest-pork-producer-contain-fatal-pig-virus-scientists-fear-worst&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggyKAAwBw&usg=AOvVaw1j6E7a1jk9JiIahN5pdmi7
9 http://www.abc.net.au/news/2018-08-22/china-india-us-medical-diplomacy-in-the-pacific/10147632&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAgg1KAAwCA&usg=AOvVaw2E0qGfLhOkKZWhh5-_Is54
10 https://www.inc.com/magazine/201809/amy-webb/china-artificial-intelligence.html&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAgg4KAAwCQ&usg=AOvVaw1thfiF9hJWhz88BU8znvnD
summary
1 China is littered with the virtual carcasses of startups that attempted to do business in the country and then gave up or were shut out.
2 China's hot real estate market remains a challenge for authorities trying to maintain stable economic growth in the face of trade tensions with ...
3 The projects were a $20 billion rail link and two gas pipelines worth $2.3 billion. All three were part of China's Belt and Road Initiative (BRI), a massive project ...
4 A new report [in Chinese] issued by the China Internet Network Information Center (CNNIC) put the number of people in China with access to ...
5 China is Iran's biggest oil customer and the shift shows the communist nation wants to keep buying Iranian crude oil despite US sanctions ...
6 Western media seized on a new Pentagon report that Chinese bombers are training to strike deep into the Western Pacific, including Guam, the ...
7 Chinese markets led gains on Tuesday in a mostly positive trading session across Asia, extending their upward climb from the previous day ...
8 As of today, ASF has been reported at sites in four provinces in China's northeast, thousands of kilometers apart. Containing the disease in a ...
9 China's 10,000-ton medical ship, the Peace Ark, has cut a broad arc through the Pacific, stopping off in Papua New Guinea, Vanuatu and Fiji ...
10 Living in China in the early 2000s changed my perspective. I saw firsthand that the outside world's view--China was good at copying but bad at ...
>
Web Page Test of URL
Nb. Note result order will be different for different users via web page for a logged in user.
Citation:
Jinseog Kim - Associate professor in the Department of Applied Statistics at Dongguk University. He received Ph.D of Statistics in 2003 in Department of Statistics at Seoul National University. His research interests are data mining related topics including machine learning, big data analytics, networked data analysis.
Presentation Link: http://datamining.dongguk.ac.kr/lectures/2016-2/bigdata/google.pdf

Related

How should I format a logit regression to test company motivations? (R)

I'm writing a paper on investment firms and their relationship with a sustainable finance initiative. I'm using a panel dataset with 307 investors, 125 of them signed this sustainable initiative.
I would like to add in a section in which I test which variables might be driving them to sign this initiative.
I believe I should use logit regression for this, but having not used these extensively, I'm looking for some guidance.
Currently the data looks like this:
investor
year
activity
country
region
strategy
signatory
123 IM
2002
4.45
France
europe
VC
1
123 IM
2003
3.2
France
europe
VC
1
123 IM
2004
7.8
France
europe
VC
1
Aegon
2005
5.4
Netherlands
europe
BY
0
Aegon
2006
4.2
Netherlands
europe
BY
0
Aegon
2007
1.3
Netherlands
europe
BY
0
As you can see the signatory variable is a binary, and I would be looking to test variables such as country or region against it.
Any tips would be appreciated!
Rory
You can use the glm function in R. Following is an example with country and activity variables as independent variables:
# Assuming that your dataframe name is df
my_logit <- glm(signatory ~ activity + country, family = 'binomial', data=df)
# Check the output summary
summary(my_logit)

Is there a way to combine these variables in a way that makes sense?

Hello stack overflow community!
I am a sociology student working on a thesis project comparing home value appreciation and neighborhood racial composition over time.
I'm currently using two separate data sources and trying to combine them in a way that makes sense without aggregating anything.
The first data source is GIS data which has information on home sales in each year by home. The second is census data which has yearly estimates of racial composition by census tract. Both are in .csv formats.
My goal is to create a set of variables for each home row in the GIS data which represents the racial composition for the tract the home is in at the year it was sold (e.g. home 1 | 2010| $500,000 | Census tract 10 | 10% white).
I began doing this by going into Stata and using the following strategy:
For example, if I'm looking at a home sold in 2010 in Census tract 10 and I find that this tract was 10% white in 2010, using something like
If censustract=10 and year=2010, replace percentwhite = 10
However, this seemed incredibly time consuming, as I'm using data that go back decades and a couple dozen Census tracts.
Does anyone have any suggestions on how I might do this smarter, not harder? The first thought I had was to aggregate the data by census tract and year, but was hoping to avoid that if possible. Thank you so much in advance for your help and have a terrific day and start to the new year!
It sounds like you can simply merge census data onto your GIS data. That will be much less painful than using -replace-. Here's an example:
*GIS data: information on home sales in each year by home
clear
input censustract house_id year house_value_k
10 100 2010 200
11 101 2020 500
11 102 1980 100
end
tempfile GIS_data
sa `GIS_data'
*census data: yearly estimates of racial composition by census tract
clear
input censustract year percentwhite
10 2010 20
10 2000 10
11 2010 25
11 2000 5
end
tempfile census_data
sa `census_data'
*easy method: merge the census data onto your GIS data
use `GIS_data', clear
mer m:1 censustract year using `census_data'
drop if _merge==2
list
*hard method: use -replace-
use `GIS_data', clear
gen percentwhite=.
replace percentwhite=20 if censustract==10 & year==2010
replace percentwhite=10 if censustract==10 & year==2000
replace percentwhite=25 if censustract==11 & year==2010
replace percentwhite=5 if censustract==11 & year==2000
list
Both methods "work", but using -merge- is much easier and less prone to errors.
Note: I intentionally created the data sets so that the merge wouldn't be perfect. You will likely want to drop some of the observations in that case. In the code above I dropped when _merge==2

How do i split a JSON file by a specific amount of objects in separate files?

import json
content = []
with open("articles.jsonl", "rt") as file:
for a in file:
out = json.loads(a)
content.append(out)
file.close()
count = 0
file_count = 1
with open("articles" + str(file_count) + ".jsonl", "wt") as fp:
for a in content:
json.dump(a, fp)
fp.write("\n")
count +=1
if count == 2000:
file_count +=1
count = 0
continue
fp.close()
{"id": "f7ca322d-c3e8-40d2-841f-9d7250ac72ca", "content": "VETERANS saluted Worcester's first ever breakfast club for ex-soldiers which won over hearts, minds and bellies. \n \nThe Worcester Breakfast Club for HM Forces Veterans met at the Postal Order in Foregate Street at 10am on Saturday. \n \nThe club is designed to allow veterans a place to meet, socialise, eat and drink, giving hunger and loneliness their marching orders. \n \nFather-of-two Dave Carney, aged 43, of Merrimans Hill, Worcester, set up the club after being inspired by other similar clubs across the country. \n \nHe said: \"As you can see from the picture, we had a good response. Five out of the 10 that attended said they saw the article in the newspaper and turned up. \n \n\"We even had an old chap travel from Droitwich and he was late on parade by three hours. \n \n\"It's generated a lot of interest and I estimate (from other veterans who saw the article) that next month's meeting will attract about 20 people. Onwards and upwards.\" \n \nHe said the management at the pub had been extremely hospitable to them. \n \nMr Carney said: \"They bent over backwards for us. They really looked after us well. That is the best choice of venue I could have made. They even put 'reserved for the armed forces'. \n Promoted stories \nThe reserve veteran with the Royal Engineers wanted to go to a breakfast club but found the nearest ones were in Bromsgrove and Gloucester so he decided to set up his own, closer to home. \n \nHe was influenced by Derek Hardman who set up a breakfast club for veterans in Hull and Andy Wilson who set one up in Newcastle. He said the idea has snowballed and there were now 70 similar clubs across the country and even some in Germany. \n \nMr Carney said with many Royal British Legion clubs closing he wanted veterans and serving personnel to feel they had somewhere they could go for good grub, beer and banter to recapture the comradery of being in the forces. \n \nThe Postal Order was chosen because of its central location and its proximity to the railway station and hotels and reasonably priced food and drink. \n \nThe management of the pub have even given the veterans a designated area within the pub. \n \n Share article \n \nThe next meeting is at the Postal Order on Saturday, October 3 at 10am. \n \nThe breakfast club meets on the first Saturday of each month for those who want to attend in future.", "title": "Worcester breakfast club for veterans gives hunger its marching orders", "media-type": "News", "source": "Redditch Advertiser", "published": "2015-09-07T10:16:14Z"}
Above is a small sample of the articles.jsonl file.
This just writes everything to a single file called articles1.jsonl instead of multiple files with a specific set of objects. Any suggestions?

Creating a list of JSON files only with only one component of the list

I have 4 json files spread into to folders: folder1 and folder2. Each json file contains the date, the body and the title.
folder1.json:
{"date": "December 31, 1989, Sunday, Late Edition - Final", "body": "Frigid temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. ", "title": "Prospects;"}
{"date": "December 31, 1989, Sunday, Late Edition - Final", "body": "DATELINE: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. Agents of the Office for the Protection of State Secrets got one check from Prague, the pun goes, and another from their real bosses at K.G.B. headquarters in Moscow. Roy Godson, head of the Washington-based National Strategy Information Center and a well-known intelligence scholar, called any democratic change ''a net loss'' for Soviet intelligence. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' ", "title": "Upheaval in the East: Espionage;"}
folder2.json:
{"date": "December 31, 1989, Sunday, Late Edition - Final", "body": "SURVIVING the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. But facing business owners are numerous problems, from taxes and regulations at all levels of government to competition from other businesses in and out of Westchester. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. ", "title": "Coping With the Economic Prospects of 1990"}
{"date": "December 29, 1989, Friday, Late Edition - Final", "body": "Eastern Airlines said yesterday that it was laying off 600 employees, mostly managers, and cutting wages by 10 percent or 20 percent for about half its work force. Thomas J. Matthews, Eastern's senior vice president of human resources, estimated that the measures would save the carrier about $100 million a year. Eastern plans to rebuild by making Atlanta its primary hub and expects to operate about 75 percent of its flights from there. ", "title": "Eastern Plans Wage Cuts, 600 Layoffs"}
I will like to create a common list of all these json files but only with the body of each article. So far I am trying the following:
json1 <- lapply(readLines("folder1.json"), fromJSON)
json2 <- lapply(readLines("folder2.json"), fromJSON)
jsonl <- list(json1$body, json2$body)
But it is not working. Any suggestions?
Andres Azqueta
Solution:
You need to derence the the fromJSON(), in the sapply() to only retrieve the body.
fromJSON()$body
Note: I am assuming the file format from you previous question
The point being the file format is sudo JSON, hence the modified fromJSON() call below.
OK, Let step through an example:
Stage 1: Concatenate JSON files into 1
filelist <- c("./data/NYT_1989.json", "./data/NYT_1990.json")
newJSON <- sapply(filelist, function(x) fromJSON(sprintf("[%s]", paste(readLines(x), collapse = ",")), flatten = FALSE))
newJSON[2]# Extract bodies
newJSON[5]# Extract bodies
Output
filelist <- c("./data/NYT_1989.json", "./data/NYT_1990.json")
> newJSON <- sapply(filelist, function(x) fromJSON(sprintf("[%s]", paste(readLines(x), collapse = ",")), flatten = FALSE))
> newJSON[2]# Extract bodies
[[1]]
[1] "Frigid temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. "
[2] "DATELINE: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' "
[3] "SURVIVING the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. "
> newJSON[5]# Extract bodies
[[1]]
[1] "Blue temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. "
[2] "BLUE1: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' "
[3] "GREEN4 the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. "
Stage 2: Concatenate and extract the body from all files...
Look for the reference to fromJSON()$body in code line...
filelist <- c("./data/NYT_1989.json", "./data/NYT_1990.json")
newJSON <- sapply(filelist, function(x) fromJSON(sprintf("[%s]", paste(readLines(x), collapse = ",")), flatten = FALSE)$body)
newJSON
Output
> filelist <- c("./data/NYT_1989.json", "./data/NYT_1990.json")
> newJSON <- sapply(filelist, function(x) fromJSON(sprintf("[%s]", paste(readLines(x), collapse = ",")), flatten = FALSE)$body)
> newJSON
./data/NYT_1989.json
[1,] "Frigid temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. "
[2,] "DATELINE: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' "
[3,] "SURVIVING the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. "
./data/NYT_1990.json
[1,] "Blue temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. "
[2,] "BLUE1: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' "
[3,] "GREEN4 the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. "
require(RJSONIO)
json_1<- fromJSON("~/folder1/1.json")
json_2<- fromJSON("~/folder2/2.json")
jsonl <- list(json1$body, json2$body)

Creating a corpus out of texts stored in JSON files in R

I have several JSON files with texts in grouped into date, body and title. As an example consider:
{"date": "December 31, 1990, Monday, Late Edition - Final", "body": "World stock markets begin 1991 facing the threat of a war in the Persian Gulf, recessions or economic slowdowns around the world, and dismal earnings -- the same factors that drove stock markets down sharply in 1990. Finally, there is the problem of the Soviet Union, the wild card in everyone's analysis. It is a country whose problems could send stock markets around the world reeling if something went seriously awry. With Russia about to implode, that just adds to the risk premium, said Mr. Dhar. LOAD-DATE: December 30, 1990 ", "title": "World Markets;"}
{"date": "December 30, 1992, Sunday, Late Edition - Final", "body": "DATELINE: CHICAGO Gleaming new tractors are becoming more familiar sights on America's farms. Sales and profits at the three leading United States tractor makers -- Deere & Company, the J.I. Case division of Tenneco Inc. and the Ford Motor Company's Ford New Holland division -- are all up, reflecting renewed agricultural prosperity after the near-depression of the early and mid-1980's. But the recovery in the tractor business, now in its third year, is fragile. Tractor makers hope to install computers that can digest this information, then automatically concentrate the application of costly fertilizer and chemicals on the most productive land. Within the next 15 years, that capability will be commonplace, predicted Mr. Ball. LOAD-DATE: December 30, 1990 ", "title": "All About/Tractors;"}
I have three different newspapers with separate files containing all the texts produced for the period 1989 - 2016. My ultimate goal is to combine all the texts into a single corpus. I have done it in Python using the pandas library and I am wondering if it could be done in R similarly. Here is my code with the loop in R:
for (i in 1989:2016){
df0 = pd.DataFrame([json.loads(l) for l in open('NYT_%d.json' % i)])
df1 = pd.DataFrame([json.loads(l) for l in open('USAT_%d.json' % i)])
df2 = pd.DataFrame([json.loads(l) for l in open('WP_%d.json' % i)])
appended_data.append(df0)
appended_data.append(df1)
appended_data.append(df2)
}
Use jsonlite::stream_in to read your files and jsonlite::rbind.pages to combine them.
There many options in R to read json file and convert them to a data.frame/data.table.
Here one using jsonlite and data.table:
library(data.table)
library(jsonlite)
res <- lapply(1989:2016,function(i){
ff <- c('NYT_%d.json','USAT_%d.json' ,'WP_%d.json')
list_files_paths <- sprintf(ff,i)
rbindlist(lapply(list_files_paths,fromJSON))
})
Here res is a list of data.table. If you want to aggregate all data.table in a single data.table:
rbindlist(res)
Use ndjson::stream_in to read them in faster and flatter than jsonlite::stream_in :-)