Clojure: parse json and extract values - json

I'm making an API call and using Cheshire to parse the JSON:
(defn fetch_headlines [source]
(let [articlesUrl (str "https://newsapi.org/v2/top-headlines?sources="
source
"&apiKey=a688e6494c444902b1fc9cb93c61d6987")]
(-> articlesUrl
client/get
generate-string
parse-string)))
The JSON payload:
{"status" 200, "headers" {"access-control-allow-headers" "x-api-key,
authorization", "content-type" "application/json; charset=utf-8",
"access-control-allow-origin" "*", "content-length" "7434",
"connection" "close", "pragma" "no-cache", "expires" "-1",
"access-control-allow-methods" "GET", "date" "Thu, 28 Mar 2019
20:22:16 GMT", "x-cached-result" "false", "cache-control" "no-cache"},
"body"
"{\"status\":\"ok\",\"totalResults\":10,\"articles\":[{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":null,\"title\":\"Trump:
Mueller probe was 'attempted takeover' of government - CNN
Video\",\"description\":\"In a Fox News interview with Sean Hannity,
President Trump called special counsel Robert Mueller's probe an
\\"attempted takeover of our
government.\\"\",\"url\":\"http://us.cnn.com/videos/politics/2019/03/28/trump-mueller-probe-attempted-takeover-hannity-cpt-sot-vpx.cnn\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190324191527-06-trump-mueller-reaction-0324-super-tease.jpg\",\"publishedAt\":\"2019-03-28T20:09:04.1891948Z\",\"content\":\"Chat
with us in Facebook Messenger. Find out what's happening in the world
as it
unfolds.\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":null,\"title\":\"James
Clapper reacts to call he should be investigated - CNN
Video\",\"description\":\"Former Director of National Intelligence
James Clapper reacts to White House press secretary Sarah Sanders
saying he and other former intelligence officials should be
investigated after special counsel Robert Mueller did not establish
collusion between the
Tr…\",\"url\":\"http://us.cnn.com/videos/politics/2019/03/26/james-clapper-reponse-mueller-report-sarah-sanders-criticism-bts-ac360-vpx.cnn\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190325211210-james-clapper-ac360-03252019-super-tease.jpg\",\"publishedAt\":\"2019-03-28T20:08:43.1736236Z\",\"content\":\"Chat
with us in Facebook Messenger. Find out what's happening in the world
as it
unfolds.\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Maegan
Vazquez, CNN\",\"title\":\"Trump set for first rally since Mueller
investigation ended\",\"description\":\"President Donald Trump, making
his first appearance before supporters since Robert Mueller ended his
investigation, is set to speak during a rally in Grand Rapids,
Michigan Thursday
night.\",\"url\":\"http://us.cnn.com/2019/03/28/politics/donald-trump-grand-rapids-rally/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190321115403-07-donald-trump-lead-image-super-tease.jpg\",\"publishedAt\":\"2019-03-28T19:49:26Z\",\"content\":\"Washington
(CNN)President Donald Trump, making his first appearance before
supporters since Robert Mueller ended his investigation, is set to
speak during a rally in Grand Rapids, Michigan Thursday
night.\r\nThe rally follows a chaotic week in Washington, preci…
[+2099
chars]\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Katelyn
Polantz, CNN\",\"title\":\"Judge orders Justice Dept. to turn over
Comey memos\",\"description\":\"A federal judge has ordered that the
James Comey memos are turned over, in a court case brought by CNN and
other media organizations for access to the documents memorializing
former FBI Director's interactions with President Donald
Trump.\",\"url\":\"http://us.cnn.com/2019/03/28/politics/james-comey-memo-lawsuit/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/181209143047-comey-1207-super-tease.jpg\",\"publishedAt\":\"2019-03-28T19:14:45Z\",\"content\":\"Washington
(CNN)A federal judge has ordered that the Justice Department and FBI
submit James Comey's memos in full to the court under seal, in a court
case brought by CNN and other media organizations for access to the
documents memorializing the former FBI d… [+1043
chars]\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Clare
Foran and Manu Raju, CNN\",\"title\":\"Pelosi calls AG's summary of
Mueller report 'arrogant'\",\"description\":\"House Speaker Nancy
Pelosi on Thursday criticized Attorney General William Barr's summary
of special counsel Robert Mueller's report, calling it
\\"condescending\\" and \\"arrogant\\" and saying \\"it wasn't
the right thing to
do.\\"\",\"url\":\"http://us.cnn.com/2019/03/28/politics/pelosi-mueller-report-congress-barr-summary/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190328130240-02-nancy-pelosi-03282019-super-tease.jpg\",\"publishedAt\":\"2019-03-28T18:48:25Z\",\"content\":null},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Analysis
by Chris Cillizza, CNN Editor-at-large\",\"title\":\"The 43 most
outrageous lines from Donald Trump's phone interview with Sean
Hannity\",\"description\":\"There's no \\"reporter\\" that President
Donald Trump likes more than Fox News' Sean Hannity -- largely due to
Hannity's unwavering, puppy dog-like support for the President. Trump
likes to reward people who play nice with him, which brings us to the
45-minute
ph…\",\"url\":\"http://us.cnn.com/2019/03/28/politics/sean-hannity-donald-trump-mueller/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190328140149-01-hannity-trump-file-super-tease.jpg\",\"publishedAt\":\"2019-03-28T18:44:21Z\",\"content\":\"(CNN)There's
no \\"reporter\\" that President Donald Trump likes more than Fox
News' Sean Hannity -- largely due to Hannity's unwavering, puppy
dog-like support for the President. Trump likes to reward people who
play nice with him, which brings us to the 45-minu… [+14785
chars]\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":null,\"title\":\"Puerto
Rico Gov.: I'll punch the bully in the mouth - CNN
Video\",\"description\":\"In an exclusive interview with CNN, Puerto
Rico Governor Ricardo Rosselló said he would not sit back and allow
his officials to be bullied by the White
House.\",\"url\":\"http://us.cnn.com/videos/politics/2019/03/28/ricardo-rossello-trump-bully-puerto-rico-sot-vpx.cnn\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190328123504-puerto-rico-gov-ricardo-rosello-super-tease.jpg\",\"publishedAt\":\"2019-03-28T18:08:33.7312458Z\",\"content\":\"Chat
with us in Facebook Messenger. Find out what's happening in the world
as it
unfolds.\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Jeremy
Herb, Manu Raju and Ted Barrett, CNN\",\"title\":\"Jared Kushner
interviewed by Senate Intelligence
Committee\",\"description\":\"President Donald Trump's son-in-law
Jared Kushner returned to the Senate Intelligence Committee for a
closed door interview Thursday as part of the committee's Russia
investigation.\",\"url\":\"http://us.cnn.com/2019/03/28/politics/jared-kushner-senate-intelligence/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/180302124221-30-jared-kushner-super-tease.jpg\",\"publishedAt\":\"2019-03-28T16:21:29Z\",\"content\":null},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Jeremy
Herb and Laura Jarrett, CNN\",\"title\":\"Mueller report more than 300
pages, sources say\",\"description\":\"Special counsel Robert
Mueller's confidential report on the Russia investigation is more than
300 pages, according to a Justice Department official and a second
source with knowledge of the
matter.\",\"url\":\"http://us.cnn.com/2019/03/28/politics/mueller-report-pages/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190324130054-05-russia-investigation-0324-super-tease.jpg\",\"publishedAt\":\"2019-03-28T15:52:01Z\",\"content\":null},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Jim
Acosta and Kevin Liptak, CNN\",\"title\":\"Exclusive: Puerto Rico
governor warns White House over funding\",\"description\":\"Tensions
are escalating between President Donald Trump and Puerto Rico's
governor over disaster relief efforts that have been slow in coming
for the still-battered island after Hurricane
Maria.\",\"url\":\"http://us.cnn.com/2019/03/28/politics/ricardo-rossell-donald-trump-puerto-rico-funding/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/180920230539-pr-storm-of-controversy-rossello-trump-super-tease.jpg\",\"publishedAt\":\"2019-03-28T15:19:39Z\",\"content\":null}]}",
"trace-redirects"
["https://newsapi.org/v2/top-headlines?sources=cnn&apiKey=a688e6494c444902b1fc9cb93c61d687"]}
I'd like to extract to extract the URLs from the returned JSON payload, I've tried this:
(defn fetch_headlines [source]
(let [articlesUrl (str "https://newsapi.org/v2/top-headlines?sources="
source
"&apiKey=a688e6494c444902b1fc9cb93c61d697")]
(-> articlesUrl
client/get
generate-string
parse-string
(get-in ["source" "url"]))))
But I get a nil result, any ideas?
SOLUTION based on user feedback:
(defn fetch-headlines [source]
(let [articlesUrl (str "https://newsapi.org/v2/top-headlines?sources="
source
"&apiKey=a688e6494c444902b1fc9cb93c61d697")]
(-> articlesUrl
client/get
:body
parse-string
(get-in ["articles" 0 "url"]))))

What you need is inside the body key, but the value corresponding to that key is still a string and not yet a clojure map. When you look for source, you're getting nil back because that key doesn't exist (it should be inside body, after correctly parsing the string into json).
Once you've properly parsed the body value, it should be something like:
(let [index-of-article 0]
(get-in response ["body" "articles" index-of-article "url"]))
where index-of-article is the positional index of the article you want, since articles contains a vector of articles.

Related

Remove specific "span" tag while preserving html object

I am scraping a website using beautifulsoup & python, which has more than 100 span tags. I want to remove 2 consecutive span tag, where the first span tag has text element "READ MORE:" and the second span tag is some string.
<span>Two cars collided at low speed in Lurnea on February 25, 2019.</span>,
<span>The accident killed an 11-month-old boy who was in the BMW sedan being driven by Peter Watfa.</span>,
<span>READ MORE: </span>,
<span>Long queues form at airports as one million Aussies set to fly this Easter</span>,
<span>Watfa has repeatedly refused to admit the 11-month-old was sitting on his lap and is adamant the baby was restrained in the backseat when the crash occurred.</span>,
<span>The baby boy suffered fatal injuries when the driver's airbag deployed.</span>,
<span>A judge today slammed Watfa's actions, with the court hearing the vulnerable child was "entirely dependent upon Watfa, who owed him a duty of care".</span>,
<span>READ MORE: </span>,
<span>Four female backpackers killed in horror highway crash</span>,
<span>The court also heard he had earned the title of a serial traffic offender.</span>,
<span>In the months after the crash, Watfa was involved in a police pursuit and caught driving under the influence of drugs.</span>,
<span>Watfa will serve at least two years and three months for manslaughter.</span>,
<span>He will be eligible for parole in early 2024.</span>
For example: I want to remove below 4 tag
<span>READ MORE: </span>,
<span>Long queues form at airports as one million Aussies set to fly this Easter</span>
<span>READ MORE: </span>,
<span>Four female backpackers killed in horror highway crash</span>
The output should be :
<span>Two cars collided at low speed in Lurnea on February 25, 2019.</span>,
<span>The accident killed an 11-month-old boy who was in the BMW sedan being driven by Peter Watfa.</span>,
<span>Watfa has repeatedly refused to admit the 11-month-old was sitting on his lap and is adamant the baby was restrained in the backseat when the crash occurred.</span>,
<span>The baby boy suffered fatal injuries when the driver's airbag deployed.</span>,
<span>A judge today slammed Watfa's actions, with the court hearing the vulnerable child was "entirely dependent upon Watfa, who owed him a duty of care".</span>,
<span>The court also heard he had earned the title of a serial traffic offender.</span>,
<span>In the months after the crash, Watfa was involved in a police pursuit and caught driving under the influence of drugs.</span>,
<span>Watfa will serve at least two years and three months for manslaughter.</span>,
<span>He will be eligible for parole in early 2024.</span>
I would be grateful if someone can help me with the logic in python.cheers
Assuming you scrape the text of each article of a news site and you should change your strategy.
Clean the tree while .decompose() the elements you do not wanna scrape:
for e in soup.select('span:-soup-contains("READ MORE")'):
e.find_next('span').decompose()
e.decompose()
than select body of the article and extract the text:
soup.select_one('.article__body-croppable').get_text(' ', strip=True)
This results in:
A driver has been jailed over the death of a baby boy who was sitting on his lap during a crash in Sydney's south-west . Two cars collided at low speed in Lurnea on February 25, 2019. The accident killed an 11-month-old boy who was in the BMW sedan being driven by Peter Watfa. Peter Watfa has been jailed for at least two years and three months. (9News) Watfa has repeatedly refused to admit the 11-month-old was sitting on his lap and is adamant the baby was restrained in the backseat when the crash occurred. The baby boy suffered fatal injuries when the driver's airbag deployed. A judge today slammed Watfa's actions, with the court hearing the vulnerable child was "entirely dependent upon Watfa, who owed him a duty of care". An 11-month-old boy died in the crash. (9News) The court also heard he had earned the title of a serial traffic offender. In the months after the crash, Watfa was involved in a police pursuit and caught driving under the influence of drugs. Watfa will serve at least two years and three months for manslaughter. He will be eligible for parole in early 2024.
Indeed you also could iterate your ResultSet and create a new list with all valid <span> but I think that is not the best option:
[x for i, x in enumerate(results) if 'READ MORE' not in x.text and 'READ MORE' not in results[i-1].text]

How do i split a JSON file by a specific amount of objects in separate files?

import json
content = []
with open("articles.jsonl", "rt") as file:
for a in file:
out = json.loads(a)
content.append(out)
file.close()
count = 0
file_count = 1
with open("articles" + str(file_count) + ".jsonl", "wt") as fp:
for a in content:
json.dump(a, fp)
fp.write("\n")
count +=1
if count == 2000:
file_count +=1
count = 0
continue
fp.close()
{"id": "f7ca322d-c3e8-40d2-841f-9d7250ac72ca", "content": "VETERANS saluted Worcester's first ever breakfast club for ex-soldiers which won over hearts, minds and bellies. \n \nThe Worcester Breakfast Club for HM Forces Veterans met at the Postal Order in Foregate Street at 10am on Saturday. \n \nThe club is designed to allow veterans a place to meet, socialise, eat and drink, giving hunger and loneliness their marching orders. \n \nFather-of-two Dave Carney, aged 43, of Merrimans Hill, Worcester, set up the club after being inspired by other similar clubs across the country. \n \nHe said: \"As you can see from the picture, we had a good response. Five out of the 10 that attended said they saw the article in the newspaper and turned up. \n \n\"We even had an old chap travel from Droitwich and he was late on parade by three hours. \n \n\"It's generated a lot of interest and I estimate (from other veterans who saw the article) that next month's meeting will attract about 20 people. Onwards and upwards.\" \n \nHe said the management at the pub had been extremely hospitable to them. \n \nMr Carney said: \"They bent over backwards for us. They really looked after us well. That is the best choice of venue I could have made. They even put 'reserved for the armed forces'. \n Promoted stories \nThe reserve veteran with the Royal Engineers wanted to go to a breakfast club but found the nearest ones were in Bromsgrove and Gloucester so he decided to set up his own, closer to home. \n \nHe was influenced by Derek Hardman who set up a breakfast club for veterans in Hull and Andy Wilson who set one up in Newcastle. He said the idea has snowballed and there were now 70 similar clubs across the country and even some in Germany. \n \nMr Carney said with many Royal British Legion clubs closing he wanted veterans and serving personnel to feel they had somewhere they could go for good grub, beer and banter to recapture the comradery of being in the forces. \n \nThe Postal Order was chosen because of its central location and its proximity to the railway station and hotels and reasonably priced food and drink. \n \nThe management of the pub have even given the veterans a designated area within the pub. \n \n Share article \n \nThe next meeting is at the Postal Order on Saturday, October 3 at 10am. \n \nThe breakfast club meets on the first Saturday of each month for those who want to attend in future.", "title": "Worcester breakfast club for veterans gives hunger its marching orders", "media-type": "News", "source": "Redditch Advertiser", "published": "2015-09-07T10:16:14Z"}
Above is a small sample of the articles.jsonl file.
This just writes everything to a single file called articles1.jsonl instead of multiple files with a specific set of objects. Any suggestions?

regarding text.common_contexts() of nltk

what is the prime purpose of the using text.common_contexts() in nltk.Guys I have searched and gone through as best as I could do. but sorry to say I didn't understand a bit.please help me by giving an example.Thank you.
Example to understand:
Let's first define our input text, I will just Copy/Paste the first paragraph of Game of Thrones Wikipedia page:
input_text = "Game of Thrones is an American fantasy drama television series \
created by David Benioff and D. B. Weiss for HBO. It is an adaptation of A Song \
of Ice and Fire, George R. R. Martin's series of fantasy novels, the first of \
which is A Game of Thrones. The show was filmed in Belfast and elsewhere in the \
United Kingdom, Canada, Croatia, Iceland, Malta, Morocco, Spain, and the \
United States.[1] The series premiered on HBO in the United States on April \
17, 2011, and concluded on May 19, 2019, with 73 episodes broadcast over \
eight seasons. Set on the fictional continents of Westeros and Essos, Game of \
Thrones has several plots and a large ensemble cast, and follows several story \
arcs. One arc is about the Iron Throne of the Seven Kingdoms, and follows a web \
of alliances and conflicts among the noble dynasties either vying to claim the \
throne or fighting for independence from it. Another focuses on the last \
descendant of the realm's deposed ruling dynasty, who has been exiled and is \
plotting a return to the throne, while another story arc follows the Night's \
Watch, a brotherhood defending the realm against the fierce peoples and \
legendary creatures of the North."
To be able to apply nltk functions we need to convert our text of type 'str' to 'nltk.text.Text'.
import nltk
text = nltk.Text( input_text.split() )
text.similar()
The similar() method takes an input_word and returns other words who appear in a similar range of contexts in the text.
For example let's see what are the words used in similar context to the word 'game' in our text:
text.similar('game') #output: song web
text.common_contexts()
The common_contexts() method allows you to examine the contexts that are shared by two or more words. Let's see in which context the words 'game' and 'web' were used in the text:
text.common_contexts(['game', 'web']) #outputs a_of
This means that in the text we'll find 'a game of' and 'a song of'.
These methods are especially interesting when your text is quite large (book, magazine...)
Observe the below example. You will understand:
>>> text1.concordance("tiger")
of miles you wade knee - deep among Tiger - lilies -- what is the
one charm wa but nurse the cruellest fangs : the tiger of
Bengal crouches in spiced groves e would be more hideous than a
caged tiger , then . I could not endure the sigh
>>> text1.concordance("bird")
o the winds when that storm - tossed bird is on the wing . The three
correspon , Ahab seemed not to mark this wild bird ; nor ,
indeed , would any one else nd incommoding Tashtego there ; this
bird now chanced to intercept its broad f his hammer frozen there
; and so the bird of heaven , with archangelic shrieks
text1.common_contexts(["tiger","bird"])
the_of

Creating a list of JSON files only with only one component of the list

I have 4 json files spread into to folders: folder1 and folder2. Each json file contains the date, the body and the title.
folder1.json:
{"date": "December 31, 1989, Sunday, Late Edition - Final", "body": "Frigid temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. ", "title": "Prospects;"}
{"date": "December 31, 1989, Sunday, Late Edition - Final", "body": "DATELINE: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. Agents of the Office for the Protection of State Secrets got one check from Prague, the pun goes, and another from their real bosses at K.G.B. headquarters in Moscow. Roy Godson, head of the Washington-based National Strategy Information Center and a well-known intelligence scholar, called any democratic change ''a net loss'' for Soviet intelligence. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' ", "title": "Upheaval in the East: Espionage;"}
folder2.json:
{"date": "December 31, 1989, Sunday, Late Edition - Final", "body": "SURVIVING the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. But facing business owners are numerous problems, from taxes and regulations at all levels of government to competition from other businesses in and out of Westchester. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. ", "title": "Coping With the Economic Prospects of 1990"}
{"date": "December 29, 1989, Friday, Late Edition - Final", "body": "Eastern Airlines said yesterday that it was laying off 600 employees, mostly managers, and cutting wages by 10 percent or 20 percent for about half its work force. Thomas J. Matthews, Eastern's senior vice president of human resources, estimated that the measures would save the carrier about $100 million a year. Eastern plans to rebuild by making Atlanta its primary hub and expects to operate about 75 percent of its flights from there. ", "title": "Eastern Plans Wage Cuts, 600 Layoffs"}
I will like to create a common list of all these json files but only with the body of each article. So far I am trying the following:
json1 <- lapply(readLines("folder1.json"), fromJSON)
json2 <- lapply(readLines("folder2.json"), fromJSON)
jsonl <- list(json1$body, json2$body)
But it is not working. Any suggestions?
Andres Azqueta
Solution:
You need to derence the the fromJSON(), in the sapply() to only retrieve the body.
fromJSON()$body
Note: I am assuming the file format from you previous question
The point being the file format is sudo JSON, hence the modified fromJSON() call below.
OK, Let step through an example:
Stage 1: Concatenate JSON files into 1
filelist <- c("./data/NYT_1989.json", "./data/NYT_1990.json")
newJSON <- sapply(filelist, function(x) fromJSON(sprintf("[%s]", paste(readLines(x), collapse = ",")), flatten = FALSE))
newJSON[2]# Extract bodies
newJSON[5]# Extract bodies
Output
filelist <- c("./data/NYT_1989.json", "./data/NYT_1990.json")
> newJSON <- sapply(filelist, function(x) fromJSON(sprintf("[%s]", paste(readLines(x), collapse = ",")), flatten = FALSE))
> newJSON[2]# Extract bodies
[[1]]
[1] "Frigid temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. "
[2] "DATELINE: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' "
[3] "SURVIVING the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. "
> newJSON[5]# Extract bodies
[[1]]
[1] "Blue temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. "
[2] "BLUE1: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' "
[3] "GREEN4 the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. "
Stage 2: Concatenate and extract the body from all files...
Look for the reference to fromJSON()$body in code line...
filelist <- c("./data/NYT_1989.json", "./data/NYT_1990.json")
newJSON <- sapply(filelist, function(x) fromJSON(sprintf("[%s]", paste(readLines(x), collapse = ",")), flatten = FALSE)$body)
newJSON
Output
> filelist <- c("./data/NYT_1989.json", "./data/NYT_1990.json")
> newJSON <- sapply(filelist, function(x) fromJSON(sprintf("[%s]", paste(readLines(x), collapse = ",")), flatten = FALSE)$body)
> newJSON
./data/NYT_1989.json
[1,] "Frigid temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. "
[2,] "DATELINE: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' "
[3,] "SURVIVING the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. "
./data/NYT_1990.json
[1,] "Blue temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. "
[2,] "BLUE1: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' "
[3,] "GREEN4 the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. "
require(RJSONIO)
json_1<- fromJSON("~/folder1/1.json")
json_2<- fromJSON("~/folder2/2.json")
jsonl <- list(json1$body, json2$body)

Creating a corpus out of texts stored in JSON files in R

I have several JSON files with texts in grouped into date, body and title. As an example consider:
{"date": "December 31, 1990, Monday, Late Edition - Final", "body": "World stock markets begin 1991 facing the threat of a war in the Persian Gulf, recessions or economic slowdowns around the world, and dismal earnings -- the same factors that drove stock markets down sharply in 1990. Finally, there is the problem of the Soviet Union, the wild card in everyone's analysis. It is a country whose problems could send stock markets around the world reeling if something went seriously awry. With Russia about to implode, that just adds to the risk premium, said Mr. Dhar. LOAD-DATE: December 30, 1990 ", "title": "World Markets;"}
{"date": "December 30, 1992, Sunday, Late Edition - Final", "body": "DATELINE: CHICAGO Gleaming new tractors are becoming more familiar sights on America's farms. Sales and profits at the three leading United States tractor makers -- Deere & Company, the J.I. Case division of Tenneco Inc. and the Ford Motor Company's Ford New Holland division -- are all up, reflecting renewed agricultural prosperity after the near-depression of the early and mid-1980's. But the recovery in the tractor business, now in its third year, is fragile. Tractor makers hope to install computers that can digest this information, then automatically concentrate the application of costly fertilizer and chemicals on the most productive land. Within the next 15 years, that capability will be commonplace, predicted Mr. Ball. LOAD-DATE: December 30, 1990 ", "title": "All About/Tractors;"}
I have three different newspapers with separate files containing all the texts produced for the period 1989 - 2016. My ultimate goal is to combine all the texts into a single corpus. I have done it in Python using the pandas library and I am wondering if it could be done in R similarly. Here is my code with the loop in R:
for (i in 1989:2016){
df0 = pd.DataFrame([json.loads(l) for l in open('NYT_%d.json' % i)])
df1 = pd.DataFrame([json.loads(l) for l in open('USAT_%d.json' % i)])
df2 = pd.DataFrame([json.loads(l) for l in open('WP_%d.json' % i)])
appended_data.append(df0)
appended_data.append(df1)
appended_data.append(df2)
}
Use jsonlite::stream_in to read your files and jsonlite::rbind.pages to combine them.
There many options in R to read json file and convert them to a data.frame/data.table.
Here one using jsonlite and data.table:
library(data.table)
library(jsonlite)
res <- lapply(1989:2016,function(i){
ff <- c('NYT_%d.json','USAT_%d.json' ,'WP_%d.json')
list_files_paths <- sprintf(ff,i)
rbindlist(lapply(list_files_paths,fromJSON))
})
Here res is a list of data.table. If you want to aggregate all data.table in a single data.table:
rbindlist(res)
Use ndjson::stream_in to read them in faster and flatter than jsonlite::stream_in :-)