Error message when converting json into csv file - json

I saw several posts about converting json into csv in R, but I have
this error:
Error in fromJSON(file = "C:/users/emily/destop/data.json") : argument "txt" is missing, with no default) after running following code.
mydf <- fromJSON(file= "C:/users/emily/destop/data.json")
I downloaded jsonlite package.
It seems that the command doesn't read the json data correctly or cannot convert it into csv file.
A few sample data from data.json looks as follows:
{"reviewerID": "A3TS466QBAWB9D", "asin": "0014072149", "reviewerName": "Silver Pencil", "helpful": [0, 0], "reviewText": "If you are a serious violin student on a budget, this edition has it all: Piano accompaniment, low price, urtext solo parts, and annotations by Maestro David Oistrakh where Bach's penmanship is hard to decipher. Additions (in dashes) are easily distinguishable from the original bowings. This is a delightful concerto that all intermediate level violinists should perform with a violin buddy. Get your copy, today, along with \"The Green Violin; Theory, Ear Training, and Musicianship for Violinists\" book to prepare for this concerto and for more advanced playing!", "overall": 5.0, "summary": "Perform it with a friend, today!", "unixReviewTime": 1370476800, "reviewTime": "06 6, 2013"}
{"reviewerID": "A3BUDYITWUSIS7", "asin": "0041291905", "reviewerName": "joyce gabriel cornett", "helpful": [0, 0], "reviewText": "This is and excellent edition and perfectly true to the orchestral version! It makes playing Vivaldi a joy! I uses this for a wedding and was totally satisfied with the accuracy!", "overall": 5.0, "summary": "Vivalldi's Four Seasons", "unixReviewTime": 1381708800, "reviewTime": "10 14, 2013"}
{"reviewerID": "A2HR0IL3TC4CKL", "asin": "0577088726", "reviewerName": "scarecrow \"scarecrow\"", "helpful": [0, 0], "reviewText": "this was written for Carin Levine in 2008, but not premiered until 2011 at the Musica Viva Fest in Munich. .the work's premise maybe about the arduousness, nefarious of existence, how we all \"Work\" at life, at complexity, at densities of differing lifeworlds. Ferneyhough's music might suggest that these multiple dimensions of an ontology exist in diagonals across differing spectrums of human cognition, how we come to think about an object,aisthetic shaped ones, and expressionistic ones as his music.The work has a nice classical shape,and holds the \"romantic\" at bay; a mere 7 plus minutes for Alto Flute, a neglected wind cadre member.The work has gorgeous arresting moments with a great bounty of extended timbres-pointillistic bursts\" Klangfarben Sehr Kraeftig\" that you do grow weary of; it is almost predictable now hearing these works; you know they will inhabit a dense timbral space of mega-speed lines tossed in all registers;still one listens; that gap Freud speaks about, that we know we need this at some level. . .the music slowed at times for structural rigour.. . we have a dramatic causality at play in the subject,(the work's title) the arduousness of pushing, aspiring, working toward something; pleasurable illuminating or not, How about emancipation from itself;I guess we need forget the production of surplus value herein,even with the rebellions in current urban areas today; It has no place. . . these constructions are leftovers from modernity, the \"gods\" still hover. . .\"gods\" that Ferneyhough has no power to dispel. . . . All are now commonplace for the new music literature of music.This music still sound quite stunning, marvelous and evocative today,it is simple at some level, direct, unencumbered and violent with spit-tongues,gratuitous lines, fluttertongue,percussive slap-keys,tremoli wistful glissandi harmonics, fast filigreed lines, and simply threadbare melos, an austere fragment of what was a melody. . .Claudio Arrau said someplace that the performer the musician must emanate a \"work\" while playing music, a \"struggle\", aesthetic or otherwise, Sviatoslav Richter thought this grotesque, to look at a musician playing the great music. It was ugly for him. . .You can hear Ms.Levine on youtube playing her work, she is quite convincing, you always need to impart an authority,succored in an emotive focus that the music itself has not succumbed to your own possession. You play the music, it doesn't \"play\" you. . . I'd hope though that music with this arduous construction and structural vigour that it would in fact come to possess the performer. . .it is one of the last libidinal pleasures remaining. . .", "overall": 5.0, "summary": "arduous indeed!", "unixReviewTime": 1371168000, "reviewTime": "06 14, 2013"}
{"reviewerID": "A2DHYD72O52WS5", "asin": "0634029231", "reviewerName": "Amazon Customer \"RCC\"", "helpful": [0, 0], "reviewText": "Greg Koch is a knowledgable and charismatic host. He is seriously fun to watch. The main problem with the video is the format. The lack of on-screen tab is a serious flaw. You have to watch very carefully, have a good understanding of the minor pentatonic, and basic foundation of blues licks to even have a chance at gleening anything from this video.If you're just starting out, pick up the IN THE STYLE OF series. While this series has its limitations (incomplete songs due to copyright, no doubt), it has on screen tab and each lick is played at a reasonably slow speed. In addition, their web site has downloadable tab.However, if you can hold your own in the minor pentatonic, give this a try. It is quite a workout and you'll find yourself a better player having taken on the challenge.", "overall": 3.0, "summary": "GREAT! BUT NOT FOR BEGINNERS.", "unixReviewTime": 1119571200, "reviewTime": "06 24, 2005"}
{"reviewerID": "A1MUVHT8BONL5K", "asin": "0634029347", "reviewerName": "Amazon Customer \"clapton music fan\"", "helpful": [2, 12], "reviewText": "I bought this DVD and I'm returning it. The description and editorial review are misleading. This is NOT a Clapton video. Certainly some clips from Clapton, but generally this is a \"how to\" video. Same applies to Clapton The Early Years!", "overall": 2.0, "summary": "NOT CLAPTION MUSIC VIDEO! A Learn How To Play Guitar LIKE Clapton", "unixReviewTime": 1129334400, "reviewTime": "10 15, 2005"}
Eventually I would like to read the data correctly and convert the data into csv file.

The following should work
library(RJSONIO)
# test2.json is a sample json file
# I removed the reviewText field to keep it short
# Also tested this out with 1 row of data
D2 <- RJSONIO::fromJSON("./test2.json")
# convert the numeric vector helpful to one string
D2$helpful <- paste(D2$helpful, collapse = " ")
D2
reviewerID asin reviewerName helpful
[1,] "A3TS466QBAWB9D" "0014072149" "Silver Pencil" "0 0"
D3 <- do.call(cbind, D2)
write.csv(D3, "D3.csv")

Related

Transformer summariser pipeline giving different results on same model with fixed seed

I am using a HuggingFace summariser pipeline and I noticed that if I train a model for 3 epochs and then at the end run evaluation on all 3 epochs with fixed random seeds, I get a different results based on whether I restart the python console 3 times or whether I load the different model (one for every epoch) on the same summariser object in a loop, and I would like to understand why we have this strange behaviour.
While my results are based on ROUGE score on a large dataset, I have made this small reproducible example to show this issue. Instead of using the weights of the same model at different training epochs, I decided to demonstrate using two different summarization models, but the effect is the same. Grateful for any help.
Notice how in the first run I firstly use the facebook/bart-large-cnn model and then the lidiya/bart-large-xsum-samsum model without shutting the python terminal. In the second run I only use lidiya/bart-large-xsum-samsum model and get different output (which should not be the case).
NOTE: this reproducible example won't work on a CPU machine as it doesn't seem sensitive to torch.use_deterministic_algorithms(True) and it might give different results every time when run on a CPU, so should be reproduced on a GPU.
FIRST RUN
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
import torch
# random text taken from UK news website
text = """
The veteran retailer Stuart Rose has urged the government to do more to shield the poorest from double-digit inflation, describing the lack of action as “horrifying”, with a prime minister “on shore leave” leaving a situation where “nobody is in charge”.
Responding to July’s 10.1% headline rate, the Conservative peer and Asda chair said: “We have been very, very slow in recognising this train coming down the tunnel and it’s run quite a lot of people over and we now have to deal with the aftermath.”
Attacking a lack of leadership while Boris Johnson is away on holiday, he said: “We’ve got to have some action. The captain of the ship is on shore leave, right, nobody’s in charge at the moment.”
Lord Rose, who is a former boss of Marks & Spencer, said action was needed to kill “pernicious” inflation, which he said “erodes wealth over time”. He dismissed claims by the Tory leadership candidate Liz Truss’s camp that it would be possible for the UK to grow its way out of the crisis.
"""
seed = 42
torch.cuda.manual_seed_all(seed)
torch.use_deterministic_algorithms(True)
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
model.eval()
summarizer = pipeline(
"summarization", model=model, tokenizer=tokenizer,
num_beams=5, do_sample=True, no_repeat_ngram_size=3, device=0
)
output = summarizer(text, truncation=True)
tokenizer = AutoTokenizer.from_pretrained("lidiya/bart-large-xsum-samsum")
model = AutoModelForSeq2SeqLM.from_pretrained("lidiya/bart-large-xsum-samsum")
model.eval()
summarizer = pipeline(
"summarization", model=model, tokenizer=tokenizer,
num_beams=5, do_sample=True, no_repeat_ngram_size=3, device=0
)
output = summarizer(text, truncation=True)
print(output)
output from lidiya/bart-large-xsum-samsum model should be
[{'summary_text': 'The UK economy is in crisis because of inflation. The government has been slow to react to it. Boris Johnson is on holiday.'}]
SECOND RUN (you must restart python to conduct the experiment)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
import torch
text = """
The veteran retailer Stuart Rose has urged the government to do more to shield the poorest from double-digit inflation, describing the lack of action as “horrifying”, with a prime minister “on shore leave” leaving a situation where “nobody is in charge”.
Responding to July’s 10.1% headline rate, the Conservative peer and Asda chair said: “We have been very, very slow in recognising this train coming down the tunnel and it’s run quite a lot of people over and we now have to deal with the aftermath.”
Attacking a lack of leadership while Boris Johnson is away on holiday, he said: “We’ve got to have some action. The captain of the ship is on shore leave, right, nobody’s in charge at the moment.”
Lord Rose, who is a former boss of Marks & Spencer, said action was needed to kill “pernicious” inflation, which he said “erodes wealth over time”. He dismissed claims by the Tory leadership candidate Liz Truss’s camp that it would be possible for the UK to grow its way out of the crisis.
"""
seed = 42
torch.cuda.manual_seed_all(seed)
torch.use_deterministic_algorithms(True)
tokenizer = AutoTokenizer.from_pretrained("lidiya/bart-large-xsum-samsum")
model = AutoModelForSeq2SeqLM.from_pretrained("lidiya/bart-large-xsum-samsum")
model.eval()
summarizer = pipeline(
"summarization", model=model, tokenizer=tokenizer,
num_beams=5, do_sample=True, no_repeat_ngram_size=3, device=0
)
output = summarizer(text, truncation=True)
print(output)
output should be
[{'summary_text': 'The government has been slow to deal with inflation. Stuart Rose has urged the government to do more to shield the poorest from double-digit inflation.'}]
Why is the first output different from the second one?
You might re-seed the program after bart-large-cnn pipeline. Otherwise the seed generator would be used by the first pipeline and generate different outputs for your lidiya model across two scripts.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
import torch
# random text taken from UK news website
text = """
The veteran retailer Stuart Rose has urged the government to do more to shield the poorest from double-digit inflation, describing the lack of action as “horrifying”, with a prime minister “on shore leave” leaving a situation where “nobody is in charge”.
Responding to July’s 10.1% headline rate, the Conservative peer and Asda chair said: “We have been very, very slow in recognising this train coming down the tunnel and it’s run quite a lot of people over and we now have to deal with the aftermath.”
Attacking a lack of leadership while Boris Johnson is away on holiday, he said: “We’ve got to have some action. The captain of the ship is on shore leave, right, nobody’s in charge at the moment.”
Lord Rose, who is a former boss of Marks & Spencer, said action was needed to kill “pernicious” inflation, which he said “erodes wealth over time”. He dismissed claims by the Tory leadership candidate Liz Truss’s camp that it would be possible for the UK to grow its way out of the crisis.
"""
seed = 42
torch.cuda.manual_seed_all(seed)
torch.use_deterministic_algorithms(True)
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
model.eval()
summarizer = pipeline(
"summarization", model=model, tokenizer=tokenizer,
num_beams=5, do_sample=True, no_repeat_ngram_size=3, device=0
)
output = summarizer(text, truncation=True)
seed = 42
torch.cuda.manual_seed_all(seed)
torch.use_deterministic_algorithms(True)
tokenizer = AutoTokenizer.from_pretrained("lidiya/bart-large-xsum-samsum")
model = AutoModelForSeq2SeqLM.from_pretrained("lidiya/bart-large-xsum-samsum")
model.eval()
summarizer = pipeline(
"summarization", model=model, tokenizer=tokenizer,
num_beams=5, do_sample=True, no_repeat_ngram_size=3, device=0
)
output = summarizer(text, truncation=True)
print(output)

I need to turn the texts into vectors then feed the vectors into a classifier

I have a csv file named movie_reviews.csv and the data inside looks like this:
1 Pixar classic is one of the best kids' movies of all time.
1 Apesar de representar um imenso avanço tecnológico, a força
1 It doesn't enhance the experience, because the film's timeless appeal is down to great characters and wonderful storytelling; a classic that doesn't need goggles or gimmicks.
1 As such Toy Story in 3D is never overwhelming. Nor is it tedious, as many recent 3D vehicles have come too close for comfort to.
1 The fresh look serves the story and is never allowed to overwhelm it, leaving a beautifully judged yarn to unwind and enchant a new intake of young cinemagoers.
1 There's no denying 3D adds extra texture to Pixar's seminal 1995 buddy movie, emphasising Buzz and Woody's toy's-eye- view of the world.
1 If anything, it feels even fresher, funnier and more thrilling in today's landscape of over-studied demographically correct moviemaking.
1 If you haven't seen it for a while, you may have forgotten just how fantastic the snappy dialogue, visual gags and genuinely heartfelt story is.
0 The humans are wooden, the computer-animals have that floating, jerky gait of animated fauna.
1 Some thrills, but may be too much for little ones.
1 Like the rest of Johnston's oeuvre, Jumanji puts vivid characters through paces that will quicken any child's pulse.
1 "This smart, scary film, is still a favorite to dust off and take from the ""vhs"" bin"
0 All the effects in the world can't disguise the thin plot.
the first columns with 0s and 1s is my label.
I want to first turn the texts in movie_reviews.csv into vectors, then split my dataset based on the labels (all 1s to train and 0s to test). Then feed the vectors into a classifier like random forest.
For such a task you'll need to parse your data first with different tools. First lower-case all your sentences. Then delete all stopwords (the, and, or, ...). Tokenize (an introduction here: https://medium.com/#makcedward/nlp-pipeline-word-tokenization-part-1-4b2b547e6a3). You can also use stemming in order to keep anly the root of the word, it can be helpful for sentiment classification.
Then you'll assign an index to each word of your vocabulary and replace words in your sentence by these indexes :
Imagine your vocabulary is : ['i', 'love', 'keras', 'pytorch', 'tensorflow']
index['None'] = 0 #in case a new word is not in your vocabulary
index['i'] = 1
index['love'] = 2
...
Thus the sentence : 'I love Keras' will be encoded as [1 2 3]
However you have to define a maximum length max_len for your sentences and when a sentence contain less words than max_len you complete your vector of size max_len by zeros.
In the previous example if your max_len = 5 then [1 2 3] -> [1 2 3 0 0].
This is a basic approach. Feel free to check preprocessing tools provided by libraries such as NLTK, Pandas ...

regarding text.common_contexts() of nltk

what is the prime purpose of the using text.common_contexts() in nltk.Guys I have searched and gone through as best as I could do. but sorry to say I didn't understand a bit.please help me by giving an example.Thank you.
Example to understand:
Let's first define our input text, I will just Copy/Paste the first paragraph of Game of Thrones Wikipedia page:
input_text = "Game of Thrones is an American fantasy drama television series \
created by David Benioff and D. B. Weiss for HBO. It is an adaptation of A Song \
of Ice and Fire, George R. R. Martin's series of fantasy novels, the first of \
which is A Game of Thrones. The show was filmed in Belfast and elsewhere in the \
United Kingdom, Canada, Croatia, Iceland, Malta, Morocco, Spain, and the \
United States.[1] The series premiered on HBO in the United States on April \
17, 2011, and concluded on May 19, 2019, with 73 episodes broadcast over \
eight seasons. Set on the fictional continents of Westeros and Essos, Game of \
Thrones has several plots and a large ensemble cast, and follows several story \
arcs. One arc is about the Iron Throne of the Seven Kingdoms, and follows a web \
of alliances and conflicts among the noble dynasties either vying to claim the \
throne or fighting for independence from it. Another focuses on the last \
descendant of the realm's deposed ruling dynasty, who has been exiled and is \
plotting a return to the throne, while another story arc follows the Night's \
Watch, a brotherhood defending the realm against the fierce peoples and \
legendary creatures of the North."
To be able to apply nltk functions we need to convert our text of type 'str' to 'nltk.text.Text'.
import nltk
text = nltk.Text( input_text.split() )
text.similar()
The similar() method takes an input_word and returns other words who appear in a similar range of contexts in the text.
For example let's see what are the words used in similar context to the word 'game' in our text:
text.similar('game') #output: song web
text.common_contexts()
The common_contexts() method allows you to examine the contexts that are shared by two or more words. Let's see in which context the words 'game' and 'web' were used in the text:
text.common_contexts(['game', 'web']) #outputs a_of
This means that in the text we'll find 'a game of' and 'a song of'.
These methods are especially interesting when your text is quite large (book, magazine...)
Observe the below example. You will understand:
>>> text1.concordance("tiger")
of miles you wade knee - deep among Tiger - lilies -- what is the
one charm wa but nurse the cruellest fangs : the tiger of
Bengal crouches in spiced groves e would be more hideous than a
caged tiger , then . I could not endure the sigh
>>> text1.concordance("bird")
o the winds when that storm - tossed bird is on the wing . The three
correspon , Ahab seemed not to mark this wild bird ; nor ,
indeed , would any one else nd incommoding Tashtego there ; this
bird now chanced to intercept its broad f his hammer frozen there
; and so the bird of heaven , with archangelic shrieks
text1.common_contexts(["tiger","bird"])
the_of

loop within a loop for JSON files in R

I am trying to aggregate a bunch of JSON files in to a single one for three sources and three years. While so far I have only been able to do it through the tedious way, I am sure I could do it in a smarter and more elegant manner.
json1 <- lapply(readLines("NYT_1989.json"), fromJSON)
json2 <- lapply(readLines("NYT_1990.json"), fromJSON)
json3 <- lapply(readLines("NYT_1991.json"), fromJSON)
json4 <- lapply(readLines("WP_1989.json"), fromJSON)
json5 <- lapply(readLines("WP_1990.json"), fromJSON)
json6 <- lapply(readLines("WP_1991.json"), fromJSON)
json7 <- lapply(readLines("USAT_1989.json"), fromJSON)
json8 <- lapply(readLines("USAT_1990.json"), fromJSON)
json9 <- lapply(readLines("USAT_1991.json"), fromJSON)
jsonl <- list(json1, json2, json3, json4, json5, json6, json7, json8, json9)
Note that the year period goes equally for the three files from 1989 to 1991. Any ideas? Thanks!
PS: Example of the data inside each file:
{"date": "December 31, 1989, Sunday, Late Edition - Final", "body": "Frigid temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. ", "title": "Prospects;"}
{"date": "December 31, 1989, Sunday, Late Edition - Final", "body": "DATELINE: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' ", "title": "Upheaval in the East: Espionage;"}
{"date": "December 31, 1989, Sunday, Late Edition - Final", "body": "SURVIVING the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. ", "title": "Coping With the Economic Prospects of 1990"}
Here you go:
require(jsonlite)
filelist <- c("NYT_1989.json","NYT_1990.json","NYT_1991.json",
"WP_1989.json", "WP_1990.json","WP_1991.json",
"USAT_1989.json","USAT_1990.json","USAT_1991.json")
newJSON <- sapply(filelist, function(x) fromJSON(readLines(x)))
Read in just the body entry from each line of the input file.
You asked about how to just read in a subset of the JSON file. The file data referenced isn't actually JSON format. It is JSON like, hence we have to modify the input to fromJSON() to correctly read in the data. We dereference the result from fromJSON()$body to extract just the body variable.
filelist <- c("./data/NYT_1989.json", "./data/NYT_1990.json")
newJSON <- sapply(filelist, function(x) fromJSON(sprintf("[%s]", paste(readLines(x), collapse = ",")), flatten = FALSE)$body)
newJSON
Results
> filelist <- c("./data/NYT_1989.json", "./data/NYT_1990.json")
> newJSON <- sapply(filelist, function(x) fromJSON(sprintf("[%s]", paste(readLines(x), collapse = ",")), flatten = FALSE)$body)
> newJSON
./data/NYT_1989.json
[1,] "Frigid temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. "
[2,] "DATELINE: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' "
[3,] "SURVIVING the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. "
./data/NYT_1990.json
[1,] "Blue temperatures across much of the United States this month sent demand for heating oil soaring, providing a final upward jolt to crude oil prices. Some spot crude traded at prices up 40 percent or more from a year ago. Will these prices hold? Five experts on oil offer their views. That's assuming the economy performs as expected - about 1 percent growth in G.N.P. The other big uncertainty is the U.S.S.R. If their production drops more than 4 percent, prices could stengthen. "
[2,] "BLUE1: WASHINGTON, Dec. 30 For years, experts have dubbed Czechoslovakia's spy agency the ''two Czech'' service. But he cautioned against euphoria. ''The Soviets wouldn't have relied on just official cooperation,'' he said. ''It would be surprising if they haven't unilaterally penetrated friendly services with their own agents, too.'' "
[3,] "GREEN4 the decline in the economy will be the overriding issue for 1990, say leaders of the county's business community. Successful Westchester business owners will face and overcome these risks and obstacles. Westchester is a land of opportunity for the business owner. "
You might find the following apply tutorial useful:
Datacamp: R tutorial on the Apply family of functions
I also recommend reading:
R Inferno - Chapter 4 - Over-Vectorizing
trust my when I say this online free book has helped me a lot. It has also confirmed I am an idiot on multiple occasions :-)

Creating a corpus out of texts stored in JSON files in R

I have several JSON files with texts in grouped into date, body and title. As an example consider:
{"date": "December 31, 1990, Monday, Late Edition - Final", "body": "World stock markets begin 1991 facing the threat of a war in the Persian Gulf, recessions or economic slowdowns around the world, and dismal earnings -- the same factors that drove stock markets down sharply in 1990. Finally, there is the problem of the Soviet Union, the wild card in everyone's analysis. It is a country whose problems could send stock markets around the world reeling if something went seriously awry. With Russia about to implode, that just adds to the risk premium, said Mr. Dhar. LOAD-DATE: December 30, 1990 ", "title": "World Markets;"}
{"date": "December 30, 1992, Sunday, Late Edition - Final", "body": "DATELINE: CHICAGO Gleaming new tractors are becoming more familiar sights on America's farms. Sales and profits at the three leading United States tractor makers -- Deere & Company, the J.I. Case division of Tenneco Inc. and the Ford Motor Company's Ford New Holland division -- are all up, reflecting renewed agricultural prosperity after the near-depression of the early and mid-1980's. But the recovery in the tractor business, now in its third year, is fragile. Tractor makers hope to install computers that can digest this information, then automatically concentrate the application of costly fertilizer and chemicals on the most productive land. Within the next 15 years, that capability will be commonplace, predicted Mr. Ball. LOAD-DATE: December 30, 1990 ", "title": "All About/Tractors;"}
I have three different newspapers with separate files containing all the texts produced for the period 1989 - 2016. My ultimate goal is to combine all the texts into a single corpus. I have done it in Python using the pandas library and I am wondering if it could be done in R similarly. Here is my code with the loop in R:
for (i in 1989:2016){
df0 = pd.DataFrame([json.loads(l) for l in open('NYT_%d.json' % i)])
df1 = pd.DataFrame([json.loads(l) for l in open('USAT_%d.json' % i)])
df2 = pd.DataFrame([json.loads(l) for l in open('WP_%d.json' % i)])
appended_data.append(df0)
appended_data.append(df1)
appended_data.append(df2)
}
Use jsonlite::stream_in to read your files and jsonlite::rbind.pages to combine them.
There many options in R to read json file and convert them to a data.frame/data.table.
Here one using jsonlite and data.table:
library(data.table)
library(jsonlite)
res <- lapply(1989:2016,function(i){
ff <- c('NYT_%d.json','USAT_%d.json' ,'WP_%d.json')
list_files_paths <- sprintf(ff,i)
rbindlist(lapply(list_files_paths,fromJSON))
})
Here res is a list of data.table. If you want to aggregate all data.table in a single data.table:
rbindlist(res)
Use ndjson::stream_in to read them in faster and flatter than jsonlite::stream_in :-)