reading the numbers from a flat text files in R - json
I have a text file with the following lines:
{"time":"2015-11-15T17:56:45.300","x":93.32,"y":8.6,"s":4.57,"dis":0.45,"on_field":true,"game":{"references":[{"origin":"gsis","id":2015111500}]},"team":{"references":[{"origin":"gsis","id":"5110"}]},"play":{"references":[{"origin":"ngs","id":""}]},"references":[{"origin":"gsis","id":"00-0026189"}]}
{"time":"2015-11-15T17:56:45.400","x":93.77,"y":8.48,"s":4.55,"dis":0.47,"on_field":true,"game":{"references":[{"origin":"gsis","id":2015111500}]},"team":{"references":[{"origin":"gsis","id":"5110"}]},"play":{"references":[{"origin":"ngs","id":""}]},"references":[{"origin":"gsis","id":"00-0026189"}]}
{"time":"2015-11-15T17:56:45.500","x":94.23,"y":8.36,"s":4.53,"dis":0.47,"on_field":true,"game":{"references":[{"origin":"gsis","id":2015111500}]},"team":{"references":[{"origin":"gsis","id":"5110"}]},"play":{"references":[{"origin":"ngs","id":""}]},"references":[{"origin":"gsis","id":"00-0026189"}]}
{"time":"2015-11-15T17:56:45.600","x":94.67,"y":8.23,"s":4.51,"dis":0.46,"on_field":true,"game":{"references":[{"origin":"gsis","id":2015111500}]},"team":{"references":[{"origin":"gsis","id":"5110"}]},"play":{"references":[{"origin":"ngs","id":""}]},"references":[{"origin":"gsis","id":"00-0026189"}]}
{"time":"2015-11-15T17:56:45.700","x":95.1,"y":8.08,"s":4.5,"dis":0.46,"on_field":true,"game":{"references":[{"origin":"gsis","id":2015111500}]},"team":{"references":[{"origin":"gsis","id":"5110"}]},"play":{"references":[{"origin":"ngs","id":""}]},"references":[{"origin":"gsis","id":"00-0026189"}]}
I am trying to extract the date, time, x, y, s, and dis variables and save them in an R data frame. I think I can find a way to clean it with a shell script then read it in R but I was hoping there is some nice trick to do this in R only. Thanks
Each of your lines appear to be in JSON format (but not the whole file, so we cannot just parse it as such). You could return each line as a list and then make a list of the results
res <- readLines("test.txt")
library(jsonlite)
allofit <- sapply(res, fromJSON)
which will give you a list of lists (of lists ..) containing all your data
Related
convert json text entries to a dataframe in r
I have a text file with json like structure that contains values for certain variables as below. [{"variable1":"111","variable2":"666","variable3":"11","variable4":"aaa","variable5":"0"}] [{"variable1":"34","variable2":"12","variable3":"78","variable4":"qqq","variable5":"-9"}] Every line is a new set of values for the same variables 1 through 5. There can be 1000s of lines in a text file but the variables would always remain the same. I want to extract variable 1 through 5 along with their values and convert into a dataframe. Currently I perform these operations in excel using string manipulation and transpose. Here is what it looks like in excel - How to do this in R? Much appreciated. Thanks. J
There is a package named jsonlite that you can use. library("jsonlite") df<- fromJSON("YourPathToTheFile") You can find more info here.
read multiples local html files in a folder in R
I have several HTML files in a folder in my pc. I would like to read them in R, trying to keep the original format as much as posible. There is only text, by the way. I have tried two approaches, which failed misserably: ##first approach library (tm) cname <- file.path("C:", "Users", "usuario", "Desktop", "DEADataset", "The Phillipines", "gazzetes.presihtml") docs <- Corpus(DirSource(cname)) ## second approach list_files_path<- list.files(path = './gazzetes.presihtml') a<- paste0(list_files_path, names) # vector names contain the names of the file with the .HTML extension rawHTML <- readLines(a) Any guess? all the best
Your second approach is close to working, except that readLines only accepts one connection, but you are giving it a vector with multiple files. You can use lapply with readLines to achieve this. Here is an example: # generate vector of html files files <- c('/path/to/your/html/file1', '/path/to/your/html/file2') # readLines for each file and put them in a list lineList <- lapply(files, readLines) # create a character vector that contains all lines from all files lineVector <- unlist(lineList) # collapse the character vector into a single string html <- paste(lineVector , collapse = '\n') # print the string with original formatting cat(html)
Selectively Import only Json data in txt file into R.
I have 3 questions I would like to ask as I am relatively new to both R and Json format. I read quite a bit of things but I don't quite understand still. 1:) Can R parse Json data when the txt file contains other irrelevant information as well? Assuming I can't, I uploaded the text file into R and did some cleaning up. So that it will be easier to read the file. require(plyr) require(rjson) small.f.2 <- subset(small.f.1, ! V1 %in% c("Level_Index:", "Feature_Type:", "Goals:", "Move_Count:")) small.f.3 <- small.f.2[,-1] This would give me a single column with all the json data in each line. I tried to write new .txt file . write.table(small.f.3, file="small clean.txt", row.names = FALSE) json_data <- fromJSON(file="small.clean") The problem was it only converted 'x' (first row) into a character and ignored everything else. I imagined it was the problem with "x" so I took that out from the .txt file and ran it again. json_data <- fromJSON(file="small clean copy.txt") small <- fromJSON(paste(readLines("small clean copy.txt"), collapse="")) Both time worked and I manage to create a list. But it only takes the data from the first row and ignore the rest. This leads to my second question. I tried this.. small <- fromJSON(paste(readLines("small clean copy.txt"), collapse=",")) Error in fromJSON(paste(readLines("small clean copy.txt"), collapse = ",")) : unexpected character ',' 2.) How can I extract the rest of the rows in the .txt file? 3.) Is it possible for R to read the Json data from one row, and extract only the nested data that I need, and subsequently go on to the next row, like a loop? For example, in each array, I am only interested in the Action vectors and the State Feature vectors, but I am not interested in the rest of the data. If I can somehow extract only the information I need before moving on to the next array, than I can save a lot of memory space. I validated the array online. But the .txt file is not json formatted. Only within each array. I hope this make sense. Each row is a nested array. The data looks something like this. I have about 65 rows (nested arrays) in total. {"NonlightningIndices":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],"LightningIndices":[],"SelectedAction":12,"State":{"Features":{"Data":[21.0,58.0,0.599999964237213,12.0,9.0,3.0,1.0,0.0,11.0,2.0,1.0,0.0,0.0,0.0,0.0]}},"Actions":[{"Features":{"Data":[4.0,4.0,1.0,1.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.12213890532609,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.13055793241076,0.0,0.0,0.0,0.0,0.0,0.231325346416068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.949158357257511,0.0,0.0,0.0,0.0,0.0,0.369666537828737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0851765937900996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.223409208023677,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.698640447815897,1.69496718435102,0.0,0.0,0.0,0.0,1.42312654023416,0.0,0.38394999584831,0.0,0.0,0.0,0.0,1.0,1.22164326251584,1.30980246401454,1.00411570750454,0.0,0.0,0.0,1.44306759429513,0.0,0.00568191150434618,0.0,0.0,0.0,0.0,0.0,0.0,0.157705869690127,0.0,0.0,0.0,0.0,0.102089274086033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.37039305683305,2.64354332879095,0.0,0.456876463171171,0.0,0.0,0.208651305680117,0.0,0.0,0.0,0.0,0.0,2.0,0.0,3.46713142511126,2.26785558685153,0.284845692694476,0.29200364444299,0.0,0.562185300773834,1.79134869431988,0.423426746571872,0.0,0.0,0.0,0.0,5.06772310533214,0.0,1.95593334724537,2.08448537685298,1.22045520912269,0.251119892385839,0.0,4.86192274732091,0.0,0.186941346075472,0.0,0.0,0.0,0.0,4.37998688020614,0.0,3.04406665275463,1.0,0.49469909818283,0.0,0.0,1.57589195190525,0.0,0.0,0.0,0.0,0.0,0.0,3.55229001446173]}},...... {"NonlightningIndices":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,24],"LightningIndices":[[15,16,17,18,19,20,21,22,23]],"SelectedAction":15,"State":{"Features":{"Data":[20.0,53.0,0.0,11.0,10.0,2.0,1.0,0.0,12.0,2.0,1.0,0.0,0.0,1.0,0.0]}},"Actions":[{"Features":{"Data":[4.0,4.0,1.0,1.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.110686363475575,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.13427913742728,0.0,0.0,0.0,0.0,0.0,0.218834141070836,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.939443046803111,0.0,0.0,0.0,0.0,0.0,0.357568892126985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0889329732996782,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22521492930721,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.700441220022084,1.6762090551226,0.0,0.0,0.0,0.0,1.44526456614638,0.0,0.383689214317325,0.0,0.0,0.0,0.0,1.0,1.22583659574753,1.31795156033445,0.99710368703165,0.0,0.0,0.0,1.44325394830013,0.0,0.00418600599483917,0.0,0.0,0.0,0.0,0.0,0.0,0.157518319482216,0.0,0.0,0.0,0.0,0.110244186273209,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.369899973785845,2.55505143302811,0.0,0.463342609296841,0.0,0.0,0.226088384842823,0.0,0.0,0.0,0.0,0.0,2.0,0.0,3.47842109127488,2.38476342332125,0.0698115810371108,0.276804206873942,0.0,1.53514282355593,1.77391161515718,0.421465101754304,0.0,0.0,0.0,0.0,4.45530484778828,0.0,1.43798302409155,3.46965807176681,0.468528940277049,0.259853183829217,0.0,4.86988325473155,0.0,0.190659677933533,0.0,0.0,0.963116148760181,0.0,4.29930830894124,0.0,2.56201697590845,0.593423384852181,0.46165947868584,0.0,0.0,1.59497392171253,0.0,0.0,0.0,0.0,0.0368838512398189,0.0,4.24538684327048]}},...... I would really appreciate any advice here.
Reading XML data into R from a html source
I'd like to import data into R from a given webpage, say this one. In the source code (but not on the actual page), the data I'd like to get is stored in a single line of javascript code which starts like this: chart_Line1.setDataXML("<graph rotateNames (stuff omitted) > <set value='699.99' name='16.02.2013' /> <set value='731.57' name='18.02.2013' /> <set value='more values' name='more dates' /> ... <trendLines> (now a different command starts, stuff omitted) </trendLines></graph>") (Note that I've included line breaks for readability; the data is in one single line in the original file. It would suffice to import only the line which starts with chart_Line1.setDataXML - it's line 56 in the source if you want to have a look yourself) I can read the whole html file into a string using scan("URLofFile", what="raw"), but how do I extract the data from this? Can I specify the data format with what="...", keeping in mind that there are no line breaks to separate the data, but several line breaks in the irrelevant prefix and suffix? Is this something which can be done in a nice way using R tools, or do you suggest that this data acquisition should rather be done with a different script?
With some trial & error, I was able to find the exact line where the data is contained. I read the whole html file, and then dispose of all other lines. require(zoo) require(stringr) # get html data, scrap all lines but the interesting one theurl <- "https://www.magickartenmarkt.de/Black_Lotus_Unlimited.c1p5093.prod" sec <- scan(file =theurl, what = "character", sep="\n") sec <- sec[45] # extract all strings of the form "value='X'", where X is a 1 to 3 digit number with some separator and 2 decimal places values <- str_extract_all(sec, "value='[0-9]{1,3}.[0-9]{2}'") # dispose of all non-numerical, non-separator values values <- str_replace_all(unlist(values),"[^0-9/.]","") # get all dates in the form "name='DD.MM.YYYY" dates <- str_extract_all(sec, "name='[0-9]{2}.[0-9]{2}.[0-9]{4}'") # dispose of all non-numerical, non-separator values dates <- str_replace_all(unlist(dates),"[^0-9/.]","") # convert dates to canonical format dates <- as.Date(dates,format="%d.%m.%Y") # put values and dates into a list of ordered observations, converting the values from characters to numbers first. MyZoo <- zoo(as.numeric(values),dates)
Is there any R package to convert PDF to HTML [duplicate]
Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R? In Python there is PDFMiner, but I would like to keep this analysis all in R if possible. Any suggestions?
Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf. That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.
This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.
A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.
A purely R solution could be: library('tm') file <- 'namefile.pdf' Rpdf <- readPDF(control = list(text = "-layout")) corpus <- VCorpus(URISource(file), readerControl = list(reader = Rpdf)) corpus.array <- content(content(corpus)[[1]]) then you'll have pdf lines in an array.
install.packages("pdftools") library(pdftools) download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf", "56901.DEN.Gamebook", mode = "wb") txt <- pdf_text("56901.DEN.Gamebook") cat(txt[1])
The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor. The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out. Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page. Data can be extracted from multiple pages, and a different area can be specified for each page, if required. For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.
I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information Set path to pdftotxt.exe and convert pdf to text exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe" for(i in 1:length(pdfFracList)){ fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5) pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf") txtDestination <- paste0(reportDir,"/", fileNumber, ".txt") print(paste0("File number ", i, ", Processing file ", pdfSource)) system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE) }