I am trying to follow the example give here under "A polar example"
This example uses sea-ice data in .bin format to plot as raster. I am trying the same with a different file available from the original ftp server of the National Snow and Ice Data Center. Hence, I assume there should be no issue. However, when I try to prompt R run the following script
# from NSIDC sea ice concentration file
baseurl <- "ftp://sidads.colorado.edu/pub/DATASETS/"
f2 <- paste(baseurl,
"nsidc0051_gsfc_nasateam_seaice/final-gsfc/north/daily/2013/nt_20130111_f17_v1.1_n.bin",
sep='')
if (!file.exists(basename(f2))) download.file(f2, basename(f2), mode = "wb")
ice2 <- raster(basename(f2))
Error in .rasterObjectFromFile(x, band = band, objecttype = "RasterLayer", : Cannot create a RasterLayer object from this file.
Where am I going wrong? is the .bin file corrupted? Any help appreciated!
Thanks!
ok, found a solution on github that works really nice.
https://github.com/cran/raster/blob/master/R/nsidcICE.R
Just replace in line 14 of the script
hemi <- tolower(substr(bx, 21L, 21L))
by
hemi <- tolower(substr(bx, 22L, 22L)),
as the new name structure differs slightly from the original one by one digit namly a dot in the v1.1 sequence!
Compare:
"nt_19781119_f07_v01_s.bin"
to the version I was interested in
"nt_20130111_f17_v1.1_n.bin"
Related
I have a large .json file and I only want to read in a part of it.
I tried the the following solutions but they didn´t work:
yelp <- stream_in(file("yelp_academic_dataset_review.json"), paigesize = 500)
yelp <- stream_in(file("yelp_academic_dataset_review.json"), nrows = 500)
Anyone know how it works?
First off- always helpful to provide the packages you are using, in your case jsonlite.
One solution is parsing the data file (as a .txt file) prior to streaming it in.
yelp <- readLines("yelp_academic_dataset_review.json")[1:500]
yelp <- stream_in(textConnection(gsub("\\n", "", yelp)))
I'm assuming your file is local?
I have had success with actual piping/streaming json in the past. Ie, from the command line,
cat x.json | parse_json.py
Then you write your python script:
import json,sys
for line in sys.stdin:
js_line = json.loads(line.rstrip())
try:
# do something with js_line['x']['y']
except ValueError:
pass
I'm not sure why you want to use stream_in, but this somewhat manual approach can be effective
I use this code for extracting 1400001 to 1450000 lines of yelp:
setwd("d:/yelp_dataset")
rm(list=ls())
library(jsonlite)
rev<- 'd:/yelp_dataset/review.JSON'
revu<-jsonlite::stream_in(textConnection(readLines(rev)[1400001:1450000],verbose=F)
Cannot find the answer how to load 7z file in R. I can't use this:
s <- system("7z e -o <path> <archive>")
because of error 127. Maybe that's because I'm on Windows? However, 7z opens when I click in TotalCommander.
I'm trying something like this:
con <- gzfile(path, 'r')
ff <- readLines(con, encoding = "UTF-8")
h <- fromJSON(ff)
I have Error:
Error: parse error: trailing garbage
7z¼¯' ãSp‹ Ë:ô–¦ÐÐY#4U¶å¿ç’
(right here) ------^
The encoding is totally not there, when I load this file uncompressed it's ok without specifying the encoding. Moreover it's 2x longer. I have thousands of 7z files need to read them one by one in a loop, read, analyze and get out. Could anyone give me some hints how to do it effectively?
When uncompressed it easily works using:
library(jsonlite)
f <- read_json(path, simplifyVector = T)
EDIT
There are many json files in one 7z file. The above error is probably caused by parser which reads raw data of whole file. I don't know how to link these files or specify the connection attributes.
The official Premier league website provides data with various statistics for league's teams over seasons (e.g. this one). I used the function readHTMLTable from XML R package to retrieve those tables. However, I noticed that the function can not read tables for May months while for others it works well. Here is an example:
april2007.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2006-2007&month=APRIL&timelineView=date&toDate=1177887600000&tableView=CURRENT_STANDINGS"
april.df <- readHTMLTable(april2007.url, which = 1)
april.df[complete.cases(april.df),] ## correct table
march2014.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=APRIL&timelineView=date&toDate=1398639600000&tableView=CURRENT_STANDINGS"
march.df <- readHTMLTable(march2014.url, which = 1)
march.df[complete.cases(march.df), ] ## correct table
may2007.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2006-2007&month=MAY&timelineView=date&toDate=1179010800000&tableView=CURRENT_STANDINGS"
may.df1 <- readHTMLTable(may2007.url, which = 1)
may.df1 ## Just data for the first team
may2014.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=MAY&timelineView=date&toDate=1399762800000&tableView=CURRENT_STANDINGS"
may.df2 <- readHTMLTable(may2014.url, which =1)
may.df2 ## Just data for the first team
As you can see, the function can not retrieve data for May month.
Please, can someone explain why this happens and how it can be fixed?
EDIT AFTER #zyurnaidi answer:
Below is the code that can do the job without manual editing.
url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2009-2010&month=MAY&timelineView=date&toDate=1273359600000&tableView=CURRENT_STANDINGS" ## data for the 09-05-2010.
con <- file (url)
raw <- readLines (con)
close (con)
pattern <- '<span class=" cupchampions-league= competitiontooltip= qualifiedforuefachampionsleague=' ## it seems that this part of the webpage source code mess the things up
raw <- gsub (pattern = pattern, replacement = '""', x = raw)
df <- readHTMLTable (doc = raw, which = 1)
df[complete.cases(df), ] ## correct table
OK. There are few hints for me to find the problem here:
1. The issues happen consistently on May. This is the last month of each season. It means that there should be something unique in this particular case.
2. Direct parsing (htmlParse, from both link and downloaded file) produces a truncated file. The table and html file are just suddenly closed after the first team in the table is reported.
The parsed data always differs from the original right after this point:
<span class=" cupchampions-league=
After downloading and carefully checking the html file itself, I found that there are (uncoded?) character issues there. My guess, this is caused by the cute little trophy icons seen after the team names.
Anyway, to solve this issue, you need to take out these error characters. Instead of editing the downloaded html files, my suggestion is:
1. View page source the EPL url for May's league table
2. Copy all and paste to the text editor, save as an html file
3. You can now use either htmlParse or readHTMLTable
There might be better way to automate this, but hope it can help.
I am trying to plot a graph using R which is populated by MySQL query results. I have the following code:
rs = dbSendQuery(con, "SELECT BuildingCode, AccessTime from access")
data = fetch(rs, n=-1)
x = data[,1]
y = data[,2]
cat(colnames(data),x,y)
This gives me an output of:
BuildingCode AccessTime TEST-0 TEST-1 TEST-2 TEST-3 TEST-4 14:40:59 07:05:00 20:10:59 08:40:00 07:30:59
But this is where I get stuck. I have idea how to pass the "cat" data into an R plot. I have spend hours searching online and most of the examples of R plots I have come across use read.tables(text=""). This is not feasible for me as the data has to come from a database and not be hard coded in. I also found something about saving the output as a CSV but MySQL can not overwrite existing files so after the code was executed once I was unable to do it again as a file already existed.
My question is, how can I use the "cat" data (or another way of doing it if there is a better way) to plot a graph using data that isn't hard coded?
Note: I am using RApache as my web server and I have installed the Brew package.
Make the plot using R and just pass the path to the file back in cat
<%
## Your other code to get the data, assuming it gets a data.frame called data
## Plot code
library(Cairo)
myplotfilename <- "/path/to/dir/myplot.png"
CairoPNG(filename = myplotfilename, width = 480, height = 480)
plot(x=data[,1],y=data[,2])
tmp <- dev.off()
cat(myplotfilename)
%>
Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?
In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.
Any suggestions?
Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.
That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.
This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.
A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.
A purely R solution could be:
library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
then you'll have pdf lines in an array.
install.packages("pdftools")
library(pdftools)
download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf",
"56901.DEN.Gamebook", mode = "wb")
txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])
The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor.
The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out.
Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page.
Data can be extracted from multiple pages, and a different area can be specified for each page, if required.
For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.
I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information
Set path to pdftotxt.exe and convert pdf to text
exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"
for(i in 1:length(pdfFracList)){
fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
print(paste0("File number ", i, ", Processing file ", pdfSource))
system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}