browseVignettes: remove `source` and `R code ` files from output - html

I wrote an HTML vignette for my R package hosted on GitHub. When I open it with browseVignettes, it flawlessly opens on the browsers showing this content:
Vignettes found by browseVignettes("package_name")
Vignettes in package package_name
package_name file_name - HTML source R code
clicking on HTML source R code it opens the same file in three different versions.
However, I don't need the source and the R code files to show.
Is there a way to output only the HTML file? As in the following output
Vignettes found by browseVignettes("package_name")
Vignettes in package package_name
package_name file_name - HTML

You can't easily drop the source, but you can drop the R code by setting that component to blank. For example,
allfields <- browseVignettes()
noR <- lapply(allfields, function(pkg) {pkg[,"R"] <- ""; pkg})
class(noR) <- class(allfields)
noR
If you really want to drop the source, then you'll need to get the print method and modify it:
print.browseVignettes <- utils:::print.browseVignettes
# Modify it as you like.

Related

How to import multiple Factiva html files by loop

I am using the Factiva Package ‘tm.plugin.factiva’ to import html files containing a Factiva search. It has worked beautifully so far, but now I have a problem with importing data and constructing a corpus from several html files (350 in total). I cannot figure out how to write a loop to iterate the simple step-by-step import code I have used before.
Earlier, with a smaller sample, I have managed to import the html files an a step-by-step process:
library(R.temis)
library(tm)
library(tm.plugin.factiva)
# Import corpus
source1 <- FactivaSource("Factiva1.html")
source2 <- FactivaSource("Factiva2.html")
source3 <- FactivaSource("Factiva3.html")
corp_source1 <- Corpus(source1, list(language=NA))
corp_source2 <- Corpus(source2, list(language=NA))
corp_source3 <- Corpus(source3, list(language=NA))
full_corpus <- c(corp_source1, corp_source2, corp_source3)
However, this is obviously not an option for the 350 html files. I have tried writing a loop for the import:
# Import corpus
files <- list.files(my_path)
for (i in files){
source <- FactivaSource(i)
}
tech_corpus <- Corpus(source, list(language=NA))
And:
htmlFiles <- Sys.glob("Factiva*.html")
for (k in 1:lengths(htmlFiles[[k]])){
source <- FactivaSource(htmlFiles[[k]])
}
But both of these only reads the first html file into the source, not the rest.
I have also tried:
for (k in seq_along(htmlFiles)){
source <- FactivaSource(htmlFiles[1:k], encoding = "UTF-8", format = c("HTML"))
}
But then I get the error that:
Error: x must be a string of length 1. I have tried manipulating the htmlFiles into a list (by: html_list <- as.list(htmlFiles)), but no change in result.
The two loops that did work, but only for the first html file.
I got the same result when I tried looping constructing the corpus as well.
for (m in 1:lengths(htmlFiles)){
corp_source <- Corpus(htmlFiles[[m]], list(language=NA))
}
Which worked, but only for the first html file. But then I get the error:
In 1:lengths(htmlFiles) :
numerical expression has 5 elements: only the first used
I would highly appreciate any help to understand how to get around this issue. Ideally, a loop to repeat the step-by-step process I did in the beginning would be super, as it seems to me that neither the FactivaSource() or Corpus() likes the complications I have made here - but I am far from an expert. Any help will be highly appreciated!

Display html report in jupyter with R

The qa() function of the ShortRead bioconductor library generates quality statistics from fastq files. The report() function then prepares a report of the various measures in an html format. A few other questions on this site have recommended using the display_html() function of IRdisplay to show html in jupyter notebooks using R (irkernel). However it only throws errors for me when trying to display an html report generated by the report() function of ShortRead.
library("ShortRead")
sample_dir <- system.file(package="ShortRead", "extdata", "E-MTAB-1147") # A sample fastq file
qa_object <- qa(sample_dir, "*fastq.gz$")
qa_report <- report(qa_object, dest="test") # Makes a "test" directory containing 'image/', 'index.html' and 'QA.css'
library("IRdisplay")
display_html(file = "test/index.html")
Gives me:
Error in read(file, size): unused argument (size)
Traceback:
1. display_html(file = "test/index.html")
2. display_raw("text/html", FALSE, data, file, isolate_full_html(list(`text/html` = data)))
3. prepare_content(isbinary, data, file)
4. read_all(file, isbinary)
Is there another way to display this report in jupyter with R?
It looks like there's a bug in the code. The quick fix is to clone the github repo, and make the following edit to the ./IRdisplay/R/utils.r, and on line 38 change the line from:
read(file,size)
to
read(size)
save the file, switch to the parent directory, and create a new tarbal, e.g.
tar -zcf IRdisplay.tgz IRdisplay/
and then re-install your new version, e.g. after re-starting R, type:
install.packages( "IRdisplay.tgz", repo=NULL )

R save FlexTable as html file in script

I have a FlexTable produced with the ReporteRs package which I would like to export as .html.
When I print the table to the viewer in RStudio I can do this by clicking on 'Export' and selecting 'Save as webpage'.
How would I replicate this action in my script?
I don't want to knit to a html document or produce a report just yet as at present I just want separate files for each of my draft tables which I can share with collaborators (but nicely formatted so they are easy to read).
I have tried the as.html function and that does produce a .html file but all the formatting is missing (it is just plain text).
Here is a MWE:
# load libraries:
library(data.table)
library(ReporteRs)
library(rtable)
# Create dummy table:
mydt <- data.table(id = c(1,2,3), name = c("a", "b", "c"), fruit = c("apple", "orange", "banana"))
# Convert to FlexTable:
myflex <- vanilla.table(mydt)
# Attempt to export to html in script:
sink('MyFlexTable.html')
print(as.html(myflex))
sink()
# Alternately:
sink('MyFlexTable.html')
knit_print(myflex)
sink()
The problem with both methods demonstrated above is that they output the table without any formatting (no borders etc).
However, manually selecting 'export' and 'save as webpage' in RStudio renders the FlexTable to a html file with full formatting. Why is this?
This works for me:
writeLines(as.html(myflex), "MyFlexTable.html")

Is there any R package to convert PDF to HTML [duplicate]

Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?
In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.
Any suggestions?
Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.
That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.
This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.
A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.
A purely R solution could be:
library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
then you'll have pdf lines in an array.
install.packages("pdftools")
library(pdftools)
download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf",
"56901.DEN.Gamebook", mode = "wb")
txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])
The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor.
The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out.
Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page.
Data can be extracted from multiple pages, and a different area can be specified for each page, if required.
For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.
I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information
Set path to pdftotxt.exe and convert pdf to text
exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"
for(i in 1:length(pdfFracList)){
fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
print(paste0("File number ", i, ", Processing file ", pdfSource))
system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}

Yahoo Finance Headlines webpage scraping with R

I would like to use R to download the HTML code of any Yahoo Finance Headlines webpage, select the "headlines" and collect them in Excel. Unfortunately I cannot find and select the HTML nodes corresponding to the headlines once I download the source file to R.
Let me show the problem with an example.
I started with
source <- "http://finance.yahoo.com/q/h?s=AAPL+Headlines"
file <- "destination/finance_file.cvs"
download.file(url = source, destfile = file)
x = scan(file, what = "", sep = "\n")
producing the Excel file finance_file.cvs and, most importantly, the character x.
Using x I would like to collect the headlines and write them into a column in a second Excel file, called headlines.cvs.
My problem now is the following: if I select any headline I can find it in the HTML code of the webpage itself, but I lose its track in x. Therefore, I do not know how to extract it.
For the extraction I was thinking of
x = x[grep("some string of characters to do the job", x)]
but I am no expert in web scraping.
Any ideas/suggestions?
I thank you very much!
You can use the XML package and write the XPath query needed to extract the headlines.
Since the web page looks like:
...
<ul class="newsheadlines"/>
<ul>
<li>First headline</li>
...
you get the following query.
library(XML)
source <- "http://finance.yahoo.com/q/h?s=AAPL+Headlines"
d <- htmlParse(source)
xpathSApply(d, "//ul[contains(#class,'newsheadlines')]/following::ul/li/a", xmlValue)
free(d)