I created a plot along the lines of this:
http://www.buildingwidgets.com/blog/2015/7/2/week-26-sunburstr
# devtools::install_github("timelyportfolio/sunburstR")
library(sunburstR)
# read in sample visit-sequences.csv data provided in source
# https://gist.github.com/kerryrodden/7090426#file-visit-sequences-csv
sequence_data <- read.csv(
paste0(
"https://gist.githubusercontent.com/kerryrodden/7090426/"
,"raw/ad00fcf422541f19b70af5a8a4c5e1460254e6be/visit-sequences.csv"
)
,header=F
,stringsAsFactors = FALSE
)
In Rstudio I can click in the Viewer: "Export > Save as Web Page ..."
Which then saves the plot as interactive html-document. I would like to do this as part of the code. How do I save a plot to html using R-code? There are plenty of examples for PDF/jpg etc., but not html.
Store the sunburst in a variable and use htmltools::save_html to save it.
plot <- sunburst(sequence_data)
htmltools::save_html(plot, file = "C:/Users/User/Desktop/sunburst.html")
Related
I am using the R programming language. I am interested in seeing if it is somehow possible to take an html file (generated using the "plotly" library) and then insert that file into a MS Powerpoint slideshow.
I was looking at other stackoverflow questions where similar things were attempted: Exporting PNG files from Plotly in R
Suppose I generate this simple, interactive plot using plotly in R:
library(plotly)
x <- rnorm(100,10,10)
color <- rnorm(100, 2,1)
frame = data.frame(x,color)
p = plot_ly(type = "scatter", mode = "markers", data = frame, x = ~x, y = " ", color = ~color )
Using the "htmlwidget" library, I can save the object "p" as an html file. But is there a way to insert this object "p" into a MS Powerpoint presentation? Preferably, an offline way that does not use the internet or requires any additional software to be installed?
I tried using the "insert html" functionality in MS Powerpoint, but this just produces a "grey square" that doesn't load when you play the presentation.
Can someone please tell me if this is possible?
Thanks
I have several HTML files in a folder in my pc. I would like to read them in R, trying to keep the original format as much as posible. There is only text, by the way. I have tried two approaches, which failed misserably:
##first approach
library (tm)
cname <- file.path("C:", "Users", "usuario", "Desktop", "DEADataset", "The Phillipines", "gazzetes.presihtml")
docs <- Corpus(DirSource(cname))
## second approach
list_files_path<- list.files(path = './gazzetes.presihtml')
a<- paste0(list_files_path, names) # vector names contain the names of the file with the .HTML extension
rawHTML <- readLines(a)
Any guess? all the best
Your second approach is close to working, except that readLines only accepts one connection, but you are giving it a vector with multiple files. You can use lapply with readLines to achieve this. Here is an example:
# generate vector of html files
files <- c('/path/to/your/html/file1', '/path/to/your/html/file2')
# readLines for each file and put them in a list
lineList <- lapply(files, readLines)
# create a character vector that contains all lines from all files
lineVector <- unlist(lineList)
# collapse the character vector into a single string
html <- paste(lineVector , collapse = '\n')
# print the string with original formatting
cat(html)
I have a FlexTable produced with the ReporteRs package which I would like to export as .html.
When I print the table to the viewer in RStudio I can do this by clicking on 'Export' and selecting 'Save as webpage'.
How would I replicate this action in my script?
I don't want to knit to a html document or produce a report just yet as at present I just want separate files for each of my draft tables which I can share with collaborators (but nicely formatted so they are easy to read).
I have tried the as.html function and that does produce a .html file but all the formatting is missing (it is just plain text).
Here is a MWE:
# load libraries:
library(data.table)
library(ReporteRs)
library(rtable)
# Create dummy table:
mydt <- data.table(id = c(1,2,3), name = c("a", "b", "c"), fruit = c("apple", "orange", "banana"))
# Convert to FlexTable:
myflex <- vanilla.table(mydt)
# Attempt to export to html in script:
sink('MyFlexTable.html')
print(as.html(myflex))
sink()
# Alternately:
sink('MyFlexTable.html')
knit_print(myflex)
sink()
The problem with both methods demonstrated above is that they output the table without any formatting (no borders etc).
However, manually selecting 'export' and 'save as webpage' in RStudio renders the FlexTable to a html file with full formatting. Why is this?
This works for me:
writeLines(as.html(myflex), "MyFlexTable.html")
Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?
In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.
Any suggestions?
Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.
That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.
This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.
A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.
A purely R solution could be:
library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
then you'll have pdf lines in an array.
install.packages("pdftools")
library(pdftools)
download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf",
"56901.DEN.Gamebook", mode = "wb")
txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])
The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor.
The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out.
Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page.
Data can be extracted from multiple pages, and a different area can be specified for each page, if required.
For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.
I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information
Set path to pdftotxt.exe and convert pdf to text
exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"
for(i in 1:length(pdfFracList)){
fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
print(paste0("File number ", i, ", Processing file ", pdfSource))
system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}
I would like to create a Corpus for the collection of downloaded HTML files, and then read them in R for future text mining.
Essentially, this is what I want to do:
Create a Corpus from multiple html files.
I tried to use DirSource:
library(tm)
a<- DirSource("C:/test")
b<-Corpus(DirSource(a), readerControl=list(language="eng", reader=readPlain))
but it returns "invalid directory parameters"
Read in html files from the Corpus all at once.
Not sure how to do it.
Parse them, convert them to plain text, remove tags.
Many people suggested using XML, however, I didn't find a way to process multiple files. They are all for one single file.
Thanks very much.
This should do it. Here I've got a folder on my computer of HTML files (a random sample from SO) and I've made a corpus out of them, then a document term matrix and then done a few trivial text mining tasks.
# get data
setwd("C:/Downloads/html") # this folder has your HTML files
html <- list.files(pattern="\\.(htm|html)$") # get just .htm and .html files
# load packages
library(tm)
library(RCurl)
library(XML)
# get some code from github to convert HTML to text
writeChar(con="htmlToText.R", (getURL(ssl.verifypeer = FALSE, "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R")))
source("htmlToText.R")
# convert HTML to text
html2txt <- lapply(html, htmlToText)
# clean out non-ASCII characters
html2txtclean <- sapply(html2txt, function(x) iconv(x, "latin1", "ASCII", sub=""))
# make corpus for text mining
corpus <- Corpus(VectorSource(html2txtclean))
# process text...
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
a <- tm_map(a, PlainTextDocument)
a <- tm_map(corpus, FUN = tm_reduce, tmFuns = funcs)
a.dtm1 <- TermDocumentMatrix(a, control = list(wordLengths = c(3,10)))
newstopwords <- findFreqTerms(a.dtm1, lowfreq=10) # get most frequent words
# remove most frequent words for this corpus
a.dtm2 <- a.dtm1[!(a.dtm1$dimnames$Terms) %in% newstopwords,]
inspect(a.dtm2)
# carry on with typical things that can now be done, ie. cluster analysis
a.dtm3 <- removeSparseTerms(a.dtm2, sparse=0.7)
a.dtm.df <- as.data.frame(inspect(a.dtm3))
a.dtm.df.scale <- scale(a.dtm.df)
d <- dist(a.dtm.df.scale, method = "euclidean")
fit <- hclust(d, method="ward")
plot(fit)
# just for fun...
library(wordcloud)
library(RColorBrewer)
m = as.matrix(t(a.dtm1))
# get word counts in decreasing order
word_freqs = sort(colSums(m), decreasing=TRUE)
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
# plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))
This will correct the error.
b<-Corpus(a, ## I change DireSource(a) by a
readerControl=list(language="eng", reader=readPlain))
But I think to read your Html you need to use xml reader. Something like :
r <- Corpus(DirSource('c:\test'),
readerControl = list(reader = readXML),spec)
But you need to supply the spec argument, which depends with your file structure.
see for example readReut21578XML. It is a good example of xml/html parser.
To read all the html files into an R object you can use
# Set variables
folder <- 'C:/test'
extension <- '.htm'
# Get the names of *.html files in the folder
files <- list.files(path=folder, pattern=extension)
# Read all the files into a list
htmls <- lapply(X=files,
FUN=function(file){
.con <- file(description=paste(folder, file, sep='/'))
.html <- readLines(.con)
close(.con)
names(.html) <- file
.html
})
That will give you a list, and each element is the HTML content of each file.
I'll post later on parsing it, I'm in a hurry.
I found the package boilerpipeR particularly useful to extract only the "core" text of an html page.