How to handover a .pdf or .png object using R in Foundry Code Workbook? - palantir-foundry

I have created formatted tables using R and the package gt in Code Workbook. I would like to save them as .png or .pdf files so they can be used in other Foundry applications. Currently, I can only save them as .html files.
An example of what my current code looks like:
table <- function(metrics_national) {
library(tidyverse)
library(webshot)
library(gt)
table <- metrics_national %>%
gt() %>%
tab_header(title = "My table")
output <- new.output()
output_fs <- output$fileSystem()
table %>% gtsave(filename = output_fs$get_path("table.html",'w'))
return(table)
}
This runs and saves the .html file as expected but when I change the filename extension to .png or .pdf instead of .html it stops working.

Related

How to import multiple Factiva html files by loop

I am using the Factiva Package ‘tm.plugin.factiva’ to import html files containing a Factiva search. It has worked beautifully so far, but now I have a problem with importing data and constructing a corpus from several html files (350 in total). I cannot figure out how to write a loop to iterate the simple step-by-step import code I have used before.
Earlier, with a smaller sample, I have managed to import the html files an a step-by-step process:
library(R.temis)
library(tm)
library(tm.plugin.factiva)
# Import corpus
source1 <- FactivaSource("Factiva1.html")
source2 <- FactivaSource("Factiva2.html")
source3 <- FactivaSource("Factiva3.html")
corp_source1 <- Corpus(source1, list(language=NA))
corp_source2 <- Corpus(source2, list(language=NA))
corp_source3 <- Corpus(source3, list(language=NA))
full_corpus <- c(corp_source1, corp_source2, corp_source3)
However, this is obviously not an option for the 350 html files. I have tried writing a loop for the import:
# Import corpus
files <- list.files(my_path)
for (i in files){
source <- FactivaSource(i)
}
tech_corpus <- Corpus(source, list(language=NA))
And:
htmlFiles <- Sys.glob("Factiva*.html")
for (k in 1:lengths(htmlFiles[[k]])){
source <- FactivaSource(htmlFiles[[k]])
}
But both of these only reads the first html file into the source, not the rest.
I have also tried:
for (k in seq_along(htmlFiles)){
source <- FactivaSource(htmlFiles[1:k], encoding = "UTF-8", format = c("HTML"))
}
But then I get the error that:
Error: x must be a string of length 1. I have tried manipulating the htmlFiles into a list (by: html_list <- as.list(htmlFiles)), but no change in result.
The two loops that did work, but only for the first html file.
I got the same result when I tried looping constructing the corpus as well.
for (m in 1:lengths(htmlFiles)){
corp_source <- Corpus(htmlFiles[[m]], list(language=NA))
}
Which worked, but only for the first html file. But then I get the error:
In 1:lengths(htmlFiles) :
numerical expression has 5 elements: only the first used
I would highly appreciate any help to understand how to get around this issue. Ideally, a loop to repeat the step-by-step process I did in the beginning would be super, as it seems to me that neither the FactivaSource() or Corpus() likes the complications I have made here - but I am far from an expert. Any help will be highly appreciated!

browseVignettes: remove `source` and `R code ` files from output

I wrote an HTML vignette for my R package hosted on GitHub. When I open it with browseVignettes, it flawlessly opens on the browsers showing this content:
Vignettes found by browseVignettes("package_name")
Vignettes in package package_name
package_name file_name - HTML source R code
clicking on HTML source R code it opens the same file in three different versions.
However, I don't need the source and the R code files to show.
Is there a way to output only the HTML file? As in the following output
Vignettes found by browseVignettes("package_name")
Vignettes in package package_name
package_name file_name - HTML
You can't easily drop the source, but you can drop the R code by setting that component to blank. For example,
allfields <- browseVignettes()
noR <- lapply(allfields, function(pkg) {pkg[,"R"] <- ""; pkg})
class(noR) <- class(allfields)
noR
If you really want to drop the source, then you'll need to get the print method and modify it:
print.browseVignettes <- utils:::print.browseVignettes
# Modify it as you like.

Import multiple json files from a directory and attaching the data

I am trying to read multiple json files into a working directory for further converting into a dataset. I have files text1, text2, text3 in the directory json. Here is the code i wrote:
setwd("Users/Desktop/json")
temp = list.files(pattern="text*.")
myfiles = lapply(temp, read.delim)
library("rjson")
json_file <- "myfiles"
library(jsonlite)
out <- jsonlite::fromJSON(json_file)
out[vapply(out, is.null, logical(1))] <- "none"
data.frame(out, stringsAsFactors = FALSE)[,1:5]
View(out)
I have about 200 files so i was wondering if there is way in which the json files can be imported.
Thanks
I think that I had a similar problem when working w/ Twitter data. I had a directory containing separate files for each user name, and I wanted to import/analyze them as a group. This worked for me:
library(rjson)
filenames <- list.files("Users/Desktop/json", pattern="*.json", full.names=TRUE) # this should give you a character vector, with each file name represented by an entry
myJSON <- lapply(filenames, function(x) fromJSON(file=x)) # a list in which each element is one of your original JSON files
If this doesn't work, then I need a bit more information to understand your problem.

create a Corpus from many html files in R

I would like to create a Corpus for the collection of downloaded HTML files, and then read them in R for future text mining.
Essentially, this is what I want to do:
Create a Corpus from multiple html files.
I tried to use DirSource:
library(tm)
a<- DirSource("C:/test")
b<-Corpus(DirSource(a), readerControl=list(language="eng", reader=readPlain))
but it returns "invalid directory parameters"
Read in html files from the Corpus all at once.
Not sure how to do it.
Parse them, convert them to plain text, remove tags.
Many people suggested using XML, however, I didn't find a way to process multiple files. They are all for one single file.
Thanks very much.
This should do it. Here I've got a folder on my computer of HTML files (a random sample from SO) and I've made a corpus out of them, then a document term matrix and then done a few trivial text mining tasks.
# get data
setwd("C:/Downloads/html") # this folder has your HTML files
html <- list.files(pattern="\\.(htm|html)$") # get just .htm and .html files
# load packages
library(tm)
library(RCurl)
library(XML)
# get some code from github to convert HTML to text
writeChar(con="htmlToText.R", (getURL(ssl.verifypeer = FALSE, "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R")))
source("htmlToText.R")
# convert HTML to text
html2txt <- lapply(html, htmlToText)
# clean out non-ASCII characters
html2txtclean <- sapply(html2txt, function(x) iconv(x, "latin1", "ASCII", sub=""))
# make corpus for text mining
corpus <- Corpus(VectorSource(html2txtclean))
# process text...
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
a <- tm_map(a, PlainTextDocument)
a <- tm_map(corpus, FUN = tm_reduce, tmFuns = funcs)
a.dtm1 <- TermDocumentMatrix(a, control = list(wordLengths = c(3,10)))
newstopwords <- findFreqTerms(a.dtm1, lowfreq=10) # get most frequent words
# remove most frequent words for this corpus
a.dtm2 <- a.dtm1[!(a.dtm1$dimnames$Terms) %in% newstopwords,]
inspect(a.dtm2)
# carry on with typical things that can now be done, ie. cluster analysis
a.dtm3 <- removeSparseTerms(a.dtm2, sparse=0.7)
a.dtm.df <- as.data.frame(inspect(a.dtm3))
a.dtm.df.scale <- scale(a.dtm.df)
d <- dist(a.dtm.df.scale, method = "euclidean")
fit <- hclust(d, method="ward")
plot(fit)
# just for fun...
library(wordcloud)
library(RColorBrewer)
m = as.matrix(t(a.dtm1))
# get word counts in decreasing order
word_freqs = sort(colSums(m), decreasing=TRUE)
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
# plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))
This will correct the error.
b<-Corpus(a, ## I change DireSource(a) by a
readerControl=list(language="eng", reader=readPlain))
But I think to read your Html you need to use xml reader. Something like :
r <- Corpus(DirSource('c:\test'),
readerControl = list(reader = readXML),spec)
But you need to supply the spec argument, which depends with your file structure.
see for example readReut21578XML. It is a good example of xml/html parser.
To read all the html files into an R object you can use
# Set variables
folder <- 'C:/test'
extension <- '.htm'
# Get the names of *.html files in the folder
files <- list.files(path=folder, pattern=extension)
# Read all the files into a list
htmls <- lapply(X=files,
FUN=function(file){
.con <- file(description=paste(folder, file, sep='/'))
.html <- readLines(.con)
close(.con)
names(.html) <- file
.html
})
That will give you a list, and each element is the HTML content of each file.
I'll post later on parsing it, I'm in a hurry.
I found the package boilerpipeR particularly useful to extract only the "core" text of an html page.

Parsing multiple files for HTML tables and appending to single file in R

This is a small project in R that I am attempting to execute. I've scraped a few hundred html pages. I am able to use the reaHTMLTable function in the XML library with R to read the tables that I'm interested in. However I'm having trouble writing the for loop to loop through the directory, grab the table from each file and append them to a single CSV file.
I HAVE been successful in looping through the files and saving each table to a single txt file (which I feel is at least a start):
library(XML) # htmlTreeParse
parentpath <- "Z:/scraping"
setwd(parentpath)
filenames <- list.files()
for (targetfile in filenames){
setwd(parentpath)
data = readHTMLTable(targetfile)
outputfile <- paste(targetfile,'.txt', sep="")
write.table (data[6], file = outputfile , sep = "\t", quote=TRUE)
Shouldn't the append=TRUE option in write.table do the trick for you? You can read about it by looking up ?write.table.