knitr/rmarkdown - reducing html file size - html

I want to produce an html document using knitr/rmarkdown. Currently, the file is over 20MB and I'm trying to find a way to reduce it. The large file size is probably due to my plots which have a lot of points in them.
If I change my output type to pdf, I can get it down to 1.7MB. I'm wondering if there is a way to reduce my file while keeping it as a html.
EDIT: Here's a minimal working example which I did in RStduio.
---
title: "Untitled"
author: "My Name"
date: "September 7, 2015"
output: html_document
---
```{r}
library(ggplot2)
knitr::opts_chunk$set(dev='svg')
```
```{r}
set.seed(1)
mydf <- data.frame(x=rnorm(2e4),y=rnorm(2e4))
ggplot(mydf, aes(x,y)) + geom_point(alpha=0.6)
```
I also noticed that if I have too many observations, the plot doesn't get generated at all. I just get an empty box with a question mark in the output.
```{r}
set.seed(2)
mydf <- data.frame(x=rnorm(5e4),y=rnorm(5e4))
ggplot(mydf, aes(x,y)) + geom_point(alpha=0.6)
# ...plot doesn't appear in output
```

Following the suggestion of #daroczig to use the "dpi" knitr chunk option, I modified your code as follows (see below).
You had set the dev chunk option equal to "svg", which produces very large vector graphics files, especially for images made up of many elements (points, lines, etc.)
I set the dev chunk option back equal to "png", which is the default raster graphics format for HTML output. So you don't need to touch it at all. Keeping the dev chunk option equal to "png" dramatically reduces the HTML output file size.
I set the dpi chunk option equal to 36 (72 is the default), to lower the image resolution, and decrease the HTML output file size further.
I set the out.width and out.height chunk options equal to "600px", to increase the image dimensions.
You can change the dpi, out.width, and out.height options, until you get the HTML output file size and the image dimension to what you want. There's a trade-off between output file size and image resolution.
After knitting the code, I got an HTML output file size equal to 653kB, even when plotting 5e4 data points.
---
title: "Change size of output HTML file by reducing resolution of plot image"
author: "My Name"
date: "September 7, 2015"
output: html_document
---
```{r}
# load ggplot2 silently
suppressWarnings(library(ggplot2))
# chunk option dev="svg" produces very large vector graphics files
knitr::opts_chunk$set(dev="svg")
# chunk option dev="png" is the default raster graphics format for HTML output
knitr::opts_chunk$set(dev="png")
```
```{r, dpi=36, out.width="600px", out.height="600px"}
# chunk option dpi=72 is the default resolution
set.seed(1)
mydf <- data.frame(x=rnorm(5e4),y=rnorm(5e4))
ggplot(mydf, aes(x,y)) + geom_point(alpha=0.6)
```

To prevent scatterplots with many points blowing up the size of your vector graphics (and accordingly html output) you can use geom_point_raster() from the ggrastr package. Eat the cake and have it too!

Related

HTML output from RMarkdown is too heavy, I need to divide it into separate subsections

I have this html output file but when I open it in my browser it takes too long to load and to display the scroll bar at the side. Hence I thought to split the massive file into separate subsections with markdown. How can I do that? Than you . If this can help:
title: "my_file"
author: "me"
date: "26/02/2020"
output:
html_document:
toc: yes
toc_depth: 3
toc_float:
collapsed: yes
smooth_scroll: yes
word_document: default
---
You maybe don't need to divide your document ! As I indicate here, a prior explanation that you have to investigate is the sizes of the images or graphics, which are maybe too heavy because of an ultra-high-definition (which is the case by default after most of the computation which resulting in images). Below, I paste my answer for check the sizes of your images :
Maybe you'll have to search in the 'image or graphic' part of the code for these 2 cases :
1- If you indicate size of an image in Rmarkdown at the very beginning of your image-codechunk, check that 'fig.height' & 'fig.width' indicates a reasonnable size, like the following :
```{r name_of_the_chunk, fig.cap="Name_of_fig", fig.height=10, fig.width=8}
2- the same case is maybe present or necessary in some code that saved the graph or render the image or whatever, so ensure that you indicate for reasonable dimensions ('width' and 'height') in your 'programmatic' way to render the image, if the codechunk don't indicate a size, i.e :
svg("my.svg", width = 8,height =8)
[code of your graph]
maybe height and width are set in ggsave(file="myfile.svg",device = "svg",width =6,height = 6,units = "cm")
[code of your graph] ... or whatever function you use for generate your pictures.
Excellent day

HTML output takes too long to load and to show up the scroll bar

I have this Rmarkdown file but since it is pretty heavy (it is an online guide), the scroll bar (and the whole file except the first page) takes too long to show up when opening the html output. I tried to divide the rmd file into distinct rms sub files as below shown but I still can't get the result. Thank you
---
title: "my_file"
author: "me"
date: "26/02/2020"
output:
html_document:
toc: yes
toc_depth: 3
toc_float:
collapsed: yes
smooth_scroll: yes
word_document: default
---
```{r child = 'child0.Rmd'}
```
```{r child = 'child1.Rmd'}
```
```{r child = 'child2.Rmd'}
```
```{r child = 'child3.Rmd'}
```
```{r child = 'child4.Rmd'}
```
Investigate and try to reduce the size of yours pictures/graphics : in parallel or alternatively to the 'split' of your text in several 'html-pages', the idea is to made a compromise between time of opening and quality of your graphics (and imported pictures).
So, try :
to reduce size of graphics computed by some code chunk, see here for an exemple.
to reduce the size of yours imported pictures, if they huge, by resizing them.
to take advantage of the html format which is able to render svg files : try encoding in svg your graphics representation of your data. Not your external images, only your computation which resulting in graphics (text + area + color = some graphics are 'lighter' in svg than in jps or tif).

When knitting RMarkdown to HTML with RStudio, is it possible to view directly in browser, instead than previewing in a window?

I often with RMarkdown documents which are heavy on math, such as for example:
---
title: "Just a test"
author: "Yours Truly"
date: '`r Sys.Date()`'
output:
html_document:
fig_caption: yes
---
```{r setup, include=FALSE}
library(knitr)
opts_chunk$set(echo = FALSE,
cache = TRUE,
out.width = "75%",
fig.align = "center")
```
## Classical multiple linear regression
A common question in Data Science/Statistics is: how does a certain quantity $y$ depend on other quantities $x_1,\dots,x_p$? Generally, we are interested in $p(y|\mathbf{x})$, the conditional distribution of $y$ given $\mathbf{x}=(x_1,\dots,x_p)$. The simplest and perhaps most widely used model for $p(y|\mathbf{x})$ assumes that, given $\mathbf{x}$, $y$ is normally distributed, with a constant variance $\sigma^2$ and and mean which is a linear function of a parameter vector $\boldsymbol{\beta}=(\beta_0,\beta_1,\dots,\beta_k)$
$$\mathbb{E[y|\mathbf{x}]}=\boldsymbol{\beta}^T\cdot(1,\mathbf{x})=\beta_0+\sum_{j=1}^p\beta_jxj$$
When I knit to HTML, RStudio will preview this to a window. To see the HTML in a browser I click on "View in browser":
Isn't there a way to directly view the HTML in a browser after knitting?

Following HTML knit - RMarkdown including block of white space

I have working on journaling the visualization of some spatial data using Raster and RMarkdown, but am having a problem with there being a bunch of negative space above each figure. Here is the RMarkdown code (somewhat simplified):
```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8, echo=FALSE,
warning=FALSE, message=FALSE)
```
```{r r-packages}
library(maptools)
library(raster)
library(rgdal)
```
###Description of data
Data are taken from the National Land Cover Database - 2011 and represent land cover at a 30m X 30m resolution.
location of data: [National Land Cover Database - 2011]('http://gisdata.usgs.gov/TDDS/DownloadFile.php?TYPE=nlcd2006&FNAME=nlcd_2006_landcover_2011_edition_2014_10_10.zip')
###Import raster file for US landcover and shapefile for state borders and counties
```{r Import raster file for us landcover}
rfile <- '~/Documents/Data/nlcd_2006_landcover_2011_edition_2014_10_10/nlcd_2006_landcover_2011_edition_2014_10_10.img' #location of raster data
r1 <- raster(rfile)
##Import shapefile for state borders
statepath <- '~/Documents/Data/'
setwd(statepath)
shp1 <- readOGR(".", "states")
##Transform shapefile to fit raster projection
shp1 <- spTransform(shp1, r1#crs)
##Remove hawaii and alasks which are not in raster image
shp1.sub <- c("Hawaii","Alaska")
states.sub <- shp1[!as.character(shp1$STATE_NAME) %in% shp1.sub, ]
##Import county data
#data source: ftp://ftp2.census.gov/geo/tiger/TIGER2011/COUNTY/tl_2011_us_county.zip
countypath <- '~/Documents/Data/tl_2011_us_county'
setwd(countypath)
shp2 <- readOGR(".", "tl_2011_us_county")
##Transform shapefile to fit raster projection
counties <- spTransform(shp2, r1#crs)
counties.sub <- counties[as.character(counties$STATEFP) %in% states.sub$STATE_FIPS, ]
```
Raster plot of US with state and county border overlays
```{r plot landcover with state borders}
#Plot state borders over raster
plot(r1)
plot(counties.sub, border = "darkgrey",lwd=.65,add=T)
plot(states.sub,border = "darkblue",add=T)
```
Raster cropped and masked to extent of California
```{r crop raster to a single state (California)}
shp.sub <- c("California")
shp.ca <- states.sub[as.character(states.sub$STATE_NAME) %in% shp.sub, ]
r1.crop <- crop(r1, extent(shp.ca))
plot(r1)
```
Everything runs fine, but when the markdown is output to HTML, a bunch of white space is included as well. [Here's the published RPub] (now solved). (http://rpubs.com/pbwilliams/80167). I think this is a Raster problem, as I haven't had this issue with figures, for example, in ggplot.
I have been able to temporarily fix this by shrinking the image down, but anytime I enlarge the picture to anything reasonable, the extra space is added. If anyone knows how to fix this, it would be greatly appreciated.
As suggested in the comments, using the chunk option fig.keep = 'last' should fix this particular problem, since each code chunk seems to have two plots, and the first one is a blank one (you only want to keep the last one).

R Markdown HTML Number Figures

Does anyone know how to number the figures in the captions, for HTML format R Markdown script?
For PDF documents, the caption will say something like:
Figure X: Some Caption Text
However, the equivalent caption for the HTML version will simply say:
Some Caption Text
This makes cross-referencing figures by number completely useless.
Here is a minimal example:
---
title: "My Title"
author: "Me"
output:
pdf_document: default
html_document: default
---
```{r cars, fig.cap = "An amazing plot"}
plot(cars)
```
```{r cars2, fig.cap = "Another amazing plot"}
plot(cars)
```
I have tried setting toc, fig_caption and number_sections within each of the output formats, but this does not seem to change the result.
The other answers provided are relatively out of date, and this has since been made very easy using the bookdown package. This package provides a number of improvements which includes the built-in numbering of figures across Word, HTML and PDF.
To be able to use bookdown, you need to first install the package install.packages("bookdown") and then use one of the output formats. For HTML, this is html_document2. Taking your example:
---
title: "My Title"
author: "Me"
date: "1/1/2016"
output: bookdown::html_document2
---
```{r cars, fig.cap = "An amazing plot"}
plot(cars)
```
```{r cars2, fig.cap = "Another amazing plot"}
plot(cars)
```
These Figures will be numbered Figure 1 and Figure 2. Providing the code chunk is named and has a caption, we can cross reference the output using the the syntax \#ref(fig:foo) where foo is the name of the chunk i.e. \#ref(fig-cars). You can learn more about this behaviour here
Further Reading
R Markdown: The definitive Guide: Chapter 11 provides a great overview of bookdown
Authoring books with bookdown provides a comprehensive guide on bookdown, and recommended for more advanced details.
So unless someone has a better solution, this is the solution that I came up with, there are some flaws with this approach (for example, if the figure/table number is dependent on the section number etc...), but for the basic html document, it works.
Somewhere at the top of you document, run this:
```{r echo=FALSE}
#Determine the output format of the document
outputFormat = opts_knit$get("rmarkdown.pandoc.to")
#Figure and Table Caption Numbering, for HTML do it manually
capTabNo = 1; capFigNo = 1;
#Function to add the Table Number
capTab = function(x){
if(outputFormat == 'html'){
x = paste0("Table ",capTabNo,". ",x)
capTabNo <<- capTabNo + 1
}; x
}
#Function to add the Figure Number
capFig = function(x){
if(outputFormat == 'html'){
x = paste0("Figure ",capFigNo,". ",x)
capFigNo <<- capFigNo + 1
}; x
}
```
Then during the course of your document, if say you want to plot a figure:
```{r figA,fig.cap=capFig("My Figure Caption")
base = ggplot(data=data.frame(x=0,y=0),aes(x,y)) + geom_point()
base
```
Substitute the capFig to capTab in the above, if you want a table caption.
We can make use of pandoc-crossref, a filter that allows a cross-referencing of figures, tables, sections, and equations and works for all output format. The easiest way is to cat the figure label (in the form of {#fig:figure_label}) after each plot, although this requires echo=FALSE and results='asis'. Then we can reference a figure as we would a citation : [#fig:figure_label] produces fig. figure_number by default.
Here is a MWE:
---
output:
html_document:
toc: true
number_sections: true
fig_caption: true
pandoc_args: ["-F","pandoc-crossref"]
---
```{r}
knitr::opts_chunk$set(echo=FALSE,results='asis')
```
```{r plot1,fig.cap="This is plot one"}
x <- 1:10
y <- rnorm(10)
plot(x,y)
cat("{#fig:plot1}")
```
As we can see in [#fig:plot1]... whereas [#fig:plot2] shows...
```{r plot2, fig.cap="This is plot two"}
plot(y,x)
cat("{#fig:plot2}")
```
which produces (removing the graphics
PLOT1
Figure 1: This is plot one
As we can see in fig. 1… whereas fig. 2 shows…
PLOT2
Figure 2: This is plot two
See the pandoc-crossref readme for more options and customizations.
To install pandoc-crossref, assuming you have a haskell installation:
cabal update
cabal install pandoc-crossref
I solve cross-referencing using a solution similar to that posted by Nicholas above. I use bookdown for some projects but I find that awkward to use for other projects where I just want simple cross-referencing.
I use the following when I am writing a paper with rmarkdown and I want it in standard format for submission to a journal. I want a figure legend at the end, then tables, then I'll have the tables and figures. As I am writing, I only have a rough idea of what order the figures will be referenced in the text. I just want to reference them with a text code like fig:foobar and have the number assigned based appearance in the text. When I look at the figure legend list, I'll see what order to put the legends and will move legends around as needed.
Here's my structure.
I have an R package where I have things I need for papers, like various bibliographies and helper R functions. In that package, I have the following function which uses some variables defined in the main Rmd environment: .rmdenvir and .rmdctr .
ref <- function(useName) {
require(stringr)
if(!exists(".refctr")) .refctr <- c(`_` = 0)
if(any(names(.refctr)==useName)) return(.refctr[useName])
type=str_split(useName,":")[[1]][1]
nObj <- sum(str_detect(names(.refctr),type))
useNum <- nObj + 1
newrefctr <- c(.refctr, useNum)
names(newrefctr)[length(.refctr) + 1] <- useName
assign(".refctr", newrefctr, envir=.rmdenvir)
return(useNum)
}
It assumes that I name things I want referenced with something like cntname:foo, for example fig:foo. It makes a new counter for each one and I can make up new counters on the fly (while writing) if needed.
In my main Rmd file, I have some set-up lines:
```{r setup_main}
require(myPackageforPapers)
# here is where the variables needed by ref() are defined.
.rmdenvir = environment()
.refctr <- c(`_` = 0)
````
In the text I use the following
You can see what I am trying to show in Figure `r ref("fig:foo")`
and you can see it also in Tables `r ref("tab:foo")`
and A`r ref("tabappA:foobig")`.
to get "You can see what I am trying to show in Figure 1 and you can see it also in Tables 1 and A1." Although the numbers might not be 1; the number to use will be dynamically determined. I don't have to use a special function for the first time I reference a figure, table or whatever I am counting. ref() figures that out by looking to see if the label exists already. If not it assigns the next number, and returns it. So you don't have to use "label" in one place and "ref" in another.
In the course of writing, I might decide that appendix A is getting too big, and that I will split off some of the tables into an appendix B. All I need to do is change the above to
You can see what I am trying to show in Figure `r ref("fig:foo")`
and you can see it also in Tables `r ref("tab:foo")`
and B`r ref("tabappB:foobig")`.
I just specify a new counter name 'tabappB' and the numbers for that are dynamically determined.
At the end of my Rmd file, I have a figure list that will look like
# Figure Legends
Figure `r ref("fig:foo")`. This is the legend for this figure.
Figure `r ref("fig:foo2")`. This is the legend for another figure.
Then my tables appear like so
```{r print-tablefoo, echo=FALSE}
tablefoo=mtcars
thecap = "Tables appear with a legend while figures do not."
fullcap = paste("Table ", ref("tab:foo"), ". ", thecap, sep="")
kable(tablefoo, caption=fullcap)
```
and then the figures like so:
```{r fig-foo, echo=FALSE, fig.cap=paste("Figure",ref("fig:foo"))}
plot(1,1)
```
Appendix A is an Rmd file that included as a child. It will have tables like
```{r print-tableAfoo, echo=FALSE}
tablefoo=mtcars
thecap = "This is a legend."
fullcap = paste("Table A", ref("tabappA:foobig"), ". ", thecap, sep="")
kable(tablefoo, caption=fullcap)
```
I do have to add the "A" to get Table A1, but I find it easier if R doesn't think too much for me in terms of labelling my counters. I just I want it to return the right number.
The cross-referencing works for html, pdf/latex or word. I'd happily stick with latex solutions, but my co-authors use word so I need a solution that works with pandoc and word. Also sometimes I want html or some other output and I need a solution that works for any output that works with rmarkdown.