HTML file to screenshot as .jpg or other image - html

Nothing to do with rendering an individual image on a webpage. Goal is to render the entire webpage save that as a screenshot. Want to show a thumbnail of an HTML file to the user. The HTML file I will be screenshotting will be an HTML part in a MIME email message - ideally I would like to snapshot the entire MIME file but if I can do this to an HTML file I'll be in good shape.
An API would be ideal but an executable is also good.

You need html2ps, and convert from the package ImageMagick:
html2ps index.html index.ps
convert index.ps index.png
The second program produces one png per page for a long html-page - the page layout was done by by html2ps.
I found a program evince-thumbnailer, which was reported as:
apropos postscript | grep -i png
evince-thumbnailer (1) - create png thumbnails from PostScript and PDF documents
but it didn't work on an simple, first test.
If you like to combine multiple pages to a larger image, convert will help you surely.
Now I see, that convert operates on html directly, so
convert index.html index.png
shall work too. I don't see a difference in the output, and the size of the images is nearly identical.
If you have a multipart mime-type email, you typically have a mail header, maybe some pre-html-text, the html and maybe attachments.
You can extract the html and format it seperately - but rendering it embedded might not be that easy.
Here is a file I tested, which was from Apr. 14, so I extract the one mail from the mailfolder:
sed -n "/From - Sat Apr 14/,/From -/p" /home/stefan/.mozilla-thunderbird/k2jbztqu.default/Mail/Local\ Folders-1/Archives.sbd/sample | \
sed -n '/<html>/,/<\/html>/p' | wkhtmltopdf - - > sample.pdf
then I extract just the html-part of that.
wkhtmltopdf needs - - for reading stdin/writing to stdout. The PDF is rendered, but I don't know how to integrate it into your workflow.
You can replace wkhtml ... with
convert - sample.jpg

I'm going with wkhtmltoimage. This worked once correctly set up xvfb. The postscript suggestion did not render correctly and we need img not pdf.

Related

Pandoc fails to generate pdf from basic HTML page

I'm trying to generate a PDF from a basic HTML page using pandoc, but it seems like a table (or a few) are preventing the PDF from being generated.
This is the page I'm trying to convert to a PDF document. Here is the command that I'm running:
$ pandoc --verbose --from=html --to=pdf --output=ch3.pdf --pdf-engine=xelatex -V geometry:margin=1.5in https://bob.cs.sonoma.edu/IntroCompOrg-x64/bookch3.html
And here is the end of the output generated:
Error producing PDF.
! Argument of \LT#nofcols has an extra }.
<inserted text>
\par
l.2588 \begin{longtable}[]{#{}r#{}}
I was able to save it as a markdown document, and then convert that markdown document to PDF, but the tables become a block of incomprehensible markdown text. I suspect that something is going wrong in the translation of the table elements, but I don't know anything about latex so I can't say for sure, and have no idea where to start debugging. Any help is appreciated, thank you!

Can one extract images from pandoc's self-contained HTML files?

I have used pandoc with the option --self-contained to create HTML documents where images are embedded in the HTML code as base64.
The image is included in the IMG tag like this (where I have replaced the long string of base64-characters with a placeholder:
<IMG src="data:image/png;base64,<<base64-coded characters here>>" width=672">
Now, I'd like to extract such images, i.e. do the reverse where base64-coded data are replaced by references to files and the data converted to ordinary PNG or JPEG files that are saved on disk.
I was hoping to use pandoc to do this conversion, but I could not find an option for this in pandoc, nor have I found any other software that does it. Ideally, the solution should be shell/script-type that can easily be included in a longer toolchain.
You can use pandoc with the --extract-media option. The images will be written to the supplied directory and the base64 URLs will be replaced with references to those files.
E.g.
pandoc --from=html YOUR_FILE.html --extract-media=images

Bookdown: Single html output file

If I add a line below the first in _output.yml:
bookdown::gitbook:
split_by: none
css: ...
in the bookdown-demo the output becomes a single .html file which looks kind of plain ugly. Is it somehow possible to retain the nice style which is produced by the default settings but in a single file? If I want to send the book to someone else sending a stack of files is not great, especially if the person who receives it is not familiar with HTML as a document format.
This turns out to be a bug of bookdown, and I just fixed it on Github. You can install and test the development version (>= 0.3.3):
devtools::install_github('rstudio/bookdown')

convert pdf into small chunks of data(many chunks per page)?

I have a pdf file and I need to get get small pieces of data from it.
It is structured like this :
Page1:
Question 1
......................................
......................................
Question 2
......................................
......................................
Page End
I want to get Question 1 and Question 2 as separate html files, which contain text and image.
I've tried
pdftohtml -c pdffile.pdf output.html
And I got files with png images, but how to do I cut the Image into smaller chunks to fit the size of each Question (I want to separate each question into individual files)?
P.S. I have alot of pdf files, so a command-line tool would be nice.
I'll try to give you an approach on how I would go about it. You mention, that every page in your PDF document might have multiple questions and you basically want have one HTML file for every question.
It's great if pdftohtml works for you, but I also found another decent command line utility that you might want to try out.
Ok, so assuming you have an HTML file converted from the PDF you initially had, you might want to use csplit or awk to split your file into multiple files based on the delimiter 'Question' in your case. (Side note- csplit and awk are linux specific utilites, but I'm sure there are alternatives if you are on Windows or a MAC. I haven't specifically tried the following code)
From a relevant SO Post :
csplit input.txt'/^Question$/' '{*}'
awk '/Question/{filename=NR".txt"}; {print >filename}' input.txt
So, assuming this works, you will have a couple of broken html files. Broken because they'll be unsanitized due to dangling < or > or some other stray HTML elements after the splitting.
So you could start by saving the initial .html as .txt, removing the html, head and body elements specifically and going through the general structure of how the program converts the pdf into html. I'm sure you'll see a pattern around how the string 'Quetion' is wrapped in an element and is something you can take care of. That is why I mention .txt files in the code snippets.
You will basically have a bunch of text files with just the content html and not the usual starting tags for an html file because we removed that initially. Then it's only a matter of reading each file, just taking care of the element that surrounds the string 'Question' and adding the html, head and body elements around the content and saving them as .html files. You could do this in any programming language of your choice that supports file reading and writing (would be a fun exercise)
I hope this gets you started in the right direction.

How to render an HTML file offline?

I have a collection of html files that I gathered from a website using wget. Each file name is of the form details.php?id=100419&cid=13%0D, where the id and cid varies. Portions of the html files contain articles in Asian language (Unicode text). My intention is to extract the Asian-language text only. Dumping the rendered html using a command-line browser is the first step that I have thought of. It will eliminate some of the frills.
The problem is, I cannot dump the rendered html to a file (using, say, w3m -dump ). The dumping works if only I direct the browser (at the command-line) to the properly formed URL : http://<blah-blah>/<filename>. But this is way I will have to spend the time to download the files once again from the web. How do I get around this, what other tools could I use?
w3m -dump <filename> complains saying:
w3m: Can't load details.php?id=100419&cid=13%0D.
file <filname> shows:
details.php?id=100419&cid=13%0D: Non-ISO extended-ASCII HTML document text, with very long lines, with CRLF, CR, LF, NEL line terminators