html to pdf from bash with unicode (UTF-8) support - html

Is there a good way to convert html to PDF from bash with Unicode (UTF-8) support?
I would expect the same result as if I were to use a PDF printer and print a page from Firefox.
Usage Example:
curl http://www.wikipedia.org/ | html2pdf_bash_command > /tmp/wikipedia.org.pdf

You can do something with xfvb and a browser or use a small qt component wkhtmltopdf. Also if you have a full gnome installed on your environment you can use gnome-web-print.

Related

Character encoding in HTML file using WebView in JavaFX

I have a local HTML file that I would like to display in a WebView, in JavaFX. It's actually an html file from an epub file. I'm essentially trying to build my own epub viewer.
The epub's html file displays some text with diacritic marks. Most of these have been handled in the ebook files using html tags and a CSS, but not all. For example, the character "á" is used. When I open the html file in Chrome, it displays normally, but it shows up in my WebView program as "á".
I assume it's a character encoding thing. If I use the character value a&#769, then it shows up properly, but I'd rather not have to go through all the epub files I want to display and see what other characters don't work properly.
I have saved the html file with UTF-8 encoding, and anyway, it's the same file that is being read by Chrome and my program. Any suggestions?
Well, that wasn't too long. Explaining the question put me on the path to salvation :)
I just needed to change Eclipse's encoding, using this answer:
How to support UTF-8 encoding in Eclipse
Window > Preferences > General > Content Types, set UTF-8 as the default encoding for all content types.
Window > Preferences > General > Workspace, set "Text file encoding" to "Other : UTF-8".

Losing superscript tag when converting HTML to DOCX using libreoffice

I have the following HTML:
<html><body><p>n<sup>th</sup></p></body></html>
I am using the command:
$ libreoffice --convert-to docx:"MS Word 2007 XML" test.html
To convert that HTML into a DOCX file. However I notice that the resulting DOCX file does not actually contain the <sup> tag. It looks like it is using position and size to replicate the <w:vertAlign> tag:
<w:position w:val="8"/><w:sz w:val="19"/>
What I would need to know is how to make libreoffice put in the <w:vertAlign> tag instead of using position and size.
Additonal Info:
I had a similar problem with bold and italics (<strong><em>) but was able to get the conversion to work correctly if I converted the strong and em tags to b and i tags respectively.
If you are looking to edit the HTML, it would be much better to use a tool that is suited for editing HTML, such as Notepad++ or Sublime (as examples).
If you need to have the HTML as a LibreOffice document for a specific reason, you could open the HTML file in Notepad and save as a text file with .txt as the extension. That should allow you to open the document in LibreOffice.
You can try using a WYSIWYG(What You See Is What You Get) editor like TinyMCE(http://www.tinymce.com/). There are lots of them online and you can also find some desktop applications for that. but if you want to convert it in docx you can try this http://htmltodocx.codeplex.com/ it is written in php and uses PHPWord and is quite efficient.
Just create a Python script that replaces your unwanted tags with the <w:vertAlign> tag where ever needed.
The command works fine if you replace 'docx' with 'xml', like this:
libreoffice --convert-to xml:"MS Word 2003 XML" test.html

HTML file to screenshot as .jpg or other image

Nothing to do with rendering an individual image on a webpage. Goal is to render the entire webpage save that as a screenshot. Want to show a thumbnail of an HTML file to the user. The HTML file I will be screenshotting will be an HTML part in a MIME email message - ideally I would like to snapshot the entire MIME file but if I can do this to an HTML file I'll be in good shape.
An API would be ideal but an executable is also good.
You need html2ps, and convert from the package ImageMagick:
html2ps index.html index.ps
convert index.ps index.png
The second program produces one png per page for a long html-page - the page layout was done by by html2ps.
I found a program evince-thumbnailer, which was reported as:
apropos postscript | grep -i png
evince-thumbnailer (1) - create png thumbnails from PostScript and PDF documents
but it didn't work on an simple, first test.
If you like to combine multiple pages to a larger image, convert will help you surely.
Now I see, that convert operates on html directly, so
convert index.html index.png
shall work too. I don't see a difference in the output, and the size of the images is nearly identical.
If you have a multipart mime-type email, you typically have a mail header, maybe some pre-html-text, the html and maybe attachments.
You can extract the html and format it seperately - but rendering it embedded might not be that easy.
Here is a file I tested, which was from Apr. 14, so I extract the one mail from the mailfolder:
sed -n "/From - Sat Apr 14/,/From -/p" /home/stefan/.mozilla-thunderbird/k2jbztqu.default/Mail/Local\ Folders-1/Archives.sbd/sample | \
sed -n '/<html>/,/<\/html>/p' | wkhtmltopdf - - > sample.pdf
then I extract just the html-part of that.
wkhtmltopdf needs - - for reading stdin/writing to stdout. The PDF is rendered, but I don't know how to integrate it into your workflow.
You can replace wkhtml ... with
convert - sample.jpg
I'm going with wkhtmltoimage. This worked once correctly set up xvfb. The postscript suggestion did not render correctly and we need img not pdf.

Html to ansi colored terminal text

I am under Linux and I want to fetch an html page from the web and then output it on terminal. I found out that html2text essentially does the job, but it converts my html to a plain text whereas I would better convert it into ansi colored text in the spirit of ls --color=auto. Any ideas?
The elinks browser can do that. Other text browsers such as lynx or w3m might be able to do that as well.
elinks -dump -dump-color-mode 1 http://example.com/
the above example provides a text version of http://example.com/ using 16 colors. The output format can be customized further depending on need.
The -dump option enables the dump mode, which just prints the whole page as text, with the link destinations printed out in a kind of "email-style".
-dump-color-mode 1 enables the coloring of the output using the 16 basic terminal colors. Depending on the value and the capabilities of the terminal emulator this can be up to ~16 million (True Color). The values are documented in elinks.conf(5).
The colors used for output can be configured as well, which is documented in elinks.conf(5) as well.
The w3m browser supports coloring the output text.
You can use the lynx browser to output the text using this command.
lynx -dump http://example.com

How to render an HTML file offline?

I have a collection of html files that I gathered from a website using wget. Each file name is of the form details.php?id=100419&cid=13%0D, where the id and cid varies. Portions of the html files contain articles in Asian language (Unicode text). My intention is to extract the Asian-language text only. Dumping the rendered html using a command-line browser is the first step that I have thought of. It will eliminate some of the frills.
The problem is, I cannot dump the rendered html to a file (using, say, w3m -dump ). The dumping works if only I direct the browser (at the command-line) to the properly formed URL : http://<blah-blah>/<filename>. But this is way I will have to spend the time to download the files once again from the web. How do I get around this, what other tools could I use?
w3m -dump <filename> complains saying:
w3m: Can't load details.php?id=100419&cid=13%0D.
file <filname> shows:
details.php?id=100419&cid=13%0D: Non-ISO extended-ASCII HTML document text, with very long lines, with CRLF, CR, LF, NEL line terminators