How to render an HTML file offline? - html

I have a collection of html files that I gathered from a website using wget. Each file name is of the form details.php?id=100419&cid=13%0D, where the id and cid varies. Portions of the html files contain articles in Asian language (Unicode text). My intention is to extract the Asian-language text only. Dumping the rendered html using a command-line browser is the first step that I have thought of. It will eliminate some of the frills.
The problem is, I cannot dump the rendered html to a file (using, say, w3m -dump ). The dumping works if only I direct the browser (at the command-line) to the properly formed URL : http://<blah-blah>/<filename>. But this is way I will have to spend the time to download the files once again from the web. How do I get around this, what other tools could I use?
w3m -dump <filename> complains saying:
w3m: Can't load details.php?id=100419&cid=13%0D.
file <filname> shows:
details.php?id=100419&cid=13%0D: Non-ISO extended-ASCII HTML document text, with very long lines, with CRLF, CR, LF, NEL line terminators

Related

Can one extract images from pandoc's self-contained HTML files?

I have used pandoc with the option --self-contained to create HTML documents where images are embedded in the HTML code as base64.
The image is included in the IMG tag like this (where I have replaced the long string of base64-characters with a placeholder:
<IMG src="data:image/png;base64,<<base64-coded characters here>>" width=672">
Now, I'd like to extract such images, i.e. do the reverse where base64-coded data are replaced by references to files and the data converted to ordinary PNG or JPEG files that are saved on disk.
I was hoping to use pandoc to do this conversion, but I could not find an option for this in pandoc, nor have I found any other software that does it. Ideally, the solution should be shell/script-type that can easily be included in a longer toolchain.
You can use pandoc with the --extract-media option. The images will be written to the supplied directory and the base64 URLs will be replaced with references to those files.
E.g.
pandoc --from=html YOUR_FILE.html --extract-media=images

Character encoding in HTML file using WebView in JavaFX

I have a local HTML file that I would like to display in a WebView, in JavaFX. It's actually an html file from an epub file. I'm essentially trying to build my own epub viewer.
The epub's html file displays some text with diacritic marks. Most of these have been handled in the ebook files using html tags and a CSS, but not all. For example, the character "á" is used. When I open the html file in Chrome, it displays normally, but it shows up in my WebView program as "á".
I assume it's a character encoding thing. If I use the character value a&#769, then it shows up properly, but I'd rather not have to go through all the epub files I want to display and see what other characters don't work properly.
I have saved the html file with UTF-8 encoding, and anyway, it's the same file that is being read by Chrome and my program. Any suggestions?
Well, that wasn't too long. Explaining the question put me on the path to salvation :)
I just needed to change Eclipse's encoding, using this answer:
How to support UTF-8 encoding in Eclipse
Window > Preferences > General > Content Types, set UTF-8 as the default encoding for all content types.
Window > Preferences > General > Workspace, set "Text file encoding" to "Other : UTF-8".

convert pdf into small chunks of data(many chunks per page)?

I have a pdf file and I need to get get small pieces of data from it.
It is structured like this :
Page1:
Question 1
......................................
......................................
Question 2
......................................
......................................
Page End
I want to get Question 1 and Question 2 as separate html files, which contain text and image.
I've tried
pdftohtml -c pdffile.pdf output.html
And I got files with png images, but how to do I cut the Image into smaller chunks to fit the size of each Question (I want to separate each question into individual files)?
P.S. I have alot of pdf files, so a command-line tool would be nice.
I'll try to give you an approach on how I would go about it. You mention, that every page in your PDF document might have multiple questions and you basically want have one HTML file for every question.
It's great if pdftohtml works for you, but I also found another decent command line utility that you might want to try out.
Ok, so assuming you have an HTML file converted from the PDF you initially had, you might want to use csplit or awk to split your file into multiple files based on the delimiter 'Question' in your case. (Side note- csplit and awk are linux specific utilites, but I'm sure there are alternatives if you are on Windows or a MAC. I haven't specifically tried the following code)
From a relevant SO Post :
csplit input.txt'/^Question$/' '{*}'
awk '/Question/{filename=NR".txt"}; {print >filename}' input.txt
So, assuming this works, you will have a couple of broken html files. Broken because they'll be unsanitized due to dangling < or > or some other stray HTML elements after the splitting.
So you could start by saving the initial .html as .txt, removing the html, head and body elements specifically and going through the general structure of how the program converts the pdf into html. I'm sure you'll see a pattern around how the string 'Quetion' is wrapped in an element and is something you can take care of. That is why I mention .txt files in the code snippets.
You will basically have a bunch of text files with just the content html and not the usual starting tags for an html file because we removed that initially. Then it's only a matter of reading each file, just taking care of the element that surrounds the string 'Question' and adding the html, head and body elements around the content and saving them as .html files. You could do this in any programming language of your choice that supports file reading and writing (would be a fun exercise)
I hope this gets you started in the right direction.

Losing superscript tag when converting HTML to DOCX using libreoffice

I have the following HTML:
<html><body><p>n<sup>th</sup></p></body></html>
I am using the command:
$ libreoffice --convert-to docx:"MS Word 2007 XML" test.html
To convert that HTML into a DOCX file. However I notice that the resulting DOCX file does not actually contain the <sup> tag. It looks like it is using position and size to replicate the <w:vertAlign> tag:
<w:position w:val="8"/><w:sz w:val="19"/>
What I would need to know is how to make libreoffice put in the <w:vertAlign> tag instead of using position and size.
Additonal Info:
I had a similar problem with bold and italics (<strong><em>) but was able to get the conversion to work correctly if I converted the strong and em tags to b and i tags respectively.
If you are looking to edit the HTML, it would be much better to use a tool that is suited for editing HTML, such as Notepad++ or Sublime (as examples).
If you need to have the HTML as a LibreOffice document for a specific reason, you could open the HTML file in Notepad and save as a text file with .txt as the extension. That should allow you to open the document in LibreOffice.
You can try using a WYSIWYG(What You See Is What You Get) editor like TinyMCE(http://www.tinymce.com/). There are lots of them online and you can also find some desktop applications for that. but if you want to convert it in docx you can try this http://htmltodocx.codeplex.com/ it is written in php and uses PHPWord and is quite efficient.
Just create a Python script that replaces your unwanted tags with the <w:vertAlign> tag where ever needed.
The command works fine if you replace 'docx' with 'xml', like this:
libreoffice --convert-to xml:"MS Word 2003 XML" test.html

HTML file to screenshot as .jpg or other image

Nothing to do with rendering an individual image on a webpage. Goal is to render the entire webpage save that as a screenshot. Want to show a thumbnail of an HTML file to the user. The HTML file I will be screenshotting will be an HTML part in a MIME email message - ideally I would like to snapshot the entire MIME file but if I can do this to an HTML file I'll be in good shape.
An API would be ideal but an executable is also good.
You need html2ps, and convert from the package ImageMagick:
html2ps index.html index.ps
convert index.ps index.png
The second program produces one png per page for a long html-page - the page layout was done by by html2ps.
I found a program evince-thumbnailer, which was reported as:
apropos postscript | grep -i png
evince-thumbnailer (1) - create png thumbnails from PostScript and PDF documents
but it didn't work on an simple, first test.
If you like to combine multiple pages to a larger image, convert will help you surely.
Now I see, that convert operates on html directly, so
convert index.html index.png
shall work too. I don't see a difference in the output, and the size of the images is nearly identical.
If you have a multipart mime-type email, you typically have a mail header, maybe some pre-html-text, the html and maybe attachments.
You can extract the html and format it seperately - but rendering it embedded might not be that easy.
Here is a file I tested, which was from Apr. 14, so I extract the one mail from the mailfolder:
sed -n "/From - Sat Apr 14/,/From -/p" /home/stefan/.mozilla-thunderbird/k2jbztqu.default/Mail/Local\ Folders-1/Archives.sbd/sample | \
sed -n '/<html>/,/<\/html>/p' | wkhtmltopdf - - > sample.pdf
then I extract just the html-part of that.
wkhtmltopdf needs - - for reading stdin/writing to stdout. The PDF is rendered, but I don't know how to integrate it into your workflow.
You can replace wkhtml ... with
convert - sample.jpg
I'm going with wkhtmltoimage. This worked once correctly set up xvfb. The postscript suggestion did not render correctly and we need img not pdf.