Can one extract images from pandoc's self-contained HTML files? - html

I have used pandoc with the option --self-contained to create HTML documents where images are embedded in the HTML code as base64.
The image is included in the IMG tag like this (where I have replaced the long string of base64-characters with a placeholder:
<IMG src="data:image/png;base64,<<base64-coded characters here>>" width=672">
Now, I'd like to extract such images, i.e. do the reverse where base64-coded data are replaced by references to files and the data converted to ordinary PNG or JPEG files that are saved on disk.
I was hoping to use pandoc to do this conversion, but I could not find an option for this in pandoc, nor have I found any other software that does it. Ideally, the solution should be shell/script-type that can easily be included in a longer toolchain.

You can use pandoc with the --extract-media option. The images will be written to the supplied directory and the base64 URLs will be replaced with references to those files.
E.g.
pandoc --from=html YOUR_FILE.html --extract-media=images

Related

Lua filter for pandoc to append html

I'm currently compiling markdown to html using pandoc:
pandoc in.md -o out.html
and would like to include the same piece of html code in each of the output files, without having to write it into my markdown file.
I was hoping that a lua filter would do the job. However, the docs seem to indicate the filters will only respond to a sequence of characters within my markdown file, rather than appending something to each file.
I've played around with CSS (I've never used it before), but it doesn't look like I can just add arbitrary html code like this (correct me if I'm wrong).
To summarize, I'd like to find a way to add html code to my output.
A Lua filter is likely to be overkill here. Pandoc has an option --include-after-body (or --include-before-body) which will do what you need:
-A FILE, --include-after-body=FILE|URL
Include contents of FILE, verbatim, at the end of the document body (before the </body> tag in HTML, or the \end{document} command in LaTeX). This option can be used repeatedly to include multiple files. They will be included in the order specified. Implies --standalone.

Using pandoc to generate PDF from Markdown with inline style

I'm looking to create a mostly markdown document, but would like to take advantage of inserting HTML when I might need a bit more control over formatting on a case-by-case basis. I have iaWriter on macOS and am able to do so, and from my understanding of markdown this is an included behaviour.
When using pandoc on my linux machine, however, some tags (most notably <i> tags at the moment) are not interpreted.
My markdown file is:
This _does_ work.
This does <i>not</i> work.
However, inserting a <p>tag</p> will create a line-break and new paragraph.
When I execute pandoc -o test.pdf test.md I get the result: test.pdf
I've tried a few extensions in the output (+raw_html, +inline_code_attributes) thinking maybe I was missing something but have so far not found an explanation.
Apologies if this is a duplicate, but I was unable to find it, and have so far been unable to source an answer.
Thank you.
See the pandoc MANUAL: Creating a PDF.
By default, pandoc will use LaTeX to create the PDF Therefore, raw HTML will be ignored and would only have an effect if your output format is HTML as well. However, you can use wkhtmltopdf instead of pdflatex to go from markdown to PDF via HTML, instead of via LaTeX.
From the raw HTML extension docs:
The raw HTML is passed through unchanged in HTML, S5, Slidy, Slideous, DZSlides, EPUB, Markdown, Emacs Org mode, and Textile output, and suppressed in other formats.

HTML file to screenshot as .jpg or other image

Nothing to do with rendering an individual image on a webpage. Goal is to render the entire webpage save that as a screenshot. Want to show a thumbnail of an HTML file to the user. The HTML file I will be screenshotting will be an HTML part in a MIME email message - ideally I would like to snapshot the entire MIME file but if I can do this to an HTML file I'll be in good shape.
An API would be ideal but an executable is also good.
You need html2ps, and convert from the package ImageMagick:
html2ps index.html index.ps
convert index.ps index.png
The second program produces one png per page for a long html-page - the page layout was done by by html2ps.
I found a program evince-thumbnailer, which was reported as:
apropos postscript | grep -i png
evince-thumbnailer (1) - create png thumbnails from PostScript and PDF documents
but it didn't work on an simple, first test.
If you like to combine multiple pages to a larger image, convert will help you surely.
Now I see, that convert operates on html directly, so
convert index.html index.png
shall work too. I don't see a difference in the output, and the size of the images is nearly identical.
If you have a multipart mime-type email, you typically have a mail header, maybe some pre-html-text, the html and maybe attachments.
You can extract the html and format it seperately - but rendering it embedded might not be that easy.
Here is a file I tested, which was from Apr. 14, so I extract the one mail from the mailfolder:
sed -n "/From - Sat Apr 14/,/From -/p" /home/stefan/.mozilla-thunderbird/k2jbztqu.default/Mail/Local\ Folders-1/Archives.sbd/sample | \
sed -n '/<html>/,/<\/html>/p' | wkhtmltopdf - - > sample.pdf
then I extract just the html-part of that.
wkhtmltopdf needs - - for reading stdin/writing to stdout. The PDF is rendered, but I don't know how to integrate it into your workflow.
You can replace wkhtml ... with
convert - sample.jpg
I'm going with wkhtmltoimage. This worked once correctly set up xvfb. The postscript suggestion did not render correctly and we need img not pdf.

Output formatted text (including source code) as LaTeX, PDF and HTML

I am editing a lot of documents in latex that consist of code listings and are currently output to pdf.
Since I am working in teams on those documents, I often need to manually integrate changes done by group members to the latex source.
Most of the group members do not know latex, so I would like to have a means to enable them to do the document formatting in a style maybe similar to markdown.
Since the latex documents consist of figures, have references and use the lslisting package, I am wondering if it would be possible to map these specific areas to a simple markdown style syntax.
Workflow Example:
Edit file in Markdown (or similar)
tag sections
tag code areas
tag figures
tag references
convert to latex
automatically convert tags
output
pdf
html
Would it somehow be possible to achieve such a workflow? Maybe there are already solutions to my specific workflow?
Here is an example for Docutils.
Title
=====
Section
-------
.. _code:
Code area::
#include <iostream>
int main() {
std::cout << "Hello World!" << std::endl;
}
.. figure:: image.png
Caption for figure
A reference to the code_
Another section
---------------
- Itemize
- lists
#. Enumerated
#. lists
+-----+-----+
|Table|Table|
+-----+-----+
|Table|Table|
+-----+-----+
Save that as example.rst. Then you can compile to HTML:
rst2html example.rst example.html
or to LaTeX:
rst2latex example.rst example.tex
then compile the resulting LaTeX document:
pdflatex example.tex
pdflatex example.tex # twice to get the reference right
A more comprehensive framework for generating documents from multiple sources is Sphinx, which is based on Docutils and focuses on technical documentation.
You should look at pandoc (at least if I understand your question correctly). It can convert between multiple formats (tex, pdf, word, reStructuredText) and also supports extended versions of markdown syntax to handle more complex issues (e.g. inserting header information in html).
With it you can mix markdown and LaTeX, and then compile to html, tex and pdf. You can also include bibtex references from an external file.
Some examples (from markdown to latex and html):
pandoc -f markdown -t latex infile.txt -o outfile.tex
pandoc -f markdown -t html infile.txt -o outfile.html
To add your own LaTex template going from markdown to pdf, and a bibliography:
pandoc input.text --template=FILE --bibliography refs.bib -o outfile.pdf
It is really a flexible and awesome program, and I'm using it much myself.
Have you looked at Docutils?
If you are an Emacs user, you may find org-mode's markup to your liking. It has very nice support for tables, coordinates well with other Emacs modes like the spreadsheet, and has good export of images to HTML. Cf. the fine manual's HTML-export section.
org-mode files are editable outside Emacs, for team members who do not use it, although the previewing and embedding of other Emacs modes can, naturally, only be done with Emacs.

How to render an HTML file offline?

I have a collection of html files that I gathered from a website using wget. Each file name is of the form details.php?id=100419&cid=13%0D, where the id and cid varies. Portions of the html files contain articles in Asian language (Unicode text). My intention is to extract the Asian-language text only. Dumping the rendered html using a command-line browser is the first step that I have thought of. It will eliminate some of the frills.
The problem is, I cannot dump the rendered html to a file (using, say, w3m -dump ). The dumping works if only I direct the browser (at the command-line) to the properly formed URL : http://<blah-blah>/<filename>. But this is way I will have to spend the time to download the files once again from the web. How do I get around this, what other tools could I use?
w3m -dump <filename> complains saying:
w3m: Can't load details.php?id=100419&cid=13%0D.
file <filname> shows:
details.php?id=100419&cid=13%0D: Non-ISO extended-ASCII HTML document text, with very long lines, with CRLF, CR, LF, NEL line terminators