The general process of creating a printable pdf from xml

The general process of creating a printable pdf from xml - html

This is a very, very basic question. I'm self-taught using html, xml and css, so please forgive my absolute ignorance. My situation is as follows: I Know how to write XMl files, I can create the html output I want and use Css to style the page the way I need to. Now, I would like to print a book from this result. I need it to split the content of my html page into A4-pages, add page numbers and line numbers. What techniques do I have to learn to do this? I have read online that xsl:fo is used to transform xml to pdf. Is there any way I could use the html/css output with this or do I need to write an entire new stylesheet using xsl:fo? Do I need to learn javascript? I'm willing to do any of this, I just don't know where to start.
I had a look at importing my xml file into indesign and that would work, but then I'd have to do all the work of styling the text again. There has to be a better way.

If you want to use CSS to style your print output, the proprietary Prince XML seems to be the only tool that generates decent typography.
Turning to open source tooling, you could use XSLT to transform your custom XML to XSL-FO and then Apache FOP to generate the PDF, however the output is not so clean as with TeX and you'd need to specify all your layout in XSL-FO instead of CSS as well.
What I'd recommend is transforming your XML to (HTML or DocBook XML) and then use Pandoc to turn that into a PDF. Pandoc uses either pdflatex, xetex or luatex to generate the PDF. If you're not familiar with the LaTeX macro package, I recommend using the ConTeXt macro package instead, which has more consistent layout commands and doesn't rely on packages for basic functionality. To change the layout, use a custom Pandoc template file to generate the desired ConTeXt file. That would work as follows:
$ saxon -o docbook-file.xml custom.xml stylesheet.xslt #generate DocBook
$ pandoc -f docbook docbook-file.xml -t context --standalone --template template.tex -o out.tex #generate ConTeXt
$ context out.tex #generate PDF

Or look here http://www.cloudformatter.com/CSS2Pdf which uses XSL FO hidden to the user. You style with CSS. There are many samples showing book features like headers/footers, page numbering, multiple sequences.

You can try http://pdfcrowd.com/ - very simple and easy. I'm using their java API and it's smooth. Also quite cheap.

Related

Lua filter for pandoc to append html

I'm currently compiling markdown to html using pandoc:
pandoc in.md -o out.html
and would like to include the same piece of html code in each of the output files, without having to write it into my markdown file.
I was hoping that a lua filter would do the job. However, the docs seem to indicate the filters will only respond to a sequence of characters within my markdown file, rather than appending something to each file.
I've played around with CSS (I've never used it before), but it doesn't look like I can just add arbitrary html code like this (correct me if I'm wrong).
To summarize, I'd like to find a way to add html code to my output.

A Lua filter is likely to be overkill here. Pandoc has an option --include-after-body (or --include-before-body) which will do what you need:
-A FILE, --include-after-body=FILE|URL
Include contents of FILE, verbatim, at the end of the document body (before the </body> tag in HTML, or the \end{document} command in LaTeX). This option can be used repeatedly to include multiple files. They will be included in the order specified. Implies --standalone.

Using pandoc to generate PDF from Markdown with inline style

I'm looking to create a mostly markdown document, but would like to take advantage of inserting HTML when I might need a bit more control over formatting on a case-by-case basis. I have iaWriter on macOS and am able to do so, and from my understanding of markdown this is an included behaviour.
When using pandoc on my linux machine, however, some tags (most notably <i> tags at the moment) are not interpreted.
My markdown file is:
This _does_ work.
This does <i>not</i> work.
However, inserting a <p>tag</p> will create a line-break and new paragraph.
When I execute pandoc -o test.pdf test.md I get the result: test.pdf
I've tried a few extensions in the output (+raw_html, +inline_code_attributes) thinking maybe I was missing something but have so far not found an explanation.
Apologies if this is a duplicate, but I was unable to find it, and have so far been unable to source an answer.
Thank you.

See the pandoc MANUAL: Creating a PDF.
By default, pandoc will use LaTeX to create the PDF Therefore, raw HTML will be ignored and would only have an effect if your output format is HTML as well. However, you can use wkhtmltopdf instead of pdflatex to go from markdown to PDF via HTML, instead of via LaTeX.
From the raw HTML extension docs:
The raw HTML is passed through unchanged in HTML, S5, Slidy, Slideous, DZSlides, EPUB, Markdown, Emacs Org mode, and Textile output, and suppressed in other formats.

convert docx with (ordered) list to html

I'm trying to convert a large docx document with several layers' ordered list to an html. (see an example of the document here: http://docdro.id/X1oyfBv You should download it)
I tried the following things, including:
online converters such as html-cleaner and index.html (which only recognize one layer of the list)
save as html - which creates an horrendous file but still doesn't recognize the ol structure.
saved the file as zip and then opened the xml file, but I dont see an easy way to get the ol structure out of the w:... tags
saving it to google docs and running Omar Alzabir's script
http://omaralzabir.com/wp-content/uploads/2014/05/GoogleDocsEmail.jpg
btw. If I create a word file with an ordered list with multiple layers and i convert it, it does recognize it as ol's. But the existing file is not recognized as ol's even if I 'un-list' and list it again. So possibly there is something wrong with how the original document was created (?)
Any suggestions much appreciated:) Or indications as to why this problem occurs

Are you asking how to save a Word-doc in HTML format, with multi-level ordered-lists?
Word-HTML has bugs in its multi-level ordered lists. For the list-items, the indentation tends to be incorrect and inconsistent. There's an example here.
Word-HTML has similar bugs in its multi-level unordered lists. An example is here.
I recently wrote a Python program that fixes these bugs, in Word's HTML. The program is part of WordWebNav (WWN), which is free and open-source.
WWN is an app that converts a Microsoft-Word document to a usable web-page. It adds some missing features in the Word-HTML web-page (e.g., a navigation pane), and it fixes bugs in the Word-HTML.

You can use pandoc : https://github.com/jgm/pandoc
This is an open source universal command line tool to convert markup source based document files.
You can use it as something like that:
pandoc -o output.html input.docx

How to place labels correctly and use cross-reference in latex to be able to convert to html(5) using pandoc?

Introduction
I'd like to create source code in latex in order to produce pdf via pdflatex and html page(s) via pandoc. I use the following source
\documentclass[12pt,a4paper]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[magyar]{babel}
\usepackage{hyperref}
\begin{document}
\begin{enumerate}
\item \label{itm:thefirst}
First example point
\end{enumerate}
This is an example to demonstrate how \textbackslash label\{\} and \textbackslash ref\{\} are not working with pandoc. Here should be a reference to the first example: \ref{itm:thefirst}
\end{document}
This can be compiled with pdflatex without any error or warning.
Problem
I create the html page via pandoc using the following code:
pandoc -f latex sample.tex -t html5 -o sample.html -S --toc -s
but it creates unsatisfactory results around the label and the reference:
<body>
<ol>
<li><p>[itm:thefirst] First example point</p></li>
</ol>
<p>This is an example to demonstrate how \label{} and \ref{} are not working with pandoc. Here should be a reference to the first example: [itm:thefirst]</p>
</body>
Question
What shall I modify in the latex source code in order to get something like this:
<body>
<ol>
<li><p id="itm:thefirst">First example point</p></li>
</ol>
<p>This is an example to demonstrate how \label{} and \ref{} are not working with pandoc. Here should be a reference to the first example: (1)</p>
</body>

What shall I modify in the latex source code [...]
Pandoc does currently not support parsing and processing of \label{...} or \ref{...} from LaTeX files, so there is no easy solution to your problem.

Why not go an alternative way?
Instead of writing your sources in LaTeX, write them in Markdown.
That way it will me much easier to convert the sources to HTML as well as to LaTeX and PDF.
As a bonus, you also get top-notch support to convert the sources to EPUB, DOCX, ODT and much more....

Output formatted text (including source code) as LaTeX, PDF and HTML

I am editing a lot of documents in latex that consist of code listings and are currently output to pdf.
Since I am working in teams on those documents, I often need to manually integrate changes done by group members to the latex source.
Most of the group members do not know latex, so I would like to have a means to enable them to do the document formatting in a style maybe similar to markdown.
Since the latex documents consist of figures, have references and use the lslisting package, I am wondering if it would be possible to map these specific areas to a simple markdown style syntax.
Workflow Example:
Edit file in Markdown (or similar)
tag sections
tag code areas
tag figures
tag references
convert to latex
automatically convert tags
output
pdf
html
Would it somehow be possible to achieve such a workflow? Maybe there are already solutions to my specific workflow?

Here is an example for Docutils.
Title
=====
Section
-------
.. _code:
Code area::
#include <iostream>
int main() {
std::cout << "Hello World!" << std::endl;
}
.. figure:: image.png
Caption for figure
A reference to the code_
Another section
---------------
- Itemize
- lists
#. Enumerated
#. lists
+-----+-----+
|Table|Table|
+-----+-----+
|Table|Table|
+-----+-----+
Save that as example.rst. Then you can compile to HTML:
rst2html example.rst example.html
or to LaTeX:
rst2latex example.rst example.tex
then compile the resulting LaTeX document:
pdflatex example.tex
pdflatex example.tex # twice to get the reference right
A more comprehensive framework for generating documents from multiple sources is Sphinx, which is based on Docutils and focuses on technical documentation.

You should look at pandoc (at least if I understand your question correctly). It can convert between multiple formats (tex, pdf, word, reStructuredText) and also supports extended versions of markdown syntax to handle more complex issues (e.g. inserting header information in html).
With it you can mix markdown and LaTeX, and then compile to html, tex and pdf. You can also include bibtex references from an external file.
Some examples (from markdown to latex and html):
pandoc -f markdown -t latex infile.txt -o outfile.tex
pandoc -f markdown -t html infile.txt -o outfile.html
To add your own LaTex template going from markdown to pdf, and a bibliography:
pandoc input.text --template=FILE --bibliography refs.bib -o outfile.pdf
It is really a flexible and awesome program, and I'm using it much myself.

Have you looked at Docutils?

If you are an Emacs user, you may find org-mode's markup to your liking. It has very nice support for tables, coordinates well with other Emacs modes like the spreadsheet, and has good export of images to HTML. Cf. the fine manual's HTML-export section.
org-mode files are editable outside Emacs, for team members who do not use it, although the previewing and embedding of other Emacs modes can, naturally, only be done with Emacs.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

The general process of creating a printable pdf from xml - html

Or look here http://www.cloudformatter.com/CSS2Pdf which uses XSL FO hidden to the user. You style with CSS. There are many samples showing book features like headers/footers, page numbering, multiple sequences.

You can try http://pdfcrowd.com/ - very simple and easy. I'm using their java API and it's smooth. Also quite cheap.

Related

Lua filter for pandoc to append html

Using pandoc to generate PDF from Markdown with inline style

convert docx with (ordered) list to html

How to place labels correctly and use cross-reference in latex to be able to convert to html(5) using pandoc?

Output formatted text (including source code) as LaTeX, PDF and HTML

Categories

Resources