Convert markdown table of contents to HTML - html

I am trying to convert a markdown document to HTML, using pandoc. I cannot get the HTML output to create the table of contents correctly.
Issue:
I have added a table of contents to the markdown doc, where clicking on each header takes the reader to the relevant section. I am using the format below, where clicking on 'Header Title' will send the reader to the section 'header' in the document:
[Header Title](#header)
I tried to convert this to HTML using the pandoc command
pandoc -i input.md -f markdown -t html -o input.html
This creates a valid HTML file I can open in Firefox, and the items in the table of contents show up as links - but when I click them, nothing happens (I am expecting it to jump to the relevant section)
This happens when I use either markdown or markdown_github as the input format (-i in pandoc)
Question:
How can I get the table of contents to show the expected behavior in HTML?
Or is the concept of 'table of contents' a wrong approach to HTML, and I should change my markdown code?
Apologies if I am going about this the wrong way, I have no experience with HTML / web documents.
I found a couple of similar questions but they seemed to be specific to other programming languages / tools, so any help how I can achieve this with markdown / pandoc is much appreciated.
I am using pandoc 1.19.2.4 on Ubuntu.
Example markdown:
- [Chapter 1](#chapter-1)
- [1. Reading a text file](#1-reading-a-text-file)
## Chapter 1
This post focuses on standard text processing tasks such as reading files and processing text.
### 1. Reading a text file
Reading a file.

Looking at your markdown file, you have used #1-reading-a-text-file as the id for the 1st subheading.
While converting it to HTML, the following line is generated for the subheading:
<h3 id="reading-a-text-file">1. Reading a text file</h3>
The problem is the mismatch of "#1" which is present in the table of contents, but not in the heading.
My guess is that pandoc does not allow HTML id to start with a number.
Changing the table of contents to the following should work:
- [Chapter 1](#chapter-1)
- [1. Reading a text file](#reading-a-text-file)

Related

Pandoc fails to generate pdf from basic HTML page

I'm trying to generate a PDF from a basic HTML page using pandoc, but it seems like a table (or a few) are preventing the PDF from being generated.
This is the page I'm trying to convert to a PDF document. Here is the command that I'm running:
$ pandoc --verbose --from=html --to=pdf --output=ch3.pdf --pdf-engine=xelatex -V geometry:margin=1.5in https://bob.cs.sonoma.edu/IntroCompOrg-x64/bookch3.html
And here is the end of the output generated:
Error producing PDF.
! Argument of \LT#nofcols has an extra }.
<inserted text>
\par
l.2588 \begin{longtable}[]{#{}r#{}}
I was able to save it as a markdown document, and then convert that markdown document to PDF, but the tables become a block of incomprehensible markdown text. I suspect that something is going wrong in the translation of the table elements, but I don't know anything about latex so I can't say for sure, and have no idea where to start debugging. Any help is appreciated, thank you!

Lua filter for pandoc to append html

I'm currently compiling markdown to html using pandoc:
pandoc in.md -o out.html
and would like to include the same piece of html code in each of the output files, without having to write it into my markdown file.
I was hoping that a lua filter would do the job. However, the docs seem to indicate the filters will only respond to a sequence of characters within my markdown file, rather than appending something to each file.
I've played around with CSS (I've never used it before), but it doesn't look like I can just add arbitrary html code like this (correct me if I'm wrong).
To summarize, I'd like to find a way to add html code to my output.
A Lua filter is likely to be overkill here. Pandoc has an option --include-after-body (or --include-before-body) which will do what you need:
-A FILE, --include-after-body=FILE|URL
Include contents of FILE, verbatim, at the end of the document body (before the </body> tag in HTML, or the \end{document} command in LaTeX). This option can be used repeatedly to include multiple files. They will be included in the order specified. Implies --standalone.

Using pandoc to generate PDF from Markdown with inline style

I'm looking to create a mostly markdown document, but would like to take advantage of inserting HTML when I might need a bit more control over formatting on a case-by-case basis. I have iaWriter on macOS and am able to do so, and from my understanding of markdown this is an included behaviour.
When using pandoc on my linux machine, however, some tags (most notably <i> tags at the moment) are not interpreted.
My markdown file is:
This _does_ work.
This does <i>not</i> work.
However, inserting a <p>tag</p> will create a line-break and new paragraph.
When I execute pandoc -o test.pdf test.md I get the result: test.pdf
I've tried a few extensions in the output (+raw_html, +inline_code_attributes) thinking maybe I was missing something but have so far not found an explanation.
Apologies if this is a duplicate, but I was unable to find it, and have so far been unable to source an answer.
Thank you.
See the pandoc MANUAL: Creating a PDF.
By default, pandoc will use LaTeX to create the PDF Therefore, raw HTML will be ignored and would only have an effect if your output format is HTML as well. However, you can use wkhtmltopdf instead of pdflatex to go from markdown to PDF via HTML, instead of via LaTeX.
From the raw HTML extension docs:
The raw HTML is passed through unchanged in HTML, S5, Slidy, Slideous, DZSlides, EPUB, Markdown, Emacs Org mode, and Textile output, and suppressed in other formats.

How to place labels correctly and use cross-reference in latex to be able to convert to html(5) using pandoc?

Introduction
I'd like to create source code in latex in order to produce pdf via pdflatex and html page(s) via pandoc. I use the following source
\documentclass[12pt,a4paper]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[magyar]{babel}
\usepackage{hyperref}
\begin{document}
\begin{enumerate}
\item \label{itm:thefirst}
First example point
\end{enumerate}
This is an example to demonstrate how \textbackslash label\{\} and \textbackslash ref\{\} are not working with pandoc. Here should be a reference to the first example: \ref{itm:thefirst}
\end{document}
This can be compiled with pdflatex without any error or warning.
Problem
I create the html page via pandoc using the following code:
pandoc -f latex sample.tex -t html5 -o sample.html -S --toc -s
but it creates unsatisfactory results around the label and the reference:
<body>
<ol>
<li><p>[itm:thefirst] First example point</p></li>
</ol>
<p>This is an example to demonstrate how \label{} and \ref{} are not working with pandoc. Here should be a reference to the first example: [itm:thefirst]</p>
</body>
Question
What shall I modify in the latex source code in order to get something like this:
<body>
<ol>
<li><p id="itm:thefirst">First example point</p></li>
</ol>
<p>This is an example to demonstrate how \label{} and \ref{} are not working with pandoc. Here should be a reference to the first example: (1)</p>
</body>
What shall I modify in the latex source code [...]
Pandoc does currently not support parsing and processing of \label{...} or \ref{...} from LaTeX files, so there is no easy solution to your problem.
Why not go an alternative way?
Instead of writing your sources in LaTeX, write them in Markdown.
That way it will me much easier to convert the sources to HTML as well as to LaTeX and PDF.
As a bonus, you also get top-notch support to convert the sources to EPUB, DOCX, ODT and much more....

convert pdf into small chunks of data(many chunks per page)?

I have a pdf file and I need to get get small pieces of data from it.
It is structured like this :
Page1:
Question 1
......................................
......................................
Question 2
......................................
......................................
Page End
I want to get Question 1 and Question 2 as separate html files, which contain text and image.
I've tried
pdftohtml -c pdffile.pdf output.html
And I got files with png images, but how to do I cut the Image into smaller chunks to fit the size of each Question (I want to separate each question into individual files)?
P.S. I have alot of pdf files, so a command-line tool would be nice.
I'll try to give you an approach on how I would go about it. You mention, that every page in your PDF document might have multiple questions and you basically want have one HTML file for every question.
It's great if pdftohtml works for you, but I also found another decent command line utility that you might want to try out.
Ok, so assuming you have an HTML file converted from the PDF you initially had, you might want to use csplit or awk to split your file into multiple files based on the delimiter 'Question' in your case. (Side note- csplit and awk are linux specific utilites, but I'm sure there are alternatives if you are on Windows or a MAC. I haven't specifically tried the following code)
From a relevant SO Post :
csplit input.txt'/^Question$/' '{*}'
awk '/Question/{filename=NR".txt"}; {print >filename}' input.txt
So, assuming this works, you will have a couple of broken html files. Broken because they'll be unsanitized due to dangling < or > or some other stray HTML elements after the splitting.
So you could start by saving the initial .html as .txt, removing the html, head and body elements specifically and going through the general structure of how the program converts the pdf into html. I'm sure you'll see a pattern around how the string 'Quetion' is wrapped in an element and is something you can take care of. That is why I mention .txt files in the code snippets.
You will basically have a bunch of text files with just the content html and not the usual starting tags for an html file because we removed that initially. Then it's only a matter of reading each file, just taking care of the element that surrounds the string 'Question' and adding the html, head and body elements around the content and saving them as .html files. You could do this in any programming language of your choice that supports file reading and writing (would be a fun exercise)
I hope this gets you started in the right direction.