Converting multiline code snippets in HTML to Markdown with pandoc - html

I want to translate this snippet of HTML into Markdown using pandoc.
<code class="code_block"># chown root:root /boot/grub/grub.cfg<br/># chmod og-rwx /boot/grub/grub.cfg
</code>
The output I want to have, is something like this.
```
# chown root:root /boot/grub/grub.cfg
# chmod og-rwx /boot/grub/grub.cfg
```
But the output I never includes the <br> respectively a line break in the markdown file.
# chown root:root /boot/grub/grub.cfg# chmod og-rwx /boot/grub/grub.cfg
I already tried different commands and extensions.
$ pandoc -f html -t markdown t.html
$ pandoc -f html -t markdown+hard_line_breaks t.html
$ pandoc -f html -t markdown+raw_html+hard_line_breaks t.html
$ pandoc -f html -t markdown+raw_html+hard_line_breaks-inline_code_attributes t.html
Am I missing something?

This is due to the way pandoc represents inline code internally: the code is stored as a string of verbatim text together with a set of attributes. Newlines, being layout commands, don't fit into this representation and are ignored.
Note also that the above is a rather uncommon way of writing multi-line code. See, e.g., the MDN docs on the <code> element:
To represent multiple lines of code, wrap the <code> element within a <pre> element. The <code> element by itself only represents a single phrase of code or line of code.

The problem is that your code block is not properly formatted as a code block. You need (at least) the following:
<pre><code># chown root:root /boot/grub/grub.cfg
# chmod og-rwx /boot/grub/grub.cfg
</code></pre>
In addition to the HTML spec, covered in #tarleb's answer, the Markdown rules also differentiate between a code block and a code span based solely on the existence (or not) of the <pre> tag.
Note that the original Markdown rules demonstrate a code block as generating this HTML:
<pre><code>This is a code block.
</code></pre>
A <code> tag wrapped in a <pre> tag. In contrast, the same rules demonstrate a code span generating this HTML:
<p>Use the <code>printf()</code> function.</p>
Note that only the <code> tag is used, but it is only an inline span (wrapped in a <p>, not a block level element.
When Pandoc is converting from HTML back to Markdown it follows the same convention in reverse. Yes, you have class="code_block" set on your <code> tag, but Pandoc doesn't know what that means, nor should it. And yes, your <code> element is not wrapped in a <p>, but that is just poorly formed HTML (according to the HTML spec, <code> is not a block-level element, but phrasing content; that is, content which gets wrapped in a block-level element such as a <p> or a <pre> element).
And then there is the issue of your <br> tag. How would Pandoc know if that is part of the code or a styling hook? In fact, it doesn't. Which is why we use <pre> tags for multi-line code blocks. With the <pre> tag, whitespace is preserved. Therefore, you only need a newline character without the <br> tag.
For completeness, I realize that the original Markdown rules do not include fenced code blocks, so I will also point to the GitHub Flavored Markdown spec, which also demonstrates fenced code blocks as producing <pre><code> wrapped blocks. Naturally, to go in reverse, you would need to start with <pre><code> wrapped blocks to end up with fenced code blocks.

Related

pandoc: Convert GitHub-flavoured MarkDown containing mixed html and markdown to html

My markdown was created according to the style from this top-result cheatsheet with HTML directives, using this commmand:
pandoc -f gfm -t html --atx-headers -s -o out.html in.md
However, the generated html always ignores titles that contains the following HTML code above them, leaving tons of ###, #### in my output HTML. My titles look like these:
# H1
<a name=toc-anchor-h2 />
## H2
<a name=toc-anchor-h3 />
### H3
<a name=toc-anchor-h4 />
#### H4
Then H1 works fine, but the # in the rest levels are all seen by pandoc as plain text. How should I solve this problem?
The headers must be preceded by a blank line. The missing blank line is causing the Markdown parser to not recognize them as headers. Therefore, edit your document to the following:
# H1
<a name=toc-anchor-h2 />
## H2
<a name=toc-anchor-h3 />
### H3
<a name=toc-anchor-h4 />
#### H4
Of, if you are concerned that that moves the anchors too far away from the intended target, include them inline:
# H1
## <a name=toc-anchor-h2 />H2
### <a name=toc-anchor-h3 />H3
#### <a name=toc-anchor-h4 />H4
Or, as you are using Pandoc, you could use one of the many Pandocs extensions which assigns identifiers directly to each header.
As it turns out, Pandoc's gfm variant of Markdown (which you are using) already includes the auto_identifiers extension. As the name implies, the auto_identifiers extension will cause id attributes to be auto-generated for every header. As a reminder, assigning an id attribute to an HTML element has the same effect as defining an anchor; you can link to either with a hash fragment. Therefore, you could simply remove your anchors and use the auto-generated ids which have already been assigned to the headers themselves.
However, if you would like to define your own custom id attributes for each header, then you may want to enable the header_attributes extension and alter your Markdown as follows:
# H1
## H2 {#toc-anchor-h2}
### H3 {#toc-anchor-h3}
#### H4 {#toc-anchor-h4}
which would generate the following HTML:
<h1 id="h1">H1</h1>
<h2 id="toc-anchor-h2">H2</h2>
<h3 id="toc-anchor-h3">H3</h3>
<h4 id="toc-anchor-h4">H4</h4>
Note that the "H1" header has an auto id assigned (based upon the text content of the element), while the remaining headers have the custom ids assigned to them.
One word of caution regarding the header_attributes extension: The syntax for defining the custom ids is non-standard and not supported by most Markdown implementations. If you want portable Markdown, then you should probably stick to the auto-generated ids as that does not require any non-standard markup in your documents.
Update: Note that according to the docs, the header_attributes extension is not compatible with gfm. Therefore, you wouldn't be able to use that extension. However, you get auto_identifiers by default. If you want custom identifiers, the you would need to use the custom raw HTML anchors. Of course that gives you the added benefit of a portable Markdown document.

Mixing markdown with html for including image and empty line meaning

When I put the following:
|![alt ]({attach}img/myimg.png "hint1")|
The pelican will generate image as expected. I wanted to use html to customise the aliment so I used:
<p align="center">
![alt ]({attach}img/myimg.png "hint1")
<br>
Figure 1.
</p>
but pelican produces only:
![alt ]({attach}img/myimg.png "hint1")
Figure 1.
However if I put the first table and the second html without empty separating line like this:
|![Attention architecture ]({attach}img/myimg.png "hint1")|
<p align="center">
![Attention architecture ]({attach}img/myimg.png "hint1")
<br>
Figure 1.
</p>
I got two images.. but... when I place empty line after the first and before the second image instruction like:
|![Attention architecture ]({attach}img/myimg.png "hint1")|
<p align="center">
![Attention architecture ]({attach}img/myimg.png "hint1")
<br>
Figure 1.
</p>
then only the first instruction includes image as expected and the second is not, and produces the same:
![alt ]({attach}img/myimg.png "hint1")
Figure 1.
How to use the this HTML way of including image ?
In short, you can't use HTML to wrap Markdown. Markdown is not processed inside HTML. But you can use a Markdown extension to assign attributes to the generated HTML.
As the rules state:
Markdown formatting syntax is not processed within block-level HTML tags. E.g., you can’t use Markdown-style *emphasis* inside an HTML block.
Of course, you are wondering why it appears to work when you include a line of text on the line immediately before the <p> tag. The rules also explain:
The only restrictions are that block-level HTML elements — e.g. <div>, <table>, <pre>, <p>, etc. — must be separated from surrounding content by blank lines, and the start and end tags of the block should not be indented with tabs or spaces. Markdown is smart enough not to add extra (unwanted) <p> tags around HTML block-level tags.
In your case, the lack of a blank line causes the parser to not recognize the <p> tag as a raw HTML block. Therefore, it wraps the block in an extra <p> tag of its own, generating invalid HTML. Therefore, the styling hooks may not apply as you desire.
As it happens, Pelican includes support for Markdown extensions, including the Attribute List Extension, which is installed by default along with the Markdown parser. You just need to enable it (scroll down to Markdown). Add the following to your config file:
MARKDOWN = {
'extension_configs': {
'markdown.extensions.attr_list': {}
}
Then you can include attribute lists in your Markdown to assign various styling hooks.
![alt ]({attach}img/myimg.png "hint1")
<br>
Figure 1.
{: align=center }

Pandoc HTML variables: `quotes` and `math`

Pandoc default HTML template contains these two variables:
quotes,
math.
How are they supposed to be used?
More specifically I see that quotes sets the values for the tag <q>. Is this tag used in markdown to HTML conversion?
tl;dr: they seem to be mostly obsolete legacies from previous versions of pandoc
quotes
A little archeology of pandoc commits shows that 'quotes' was added when pandoc switched from using <q> tags to directly adding quotes signs. A new option, --html-q-tags was added to keep the previous behavior: the option wraps quotes in <q> and sets quotes to true so that a piece of css code is added as explained in the html template. See this commit to pandoc and this commit to pandoc-templates. See the behavior with the following file:
"hello world"
This:
pandoc test.md -t html --smart --standalone
Produces (skipping the usual head, with no css affecting <q>)
<p>“hello world”</p>
While this
pandoc test.md -t html --standalone --html-q-tags --smart
produces (skipping the usual header)
<style type="text/css">q { quotes: "“" "”" "‘" "’"; }</style>
</head>
<body>
<p><q>hello world</q></p>
</body>
You have to use --smart though.
math
It looks like this was introduced to include math rendering scripts inside the standalone file. See this commit from 2010. I think some command-line options picking non-(currently)-default math rendering systems, like --mathml, sets this variable to a value that actually makes sense (like copying the math rendering scripts). Try:
pandoc -t html --mathml
For the quotes variable, see #scoa.
As regards the math variable, I found what follows.
When using MathML, that is the option --mathml, the code block:
$if(math)$
$math$
$endif$
in the default HTML conversion template adds a portability script to the HTML output.
Anyway, Chrome and Edge do not currently support MathML and Firefox seems to support it without this script.
So, for a custom template, removing the $if(math)$ ... code block will not affect MathML rendering.
When using MathJax, that is the option --mathjax, $if(math)$ ... adds to the HTML output the script block:
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML-full" type="text/javascript"></script>
This is always necessary to render the maths formulae.
When using the --latexmathml, a giant script, converting the LaTeX style math into MathML, is inserted by the $if(math)$ ... code block. Without this code block in the conversion template, the script is not inserted and the maths can't be rendered.

How can I eliminate the empty line in code blocks rendered by jekyll?

GitHub Pages Jekyll use Pygments by default to render syntax highlighting for code blocks. But I prefer an easier alternative highlight.js to do the job because I only need to indent 4 spaces to mark code blocks in the markdown source files.
However, my R code are all mistakenly interpreted as php or perl or makefile or other type of code by highlight.js, and I want to manually mark the code block by
```r
(some r code)
```
instead. But when I use this, the first line of the code block always appears to be a blank line. I view the HTML source code produced by the 4-space mark, it is like
<pre><code>x <- rnorm(100)
y <- 2*x + rnorm(100)
lm(formula=y~x)
</code></pre>
which does not suffer from this problem.
How can I eliminate the blank line in the first line of the code block?
I face the same issue today when I change my highlighter to highlight.js.
With the help from others, I finally git rid of this blank line, and willing to share the solution. Basically, the whitespace inside <pre> is not trimmed, and be treated as a newline in the rendered page (you can use firebug extension of Firefox enabled with show whitespace to observe the extra line).
Then the solution is obvious.
put pre and code tags at the same line with your actual code. like this:
<pre><code class="css">#font-face {
font-family: Chunkfive; src: url('Chunkfive.otf');
}
or using solution provided by mhulse to make your raw post more readable
<pre><code
>line of code
Here and ...
Here
</code></pre>
Write your own js code to trim L/R whitespace(s) of your content before it be put in <pre>
For more details, check this page.

How to show the string inside a tag verbatim?

What tag can I use to prevent any interpretation? I need that because I need to write down some source code and it's result in blogger. I have this code in blogspot, but the code inside the <pre> is processed
The code is as follows:
<pre class='prettyprint'>
$latex \displaystyle S(n)=\sum_{k=1}^{n}{\frac{1}{T_{k}}=\sum_{k=1}^{n}{\frac{6}{k(k+1)(k+2)}$
</pre>
This is the result:
$latex \displaystyle S(n)=\sum_{k=1}^{n}{\frac{1}{T_{k}}=\sum_{k=1}^{n}{\frac{6}{k(k+1)(k+2)}$
When I can replace '$' in <pre> with something equivalent, I could avoid this issue.
I tried <code> and <pre>, but they all interpret the content.
ADDED
I'm trying to use the javascript code found in this post.
If I understand correctly, you are using Replacemath, and its documentation says: “Should you need to to prevent certain $ signs from triggering LaTeX rendering, replace $ with the equivalent HTML <span>$</span> or $, or put the code inside a <pre> or <code> block if appropriate.” Of these, the first method seems to actually work.
That is, replace all occurrences of “$” inside the pre element by <span>$</span>.
I tested this by publishing a test in my blog (which had been dormant for 6 years...). I had to manually break the pre block to fit into the column.