How can we make pandoc produce pretty, human readable HTML from markdown? - html

Recently I realized that I had made a bad choice by writing my notes in Markdown. I wanted to switch to HTML instead and put it on my website.
I used pandoc for converting the file from html to markdown:
pandoc file.md -o file.html
But everything seems wrong about this (snippet from vim):
Problems:
Code is not readable. It was very, very readable in Markdown, but I'm not sure what's up with character codes showing up instead of something humans can read.
Indentation is weird. I just did indentation with the usual gg=G in vim, and it seems that all the <p> tags are being gradually indented further and further to the right. Is this expected behaviour? It certainly looks ugly.
Specific things about the HTML code are undesirable. This is probably the least vexing, since you can easily replace things with substitute, but since I am planning on using prism for code highlighting, I would like things like class="language-c++" instead of class="sourceCode cpp.
Question: Is there any way to easily fix up this mess, or have pandoc generate better in the first place? Is there a substitute that works better than pandoc? Is there a pandoc option that I am missing?

By "character codes" I assume you mean > and the like. These are necessary in HTML, since a < character has a special meaning. Without these escapes you'd have invalid HTML.
Indentation: pandoc does not indent its HTML output. So this is the result of something you did in vim. It's not a pandoc issue.
Code formatting: By default, pandoc inserts classes and span tags to create highlighted HTML for code blocks. If you don't want this (e.g. if you want to do your own highlighting with some JavaScript code) then you can disable it using --no-highlight. You may still get some transformation of class names. You can change these using a simple lua filter: see the documentation for lua filters.

Related

If Markdown is a superset of HTML, then why can't it do everything HTML can?

I'm trying to understand Markdown's relationship to HTML. If I understand correctly both are markup languages (an umbrella term describing languages that add formatting elements to plain-text documents). Markdown converts plain text to HTML.
My understanding is that Markdown is a superset of HTML:
Markdown is a popular markup language that is a superset of HTML.
I'm assuming that it's a strict or proper superset. Drawing a parallel from What does it mean when one language is a parallel superset of another?, I interpret that to mean that every valid HTML program is also a valid Markdown program (e.g. HTML is understood in a Jupyter Notebook Markdown cell), but that the converse is not true.
What seems conflicting to me is that if Markdown is a superset of HTML, then why is it that Markdown can't do everything HTML can (I would think the opposite to be true since a superset extends the language without removing or changing any of the existing features. Also, I would expect HTML to be a superset of Markdown since HTML is more expressive and more difficult to read by most humans.
Below is a diagram trying to mimic that in What does “Objective-C is a superset of C more strictly than C++” mean exactly?
That documentation is misleading. Markdown itself is not a superset of HTML. The documentation for the original Markdown project is pretty clear:
Markdown is not a replacement for HTML, or even close to it. Its syntax is very small, corresponding only to a very small subset of HTML tags. The idea is not to create a syntax that makes it easier to insert HTML tags. In my opinion, HTML tags are already easy to insert. The idea for Markdown is to make it easy to read, write, and edit prose. HTML is a publishing format; Markdown is a writing format. Thus, Markdown’s formatting syntax only addresses issues that can be conveyed in plain text.
Today there are several flavours of Markdown, many of which add features that were not present in the original version like tables and syntax-highlighted code blocks. This doesn't change the fundamental fact that Markdown covers a subset of HTML.
(Technically speaking, Markdown isn't a subset of HTML either. *, for example, has no special meaning in HTML. Unconverted Markdown documents might be well-formed HTML but the semantics are very different. But Markdown syntax maps to a subset of HTML tags.)
However, the very next paragraph in the original documentation says:
For any markup that is not covered by Markdown’s syntax, you simply use HTML itself. There’s no need to preface it or delimit it to indicate that you’re switching from Markdown to HTML; you just use the tags.
Since you can directly use HTML in Markdown it could be considered a superset of HTML. For example, this is valid Markdown:
# My awesome title
I <em>really</em> like coffee
If you pass an HTML document through a conforming Markdown processor it should come out the other side untouched. Being able to directly use HTML in Markdown is very similar to how one can directly use C in C++. This may be what the Jupyter documentation means.

Including HTML in Markdown

Assuming I am in control of the parsing environment and I'm certain it is only to be converted to HTML (and not any of the many other formats possible); is it ok to embed some HTML within one's Markdown, in order to side-step around a bug?
Could there be any basic sideffects I (as a newbie) couldn't predict but should be aware of?
Non-conventional Markdown example:
_"<strong>This</strong> is an example sentence."_ -**OP**
Which outputs valid HTML:
<em>"<strong>This</strong> is an example sentence."</em> -<strong>OP</strong>
Resulting in successful content:
"This is an example sentence." -OP
Background (don't have to read):
I noticed that if I include HTML in my Markdown, it appears to get skipped during the conversion, resulting in it being seamlessly incorporated in the output HTML.
This appears to be a good thing, at least in my case (Using Hugo to build a website with a template theme) where the Markdown wasn't producing the correct result (leaving a pair of unwanted *s in the HTML: should have been *italic* but asterisks showing).
For those wondering - yes, I confirmed my Markdown was correct using other parsers that handled it fine.
Note: the examples here are simplifications of my specific case
Not only is it okay to do, but it is encouraged. As the rules state:
For any markup that is not covered by Markdown’s syntax, you simply use HTML itself. There’s no need to preface it or delimit it to indicate that you’re switching from Markdown to HTML; you just use the tags.
And later:
If you want, you can even use HTML tags instead of Markdown formatting; e.g. if you’d prefer to use HTML <a> or <img> tags instead of Markdown’s link or image syntax, go right ahead.
Of course, there are a few things to take into consideration. For example block level tags must be at the document root level (cannot be nested inside blockquotes, lists, etc) and content inside them does not get parsed as Markdown. However, inline tags can be placed anywhere and do not restrict Markdown parsing.
For people using Markdown in highly modular or user-flexible environments (probably slightly more advanced readers):
One should note that although Markdown is most commonly converted to HTML, it can also be used with other formats[1].
For this reason I think it's important to confirm that if you (as a publisher of content) are not the one who determines what the Markdown will be parsed with, or how it is converted it may be 'safer' to not embed HTML in it.
[1] as stated in the Markdown Wikipedia page.

Am I using Latex/Mathjax right for coloring text?

I'm trying to use the latex commands for mathjax in my html code. I have the following commands in my body tag:
\usepackage{xcolor}
\color[Hello world]{ABCDEF}
which can be seen in http://jsfiddle.net/gamea12/e6fna2bs/, but it doesn't seem to be working and is displayed as text for some reason. Can I get some help on how mathjax formatting works?
You might want to try something like
$${\rm\color[rgb]{1,0,0}Some~red~math}$$
in your document body, keeping in mind that MathJax is not a complete LaTeX authoring environment, only a means to display mathematics in HTML.

How do I remove excess whitespace in an HTML file? (And only excess whitespace)

I have a horrible, ugly HTML file that was spat out by a form generator and slightly modified to look nice. This HTML file needs to be translated, so I hooked up some scripts using po4a and csv2po, and that all works fairly well except for one thing: some of the base strings in our translation templates are surrounded by whitespace, and the translators get rather confused.
The other thing is I have this working with a Makefile (because that generated form is updated quite frequently and I'm a nerd). I'd like to keep it that way because it's nice for my workflow. So, I need a command line tool.
I'm really looking for the simplest solution in this case, so I ran the HTML file through HTML Tidy, and that removes the weird whitespace quite competently. However, it does a lot of stuff I don't need. It messes with the doctype (and it doesn't support an html5 doctype), and I've ended up with a really crazy command line just to get it to not mangle things. It is not very pleasant.
All I really want is a command line tool (not an online one) whose single goal in life is to look at my HTML file and format it nicely. Ideally not a "compressor" thing, but if that's the only option, suggestions would be nice :)
Stick it in an ide or text editor like notepad++ or net beans and hit the "format code" button which is available in nearly every ide?
I'm not sure if it is still being developed, but would HTML Tidy do the trick?

How extract meaningful text from HTML

I would like to parse a html page and extract the meaningful text from it. Anyone knows some good algorithms to do this?
I develop my applications on Rails, but I think ruby is a bit slow in this, so I think if exists some good library in c for this it would be appropriate.
Thanks!!
PD: Please do not recommend anything with java
UPDATE:
I found this link text
Sadly, is in python
Use Nokogiri, which is fast and written in C, for Ruby.
(Using regexp to parse recursive expressions like HTML is notoriously difficult and error prone and I would not go down that path. I only mention this in the answer as this issue seems to crop up again and again.)
With a real parser like for instance Nokogiri mentioned above, you also get the added benefit that the structure and logic of the HTML document is preserved, and sometimes you really need those clues.
Solutions integrating with Ruby
use Nokogiri as recommended by Amigable Clark kant
Use Hpricot
External Solutions
If your HTML is well-formed, you could use the Expat XML Parser for this.
For something more targeted toward HTML-only, the W3C actually released the code for the LibWWW, which contains a simple HTML parser (documentation).
Lynx is able to do this. This is open source if you want to take a look at it.
You should strip all angle-bracketed part from text and then collapse white-spaces.
In theory the < and > should not be there in other cases. Pages contain < and > everywhere instead of them.
Collapsing whitespaces: Convert all TAB, newline, etc to spaces, then replace every sequence of spaces to a single space.
UPDATE: And you should start after finding the <body> tag.