Pandoc ignores Markdown-Headlines - html

Using snap, I wrote a splice creating text from markdown, using this function:
markdownToHTML :: T.Text -> [Node]
markdownToHTML = renderHtmlNodes . (writeHtml writeOpts) . readMarkdown readOpts . T.unpack
where
readOpts = defaultParserState
writeOpts = defaultWriterOptions
{ writerStandalone = False
, writerHtml5 = True
, writerStrictMarkdown = False
}
Now, when I, for example, give it this markdown
# Hi
Lorem ipsum something somthing
# Stuff
[a link](http://twitter.com/)
It produces this HTML:
<h1 id='hi'>Hi
</h1>
<p>
Lorem ipsum something somthing
# Stuff
<a href='http://twitter.com/'>a link</a></p>
No matter how many newlines I put before the #, it is still just wedged into the paragraph.
Funnily enough, if I dump the same markdown into pandoc's demo site, it produces the correct Html output.
The full code of my project can be found here, if necessary.

See the documentation for Text.Pandoc. It says:
Note: all of the readers assume that the input text has '\n' line endings. So if you get your input text from a web form, you should remove '\r' characters using filter (/='\r').
I suspect that's your problem.

Related

Converting HTML with equations pages to docx

I am trying to convert an html document to docx using pandoc.
pandoc -s Template.html --mathjax -o Test.docx
During the conversion to docx everything goes smooth less the equations.
In the html file the equation look like this:
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
\begin{equation}
\log_{10}(\mu)={-2.64}+\frac{4437.038}{T-544.391}
\end{equation}
</div>
</div>
</div>
</div>
After running the pandoc command the result in the docx document is:
\begin{equation} \log_{10}(\mu)={-2.64}+\frac{4437.038}{T-544.391} \end{equation}
Do you have idea how can I overcome this issue?
Thanks
A Lua filter can help here. The code below looks for div elements with a data-mime-type="text/markdown" attribute and, somewhat paradoxically, parses it context as LaTeX. The original div is then replaced with the parse result.
local stringify = pandoc.utils.stringify
function Div (div)
if div.attributes['mime-type'] == 'text/markdown' then
return pandoc.read(stringify(div), 'latex').blocks
end
end
Save the code to a file parse-math.lua and let pandoc use it with the --lua-filter / -L option:
pandoc --lua-filter parse-math.lua ...
As noted in a comment, this gets slightly more complicated if there are other HTML elements with the text/markdown media type. In that case we'll check if the parse result contains only math, and keep the original content otherwise.
local stringify = pandoc.utils.stringify
function Div (div)
if div.attributes['mime-type'] == 'text/markdown' then
local result = pandoc.read(stringify(div), 'latex').blocks
local first = result[1] and result[1].content or {}
return (#first == 1 and first[1].t == 'Math')
and result
or nil
end
end

Creating HTML links from images in :colons: with Ruby

I have a simple HTML document:
<div should-not-be-replaced=":smile:">
Hello :smile:!
</div>
How would I replace the :smile: text with <img src="smile.png">, but keeping the first :smile: unchanged, to get this:
<div should-not-be-replaced=":smile:">
Hello <img src="smile.png">!
</div>
I tried this, but Nokogiri escapes my HTML as plain text:
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.traverse do |x|
next unless x.text?
x.content = x.text.gsub(':smile:', '<img src="smile.png">')
end
My solution is very similar to Ku's, although I've tried to handle situations where the replaced text could be in the source text multiple times by completely replacing the content text with an HTML Doc Fragment
doc = Nokogiri::HTML::DocumentFragment.parse(DATA.read)
doc.traverse do |x|
next unless x.text?
if x.text.match(%r{:(\w+):})
replace_text = x.text.gsub(%r{:(\w+):}, "<img src='#{$1}.png'>")
x.content = ""
x.add_next_sibling replace_text
end
end
I think this might be what you want, and it also deals with strings between two colons like :something: and produces "something.png" as well.
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.traverse do |x|
if x.text? && x.content =~ /:\w+:/
x.content = x.content.sub(/:(\w+):/, '')
a = Nokogiri::HTML::DocumentFragment.parse('<a src="'+$1+'.png">')
x.add_next_sibling(a)
end
end
You are making it much too hard, and using traverse which is slow because it forces Nokogiri to walk through every node in the document; In a large page that is costly.
Instead take advantage of selectors to find the specific node(s) you want:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div parm=":smile:">
Hello :smile:!
</div>
EOT
div = doc.at('div[parm=":smile:"]')
div.inner_html = div.text.sub(/:smile:/, '<img src="smile.png">')
puts doc.to_html
Running that results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div parm=":smile:">
Hello <img src="smile.png">!
</div>
</body></html>
I'm using at, which finds the first occurrence. If you need to process more than one then use search. search returns a NodeSet, which is like an array so you'll want to iterate over it. That are innumerable examples of doing so on Stack Overflow and elsewhere.
Do you mean it returns &lt or &gt?
I recommend to wrap CGI#unescape_html method
try,
require 'cgi'
CGI::unescape_html(doc.to_s)

Is Pandoc capable of injecting arbitrary HTML attributes to any elements?

So code blocks can define HTML attributes using the fenced_code_blocks extension:
~~~~ {#mycode .haskell .numberLines startFrom="100"}
qsort [] = []
qsort (x:xs) = qsort (filter (< x) xs) ++ [x] ++
qsort (filter (>= x) xs)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Is it possible to use the above syntax, in some way, for regular text blocks? For example, I'd like to convert the following Markdown text:
# My header
~~~ {.text}
This is regular text. This is regular text.
~~~
~~~ {.quote}
> This is the first level of quoting.
>
> > This is nested blockquote.
>
> Back to the first level.
~~~
~~~ {data-id=test-123}
+ Red
+ Green
+ Blue
~~~
into something like this:
<h1 id="my-header">My header</h1>
<p class="text">This is regular text. This is regular text.</p>
<blockquote class="quote">
<p>This is the first level of quoting.</p>
<blockquote>
<p>This is nested blockquote.</p>
</blockquote>
<p>Back to the first level.</p>
</blockquote>
<ul data-id="test-123">
<li>Red</li>
<li>Green</li>
<li>Blue</li>
</ul>
If there is no such support in Pandoc itself, would it be possible to create a custom writer in Lua that does so?
Edit: Looking at the sample.lua custom writer, anyone know what the "attributes table" is on line 35? And how does one pass these attributes to specific Pandoc elements? Also, the functionality I'm looking for above is very similar to the header_extension extension except it would work for all elements, not just headers.
Pandoc's filters let you operate on Pandoc's internal representation of the document. It's possible to have a chain of filters that do different transformations. I'll share two illustrative examples of filters that should help.
Markdown Code Blocks
Code blocks in Pandoc are usually meant to embed source code listings from programming languages, but here we're trying to extract the body and interpret it as markdown. Rather than using classes from your input document like text and quote, let's use a generic as-markdown class. Pandoc will generate the appropriate tags automatically.
# My header
~~~ {.as-markdown}
This is regular text. This is regular text.
~~~
~~~ {.as-markdown}
> This is the first level of quoting.
>
> > This is nested blockquote.
>
> Back to the first level.
~~~
~~~ {.as-markdown data-id=test-123}
+ Red
+ Green
+ Blue
~~~
~~~ haskell
main :: IO ()
~~~
To ensure code blocks without the as-markdown class are interpreted as usual, I included a haskell code block. Here's the filter implementation:
#!/usr/bin/env runhaskell
import Text.Pandoc.Definition (Pandoc(..), Block(..), Format(..))
import Text.Pandoc.Error (handleError)
import Text.Pandoc.JSON (toJSONFilter)
import Text.Pandoc.Options (def)
import Text.Pandoc.Readers.Markdown (readMarkdown)
asMarkdown :: String -> [Block]
asMarkdown contents =
case handleError $ readMarkdown def contents of
Pandoc _ blocks -> blocks
-- | Unwrap each CodeBlock with the "as-markdown" class, interpreting
-- its contents as Markdown.
markdownCodeBlock :: Maybe Format -> Block -> IO [Block]
markdownCodeBlock _ cb#(CodeBlock (_id, classes, _namevals) contents) =
if "as-markdown" `elem` classes then
return $ asMarkdown contents
else
return [cb]
markdownCodeBlock _ x = return [x]
main :: IO ()
main = toJSONFilter markdownCodeBlock
Running pandoc --filter markdown-code-block.hs index.md produces:
<h1 id="my-header">My header</h1>
<p>This is regular text. This is regular text.</p>
<blockquote>
<p>This is the first level of quoting.</p>
<blockquote>
<p>This is nested blockquote.</p>
</blockquote>
<p>Back to the first level.</p>
</blockquote>
<ul>
<li>Red</li>
<li>Green</li>
<li>Blue</li>
</ul>
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">main ::</span> <span class="dt">IO</span> ()</code></pre></div>
Almost there! The only part that's not quite right is the HTML attributes.
Custom HTML Attributes from Code Block Metadata
The following filter should help you get started. It converts code blocks with the web-script class to an HTML <script> tag when the target format is html or html5.
#!/usr/bin/env runhaskell
import Text.Pandoc.Builder
import Text.Pandoc.JSON
webFormats :: [String]
webFormats =
[ "html"
, "html5"
]
script :: String -> Block
script src = Para $ toList $ rawInline "html" ("<script type='application/javascript'>" <> src <> "</script>")
injectScript :: Maybe Format -> Block -> IO Block
injectScript (Just (Format format)) cb#(CodeBlock (_id, classes, _namevals) contents) =
if "web-script" `elem` classes then
if format `elem` webFormats then
return $ script contents
else
return Null
else
return cb
injectScript _ x = return x
main :: IO ()
main = toJSONFilter injectScript
The data-id=test-123 in your last block would come through in the _namevals's key-value pairs with type [(String, String)]. All you'd need to do is refactor script to support arbitrary tags and key-value pairs for HTML attributes, and specify what HTML to generate based on those inputs. To see the native representation of the input document, run pandoc -t native index.md.
[Header 1 ("my-header",[],[]) [Str "My",Space,Str "header"]
,CodeBlock ("",["as-markdown"],[]) "This is regular text. This is regular text."
,CodeBlock ("",["as-markdown"],[]) "> This is the first level of quoting.\n>\n> > This is nested blockquote.\n>\n> Back to the first level."
,CodeBlock ("",["as-markdown"],[("data-id","test-123")]) "+ Red\n+ Green\n+ Blue"
,Para [Str "To",Space,Str "ensure",Space,Str "regular",Space,Str "code",Space,Str "blocks",Space,Str "work",Space,Str "as",Space,Str "usual."]
,CodeBlock ("",["haskell"],[]) "main :: IO ()"]
If you'd like to play around with either of these examples, they're both in my pandoc-experiments repository.
This is very doable in kramdown, which will convert the following input
# My header
This is regular text. This is regular text.
{: .text}
> This is the first level of quoting.
>
> > This is nested blockquote.
>
> Back to the first level.
{: .quote}
+ Red
+ Green
+ Blue
{: data-id="test-123"}
to
<h1 id="my-header">My header</h1>
<p class="text">This is regular text. This is regular text.</p>
<blockquote class="quote">
<p>This is the first level of quoting.</p>
<blockquote>
<p>This is nested blockquote.</p>
</blockquote>
<p>Back to the first level.</p>
</blockquote>
<ul data-id="test-123">
<li>Red</li>
<li>Green</li>
<li>Blue</li>
</ul>
See the attribute list definition section of the syntax for details.

How to write a transformer for Ruby Sanitize Gem to transform <br> into newlines?

I'm using a wrapper for the Sanitize Gem's clean method to solve some our issues:
def remove_markup(html_str)
html_str.gsub /(\<\/p\>)/, "#{$1}\n"
marked_up = Sanitize.clean html_str
ESCAPE_SEQUENCES.each do |esc_seq, ascii_seq|
marked_up = marked_up.gsub('&' + esc_seq + ';', ascii_seq.chr)
end
marked_up
end
I recently add the gsub two lines as a quick way to do what I wanted:
Replace insert a newline wherever a paragraph ended.
However, I'm sure this can be accomplished more elgantly with a Sanitize transformer.
Unfortunately, I think I must be misunderstanding a few things. Here is an example of a transformer I wrote for the tag that worked.
s2 = "<p>here is para 1<br> It's a nice paragraph</p><p>Don't forget para 2</p>"
br_to_nl = lambda do |env|
node = env[:node]
node_name = env[:node_name]
return if env[:is_whitelisted] || !node.element?
return unless node_name == 'br'
node.replace "\n"
end
Sanitize.clean s2, :transformers => [br_to_nl]
=> " here is para 1\n It's a nice paragraph Don't forget para 2 "
But I couldn't come up with a solution that would work well for <p> tags.
Should I add a text element to the node as a child? How to make it show up immediately after the element?
related question (answered) How to use RubyGem Sanitize transformers to sanitize an unordered list into a comma seperated list?

Parse html using Perl works for 2 lines but not multiple

I have written the following Perl script-
use HTML::TreeBuilder;
my $html = HTML::TreeBuilder->new_from_content(<<END_HTML);
<span class=time>1 h </span>
User: There are not enough <b>big</b>
<b>fish</b> in the lake ;
END_HTML
my $source = "foo";
my #time = "10-14-2011";
my $name = $html->find('a')->as_text;
my $comment = $html->as_text;
my #keywords = map { $_->as_text } $html->find('b');
Which outputs- foo, 10-14-2011, User, 1h User: There are not enough big fish in the lake, big fish
Which is perfect and what I wanted from the test html but
this only works fine when I put in the aforementioned HTML, which I did for test purposes.
However the full HTML file has multiple references of 'a' and 'b' for instances therefore when printing out the results for these columns are blank.
How can I account for multiple values for specific searches?
Without sight of your real HTML it is hard to help, but $html->find returns a list of <a> elements, so you could write something like
foreach my $anchor ($html->find('a')) {
print $anchor->as_text, "\n";
}
But that will find all <a> elements, and it is unlikely that that is what you want. $html->look_down() is far more flexible, and provides for searching by attribute as well as by tag name.
I cannot begin to guess about your problem with comments without seeing what data you are dealing with.
If you need to process each text element independently then you probably need to call the objectify_text method. This turns every text element in the tree into a pseudo element with a ~text tag name and a text attribute, for instance <p>paragraph text</p> would be transformed into <p><~text text="paragraph text" /></p>. These elements can be discovered using $html->find('~text') as normal. Here is some code to demonstrate
use strict;
use warnings;
use HTML::TreeBuilder;
my $html = HTML::TreeBuilder->new_from_content(<<END_HTML);
<span class=time>1 h </span>
User: There are not enough <b>big</b>
<b>fish</b> in the lake ;
END_HTML
$html->objectify_text;
print $_->attr('text'), "\n" for $html->find('~text');
OUTPUT
1 h
User
: There are not enough
big
fish
in the lake ;