Is Pandoc capable of injecting arbitrary HTML attributes to any elements? - html

So code blocks can define HTML attributes using the fenced_code_blocks extension:
~~~~ {#mycode .haskell .numberLines startFrom="100"}
qsort [] = []
qsort (x:xs) = qsort (filter (< x) xs) ++ [x] ++
qsort (filter (>= x) xs)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Is it possible to use the above syntax, in some way, for regular text blocks? For example, I'd like to convert the following Markdown text:
# My header
~~~ {.text}
This is regular text. This is regular text.
~~~
~~~ {.quote}
> This is the first level of quoting.
>
> > This is nested blockquote.
>
> Back to the first level.
~~~
~~~ {data-id=test-123}
+ Red
+ Green
+ Blue
~~~
into something like this:
<h1 id="my-header">My header</h1>
<p class="text">This is regular text. This is regular text.</p>
<blockquote class="quote">
<p>This is the first level of quoting.</p>
<blockquote>
<p>This is nested blockquote.</p>
</blockquote>
<p>Back to the first level.</p>
</blockquote>
<ul data-id="test-123">
<li>Red</li>
<li>Green</li>
<li>Blue</li>
</ul>
If there is no such support in Pandoc itself, would it be possible to create a custom writer in Lua that does so?
Edit: Looking at the sample.lua custom writer, anyone know what the "attributes table" is on line 35? And how does one pass these attributes to specific Pandoc elements? Also, the functionality I'm looking for above is very similar to the header_extension extension except it would work for all elements, not just headers.

Pandoc's filters let you operate on Pandoc's internal representation of the document. It's possible to have a chain of filters that do different transformations. I'll share two illustrative examples of filters that should help.
Markdown Code Blocks
Code blocks in Pandoc are usually meant to embed source code listings from programming languages, but here we're trying to extract the body and interpret it as markdown. Rather than using classes from your input document like text and quote, let's use a generic as-markdown class. Pandoc will generate the appropriate tags automatically.
# My header
~~~ {.as-markdown}
This is regular text. This is regular text.
~~~
~~~ {.as-markdown}
> This is the first level of quoting.
>
> > This is nested blockquote.
>
> Back to the first level.
~~~
~~~ {.as-markdown data-id=test-123}
+ Red
+ Green
+ Blue
~~~
~~~ haskell
main :: IO ()
~~~
To ensure code blocks without the as-markdown class are interpreted as usual, I included a haskell code block. Here's the filter implementation:
#!/usr/bin/env runhaskell
import Text.Pandoc.Definition (Pandoc(..), Block(..), Format(..))
import Text.Pandoc.Error (handleError)
import Text.Pandoc.JSON (toJSONFilter)
import Text.Pandoc.Options (def)
import Text.Pandoc.Readers.Markdown (readMarkdown)
asMarkdown :: String -> [Block]
asMarkdown contents =
case handleError $ readMarkdown def contents of
Pandoc _ blocks -> blocks
-- | Unwrap each CodeBlock with the "as-markdown" class, interpreting
-- its contents as Markdown.
markdownCodeBlock :: Maybe Format -> Block -> IO [Block]
markdownCodeBlock _ cb#(CodeBlock (_id, classes, _namevals) contents) =
if "as-markdown" `elem` classes then
return $ asMarkdown contents
else
return [cb]
markdownCodeBlock _ x = return [x]
main :: IO ()
main = toJSONFilter markdownCodeBlock
Running pandoc --filter markdown-code-block.hs index.md produces:
<h1 id="my-header">My header</h1>
<p>This is regular text. This is regular text.</p>
<blockquote>
<p>This is the first level of quoting.</p>
<blockquote>
<p>This is nested blockquote.</p>
</blockquote>
<p>Back to the first level.</p>
</blockquote>
<ul>
<li>Red</li>
<li>Green</li>
<li>Blue</li>
</ul>
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">main ::</span> <span class="dt">IO</span> ()</code></pre></div>
Almost there! The only part that's not quite right is the HTML attributes.
Custom HTML Attributes from Code Block Metadata
The following filter should help you get started. It converts code blocks with the web-script class to an HTML <script> tag when the target format is html or html5.
#!/usr/bin/env runhaskell
import Text.Pandoc.Builder
import Text.Pandoc.JSON
webFormats :: [String]
webFormats =
[ "html"
, "html5"
]
script :: String -> Block
script src = Para $ toList $ rawInline "html" ("<script type='application/javascript'>" <> src <> "</script>")
injectScript :: Maybe Format -> Block -> IO Block
injectScript (Just (Format format)) cb#(CodeBlock (_id, classes, _namevals) contents) =
if "web-script" `elem` classes then
if format `elem` webFormats then
return $ script contents
else
return Null
else
return cb
injectScript _ x = return x
main :: IO ()
main = toJSONFilter injectScript
The data-id=test-123 in your last block would come through in the _namevals's key-value pairs with type [(String, String)]. All you'd need to do is refactor script to support arbitrary tags and key-value pairs for HTML attributes, and specify what HTML to generate based on those inputs. To see the native representation of the input document, run pandoc -t native index.md.
[Header 1 ("my-header",[],[]) [Str "My",Space,Str "header"]
,CodeBlock ("",["as-markdown"],[]) "This is regular text. This is regular text."
,CodeBlock ("",["as-markdown"],[]) "> This is the first level of quoting.\n>\n> > This is nested blockquote.\n>\n> Back to the first level."
,CodeBlock ("",["as-markdown"],[("data-id","test-123")]) "+ Red\n+ Green\n+ Blue"
,Para [Str "To",Space,Str "ensure",Space,Str "regular",Space,Str "code",Space,Str "blocks",Space,Str "work",Space,Str "as",Space,Str "usual."]
,CodeBlock ("",["haskell"],[]) "main :: IO ()"]
If you'd like to play around with either of these examples, they're both in my pandoc-experiments repository.

This is very doable in kramdown, which will convert the following input
# My header
This is regular text. This is regular text.
{: .text}
> This is the first level of quoting.
>
> > This is nested blockquote.
>
> Back to the first level.
{: .quote}
+ Red
+ Green
+ Blue
{: data-id="test-123"}
to
<h1 id="my-header">My header</h1>
<p class="text">This is regular text. This is regular text.</p>
<blockquote class="quote">
<p>This is the first level of quoting.</p>
<blockquote>
<p>This is nested blockquote.</p>
</blockquote>
<p>Back to the first level.</p>
</blockquote>
<ul data-id="test-123">
<li>Red</li>
<li>Green</li>
<li>Blue</li>
</ul>
See the attribute list definition section of the syntax for details.

Related

Converting HTML with equations pages to docx

I am trying to convert an html document to docx using pandoc.
pandoc -s Template.html --mathjax -o Test.docx
During the conversion to docx everything goes smooth less the equations.
In the html file the equation look like this:
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
\begin{equation}
\log_{10}(\mu)={-2.64}+\frac{4437.038}{T-544.391}
\end{equation}
</div>
</div>
</div>
</div>
After running the pandoc command the result in the docx document is:
\begin{equation} \log_{10}(\mu)={-2.64}+\frac{4437.038}{T-544.391} \end{equation}
Do you have idea how can I overcome this issue?
Thanks
A Lua filter can help here. The code below looks for div elements with a data-mime-type="text/markdown" attribute and, somewhat paradoxically, parses it context as LaTeX. The original div is then replaced with the parse result.
local stringify = pandoc.utils.stringify
function Div (div)
if div.attributes['mime-type'] == 'text/markdown' then
return pandoc.read(stringify(div), 'latex').blocks
end
end
Save the code to a file parse-math.lua and let pandoc use it with the --lua-filter / -L option:
pandoc --lua-filter parse-math.lua ...
As noted in a comment, this gets slightly more complicated if there are other HTML elements with the text/markdown media type. In that case we'll check if the parse result contains only math, and keep the original content otherwise.
local stringify = pandoc.utils.stringify
function Div (div)
if div.attributes['mime-type'] == 'text/markdown' then
local result = pandoc.read(stringify(div), 'latex').blocks
local first = result[1] and result[1].content or {}
return (#first == 1 and first[1].t == 'Math')
and result
or nil
end
end

Formatting divs using Pandoc

I am using Pandoc to convert Pandoc Markdown documents to HTML5 documents. In my md input, I write custom divs using a special Pandoc syntax, for example :
::: Resources
A nice document
Informative website
:::
The resulted HTML is this :
<div class="Resources">
<p>A nice document Informative website</p>
</div>
I would like the output to be something like this instead :
<div class="Resources">
<div>A nice document</div>
<div>Informative website</div>
</div>
Ie. I want the two resources to be in two different containers. I did not find any solution to do that (it is possible that the pandoc filters can, but I don't quite understand how to write them).
Thank you very much for any kind of help. Cheers.
If the main goal is to have separate Resource blocks, I'd suggest to use a list inside the div:
::: Resources
- A nice document
- Informative website
:::
This will give
<div class="Resources">
<ul>
<li>A nice document</li>
<li>Informative website</li>
</ul>
</div>
It's not what you want yet, but get's us half way there. It already marks all resources as separate blocks. This simplifies our task to refine the document structure further through filtering. The following uses pandoc's Lua filter functionality; put the code into a file and pass it to pandoc via the --lua-filter command line parameter.
local list_to_resources = {
BulletList = function (el)
local resources = {}
local resource_attr = pandoc.Attr('', {'Resource'}, {})
for i, item in ipairs(el.content) do
resources[i] = pandoc.Div(item, resource_attr)
end
return resources
end
}
function Div (el)
-- return div unaltered unless it is of class "Resources"
if not el.classes:includes'Resources' then
return nil
end
return pandoc.walk_block(el, list_to_resources)
end
Calling pandoc with this filter will produce your desired output:
<div class="Resources">
<div class="Resource">
A nice document
</div>
<div class="Resource">
Informative website
</div>
</div>
For the sake of completeness, I'll also add a solution to the question when taking it literally. However, I do not recommend using it for various reasons:
It is far less "markdowny". Using only linebreaks to separate items is uncommon in Markdown and goes against its philosophy of having readable text without surprises.
The necessary code is more complex and fragile.
You won't be able to add additional information to the Resources div, as it will always be mangeled-up by the filter. With the previous solution, only bullet lists have a special meaning.
That being said, here's the code:
-- table to collect elements in a line
local elements_in_line = {}
-- produce a span from the collected elements
local function line_as_span()
local span = pandoc.Span(elements_in_line)
elements_in_line = {}
return span
end
local lines_to_blocks = {
Inline = function (el)
print(el.t)
if el.t == 'SoftBreak' then
return line_as_span()
end
table.insert(elements_in_line, el)
return {}
end,
Para = function (el)
local resources = {}
local content = el.content
-- last line is not followed by SoftBreak, add it here
table.insert(content, line_as_span())
local attr = pandoc.Attr('', {'Resource'})
for i, line in ipairs(content) do
resources[i] = pandoc.Div(pandoc.Plain(line.content), attr)
end
return resources
end
}
function Div (el)
-- return div unaltered unless it is of class "Resources"
if not el.classes:includes'Resources' then
return nil
end
return pandoc.walk_block(el, lines_to_blocks)
end

How to extract text as well as hyperlink text in scrapy?

I want to extract from following html code:
<li>
<a test="test" href="abc.html" id="11">Click Here</a>
"for further reference"
</li>
I'm trying to do with following extract command
response.css("article div#section-2 li::text").extract()
But it is giving only "for further reference" line
And Expected output is "Click Here for further reference" as a one string.
How to do this?
How to modify this to do the same if following patterns are there:
Text Hyperlink Text
Hyperlink Text
Text Hyperlink
There are at least a couple of ways to do that:
Let's first build a test selector that mimics your response:
>>> response = scrapy.Selector(text="""<li>
... <a test="test" href="abc.html" id="11">Click Here</a>
... "for further reference"
... </li>""")
First option, with a minor change to your CSS selector.
Look at all text descendants, not only text children (notice the space between li and ::text pseudo element):
# this is your CSS select,
# which only gives direct children text of your selected LI
>>> response.css("li::text").extract()
[u'\n ', u'\n "for further reference"\n']
# notice the extra space
# here
# |
# v
>>> response.css("li ::text").extract()
[u'\n ', u'Click Here', u'\n "for further reference"\n']
# using Python's join() to concatenate and build the full sentence
>>> ''.join(response.css("li ::text").extract())
u'\n Click Here\n "for further reference"\n'
Another option is to chain your .css() call with XPath 1.0 string() or normalize-space() inside a subsequent .xpath() call:
>>> response.css("li").xpath('string()').extract()
[u'\n Click Here\n "for further reference"\n']
>>> response.css("li").xpath('normalize-space()').extract()
[u'Click Here "for further reference"']
# calling `.extract_first()` gives you a string directly, not a list of 1 string
>>> response.css("li").xpath('normalize-space()').extract_first()
u'Click Here "for further reference"'
I use xpath if that is the case the selector will be:
response.xpath('//article/div[#id="section-2"]/li/a/text()').extract()#this will give you text of mentioned hyper link >> "Click Here"
response.xpath('//article/div[#id="section-2"]/li/a/#href').extract()#this will give you link of mentioned hyper link >> "abc.html"
response.xpath('//article/div[#id="section-2"]/li/text()').extract()#this will give you text of li >> "for further reference"

How to get html tag text using XMLSlurper in Groovy

I am trying to modify html code in Groovy. I parsed it using XMLSlurper. The problem is i need to edit text of certain tag which contains text and children tags. Html code looks like this:
<ul><li>Text to modify<span>more text</span></li></ul>
In groovy i am trying this code:
def ulDOM = new XmlSlurper().parseText(ul);
def elements = ulDOM.li.findAll{
it.text().equals("text i am looking for");
}
The problem is i got empty array in 'elements' because it.text() returns text from 'it' node together with whole DOM subtree text nodes. In this case "Text to modifymore text". Note that contains() method is not enough for my solution.
My question is how to get exact text from a certain tag and not the text from whole DOM subtree?
.text() evaluate children and appends. Hence it will always include merged line.
Could you consinder localText()? Not exactly what you expect, it returns an array of strings.
import org.testng.Assert
ul='''<ul>
<li>Text to modify<span>more text</span>
</li>
</ul> '''
def ulDOM = new XmlSlurper().parseText(ul);
def elements = ulDOM.li.findAll{
String[] text = it.localText();
text[0].equals("Text to modify");
}
Assert.assertTrue(elements.size()==1)

Pandoc ignores Markdown-Headlines

Using snap, I wrote a splice creating text from markdown, using this function:
markdownToHTML :: T.Text -> [Node]
markdownToHTML = renderHtmlNodes . (writeHtml writeOpts) . readMarkdown readOpts . T.unpack
where
readOpts = defaultParserState
writeOpts = defaultWriterOptions
{ writerStandalone = False
, writerHtml5 = True
, writerStrictMarkdown = False
}
Now, when I, for example, give it this markdown
# Hi
Lorem ipsum something somthing
# Stuff
[a link](http://twitter.com/)
It produces this HTML:
<h1 id='hi'>Hi
</h1>
<p>
Lorem ipsum something somthing
# Stuff
<a href='http://twitter.com/'>a link</a></p>
No matter how many newlines I put before the #, it is still just wedged into the paragraph.
Funnily enough, if I dump the same markdown into pandoc's demo site, it produces the correct Html output.
The full code of my project can be found here, if necessary.
See the documentation for Text.Pandoc. It says:
Note: all of the readers assume that the input text has '\n' line endings. So if you get your input text from a web form, you should remove '\r' characters using filter (/='\r').
I suspect that's your problem.