Pandoc: Include raw `tex` in `markdown` to `html` conversion

Pandoc: Include raw `tex` in `markdown` to `html` conversion - html

Here is a simple markdown code:
$$ \alpha = \beta\label{eqone}$$
Refer equation (\ref{eqone}).
When I convert this to html using
pandoc --mathjax file.markdown -o file.html
the \ref{eqone} is omitted in the html output since it is raw tex. Is there a work around to include the raw tex in html output?
I understand that I could have used:
(#eqone) $$\alpha=\beta$$
Refer equation ((#eqone)).
for equation numbering and referencing. This produces the number on the left side and also does not distinguish between figures, tables and equations.
However, mathjax numbering appears on the right like the standard tex output.
Any other work around for proper equation numbering is also welcome.
Note: Following code needs to be added to the head of the generated html file to configure autonumbering in mathjax.
<script type="text/x-mathjax-config">
MathJax.Hub.Config({ TeX: { equationNumbers: {autoNumber: "all"} } });
</script>

You could try to put the \ref into a math environment, $\ref{eqone}$. And if the command is not defined in math, switch back to text, $\text{\ref{eqone}}$. Ugly, but it might work.

Related

Using Regex to find "<img .../>" and "<script ...> </script>" in HTML string

I am trying to use Regular Expressions for the first time to search for images and scripts in webpages in Scala. The expressions I've come up with are
Images:
/(<img\S+\s+\/>)+/
Scripts:
/(<script\s+\S+><\/script>)+/
I don't really know anything about HTML code or using Regex so I'm not sure what I need in order to specify that it should match <img .../> where the ... could be any amount of characters or whitespace. This is just a small part of a programming assignment I'm writing in Scala and we have to use Regex.

A regex like <img[^>]*> would match <img..........>.
A regex like <script.*?</script> would match a single <script...>...</script> instance. The ? is necessary to prevent it from matching everything from the first <script...> tag to the last </script> tag.
(Feel free to add back in the capturing ( )'s, the \ escapes, and surround with the regex delimiting / / tokens. I removed them to focus on the regular expressions themselves, without the leaning toothpick syndrome and other noise.)
While these are better than the ones you proposed, they will still break in many circumstances. RegEx is not designed to parse HTML.
<script>
<!-- This "</script>" doesn't end the script, but fools the RegEx -->
</script>

How to split up long HTML into "functions"

When writing a non-trivial static HTML page, the large-scale structure gets very hard to see, with the fine structure and the content all mixed into it. Scrolling several pages to find the closing tag that matches a given opening tag is frustrating. In general it feels messy, awkward, and hard to maintain... like writing a large program with no functions.
Of course, when I write a large program, I break it up hierarchically into smaller functions. Is there any way to do this with a large HTML file?
Basically I'm looking for a template system, where the content to be inserted into the template is just more HTML that's optionally (here's the important part) located in the same file.
That is, I want to be able to do something like what is suggested by this hypothetical syntax:
<html>
<head>{{head}}</head>
<body>
<div class="header">{{header}}</div>
<div class="navbar">{{navbar}}</div>
<div class="content">{{content}}</div>
<div class="footer">{{footer}}</div>
</body>
</html>
{{head}} =
<title>Hello World!</title>
{{styles}}
{{scripts}}
{{styles}} =
<link rel="stylesheet" type="text/css" href="style.css">
{{navbar}} =
...
...
... and so on...
Then presumably there would be a simple way to "compile" this to make a standard HTML file.
Are there any tools out there to allow writing HTML this way?
Most template engines require each include to be a separate file, which isn't useful.
UPDATE: Gnu M4 seems to do exactly the sort of thing I'm looking for, but with a few caveats:
The macro definitions have to appear before they are used, when I'd rather they be after.
M4's syntax mixes very awkwardly with HTML. Since the file is no longer HTML, it can't be easily syntax checked for errors. The M4 processor is very forgiving and flexible, making errors in M4 files hard to find sometimes - the parser won't complain, or even notice, when what you wrote means something other than what you probably meant.
There's no way to get properly indented HTML out, making the output an unreadable mess. (Since production HTML might be minified anyway, that's not a major issue, and it can always be run through a formatter if it needs to be readable.)

This will parse your template example and do what you want.
perl -E 'my $pre.=join("",<>); my ($body,%h)=split(/^\{\{(\w+)\}\}\s*=\s*$/m, $pre); while ($body =~ s/\{\{(\w+)\}\}/$h{$1}/ge) { if ($rec++>200) {die("Max recursion (200)!")}};$body =~ s/({{)-/$1/sg; $body =~ s/({{)-/$1/sg; print $body' htmlfiletoparse.html
And here's the script version.
file joshTplEngine ;)
#!/usr/bin/perl
## get/read lines from files. Multiple input files are supported
my $pre.=join("",<>);
## split files to body and variables %h = hash
## variable is [beginning of the line]{{somestring}} = [newline]
my ($body,%h)=split(/^\{\{# split on variable line and
(\w+) ## save name
\}\}
\s*=\s*$/xm, $pre);
## replace recursively all variables defined as {{somestring}}
while ($body =~ s/
\{\{
(\w+) ## name of the variable
\}\}
/ ##
$h{$1} ## all variables have been read to hash %h and $1 contens variable name from mach
/gxe) {
## check for number of recursions, limit to 200
if ($rec++>200) {
die("Max recursion (200)!")
}
}
## replace {{- to {{ and -}} to }}
$body =~ s/({{)-/$1/sg;
$body =~ s/-(}})/$1/sg;
## end, print
print $body;
Usage:
joshTplEngine.pl /some/html/file.html [/some/other/file] | tee /result/dir/out.html
I hope this little snipped of perl will enable you to your templating.

Compare two HTML documents ignoring multiple and trailing whitespaces

Is there a tool that compares an HTML document like:
<p b="1" a="0 "> a b
c </p>
(as a C string: "<p> a b\nc </p>") equal to:
<p a="0 " b="1">a b c</p>
Note how:
text multiple whitespaces were converted to a single whitespace
newlines were converted to whitespaces
text trailing and heading whitespaces were stripped
attributes were put on a standard order
attribute values were unchanged, including trailing whitespaces
Why I want that
I am working on the Markdown Test Suite that aims to measure markdown engine compliance and portability.
We have markdown input, expected HTML output, and want to determine if the generated HTML output is equal to the expected one.
The problem is that Markdown is underspecified, so we cannot compare directly the two HTML strings.
The actual test code is here, just modify run-tests.py#dom_normalize if you want to try out your solution.
Things I tried
beautifulsoup. Orders the attributes, but does not deal well with whitespaces?
A function formatter regex modification might work, but I don't see a way to differentiate between the inside of nodes and attributes.
A Python only solution like this would be ideal.
looking for a Javascript function similar to isEqualNode() (does not work because ignores nodeVaue) + some headless JS engine. Couldn't find one.
If there is nothing better, I'll just have to write my own output formatter front-end to some HTML parser.

I ended up cooking up a custom HTML renderer that normalizes things based on Python's stdlib HTMLParser.
You can see it at: https://github.com/karlcow/markdown-testsuite/blob/749ed0b812ffcb8b6cc56f93ff94c6fdfb6bd4a2/run-tests.py#L20
Usage and docstrig tests at: https://github.com/karlcow/markdown-testsuite/blob/749ed0b812ffcb8b6cc56f93ff94c6fdfb6bd4a2/run-tests.py#L74

Preserving whitespace in XQuery HTML output

When I run galax-run a.xq, where a.xq is
<html>
<body>
<ul>
{
for $x in doc("books.xml")/bookstore/book
return <li>{data($x/title)}</li>
}
</ul>
</body>
</html>
the output is all on one line. How do I keep the formatting (new lines and other white spaces) as in a.xq?

Your question is about "Boundary Whitespace", which is either stripped or preserved, with an implementation defined default behaviour. You can however override the default by using a boundary-space declaration. For preserving boundary whitespace, use
declare boundary-space preserve;
in the query prolog. See http://www.w3.org/TR/xquery/#id-boundary-space-decls for details.
Note that this governs the layout of constructed nodes. Their external appearance may also be affected by serialization settings. The serializer may have an option to re-introduce boundary space, even if it was stripped at construction time. You would have to consult implementation-specific documentation to find out.

Turns out xmllint --format a.xml will print a neatly formatted version of a.xml. That was all I needed. You can also pipe to xmllint, like this:
galax-run a.xq | xmllint --format -

How can I replace and multiply dimensions of img tags in Perl or Ruby?

I have a folder full of html files created for a Kindle ebook. The images are coded with width and height, as per the Kindle guidelines:
<img width="328" height="234" src="images/224p_fmt.jpeg" alt="224p.tif"/>
What I need to create/find is a script that will process all the image tags, multiply the width an height attributes by a specified amount (coded into the script) and write them back into the html files.
So, for the above example, say I want to multiply by 1.5, and wind up with
<img width="492" height="351" src="images/224p_fmt.jpeg" alt="224p.tif"/>
Scripts like this are not my forte, so help appreciated. I especially am unclear on how to write a script that I can run on file(s) from the command line and just input/output html.
I assume the meat of the code would be something like
s/<img width="([0-9]+)" height="([0-9]+)" src="(.*?)" alt=".*"/>/'<img width="'.$1*1.5.'" height="'.$2*1.5.'" src="'.$3.'" alt=""/>'/eg;
Which I realize is incorrect (the multiplication part) which is why help appreciated.

You've already got the main regex figured out, just need to tweak it and decide a language. Using regexes on html is not optimal, but since this is somewhat straightforward, its probably ok.
perl -pi.bak -we 's/<img width="([0-9]+)" height="([0-9]+)"/q(<img width=") .
$1*1.5 . q(" height=") . $2*1.5 . q(")/eg;' yourfile.html
Note the use of the alternate quoting q(...), since using single quotes on the command line will conflict with the shell quoting.
There's no need to touch any parts you're not changing, unless you feel the need to make a stricter match. If you do, you can add a look-ahead assertion:
(?=\s*src=".*?"\s*alt=".*?"\/>)
This part will remain unchanged by the substitution.

In Python I'd do it like this.
import sys, re
source = sys.stdin.read()
def multi(by):
def handler(m):
updated = int(m.group(2)) * by
return m.group(1) + str(updated)
return handler
print re.sub(r'((?:width|height)=["\'])(\d+)', multi(1.5), source)
Then you can handle input and output on the command like using < and >.
$ python resize.py < index.html > new_file.html

I would look into using the nokogiri gem to parse the HTML, search for image tags, extract the width and height attributes and then output the changed document so you can save it.
More information at the nokogiri tutorial page.

You're right, it can be done with a small Ruby script. It can look like this :
source = '<img width="328" height="234" src="images/224p_fmt.jpeg" alt="224p.tif"/>'
datas = source.scan(/<img width="([0-9]+)" height="([0-9]+)" src="(.*?)" alt=".*">/).flatten!
source.gsub!(data[0], (data[0].to_i * 1.5).to_s)
source.gsub!(data[1], (data[1].to_i * 1.5).to_s)
Of course, it's a quick and dirty script, far from perfect and it has some drawback.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Pandoc: Include raw `tex` in `markdown` to `html` conversion - html

You could try to put the \ref into a math environment, $\ref{eqone}$. And if the command is not defined in math, switch back to text, $\text{\ref{eqone}}$. Ugly, but it might work.

Related

Using Regex to find "<img .../>" and "<script ...> </script>" in HTML string

How to split up long HTML into "functions"

Compare two HTML documents ignoring multiple and trailing whitespaces

Preserving whitespace in XQuery HTML output

How can I replace and multiply dimensions of img tags in Perl or Ruby?

Categories

Resources