Multiple raw latex commands in the rst_prolog of Sphinx - html

I don't see how to include some raw:: latex substitution commands in the rst_prolog of Sphinx.
This question has an answer for one-line replace command but not for my case.
My tentative in the conf.py is :
rst_prolog = """
.. |BeginSmaller| raw:: latex
\begingroup\footnotesize
.. |EndSmaller| raw:: latex
\endgroup
"""
which not surprisingly gives the bad results:
html compilation:
warning message <rst_prolog>:9: WARNING: Explicit markup ends without a blank line; unexpected unindent.
the produced html has some "ootnotesize" text at the begining of pages, probably because of bad escape sequence.
latex compilation:
Only egingroup appears
Does someone has an idea to solve this ?

Related

Use SED to extract value of all input elements with a certain name

How do I get the value attribute based on a search of some other attribute?
For example:
<body>
<input name="dummy" value="foo">
<input name="alpha" value="bar">
</body>
How do I get the value of the input element with the name "dummy"?
Since you're looking for a solution using bash and sed, I'm assuming you're looking for a Linux command line option.
Use hxselect html parsing tool to extract element; use sed to extract value from element
I did a Google search for "linux bash parse html tool" and came across this: https://unix.stackexchange.com/questions/6389/how-to-parse-hundred-html-source-code-files-in-shell
The accepted answer suggests using the hxselect tool from the html-xml-utils package which extracts elements based on a css selector.
So after installing (downoad, unzip, ./configure, make, make install), you can run this command using the given css selector
hxselect "input[name='dummy']" < example.html
(Given that example.html contains your example html from the question.) This will return:
<input name="dummy" value="foo"/>
Almost there. We need to extract the value from that line:
hxselect "input[name='dummy']" < example.html | sed -n -e "s/^.*value=['\"]\(.*\)['\"].*/\1/p"
Which returns "foo".
why you would / would not want to use this approach
using regex to parse out the attributes is complicated, and often the wrong way to go
the hxselect tool (in my other answer) is a pain to install
BUT, this approach accepts malformed html, which is what is argued for in this answer to the question linked above. By the way, that question has very thorough discussion on the regex+html debate.
Since you're asking for SED, I'll assume you want a command line option. However, a tool built for html parsing may be more effective. The problem with my first answer is that I don't know of a way in css to select the value of an attribute (does anyone else?). However, with xml you can select attributes like you could other elements. Here is a command line option for using an xml parsing tool.
Treat it as XML; use XPATH
Install xmlstarlet with your package manager
Run xmlstarlet sel -t -v //input[#name=\'dummy\']/#value example.html (where example.html contains your html
If your html isn't valid xml, follow the warnings from xmlstarlet to make the necessary changes (in this case, <input> must be changed to <input/>
Run the command again. Returns: foo
why you might/might not use this approach
it is way more simple and robust than hand-rolling a regex html parser, but
it requires well formed html
Parsing HTML with sed is generally a bad idea, since sed works in a line-based manner and HTML does not usually consider newlines syntactically important. It's not good if your HTML-handling tools break when the HTML is reformatted.
Instead, consider using Python, which has an HTML push parser in its standard library. For example:
#!/usr/bin/python
from HTMLParser import HTMLParser
from sys import argv
# Our parser. It inherits the standard HTMLParser that does most of
# the work.
class MyParser(HTMLParser):
# We just hook into the handling of start tags to extract the
# attribute
def handle_starttag(self, tag, attrs):
# Build a dictionary from the attribute list for easier
# handling
attrs_dict = dict(attrs)
# Then, if the tag matches our criteria
if tag == 'input' \
and 'name' in attrs_dict \
and attrs_dict['name'] == 'dummy':
# Print the value attribute (or an empty string if it
# doesn't exist)
print attrs_dict['value'] if 'value' in attrs_dict else ""
# After we defined the parser, all that's left is to use it. So,
# build one:
p = MyParser()
# And feed a file to it (here: the first command line argument)
with open(argv[1], 'rb') as f:
p.feed(f.read())
Save this code as, say, foo.py, then run
python foo.py foo.html
where foo.html is your HTML file.

PanDoc: How to assign level-one Atx-style header (markdown) to the contents of html title tag

I am using PanDoc to convert a large number of markdown (.md) files to html. I'm using the following Windows command-line:
for %%i in (*.md) do pandoc -f markdown -s %%~ni.md > html/%%~ni.html
On a test run, the html looks OK, except for the title tag - it's empty. Here is an example of the beginning of the .md file:
#Topic Title
- [Anchor 1](#anchor1)
- [Anchor 2](#anchor2)
<a name="anchor1"></a>
## Anchor 1
Is there a way I can tell PanDoc to parse the
#Topic Title
so that, in the html output files, I will get:
<title>Topic Title</title>
?
There are other .md tags I'd like to parse, and I think solving this will help me solve the rest of it.
I don't believe Pandoc supports this out-of-the-box. The relevant part of the Pandoc documentation states:
Templates may contain variables. Variable names are sequences of alphanumerics, -, and _, starting with a letter. A variable name surrounded by $ signs will be replaced by its value. For example, the string $title$ in
<title>$title$</title>
will be replaced by the document title.
It then continues:
Some variables are set automatically by pandoc. These vary somewhat depending on the output format, but include metadata fields (such as title, author, and date) as well as the following:
And proceeds to list a bunch of variables (none of which are relevant to your question). However, the above quote indicates that the title variable is a metadata field. The metadata field can be defined in a pandoc_title_block, a yaml_metadata_block, or passed in as a command line option.
The docs note that:
... you may also keep the metadata in a separate YAML file and pass it to pandoc as an argument, along with your markdown files ...
So you have a couple options:
Edit each document to add metadata defining the title for each document (this could possibly be scripted).
Write your script to extract the title (perhaps a regex which looks for #header in the first line) and passes that in to Pandoc as a command line option.
If you intend to start including the metadata in new documents you create going forward, then the first option is probably the way to go. Run a script once to batch edit your documents and then your done. However, if you have no intention of adding metadata to any documents, I would consider the second option. You are already running a loop, so just get the title before calling Pandoc within your loop (although I'm not sure how to do that in a windows script).

how to create consecutive backslash in octopress with kramdown

I'm using Kramdown and Octopress to write markdown text, but I don't know how to get \\ in html. I tried \\\\ but get \. According to its doc, \ is used for escape. Does anyone know how to get \\ in html, not \\? thanks. And I'm confused about when \\ will be translated into \ and when will be <br />.
The problem is not with Kramdown, but with a plugin that comes with Octopress called rubypants.rb. Take a look at plugins/rubypants.rb, and you will find a method named process_escapes which does several calls to str.gsub. (Line 335 or so.) One of these replaces the double backslash ("\") with the escape code you're seeing - fix that line and you'll be good. (You can fix it by moving the 'str.' to the next gsub and deleting the rest of the line.)
I am not seeing the problem here
$ kramdown --version
0.14.2
$ kramdown <<< '\\\\'
<p>\\</p>

How to embed HTML string syntax in CoffeeScript using VIM?

I have looked at how to embed HTML syntax in JavaScript string from HTML syntax highlighting in javascript strings in vim.
However, when I use CoffeeScript I cannot get the same thing working by editing coffee.vim syntax file in a similar way. I got recursive errors which said including html.vim make it too nested.
I have some HTML template in CoffeeScript like the following::
angular.module('m', [])
.directive(
'myDirective'
[
->
template: """
<div>
<div>This is <b>bold</b> text</div>
<div><i>This should be italic.</i></div>
</div>
"""
]
)
How do I get the template HTML syntax in CoffeeScript string properly highlighted in VIM?
I would proceed as follows:
Find out the syntax groups that should be highlighted as pure html would be. Add html syntax highlighting to these groups.
To find the valid syntax group under the cursor you can follow the instructions here.
In your example the syntax group of interest is coffeeHereDoc.
To add html highlighting to this group execute the following commands
unlet b:current_syntax
syntax include #HTML syntax/html.vim
syn region HtmlEmbeddedInCoffeeScript start="" end=""
\ contains=#HTML containedin=coffeeHereDoc
Since vim complains about recursion if you add these lines to coffee.vim i would go with an autocommand:
function! Coffee_syntax()
if !empty(b:current_syntax)
unlet b:current_syntax
endif
syn include #HTML syntax/html.vim
syn region HtmlEmbeddedInCoffeeScript start="" end="" contains=#HTML
\ containedin=coffeeHereDoc
endfunction
autocmd BufEnter *.coffee call Coffee_syntax()
I was also running into various issues while trying to get this to work. After some experimentation, here's what I came up with. Just create .vim/after/syntax/coffee.vim with the following contents:
unlet b:current_syntax
syntax include #HTML $VIMRUNTIME/syntax/html.vim
syntax region coffeeHtmlString matchgroup=coffeeHeredoc
\ start=+'''\\(\\_\\s*<\\w\\)\\#=+ end=+\\(\\w>\\_\\s*\\)\\#<='''+
\ contains=#HTML
syn sync minlines=300
The unlet b:current_syntax line disables the current syntax matching and lets the HTML syntax definition take over for matching regions.
Using an absolute path for the html.vim inclusion avoids the recursion problem (described more below).
The region definition matches heredoc strings that look like they contain HTML. Specifically, the start pattern looks for three single quotes followed by something that looks like the beginning of an HTML tag (there can be whitespace between the two), and the end pattern looks for the end of an HTML tag followed by three single quotes. Heredoc strings that don't look like they contain HTML are still matched using the coffeeHeredoc pattern. This works because this syntax file is being loaded after the syntax definitions from the coffeescript plugin, so we get a chance to make the more specific match (a heredoc containing HTML) before the more general match (the coffeeHeredoc region) happens.
The syn sync minlines=300 widens the matching region. My embedded HTML strings sometimes stretched over 50 lines, and Vim's syntax highlighter would get confused about how the string should be highlighted. For complete surety you could use syn sync fromstart, but for large files this could theoretically be slow (I didn't try it).
The recursion problem originally experienced by #heartbreaker was caused by the html.vim script that comes with the vim-coffeescript plugin (I'm assuming that was being used). That plugin's html.vim file includes the its coffee.vim syntax file to add coffeescript highlighting to HTML files. Using a relative syntax include, a la
syntax include #HTML syntax/html.vim
you get all the syntax/html.vim files in VIM's runtime path, including the one from the coffeescript plugin (which includes coffee.vim, hence the recursion). Using an absolute path will restrict you to only getting the particular syntax file you specify, but this seems like a reasonable tradeoff since the HTML one would embed in a coffeescript string is likely fairly simple.

How can I decode HTML entities?

Here's a quick Perl question:
How can I convert HTML special characters like ü or ' to normal ASCII text?
I started with something like this:
s/\&#(\d+);/chr($1)/eg;
and could write it for all HTML characters, but some function like this probably already exists?
Note that I don't need a full HTML->Text converter. I already parse the HTML with the HTML::Parser. I just need to convert the text with the special chars I'm getting.
Take a look at HTML::Entities:
use HTML::Entities;
my $html = "Snoopy & Charlie Brown";
print decode_entities($html), "\n";
You can guess the output.
The above answers tell you how to decode the entities into Perl strings, but you also asked how to change those into ASCII.
Assuming that this is really what you want and you don't want all the unicode characters you can look at the Text::Unidecode module from CPAN to Zap all those odd characters back into a roughly similar collection of ASCII characters:
use Text::Unidecode qw(unidecode);
use HTML::Entities qw(decode_entities);
my $source = '北亰';
print unidecode(decode_entities($source));
# That prints: Bei Jing
Note that there are hex-specified characters too. They look like this: é (é).
Use HTML::Entities' decode_entities to translate the entities into actual characters. To convert that to ASCII requires more work. I've used iconv (perl interface: Text::Iconv)
with the transliterate option on with some success in the past. But if you are dealing
with a limited set of entities, or you don't actually need it reduced to ASCII equivalents,
you may be better off limiting what decode_entities produces or providing it with custom
conversion maps. See the HTML::Entities doc.
There are a handful of predefined HTML entities - & " > and so on - that you could hard code.
However, the larger case of numberic entities - { - is going to be much harder, as those values are Unicode, and conversion to ASCII is going to range from difficult to impossible.
I use this script. Save it as html2utf.py, and use it ala echo $some_html | html2utf.py.
#!/usr/bin/env python3
"""
An alternative for `perl -Mopen=locale -MHTML::Entities -pe '$_ = decode_entities($_)'` (which you can use by `cpanm HTML::Entities`) and `recode html..`.
"""
import fileinput
import html
for line in fileinput.input():
print(html.unescape(line.rstrip('\n')))
I have created a one-liner for bash, using Perl to decode the HTML entities that are passed to perl. My solution is a blend of this answer (see above) and something I found on commandlinefu.com last week.
Most of us who code in Bash aren't in the habit of using echo -n to strip out the \n newline character since it doesn't usually affect Bash text parsing. With Perl——and with this particular method——it's important to use echo -n or else perl will interpret the 'newline' \n character as a literal part of the response, adding an unwanted %0A to your results.
Here's my bash-perl one-liner hybrid:
encodedURL="$(echo -n "$entityURL" | perl -MHTML::Entities -MURI::Escape -ne 'print uri_escape(decode_entities($_))')"
Example:
Input: Seals \& Croft - Summer Breeze
Output: Seals%20%26%20Croft%20-%20Summer%20Breeze