Specify Pandoc HTML numbering to start from <h2>

Specify Pandoc HTML numbering to start from <h2> - html

I want to convert a markdown to HTML with header numbering, starting from <h2>.
What's the way to achieve it?
pandoc provides the option --number-sections (or -N) so headers are numbered in the output.
Now I am trying to convert markdown to HTML with this option.
In default, the output HTML header level of pandoc starts from <h1>. It is not ideal and so I want to change it to <h2> (whereas the original markdown may contain many first-level headers, the output HTML should contain at most 1 <h1>).
It is possible to specify --shift-heading-level-by=1; then, the output header level starts from <h2> (see Official Pandoc User's Guide and maybe also this question).
However, it would mess up the section-numbering! Basically, the level of the section numbering shifts, too. Now all sections are under "0" (like 0.1, 0.2, 0.2.1, …) and no sections of 1 exist.
pandoc provides another option --number-offset=1 but what it does is just offseting the numbers like "0.1"→"1.1". Then, all section numbers start from 1 with no sections numbered 2. Obviously, it makes no sense. The initial prefix number "1." is redundant and should be removed from all the section numbers like 1.1→1, 1.1.4→1.4, 1.2.3→2.3, etc.
For demonstration purposes, here is a sample markdown text file (abc.md)
%Test-md
# First Header (1) #
## Header (1-1) ##
# Second Header (2) #
## Header (2-2) ##
### Header (2-3) ###
and its output HTML (simplified) with
pandoc -N --section-divs --shift-heading-level-by=1 -t html5 abc.md
<section id="first-header-1" data-number="0.1">
<h2 data-number="0.1">0.1 First Header (1)</h2>
<section id="header-1-1" data-number="0.1.1">
<h3 data-number="0.1.1">0.1.1 Header (1-1)</h3>
</section>
</section>
<section id="second-header-2" data-number="0.2">
<h2 data-number="0.2">0.2 Second Header (2)</h2>
<section id="header-2-2" data-number="0.2.1">
<h3 data-number="0.2.1">0.2.1 Header (2-2)</h3>
<section id="header-2-3" data-number="0.2.1.1">
<h4 data-number="0.2.1.1">0.2.1.1 Header (2-3)</h4>
</section>
</section>
</section>
How can one make pandoc do the numbering in the ordinary way (1, 2, 2.1, 2.2, 2.2.1) yet output the HTML with the header level starting from <h2>?

Pandoc first shifts the headings, then does the numbering. This is not what we want here though, we'd like the numbering to happen first. A pandoc Lua filters can be used to take control of this.
The function pandoc.utils.make_sections performs the action that's triggered by passing --section-divs or --number-sections on the command line. Matching the effect of --shift-heading-level-by=1 is possible by modifying all Header elements manually:
function Pandoc (doc)
-- Create and number sections. Setting the first parameter to
-- `true` ensures that headings are numbered.
doc.blocks = pandoc.utils.make_sections(true, nil, doc.blocks)
-- Shift the heading levels by 1
doc.blocks = doc.blocks:walk {
Header = function (h)
h.level = h.level + 1
return h
end
}
-- Return the modified document
return doc
end
The filter would be used by saving it to a file shifted-numbered-headings.lua. It can then be passed to pandoc via the --lua-filter/-L parameter. The --number-sections/-N option must still be passed for the numbering to become visible, and --section-divs is still required to get <section> elements.
pandoc \
--lua-filter=shifted-numbered-headings.lua \
--number-sections \
--section-divs \
...
The class that pandoc sets on the <section> elements will always reflect the actual tagging level: the <section> that wraps an <h2> heading will have class="level2", even if, conceptually, it is a first level heading. This may be confusing and, unfortunately, cannot be changed with a filter.

So far, the easiest solution I have found is to make it in two steps. First, convert a markdown to HTML with no shift in the header levels. Then, convert the HTML to another HTML in which the header level is shifted by 1: <h1> → <h2>.
Here is an example code:
pandoc -N --section-divs -t html5 /tmp/try1.md |\
pandoc --from=html -t html5 --shift-heading-level-by=1 > output.html
Notice --from=html in the second pandoc -- it is necessary because otherwise pandoc would not know the file type of the streaming (pipe) input.
Here is the (simplified) output. There is now no redundant common prefix like "0." or "1." in the section-header numbers.
<section id="first-header-1" data-number="1">
<h2 data-number="1">1 First Header (1)</h2>
<section id="header-1-1" data-number="1.1">
<h3 data-number="1.1">1.1 Header (1-1)</h3>
</section>
</section>
<section id="second-header-2" data-number="2">
<h2 data-number="2">2 Second Header (2)</h2>
<section id="header-2-2" data-number="2.1">
<h3 data-number="2.1">2.1 Header (2-2)</h3>
<section id="header-2-3" data-number="2.1.1">
<h4 data-number="2.1.1">2.1.1 Header (2-3)</h4>
</section>
</section>
</section>
As a note, number-offset is irrelevant because it is to specify the numbering to start from a different number from the default 1 or 0 and does nothing with the section-numbering level.

Related

How do I preserve identifiers while converting from HTML to Markdown with Pandoc

I can't find the right parameter for this.
In my HTML-document I have a line like this:
<h2 id="seminar-teil-1">Seminar Teil 1</h2>
Now I want to convert this .html-document to a Markdown-Document. The end result should be:
## Seminar Teil 1 {#seminar-teil-1}
How do I get this done?

Ah! No wonder you were struggling. By default, identifiers are in fact preserved.
> pandoc -f html -t markdown
<h1 id="uid">Seminar 1</h1>
Output
Seminar 1 {#uid}
=========
There is one special case though, when your identified matches the identifier which pandoc automatically generates then no identifier is emitted. The automatic identifier is basically a lower case alphanumeric version of the title interspersed with dashs (exactly like your custom identifier!).
To turn this feature off run pandoc as follows
> pandoc -f html -t markdown-auto_identifiers
<h2 id="seminar-teil-1">Seminar Teil 1</h2>
Seminar Teil 1 {#seminar-teil-1}

Find and replace Heading tags with regex in Notepad++

There's a OCR scanned book and there's a tool which converts the OCR'd PDF to XML but most of the XML tags are wrong so there's another tool to fix it. But I need to break the lines from <h1> to <h5>, 1. & 1.1. & 1.1.1. so its easy to re-tag using the tool.
The XML code looks like this:
`<h1>text</h1><h2>text</h3><h3>text</h3>"
and
1.text.2.text.3.text.1.1.text.1.1.1.text
And I need to break the lines like this using a Regex in notepad++.
<h1>text</h1>
<h2>text</h2>
<h3>text</h3>
and
1.text.
2.text.
3.text.
and
1.1.text.
1.1.1.text.
I used </h1>\s* to find an </h1>\n but it only breaks h1 tags. I need to break all "H" tags and 1., 2., 1.1., 1.1.1. tags too.

At the risk of getting downvoted, i think you may be better served by a parser. In the past when I've had to manage similar tasks, I would write a small script/program to parse the file and re-write it as needed. Parsing the xml first, and then reformatting using regex might be easier to accomplish your goal.

You can use this search and replace (if your h1, h2, ... tags don't contain other tags):
search: (?<!^)(<h[1-6][^<]*|(?<![0-9]\.)[0-9]+\.)
replace: \n$1
note: if you need Windows newlines, you must change \n with \r\n.
pattern details:
(?<!^) # not preceded by the begining of the string
( # open the capture group 1
<h[1-6][^<]* # <h, a digit between 1 to 6, all characters until
# the next < (to skip all the content between
# h1, h2... tags)
| # OR
(?<![0-9]\.)[0-9]+\. # one or more digits and a dot not preceded by a digit
# and a dot
) # close the capture group 1
$1 is a reference to the content of the capture group 1

Find specific tags in a HTML file

I have some html files and want to extract the contents between some tags:
The title of the page
some tagged content here.
<p>A paragraph comes here</p>
<p>A paragraph comes here</p><span class="more-about">Some text here</span><p class="en-cpy">Copyright © 2012 </p>
I just want these tags: head, p
but as could be seen in the second paragraph, the last tag is which starts with p but is not my desires tag, and I don't want its content.
I used following script for extracting my desired text, but I can't filter out the tags such as the last one in my example.... How is it possible to extract just <p> tags?
grep "<p>" $File | sed -e 's/^[ \t]*//'
I have to add that, the last tag (which I don't want to appear in the output) is right after one of my desired tags (as is in my example) and using grep command all the content of that line would be returned as output... (This is my problem)

Don't. Trying to use regex to parse HTML is going to be painful. Use something like Ruby and Nokogiri, or a similar language + library that you are familiar with.

to extract text between <p> and </p>, try this
perl -ne 'BEGIN{$/="</p>";$\="\n"}s/.*(<p>)/$1/&&print' < input-file > output-file
or
perl -n0l012e 'print for m|<p>.*?</p>|gs'

xmllint --html --xpath "//*[name()='head' or name()='p']" "$file"
If you're dealing with broken HTML you might need a different parser. Here's a "one-liner" basically the same using lxml. Just pass the script your URL
#!/usr/bin/env python3
from lxml import etree
import sys
print('\n'.join(etree.tostring(x, encoding="utf-8", with_tail=False).decode("utf-8") for x in (lambda i: etree.parse(i, etree.HTMLParser(remove_blank_text=1, remove_comments=1)).xpath("//*[name()='p' or name()='head']"))(sys.argv[0])))

How to fold/unfold HTML tags with Vim

Is there some plugin to fold HTML tags in Vim?
Or there is another way to setup a shortcut to fold or unfold html tags?
I would like to fold/unfold html tags just like I do with indentation folding.

I have found zfat (or, equally, zfit) works well for folding with HTML documents. za will toggle (open or close) an existing fold. zR opens all the folds in the current document, zM effectively re-enables all existing folds marked in the document.
If you find yourself using folds extensively, you could make some handy keybindings for yourself in your .vimrc.

If you indent your HTML the following should work:
set foldmethod=indent
The problem with this, I find, is there are too many folds. To get around this I use zO and zc to open and close nested folds, respectively.
See help fold-indent for more information:
The folds are automatically defined by the indent of the lines.
The foldlevel is computed from the indent of the line, divided by the
'shiftwidth' (rounded down). A sequence of lines with the same or higher fold
level form a fold, with the lines with a higher level forming a nested fold.
The nesting of folds is limited with 'foldnestmax'.
Some lines are ignored and get the fold level of the line above or below it,
whichever is lower. These are empty or white lines and lines starting
with a character in 'foldignore'. White space is skipped before checking for
characters in 'foldignore'. For C use "#" to ignore preprocessor lines.
When you want to ignore lines in another way, use the 'expr' method. The
indent() function can be used in 'foldexpr' to get the indent of a line.

Folding html with foldmethod syntax, which is simpler.
This answer is based on HTML syntax folding in vim. author is #Ingo Karcat.
set your fold method to be syntax with the following:
vim command line :set foldmethod=syntax
or put the setting in ~/.vim/after/ftplugin/html.vim
setlocal foldmethod=syntax
Also note so far, the default syntax script only folds a multi-line
tag itself, not the text between the opening and closing tag.
So, this gets folded:
<div
class="foo"
id="bar"
>
And this doesn't
<div>
<b>text between here</b>
</div>
To get folded between tags, you need extend the syntax script, via
the following, best place into ~/.vim/after/syntax/html.vim
The syntax folding is performed between all but void html elements
(those which don't have a closing sibling, like <br>)
syntax region htmlFold start="<\z(\<\(area\|base\|br\|col\|command\|embed\|hr\|img\|input\|keygen\|link\|meta\|para\|source\|track\|wbr\>\)\#![a-z-]\+\>\)\%(\_s*\_[^/]\?>\|\_s\_[^>]*\_[^>/]>\)" end="</\z1\_s*>" fold transparent keepend extend containedin=htmlHead,htmlH\d

Install js-beautify command(JavaScript version)
npm -g install js-beautify
wget --no-check-certificate https://www.google.com.hk/ -O google.index.html
js-beautify -f google.index.html -o google.index.bt.html
http://www.google.com.hk orignal html:
js-beautify and vim fold:

Add on to answer by James Lai.
Initially my foldmethod=syntax so zfat won't work.
Solution is to set the foldemethod to manual
:setlocal foldmethod=manual
to check which foldmethod in use,
:setlocal foldmethod?

Firstly set foldmethod=syntax and try zfit to fold start tag and zo to unfold tags, It works well on my vim.

How can I remove a variable number of lines from the top and bottom of multiple HTML documents?

I have a large number of html documents that need a variable number of lines removed from the top and bottom. The part I want always starts with <div class="someclass"> and the bottom section always starts with <div class="bottomouter>. Something like this:
<html>
[...]
<div class="someclass"><!-- stuff i want to keep --></div>
<div class="bottomouter">[...]</div>
[...]
</html>
How could this be accomplished?
I'm working on a Linux box so I have access to Perl, Sed, Awk, &c. However, I don't know how to approach this (or if this is the right place to ask).
Edit: To clarify I'm moving a bunch of static document into a template system and they need the headers and footers removed.

How about a perl script like this:
#!/usr/bin/perl -n
$output_enabled = 1 if (/^<div class="someclass">/);
$output_enabled = 0 if (/^<div class="bottomouter">/);
print if ($output_enabled);
The -n option tells perl to apply the script to each line of input, putting the line in the $_ variable (which is used implicitly in a lot of places in Perl; think of it like the word "it"). I set the $output_enabled variable (which persists across lines since it's a global variable, not declared with my) to 1 (true) if the current line matches the regex /^<div class="someclass">/, that is, if it starts with <div class="someclass">. Similarly, I set $output_enabled to 0 (false) if the line starts with <div class="bottomouter">. Finally, I print out the line if $output_enabled is true (it's initially false because it's undefined).

sed -n '/begPattern/,/endPattern/p'

To output the part of the file between the delimiting lines without including them:
sed '1,/<div class="someclass">/d;/<div class="bottomouter">/,$d' inputfile

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Specify Pandoc HTML numbering to start from <h2> - html

Related

How do I preserve identifiers while converting from HTML to Markdown with Pandoc

Find and replace Heading tags with regex in Notepad++

Find specific tags in a HTML file

How to fold/unfold HTML tags with Vim

How can I remove a variable number of lines from the top and bottom of multiple HTML documents?

Categories

Resources