Ignore any blank space or line break in git-diff - html

I have the same HTML file rendered in two different ways and want to compare it using git diff, taking care of ignoring every white-space, tab, line-break, carriage-return, or anything that is not strictly the source code of my files.
I'm actually trying this:
git diff --no-index --color --ignore-all-space <file1> <file2>
but when some html tags are collapsed all on one line (instead of one per line and tabulated) git-diff detect is as a difference (while for me it is not).
<html><head><title>TITLE</title><meta ......
is different from
<html>
<head>
<title>TITLE</title>
<meta ......
What option do I miss to accomplish what I need and threat as if it was the same?

git diff supports comparing files line by line or word by word, and also supports defining what makes a word. Here you can define every non-space character as a word to do the comparison. In this way, it will ignore all spaces including white-spcae, tab, line-break and carrige-return as what you need.
To achieve it, there's a perfect option --word-diff-regex, and just set it --word-diff-regex=[^[:space:]]. Refer to doc for detail.
git diff --no-index --word-diff-regex=[^[:space:]] <file1> <file2>
Here's an example. I created two files, with a.html as follows:
<html><head><title>TITLE</title><meta>
With b.html as follows:
<html>
<head>
<title>TI==TLE</title>
<meta>
By running
git diff --no-index --word-diff-regex=[^[:space:]] a.html b.html
It highlights the difference of TITLE and TI{+==+}TLE in the two files in plain mode as follows. You can also specify --word-diff=<mode> to display results in different modes. The mode can be color, plain, porcelain and none, and with plain as default.
diff --git a/d.html b/a.html
index df38a78..306ed3e 100644
--- a/d.html
+++ b/a.html
## -1 +1,4 ##
<html>
<head>
<title>TI{+==+}TLE</title>
<meta>

Executing command git diff --help gives some options like
--ignore-cr-at-eol
Ignore carriage-return at the end of line when doing a comparison.
--ignore-space-at-eol
Ignore changes in whitespace at EOL.
-b, --ignore-space-change
Ignore changes in amount of whitespace. This ignores whitespace at line end, and considers all other sequences of one or more whitespace
characters to be equivalent.
-w, --ignore-all-space
Ignore whitespace when comparing lines. This ignores differences even if one line has whitespace where the other line has none.
--ignore-blank-lines
Ignore changes whose lines are all blank.
Which you can combine according to your need, Below command worked for me
git diff --ignore-blank-lines --ignore-all-space --ignore-cr-at-eol

This does the trick for me:
git diff --ignore-blank-lines

git-diff compares files line by line
It checks the first line of your file1 with that in file2, since they are not same it reports an error.
Ignoring white space means that foo bar will match foobar if on the same line. Since your files span multiple lines in one and only one line in other, the files will always differ
If you really want to check that the files contain the exact same non-whitespace characters, you could try something like this:
diff <(perl -ne 's/\s*//xg; print' file1) <(perl -ne 's/\s*//g; print' file2)
Hope it solves your problem!

Related

Issues with parsing HTML with ragel

In my project I need to extract links from HTML document.
For this purpose I've prepared ragel HTML grammar, primarily based on this work:
https://github.com/brianpane/jitify-core/blob/master/src/core/jitify_html_lexer.rl
(mentioned here: http://ragel-users.complang.narkive.com/qhjr33zj/ragel-grammars-for-html-css-and-javascript )
Almost all works well (thanks for the great tool!), except one issue I can't overcome to date:
If I specify this text as an input:
bbbb <a href="first_link.aspx"> cccc<a href="/second_link.aspx">
my parser can correctly extract first link, but not the second one.
The difference between them is that there is a space between 'bbbb' and '<a', but no spaces between 'cccc' and '<a'.
In general, if any text, except spaces, exists before '<a' tag it makes parses consider it as content, and parser do not recognize tag opening.
Please find in this repo: https://github.com/amdei/ragel_html_sample intentionally simplified sample with grammar, aiming to work as C program ( ngx_url_html_portion.rl ).
There is also input file input-nbsp.html , which expected to contain input for the application.
In order to play with it, make .c-file from grammar:
ragel ngx_url_html_portion.rl
then compile resulting .c-file and run programm.
Input file should be in the same directory.
Will be sincerely grateful for any clue.
The issue with the defined FSM is that it includes into 'content' all characters until the space. You should exclude HTML tag opening '<' from the rule. Here is the diff for illustration:
$ git diff
diff --git a/ngx_url_html_portion.rl b/ngx_url_html_portion.rl
index ccef0ca..1f8dcf0 100644
--- a/ngx_url_html_portion.rl
+++ b/ngx_url_html_portion.rl
## -145,7 +145,7 ## void copy2hrefbuf(par_t* par, u_char* p){
);
content = (
- any - (space )
+ any - (space ) - '<'
)+;
html_space = (

Notepad++ RexEx Remove everything between 2 html tags ( with line break between )

I want to remove in a html document with notepad++
everything between the marked area
So the Start point to remove is ( including ) "<imgCRLF" and then everything between including CRLF
and then including "DetailsCRLF</aCRLF" for the End ponint
I started simple with <img.*<a/> and ticked
and I tried to improve this starting point but always got either nothing was deleted or to much :)
Use <img.*?</a>[\r\n]*. The .* is too greedy. [\r\n]* will capture the whitespace after </a>.
Also, if you are only interested in matching <img with subsequent line breaks, you can use another regex:
<img[\r\n].*?</a>[\r\n]*

How can I get pandoc multiline table cells to work when outputting to docx?

I have the latest version of pandoc installed. I have created a test.txt file with the following multiline table copied directly from the pandoc user guide section on multiline tables:
-------------------------------------------------------------
Centered Default Right Left
Header Aligned Aligned Aligned
----------- ------- --------------- -------------------------
First row 12.0 Example of a row that
spans multiple lines.
Second row 5.0 Here's another one. Note
the blank line between
rows.
-------------------------------------------------------------
I then use the following pandoc command with the multiline_tables extension to output to docx:
pandoc -f markdown_mmd+multiline_tables -o text.docx test.txt
The test.docx file is created, but only shows the first line of the multiline cells. No subsequent lines of the multiline cells appear. So it appears as this (but in a microsoft word table):
-------------------------------------------------------------
Centered Default Right Left
Header Aligned Aligned Aligned
----------- ------- --------------- -------------------------
First row 12.0 Example of a row that
Second row 5.0 Here's another one. Note
-------------------------------------------------------------
Anyone no how to remedy this?
The multiline_tables-extension is enabled by default in Pandoc's markdown. So no need to use the Multi Markdown flavour. So this should work:
pandoc -o text.docx test.txt
p.s. most people name their files test.md of test.markdown
p.p.s. On rereading your question, the thought crossed my mind that your problem might have more to do with how Word displays the generated docx file. Have your tried different versions of Word, or tried outputting to HTML instead? (the docx format is quite brittle)
Step 1:
Type the command line: pandoc --print-default-data-file reference.docx > myref.docx
./ shall be added before pandoc,if your OS is linux.
You will get a document named myref.docx.
1) Open myref.docx
2) Select (or click anywhere inside) the Table
3) Click the "Table Design" tab
4) Click the little down-arrow icon on the Quick Styles list (which annoyingly only appears when you hover your mouse over the styles)
5) Click "Modify Table Style" on the popup menu
6) Make style changes that subsequent lines present.
7) Save the reference doc to the same folder
Step 2:
Type the command line: pandoc anyTypeOfYourInput.html -s --reference-doc=myref.docx -o resultOutput.docx
./ shall be added before pandoc,if your OS is linux.
Above answer is based on Mr.Iansco's answer on github. https://github.com/jgm/pandoc/issues/3275.
You can search key word 'iansco' find original answer qickly.

How to fold/unfold HTML tags with Vim

Is there some plugin to fold HTML tags in Vim?
Or there is another way to setup a shortcut to fold or unfold html tags?
I would like to fold/unfold html tags just like I do with indentation folding.
I have found zfat (or, equally, zfit) works well for folding with HTML documents. za will toggle (open or close) an existing fold. zR opens all the folds in the current document, zM effectively re-enables all existing folds marked in the document.
If you find yourself using folds extensively, you could make some handy keybindings for yourself in your .vimrc.
If you indent your HTML the following should work:
set foldmethod=indent
The problem with this, I find, is there are too many folds. To get around this I use zO and zc to open and close nested folds, respectively.
See help fold-indent for more information:
The folds are automatically defined by the indent of the lines.
The foldlevel is computed from the indent of the line, divided by the
'shiftwidth' (rounded down). A sequence of lines with the same or higher fold
level form a fold, with the lines with a higher level forming a nested fold.
The nesting of folds is limited with 'foldnestmax'.
Some lines are ignored and get the fold level of the line above or below it,
whichever is lower. These are empty or white lines and lines starting
with a character in 'foldignore'. White space is skipped before checking for
characters in 'foldignore'. For C use "#" to ignore preprocessor lines.
When you want to ignore lines in another way, use the 'expr' method. The
indent() function can be used in 'foldexpr' to get the indent of a line.
Folding html with foldmethod syntax, which is simpler.
This answer is based on HTML syntax folding in vim. author is #Ingo Karcat.
set your fold method to be syntax with the following:
vim command line :set foldmethod=syntax
or put the setting in ~/.vim/after/ftplugin/html.vim
setlocal foldmethod=syntax
Also note so far, the default syntax script only folds a multi-line
tag itself, not the text between the opening and closing tag.
So, this gets folded:
<div
class="foo"
id="bar"
>
And this doesn't
<div>
<b>text between here</b>
</div>
To get folded between tags, you need extend the syntax script, via
the following, best place into ~/.vim/after/syntax/html.vim
The syntax folding is performed between all but void html elements
(those which don't have a closing sibling, like <br>)
syntax region htmlFold start="<\z(\<\(area\|base\|br\|col\|command\|embed\|hr\|img\|input\|keygen\|link\|meta\|para\|source\|track\|wbr\>\)\#![a-z-]\+\>\)\%(\_s*\_[^/]\?>\|\_s\_[^>]*\_[^>/]>\)" end="</\z1\_s*>" fold transparent keepend extend containedin=htmlHead,htmlH\d
Install js-beautify command(JavaScript version)
npm -g install js-beautify
wget --no-check-certificate https://www.google.com.hk/ -O google.index.html
js-beautify -f google.index.html -o google.index.bt.html
http://www.google.com.hk orignal html:
js-beautify and vim fold:
Add on to answer by James Lai.
Initially my foldmethod=syntax so zfat won't work.
Solution is to set the foldemethod to manual
:setlocal foldmethod=manual
to check which foldmethod in use,
:setlocal foldmethod?
Firstly set foldmethod=syntax and try zfit to fold start tag and zo to unfold tags, It works well on my vim.

How can I remove a variable number of lines from the top and bottom of multiple HTML documents?

I have a large number of html documents that need a variable number of lines removed from the top and bottom. The part I want always starts with <div class="someclass"> and the bottom section always starts with <div class="bottomouter>. Something like this:
<html>
[...]
<div class="someclass"><!-- stuff i want to keep --></div>
<div class="bottomouter">[...]</div>
[...]
</html>
How could this be accomplished?
I'm working on a Linux box so I have access to Perl, Sed, Awk, &c. However, I don't know how to approach this (or if this is the right place to ask).
Edit: To clarify I'm moving a bunch of static document into a template system and they need the headers and footers removed.
How about a perl script like this:
#!/usr/bin/perl -n
$output_enabled = 1 if (/^<div class="someclass">/);
$output_enabled = 0 if (/^<div class="bottomouter">/);
print if ($output_enabled);
The -n option tells perl to apply the script to each line of input, putting the line in the $_ variable (which is used implicitly in a lot of places in Perl; think of it like the word "it"). I set the $output_enabled variable (which persists across lines since it's a global variable, not declared with my) to 1 (true) if the current line matches the regex /^<div class="someclass">/, that is, if it starts with <div class="someclass">. Similarly, I set $output_enabled to 0 (false) if the line starts with <div class="bottomouter">. Finally, I print out the line if $output_enabled is true (it's initially false because it's undefined).
sed -n '/begPattern/,/endPattern/p'
To output the part of the file between the delimiting lines without including them:
sed '1,/<div class="someclass">/d;/<div class="bottomouter">/,$d' inputfile