Preserving whitespace in XQuery HTML output - html

When I run galax-run a.xq, where a.xq is
<html>
<body>
<ul>
{
for $x in doc("books.xml")/bookstore/book
return <li>{data($x/title)}</li>
}
</ul>
</body>
</html>
the output is all on one line. How do I keep the formatting (new lines and other white spaces) as in a.xq?

Your question is about "Boundary Whitespace", which is either stripped or preserved, with an implementation defined default behaviour. You can however override the default by using a boundary-space declaration. For preserving boundary whitespace, use
declare boundary-space preserve;
in the query prolog. See http://www.w3.org/TR/xquery/#id-boundary-space-decls for details.
Note that this governs the layout of constructed nodes. Their external appearance may also be affected by serialization settings. The serializer may have an option to re-introduce boundary space, even if it was stripped at construction time. You would have to consult implementation-specific documentation to find out.

Turns out xmllint --format a.xml will print a neatly formatted version of a.xml. That was all I needed. You can also pipe to xmllint, like this:
galax-run a.xq | xmllint --format -

Related

How to use regex (regular expressions) in Notepad++ to remove all HTML and JSON code that does not contain a specific string?

Using regular expressions (in Notepad++), I want to find all JSON sections that contain the string foo. Note that the JSON just happens to be embedded within a limited set of HTML source code which is loaded into Notepad++.
I've written the following regex to accomplish this task:
({[^}]*foo[^}]*})
This works as expected in all the input that is possible.
I want to improve my workflow, so instead of just finding all such JSON sections, I want to write a regex to remove all the HTML & JSON that does not match this expression. The result will be only JSON sections that contain foo.
I tried using the Notepad++ regex Replace functionality with this find expression:
(?:({[^}]*?foo[^}]*?})|.)+
and this replace expression:
$1\n\n$2\n\n$3\n\n$4\n\n$5\n\n$6\n\n$7\n\n$8\n\n$9\n\n
This successfully works for the last occurrence of foo within the JSON, but does not find the rest of the occurrences.
How can I improve my code to find all the occurrences?
Here is a simplified minimal example of input and desired output. I hope I haven't simplified it too much for it to be useful:
Simplified input:
<!DOCTYPE html>
<html>
<div dat="{example foo1}"> </div>
<div dat="{example bar}"> </div>
<div dat="{example foo2}"> </div>
</html>
Desired output:
{example foo1}
{example foo2}
You can use
{[^}]*foo[^}]*}|((?s:.))
Replace with (?1:$0\n). Details:
{[^}]*foo[^}]*} - {, zero or more chars other than }, foo, zero or more chars other than } and then a }
| - or
((?s:.)) - Capturing group 1: any one char ((?s:...) is an inline modifier group where . matches all chars including line break chars, same as if you enabled . matches newline option).
The (?1:$0\n) replacement pattern replaces with an empty string if Group 1 was matched, else the replacement is the match text + a newline.
See the demo and search and replace dialog settings:
Updates
The comment section was full tried to suggest a code here,
Let me know if this is a bit close to your intended result,
Find: ({.+?[\n]*foo[ \d]*})|.*?
Replace all: $1
Also added Toto's example

Why do some strings contain " " and some " ", when my input is the same(" ")?

My problem occurs when I try to use some data/strings in a p-element.
I start of with data like this:
data: function() {
return {
reportText: {
text1: "This is some subject text",
text2: "This is the conclusion",
}
}
}
I use this data as follows in my (vue-)html:
<p> {{ reportText.text1 }} </p>
<p> {{ reportText.text2 }} </p>
In my browser, when I inspect my elements I get to see the following results:
<p>This is some subject text</p>
<p>This is the conclusion</p>
As you can see, there is suddenly a difference, one p element uses and the other , even though I started of with both strings only using . I know and technically represent the same thingm, but the problem with the string is that it gets treated as a string with 1 large word instead of multiple separate words. This screws up my layout and I can't solve this by using certain css properties (word-wrap etc.)
Other things I have tried:
Tried sanitizing the strings by using .replace( , ), but that doesn't do anything. I assume this is because it basically is the same, so there is nothing to really replace. Same reason why I have to use blockcode on stackoverflow to make the destinction between and .
Logged the data from vue to see if there is any noticeable difference, but I can't see any. If I log the data/reportText I again only see string with 's
So I have the following questions:
Why does this happen? I can't seem to find any logical explanation why it sometimes uses 's and sometimes uses 's, it seems random, but I am sure I am missing something.
Any other things I could try to follow the path my string takes, so I can see where the transformation from to happens?
Per the comments, the solution devised ended up being a simple unicode character replacement targeting the \u00A0 unicode code point (i.e. replacing unicode non-breaking spaces with ordinary spaces):
str.replace(/[\\u00A0]/g, ' ')
Explanation:
JavaScript typically allows the use of unicode characters in two ways: you can input the rendered character directly, or you can use a unicode code point (i.e. in the case of JavaScript, a hexadecimal code prefixed with \u like \u00A0). It has no concept of an HTML entity (i.e. a character sequence between a & and ; like ).
The inspector tool for some browsers, however, utilizes the HTML concept of the HTML entity and will often display unicode characters using their corresponding HTML entities where applicable. If you check the same source code in Chrome's inspector vs. Firefox's inspector (as of writing this answer, anyway), you will see that Chrome uses HTML entities while Firefox uses the rendered character result. While it's a handy feature to be able to see non-printable unicode characters in the inspector, Chrome's use of HTML entities is only a convenience feature, not a reflection of the actual contents of your source code.
With that in mind, we can infer that your source code contains unicode characters in their fully rendered form. Regardless of the form of your unicode character, the fix is identical: you need to target these unicode space characters explicitly and replace them with ordinary spaces.

Compare two HTML documents ignoring multiple and trailing whitespaces

Is there a tool that compares an HTML document like:
<p b="1" a="0 "> a b
c </p>
(as a C string: "<p> a b\nc </p>") equal to:
<p a="0 " b="1">a b c</p>
Note how:
text multiple whitespaces were converted to a single whitespace
newlines were converted to whitespaces
text trailing and heading whitespaces were stripped
attributes were put on a standard order
attribute values were unchanged, including trailing whitespaces
Why I want that
I am working on the Markdown Test Suite that aims to measure markdown engine compliance and portability.
We have markdown input, expected HTML output, and want to determine if the generated HTML output is equal to the expected one.
The problem is that Markdown is underspecified, so we cannot compare directly the two HTML strings.
The actual test code is here, just modify run-tests.py#dom_normalize if you want to try out your solution.
Things I tried
beautifulsoup. Orders the attributes, but does not deal well with whitespaces?
A function formatter regex modification might work, but I don't see a way to differentiate between the inside of nodes and attributes.
A Python only solution like this would be ideal.
looking for a Javascript function similar to isEqualNode() (does not work because ignores nodeVaue) + some headless JS engine. Couldn't find one.
If there is nothing better, I'll just have to write my own output formatter front-end to some HTML parser.
I ended up cooking up a custom HTML renderer that normalizes things based on Python's stdlib HTMLParser.
You can see it at: https://github.com/karlcow/markdown-testsuite/blob/749ed0b812ffcb8b6cc56f93ff94c6fdfb6bd4a2/run-tests.py#L20
Usage and docstrig tests at: https://github.com/karlcow/markdown-testsuite/blob/749ed0b812ffcb8b6cc56f93ff94c6fdfb6bd4a2/run-tests.py#L74

How to fold/unfold HTML tags with Vim

Is there some plugin to fold HTML tags in Vim?
Or there is another way to setup a shortcut to fold or unfold html tags?
I would like to fold/unfold html tags just like I do with indentation folding.
I have found zfat (or, equally, zfit) works well for folding with HTML documents. za will toggle (open or close) an existing fold. zR opens all the folds in the current document, zM effectively re-enables all existing folds marked in the document.
If you find yourself using folds extensively, you could make some handy keybindings for yourself in your .vimrc.
If you indent your HTML the following should work:
set foldmethod=indent
The problem with this, I find, is there are too many folds. To get around this I use zO and zc to open and close nested folds, respectively.
See help fold-indent for more information:
The folds are automatically defined by the indent of the lines.
The foldlevel is computed from the indent of the line, divided by the
'shiftwidth' (rounded down). A sequence of lines with the same or higher fold
level form a fold, with the lines with a higher level forming a nested fold.
The nesting of folds is limited with 'foldnestmax'.
Some lines are ignored and get the fold level of the line above or below it,
whichever is lower. These are empty or white lines and lines starting
with a character in 'foldignore'. White space is skipped before checking for
characters in 'foldignore'. For C use "#" to ignore preprocessor lines.
When you want to ignore lines in another way, use the 'expr' method. The
indent() function can be used in 'foldexpr' to get the indent of a line.
Folding html with foldmethod syntax, which is simpler.
This answer is based on HTML syntax folding in vim. author is #Ingo Karcat.
set your fold method to be syntax with the following:
vim command line :set foldmethod=syntax
or put the setting in ~/.vim/after/ftplugin/html.vim
setlocal foldmethod=syntax
Also note so far, the default syntax script only folds a multi-line
tag itself, not the text between the opening and closing tag.
So, this gets folded:
<div
class="foo"
id="bar"
>
And this doesn't
<div>
<b>text between here</b>
</div>
To get folded between tags, you need extend the syntax script, via
the following, best place into ~/.vim/after/syntax/html.vim
The syntax folding is performed between all but void html elements
(those which don't have a closing sibling, like <br>)
syntax region htmlFold start="<\z(\<\(area\|base\|br\|col\|command\|embed\|hr\|img\|input\|keygen\|link\|meta\|para\|source\|track\|wbr\>\)\#![a-z-]\+\>\)\%(\_s*\_[^/]\?>\|\_s\_[^>]*\_[^>/]>\)" end="</\z1\_s*>" fold transparent keepend extend containedin=htmlHead,htmlH\d
Install js-beautify command(JavaScript version)
npm -g install js-beautify
wget --no-check-certificate https://www.google.com.hk/ -O google.index.html
js-beautify -f google.index.html -o google.index.bt.html
http://www.google.com.hk orignal html:
js-beautify and vim fold:
Add on to answer by James Lai.
Initially my foldmethod=syntax so zfat won't work.
Solution is to set the foldemethod to manual
:setlocal foldmethod=manual
to check which foldmethod in use,
:setlocal foldmethod?
Firstly set foldmethod=syntax and try zfit to fold start tag and zo to unfold tags, It works well on my vim.

How can I remove an entire HTML tag (and its contents) by its class using a regex?

I am not very good with Regex but I am learning.
I would like to remove some html tag by the class name. This is what I have so far :
<div class="footer".*?>(.*?)</div>
The first .*? is because it might contain other attribute and the second is it might contain other html stuff.
What am I doing wrong? I have try a lot of set without success.
Update
Inside the DIV it can contain multiple line and I am playing with Perl regex.
As other people said, HTML is notoriously tricky to deal with using regexes, and a DOM approach might be better. E.g.:
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file( 'yourdocument.html' );
for my $node ( $tree->findnodes( '//*[#class="footer"]' ) ) {
$node->replace_with_content; # delete element, but not the children
}
print $tree->as_HTML;
You will also want to allow for other things before class in the div tag
<div[^>]*class="footer"[^>]*>(.*?)</div>
Also, go case-insensitive. You may need to escape things like the quotes, or the slash in the closing tag. What context are you doing this in?
Also note that HTML parsing with regular expressions can be very nasty, depending on the input. A good point is brought up in an answer below - suppose you have a structure like:
<div>
<div class="footer">
<div>Hi!</div>
</div>
</div>
Trying to build a regex for that is a recipe for disaster. Your best bet is to load the document into a DOM, and perform manipulations on that.
Pseudocode that should map closely to XML::DOM:
document = //load document
divs = document.getElementsByTagName("div");
for(div in divs) {
if(div.getAttributes["class"] == "footer") {
parent = div.getParent();
for(child in div.getChildren()) {
// filter attribute types?
parent.insertBefore(div, child);
}
parent.removeChild(div);
}
}
Here is a perl library, HTML::DOM, and another, XML::DOM
.NET has built-in libraries to handle dom parsing.
In Perl you need the /s modifier, otherwise the dot won't match a newline.
That said, using a proper HTML or XML parser to remove unwanted parts of a HTML file is much more appropriate.
<div[^>]*class="footer"[^>]*>(.*?)</div>
Worked for me, but needed to use backslashes before special characters
<div[^>]*class=\"footer\"[^>]*>(.*?)<\/div>
Partly depends on the exact regex engine you are using - which language etc. But one possibility is that you need to escape the quotes and/or the forward slash. You might also want to make it case insensitive.
<div class=\"footer\".*?>(.*?)<\/div>
Otherwise please say what language/platform you are using - .NET, java, perl ...
Try this:
<([^\s]+).*?class="footer".*?>([.\n]*?)</([^\s]+)>
Your biggest problem is going to be nested tags. For example:
<div class="footer"><b></b></div>
The regexp given would match everything through the </b>, leaving the </div> dangling on the end. You will have to either assume that the tag you're looking for has no nested elements, or you will need to use some sort of parser from HTML to DOM and an XPath query to remove an entire sub-tree.
This will be tricky because of the greediness of regular expressions, (Note that my examples may be specific to perl, but I know that greediness is a general issue with REs.) The second .*? will match as much as possible before the </div>, so if you have the following:
<div class="SomethingElse"><div class="footer"> stuff </div></div>
The expression will match:
<div class="footer"> stuff </div></div>
which is not likely what you want.
why not <div class="footer".*?</div> I'm not a regex guru either, but I don't think you need to specify that last bracket for your open div tag