Regex for no space between attributes html - html

How to detected no space between attributes.
Example:
<div style="margin:37px;"/></div>
<span title=''style="margin:37px;" /></span>
<span title="" style="margin:37px;" /></span>
<a title="u" hghghgh title="j" >
<a title=""gg ff>
correct: 1,3,4
incorrect: 2,5
How to detected incorrect?
I've tried with this:
<(.*?=(['"]).*?\2)([\S].*)|(^/)>
But it's not working.

You should not use regex to parse HTML, unless for learning purpose.
http://regexr.com/3cge1
<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*/?>
This regular expression matches even if you don't have any attribute at all. It works for self-closing tags, and if the attribute has no value.
<\w+ Match opening < and \w characters.
(\s+[\w-]+(=(['"])[^"']*\3)?)* zero or more attributes that must start with a white space. It contains two parts:
\s+[\w-]+ attribute name after mandatory space
(=(['"])[^"']*\3)? optional attribute value
\s*/?> optional white space and optional / followed by closing >.
Here is a test for the strings:
var re = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g;
! '<div style="margin:37px;"/></div>'.match(re);
false
! '<span title=\'\'style="margin:37px;" /></span>'.match(re);
true
! '<span title="" style="margin:37px;" /></span>'.match(re);
false
! '<a title="u" hghghgh title="j" >'.match(re);
false
! '<a title=""gg ff>'.match(re);
true
Display all incorrect tags:
var html = '<div style="margin:37px;"></div> <span title=\'\'style="margin:37px;"/><a title=""gg ff/> <span title="" style="margin:37px;" /></span> <a title="u" hghghgh title="j"example> <a title=""gg ff>';
var tagRegex = /<\w+[^>]*\/?>/g;
var validRegex = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g;
html.match(tagRegex).forEach(function(m) {
if(!m.match(validRegex)) {
console.log('Incorrect', m);
}
});
Will output
Incorrect <span title=''style="margin:37px;"/>
Incorrect <a title=""gg ff/>
Incorrect <a title="u" hghghgh title="j"example>
Incorrect <a title=""gg ff>
Update for the comments
<\w+(\s+[\w-]+(="[^"]*"|='[^']*'|=[\w-]+)?)*\s*/?>

I got this pattern to work, finding incorrect lines 2 and 5 as you requested:
>>> import re
>>> p = r'<[a-z]+\s[a-z]+=[\'\"][\w;:]*[\"\'][\w]+.*'
>>> html = """
<div style="margin:37px;"/></div>
<span title=''style="margin:37px;" /></span>
<span title="" style="margin:37px;" /></span>
<a title="u" hghghgh title="j" >
<a title=""gg ff>
"""
>>> bad = re.findall(p, html)
>>> print '\n'.join(bad)
<span title=''style="margin:37px;" /></span>
<a title=""gg ff>
regex broken down:
p = r'<[a-z]+\s[a-z]+=[\'\"][\w;:]*[\"\'][\w]+.*'
< - starting bracket
[a-z]+\s - 1 or more lowercase letters followed by a space
[a-z]+= - 1 or more lowercase letters followed by an equals sign
[\'\"] - match a single or double quote one time
[\w;:]* - match an alphnumeric character (a-zA-Z0-9_) or a colon or semi-colon 0 or more times
[\"\'] - again match a single or double quote one time
[\w]+ - match an alphanumeric character one or more times(this catches the lack of a space you wanted to detect) ***
.* - match anything 0 or more times(gets rest of the line)

Try this regex , i think it will work
<\w*[^=]*=["'][\w;:]*["'][\s/]+[^>]*>
< - starting bracket
\w* - one or more alphanumeric character
[^=]*= - It will cover all the character till '=' shows up
["'][\w;:]*["'] - this will match two cases
1. one with single quote with having strings optional
2. one with double quote with having strings optional
[\s/]+ - match the space or '\' atleast one occurence
[^>]* - this will match all the character till '>' closing bracket

Not sure about this I am not so experienced at regex but this looks like it is working well
JS Fiddle
<([a-z]+)(\s+[a-z\-]+(="[^"]*")?)*\s*\/?>([^<]+(<\/$1>))?
Currently <([a-z]+) will mostly work but with web component and <ng-* this would better be \w+
---------------
Output:
<div style="margin:37px;">div</div> correct
<span title=" style="margin:37px;" />span1</span> incorrect
<span title="" style="margin:37px;" />span2</span> correct
<a title="u" title="j">link</a> correct
<a title=""href="" alt="" required>test</a> incorrect
<img src="" data-abc="" required> correct
<input type=""style="" /> incorrect

Related

HTML::ELEMENT not finding all elements

I have this snippet of html:
<li class="result-row" data="2">
<p class="result-info">
<span class="icon icon-star" role="button">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2018-12-04 09:21" title="Tue 04 Dec 09:21:50 AM">Dec 4</time>
Link Text
and this perl code (not production, so no quality comments are necessary)
my $root = $tree->elementify();
my #rows = $root->look_down('class', 'result-row');
my $item = $rows[0];
say $item->dump;
my $date = $item->look_down('class', 'result-date');
say $date;
my $title = $item->look_down('class', 'result-title hdrlnk');
All outputs are as I expected except $date isn't defined.
When I look at the $item->dump, it looks like the time element doesn't show up in the output. Here's a snippet of the output from $item->dump where I would expect to see a <time...> element. All it shows is the text from the time element.
<li class="result-row" data="2"> #0.1.9.3.2.0
<a class="result-image gallery empty" href="https://localhost/1.html"> #0.1.9.3.2.0.0
<p class="result-info"> #0.1.9.3.2.0.1
<span class="icon icon-star" role="button"> #0.1.9.3.2.0.1.0
" "
<span class="screen-reader-text"> #0.1.9.3.2.0.1.0.1
"favorite this post"
" "
" Dec 4 "
<a class="result-title hdrlnk" data="2" href="https://localhost/1.html"> #0.1.9.3.2.0.1
.2
"Link Text..."
" "
...
I've not used HTML::Element before. I rtfmed and didn't see any tag exclusions and I did a search of the package code for tags white/black lists (which wouldn't make sense, but neither does leaving out the time tag).
Does anyone know why the time element is not showing up in the dump and any search for it turns up nothing?
As an fyi, the rest of the code searches and finds elements without issue, it just appears to be the time tag that's missing.
HTML::TreeBuilder does not support HTML5 tags. Consider Mojo::DOM as an alternative that keeps up with the living HTML standard. I can't show how your whole code would look with Mojo::DOM since you've only shown a piece, but the Mojo::DOM equivalent of look_down is find (returns a Mojo::Collection arrayref) or at (returns the first element found or undef), both taking a CSS selector.

Scraping with Nokogiri::HTML - Can't get text from XPATH

I'm trying to scrape html with Nokogiri.
This is the html source:
<span id="J_WlAreaInfo" class="wl-areacon">
<span id="J-From">山东济南</span>
至
<span id="J-To">
<span id="J_WlAddressInfo" class="wl-addressinfo" title="全国">
全国
<s></s>
</span>
</span>
</span>
I need to get the following text: 山东济南
Checked shortest XPATH with firebug:
//*[#id="J-From"]
Here is my ruby code:
doc = Nokogiri::HTML(open("http://foo.html"), "UTF-8")
area = doc.xpath('//*[#id="J-From"]')
puts area.text
However, it returns nothing.
What am I doing wrong?
However, it returns nothing. What am I doing wrong?
xpath() returns an array containing the matches (it's actually called a NodeSet):
require 'nokogiri'
html = %q{
<span id="J_WlAreaInfo" class="wl-areacon">
<span id="J-From">山东济南</span>
至
<span id="J-To">
<span id="J_WlAddressInfo" class="wl-addressinfo" title="全国">
全国
<s></s>
</span>
</span>
</span>
}
doc = Nokogiri::HTML(html)
target_tags = doc.xpath('//*[#id="J-From"]')
target_tags.each do |target_tag|
puts target_tag.text
end
--output:--
山东济南
Edit: You can actually call text() on the Array, but it will return the concatenated results of the text for each match in the array--which is not something I've ever found useful--but because there is only one match you should have gotten the result 山东济南. There is nothing in your post that indicates why you didn't get that result.
If you only want a single result from your xpath, i.e. the first match, then you can use at_xpath():
target_tag = doc.at_xpath('//*[#id="J-From"]')
puts target_tag.text
--output:--
山东济南

Hover text is broken if text has special symbols when description is given

Example:
<a title="A web design community.'test'~`!##$%^&*()-_+=\|][{};:,<.>?/ **"new test"** " href="http://css-tricks.com">CSS-Tricks</a>
In tooltip, after the double quotes "new test" is not working.
Is there any possible to show the content in tooltip like this
ex: testing 'welcome', # 3 $ ^ & * "flow"?
The problem is that your double quotes in the title close your title automatically. Escape them by replacing " with " and also funkwurm recommends to replace < and > with < and > respectively to avoid errors in xml:
<a title="A web design community.'test'~`!##$%^&*()-_+=\|][{};:,<.>?/ **"new test"** " href="http://css-tricks.com">CSS-Tricks</a>
You can use this is also.
<a title="Answer to your's question.'Test It' :):)'B Happy' :):)"new test"**" href="http://css-tricks.com">CSS-Tricks</a>

Remove <p> tags - Regular Expression (Regex)

I have some HTML and the requirement is to remove only starting <p> tags from the string.
Example:
input: <p style="display:inline; margin: 40pt;"><span style="font:XXXX;"> Text1 Here</span></p><p style="margin: 50pt"><span style="font:XXXX">Text2 Here</span></p> <p style="display:inline; margin: 40pt;"><span style="font:XXXX;"> Text3 Here</span></p>the string goes on like that
desired output: <span style="font:XXXX;"> Text1 Here</span></p><span style="font:XXXX">Text2 Here</span></p><span style="font:XXXX;"> Text3 Here</span></p>
Is it possible using Regex? I have tried some combinations but not working. This is all a single string. Any advice appreciated.
I'm sure you know the warnings about using regex to match html. With these disclaimers, you can do this:
Option 1: Leaving the closing </p> tags
This first option leaves the closing </p> tags, but that's what your desired output shows. :) Option 2 will remove them as well.
PHP
$replaced = preg_replace('~<p[^>]*>~', '', $yourstring);
JavaScript
replaced = yourstring.replace(/<p[^>]*>/g, "");
Python
replaced = re.sub("<p[^>]*>", "", yourstring)
<p matches the beginning of the tag
The negative character class [^>]* matches any character that is not a closing >
> closes the match
we replace all this with an empty string
Option 2: Also removing the closing </p> tags
PHP
$replaced = preg_replace('~</?p[^>]*>~', '', $yourstring);
JavaScript
replaced = yourstring.replace(/<\/?p[^>]*>/g, "");
Python
replaced = re.sub("</?p[^>]*>", "", yourstring)
This is a PCRE expression:
/<p( *\w+=("[^"]*"|'[^']'|[^ >]))*>(.*<\/p>)/Ug
Replace each occurrence with $3 or just remove all occurrences of:
/<p( *\w+=("[^"]*"|'[^']'|[^ >]))*>/g
If you want to remove the closing tag as well:
/<p( *\w+=("[^"]*"|'[^']'|[^ >]))*>(.*)<\/p>/Ug

Passing double quotes to Jscript

insertText is java script that accepts two string paramters
I need to pass two strings
first parameter:
<img src="
second
">
I just cant figure out how to pass double quote as parameter
This works
<a onClick="insertText('<em>', '</em>'); return false;">Italic</a>
This does not
<a onClick="insertText('<img src=/"', '/">'); return false;">Image</a>
Prints '); return false;">Image
You want to use \ rather than /
The escape character for JavaScript is \, not /. So try this:
<a onClick="insertText('<img src=\"', '\">'); return false;">Image</a>
Update:
The solution above doesn't work, because the double-quotes "belong" to the HTML and not to the JavaScript, so we can't escape them in the JavaScript code.
Use this instead:
<a onClick="insertText('<img src=\'', '\'>'); return false;">Image</a> // --> <img src='...'>
or
<a onClick='insertText("<img src=\"", "\">"); return false;'>Image</a> // --> <img src="...">
Since you are using jQuery, why don't you do it the jQuery way?
insertText = function(a, b) {
// your insertText implementation...
};
$('a').click(function() { // use the right selector, $('a') selects all anchor tags
insertText('<img src="', '">');
});
With this solution you can avoid the problems with the quotes.
Here's a working example: http://jsfiddle.net/jcDMN/
The Golden Rule for that is reversing the quotation which means I use the single quotation ' inside the double quotation " and vice versa.
Also, you should use the backslash symbole to espape a special character like ' and ".
For example,
the following commands should work as they apply the rules mentioned above...
<a onClick="insertText('<em>', '</em>'); return false;">Italic</a>
or
<a onClick='insertText("<em>", "</em>"); return false;'>Italic</a>
or
<a onClick="insertText('<img src=\"', '\">'); return false;">Image</a>
or
<a onClick='insertText("<img src=\'", "\'>"); return false;'>Image</a>
I hope this helps you ...
You need to escape it.
<a onClick="insertText('<img src=\"', '\">'); return false;">Image</a>