Find duplicated id keys in html using a regex - html

Given an html file, how could I find if there's some repeated id value using a regular expression? I need it for searching it on SublimeText.
For example: using the id=("[^"]*").*id=\1 I can find duplicated id keys in the same line
<img id="key"><img id="key">
But what I need is to perform the same in multiple lines and with different pairs of keys. In this case for example key and key2 are repeated ids.
<img id="key">
<img id="key2">
<img id="key">
<img id="key3">
<img id="key2">
<img id="key">
Note: I'm usign the img tag only as an example, the html file is more complex.

For whatever reason, Sublime's . matcher doesn't include line breaks, so you'll need to do something like this: id=("[^"]+")(.|\n)*id=\1
Honestly though, I'd rather use Unix utilities:
grep -Eo 'id="[^"]+"' filename | sort | uniq -c
3 id="key"
2 id="key2"
1 id="key3"
If these are complete HTML documents, you could use the w3's HTML validator to catch dups along with other errors.

If all you're trying to do is find duplicated IDs, then here's a little Perl program I threw together that will do it:
use strict;
use warnings;
my %ids;
while ( <> ) {
while ( /id="([^"]+)"/g ) {
++$ids{$1};
}
}
while ( my ($id,$count) = each %ids ) {
print "$id shows up $count times\n" if $count > 1;
}
Call it "dupes.pl". Then invoke it like this:
perl dupes.pl file.html
If I run it on your sample, it tells me:
key shows up 3 times
key2 shows up 2 times
It has some restrictions, like it won't find id=foo or id='foo', but probably will help you down the road.

Sublime Text's regex search appears to default to multi-line mode, which means the . won't match line breaks. You can use a mode modifier to use single line mode to make . match new lines:
(?s)id=("[^"]+").*id=\1
The (?s) is the single line mode modifier.
However, this regex does a poor job of finding all duplicate keys since it will only match from key to key in your sample HTML. You probably need a multi-step process to find all keys, which could be programmed. As others have shown, you'll need to (1) pull all the ids out first, then (2) group them and count them to determine which are dupes.
Alternately, the manual approach would be to change the regex pattern to look-ahead for duplicate ids, then you can find the next match in Sublime Text:
(?s)id=("[^"]+")(?=.*id=\1)
With the above pattern, and your sample HTML, you'll see the following matches highlighted:
<img id="key"> <-- highlighted (dupe found on 3rd line)
<img id="key2"> <-- highlighted (dupe found on 5th line)
<img id="key"> <-- highlighted (next dupe found on last line)
<img id="key3">
<img id="key2">
<img id="key">
Notice that the look-ahead doesn't reveal the actual dupes later in the file. It will stop at the first occurrence and indicates that later on there are dupes.

Here is the AWK script to look-up for duplicated img's id values:
awk < file.txt
'{
$2 = tolower($2);
gsub(/(id|["=>])/, "", $2);
if (NF == 2)
imgs[$2]++;
}
END {
for (img in imgs)
printf "Img ID: %s\t appears %d times\n", img, imgs[img]
}'

Related

How to use regex (regular expressions) in Notepad++ to remove all HTML and JSON code that does not contain a specific string?

Using regular expressions (in Notepad++), I want to find all JSON sections that contain the string foo. Note that the JSON just happens to be embedded within a limited set of HTML source code which is loaded into Notepad++.
I've written the following regex to accomplish this task:
({[^}]*foo[^}]*})
This works as expected in all the input that is possible.
I want to improve my workflow, so instead of just finding all such JSON sections, I want to write a regex to remove all the HTML & JSON that does not match this expression. The result will be only JSON sections that contain foo.
I tried using the Notepad++ regex Replace functionality with this find expression:
(?:({[^}]*?foo[^}]*?})|.)+
and this replace expression:
$1\n\n$2\n\n$3\n\n$4\n\n$5\n\n$6\n\n$7\n\n$8\n\n$9\n\n
This successfully works for the last occurrence of foo within the JSON, but does not find the rest of the occurrences.
How can I improve my code to find all the occurrences?
Here is a simplified minimal example of input and desired output. I hope I haven't simplified it too much for it to be useful:
Simplified input:
<!DOCTYPE html>
<html>
<div dat="{example foo1}"> </div>
<div dat="{example bar}"> </div>
<div dat="{example foo2}"> </div>
</html>
Desired output:
{example foo1}
{example foo2}
You can use
{[^}]*foo[^}]*}|((?s:.))
Replace with (?1:$0\n). Details:
{[^}]*foo[^}]*} - {, zero or more chars other than }, foo, zero or more chars other than } and then a }
| - or
((?s:.)) - Capturing group 1: any one char ((?s:...) is an inline modifier group where . matches all chars including line break chars, same as if you enabled . matches newline option).
The (?1:$0\n) replacement pattern replaces with an empty string if Group 1 was matched, else the replacement is the match text + a newline.
See the demo and search and replace dialog settings:
Updates
The comment section was full tried to suggest a code here,
Let me know if this is a bit close to your intended result,
Find: ({.+?[\n]*foo[ \d]*})|.*?
Replace all: $1
Also added Toto's example

Extracting character sequence containing word

I've got an HTML string containing special character sequences looking like this:
[start_tag attr="value"][/end_tag]
I want to be able to extract one of these sequences containing specific attribute e.g:
[my_image_tag image_id="12345" attr2="..." ...]
and from the above example, I want to extract the whole thing with square brackets but using only one of the attributes and its value in this case - image_id="12345"
I tried using regex but it gives me the whole line whereas I need only the part of the line based on specific value as mentioned above.
Something like this should work:
my_string = '<h1>Heading1</h1>some text soem tex some text [some_tag attrs][/some_tag]some text some text [some_tag image_id="12345"] some text'
search_attrs = %w(image_id foo bar)
found = my_string =~ /(\[[^\]]*(#{search_attrs.join('|')})="[^"\]]*"[^\]]*\])/ && $1
# => "[some_tag image_id=\"12345\"]"
For a specific attribute id and value, you can simplify it like so:
found = my_string =~ /(\[[^\]]* image_id="12345"[^\]]*\])/ && $1
# => "[some_tag image_id=\"12345\"]"
It works by expanding the primary capture group to everything you're looking for.
However, this assumes you only need to extract one such attribute.
It also assumes that you don't care if the string crosses through any HTML tag boundaries. If you cared about that, then you'd need to first hash out the legal boundaries using an HTML parser, then search within those results.

Remove/strip specific Html tag and replace using NotePad++

Here is my text:
<h3>#6</h2>
Is he eating a lemon?
</div>
I have a few of them in my articles the #number is always different also the text is always different.
I want to make this out of it:
<h3>#6 Is he eating a lemon?</h3>
I tried it via regex in notepad++ but I am still very new to this:
My Search:
<h3>.*?</h2>\r\n.*?\r\n\r\n</div>
Also see here.
Now it is always selecting the the right part of the text.
How does my replace command need to look like now to get an output like above?
You should modify your original regex to capture the text you want in groups, like this:
<h3>(.*?)</h2>\r\n(.*?)\r\n\r\n</div>
( ) ( )
// ^ ^ These are your capture groups
You can then access these groups with the \1 and \2 tokens respectively.
So your replace pattern would look like:
<h3>\1 \2</h3>
Your search could be <h3>(.*)<\/h2>\r\n(.*)\r\n\r\n<\/div>
and the replace is <h3>$1 $2</h3>, where $1 and $2 represent the strings captured in the parentheses.

regex to ignore duplicate matches

I'm using an application to search this website that I don't have control of right this moment and was wondering if there is a way to ignore duplicate matches using only regex.
Right now I wrote this to get matches for the image source in the pages source code
uses this to retrieve srcs
<span> <img id="imgProduct.*? src="/(.*?)" alt="
from this
<span> <img id="imgProduct_1" class="SmPrdImg selected"
onclick="(some javascript);" src="the_src_I_want1.jpg" alt="woohee"> </span>
<span> <img id="imgProduct_2" class="SmPrdImg selected"
onclick="(some javascript);" src="the_src_I_want2.jpg" alt="woohee"> </span>
<span> <img id="imgProduct_3" class="SmPrdImg selected"
onclick="(some javascript);" src="the_src_I_want3.jpg" alt="woohee"> </span>
the only problem is that the exact same code listed above is duplicated way lower in the source. Is there a way to ignore or delete the duplicates using only regex?
Your pattern's not very good; it's way too specific to your exact source code as it currently exists. As #Truth commented, if that changes, you'll break your pattern. I'd recommend something more like this:
<img[^>]*src=['"]([^'"]*)['"]
That will match the contents of any src attribute inside any <img> tag, no matter how much your source code changes.
To prevent duplicates with regex, you'll need lookahead, and this is likely to be very slow. I do not recommend using regex for this. This is just to show that you could, if you had to. The pattern you would need is something like this (I tested this using Notepad++'s regex search, which is based on PCRE and more robust than JavaScript's, but I'm reasonably sure that JavaScript's regex parser can handle this).
<img[^>]*src=['"]([^'"]*)['"](?!(?:.|\s)*<img[^>]*src=['"]\1['"])
You'll then get a match for the last instance of every src.
The Breakdown
For illustration, here's how the pattern works:
<img[^>]*src=['"]([^'"]*)['"]
This makes sure that we are inside a <img> tag when src comes up, and then makes sure we match only what is inside the quotes (which can be either single or double quotes; since neither is a legal character in a filename anyway we don't have to worry about mixing quote types or escaped quotes).
(?!
(?:
.
|
\s
)*
<img[^>]*src=['"]\1['"]
)
The (?! starts a negative lookahead: we are requiring that the following pattern cannot be matched after this point.
Then (?:.|\s)* matches any character or any whitespace. This is because JavaScript's . will not match a newline, while \s will. Mostly, I was lazy and didn't want to write out a pattern for any possible line ending, so I just used \s. The *, of course, means we can have any number of these. That means that the following (still part of the negative lookahead) cannot be found anywhere in the rest of the file. The (?: instead of ( means that this parenthetical isn't going to be remembered for backreferences.
That bit is <img[^>]*src=['"]\1['"]. This is very similar to the initial pattern, but instead of capturing the src with ([^'"]*), we're referencing the previously-captured src with \1.
Thus the pattern is saying "match any src in an img that does not have any img with the same src anywhere in the rest of the file," which means you only get the last instance of each src and no duplicates.
If you want to remove all instances of any img whose src appears more than once, I think you're out of luck, by the way. JavaScript does not support lookbehind, and the overwhelming majority of regex engines that do wouldn't allow such a complicated lookbehind anyway.
I wouldn't work too hard to make them unique, just do that in the PHP following the preg match with array_unique:
$pattern = '~<span> <img id="imgProduct.*? src="/(.*?)" alt="~is';
$match = preg_match_all($pattern, $html, $matches);
if ($match)
{
$matches = array_unique($matches[1]);
}
If you are using JavaScript, then you'd need to use another function instead of array_unique, check PHPJS:
http://phpjs.org/functions/array_unique:346

I'm new to Perl and have a few regex questions

I'm teaching myself Perl and I learn best by example. As such, I'm studying a simple Perl script that scrapes a specific blog and have found myself confused about a couple of the regex statements. The script looks for the following chunks of html:
<dt><a name="2004-10-25"><strong>October 25th</strong></a></dt>
<dd>
<p>
[Content]
</p>
</dd>
... and so on.
and here's the example script I'm studying:
#!/usr/bin/perl -w
use strict;
use XML::RSS;
use LWP::Simple;
use HTML::Entities;
my $rss = new XML::RSS (version => '1.0');
my $url = "http://www.linux.org.uk/~telsa/Diary/diary.html";
my $page = get($url);
$rss->channel(title => "The more accurate diary. Really.",
link => $url,
description => "Telsa's diary of life with a hacker:"
. " the current ramblings");
foreach (split ('<dt>', $page))
{
if (/<a\sname="
([^"]*) # Anchor name
">
<strong>
([^>]*) # Post title
<\/strong><\/a><\/dt>\s*<dd>
(.*) # Body of post
<\/dd>/six)
{
$rss->add_item(title => $2,
link => "$url#$1",
description => encode_entities($3));
}
}
If you have a moment to better help me understand, my questions are:
how does the following line work:
([^"]*) # Anchor name
how does the following line work:
([^>]*) # Post title
what does the "six" mean in the following line:
</dd>/six)
Thanks so much in advance for all your help! I'm also researching the answers to my own questions at the moment, but was hoping someone could give me a boost!
how does the following line work...
([^"]*) # Anchor name
zero or more things which aren't ", captured as $1, $2, or whatever, depending on the number of brackets ( in we are.
how does the following line work...
([^>]*) # Post title
zero or more things which aren't >, captured as $1, $2, or whatever.
what does the "six" mean in the
following line...
</dd>/six)
s = match as single line (this just means that "." matches everything, including \n, which it would not do otherwise)
i = match case insensitive
x = ignore whitespace in regex.
x also makes it possible to put comments into the regex itself, so the things like # Post title there are just comments.
See perldoc perlre for more / better information. The link is for Perl 5.10. If you don't have Perl 5.10 you should look at the perlre document for your version of Perl instead.
[^"]* means "any string of zero or more characters that doesn't contain a quotation mark". This is surrounded by quotes making forming a quoted string, the kind that follows <a name=
[^>]* is similar to the above, it means any string that doesn't contain >. Note here that you probably mean [^<], to match until the opening < for the next tag, not including the actual opening.
that's a collection of php specific regexp flags. I know i means case insensitive, not sure about the rest.
The code is an extended regex. It allows you to put whitespace and comments in your regexes. See perldoc perlre and perlretut. Otherwise like normal.
Same.
The characters are regex modifiers.