Regex match spaces in html attribute

Regex match spaces in html attribute - html

I have a bunch of html with lines like this:
<a href="#" rel="this is a test">
I need to replace the spaces in the rel-attribute with underscores, but I'm sort of a regex-noob!
I'm using Textmate.
Can anyone help me?
/Jakob

Find: (rel="[^\s"]*)\s([^"]*")
Replace: \1_\2
This replaces only the first white space so click on "Replace All" until nothing is replaced anymore. It's not pretty but easy to understand and works with every editor.
Change rel in the find pattern if you need to clean other attributes.

I don't think you can do this properly. Though I wonder why you need to do it at one go?
I can think of a really poor way of doing it, but even if I don't recommend it, here goes:
You could sort of do it with the regex below. However, you would have to increase the number of captures and outputs with a _ on the end to the potential number of spaces in the rel. I bet that is a requirement which disallows this solution.
Search:
{\<a *href\=\"[^\"]*" *rel\=\"}{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*
Replace:
\1\2_\3_\4_\5_\6_\7_\8_
This way has two downsides, one is there might be limitations to the number of captures you can have in Textmate, two is you'll end up with a large number of _'s on the end of each line.
With your current test, with the regex above, you would end up with:
<a href="#" rel="this_is_a_test">____
PS: This regex is of the format of the visual studio search/replace box. You'll probably need to change some characters to make it fit textpad.
{} => capturing group
() => grouping
[^A] => anything but A
( |\")* => space or "
\1 => is the first capture

Suppose you already received the value of rel:
var value = document.getElementById(id).getAttribute( "rel");
var rel = (new String( value)).replace( /\s/g,"_");
document.getElementById(id).setAttribute( "rel", rel);

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

I have to get on-board the "you're using the wrong tool for the job" train here. You have Textmate, so that means OSX, which means you have sed, awk, ruby and perl that can all do this much much better and easier.
Learning how to use one of these tools to do text manipulation will give you uncountable benefits in the future. Here is a URL that will ease you into sed: http://www.grymoire.com/Unix/Sed.html

If you're using TextMate, then you're on a Mac, and therefore have Python.
Try this:
#!/usr/bin/env python
import re
input = open('test.html', 'r')
p_spaces = re.compile(r'^.*rel="[^"]+".*$')
for line in input:
matches = p_spaces.findall(line)
for match in matches:
new_rel = match.replace(' ', '_')
line = line.replace(match, new_rel)
print line,
Sample output:
$ cat test.html
testing, testing, 1, 2, 3
<a href="#" rel="this is a test">
<unrelated line>
Stuff
<a href="#" rel="this is not a test">
<a href="#" rel="this is not a test" rel="this is invalid syntax (two rels)">
aoseuaoeua
$ ./test.py
testing, testing, 1, 2, 3
<a_href="#"_rel="this_is_a_test">
<unrelated line>
Stuff
<a_href="#"_rel="this_is_not_a_test">
<a_href="#"_rel="this_is_not_a_test"_rel="this_is_invalid_syntax_(two_rels)">
aoseuaoeua

Related

RegEx matching for HTML and non-HTML URLs

I'm trying to get all urls from this text. The absolute and relative URLs, but I'm not getting the right regular expression. The expression is combining with more things than I would like. You are getting HTML tags and other information that I do not want.
Attempt
(\w*.)(\\\/){1,}(.*)(?![^"])
Input
<div class=\"loader\">\n <div class=\"loaderImage\"><img src=\"\/c\/Community\/Rating\/img\/loader.gif\" \/><\/div>\n <\/div>\n<\/div>\n<\/div><\/span><\/span>\n
<a title=\"Avengers\" href=\"\/pt\/movie\/Avengers\/57689\" >Avengers<\/a> <\/div>\n
<img title=\"\" alt=\"\" id=\"145793\" src=\"https:\/\/images04-cdn.google.com\/movies\/74932\/74932_02\/previews\/2\/128\/top_1_307x224\/74932_02_01.jpg\" class=\"tlcImageItem img\" width=\"307\" height=\"224\" \/>
pageLink":"\/pt\/videos\/\/updates\/1\/0\/Category\/0","previousPage":"\/pt\/videos\/\/updates\/1\/0\/Category\/0","nextUrl":"\/pt\/videos\/\/updates\/2\/0\/Category\/0","method":"updates","type":"scenes","callbackJs"
<span class=\"value\">4<\/span>\n <\/div>\n <\/div>\n <div class=\"loader\">\n <div class=\"loaderImage\"><img src=\"\/c\/Community\/Rating\/img\/loader.gif\" \/><\/div>\n <\/div>\n<\/div>\n<\/div><\/span><\/span>
Demo

As it has been commented, it may not really be the best idea that you solve this problem with RegEx. However, if you wish to practice or you really have to, you may do an exact match in between "" where you URLs are present. You can bound them from left using scr, href, or any other fixed components that you may have. You can simply use an | and list them in the first group ().
RegEx 1 for HTML URLs
This RegEx may not be the right solution, but it might give you a perspective that how you might approach solving this problem using RegEx:
(src=|href=)(\\")([a-zA-Z\\\/0-9\.\:_-]+)(")
It creates four groups, so that to simplify updating it, and the $3 group might be your desired URLs. You can add any chars that your URLs might have in the third group.
RegEx 2 for both HTML and non-HTML URLs
For capturing other non-HTML URLs, you can update it similar to this RegEx:
(src=\\|href=\\|pageLink\x22:|previousPage\x22:|nextUrl\x22:)(")([a-zA-Z\\\/0-9\.\:_-]+)(")
where \x22 stands for ", which you can simply replace it. I have just added \x22 such that you could see those ", where your target URLs are located in between:
The second RegEx also has four groups, where the target group is $3. You can also simplify or DRY it, if you wish.

Regular expression to add a word between html-tags (newbie)

I can't seem to create a regular expression that would work in this situation:
I have hundreds of lines that look like this:
<a title="Match" href="http://mywebsite.com/category/Match"></a>
I would need to have the title word inserted between the html tags, like so:
<a title="Match" href="http://mywebsite.com/category/Match">Match</a>
Here's my feeble attempt at it (using Notepad++):
Find:
title="([A-Za-z][A-Za-z0-9]*?)"([A-Za-z][A-Za-z0-9]*?)><
Replace:
title="\1"\2>\1<
As you can see, I really suck at regular expressions :D
Any help would be appreciated!
EDIT:
I should clarify that this is a one-time operation carried out in Notepad++ with the find and replace panel.
I should also clarify that the word "Match" is going to be different on each line.

This works in Notepad++ 6.3.2
Find what :
(title\=")([^"]+)("[^>]+>)(<)
Replace with :
\1\2\3\2\4

Use Capture Groups and Back-References
You can capture parts of your match using capture groups, and then replace them with back-references. The specific syntax may vary by language and implementation. Here are two examples.
Ruby Example
str = %q{<a title="Match" href="http://mywebsite.com/category/Match"></a>}
str.sub /(Match)(">)</, "#{$1}#{$2}#{$1}<"
# => "<a title=\"Match\" href=\"http://mywebsite.com/category/Match\">Match</a>"
GNU sed Example
$ echo '<a title="Match" href="http://mywebsite.com/category/Match"></a>' |
sed -r 's/(Match)(">)</\1\2\1</'
<a title="Match" href="http://mywebsite.com/category/Match">Match</a>

Cleaning up text: from ALLCAPS to <em>allcaps</em>

I need to clean up some text for html that used ALLCAPS instead of italics. So I'd like to take something that looks like this:
Here is an artificial EXAMPLE of a piece of TEXT that
uses allcaps as a way of EMPHASIZING words.
And convert it into this:
Here is an artificial <em>example</em> of a piece of <em>text</em> that
uses allcaps as a way of <em>emphasizing</em> words.
I'm tagging this with regex and notepad++, but (as you can probably tell) I don't know the first thing about how to use them.

There're no such possibilities with Notepad++ regex engine.
You can run a script that do the job, in Perl for example:
perl -pi.back -e "s#\b([A-Z]+)\b#'<em>'.lc($1).'</em>'/eg" yourfile.html
yourfile.html will be saved in yourfile.html.back

As far as I konw the regex engine of Notepad++ is not advanced enough to do this.
I would advice to use a programming language to accomplish this, in PHP for example you could do this:
echo preg_replace_callback('/([A-Z]{2,})/', create_function('$s', 'return "<em>".strtolower($s[0])."</em>";'), $s);
Be sure to exclude the legitim first capital letter of a single word in the regex.

AFAIK you cannot change casing in the Find\Replace mechanism of Notepad++.
If all you need is the <em> tag insertion you can do the following:
In the Find box type (\s+)([A-Z]+)(\s+), abd in the Replace type \1<em>\2</em>\3.
You can try some of the TextFX tools maybe in the TextFX Characters sub-menu.

Here is how to do this using JavaScript's string replace method:
var capfix = function (x) {
var emout = function (y) {
y = y.charAt(0) + "<em>" + y.toLowerCase() + "</em>" + y.charAt(y.length - 1);
};
return x.replace(/\s[A-Z]\s/g, emout);
};
To execute just call:
capfix(yourData);
This assumes that "yourData" is just a variable that represents your data as a string. If you wanted to use a web tool then "yourData" could represent the value from some input control, as in the following:
var yourData = document.getElementById("myinput").value;
alert(capfix(yourData));
To make that work just put an id attribute on your web tool input such as:
<textarea id="myinput"></textarea>

Inserting HTML inside quotes

I want a page break inside the title attribute of a link, but when I put one in, it appears correct in a browser, but returns 7 errors when I validate it.
This is the code.
<a href="images/Bosses/Lord Yarkan Large.jpg" class="hastipz" target="_blank" title="Lord Yarkan, a level 80 Unique from Silkroad Online -- Click for a Larger Image">
<img class="bosspic" src="images/Bosses/Lord Yarkan.jpg" style="float:right; position:relative;" alt="Lord Yarkon; Silkroad Unique"/>
</a>
The reason is because the title attribute appears in a tooltip, and I need a page break inside that tooltip. How can I add a page break inside the quotes without returning errors?

I found this forum post:
There are two approaches:
1) Use the character entity for a carriage return, which is 
 Thus:
<...title="Exemplary
website">
(For a full list of character entities, try Googling "HTML Character Codes".)
2) to do any additional styling to your "tooltips", Google "CSS tooltips"
1) is Non-standard though. Works on IE/Chrome, not with Firefox. The new spec appears to recommend
(newline) instead.

Do you need to validate for work?
If not, do not worry about the errors if it works as you want it.
Validation is not the goal. It is a tool to help build better Web sites. which is the goal. ;-)
If you must have it validate, you could try to use some script to switch out a specific keyword / set of characters for a <br /> at dom ready. Although this is untested and I am not sure it wouldn't throw errors, too.
EDIT
As requested, a little jQuery to switch out a word:
$('a').each(function(){
var a = $(this).attr('title');
var b = a.replace('lineBreak','\n');
$(this).attr('title', b);
});
Example: http://jsfiddle.net/jasongennaro/qRQaq/1/
Nb:
I used "lineBreak" as the keyword, as this is unlikely to be matched. "br" might be
I replaced it with the \n line break character.
You should try the \n line break character on its own... might work without needing to replace anything.

I'm new to Perl and have a few regex questions

I'm teaching myself Perl and I learn best by example. As such, I'm studying a simple Perl script that scrapes a specific blog and have found myself confused about a couple of the regex statements. The script looks for the following chunks of html:
<dt><a name="2004-10-25"><strong>October 25th</strong></a></dt>
<dd>
<p>
[Content]
</p>
</dd>
... and so on.
and here's the example script I'm studying:
#!/usr/bin/perl -w
use strict;
use XML::RSS;
use LWP::Simple;
use HTML::Entities;
my $rss = new XML::RSS (version => '1.0');
my $url = "http://www.linux.org.uk/~telsa/Diary/diary.html";
my $page = get($url);
$rss->channel(title => "The more accurate diary. Really.",
link => $url,
description => "Telsa's diary of life with a hacker:"
. " the current ramblings");
foreach (split ('<dt>', $page))
{
if (/<a\sname="
([^"]*) # Anchor name
">
<strong>
([^>]*) # Post title
<\/strong><\/a><\/dt>\s*<dd>
(.*) # Body of post
<\/dd>/six)
{
$rss->add_item(title => $2,
link => "$url#$1",
description => encode_entities($3));
}
}
If you have a moment to better help me understand, my questions are:
how does the following line work:
([^"]*) # Anchor name
how does the following line work:
([^>]*) # Post title
what does the "six" mean in the following line:
</dd>/six)
Thanks so much in advance for all your help! I'm also researching the answers to my own questions at the moment, but was hoping someone could give me a boost!

how does the following line work...
([^"]*) # Anchor name
zero or more things which aren't ", captured as $1, $2, or whatever, depending on the number of brackets ( in we are.
how does the following line work...
([^>]*) # Post title
zero or more things which aren't >, captured as $1, $2, or whatever.
what does the "six" mean in the
following line...
</dd>/six)
s = match as single line (this just means that "." matches everything, including \n, which it would not do otherwise)
i = match case insensitive
x = ignore whitespace in regex.
x also makes it possible to put comments into the regex itself, so the things like # Post title there are just comments.
See perldoc perlre for more / better information. The link is for Perl 5.10. If you don't have Perl 5.10 you should look at the perlre document for your version of Perl instead.

[^"]* means "any string of zero or more characters that doesn't contain a quotation mark". This is surrounded by quotes making forming a quoted string, the kind that follows <a name=
[^>]* is similar to the above, it means any string that doesn't contain >. Note here that you probably mean [^<], to match until the opening < for the next tag, not including the actual opening.
that's a collection of php specific regexp flags. I know i means case insensitive, not sure about the rest.

The code is an extended regex. It allows you to put whitespace and comments in your regexes. See perldoc perlre and perlretut. Otherwise like normal.
Same.
The characters are regex modifiers.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Regex match spaces in html attribute - html

I have a bunch of html with lines like this: <a href="#" rel="this is a test"> I need to replace the spaces in the rel-attribute with underscores, but I'm sort of a regex-noob! I'm using Textmate. Can anyone help me? /Jakob

Find: (rel="[^\s"])\s([^"]") Replace: \1_\2 This replaces only the first white space so click on "Replace All" until nothing is replaced anymore. It's not pretty but easy to understand and works with every editor. Change rel in the find pattern if you need to clean other attributes.

Suppose you already received the value of rel: var value = document.getElementById(id).getAttribute( "rel"); var rel = (new String( value)).replace( /\s/g,"_"); document.getElementById(id).setAttribute( "rel", rel);

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Related

RegEx matching for HTML and non-HTML URLs

Regular expression to add a word between html-tags (newbie)

Cleaning up text: from ALLCAPS to <em>allcaps</em>

Inserting HTML inside quotes

I'm new to Perl and have a few regex questions

Categories

Resources

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Regex match spaces in html attribute - html

I have a bunch of html with lines like this: <a href="#" rel="this is a test"> I need to replace the spaces in the rel-attribute with underscores, but I'm sort of a regex-noob! I'm using Textmate. Can anyone help me? /Jakob

Find: (rel="[^\s"]*)\s([^"]*") Replace: \1_\2 This replaces only the first white space so click on "Replace All" until nothing is replaced anymore. It's not pretty but easy to understand and works with every editor. Change rel in the find pattern if you need to clean other attributes.

Suppose you already received the value of rel: var value = document.getElementById(id).getAttribute( "rel"); var rel = (new String( value)).replace( /\s/g,"_"); document.getElementById(id).setAttribute( "rel", rel);

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Related

RegEx matching for HTML and non-HTML URLs

Regular expression to add a word between html-tags (newbie)

Cleaning up text: from ALLCAPS to <em>allcaps</em>

Inserting HTML inside quotes

I'm new to Perl and have a few regex questions

Categories

Resources

Find: (rel="[^\s"])\s([^"]") Replace: \1_\2 This replaces only the first white space so click on "Replace All" until nothing is replaced anymore. It's not pretty but easy to understand and works with every editor. Change rel in the find pattern if you need to clean other attributes.