Find spaces in anchor links - html

We've got a large amount of static that HTML has links like e.g.
Link
However some of them contain spaces in the anchor e.g.
Link
Any ideas on what kind of regular expression I'd need to use to find the Spaces after the # and replace them with a - or _
Update: Just need to find them using TextMate, hence no need for a HTML parsing lib.

This regex should do it:
#[a-zA-Z]+\s+[a-zA-Z\s]+
Three Caveats.
First, if you are afraid that the page text itself (and not just the links) might contain information like "#hashtag more words", then you could make the regex more restrictive, like this:
#[a-zA-Z]+\s+[a-zA-Z\s]+\">
Second, if you have hash tags that contain characters beyond A-Z, then just add them in between the second set of brackets. So, if you have '-' as well, you would modify to:
#[a-zA-Z]+\s+[a-zA-Z-\s]+\">
Finally, this assumes that all the links you are trying to match start with a letter/word and are followed by a space, so, in the current form, it would not match "Anchor-tags-galore", but would match "Anchor tags galore."

Have you considered using an HTML parsing library like BeautifulSoup? It would make finding all the hrefs much easier!

Here, this regex matches the hash and all the words and spaces in between:
#(\w+\s)+\w+
http://dl.getdropbox.com/u/5912/Jing/2009-08-12_1651.png
When you have some time, you should download "The Regex Coach", which is an awesome tool to develop your own regexes. You get instant feedback and you learn very fast. Plus it comes at no cost!
Visit the homepage

Related

Using Extended Find and Replace + Reg Expressions to find and remove an anchor tag in HTML

I'm trying to use regular expressions to find and remove all anchor tags identical to these -
title name
title name part II
Where ONLY the filename and titles change, and I need to leave the title (unlinked) behind.
Because of my word-processing background I naively tried the following wildcards in my extended find and replace with regular expressions checked:
*
it does not work of course, not even to remove the entire link and text.
After much searching and reading I'm still just guessing at how to do this. Since no example I've found does exactly what I need using extended find and replace. All of this is over my head.
I have searched "How To Use regular expressions in Search and Replace" with HomeSite, Dreamweaver, topsite and other similar software to what I'm using to edit my HTML docs. Without success. I've read several tutorials on using RegExp and I'm learning, but still cannot seem to do what I need. I have read how to use RegExp in php, perl, c++ but cannot transition this over to what I need.
I'm willing to use other text editing software to accomplish this as I need to remove about 4,000 of these wma file links, while leaving the titles and other tags untouched.
I have searched similar questions here on stackoverflow. And read up on using regular expressions in general, but I cannot follow what is explained enough to adapt it to what I need. This is such a big subject.
This is what I have so far:
<a href="mms:\/\/media\.domain\.com\/CME\/ \.wma"> <\/a>
The parts where I've left spaces are what's giving me trouble.
Thanks
With some help on a forum I was able to work out my answer, here it is
Find:
([^<]+)
Replace:
\1
Note the parenthesis around the [^<]+ this creates a sub-expression that can be referenced by the \1 I do not know how this all works exactly, but I found it in an article.
This find and replace finds my anchor tag and title and replaces it with just the title.
I have to give thanks to http://forums.devshed.com for the pieces I was missing.

which is better to add two names (-) or(_)

hi when i write css or html i found that i want add two name like this
web-development
web_development
which one is better according SEO or write style name, file name or image name.
The first one is better. Also see this post by Google employee Matt Cutts: http://www.mattcutts.com/blog/dashes-vs-underscores/
use the dash. Google engines don't really parse underscores. This is maybe for programmers sanity, so that when they search for query_function, they get results they are looking for?
If you have a url like "http://example.com/web-site", google will return results for 'web', 'site' and '"web site"'. This is not the case for underscores: web_site will only return results for web_site.
ps.
I also think that dashes are better than underscores for usability purposes: a dash is a single button on the keyboard, while an underscore requires two buttons to be pressed. This has nothing to do with the technical side of SEO, but everything to do with usability, which is more important than SEO imo.
for css i don't think there is some issues with naming methodology, but for naming HTML pages - is preferred as search engines take - as space, even though good page name is not enough for good s.e.o. you need to have proper meta tag and keywords.
And make sure all your images have proper title tag, this is real essential.
Isn't it common practice to use the - to connect two words, and the _ to replace a space in situations where you can't use a space/+ sign, like CSS classNames?
first one is better in terms of SEO. Because the priority of hiphen is greater than under score
Please list two (2) words in the English language that use underscores ("_") within them.
Now list fifty (50) words that use dashes/hyphens ("-").
My opinion is that the hyphens would be a better solution for SEO.
IMO When it comes down to SEO is that everything makes a difference !
You are dealing with two different problems: URLs and CSS.
For URLs, hyphens would be the better choice because of SEO.
However, depending on your editing program, underscores might work better for mutli-word class names. In TextMate for instance, I can hit Esc to finish (auto-complete) a class I previously entered. It stops completing when it encounters a hyphen, but will fill in the whole class name when you use an underscore. If this is not the case for your editor, then it is really up to your preference.

Find and Replace and a WYSIWYG Editor

My problem is as follows:
I have a column: ProductName. Now, the text entered here is entered from tinyMCE so it has all kinds of tags. The user wants to be able to do a Find-And-Replace on all products, and it has to support coloring.
For example - let's say this is a portion of a ProductName:
other text.. <strong>text text <font color="#ff6600">colortext®</font></strong> ..other text
Now, the user wants to replace the :
<font color="#ff6600">colortext®</font>
The original name has the <strong> tag in it so it appears bold. So the users makes it bold - now the text he is searching for is:
<strong><font color="#ff6600">colortext®</font></strong>
Obviously I'm not going to find it. Plus there's the matter of spaces: in one place it has a space in another it doesn't.
Is there a way to overcome this?
Strip the HTML tags from the search text and do a plain text search first. Then, part by part (i.e., text node by text node), take the element path of the search text's parts, and compare these with their counterparts in the found text. If the paths for all parts match, you're done.
Edit: By path, I meant something similar to XPath, or the path notion of the TinyMCE editor. Example: plain text part of the search text is "colortext®". The path of this text node in the search text is <strong>/<font color="#ff6600">. Search for the same plain text in the text body (trivial), and take it's path, which is also <strong>/<font color="#ff6600">. (Compare this with the path of "other text..", which is /, and of "text text", which is <strong>.) The two paths are the same, so this is a real match. If you have a DOM tree representation, determining the path shouldn't be difficult.
You're asking for several related, but discrete, abilities:
Search and Replace content
Search and Replace formatting
Search and Replace similar (ie, ignore trivial differences in whitespace)
You should take this in steps - otherwise it becomes overwhelming and a single search algorithm won't be able to do all three without intense effort and resulting in difficult to maintain code.
First, look at the similar problem. Make a search that ignores spaces and case. You might want to get into Lucene or another search engine technology if you also need to deal with "bowl" vs "bowls" and "intelligent" vs "smart" - though I expect this is beyond your current needs.
Once you have that working, it becomes one layer in your stack of searches.
Second, look a formatting search. This is typically done using tokens or tags - which you already have in the form of HTML. However, you have to be able to deal with things out of sequence - so <b><i>text</i></b> needs to be caught in a search for <i><b>text</b></i> and the malformed representation where tags aren't nested properly, such as <b><i>text</b></i>.
One method of this is to pre-parse the string and apply the formatting styles to each character. So you'd have a t that's bold and italic, an e that's bold and italic, etc. to make this easier and faster use a hash to represent the style combination - Read the first character, figure out what style it is (keep track of this turning styles on and off and you find tags) and if it already exists in the hash, assign that hash number to the letter. If it doesn't, get the new hash number and assign that.
Now you can compare the letter and its style hash against your search and get format and content matches. Stack that on top of your similar match and you have what you need.
-Adam
If it's valid XML, an XSLT would be trivial for this kind of exercise.
Use an identity template, and then add an XPath to find the specific node you want:
<xsl:template match="//strong/font">
<xsl:copy>
<!-- Insert the replacement text here -->
</xsl:copy>
</xsl:template>
When working with XML, this would be a maintainable, extensible solution.
Not sure to understand everything you said but the use of regular expression seems a good way to overcome the problem you're talking about.

Regex to match attributes in HTML?

I have a txt file which actually is a html source of some webpage.
Inside that txt file there are various strings preceded by a "title=" tag.
e.g.
<div id='UWTDivDomains_5_6_2_2' title='Connectivity Framework'>
I am interested in getting the text Connectivity Framework to be extraced and written to a separate file.
Like this, there are many such tags each having a different text after the title='some text here which i need to extract '
I want to extract all such instances of the text from the html source/txt file and write to a separate txt file. The text can contain lower case, upper case letters and number only. The length of each text string(in characters) will vary.
I am using PowerGrep for windows. Powergrep allows me to search a text file with regular expression inout.
I tried using the search as
title='[a-zA-Z0-9]
It shows the correct matches, but it matches only first character of the string and writes only the first character of the text string matched to the second txt file, not all string.
I want all string to be matched and written to the second file.
What is the correct regular expression or way to do what i want to do, using powergrep?
-AD.
I'm just not sure how many times the question of regular expression parsing of HTML files has to be asked (and answered with the correct solution of "use a DOM parser"). It comes up every day.
The difficulties are:
In HTML attributes can have single-quotes, double-quotes or even no quotes;
Similar strings can appear in the HTML document itself;
You have to handle correct escaping; and
Malformed HTML (decent parsers are extremely robust to common errors).
So if you cater for all this (and it gets to be a pretty complicated yet still imperfect regex), it's still not 100%.
HTML parsers exist for a reason. Use them.
I'm not familiar with PowerGrep, however, your regex is incomplete. Try this:
title='[a-zA-Z0-9 ]*'
or better yet:
title='([^']*)'
The other answers all give correct changes to the regex, so I'll explain what the issue was with your original.
The square brackets indicate a character class - meaning that the regex will match any character within those brackets. However, like everything else, it will only match it once by default. Just as the regex "s" would match only the first character in "ssss", the regex "[a-zA-Z0-9]" will match only the first character in "Connectivity Framework".
By adding repetition, one can get that character class to match repeatedly. The easiest way to do this is by adding an asterisk after it (which will match 0 or more occurences). Thus the regex "[a-zA-Z0-9]*" will match as many characters in a row until it hits a character that is not in that character class (in your case, the space character since you didn't include that in your brackets).
Regexes though can be pretty complex to describe the syntax accurately - what if someone put a non-alphanumeric character such as an ampersand within the attribute? You could try to capture all input between the quotes by making the character set "anything except a quote character", so "'[^']*'" would usually do the right thing. Often you need to bear in mind escaping as well (e.g. with a string 'Mary\'s lamb' you do actually want to capture the apostrophe in the middle so a simple "everything but apostrophes" character set won't cut it) though thankfully this is not an issue with XML/HTML according to the specs.
Still, if there is an existing library available that will do the extraction for you, this is likely to be faster and more correct than rolling your own, so I would lean towards that if possible.
I would use this regular expression to get the title attribute values
<[a-z]+[^>]*\s+title\s*=\s*("[^"]*"|'[^']*'|[^\s >]*)
Note that this regex matches the attribute value expression with quotes. So you have to remove them if needed.
Here's the regex you need
title='([a-zA-Z0-9]+)'
but if you're going to be doing a lot more stuff like this, using a parser might make it much more robust and useful.
Try this instead:
title=\'[a-zA-Z0-9]*\'

Regex to read out HTML tags

I'm looking for a regex that matches all used HTML tags in a text consisting of several lines. It should read out "b", "p" and "script" in the following lines:
<b>
<p class="normalText">
<script type="text/javascript">
Is there such thing? The start I have is that it should start with a "<" and read until it hits a space or a ">", but at the same time, it should not include the starting "<" since I just want to match the letter/word itself. Thoughts?
There are many similar questions on SO:
Filter out HTML tags and resolve entities in python
Regex to match all HTML tags except <p> and </p>
Strip all HTML tags except links
etc. The general agreement is that it's best not to use regular expressions to parse HTML instead of doing it properly by applying a DOM parser and traversing the DOM tree.
It's virtually impossible to regex HTML once you start considering all the special cases and malformed HTML that browsers sometimes happilly parse anyway. That said however I thought it might be fun to get the names without using capture groups and thus I present too you with the following sollution:
(?<=<)\w+(?=[^<]*?>)
For the record I hold little faith in it being at all useful in any but the most trivial of cases.
I don't know what system you are using, but it can be done to a certain extent. Look at this online flex-based application. Check out the Published > XML regex examples. You will get an idea.