Regex to match attributes in HTML? - html

I have a txt file which actually is a html source of some webpage.
Inside that txt file there are various strings preceded by a "title=" tag.
e.g.
<div id='UWTDivDomains_5_6_2_2' title='Connectivity Framework'>
I am interested in getting the text Connectivity Framework to be extraced and written to a separate file.
Like this, there are many such tags each having a different text after the title='some text here which i need to extract '
I want to extract all such instances of the text from the html source/txt file and write to a separate txt file. The text can contain lower case, upper case letters and number only. The length of each text string(in characters) will vary.
I am using PowerGrep for windows. Powergrep allows me to search a text file with regular expression inout.
I tried using the search as
title='[a-zA-Z0-9]
It shows the correct matches, but it matches only first character of the string and writes only the first character of the text string matched to the second txt file, not all string.
I want all string to be matched and written to the second file.
What is the correct regular expression or way to do what i want to do, using powergrep?
-AD.

I'm just not sure how many times the question of regular expression parsing of HTML files has to be asked (and answered with the correct solution of "use a DOM parser"). It comes up every day.
The difficulties are:
In HTML attributes can have single-quotes, double-quotes or even no quotes;
Similar strings can appear in the HTML document itself;
You have to handle correct escaping; and
Malformed HTML (decent parsers are extremely robust to common errors).
So if you cater for all this (and it gets to be a pretty complicated yet still imperfect regex), it's still not 100%.
HTML parsers exist for a reason. Use them.

I'm not familiar with PowerGrep, however, your regex is incomplete. Try this:
title='[a-zA-Z0-9 ]*'
or better yet:
title='([^']*)'

The other answers all give correct changes to the regex, so I'll explain what the issue was with your original.
The square brackets indicate a character class - meaning that the regex will match any character within those brackets. However, like everything else, it will only match it once by default. Just as the regex "s" would match only the first character in "ssss", the regex "[a-zA-Z0-9]" will match only the first character in "Connectivity Framework".
By adding repetition, one can get that character class to match repeatedly. The easiest way to do this is by adding an asterisk after it (which will match 0 or more occurences). Thus the regex "[a-zA-Z0-9]*" will match as many characters in a row until it hits a character that is not in that character class (in your case, the space character since you didn't include that in your brackets).
Regexes though can be pretty complex to describe the syntax accurately - what if someone put a non-alphanumeric character such as an ampersand within the attribute? You could try to capture all input between the quotes by making the character set "anything except a quote character", so "'[^']*'" would usually do the right thing. Often you need to bear in mind escaping as well (e.g. with a string 'Mary\'s lamb' you do actually want to capture the apostrophe in the middle so a simple "everything but apostrophes" character set won't cut it) though thankfully this is not an issue with XML/HTML according to the specs.
Still, if there is an existing library available that will do the extraction for you, this is likely to be faster and more correct than rolling your own, so I would lean towards that if possible.

I would use this regular expression to get the title attribute values
<[a-z]+[^>]*\s+title\s*=\s*("[^"]*"|'[^']*'|[^\s >]*)
Note that this regex matches the attribute value expression with quotes. So you have to remove them if needed.

Here's the regex you need
title='([a-zA-Z0-9]+)'
but if you're going to be doing a lot more stuff like this, using a parser might make it much more robust and useful.

Try this instead:
title=\'[a-zA-Z0-9]*\'

Related

nested html regex crashes vscode

I am using the following regex in vscode for a search and replace. It is to match an outer tag with plus 3 nested tags.
<tag>(((.|\n)*?)(</tag>)){4}
If i add any character to the end of this regex, vscode crashes. In my case i was going to specify a tag after that match.
Im pretty new to regex so trying to keep it simple.
I know this is a common problem and something to do with backtracking and i want to know how to simply this.
NEVER use (.|\n)*?. It is a very misfortunate, widely known pattern, that causes so much backtracking that it often leads to situations like this, when the text is long and specific enough to lead to catastrophic backtracking.
Note that even [\w\W]*? (or [\s\S\r]*?, see Multi-line regular expressions in Visual Studio Code) here might already suffice. Although it also involves quite a lot of backtracking, it will be much more efficient.
What can usually be used is an unrolled pattern, like
<tag>(?:[^<\r]*(?:<(?!/tag>)[^<]*)*</tag>){4}
Instead of (.|\n)*?, a series of patterns are used so that each could only match distinct positions in a string.
Details
<tag> - a literal string
(?:[^<\r]*(?:<(?!/tag>)[^<]*)*</tag>){4} - four repetitions of
[^<\r]* - 0 or more chars other than < (even line break chars, \r ensures this in VS Code regex, it enables all character classes in the pattern that can match newlines to match newlines (thus, \r is not necessary to use in the next character class))
(?:<(?!/tag>)[^<]*)* - 0 or more repetitions of a < not followed with /tag> and then 0 or more chars other than <.
</tag> - a literal </tag> string.
Having said that, you might also be interested in the Emmet:Balace outward:
A well-known tag balancing: searches for tag or tag's content bounds
from current caret position and selects it. It will expand (outward
balancing) or shrink (inward balancing) selection when called multiple
times. Not every editor supports both inward and outward balancing due
of some implementation issues, most editors have outward balancing
only.
Emmet’s tag balancing is quite unique. Unlike other implementation,
this one will search tag bounds from caret’s position, not the start
of the document. It means you can use tag balancer even in non-HTML
documents.

Escape an apostrophe from a data binding's XML

I have a string from xml with an apostrophe that should be escaped to &apos; and it is not.
<city place="park's place"/>
In html I am grabbing the value.
<span datafld="place"></span>
I need the value in place to be "park&apos;s place" and not park's place. Currently it shows "park's place".
I have spent a good amount of time trying to find an answer and can't seem to find one.
This code example is badly hacked together since I am not allowed to show any original code.
Thanks.
Edit: This is on a xhtml page using javascript.
In the XML "data model" all values are unescaped. So whether your attribute was specified as:
place="park's place"
or:
place="park&apos;s place"
or:
place="park's place"
when you use an XML parser (or the DOM) you'll get "park's place". (Things like "innerHTML" are an exception to this general rule.)
If you need to compare that to some other string that has a different level of escaping then you either need to escape the string you get from the DOM, or you need to unescape the other string. It's a lot like if you were going to compare a measurement in meters to one in feet: you need to convert to a common unit-of-measurement/level-of-escaping.
I'd go with the unescaping approach if you can. If that isn't possible then you'll need to make sure that you escape in a consistent way everywhere, which can be difficult. Note that I've shown you three different ways of legally escaping that particular string -- and there are many many more.
Maybe it serves someone looking for it you can put the symbol (´) the tilde

Why so much HTML input sanitization necessary?

I have implemented a search engine in C for my html website. My entire web is programmed in C.
I understand that html input sanitization is necessary because an attacker can input these 2 html snippets into my search page to trick my search page into downloading and displaying foreign images/scripts (XSS):
<img src="path-to-attack-site"/>
<script>...xss-code-here...</script>
Wouldn't these attacks be prevented simply by searching for '<' and '>' and stripping them from the search query ? Wouldn't that render both scripts useless since they would not be considered html ? I've seen html filtering that goes way beyond this where they filter absolutely all the JavaScript commands and html markup !
Input sanitisation is not inherently ‘necessary’.
It is a good idea to remove things like control characters that you never want in your input, and certainly for specific fields you'll want specific type-checking (so that eg. a phone number contains digits).
But running escaping/stripping functions across all form input for the purpose of defeating cross-site-scripting attacks is absolutely the wrong thing to do. It is sadly common, but it is neither necessary nor in many cases sufficient to protect against XSS.
HTML-escaping is an output issue which must be tackled at the output stage: that is, usually at the point you are templating strings into the output HTML page. Escape < to <, & to &, and in attribute values escape the quote you're using as an attribute delimiter, and that's it. No HTML-injection is possible.
If you try to HTML-escape or filter at the form input stage, you're going to have difficulty whenever you output data that has come from a different source, and you're going to be mangling user input that happens to include <, & and " characters.
And there are other forms of escaping. If you try to create an SQL query with the user value in, you need to do SQL string literal escaping at that point, which is completely different to HTML escaping. If you want to put a submitted value in a JavaScript string literal you would have to do JSON-style escaping, which is again completely different. If you wanted to put a value in a URL query string parameter you need URL-escaping, not HTML-escaping. The only sensible way to cope with this is to keep your strings as plain text and escape them only at the point you output them into a different context like HTML.
Wouldn't these attacks be prevented simply by searching for '<' and '>' and stripping them from the search query ?
Well yes, if you also stripped ampersands and quotes. But then users wouldn't be able to use those characters in their content. Imagine us trying to have this conversation on SO without being able to use <, & or "! And if you wanted to strip out every character that might be special when used in some context (HTML, JavaScript, CSS...) you'd have to disallow almost all punctuation!
< is a valid character, which the user should be permitted to type, and which should come out on the page as a literal less-than sign.
My entire web is programmed in C.
I'm so sorry.
Encoding brackets is indeed sufficient in most cases to prevent XSS, as anything between tags will then display as plain-text.

Find spaces in anchor links

We've got a large amount of static that HTML has links like e.g.
Link
However some of them contain spaces in the anchor e.g.
Link
Any ideas on what kind of regular expression I'd need to use to find the Spaces after the # and replace them with a - or _
Update: Just need to find them using TextMate, hence no need for a HTML parsing lib.
This regex should do it:
#[a-zA-Z]+\s+[a-zA-Z\s]+
Three Caveats.
First, if you are afraid that the page text itself (and not just the links) might contain information like "#hashtag more words", then you could make the regex more restrictive, like this:
#[a-zA-Z]+\s+[a-zA-Z\s]+\">
Second, if you have hash tags that contain characters beyond A-Z, then just add them in between the second set of brackets. So, if you have '-' as well, you would modify to:
#[a-zA-Z]+\s+[a-zA-Z-\s]+\">
Finally, this assumes that all the links you are trying to match start with a letter/word and are followed by a space, so, in the current form, it would not match "Anchor-tags-galore", but would match "Anchor tags galore."
Have you considered using an HTML parsing library like BeautifulSoup? It would make finding all the hrefs much easier!
Here, this regex matches the hash and all the words and spaces in between:
#(\w+\s)+\w+
http://dl.getdropbox.com/u/5912/Jing/2009-08-12_1651.png
When you have some time, you should download "The Regex Coach", which is an awesome tool to develop your own regexes. You get instant feedback and you learn very fast. Plus it comes at no cost!
Visit the homepage

HTML Escaping - Reg expressions?

I'd like to HTML escape a specific phrase automatically and logically that is currently a statement with words highlighted with quotation marks. Within the statement, quotation or inch marks could also be used to describe a distance.
The phrase could be:
Paul said "It missed us by about a foot". In fact it was only about 9".
To escape this phrase It should really be
<pre>Paul said “It missed us by about a foot”.
In fact it was only about 9′.</pre>
Which gives
<pre>Paul said “It missed us by about a foot”.
In fact it was only about 9″.</pre>
I can't think of a sample phrase to add in a " escape as well but that could be there!
I'm looking for some help on how to identify which of the escape values to replace " characters with at runtime. The phrase was just an example and it could be anything but should be correctly formed i.e. an opening and closing quote would be present if we are to correctly escape the text.
Would I use a regular expression to find a quoted phrase in the text i.e. two " " characters before a full stop and then replace the first then the second. with
“
then
”
If I found one " replace it with a
"
unless it was after a number where I replace it with
″
How would I deal with multiple quotes within a sentence?
"It just missed" Paul said "by a foot".
This would really stump me.....
<pre>"It just missed" Paul said "by 9" almost".</pre>
The above should read when escaped correctly. (I'm showing the actual characters this time)
“It just missed” Paul said “by 9″ almost”.
Obviously an edge case but I wondered if it's possible to escape this at runtime without an understanding of the content? If not help on the more obvious phrases would be appreciated.
I would do this in two passes:
The first pass searches for any "s which are immediately preceded by numbers and does that replacement:
s/([0-9])"/\1″/g
Depending on the text you're dealing with, you may want/need to extend this regex to also recognize numbers that are spelled out as words; I've only checked for digits for the sake of simplicity.
With all of those taken care of, a second pass can then easily convert pairs of "s as you've described:
s/"([^"]*)"/“\1”/g
Note the use of [^"]* rather than .* - we want to find two sets of double-quotes with any number of non-double-quote characters between them. By adding that restriction, there won't be any problems handling strings with multiple quoted sections. (This could also be accomplished using the non-greedy .*?, but a negated character class more clearly states your intent and, in most regex implementations, is more efficient.)
A stray, mismatched " somewhere in the string, or an inch marker which is missed by the first pass, can still cause problems, of course, but there's no way to avoid that possibility without implementing understanding of the content.
what you've described is basically a hidden markov model,
http://en.wikipedia.org/wiki/Hidden_Markov_model
you have a set of input symbols (your original text and ambiguous punctuation), and a set of output symbols (original text and more fine-grained punctuation) but no good way of really observing the connection between the two in a programmatic way. you could write some rules to cover some of the edge cases, but that will basically never work for the multiple quotes situation. in this case you can't really use a regex for the same reason, but with an hmm, and a bunch of training text you could probably mmake some pretty good guesses.
sorry that's probably not very helpful if you're trying to get something ready for deployment, but the input has greater ambiguity than the output, so your only option is to consider the context, and that basically means either a very lengthy set of rules, or some kind of machine learning approach.
interesting question though - it would be neat to see what kind of performance you could get. maybe someone's already written a paper on it?
I wondered if it's possible to escape
this at runtime without an
understanding of the content?
Considering that you're adding semantic meaning to the punctuation which is currently encoded in the other text... no, not really.
Regular expressions would be the easiest tool for at least part of it. I'd suggest looking for /\d+"/ for the inch number cases. But for quotes delimiters, after you'd looked for any other special cases or phrases, it may be easier to use an algorithm for matching pairs, like with parentheses and brackets: tokenize and count. Then test on real-world input and refine.
But I really have to ask: why?
I am not sure if it is possible at all to do that without understanding the meaning of the sentence. I tend to doubt it.
My first attempt would be the following.
go from left to right through the string
alternate replacing double primes with left and right double quotes, but replace with double primes if there is a number to the left
if the quotation marks are unbalanced at the end of the string go back until you find a number with double primes and change the double primes into left or right double quotes depending on the preceding double quotes.
I am quite sure that you can easily fail this strategy. But it is still the easy case - hard work starts when you have to deal with nested quotation marks.
I know this is off the wall, but have you considered Mechanical Turk? This is the sort of problem humans excel at, and computers, currently, are terrible at. Choosing the correct punctuation requires understanding of the meaning of the sentence, so a regex is bound to fail for edge cases.
You could try something like this. First replace the quotations with this regular expression:
"((?:[^"\d]+|\d"?)*)"
And than the inch sign:
(\d+)"
Here’s an example in JavaScript:
'"It just missed" Paul said "by 9" almost"'.replace(/"((?:[^"\d]*|\d["']?)+)"/g, "“$1”").replace(/(\d+)"/g, "$1″");