Escape an apostrophe from a data binding's XML - html

I have a string from xml with an apostrophe that should be escaped to ' and it is not.
<city place="park's place"/>
In html I am grabbing the value.
<span datafld="place"></span>
I need the value in place to be "park&apos;s place" and not park's place. Currently it shows "park's place".
I have spent a good amount of time trying to find an answer and can't seem to find one.
This code example is badly hacked together since I am not allowed to show any original code.
Thanks.
Edit: This is on a xhtml page using javascript.

In the XML "data model" all values are unescaped. So whether your attribute was specified as:
place="park's place"
or:
place="park&apos;s place"
or:
place="park's place"
when you use an XML parser (or the DOM) you'll get "park's place". (Things like "innerHTML" are an exception to this general rule.)
If you need to compare that to some other string that has a different level of escaping then you either need to escape the string you get from the DOM, or you need to unescape the other string. It's a lot like if you were going to compare a measurement in meters to one in feet: you need to convert to a common unit-of-measurement/level-of-escaping.
I'd go with the unescaping approach if you can. If that isn't possible then you'll need to make sure that you escape in a consistent way everywhere, which can be difficult. Note that I've shown you three different ways of legally escaping that particular string -- and there are many many more.

Maybe it serves someone looking for it you can put the symbol (´) the tilde

Related

Is it actually possible to parse freeform HTML with a regular expression?

now before you prepare to right a speech about the perils of HTML parsing with regex, I already know it. This is more just a curiosity question, than actually wanting to know the question for practical usage.
Basically, given a file of HTML in some random, but perfectly valid format, can you parse out the content of <p> tags using a half-sane number of regular expressions? (and also pretending that <p> tags can not be nested or some other minor limitation)
It's certainly possible to extract all the text between {insert character sequence 1 here} and {insert character sequence 2 here} with regular expressions, so long as those sequences aren't overlapping. For example:
/(?<{insert character sequence 1 here}).*?(?={insert character sequence 2 here})/
Of course, it's terribly brittle and will break horribly if what you're running it on is even slightly malformed, or contains either character sequence outside the context where it's meaningful, or any number of other ways. If you oversimplify the problem, then yes you can get away with an oversimplified solution.
Yes, under restrictions like valid HTML and non-nesting, you can use regular expressions for certain uses.
It depends on what you limitations you'd consider minor. XHTML, for one obvious example, is somewhat more amenable to simple parsing. A great deal depends on whether you're thinking in terms of parsing existing HTML, or generating new HTML that could be parsed relatively easily. For the former case, I'd say the restrictions were major -- i.e., you'd need to know a great deal about the specific HTML in question to parse it. For the latter case, I'd say the restrictions were fairly trivial -- i.e., would only involve how you write the HTML, but would not affect what you could express in HTML.

How can I take an xml string and display it on my page similiar to how StackOverflow does it with 'insert code'?

I'm using the DataContractSerializer to convert and object returned from a WCF call to xml. The client would like to see that xml string in a webpage. If I output the string directly to a label, the browser strips out the angle brackets obviously. My question is how can I do something similar to StackOverflow? Are they doing a find & replace to replace angle brackets with their html entities? I see they are doing a code tag inside a pre tag and then making spans with the appropriate class. Is there an existing utility out there I can use to do this instead of writing some kind of parsing routine. I'm sure something free must be out there. If anyone can direct to the right place or some code that can easily accomplish this, I would greatly appreciate it. I apologize if this is more of a meta.stackoverflow question. Thanks for any tips.
The basic answer is that to get HTML displayed as typed, special characters and all, you need to replace the special characters (<, > etc.), with their escaped equivalents (>, < etc.). Beyond that if you want syntax colouring you'll have to parse the input to identify the keywords etc.
A full list of the special characters and their escape codes can be found here, but this is just one site of many.
you're talking about "pretty print".. if you want to diplay source code you could use this link 16 Free Javascript Code Syntax Highlighters For Better Programming
But if you want to display only xml.. there are some functions on the web that could help you with that, like this one: XML PHP Pretty Printer
and dont forget the special characters =)
good luck

Why so much HTML input sanitization necessary?

I have implemented a search engine in C for my html website. My entire web is programmed in C.
I understand that html input sanitization is necessary because an attacker can input these 2 html snippets into my search page to trick my search page into downloading and displaying foreign images/scripts (XSS):
<img src="path-to-attack-site"/>
<script>...xss-code-here...</script>
Wouldn't these attacks be prevented simply by searching for '<' and '>' and stripping them from the search query ? Wouldn't that render both scripts useless since they would not be considered html ? I've seen html filtering that goes way beyond this where they filter absolutely all the JavaScript commands and html markup !
Input sanitisation is not inherently ‘necessary’.
It is a good idea to remove things like control characters that you never want in your input, and certainly for specific fields you'll want specific type-checking (so that eg. a phone number contains digits).
But running escaping/stripping functions across all form input for the purpose of defeating cross-site-scripting attacks is absolutely the wrong thing to do. It is sadly common, but it is neither necessary nor in many cases sufficient to protect against XSS.
HTML-escaping is an output issue which must be tackled at the output stage: that is, usually at the point you are templating strings into the output HTML page. Escape < to <, & to &, and in attribute values escape the quote you're using as an attribute delimiter, and that's it. No HTML-injection is possible.
If you try to HTML-escape or filter at the form input stage, you're going to have difficulty whenever you output data that has come from a different source, and you're going to be mangling user input that happens to include <, & and " characters.
And there are other forms of escaping. If you try to create an SQL query with the user value in, you need to do SQL string literal escaping at that point, which is completely different to HTML escaping. If you want to put a submitted value in a JavaScript string literal you would have to do JSON-style escaping, which is again completely different. If you wanted to put a value in a URL query string parameter you need URL-escaping, not HTML-escaping. The only sensible way to cope with this is to keep your strings as plain text and escape them only at the point you output them into a different context like HTML.
Wouldn't these attacks be prevented simply by searching for '<' and '>' and stripping them from the search query ?
Well yes, if you also stripped ampersands and quotes. But then users wouldn't be able to use those characters in their content. Imagine us trying to have this conversation on SO without being able to use <, & or "! And if you wanted to strip out every character that might be special when used in some context (HTML, JavaScript, CSS...) you'd have to disallow almost all punctuation!
< is a valid character, which the user should be permitted to type, and which should come out on the page as a literal less-than sign.
My entire web is programmed in C.
I'm so sorry.
Encoding brackets is indeed sufficient in most cases to prevent XSS, as anything between tags will then display as plain-text.

Regex to match attributes in HTML?

I have a txt file which actually is a html source of some webpage.
Inside that txt file there are various strings preceded by a "title=" tag.
e.g.
<div id='UWTDivDomains_5_6_2_2' title='Connectivity Framework'>
I am interested in getting the text Connectivity Framework to be extraced and written to a separate file.
Like this, there are many such tags each having a different text after the title='some text here which i need to extract '
I want to extract all such instances of the text from the html source/txt file and write to a separate txt file. The text can contain lower case, upper case letters and number only. The length of each text string(in characters) will vary.
I am using PowerGrep for windows. Powergrep allows me to search a text file with regular expression inout.
I tried using the search as
title='[a-zA-Z0-9]
It shows the correct matches, but it matches only first character of the string and writes only the first character of the text string matched to the second txt file, not all string.
I want all string to be matched and written to the second file.
What is the correct regular expression or way to do what i want to do, using powergrep?
-AD.
I'm just not sure how many times the question of regular expression parsing of HTML files has to be asked (and answered with the correct solution of "use a DOM parser"). It comes up every day.
The difficulties are:
In HTML attributes can have single-quotes, double-quotes or even no quotes;
Similar strings can appear in the HTML document itself;
You have to handle correct escaping; and
Malformed HTML (decent parsers are extremely robust to common errors).
So if you cater for all this (and it gets to be a pretty complicated yet still imperfect regex), it's still not 100%.
HTML parsers exist for a reason. Use them.
I'm not familiar with PowerGrep, however, your regex is incomplete. Try this:
title='[a-zA-Z0-9 ]*'
or better yet:
title='([^']*)'
The other answers all give correct changes to the regex, so I'll explain what the issue was with your original.
The square brackets indicate a character class - meaning that the regex will match any character within those brackets. However, like everything else, it will only match it once by default. Just as the regex "s" would match only the first character in "ssss", the regex "[a-zA-Z0-9]" will match only the first character in "Connectivity Framework".
By adding repetition, one can get that character class to match repeatedly. The easiest way to do this is by adding an asterisk after it (which will match 0 or more occurences). Thus the regex "[a-zA-Z0-9]*" will match as many characters in a row until it hits a character that is not in that character class (in your case, the space character since you didn't include that in your brackets).
Regexes though can be pretty complex to describe the syntax accurately - what if someone put a non-alphanumeric character such as an ampersand within the attribute? You could try to capture all input between the quotes by making the character set "anything except a quote character", so "'[^']*'" would usually do the right thing. Often you need to bear in mind escaping as well (e.g. with a string 'Mary\'s lamb' you do actually want to capture the apostrophe in the middle so a simple "everything but apostrophes" character set won't cut it) though thankfully this is not an issue with XML/HTML according to the specs.
Still, if there is an existing library available that will do the extraction for you, this is likely to be faster and more correct than rolling your own, so I would lean towards that if possible.
I would use this regular expression to get the title attribute values
<[a-z]+[^>]*\s+title\s*=\s*("[^"]*"|'[^']*'|[^\s >]*)
Note that this regex matches the attribute value expression with quotes. So you have to remove them if needed.
Here's the regex you need
title='([a-zA-Z0-9]+)'
but if you're going to be doing a lot more stuff like this, using a parser might make it much more robust and useful.
Try this instead:
title=\'[a-zA-Z0-9]*\'

Best way to fetch a varying HTML tag

I'm trying to fetch some HTML from various blogs and have noticed that different providers use the same tag in different ways.
For example, here are two major providers that use the meta name generator tag differently:
Blogger: <meta content='blogger' name='generator'/> (content first, name later and, yes, single quotes!)
WordPress: <meta name="generator" content="WordPress.com" /> (name first, content later)
Is there a way to extract the value of content for all cases (single/double quotes, first/last in the row)?
P.S. Although I'm using Java, the answer would probably help more people if it where for regular expressions generally.
The answer is: don't use regular expressions.
Seriously. Use a SGML parser, or an XML parser if you happen to know it's valid XML (probably almost never true). You will absolutely screw up and waste tons of time trying to get it right. Just use what's already available.
Actually, you should probably use some sort of HTML parser where you can inspect each node (and therefore node attributes) in the DOM of the page. I've not used any of these for a while so I don't know the pros and cons but here's a list http://java-source.net/open-source/html-parsers
Those differences are not really important according to the XHTML standard.
In other words, they are exactly the same thing.
Also, if you replace double quotes with single quotes would be the same.
The typical way of 'normalizing' an xml document is to pare it using some API that treats the document as its Infoset representation. Both DOM and SAX style APIs work that way.
If you want to parse them by hand (or with a RegEx) you have to replicate all those things in your code and, in my opinion, that's not practical.
Note: single quotes (even no quotes, if the value doesn't contain a space) is valid according to the W3C HTML spec. Quote:
By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39)... In certain cases, authors may specify the value of an attribute without any quotation marks.
Also, don't forget that the order of attributes can be reversed and that other attributes can appear in the tag.
You may want to give Java's HTMLEditorKit a shot. It is easy to experiment with to see if the parsing provides what you are looking for.
Ok, since you are looking for language-agnostic then you can try a REGEX like /<meta\s.*content=.*>/ and take the result from that and parse out the specific values that you are looking for. I'm by no means a REGEX expert so there is probably a better way but in using the tool at http://www.codehouse.com/webmaster_tools/regex/ I matched both of the strings you provided.
If you must use regex, here is a regex to get just the content part:
content\s*=\s*['"].*?['"]
returns
content = "blogger"
and
content='Worpress.com'
respectively. I'm no regex expert, but it gets those when given your examples in regexpal.
Once you get that you can get everything between the quotes however you choose, be it another regex (which is just immoral at that point) or just looping over the characters.
If your using java you may want to look at tagsoup, which is a SAX-compliant parser for "[parsing] HTML as it is found in the wild".