I have a simple html page with a div element in it.
The innerHTML property of the div is set through query String.
In query string I pass html strings,i.e.
<p style='font-size:20px;color:green;'> Sun rises in the east </p> etc...
I get the appropriate output.
However, if I pass color code in style attribute say, #00990a, I am not displayed any content.
Can someone help me through this?
if theres a color code that contains a #, everything after that will be treated fragment identifier. to avoid this you have to url-encode your parameter-value (replacing # with %23 an d doing the same with other characters that have a special meaning (#&%=?#...)).
Finally your url should look like this:
PageUrl?Content=%3Cp+style%3D%27color%3A%23009900%27%3EContent%3C%2Fp%3E
Since you haven't shown us any code, I shall guess…
In a URI, # indicates the start of the fragment identifier (as ? indicates the start of the query string). Your colour is terminated the query string and starting the fragment identifier. You need to URL encode any character that has special meaning in URLs. (# is %23).
Do make sure that you sanitise the passed HTML and CSS on the server though. It is very easy to expose yourself to XSS attacks otherwise.
Related
I want to client validate a form input (username + password) before sending it to the server (php).
Therefore I applied the pattern attribute in the input tag.
I came up with a RegEx expression that does the job on the server side:
(preg_match_all('/^[a-zA-Z0-9. _äöüßÄÖÜ#-]{1,50}$/', $_POST['username']) == 0)
thereby the global flag is set using preg_match_all (instead of preg_match).
Now I wanted to implement the same RegEx in my pattern attribute in the HTML form.
HTML standard defines that RegEx in the pattern attribute follows RegEx in JavaScript, which devides the expression into "pattern, flags" divided by a comma. I would translate that into HTML like this:
pattern="^[a-zA-Z0-9. _äöüßÄÖÜ#-]{1,50}$,g"
That doesn't work.
All JavaScript RegEx validators I have found enclose the pattern into slashes:
/^[a-zA-Z0-9. _äöüßÄÖÜ#-]{1,50}$/
and say, that the global flag would be behind the last slash:
pattern="/^[a-zA-Z0-9. _äöüßÄÖÜ#-]{1,50}$/g"
That doesn't work either.
Mozilla also states in their developer guide (I also read it elsewhere):
No forward slashes should be specified around the pattern text.
So, how can I get the global flag into the pattern attribute of the input element?
There are a couple of facts you should be aware when using pattern attribute regex:
There is no need to use g flag, the whole string must match the regex, and the regex check will only be performed once, a single match is enough
There is no need wrapping the pattern with regex delimiters, and if you add slashes at the start and end, they will be treated as literal slashes making part of the regex pattern, and in 99.9% of cases that would ruin the regex
You do not even need ^ and $ anchors as the pattern regex must match the entire string input. In fact, the pattern is automatically enclosed with ^(?: and )$, so if you use pattern="^\d+$" (just a quick example), the final regex (in Chrome, e.g.) will look like /^(?:^\d+$)$/u, which looks rather redundant.
So, all you need is
pattern="^[a-zA-Z0-9. _äöüßÄÖÜ#-]{1,50}"
// Or even
pattern="^[\w. äöüßÄÖÜ#-]{1,50}"
Note that [A-Za-z0-9_] = \w in JavaScript regex.
In my jsp page in tag (img src="upload/<%=a.getUrlimmagine()%>") i have this error
Bad value in "upload/ " for attribute "src" on element "img":DOUBLE_WHITE SPACE in PATH
How can I solve it?
Because URLs should not have spaces (see : Are URLs allowed to have a space in them?
) you should encode your url so the unsafe characters will be replaced with some strings representing them (i.e. space becoming %20)
So you either do this in the getUrlimmagine() in your bean or do the encoding in the jsp page.
If you are sure that the only unsafe charcter in your image names is spaces so you can use String.replace() in the backing bean or use the JSTL replace function h in your jsp page
otherwise if you want the cleanest solution, you should definitely read this article : What every web developer must know about URL encoding by Stéphane Épardaud
A cleaner solution to get the same result without using JSP expression is in this answer by BalusC
You would need to use the replaceAll method for String:
<img src="upload/<%=a.getUrlimmagine().replaceAll(" ", "%20")%>" />
I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE
Say we have a form where the user types in various info. We validate the info, and find that something is wrong. A field is missing, invalid email, et cetera.
When displaying the form to the user again I of course don't want him to have to type in everything again so I want to populate the input fields. Is it safe to do this without sanitization? If not, what is the minimum sanitization that should be done first?
And to clearify: It would of course be sanitized before being for example added to a database or displayed elsewhere on the site.
No it isn't. The user might be directed to the form from a third party site, or simply enter data (innocently) that would break the HTML.
Convert any character with special meaning to its HTML entity.
i.e. & to &, < to <, > to > and " to " (assuming you delimit your attribute values using " and not '.
In Perl use HTML::Entities, in TT use the html filter, in PHP use htmlspecialchars. Otherwise look for something similar in the language you are using.
It is not safe, because, if someone can force the user to submit specific data to your form, you will output it and it will be "executed" by the browser. For instance, if the user is forced to submit '/><meta http-equiv="refresh" content="0;http://verybadsite.org" />, as a result an unwanted redirection will occur.
You cannot insert user-provided data into an HTML document without encoding it first. Your goal is to ensure that the structure of the document cannot be changed and that the data is always treated as data-values and never as HTML markup or Javascript code. Attacks against this mechanism are commonly known as "cross-site scripting", or simply "XSS".
If inserting into an HTML attribute value, then you must ensure that the string cannot cause the attribute value to end prematurely. You must also,of course, ensure that the tag itself cannot be ended. You can acheive this by HTML-encoding any chars that are not guaranteed to be safe.
If you write HTML so that the value of the tag's attribute appears inside a pair of double-quote or single-quote characters then you only need to ensure that you html-encode the quote character you chose to use. If you are not correctly quoting your attributes as described above, then you need to worry about many more characters including whitespace, symbols, punctuation and other ascii control chars. Although, to be honest, its arguably safest to encode these non-alphanumeric chars anyway.
Remember that an HTML attribute value may appear in 3 different syntactical contexts:
Double-quoted attribute value
<input type="text" value="**insert-here**" />
You only need to encode the double quote character to a suitable HTML-safe value such as "
Single-quoted attribute value
<input type='text' value='**insert-here**' />
You only need to encode the single quote character to a suitable HTML-safe value such as
Unquoted attribute value
<input type='text' value=**insert-here** />
You shouldn't ever have an html tag attribute value without quotes, but sometimes this is out of your control. In this case, we really need to worry about whitespace, punctuation and other control characters, as these will break us out of the attribute value.
Except for alphanumeric characters, escape all characters with ASCII values less than 256 with the &#xHH; format (or a named entity if available) to prevent switching out of the attribute. Unquoted attributes can be broken out of with many characters, including [space] % * + , - / ; < = > ^ and | (and more). [para lifted from OWASP]
Please remember that the above rules only apply to control injection when inserting into an HTML attribute value. Within other areas of the page, other rules apply.
Please see the XSS prevention cheat sheet at OWASP for more information
Yes, it's safe, provided of course that you encode the value properly.
A value that is placed inside an attribute in an HTML needs to be HTML encoded. The server side platform that you are using should have methods for this. In ASP.NET for example there is a Server.HtmlEncode method, and the TextBox control will automatically HTML encode the value that you put in the Text property.
When should I HTML-escape data in my code and when should I URL-escape? I am confused about which one when to use...
For example, given a element which asks for an URL:
<input type="text" value="DATA" name="URL">
Should I HTML-Escape DATA here or URL-escape it here?
And what about an element:
NAME
Should URL be URL-escaped or HTML-escaped? What about NAME?
Thanks, Boda Cydo.
URL encoding ensures that special characters such as ? and & don't cause the URL to be misinterpreted on the receiving end. In practice, this means you'll need to URL encode any dynamic query string values that have a chance of containing such characters.
HTML encoding ensures that special characters such as > and " don't cause the browser the misinterpret the markup. Therefore you need to HTML encode any values outputted into the markup that might contain such characters.
So in your example:
DATA needs to be HTML encoded.
Any dynamic segments of URL will need to be URL encoded, then the whole string will need to be HTML encoded.
Name needs to be HTML encoded.
HTML Escape when you're writing anything to a HTML document.
URL Escape when you're constructing a URL to call in-code, or for a browser to call (i.e. in the href tag).
In your examples you'll want to 'Attribute' escape the attributes. (I can't remember the exact function name, but it's in HttpUtility).
In the examples you show, it should be first URL-escaped, then HTML-escaped:
<a href="http://www.example.com?arg1=this%2C+that&arg2=blah">