Conditional HTML Attributes using Razor - html

The variable strCSSClass often has a value but sometimes is empty.
I do not want to include an empty class="" in this input element's HTML, which means if strCSSClass is empty, I don't want the class= attribute at all.
The following is one way to do a conditional HTML attribute:
<input type="text" id="#strElementID" #(CSSClass.IsEmpty() ? "" : "class=" + strCSSClass) />
Is there a more elegant way of doing this? Specifically one where I could follow the same syntax as is used in the other parts of the element: class="#strCSSClass" ?

You didn't hear it from me, the PM for Razor, but in Razor 2 (Web Pages 2 and MVC 4) we'll have conditional attributes built into Razor (as of MVC 4 RC tested successfully), so you can write things like this:
<input type="text" id="#strElementID" class="#strCSSClass" />
If strCSSClass is null then the class attribute won't render at all.
Further Reading
Jon Galloway - ASP.NET MVC 4 Beta Released!
Conditional Attributes in Razor View Engine and ASP.NET MVC 4

Note you can do something like this(at least in MVC3):
<td align="left" #(isOddRow ? "class=TopBorder" : "style=border:0px") >
What I believed was razor adding quotes was actually the browser. As Rism pointed out when testing with MVC 4(I haven't tested with MVC 3 but I assume behavior hasn't changed), this actually produces class=TopBorder but browsers are able to parse this fine. The HTML parsers are somewhat forgiving on missing attribute quotes, but this can break if you have spaces or certain characters.
<td align="left" class="TopBorder" >
OR
<td align="left" style="border:0px" >
What goes wrong with providing your own quotes
If you try to use some of the usual C# conventions for nested quotes, you'll end up with more quotes than you bargained for because Razor is trying to safely escape them. For example:
<button type="button" #(true ? "style=\"border:0px\"" : string.Empty)>
This should evaluate to <button type="button" style="border:0px"> but Razor escapes all output from C# and thus produces:
style="border:0px"
You will only see this if you view the response over the network. If you use an HTML inspector, often you are actually seeing the DOM, not the raw HTML. Browsers parse HTML into the DOM, and the after-parsing DOM representation already has some niceties applied. In this case the Browser sees there aren't quotes around the attribute value, adds them:
style=""border:0px""
But in the DOM inspector HTML character codes display properly so you actually see:
style=""border:0px""
In Chrome, if you right-click and select Edit HTML, it switch back so you can see those nasty HTML character codes, making it clear you have real outer quotes, and HTML encoded inner quotes.
So the problem with trying to do the quoting yourself is Razor escapes these.
If you want complete control of quotes
Use Html.Raw to prevent quote escaping:
<td #Html.Raw( someBoolean ? "rel='tooltip' data-container='.drillDown a'" : "" )>
Renders as:
<td rel='tooltip' title='Drilldown' data-container='.drillDown a'>
The above is perfectly safe because I'm not outputting any HTML from a variable. The only variable involved is the ternary condition. However, beware that this last technique might expose you to certain security problems if building strings from user supplied data. E.g. if you built an attribute from data fields that originated from user supplied data, use of Html.Raw means that string could contain a premature ending of the attribute and tag, then begin a script tag that does something on behalf of the currently logged in user(possibly different than the logged in user). Maybe you have a page with a list of all users pictures and you are setting a tooltip to be the username of each person, and one users named himself '/><script>$.post('changepassword.php?password=123')</script> and now any other user who views this page has their password instantly changed to a password that the malicious user knows.

I guess a little more convenient and structured way is to use Html helper. In your view it can be look like:
#{
var htmlAttr = new Dictionary<string, object>();
htmlAttr.Add("id", strElementId);
if (!CSSClass.IsEmpty())
{
htmlAttr.Add("class", strCSSClass);
}
}
#* ... *#
#Html.TextBox("somename", "", htmlAttr)
If this way will be useful for you i recommend to define dictionary htmlAttr in your model so your view doesn't need any #{ } logic blocks (be more clear).

Related

StringTemplate: HTML row formatting (odd/even)

I am new to StringTemplate template engine and want to use it for generating an html document with a table. I want to alter the style of the table rows depending on whether it is odd or even. I found a discussion on the stringtemplate-interest mailing list that describes the general approach ([stringtemplate-interest] Odd even row formatting).
But I have an additional requirement which breaks this general approach (I think). I want to render rows depending on the existence of a value. So I am working with a conditional expression $if(expr)$. My template looks like this.
delimiters „$“,“$“
htmlTable(valueA, valueB, valueC, valueD) ::= <<
<table>
<tr styleClass='odd'><td>$valueA$</td></tr>
$if(valueB)$
<tr><td>$valueB$</td></tr>
$endif$
<tr styleClass='odd'><td>$valueC$</td></tr>
<tr><td>$valueD$</td></tr>
</table>
>>
In the given template I can not use the hard coded styleClass attribute, because it would render the table wrong if the valueB parameter does not exist.
Is my requirement realizable with a template engine like StringTemplate, which focus on separation of model and view? Or is there too much model in the requirement as to implement it in the view? I know how to do it in other template engines (i. e. FreeMarker or Apache Velocity) or I might use some fancy CSS or javascript stuff but I would rather keep the model-view separation and use StringTemplates internal instruments.

Why does the browser automatically unescape html tag attribute values?

Below I have an HTML tag, and use JavaScript to extract the value of the widget attribute. This code will alert <test> instead of <test>, so the browser automatically unescapes attribute values:
alert(document.getElementById("hau").attributes[1].value)
<div id="hau" widget="<test>"></div>
My questions are:
Can this behavior be prevented in any way, besides doing a double escape of the attribute contents? (It would look like this: &lt;test&gt;)
Does anyone know why the browser behaves like this? Is there any place in the HTML specs that this behavior is mentioned explicitly?
1) It can be done without doing a double escape
Looks like yours is closer to htmlEncode().
If you don't mind using jQuery
alert(htmlEncode($('#hau').attr('widget')))
function htmlEncode(value){
//create a in-memory div, set it's inner text(which jQuery automatically encodes)
//then grab the encoded contents back out. The div never exists on the page.
return $('<div/>').text(value).html();
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="hau" widget="<test>"></div>
If you're interested in a pure vanilla js solution
alert(htmlEncode(document.getElementById("hau").attributes[1].value))
function htmlEncode( html ) {
return document.createElement( 'a' ).appendChild(
document.createTextNode( html ) ).parentNode.innerHTML;
};
<div id="hau" widget="<test>"></div>
2) Why does the browser behave like this?
Only because of this behaviour, we are able to do a few specific things, such as including quotes inside of a pre-filled input field as shown below, which would not have been possible if the only way to insert " is by adding itself which again would require escaping with another char like \
<input type='text' value=""You &apos;should&apos; see the double quotes here"" />
The browser unescapes the attribute value as soon as it parses the document (mentioned here). One of the reasons might be that it would otherwise be impossible to include, for example, double quotes in your attribute value (well, technically it would if you put the value in single quotes instead, but then you wouldn't be able to include single quotes in the value).
That said, the behavior cannot be prevented, although if you really must use the value with the HTML entities being part of it, you could simply turn your special characters back into the codes (I recommend Underscore's escape for such task).

Regex: Extracting readable (non-code) text and URLs from HTML documents

I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE

HTML rendered incorrectly in .NET

I am trying to take the string "<BR>" in VB.NET and convert it to HTML through XSLT. When the HTML comes out, though, it looks like this:
<BR>
I can only assume it goes ahead and tries to render it. Is there any way I can convert those </> back into the brackets so I get the line break I'm trying for?
Check the XSLT has:
<xsl:output method="html"/>
edit: explanation from comments
By default XSLT outputs as XML(1) which means it will escape any significant characters. You can override this in specific instances with the attribute disable-output-escaping="yes" (intro here) but much more powerful is to change the output to the explicit value of HTML which confides same benefit globally, as the following:
For script and style elements, replace any escaped characters (such
as & and >) with their actual values
(& and >, respectively).
For attributes, replace any occurrences of > with >.
Write empty elements such as <br>, <img>, and <input> without
closing tags or slashes.
Write attributes that convey information by their presence as
opposed to their value, such as
checked and selected, in minimized
form.
from a solid IBM article covering the subject, more recent coverage from stylusstudio here
If HTML output is what you desire HTML output is what you should specify.
(1) There is actually corner case where output defaults to HTML, but I don't think it's universal and it's kind of obtuse to depend on it.
Try wraping it with <xsl:text disable-output-escaping="yes"><br></xsl:text>
Don't know about XSLT but..
One workaround might be using HttpUtility.HtmlDecode from System.Web namespace.
using System;
using System.Web;
class Program
{
static void Main()
{
Console.WriteLine(HttpUtility.HtmlDecode("<br>"));
Console.ReadKey();
}
}
...
Got it! On top of the selected answer, I also did something similar to this on my string:
htmlString = htmlString.Replace("<","<")
htmlString = htmlString.Replace(">",">")
I think, though, that in the end, I may just end up using <pre> tags to preserve everything.
The string "<br>" is already HTML so you can just Response.Write("<br>").
But you meantion XSLT so I imagine there some transform going on. In that case surely the transform should be inserting it at the correct place as a node. A better question will likely get a better answer

HtmlAgilityPack Drops Option End Tags

I am using HtmlAgilityPack. I create an HtmlDocument and LoadHtml with the following string:
<select id="foo_Bar" name="foo.Bar"><option selected="selected" value="1">One</option><option value="2">Two</option></select>
This does some unexpected things. First, it gives two parser errors, EndTagNotRequired. Second, the select node has 4 children - two for the option tags and two more for the inner text of the option tags. Last, the OuterHtml is like this:
<select id="foo_Bar" name="foo.Bar"><option selected="selected" value="1">One<option value="2">Two</select>
So basically it is deciding for me to drop the closing tags on the options. Let's leave aside for a moment whether it is proper and desirable to do that. I am using HtmlAgilityPack to test HTML generation code, so I don't want it to make any decision for me or give any errors unless the HTML is truly malformed. Is there some way to make it behave how I want? I tried setting some of the options for HtmlDocument, specifically:
doc.OptionAutoCloseOnEnd = false;
doc.OptionCheckSyntax = false;
doc.OptionFixNestedTags = false;
This is not working. If HtmlAgilityPack cannot do what I want, can you recommend something that can?
The exact same error is reported on the HAP home page's discussion, but it looks like no meaningful fixes have been made to the project in a few years. Not encouraging.
A quick browse of the source suggests the error might be fixable by commenting out line 92 of HtmlNode.cs:
// they sometimes contain, and sometimes they don 't...
ElementsFlags.Add("option", HtmlElementFlag.Empty);
(Actually no, they always contain label text, although a blank string would also be valid text. A careless author might omit the end-tag, but then that's true of any element.)
ADD
An equivalent solution is calling HtmlNode.ElementsFlags.Remove("option"); before any use of liberary (without need to modify the liberary source code)
It seems that there is some reason not to parse the Option tag as a "generic" tag, for XHTML compliance, however this can be a real pain in the neck.
My suggestion is to do a whole-string-replace and change all "option" tags to "my_option" tags, that way you:
Don't have to modify the source of the library (and can upgrade it later).
Can parse as you usually would.
The original post on HtmlAgilityPack forum can be found at:
http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=14982