I created a directive that highlights code but it seems browsers are modifying that code before I can get to it and highlight it.
Here's what's happening.
I have a directive called my-compile which basically just spits the passed value into the element's innerHTML and runs a $compile on it.
eg:
<span my-compile="details"></span>
And details would be something like:
here are some details and here's a <code lang="java">first = temp & 0xFF &</code>
here's the directive code that matters (this is in the link function):
element.html(details); $compile(element.children())(scope);
So $compile sees the <code> directive and hands that off to the code directive, except, and here's the problem, the <code> directive does an element.html() to get the contents and this is returned:
first = temp & 0xFF &
The problem is that the code is now wrong, because the first & wasn't escaped.
How can I still use the <code> directive in a similar fashion but preserve the & sign (and I assume this happens with > and < signs too)?
My only idea was a lookup service but that's kinda messy, but maybe my only option as the second it hits the browser's DOM it gets escaped, but the escaped & doesn't get double escaped.
I've also tried using element[0].innerHTML thinking maybe it's an Angular/jQuery sanitization thing, but it it isn't.
The problem is that when you added html to your DOM first time element.html(details); browser parses it as html(btw - it fixes incorrect escaping, adds missing close tags, etc), and when your are trying to access it later - you are getting html with all fixes done during parsing.
The only way how you can fix it - properly encode your code content as a text entity(for provided example if you need text first = temp & 0xFF & encoded version would be first = temp & 0xFF &), an access it as element.text() but not as element.html() in your code directive.
Related
I would like to understand why using a browser, and editing an html page with the inspect button, the value fields of the radio inputs cannot be changed with;
value='"'
I change the character to & quot;
value="""
because chrome firefox and others do not allow to insert the character " clean within the value?
can you give me an explanation on this?
The inspector favors double quotes to clearly show strings. It may not actually look that way in the DOM, but in the inspector it's the rule. You can't have an unescaped double quote in-between 2 others, so value="\"" might work, or value=""" as you said. Not sure how relevant this link is but it shows that they purposely hard-code that double quote in there
I am working on Freemarker Template to create one form.
I declared two variables using <#local> which will hold anchor tag and button
<#local rightElement = "<a class='set-right' href='${data.url}'>Forgot your password?</a>">
<#local rightButton = "<input type='button' class='js-show-pass btn-toggle-password btn-link' value='Show'>">
I have used this variable to pass to macro which create my form structure. But when I load my page the variable which I pass prints as String on my Form page. I am accessing them as ${rightElement} and ${rightButton} in my macro. But it is printing my anchor tag with double quotes ,making it as string.
I tried multiple ways to eliminate that double quote with no success. Could you please suggest ways to declare anchor tag in variable and use it as html element and not String.
I'm not sure what you mean by "printing my anchor tag with double quotes, making it as string", but my guess is that the HTML that you print gets auto-escaped, and so you literally see the HTML source (with the <-s and all) in the browser, instead of a link or button.
So, first, assign the values with <#local someName>content</#local> syntax, instead of <#local someName="content">:
<#local rightElement><a class='set-right' href='${data.url}'>Forgot your password?</a></#local>
<#local rightButton><input type='button' class='js-show-pass btn-toggle-password btn-link' value='Show'></#local>
This ensures that ${data.url} and such are properly escaped (assuming you do use auto-escaping, and you should), also then you won't have to avoid/escape " or ' inside the string literal, you can use #if and such in constructing the value, etc.
Then, where you print the HTML value, if you are using the modern version of auto-escaping (try ${.output_format} to see - it should print HTML then, not undefined), you can now just write ${rightElement}, because FreeMarker knows that it's captured HTML, and so needs no more escaping. If you are using the legacy version of auto-escaping (which utilized #escape directive), then you have to write <#noescape>${rightElement}</#noescape>.
I am using gtk in an application and I make use of the abilities of gtklabel text to be rendered automatically as a clickable url. This works well most of the time, however with a url which contains parentheses "(" and ")" this does not work. The versions I use are the ones available on debian (old)stable, i.e. debian 6 (2.20) and 7 (3.4.2).
For example, I am trying to display the following url:
https://maps.google.com/maps?q=62.1891,+-141.5372+(Example+text+in+here+will+be+rendered+in+the+maps+label)&iwloc=A&hl=en
When I create a gtklabel with this text, for example:
text="<b>Click here for Map</b>\n"
Then it will display fine in the label as an underlined link in bold with the text Click here for Map
However when you click the link it will not show correctly and this error appears:
Gtk-WARNING **: Unable to show '(null)': Operation not supported
It looks like the parentheses mess up the rendering of the url by gtk.
Is there a way to escape the parentheses, or use a different character that works in the map url to create the label?
I have tried various methods of escaping it, however none were effective so far. Such as using %28 and %29 to replace the parentheses as well as backslashes as an escape character.
I am using the method described in https://developer.gnome.org/gtk2/2.24/GtkLabel.html and https://developer.gnome.org/gtk3/stable/GtkLabel.html under "Links" which allows automatic rendering of links:
Links
Since 2.18, GTK+ supports markup for clickable hyperlinks in addition
to regular Pango markup. The markup for links is borrowed from HTML,
using the a with href and title attributes. GTK+ renders links similar
to the way they appear in web browsers, with colored, underlined text.
The title attribute is displayed as a tooltip on the link. An example
looks like this:
1 gtk_label_set_markup (label, "Go to the http://www.gtk.org\" title=\"<i>Our&/i> website\">GTK+
website for more...");
I understand it is working in more recent releases of gtk (2.24 and 3.6), making sure to escape ampersands. But I was wondering if there is a work around for older gtk versions to avoid this problem?
You should be escaping your ampersands with &.
I'm pretty sure GTK prints out a runtime warning telling you this when you call gtk_label_set_markup().
Here's the warning on GTK 3.6.4:
Gtk-WARNING **: Failed to set text from markup due to error parsing markup: Error on line 1: Entity did not end with a semicolon; most likely you used an ampersand character without intending to start an entity - escape ampersand as &
jku is right, the ampersand need to be escaped. He're an example using the very same string as you, and it works (tested on 3.6.4 and 2.24.17).
#include <gtk/gtk.h>
int
main (int argc, char **argv)
{
gtk_init (&argc, &argv);
GtkWidget *window = gtk_window_new (GTK_WINDOW_TOPLEVEL);
// This one won't work, needs ampersand escaping
// GtkWidget *label = gtk_label_new ("<b>Click here for Map</b>\n");
GtkWidget *label = gtk_label_new ("<b>Click here for Map</b>\n");
gtk_label_set_use_markup (GTK_LABEL (label), TRUE);
gtk_container_add (GTK_CONTAINER(window), label);
gtk_widget_show_all (GTK_WIDGET (window));
g_signal_connect (window, "destroy", G_CALLBACK(gtk_main_quit), NULL);
gtk_main ();
return 0;
}
Original answer:
Have you tried to call gtk_show_uri with that link? You could then see if that's a problem with what handles URI's, or if it's the way your label is formatted/constructed.
I have a strange problem:
In the database, I have a literal ampersand lt semicolon:
<div
whenever its printed into a html textarea tag, the source code of the page shows the > as >.
How do I stop this decoding?
You can't stop entities being decoded in a textarea since the content of a textarea is not (unlike a script or style element) intrinsic CDATA, even though error recovery may sometimes give the impression that it is.
The definition of the textarea element is:
<!ELEMENT TEXTAREA - - (#PCDATA) -- multi-line text field -->
i.e. it contains PCDATA which is described as:
Document text (indicated by the SGML construct "#PCDATA"). Text may contain character references. Recall that these begin with & and end with a semicolon (e.g., Hergé's adventures of Tintin contains the character entity reference for the e acute character).
This means that when you type (the invalid HTML of) "start of tag" (<) the browser corrects it to "less than sign" (<) but when you type "start of entity" (&), which is allowed, no error correction takes place.
You need to write what you mean. If you want to include some HTML as data then you must convert any character with special meaning to its respective character reference.
If the data is:
<div
Then the HTML must be:
<textarea><div</textarea>
You can use the standard functions for converting this (e.g. PHP's htmlspecialchars or Perl's HTML::Entities module).
NB 1: If you were using XHTML[2] (and really using it, it doesn't count if you serve it as text/html) then you could use an explicit CDATA block:
<textarea><![CDATA[<div]]></textarea>
NB 2: Or if browsers implemented HTML 4 correctly
Ok , but the question is . why it decodes them anyway ? assuming i've added & , save the textarea , ti will be saved < , but displayed as < , saving it again will convert it back to < (but it will remain < in the database) , saving again will save it a < in the database , why the textarea decodes it ?
The server sends (to the browser) data encoded as HTML.
The browser sends (to the server) data encoded as application/x-www-form-urlencoded (or multipart/form-data).
Since the browser is not sending the data as HTML, the characters are not represented as HTML entities.
If you take the data received from the client and then put it into an HTML document, then you must encode it as HTML first.
In PHP, this can be done using htmlentities(). Example below.
<?php
$content = "This string contains the TM symbol: ™";
print "<textarea>". htmlentities($content) ."</textarea>";
?>
Without htmlentities(), the textarea would interpret and display the TM symbol (™) instead of "™".
http://php.net/manual/en/function.htmlentities.php
You have to be sure that this is rendered to the browser:
<textarea name="somename"><div</textarea>
Essentially, this means that the & in < has to be html encoded to &. How to do it will depend on the technologies you're using.
UPDATE: Think about it like this. If you want to display <div> inside a textarea, you'll have to encode <> because otherwise, <div> would be a normal HTML element to the browser:
<textarea name="somename"><div></textarea>
Having said this, if you want to display <div> inside a textarea, you'll have to encode & again, because the browser decodes HTML entities when rendering HTML. It has nothing to do with your database.
You can serve your DB-content from a separate page and then place it in the textarea using a Javascript (jQuery) Ajax-call:
request = $.ajax
({
type: "GET",
url: "url-with-the-troubled-content.php",
success: function(data)
{
document.getElementById('id-of-text-area').value = data;
}
});
Explained at
http://www.endtask.net/how-to-prevent-a-textarea-element-from-decoding-html-entities/
I had the same problem and I just made two replacements on the text to show from the database before letting it into the text area:
myString = Replace(myString, "&", "&")
myString = Replace(myString, "<", "<")
Replace n:o 1 to trick the textarea to show the codes.
replace n:o 2: Without this replacement you can not show the word "" inside the textarea (it would end the textarea tag).
(Asp / vbscript code above, translate to a replace method of your language choice)
I found an alternative solution for reading and working with in-browser, simply read the element's text() using jQuery, it returns the characters as display characters and allows me to write from a textarea to a div's innerHTML using the property via html()...
With only JS and HTML...
...to answer the actual question, with a bare-minimal example:
<textarea id=myta></textarea>
<script id=mytext type=text/plain>
™
</script>
<script> myta.value = mytext.innerText; </script>
Explanation:
Script tags do not render html nor entities. By storing text in a script tag, it will remain unadultered-- problem is it will try to execute as JavaScript. So we use an empty textarea and store the text in a script tag (here, the first one).
To prevent that, we change the mime-type to text/plain instead of it's default, which is text/javascript. This will prevent it from running.
Then to populate the textarea, we copy the script tag's content to it (here done in the second script tag).
The only caveats I have found with this are you have to use JavaScript and you cannot include script tags directly in it.
I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE