Ruby: Including raw HTML in a Nokogiri HTML builder - html

I'm writing code to convert a fixed XML schema to HTML. I'm trying to use Nokogiri, and it works for most tags, e.g.:
# doc is the Nokogiri html builder, text_inline is a TextInlineContent node
def consume_inline_content?(doc, text_inline)
text = text_inline.text
case text_inline.name
when 'text'
doc.text text
when 'emphasized'
doc.em {
doc.text text
}
# ... and so on ...
end
end
The problem is, this schema also includes a rawHTML text node. Here is some of my input:
<rawHTML><![CDATA[<h2>]]></rawHTML>
Stuff
<rawHTML><![CDATA[</h2>]]></rawHTML>
which should ideally be rendered as <h2>Stuff</h2>. But when I try the "obvious" thing:
...
when 'rawHTML'
doc << text
...
Nokogiri produces <h2></h2>Stuff. It seems to be "fixing" the unbalanced open tag before I have a chance to insert its contents or closing tag.
I recognize that I'm asking about a feature that could produce malformed html, and maybe the builder doesn't want to allow that. Is there a right way to handle this situation?

Related

How can I strip HTML tags from a string in the model before I get to the view

Trying to determine how to strip the HTML tags from a string in Ruby. I need this to be done in the model before I get to the view. So using:
ActionView::Helpers::SanitizeHelperstrip_tags()
won't work. I was looking into using Nokogiri, but can't figure out how to do it.
If I have a string:
description = google
I need it to be converted to plain text without including HTML tags so it would just come out as "google".
Right now I have the following which will take care of HTML entities:
def simple_description
simple_description = Nokogiri::HTML.parse(self.description)
simple_description.text
end
You can call the sanitizer directly like this:
Rails::Html::FullSanitizer.new.sanitize('<b>bold</b>')
# => "bold"
There are also other sanitizer classes that may be useful: FullSanitizer, LinkSanitizer, Sanitizer, WhiteListSanitizer.
Nokogiri is a great choice if you don't own the HTML generator and you want to reduce your maintenance load:
require 'nokogiri'
description = 'google'
Nokogiri::HTML::DocumentFragment.parse(description).at('a').text
# => "google"
The good thing about a parser vs. using patterns, is the parser continues work with changes to the tags or format of the document, whereas patterns get tripped up by those things.
While using a parser is a little slower, it more than makes up for that by the ease of use and reduced maintenance.
The code above breaks down to:
Nokogiri::HTML(description).to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>google</body></html>\n"
Rather than let Nokogiri add the normal HTML headers, I told it to parse only that one node into a document fragment:
Nokogiri::HTML::DocumentFragment.parse(description).to_html
# => "google"
at finds the first occurrence of that node:
Nokogiri::HTML::DocumentFragment.parse(description).at('a').to_html
# => "google"
text finds the text in the node.
Maybe you could use regular expression in ruby like following
des = 'google'
p des[/<.*>(.*)\<\/.*>/,1]
The result will be "google"
Regular expression is powerful.
You could customize to fit your needs.

In XML, is there a way to turn all of a node's children into a single string including the nested XML tags?

I have bits like the following in an XML file that is a data source for an HTML page that uses CSS and javascript only. The special XML codes are my own, and I want to process them with javascript.
<listitem>regular text could be in here</listitem>
<listitem>possibly with <b>HTML markup</b></listitem>
<listitem>or <special>special xml</special></listitem>
What I dream of is a way to get from .getElementsByTagName("listitem") to the following array.
["regular text could be in here", "possibly with <b>HTML markup</b>", "or <special>special xml</special>"]
That way, I could process each listitem as part of the array. However, the XML parser breaks apart all the XML for each listitem. Other than using CDATA, which gets messy, is there another way?
I think the answer is:
Array.prototype.slice.call(document.getElementsByTagName("listitem")).map(function(x) {return x.innerHTML})
It will return:
["regular text could be in here", "possibly with <b>HTML markup</b>", "or <special>special xml</special>"]

Using ruby variables as html code

I would expect that the following:
<div style="padding-top:90px;"><%= u.one_line %></div>
simply pulls whatever is in u.one_line (which in my case is text from database), and puts it in the html file. The problem I'm having is that sometimes, u.one_line has text with formatted html in it (just line breaks). For example sometimes:
u.one_line is "This is < / b r > awesome"
and I would like the page to process the fact that there's a line break in there... I had to put it with spaces up ^^^ here because the browser would not display it otherwise on stackoverflow. But on my server it's typed correctly, unfortunately instead of the browser processing the line break, it prints out the "< / b r>" part...
I hope you guys understand what I mean :(?
always remember to use raw or html_safe for html output in rails because rails by default auto-escapes html content for protecting against XSS attacks.
for more see
When to use raw() and when to use .html_safe

How can i convert/replace every newline to '<br/>'?

set tabstop=4
set shiftwidth=4
set nu
set ai
syntax on
filetype plugin indent on
I tried this, content.gsub("\r\n","<br/>") but when I click the view/show button to see the contents of these line, I get the output/result=>
set tabstop=4<br/> set shiftwidth=4<br/> set nu<br/> set ai<br/> syntax on<br/> filetype plugin indent on
But I tried to get those lines as a seperate lines. But all become as a single line. Why?
How can I make all those lines with a html break (<br/>) ?
I tried this, that didn't work.
#addpost = Post.new params[:data]
#temptest = #addpost.content.html_safe
#addpost.content = #temptest
#logger.debug(#addpost)
#addpost.save
Also tried without saving into database. Tried only in view layer,<%= t.content.html_safe %> That didn't work too.
Got this from page source
vimrc file <br/>
2011-12-06<br/><br/>
set tabstop=4<br/><br/>set shiftwidth=4<br/><br/>set nu<br/><br/>set ai<br/><br/>syntax on<br/><br/>filetype plugin indent on<br/>
Edit
Delete
<br/><br/>
An alternative to convert every new lines to html tags <br> would be to use css to display the content as it was given :
.wrapped-text {
white-space: pre-wrap;
}
This will wrap the content on a new line, without altering its current form.
You need to use html_safe if you want to render embedded HTML:
<%= #the_string.html_safe %>
If it might be nil, raw(#the_string) won't throw an exception. I'm a bit ambivalent about raw; I almost never try to display a string that might be nil.
With Ruby On Rails 4.0.1 comes the simple_format from TextHelper. It will handle more tags than the OP requested, but will filter malicious tags from the content (sanitize).
simple_format(t.content)
Reference : http://api.rubyonrails.org/classes/ActionView/Helpers/TextHelper.html
http://www.ruby-doc.org/core-1.9.3/String.html
as it says there gsub expects regex and replacement
since "\n\r" is a string you can see in the docs:
if given as a String, any regular expression metacharacters it contains will be interpreted literally, e.g. '\d' will match a backlash followed by ‘d’, instead of a digit.
so you are trying to match "\n\r", you probably want a character class containing \n or \r -[\n\r]
a = <<-EOL
set tabstop=4
set shiftwidth=4
set nu
set ai
syntax on
filetype plugin indent on
EOL
print a.gsub(/[\n\r]/,"<br/>\n");
I'm not sure I exactly follow the question - are you seeing the output as e.g. preformatted text, or does the source HTML have those tags? If the source HTML has those tags, they should appear on new lines, even if they aren't on line breaks in the source, right?
Anyway, I'm guessing you're dealing with automatic string escaping. Check out this other Stack Overflow question
Also, this: Katz talking about this feature

How to stop an html TEXTAREA from decoding html entities

I have a strange problem:
In the database, I have a literal ampersand lt semicolon:
<div
whenever its printed into a html textarea tag, the source code of the page shows the > as >.
How do I stop this decoding?
You can't stop entities being decoded in a textarea since the content of a textarea is not (unlike a script or style element) intrinsic CDATA, even though error recovery may sometimes give the impression that it is.
The definition of the textarea element is:
<!ELEMENT TEXTAREA - - (#PCDATA) -- multi-line text field -->
i.e. it contains PCDATA which is described as:
Document text (indicated by the SGML construct "#PCDATA"). Text may contain character references. Recall that these begin with & and end with a semicolon (e.g., Hergé's adventures of Tintin contains the character entity reference for the e acute character).
This means that when you type (the invalid HTML of) "start of tag" (<) the browser corrects it to "less than sign" (<) but when you type "start of entity" (&), which is allowed, no error correction takes place.
You need to write what you mean. If you want to include some HTML as data then you must convert any character with special meaning to its respective character reference.
If the data is:
<div
Then the HTML must be:
<textarea>&lt;div</textarea>
You can use the standard functions for converting this (e.g. PHP's htmlspecialchars or Perl's HTML::Entities module).
NB 1: If you were using XHTML[2] (and really using it, it doesn't count if you serve it as text/html) then you could use an explicit CDATA block:
<textarea><![CDATA[<div]]></textarea>
NB 2: Or if browsers implemented HTML 4 correctly
Ok , but the question is . why it decodes them anyway ? assuming i've added & , save the textarea , ti will be saved < , but displayed as < , saving it again will convert it back to < (but it will remain < in the database) , saving again will save it a < in the database , why the textarea decodes it ?
The server sends (to the browser) data encoded as HTML.
The browser sends (to the server) data encoded as application/x-www-form-urlencoded (or multipart/form-data).
Since the browser is not sending the data as HTML, the characters are not represented as HTML entities.
If you take the data received from the client and then put it into an HTML document, then you must encode it as HTML first.
In PHP, this can be done using htmlentities(). Example below.
<?php
$content = "This string contains the TM symbol: ™";
print "<textarea>". htmlentities($content) ."</textarea>";
?>
Without htmlentities(), the textarea would interpret and display the TM symbol (™) instead of "™".
http://php.net/manual/en/function.htmlentities.php
You have to be sure that this is rendered to the browser:
<textarea name="somename">&lt;div</textarea>
Essentially, this means that the & in < has to be html encoded to &. How to do it will depend on the technologies you're using.
UPDATE: Think about it like this. If you want to display <div> inside a textarea, you'll have to encode <> because otherwise, <div> would be a normal HTML element to the browser:
<textarea name="somename"><div></textarea>
Having said this, if you want to display <div> inside a textarea, you'll have to encode & again, because the browser decodes HTML entities when rendering HTML. It has nothing to do with your database.
You can serve your DB-content from a separate page and then place it in the textarea using a Javascript (jQuery) Ajax-call:
request = $.ajax
({
type: "GET",
url: "url-with-the-troubled-content.php",
success: function(data)
{
document.getElementById('id-of-text-area').value = data;
}
});
Explained at
http://www.endtask.net/how-to-prevent-a-textarea-element-from-decoding-html-entities/
I had the same problem and I just made two replacements on the text to show from the database before letting it into the text area:
myString = Replace(myString, "&", "&")
myString = Replace(myString, "<", "<")
Replace n:o 1 to trick the textarea to show the codes.
replace n:o 2: Without this replacement you can not show the word "" inside the textarea (it would end the textarea tag).
(Asp / vbscript code above, translate to a replace method of your language choice)
I found an alternative solution for reading and working with in-browser, simply read the element's text() using jQuery, it returns the characters as display characters and allows me to write from a textarea to a div's innerHTML using the property via html()...
With only JS and HTML...
...to answer the actual question, with a bare-minimal example:
<textarea id=myta></textarea>
<script id=mytext type=text/plain>
™
</script>
<script> myta.value = mytext.innerText; </script>
Explanation:
Script tags do not render html nor entities. By storing text in a script tag, it will remain unadultered-- problem is it will try to execute as JavaScript. So we use an empty textarea and store the text in a script tag (here, the first one).
To prevent that, we change the mime-type to text/plain instead of it's default, which is text/javascript. This will prevent it from running.
Then to populate the textarea, we copy the script tag's content to it (here done in the second script tag).
The only caveats I have found with this are you have to use JavaScript and you cannot include script tags directly in it.