CL-WHO HTML generator to file - html

I'm trying to generate an html file to a file. I'm using with-html-output-to-string, but I can't seem to figure out how to get the functionality to work. I'm not sure if I should use a file stream, with-open-file, and how to get the syntax to work. I've been messing with this for a day, but the code just doesnt run.

CL-USER> (who:with-html-output-to-string (out nil :prologue t :indent t)
(:html
(:head
(:title "home"))
(:body
(:p "Hello cl."))))
"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">
<html>
<head>
<title>home
</title>
</head>
<body>
<p>Hello cl.
</p>
</body>
</html>"

Related

Stop Nokogiri from adding DOCTYPE and meta tags?

I'm trying to use Nokogiri to convert some template files from one format to another. But it keeps adding tags. I'm trying to prevent it from adding Doctype and meta tags, but can't figure it out. I've tried
#doc = Nokogiri::HTML.parse(r)
but that adds the tags. I've also tried
#doc = Nokogiri::HTML.fragment(r)
as suggested in "How to prevent Nokogiri from adding <DOCTYPE> tags?", but that removes any <html>, <head>, or <body> tags that are in the document.
If it matters, my code for reading the file is:
f = File.read(infile)
r = f.gsub(/<tmpl_var ([^>]*)>/, '{{{\1}}}')
#doc = Nokogiri::HTML.fragment(r)
I need to do a gsub beforehand because I need to replace <tmpl_var> tags which aren't proper HTML and cause more problems.
When using HTML.fragment(r), I do get an htmlParseStartTag: misplaced <html> tag error (as well as similar errors for <body> and <head>).
Is there a way to prevent it from making these additions?
An example conversion:
Before:
<html>
<head>
<script>
var x = "y";
</script>
</head>
<body>
<div>
Stuff
</div>
</body>
</html>
After using Parse:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<script>
var x = "y";
</script>
</head>
<body>
<div>
Stuff
</div>
</body>
</html>
After using HTML.fragment or HTML::DocumentFragment.parse:
<script>
var x = "y";
</script>
<div>
Stuff
</div>
In this case, I want it to just output the before section. (In the real script I make a bunch of changes though).
Nokogiri can be told to not add the standard HTML headers. Consider these:
require 'nokogiri'
doc = Nokogiri::HTML('<p>foo</p>')
doc.to_html # => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foo</p></body></html>\n"
doc = Nokogiri::HTML.fragment('<p>foo</p>')
doc.to_html # => "<p>foo</p>"
tmpl_var is a bad tag name in HTML, as is {{{\1}}}, so asking Nokogiri to try to parse either will result in problems:
doc = Nokogiri::HTML.fragment('<templ_var p1="baz">foo</templ_var>')
doc.errors # => [#<Nokogiri::XML::SyntaxError: Tag templ_var invalid>]
But you can still munge the DOM:
doc.to_html # => "<templ_var p1=\"baz\">foo</templ_var>"
doc.search('templ_var').each { |t| t.name = 'bar'}
doc.to_html # => "<bar p1=\"baz\">foo</bar>"
Or:
doc.to_html # => "<div><templ_var p1=\"baz\">foo</templ_var></div>"
doc.search('templ_var').each { |t| t.replace('{{{\1}}}') }
doc.to_html # => "<div>{{{\\1}}}</div>"
Putting that stuff together, plus a bit of chicanery:
doc = Nokogiri::HTML.fragment('<div><templ_var p1="baz">foo</templ_var></div>')
doc.to_html # => "<div><templ_var p1=\"baz\">foo</templ_var></div>"
doc.search('templ_var').each { |t| t.replace('{{{\1}}}') }
doc.to_html # => "<div>{{{\\1}}}</div>"
header = Nokogiri::XML.fragment('<html><body>')
header.at('body').children = doc
header.to_html # => "<html><body><div>{{{\\1}}}</div></body></html>"
So, I'd go after it something like that.
Now, why is Nokogiri stripping the <html> tag when parsing a fragment? I don't know. It leaves <body> alone if <head> or <html> is missing:
Nokogiri::HTML.fragment('<p>foo<p>').to_html
# => "<p>foo</p><p></p>"
Nokogiri::HTML.fragment('<body><p>foo<p></body>').to_html
# => "<body>\n<p>foo</p>\n<p></p>\n</body>"
But it gets funky if <head> or <html> exists:
Nokogiri::HTML.fragment('<head><style></style></head><body><p>foo<p></body>').to_html
# => "<style></style><p>foo</p><p></p>"
Nokogiri::HTML.fragment('<html><head><style></style></head><body><p>foo<p></body></html>').to_html
# => "<style></style><p>foo</p><p></p>"
That smells like a bug in Nokogiri to me as I haven't seen anything to document that behavior.
You can get around this by using Nokogiri::XML::DocumentFragment instead of Nokogiri::HTML::DocumentFragment. The XML version won't remove the html, head, or body tags.

MVC Mailer layout is not rendering any html other thatn the RenderBody method

Hi I am using MVC Mailer to manage creating and sending emails in my application. It will create and send the email fine but any html I insert inside the body in the layout is not in the email.
Mailer
public class Mailer : MailerBase, IMailer
{
public aMailer()
{
MasterName = "_EmailLayout";
}
public virtual MvcMailMessage RequestAccess(RequestAccessViewModel viewmodel)
{
ViewData.Model = viewmodel;
return Populate(x =>
{
x.Subject = "RequestAccess for Data";
x.ViewName = "RequestAccess";
x.To.Add("AppTeam#groups.hp.com");
x.From = new MailAddress(viewmodel.Email);
});
}
}
I am setting it to use _EmailLayout here, I cahnged the name after seeing that there was an issue with naming it _Layout because it would conflict with any other files named _Layout.
_EmailLayout
<html>
<head>
</head>
<body>
<h1>Mailer</h1>
#RenderBody()
Thanks
</body>
The contents of the H1 tag or "Thanks" are not in the email
Access.cshtml
<h3>"This is a Application email." </h3>
<p>#Model.Message</p>
<br/>
<p>Regards</p>
<p>#Model.Name</p>
<p>Business Area: #Model.BusinessArea</p>
Email Source
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii"><title></title>
</head>
<body>
<p> Hi jeff test,</p>
<br>
<p>Thank you for your enquiry about the Application.</p>
<br>
</body>
Has anyone come across this issue before? When I debug my application I can see that it is going into the _EmailLayout but I don't know why the HTML in that files is not rendered.
After posting the following issue on the github page for MVC Mailer
Changing the layout code to this fixed the problem
<html>
<head>
<meta charset="utf-8" />
<title></title>
</head>
<body>
Mailer
#RenderBody()
Thanks
</body>
</html>
I'm not sure why this fixed the problem but it did.

libxml2 fails to handle CDATA in HTML correctly

I'm using libxml2.2.7.3 to parse html pages and I'm having difficulties getting it work correctly with CDATA in HTML. Here's the code:
xmlDocPtr doc = htmlReadMemory(data, length, "", NULL, 0);
xmlBufferPtr buffer = xmlBufferCreate();
xmlNodeDump(buffer, doc, doc->children, 0, 0);
printf("%s", (char*)buffer->content);
and the HTML data:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><body>
<div>
<script type="text/javascript">
//<![CDATA[
document.write('</div>');
//]]>
</script>
</div>
</body></html>
The parser erroneously recognizes the </div> inside the quotes as a real html tag and prints out error messages as follows:
:8: HTML parser error : Unexpected end tag : script
</script>
^
:9: HTML parser error : Unexpected end tag : div
</div>
^
And the result printed out and debugging also imply that parsing went wrong:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><body>
<div>
<script type="text/javascript"><![CDATA[
//<![CDATA[
document.write(']]></script></div>');
//]]>
</body></html>
So the question is, is this a bug of libxml2? Or am I doing something wrong?
Any insightful advices would be greatly appreciated.
Thanks!
In HTML, the <script> element contains CDATA by definition, so <![CDATA[ has no effect.
In short, the source document is broken.
That section would be more properly written as:
<script type="text/javascript">
document.write('<\/div>');
</script>

Find a C# HTML parser find all <script> and give me the line and position info

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>title</title>
</head>
<body>
I want to get this text
<script>
var test=function()
{}
</script>
</body>
</html>
and the result is:
line:7,
position :4
content:
var test=function()
{}
Have you tried the HTML Agility Pack?
This typically works quite well and gives you a nice intuitive interface into parsing HTML content.
You should be able to use it something like this:
HtmlDocument doc = new HtmlDocument();
doc.Load("yourfile.html");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//script)
{
// do something with your script nodes
}

Strip text from HTML document using Ruby

There are lots of examples of how to strip HTML tags from a document using Ruby, Hpricot and Nokogiri have inner_text methods that remove all HTML for you easily and quickly.
What I am trying to do is the opposite, remove all the text from an HTML document, leaving just the tags and their attributes.
I considered looping through the document setting inner_html to nil but then really you'd have to do this in reverse as the first element (root) has an inner_html of the entire rest of the document, so ideally I'd have to start at the inner most element and set inner_html to nil whilst moving up through the ancestors.
Does anyone know a neat little trick for doing this efficiently? I was thinking perhaps regex's might do it but probably not as efficiently as an HTML tokenizer/parser might.
This works too:
doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").remove
You can scan the string to create an array of "tokens", and then only select those that are html tags:
>> some_html
=> "<div>foo bar</div><p>I like <em>this</em> stuff <a href='http://foo.bar'> long time</a></p>"
>> some_html.scan(/<\/?[^>]+>|[\w\|`~!##\$%^&*\(\)\-_\+=\[\]{}:;'",\.\/?]+|\s+/).select { |t| t =~ /<\/?[^>]+>/ }.join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"
==Edit==
Or even better, just scan for html tags ;)
>> some_html.scan(/<\/?[^>]+>/).join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"
To grab everything not in a tag, you can use nokogiri like this:
doc.search('//text()').text
Of course, that will grab stuff like the contents of <script> or <style> tags, so you could also remove blacklisted tags:
blacklist = ['title', 'script', 'style']
nodelist = doc.search('//text()')
blacklist.each do |tag|
nodelist -= doc.search('//' + tag + '/text()')
end
nodelist.text
You could also whitelist if you preferred, but that's probably going to be more time-intensive:
whitelist = ['p', 'span', 'strong', 'i', 'b'] #The list goes on and on...
nodelist = Nokogiri::XML::NodeSet.new(doc)
whitelist.each do |tag|
nodelist += doc.search('//' + tag + '/text()')
end
nodelist.text
You could also just build a huge XPath expression and do one search. I honestly don't know which way is faster, or if there is even an appreciable difference.
I just came up with this, but #andre-r's solution is soo much better!
#!/usr/bin/env ruby
require 'nokogiri'
def strip_text doc
Nokogiri(doc).tap { |doc|
doc.traverse do |node|
node.content = nil if node.text?
end
}.to_s
end
require 'test/unit'
require 'yaml'
class TestHTMLStripping < Test::Unit::TestCase
def test_that_all_text_gets_strippped_from_the_document
dirty, clean = YAML.load DATA
assert_equal clean, strip_text(dirty)
end
end
__END__
---
- |
<!DOCTYPE html>
<html xmlns='http://www.w3.org/1999/xhtml' xml:lang='en' lang='en'>
<head>
<meta http-equiv='Content-type' content='text/html; charset=UTF-8' />
<title>Test HTML Document</title>
<meta http-equiv='content-language' content='en' />
</head>
<body>
<h1>Test <abbr title='Hypertext Markup Language'>HTML</abbr> Document</h1>
<div class='main'>
<p>
<strong>Test</strong> <abbr title='Hypertext Markup Language'>HTML</abbr> <em>Document</em>
</p>
</div>
</body>
</html>
- |
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
<meta http-equiv="content-language" content="en">
</head>
<body><h1><abbr title="Hypertext Markup Language"></abbr></h1><div class="main"><p><strong></strong><abbr title="Hypertext Markup Language"></abbr><em></em></p></div></body>
</html>