In an example script that prints HTML, it looks to me that the body tag is not closed. However I have never had experience with Perl before. Is this example incorrect? or is there something else that means body is closed?
print "Content-type: text/html\n\n";
print "<html>\n<head>\n<title>\nPerl CGI
Example\n</title>\n<body>\n<h1>Hello,
World!</h1>\nYour user agent is: <b>\n";
print $cgi_object->user_agent();
print "<b>.</html>\n";
Where there is a . on the last line it looks to me like it should be </body>
You aren't missing anything, that code simply doesn't generate an end tag for the body element, but that tag (unlike the missing Doctype) is optional in HTML anyway so the element will be closed by the browser when it parses the end tag for the html element.
It would be better written something more like this:
#!/usr/bin/env perl
use strict;
use warnings;
use CGI;
use Template;
my $cgi = CGI->new();
print $cgi->header(-charset => 'utf-8');
my $ua = $cgi->user_agent();
my $tt = Template->new();
$tt->process(\*DATA, { ua => $ua });
__END__
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Perl CGI Example</title>
</head>
<body>
<h1>Hello, World!</h1>
<p>Your user agent is: <em>[% ua | html %]</em>.</p>
</body>
</html>
And better still if you ditched CGI and used PSGI/Plack.
Related
How to call a perl script which is returning data "text" type inside a html file.Posting the perl script for reference.This script has to be called in the index.html so how to do that
#!/usr/bin/perl -w
use CGI;
my $cgi = CGI->new;
my $remote = $cgi->remote_host();
my $AB = "16.108";
my $CD = "16.214";
my $EF = "10.99";
my $GH = "10.243";
my $XY = "10.179";
my #remote_ip_values = split(/\./, $remote);
my $remote_ip = $remote_ip_values[0] .".". $remote_ip_values[1];
print "Content-type: text/html\n\n";
print "document.write(\"<style>\\n\");\n";
print "document.write(\"font color:red;\\n\");\n";
if ($remote_ip eq $AB)
{
print "document.write(\"SUCCESSFUL\");\n";
}
else
{
print "document.write(\"TEST\");\n";
}
exit;
I am editing my answer as I now understand what you are trying to do. I have set up a simple example which will load the content from "another website" - which for you will be your Perl CGI.
This HTML uses some simple JQuery. I am loading the following HTML page, http://chocksaway.com/tester.html and copying the contents into the "test" div:
<!DOCTYPE HTML>
<html>
<head>
<title>Loader</title>
</head>
<body>
<div id="test"></div>
<script type="text/javascript" src="http://code.jquery.com/jquery-1.9.1.min.js"></script>
<script type="text/javascript">
$(document).ready(function() {
$('#test').load('http://chocksaway.com/tester.html');
});
</script>
</body>
</html>
http://chocksaway.com/tester.html has the following HTML:
<html>
<header></header>
<body>
this is a test so there
</body>
</html>
If you open you http://chocksaway.com/tester2.html, you will see:
this is a test so there
Replace "http://chocksaway.com/tester2.html" with the URL of your Perl CGI.
I want to change the charset in the "http-equiv" content-type tag. Because I'm working with Nokogiri in other parts of my code I'd like to use it for this processing step too.
This is example code:
http_equiv = doc.at('meta[#http-equiv]')
if !http_equiv.nil? && !http_equiv["http-equiv"].nil? && http_equiv["http-equiv"].downcase.eql?("content-type")
http_equiv["content"] = "text/html; charset=utf-8"
end
content = doc.to_html.encode(Encoding::UTF_8)
The problem is that the input content is alway the same as the output content. Nokogiri didn't do anything.
Based on an answer I created a real world example which won't work in contrast to the generated example.
require 'nokogiri'
require 'open-uri'
doc = require 'open-uri'
doc = Nokogiri::HTML(open("http://www.spiegel.de/politik/deutschland/hooligans-gegen-salafisten-demo-in-koeln-eskaliert-a-999401.html"))
content_type = doc.at('meta[#http-equiv="Content-Type"]')
content_type['content'] = 'text/html; charset=UTF-8'
puts doc.to_html
I'd do something like this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<head>
<meta http-equiv="content-type" content="text/html">
</head>
<body>
foo
</body>
</html>
EOT
content_type = doc.at('meta[#http-equiv="content-type"]')
content_type['content'] = 'text/html; charset=UTF-8'
puts doc.to_html
Running that outputs:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=US-ASCII">
</head>
<body>
foo
</body>
</html>
You can also do
content_type['content'] << '; charset=UTF-8'
if you're only appending to the existing value.
It doesn't change the content-type.
It changes the content type in the tag, however there is more to it since it seems you don't want to change the content-type marker, you want to change the encoding of the document itself at output. Once you do that, Nokogiri will also change the meta tag to match:
doc.to_html(encoding: 'UTF-8')
will tell Nokogiri to output the HTML, trying to convert from ISO-8859-1 to UTF-8. There is no guarantee that will occur correctly though, because there are some incompatibilities.
Your original attempt using:
content = doc.to_html.encode(Encoding::UTF_8)
won't work correctly, because of HTML encoding that occurs on special characters. You have to change the character encoding before they are HTML-encoded, which should happen if you use to_html(encoding: 'UTF-8').
I'm trying to use Nokogiri to convert some template files from one format to another. But it keeps adding tags. I'm trying to prevent it from adding Doctype and meta tags, but can't figure it out. I've tried
#doc = Nokogiri::HTML.parse(r)
but that adds the tags. I've also tried
#doc = Nokogiri::HTML.fragment(r)
as suggested in "How to prevent Nokogiri from adding <DOCTYPE> tags?", but that removes any <html>, <head>, or <body> tags that are in the document.
If it matters, my code for reading the file is:
f = File.read(infile)
r = f.gsub(/<tmpl_var ([^>]*)>/, '{{{\1}}}')
#doc = Nokogiri::HTML.fragment(r)
I need to do a gsub beforehand because I need to replace <tmpl_var> tags which aren't proper HTML and cause more problems.
When using HTML.fragment(r), I do get an htmlParseStartTag: misplaced <html> tag error (as well as similar errors for <body> and <head>).
Is there a way to prevent it from making these additions?
An example conversion:
Before:
<html>
<head>
<script>
var x = "y";
</script>
</head>
<body>
<div>
Stuff
</div>
</body>
</html>
After using Parse:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<script>
var x = "y";
</script>
</head>
<body>
<div>
Stuff
</div>
</body>
</html>
After using HTML.fragment or HTML::DocumentFragment.parse:
<script>
var x = "y";
</script>
<div>
Stuff
</div>
In this case, I want it to just output the before section. (In the real script I make a bunch of changes though).
Nokogiri can be told to not add the standard HTML headers. Consider these:
require 'nokogiri'
doc = Nokogiri::HTML('<p>foo</p>')
doc.to_html # => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foo</p></body></html>\n"
doc = Nokogiri::HTML.fragment('<p>foo</p>')
doc.to_html # => "<p>foo</p>"
tmpl_var is a bad tag name in HTML, as is {{{\1}}}, so asking Nokogiri to try to parse either will result in problems:
doc = Nokogiri::HTML.fragment('<templ_var p1="baz">foo</templ_var>')
doc.errors # => [#<Nokogiri::XML::SyntaxError: Tag templ_var invalid>]
But you can still munge the DOM:
doc.to_html # => "<templ_var p1=\"baz\">foo</templ_var>"
doc.search('templ_var').each { |t| t.name = 'bar'}
doc.to_html # => "<bar p1=\"baz\">foo</bar>"
Or:
doc.to_html # => "<div><templ_var p1=\"baz\">foo</templ_var></div>"
doc.search('templ_var').each { |t| t.replace('{{{\1}}}') }
doc.to_html # => "<div>{{{\\1}}}</div>"
Putting that stuff together, plus a bit of chicanery:
doc = Nokogiri::HTML.fragment('<div><templ_var p1="baz">foo</templ_var></div>')
doc.to_html # => "<div><templ_var p1=\"baz\">foo</templ_var></div>"
doc.search('templ_var').each { |t| t.replace('{{{\1}}}') }
doc.to_html # => "<div>{{{\\1}}}</div>"
header = Nokogiri::XML.fragment('<html><body>')
header.at('body').children = doc
header.to_html # => "<html><body><div>{{{\\1}}}</div></body></html>"
So, I'd go after it something like that.
Now, why is Nokogiri stripping the <html> tag when parsing a fragment? I don't know. It leaves <body> alone if <head> or <html> is missing:
Nokogiri::HTML.fragment('<p>foo<p>').to_html
# => "<p>foo</p><p></p>"
Nokogiri::HTML.fragment('<body><p>foo<p></body>').to_html
# => "<body>\n<p>foo</p>\n<p></p>\n</body>"
But it gets funky if <head> or <html> exists:
Nokogiri::HTML.fragment('<head><style></style></head><body><p>foo<p></body>').to_html
# => "<style></style><p>foo</p><p></p>"
Nokogiri::HTML.fragment('<html><head><style></style></head><body><p>foo<p></body></html>').to_html
# => "<style></style><p>foo</p><p></p>"
That smells like a bug in Nokogiri to me as I haven't seen anything to document that behavior.
You can get around this by using Nokogiri::XML::DocumentFragment instead of Nokogiri::HTML::DocumentFragment. The XML version won't remove the html, head, or body tags.
I have to display a file in HTML table format.
I tried this but I cannot get any output.
use CGI qw(:standard);
my $line;
print '<HTML>';
print "<head>";
print "</head>";
print "<body>";
print "<p>hello perl am html</p>";
print "</body>";
print "</html>";
A CGI program must output the HTTP headers before it outputs any content. At a minimum, it must supply an HTTP Content-Type header.
Add:
my $q = CGI->new;
print $q->header('text/html; charset=utf-8');
… before you output any HTML.
(You should also write valid HTML, so include a Doctype and <title>).
You should use the CGI module once you have loaded it. It makes it much simpler to follow the correct rules for an HTTP page.
As has been observed, you need to print an HTTP header before the HTML body, and you can do that with print $cgi->header which defaults to specifying a content type of text/html and a character set of ISO-8859-1, which is adequate for many simple HTML pages. It also generates a <meta> element within the HTML that contains the same information.
This short program shows the idea. I have added a trivial table that shows how you could include that in the page. As you can see, the CGI code is much simpler than the corresponding HTML.
use strict;
use warnings;
use CGI qw/ :standard /;
print header;
print
start_html('My Title'),
p('Hello Perl am HTML'),
table(
Tr([
td([1, 2, 3]),
td([4, 5, 6]),
])
),
end_html
;
output
Content-Type: text/html; charset=ISO-8859-1
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
<title>My Title</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body>
<p>Hello Perl am HTML</p><table><tr><td>1</td> <td>2</td> <td>3</td></tr> <tr><td>4</td> <td>5</td> <td>6</td></tr></table>
</body>
</html>
How about this:
use CGI;
use strict;
my $q = CGI->new;
print $q->header.$q->start_html(-title=>'MyTitle');
my $tableSettings = {-border=>1, -cellpadding=>0, -cellspacing=>0};
print $q->table($tableSettings, $q->Tr($q->td(['column1', 'column2', 'column3'])));
print $q->end_html;
Output:
Content-Type: text/html; charset=ISO-8859-1
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
<title>MyTitle</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body>
<table border="1" cellspacing="0" cellpadding="0"><tr><td>column1</td> <td>column2</td> <td>column3</td></tr></table>
</body>
</html>
There are lots of examples of how to strip HTML tags from a document using Ruby, Hpricot and Nokogiri have inner_text methods that remove all HTML for you easily and quickly.
What I am trying to do is the opposite, remove all the text from an HTML document, leaving just the tags and their attributes.
I considered looping through the document setting inner_html to nil but then really you'd have to do this in reverse as the first element (root) has an inner_html of the entire rest of the document, so ideally I'd have to start at the inner most element and set inner_html to nil whilst moving up through the ancestors.
Does anyone know a neat little trick for doing this efficiently? I was thinking perhaps regex's might do it but probably not as efficiently as an HTML tokenizer/parser might.
This works too:
doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").remove
You can scan the string to create an array of "tokens", and then only select those that are html tags:
>> some_html
=> "<div>foo bar</div><p>I like <em>this</em> stuff <a href='http://foo.bar'> long time</a></p>"
>> some_html.scan(/<\/?[^>]+>|[\w\|`~!##\$%^&*\(\)\-_\+=\[\]{}:;'",\.\/?]+|\s+/).select { |t| t =~ /<\/?[^>]+>/ }.join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"
==Edit==
Or even better, just scan for html tags ;)
>> some_html.scan(/<\/?[^>]+>/).join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"
To grab everything not in a tag, you can use nokogiri like this:
doc.search('//text()').text
Of course, that will grab stuff like the contents of <script> or <style> tags, so you could also remove blacklisted tags:
blacklist = ['title', 'script', 'style']
nodelist = doc.search('//text()')
blacklist.each do |tag|
nodelist -= doc.search('//' + tag + '/text()')
end
nodelist.text
You could also whitelist if you preferred, but that's probably going to be more time-intensive:
whitelist = ['p', 'span', 'strong', 'i', 'b'] #The list goes on and on...
nodelist = Nokogiri::XML::NodeSet.new(doc)
whitelist.each do |tag|
nodelist += doc.search('//' + tag + '/text()')
end
nodelist.text
You could also just build a huge XPath expression and do one search. I honestly don't know which way is faster, or if there is even an appreciable difference.
I just came up with this, but #andre-r's solution is soo much better!
#!/usr/bin/env ruby
require 'nokogiri'
def strip_text doc
Nokogiri(doc).tap { |doc|
doc.traverse do |node|
node.content = nil if node.text?
end
}.to_s
end
require 'test/unit'
require 'yaml'
class TestHTMLStripping < Test::Unit::TestCase
def test_that_all_text_gets_strippped_from_the_document
dirty, clean = YAML.load DATA
assert_equal clean, strip_text(dirty)
end
end
__END__
---
- |
<!DOCTYPE html>
<html xmlns='http://www.w3.org/1999/xhtml' xml:lang='en' lang='en'>
<head>
<meta http-equiv='Content-type' content='text/html; charset=UTF-8' />
<title>Test HTML Document</title>
<meta http-equiv='content-language' content='en' />
</head>
<body>
<h1>Test <abbr title='Hypertext Markup Language'>HTML</abbr> Document</h1>
<div class='main'>
<p>
<strong>Test</strong> <abbr title='Hypertext Markup Language'>HTML</abbr> <em>Document</em>
</p>
</div>
</body>
</html>
- |
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
<meta http-equiv="content-language" content="en">
</head>
<body><h1><abbr title="Hypertext Markup Language"></abbr></h1><div class="main"><p><strong></strong><abbr title="Hypertext Markup Language"></abbr><em></em></p></div></body>
</html>