I have to display a file in HTML table format.
I tried this but I cannot get any output.
use CGI qw(:standard);
my $line;
print '<HTML>';
print "<head>";
print "</head>";
print "<body>";
print "<p>hello perl am html</p>";
print "</body>";
print "</html>";
A CGI program must output the HTTP headers before it outputs any content. At a minimum, it must supply an HTTP Content-Type header.
Add:
my $q = CGI->new;
print $q->header('text/html; charset=utf-8');
… before you output any HTML.
(You should also write valid HTML, so include a Doctype and <title>).
You should use the CGI module once you have loaded it. It makes it much simpler to follow the correct rules for an HTTP page.
As has been observed, you need to print an HTTP header before the HTML body, and you can do that with print $cgi->header which defaults to specifying a content type of text/html and a character set of ISO-8859-1, which is adequate for many simple HTML pages. It also generates a <meta> element within the HTML that contains the same information.
This short program shows the idea. I have added a trivial table that shows how you could include that in the page. As you can see, the CGI code is much simpler than the corresponding HTML.
use strict;
use warnings;
use CGI qw/ :standard /;
print header;
print
start_html('My Title'),
p('Hello Perl am HTML'),
table(
Tr([
td([1, 2, 3]),
td([4, 5, 6]),
])
),
end_html
;
output
Content-Type: text/html; charset=ISO-8859-1
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
<title>My Title</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body>
<p>Hello Perl am HTML</p><table><tr><td>1</td> <td>2</td> <td>3</td></tr> <tr><td>4</td> <td>5</td> <td>6</td></tr></table>
</body>
</html>
How about this:
use CGI;
use strict;
my $q = CGI->new;
print $q->header.$q->start_html(-title=>'MyTitle');
my $tableSettings = {-border=>1, -cellpadding=>0, -cellspacing=>0};
print $q->table($tableSettings, $q->Tr($q->td(['column1', 'column2', 'column3'])));
print $q->end_html;
Output:
Content-Type: text/html; charset=ISO-8859-1
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
<title>MyTitle</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body>
<table border="1" cellspacing="0" cellpadding="0"><tr><td>column1</td> <td>column2</td> <td>column3</td></tr></table>
</body>
</html>
Related
Anyone knows how to fix this?
Text shows as this: "Sugest�es"
I have a webpage in asp that gets data from MySQL, database and tables are set utf-8 default and my webpage is encoded with utf8 with no bom and have the following code
<%#Language="VBScript" CodePage = 65001%>
<%
Response.ContentType = "text/html"
Response.AddHeader "Content-Type", "text/html; charset=utf-8"
Response.AddHeader "Pragma", "no-cache"
response.Charset="utf-8"
Response.CodePage = 65001
Session.LCID = 2070
%>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR...nsitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="pt" lang="pt">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
</head>
Any ideas?
Thanks
Edit: By the way other data that I get from db in this page works great, only from this table its not working and iv already check and all tables are the same.
I am trying to retrieve the output from the a URL using XMLHTTP GET:
The output in the browser when I hit the url directly is the following:
{
"Titles": {
"resultCount": 37680,
"moreResources": true
}
}
The ASP code on test.asp I am using is:
<%#language=JScript%>
<%
var objSrvHTTP;
objSrvHTTP = Server.CreateObject ("Msxml2.ServerXMLHTTP.6.0");
objSrvHTTP.open ("GET","http://someipaddress:8080/Publisher/Titles/Paging/0,0,tc?output=json", false);
objSrvHTTP.send ();
Response.ContentType = "application/json";
Response.Write (objSrvHTTP.responseText);
%>
The results displayed in browser from hitting test.asp is:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>something</title>
</head>
<body>
{
"Titles": {
"resultCount": 37698,
"moreResources": true
}
}
</body>
</html>
I am looking to have just the data between the body tags returned, or even better just the value for "resultCount". Any help would be much appreciated.
You need to remove the HTML markup, when reading JSON data it should be nothing but valid JSON in the request response.
I want to change the charset in the "http-equiv" content-type tag. Because I'm working with Nokogiri in other parts of my code I'd like to use it for this processing step too.
This is example code:
http_equiv = doc.at('meta[#http-equiv]')
if !http_equiv.nil? && !http_equiv["http-equiv"].nil? && http_equiv["http-equiv"].downcase.eql?("content-type")
http_equiv["content"] = "text/html; charset=utf-8"
end
content = doc.to_html.encode(Encoding::UTF_8)
The problem is that the input content is alway the same as the output content. Nokogiri didn't do anything.
Based on an answer I created a real world example which won't work in contrast to the generated example.
require 'nokogiri'
require 'open-uri'
doc = require 'open-uri'
doc = Nokogiri::HTML(open("http://www.spiegel.de/politik/deutschland/hooligans-gegen-salafisten-demo-in-koeln-eskaliert-a-999401.html"))
content_type = doc.at('meta[#http-equiv="Content-Type"]')
content_type['content'] = 'text/html; charset=UTF-8'
puts doc.to_html
I'd do something like this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<head>
<meta http-equiv="content-type" content="text/html">
</head>
<body>
foo
</body>
</html>
EOT
content_type = doc.at('meta[#http-equiv="content-type"]')
content_type['content'] = 'text/html; charset=UTF-8'
puts doc.to_html
Running that outputs:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=US-ASCII">
</head>
<body>
foo
</body>
</html>
You can also do
content_type['content'] << '; charset=UTF-8'
if you're only appending to the existing value.
It doesn't change the content-type.
It changes the content type in the tag, however there is more to it since it seems you don't want to change the content-type marker, you want to change the encoding of the document itself at output. Once you do that, Nokogiri will also change the meta tag to match:
doc.to_html(encoding: 'UTF-8')
will tell Nokogiri to output the HTML, trying to convert from ISO-8859-1 to UTF-8. There is no guarantee that will occur correctly though, because there are some incompatibilities.
Your original attempt using:
content = doc.to_html.encode(Encoding::UTF_8)
won't work correctly, because of HTML encoding that occurs on special characters. You have to change the character encoding before they are HTML-encoded, which should happen if you use to_html(encoding: 'UTF-8').
In an example script that prints HTML, it looks to me that the body tag is not closed. However I have never had experience with Perl before. Is this example incorrect? or is there something else that means body is closed?
print "Content-type: text/html\n\n";
print "<html>\n<head>\n<title>\nPerl CGI
Example\n</title>\n<body>\n<h1>Hello,
World!</h1>\nYour user agent is: <b>\n";
print $cgi_object->user_agent();
print "<b>.</html>\n";
Where there is a . on the last line it looks to me like it should be </body>
You aren't missing anything, that code simply doesn't generate an end tag for the body element, but that tag (unlike the missing Doctype) is optional in HTML anyway so the element will be closed by the browser when it parses the end tag for the html element.
It would be better written something more like this:
#!/usr/bin/env perl
use strict;
use warnings;
use CGI;
use Template;
my $cgi = CGI->new();
print $cgi->header(-charset => 'utf-8');
my $ua = $cgi->user_agent();
my $tt = Template->new();
$tt->process(\*DATA, { ua => $ua });
__END__
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Perl CGI Example</title>
</head>
<body>
<h1>Hello, World!</h1>
<p>Your user agent is: <em>[% ua | html %]</em>.</p>
</body>
</html>
And better still if you ditched CGI and used PSGI/Plack.
There are lots of examples of how to strip HTML tags from a document using Ruby, Hpricot and Nokogiri have inner_text methods that remove all HTML for you easily and quickly.
What I am trying to do is the opposite, remove all the text from an HTML document, leaving just the tags and their attributes.
I considered looping through the document setting inner_html to nil but then really you'd have to do this in reverse as the first element (root) has an inner_html of the entire rest of the document, so ideally I'd have to start at the inner most element and set inner_html to nil whilst moving up through the ancestors.
Does anyone know a neat little trick for doing this efficiently? I was thinking perhaps regex's might do it but probably not as efficiently as an HTML tokenizer/parser might.
This works too:
doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").remove
You can scan the string to create an array of "tokens", and then only select those that are html tags:
>> some_html
=> "<div>foo bar</div><p>I like <em>this</em> stuff <a href='http://foo.bar'> long time</a></p>"
>> some_html.scan(/<\/?[^>]+>|[\w\|`~!##\$%^&*\(\)\-_\+=\[\]{}:;'",\.\/?]+|\s+/).select { |t| t =~ /<\/?[^>]+>/ }.join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"
==Edit==
Or even better, just scan for html tags ;)
>> some_html.scan(/<\/?[^>]+>/).join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"
To grab everything not in a tag, you can use nokogiri like this:
doc.search('//text()').text
Of course, that will grab stuff like the contents of <script> or <style> tags, so you could also remove blacklisted tags:
blacklist = ['title', 'script', 'style']
nodelist = doc.search('//text()')
blacklist.each do |tag|
nodelist -= doc.search('//' + tag + '/text()')
end
nodelist.text
You could also whitelist if you preferred, but that's probably going to be more time-intensive:
whitelist = ['p', 'span', 'strong', 'i', 'b'] #The list goes on and on...
nodelist = Nokogiri::XML::NodeSet.new(doc)
whitelist.each do |tag|
nodelist += doc.search('//' + tag + '/text()')
end
nodelist.text
You could also just build a huge XPath expression and do one search. I honestly don't know which way is faster, or if there is even an appreciable difference.
I just came up with this, but #andre-r's solution is soo much better!
#!/usr/bin/env ruby
require 'nokogiri'
def strip_text doc
Nokogiri(doc).tap { |doc|
doc.traverse do |node|
node.content = nil if node.text?
end
}.to_s
end
require 'test/unit'
require 'yaml'
class TestHTMLStripping < Test::Unit::TestCase
def test_that_all_text_gets_strippped_from_the_document
dirty, clean = YAML.load DATA
assert_equal clean, strip_text(dirty)
end
end
__END__
---
- |
<!DOCTYPE html>
<html xmlns='http://www.w3.org/1999/xhtml' xml:lang='en' lang='en'>
<head>
<meta http-equiv='Content-type' content='text/html; charset=UTF-8' />
<title>Test HTML Document</title>
<meta http-equiv='content-language' content='en' />
</head>
<body>
<h1>Test <abbr title='Hypertext Markup Language'>HTML</abbr> Document</h1>
<div class='main'>
<p>
<strong>Test</strong> <abbr title='Hypertext Markup Language'>HTML</abbr> <em>Document</em>
</p>
</div>
</body>
</html>
- |
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
<meta http-equiv="content-language" content="en">
</head>
<body><h1><abbr title="Hypertext Markup Language"></abbr></h1><div class="main"><p><strong></strong><abbr title="Hypertext Markup Language"></abbr><em></em></p></div></body>
</html>