Perl not printing the special characters - html

My scrape content is not displaying the special characters.It shows some junk values in place of special characters.(€ printed as -aA).Thanks in advance.
# !/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder::XPath;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new(agent => "Mozilla/5.0");
my $req = HTTP::Request->new(GET => 'http://www.infanziabimbo.it/costi-modalita-e-tempi-di-spedizione.html');
my $res = $ua->request($req);
die("error") unless $res->is_success;
my $xp = HTML::TreeBuilder::XPath->new_from_content($res->content);
my #node = $xp->findnodes_as_strings('//div[#class="mainbox-body"]');
die("node doesn't exist") if $#node == -1; # Line 18
open HTML, ">C:/Users/jeyakuma/Desktop/kjk.html";
foreach(<#node>)
{
print HTML "$_";
}
close HTML;
"

Here are some observations on your code that I hope will help you
You must always check that a call to open succeeded, otherwise your program will just continue to run silently without any input or output. Rather than the idiomatic open ... or die $! you may prefer just to add use autodie at the top of your code
If the HTTP request fails, it is more informative if your program indicates why it failed instead of just saying "error". I suggest you write this instead
$res->is_success or die $res->status_line;
If you don't need any special LWP or parse options, then you can just write
my $url = 'http://www.infanziabimbo.it/costi-modalita-e-tempi-di-spedizione.html';
my $xp = HTML::TreeBuilder::XPath->new_from_url($url);
although that doesn't give you any way to specify the user agent string as you do currently
Rather than testing $#node for equality to -1, it is much neater to check for the truth of #node, so
die "node doesn't exist" unless #node; # Line 18
If your data contains UTF-8 characters then your output file handle must be set to the appropriate mode. You can change the mode using binmode, like this
open HTML, ">C:/Users/jeyakuma/Desktop/kjk.html";
binmode HTML, ':encoding(utf-8)';
But the best way is to use the preferred three-parameter form of open, which would look like this, assuming that you have use autodie in place at the start of your program
open HTML, '>:encoding(utf-8)', 'C:/Users/jeyakuma/Desktop/kjk.html';
Lexical file handles are far superior to the old-fashioned global file handles
The loop foreach(<#node>) { ... } is completely wrong because it is equivalent to foreach (glob join ' ', #node) { ... } and only appears to work because, in general, glob will leave a filename untouched if it doesn't contain any wildcards. What you meant was just for (#node) { ... }
In addition, it is bad practice to enclose a variable in quotes unless you specifically want to call its stringification method, so "$_" should be just $_
You may as well write your final output loop as
print HTML #node;
Putting these changes in place, the result looks like this, which I believe will fix your problem
use strict;
use warnings;
use autodie;
use HTML::TreeBuilder::XPath;
my $url = 'http://www.infanziabimbo.it/costi-modalita-e-tempi-di-spedizione.html';
my $xp = HTML::TreeBuilder::XPath->new_from_url($url);
my #node = $xp->findnodes_as_strings('//div[#class="mainbox-body"]');
die "node doesn't exist" unless #node;
open my $html_fh, '>:encoding(utf-8)', 'C:/Users/jeyakuma/Desktop/kjk.html';
print $html_fh #node;
close $html_fh;

Related

Grepping data from an html file gives some random value

I have a XML file using which I am grepping some of the value based on some regex.
The XML file looks like this-
<Instance>Fuse_Name</Instance>
<Id>8'hed</ID>
<SomeAddr>17'h00baf</SomeAddr>
<PSomeAddr>17'h00baf</PSomeAddr>
I want to retrieve 17'h00baf value from "SomeAddr" tag. I am matching the regex "SomeAddr" so as to reach that row in the file and then using index and substr function I am retrieving value using below code
my $i = index($row,">");
my $j = index($row,"<");
$Size_in_bits = substr $row,$i+1,$j-$i-3;
But after doing this I am not getting 17'h00baf . Instead I am getting 17'h01191 . On similar approach I am able to grep other values which are decimal or string,Only with the hexadecimal values I am facing this problem. Can somebody please tell me what is wrong in the approach??
Please don't parse XML with regexes. Use a proper XML parser.
But, ignoring that advice temporarily, I don't get the behaviour you describe when testing your code.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
next unless /<SomeAddr>/;
my $i = index($_, ">");
my $j = index($_, "<");
my $Size_in_bits = substr $_, $i + 1, $j - $i - 3;
say $Size_in_bits;
}
__END__
<Instance>Fuse_Name</Instance>
<Id>8'hed</ID>
<SomeAddr>17'h00baf</SomeAddr>
<PSomeAddr>17'h00baf</PSomeAddr>
And running it:
$ perl parsexml
17'h00baf
Of course, I've had to guess at what a lot of your code looks like because you didn't give us a complete example to test. So it looks likely that your problems are in bits of the code that you haven't shown us.
(My guess would be that there's another <SomeAddr> tag in the file somewhere.)
Never, ever use a regex to parse HTML/XML/.... Always use a proper parser and then implement your algorithm in the DOM domain.
My solution shows how to parse the XML and then extract the text content from <SomeAddr> nodes at the top-level of the XML document.
#!/usr/bin/perl
use warnings;
use strict;
use XML::LibXML;
my $doc = XML::LibXML->load_xml(IO => \*DATA);
my $xpc = XML::LibXML::XPathContext->new();
# register default NS
$xpc->registerNs('default', 'http://some.domain.com/some/path/to');
foreach my $node ($xpc->findnodes('//default:SomeAddr', $doc)) {
print $node->textContent, "\n";
}
exit 0;
__DATA__
<Root xmlns="http://some.domain.com/some/path/to">
<Instance>Fuse_Name</Instance>
<Id>8'hed</Id>
<SomeAddr>17'h00baf</SomeAddr>
<PSomeAddr>17'h00baf</PSomeAddr>
</Root>
Test run
$ perl dummy.pl
17'h00baf

Creating a comment box html form with perl processing script

I currently have a script named "test.pl" that does a variety of things and prints out HTML code to view on a webpage. One of the things I want it to do, is allow the user to type in a comment, and select which type of comment it is and the processing form of the comment box will append the comment into a file. I am not sure if I am doing this right because it doesn't seem to be working as I'm getting some errors.. here are the snippets of the code:
#!/usr/bin/perl
use warnings;
use CGI qw(:cgi-lib :standard); # Use CGI modules that let people read data passed from a form
#Initiate comment processing
&ReadParse(%in);
if ($in("comment") && $in("type") ! == "") {
$comment = $in("comment");
$type = $in("type");
WritetoFile($comment,$type);
}
sub WritetoFile {
my $input = shift;
my $type = shift;
my $file = "$type" . "_comment.txt";
open (my $fh, '>>', $file) or die "Could not open file '$file' $!";
print $fh "$input\n";
close $fh
}
The form I am using is this:
<FORM ACTION=test.pl METHOD=POST>
Comment:
<INPUT TYPE=TEXT NAME="comment" LENGTH=60>
<P>
Select Type
<SELECT NAME ="type">
<OPTION SELECTED> Animal
<OPTION> Fruit
<OPTION> Vegetable
<OPTION> Meat
<OPTION> Other
<INPUT TYPE=SUBMIT VALUE="Submit"></FORM
Any suggestions on how to make this work or even improve the process I am doing would be greatly appreciated!I would prefer to keep the processing script and the script that does the rest of my subs to be the same script (test.pl) unless this is something I have to keep separate
Your code is a bizarre mixture of old- and new-style Perl. You're using the cgi-lib compatibility layer in CGI.pm and calling its ReadParse() function using the (unnecessary since 1994) leading ampersand. On the other hand, you're using three-arg open() and lexical filehandles. I'd be interested to hear how you developed that style.
Your problem comes from your (mis-)handling of the %in hash. Your call to ReadParse() puts all of the CGI parameters into the hash, but you're using the wrong syntax to get the values out of the hash. Hash keys are looked up using braces ({ ... }), not parentheses (( ... )).
You also have some confusion over your boolean equality operators. != is used for numeric comparisons. You want ne for string comparisons.
You probably wanted something like:
ReadParse(%in);
if ($in{comment} ne "" and $in{type} ne "") {
$comment = $in{comment};
$type = $in{type};
WritetoFile($comment,$type);
}
Your $comment and $type variables are unnecessary as you can pass the hash lookups directly into your subroutine.
WritetoFile($in{comment}, $in{type});
Finally, as others have pointed out, learning CGI in 2014 is like learning to use a typewriter - it'll still work, but people will think you're rather old-fashioned. Look at CGI::Alternatives for some more modern approaches.

Perl - Print div by class

I need to print a specific div with class productSpecs from a webpage. Here is my code.
use strict;
use LWP::Simple;
use HTML::TreeBuilder::XPath qw();
my $url="http://www.flipkart.com/samsung-b310e-guru-music-2/p/itmdz9am8xehucbx";
my $content = get($url);
my $t = HTML::TreeBuilder::XPath->new;
$t->parse($content);
my $rank = $t->findvalue('//*[#class="productSpecs"]');
print $rank;
But I am not getting the content I want. What is wrong with my code?
Inspecting the HTML code you are trying to parse, the required div node has this declaration:
<div class="productSpecs specSection">
so your code should be:
my $rank = $t->findnodes('//div[#class="productSpecs specSection"]');
Just for comparison I tried this with Mojolicious using the ojo tool (great for oneliners) and it seems Mojo::DOM returns the HTML by default unless you ask for the text with a ->text() method. e.g. this seems to do what you want:
perl -Mojo -E 'g("http://www.flipkart.com/samsung-b310e-guru-music-2/p/itmdz9am8xehucbx")
->dom->find("div.productSpecs")->each(sub{say $_})'
cheers,
Hi user2186465 and welcome to Stack Exchange :-)
When you assign and print the output fromHTML::TreeBuilder::XPath's findnodes->() method it seems to default to parsing/rendering the <div> node and returning the content as text. Along with that it returns an XML::XPathEngine::NodeSet object (which HTML::TreeBuilder::XPath uses) and an array with a reference to an HTML::Tree object that has what you want. You need to assign that array element reference to your $rank variable or else you'll just get the text:
my $rank = $t->findnodes('//div[#class="productSpecs specSection"]')->[0];
(NB: this appears somewhere in the documentation as an example, but it is not prominent). Once you have the HTML::Element object you can use one of its methods with your print statement to get at the contents.
Without the ->[0] you get the rendered text and print $rank just shows that; but with ->[0] you get access to the object and its methods so print $rank->as_HTML can show the raw HTML content from the node (->as_XML works as well). HTML::TreeBuilder::XPath also has a as_XML_indented convenience method to make the output easier to read. So:
use strict;
use LWP::Simple;
use HTML::TreeBuilder::XPath qw();
my $url="http://www.flipkart.com/samsung-b310e-guru-music-2/p/itmdz9am8xehucbx";
my $content = get($url);
my $t = HTML::TreeBuilder::XPath->new;
$t->parse($content);
my $rank = $t->findnodes('//div[#class="productSpecs specSection"]')->[0];
print $rank->as_XML_indented ;
should do what you want.
HTH

Undo mysql_real_escape_string

I have the following code at the top of every of my php pages:
<?php
function name_format($str)
{
return trim(mysql_real_escape_string(htmlspecialchars($str, ENT_QUOTES)));
}
?>
foreach ($_POST as $key => $value) {
if (!is_array($value))
{
$_POST[$key] = name_format($value);
}
}
This was pretty useful until now. I experienced that if I want to display a text from a <textarea> before writing it into a database, then it shows "\r\n" instead of normal line breaks.
Even if I try to do the following, it doesn't work:
$str = str_replace("\r\n", "<br>", $str);
The mistake you're making here is over-writing $_POST with a version of the string which you are hoping will be appropriate for all contexts (using mysqli_real_escape_string and htmlspecialchars at the same time).
You should leave the original value untouched, and escape it where it is used, using the appropriate function for that context. (This is one reason why the "magic quotes" feature of early versions of PHP are universally acknowledged to have been a bad idea.)
So in your database code, you would prepare a variable for use with SQL (specifically, MySQL):
$comment = mysqli_real_escape_string(trim($_POST['comment']));
And in your template, you would prepare a variable for use with HTML:
$comment = htmlspecialchars(trim($_POST['comment']));
Possibly adding a call to nl2br() in the HTML context, as desired.

Problems parsing Reddit's JSON

I'm working on a perl script that parses reddit's JSON using the JSON module.
However I do have the problem of being very new to both perl and json.
I managed to parse the front page and subreddits successfully, but the comments have a different structure and I can't figure out how to access the data I need.
Here's the code that successfully finds the "data" hash for the front page and subreddits:
foreach my $children(#{$json_text->{"data"}->{"children"}}) #For values of children.
{
my $data = $children->{"data"}; #accessing each data hash.
my %phsh = (); #my hash to collect and print.
$phsh{author} = $data->{"author"};#Here I get the "author" value from "data"
*Etc....
This successfully gets what I need from http://www.reddit.com/.json
But when I go to the json of a comment, this one for example, it has a different format and I can't figure out how to parse it. If I try the same thing as before my parser crashes, saying it is not a HASH reference.
So my question is: How do access the "children" in the second JSON? I need to get both the data for the Post and the data for the comments. Can anybody help?
Thanks in advance!
(I know it may be obvious, but I'm running on very little sleep XD)
You need to either look at the JSON data or dump the decoded data to see what form it takes. The comment data, for example is an array at the top level.
Here is some code that prints the body field of all top-level comments. Note that a comment may have an array of replies in its replies field, and each reply may also have replies in turn.
Depending on what you want to do you may need to check whether a reference is to an array or a hash by checking the value returned by the ref operator.
use strict;
use warnings;
binmode STDOUT, ':utf8';
use JSON;
use LWP;
use Data::Dump;
my $ua = LWP::UserAgent->new;
my $resp = $ua->get('http://www.reddit.com/r/funny/comments/wx3n5/caption_win.json');
die $resp->status_line unless $resp->is_success;
my $json = $resp->decoded_content;
my $data = decode_json($json);
die "Error: $data->{error}" if ref $data eq 'HASH' and exists $data->{error};
dd $data->[1]{data}{children}[0];
print "\n\n";
my $children = $data->[1]{data}{children};
print scalar #$children, " comments:\n\n";
for my $child (#$children) {
print $child->{data}{body}, "\n";
}