How do I extract an HTML title with Perl?

How do I extract an HTML title with Perl? - html

Is there a way to extract HTML page title using Perl? I know it can be passed as a hidden variable during form submit and then retrieved in Perl that way but I was wondering if there is a way to do this without the submit?
Like, lets say i have an HTML page like this:
<html><head><title>TEST</title></head></html>
and then in Perl I want to do :
$q -> h1('something');
How can I replace 'something' dynamically with what is contained in <title> tags?

I would use pQuery. It works just like jQuery.
You can say:
use pQuery;
my $page = pQuery("http://google.com/");
my $title = $page->find('title');
say "The title is: ", $title->html;
Replacing stuff is similar:
$title->html('New Title');
say "The entirety of google.com with my new title is: ", $page->html;
You can pass an HTML string to the pQuery constructor, which it sounds like you want to do.
Finally, if you want to use arbitrary HTML as a "template", and then "refine" that with Perl commands, you want to use Template::Refine.

HTML::HeadParser does this for you.

It's not clear to me what you are asking. You seem to be talking about something that could run in the user's browser, or at least something that already has an html page loaded.
If that's not the case, the answer is URI::Title.

use strict;
use LWP::Simple;
my $url = 'http://www.google.com'|| die "Specify URL on the cmd line";
my $html = get ($url);
$html =~ m{<TITLE>(.*?)</TITLE>}gism;
print "$1\n";

The previous answer is wrong, if the HTML title tag is used more often then this can easily be overcome by checking to make sure that the title tag is valid (no tags in between).
my ($title) = $test_content =~ m/<title>([a-zA-Z\/][^>]+)<\/title>/si;

get the title name form the file.
my $spool = 0;
open my $fh, "<", $absPath or die $!;
#open ($fh, "<$tempfile" );
# wrtie the opening brace
print WFL "[";
while (<$fh>) {
# removes the new line from the line read
chomp;
# removes the leading and trailing spaces.
$_=~ s/^\s+|\s+$//g;
# case where the <title> and </title> occures in one line
# we print and exit in one instant
if (($_=~/$startstring/i)&&($_=~/$endstring/i)) {
print WFL "'";
my ($title) = $_=~ m/$startstring(.+)$endstring/si;
print WFL "$title";
print WFL "',";
last;
}
# case when the <title> is in one line and </title> is in other line
#starting <title> string is found in the line
elsif ($_=~/$startstring/i) {
print WFL "'";
# extract everything after <title> but nothing before <title>
my ($title) = $_=~ m/$startstring(.+)/si;
print WFL "$title";
$spool = 1;
}
# ending string </title> is found
elsif ($_=~/$endstring/i) {
# read everything before </title> and nothing above that
my ($title) = $_=~ m/(.+)$endstring/si;
print WFL " ";
print WFL "$title";
print WFL "',";
$spool = 0;
last;
}
# this will useful in reading all line between <title> and </title>
elsif ($spool == 1) {
print WFL " ";
print WFL "$_";
}
}
close $fh;
# end of getting the title name

If you just want to extract the page title you can use a regular expression. I believe that would be something like:
my ($title) = $html =~ m/<title>(.+)<\/title>/si;
where your HTML page is stored in the string $html. In si, the s stands for for single line mode (i.e., the dot also matches a newline) and i for ignore case.

Related

Use regular expression to extract img tag from HTML in Perl

I need to extract captcha from url and recognised it with Tesseract.
My code is:
#!/usr/bin/perl -X
###
$user = 'user'; #Enter your username here
$pass = 'pass'; #Enter your password here
###
#Server settings
$home = "http://perltest.adavice.com";
$url = "$home/c/test.cgi?u=$user&p=$pass";
###Add code here!
#Grab img from HTML code
#if ($html =~ /<img. *?src. *?>/)
#{
# $img1 = $1;
#}
#else
#{
# $img1 = "";
#}
$img2 = grep(/<img. *src=.*>/,$html);
if ($html =~ /\img[^>]* src=\"([^\"]*)\"[^>]*/)
{
my $takeImg = $1;
my #dirs = split('/', $takeImg);
my $img = $dirs[2];
}
else
{
print "Image not found\n";
}
###
die "<img> not found\n" if (!$img);
#Download image to server (save as: ocr_me.img)
print "GET '$img' > ocr_me.img\n";
system "GET '$img' > ocr_me.img";
###Add code here!
#Run OCR (using shell command tesseract) on img and save text as ocr_result.txt
system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";
###
die "ocr_result.txt not found\n" if (!-e "ocr_result.txt");
# check OCR results:
$txt = 'cat ocr_result.txt';
$txt =~ s/[^A-Za-z0-9\-_\.]+//sg;
$img =~ s/^.*\///;
print `echo -n "file=$img&text=$txt" | POST "$url"`;
As you see I`m trying extract img src tag. This solution did not work for me ($img1) use shell command tesseract in perl script to print a text output. Also I used adopted version of that solution($img2) How can I extract URL and link text from HTML in Perl?.
If you need HTMLcode from that page, here is:
<html>
<head>
<title>Perl test</title>
</head>
<body style="font: 18px Arial;">
<nobr>somenumbersimg src="/JJ822RCXHFC23OXONNHR.png"
somenumbers<img src="/captcha/1533030599.png"/>
somenumbersimg src="/JJ822RCXHFC23OXONNHR.png" </nobr><br/><br/><form method="post" action="?u=user&p=pass">User: <input name="u"/><br/>PW: <input name="p"/><br/><input type="hidden" name="file" value="1533030599.png"/>Text: <input name="text"></br><input type="submit"></form><br/>
</body>
</html>
I got error that image not found. My problem is wrong regular expression, as I think.I can not install any modules such as HTTP::Parser or similar

Aside from the fact that using regular expressions on HTML isn't very reliable, your regular expression in the following code isn't going to work because it's missing a capture group, so $1 won't be assigned a value.
if ($html =~ /<img. *?src. *?>/)
{
$img = $1;
}
If you want to extract parts of text using a regular expression you need to put that part inside brackets. Like for example:
$example = "hello world";
$example =~ /(hello) world/;
this will set $1 to "hello".
The regular expression itself doesn't make that much sense - where you have ". *?", that'll match any character followed by 0 or more spaces. Is that a typo for ".*?" which would match any number of characters but isn't greedy like ".*", so will stop when it finds a match for the next part of the regex.
This regular expression is possibly closer to what you're looking for. It'll match the first img tag that has a src attribute that starts with "/captcha/" and store the image URL in $1
$html =~ m%<img[^>]*src="(/captcha/[^"]*)"%s;
To break it down how it works. The "m%....%" is just a different way of saying "/.../" that allows you to put slashes in the regex without needing to escape them. "[^>]*" will match zero or more of any character except ">" - so it won't match the end of the tag. And "(/captcha/[^"]*)" is using a capture group to grab anything inside the double quotes that will be the URL. It's also using the "/s" modifier on the end which will treat $html as if it is just one long line of text and ignoring any \n in it which probably isn't needed, but on the off chance the img tag is split over multiple lines it'll still work.

Perl to read each lines and print the output in email (html)

The below code able to read the content of file and print the content of body with the file's content.
use strict;
my $filename = '.../text.txt';
open (my $ifh, '<', $filename)
or die "Could not open file '$filename' $!";
local $/ = undef;
my #row = (<$ifh>)[0..9];
close ($ifh);
print "#row\n";
my ($body) = #_;
my ($html_body)= #_;
.
.
.
print(MAIL "Subject: Important Announcement \n");
.
.
.
push(#$html_body, "<h1><b><font color= red ><u>ATTENTION!</u></b></h1></font><br>");
push(#$html_body, "#row");
.
.
.
print(MAIL "$body", "#$html_body");
close(MAIL);
But unfortunately, i am having problem to produce the email body with same format of the text.txt file. The output email produced only having single line instead of paragraphs of 3.

The problem you're facing is that plain text contains no formatting information when placed inside a HTML document. End of line characters are ignored and treated just like ordinary white space. You need to add HTML tags to the text to convey the formatting you want or you could wrap it up in a pre tag as that will display it "as is".
As mentioned by others in the comments above, your use of #_ doesn't make sense. And it doesn't really make sense for $html_body to be treated like an array either when all you're doing is appending HTML to it. So I've rewritten that chunk of code to use it as a scalar and append the HTML to it instead. And also fixed some mistakes in the HTML as you need to close tags in the same order as you open them.
print MAIL "Subject: Important Announcement \n";
print MAIL "\n"; # Need a blank line after the header to show it's finished
my $html_body = "<html><body>";
$html_body .= "<h1><b><font color="red"><u>ATTENTION!</u></font></b></h1>";
$html_body .= "<pre>";
$html_body .= join("", #row);
$html_body .= "</pre>";
$html_body .= "</body></html>";
print MAIL $html_body;
close(MAIL);

First of all #_ is an arrayof arguments passed to subroutines, and it looks like you're not in one. So, doing:
my ($body) = #_;
my ($html_body) = #_;
is setting $body & $html_body to $_[0], which is undef.
How to fix?
There are two ways if you wrap it in a subroutine:
Use shift -> Which will make the above code look like:
my ($body) = shift;
my ($html_body)= shift;
Or,
my ($body, $html_body) = #_;
I would recommend the last one because it is less code and is more readable than the first one.

Perl mechanize print HTML form names

I'm trying to automate hotmail login. How can I find what the appropriate fields are? When I print the forms I just get a bunch of hex information.
what's the correct method and how is it used?
use WWW::Mechanize;
use LWP::UserAgent;
my $mech = WWW::Mechanize->new();
my $url = "http://hotmail.com";
$mech->get($url);
print "Forms: $mech->forms";
if ($mech->success()){
print "Successful Connection\n";
} else {
print "Not a successful connection\n"; }

this may help you
use WWW::Mechanize;
use Data::Dumper;
my $mech = WWW::Mechanize->new();
my $url = "http://yoururl.com";
$mech->get($url);
my #forms = $mech->forms;
foreach my $form (#forms) {
my #inputfields = $form->param;
print Dumper \#inputfields;
}

Sometimes it is useful to look at what the web site is asking in advance of coding up a reader or interface to it.
I wrote this bookmarklet that you save in your browser bookmarks and when you click it while visiting any html web page will show in a pop-up all the forms actions and fields with values even hidden. Simply copy the text below and paste into a new bookmark location field, name it and save.
javascript:t=%22<TABLE%20BORDER='1'%20BGCOLOR='#B5D1E8'>%22;for(i=0;i<document.forms.length;i++){t+=%22<TR><TH%20colspan='4'%20align='left'%20BGCOLOR='#336699'>%22;t+=%22<FONT%20color='#FFFFFF'>%20Form%20Name:%20%22;t+=document.forms[i].name;t+=%22</FONT></TH></TR>%22;t+=%22<TR><TH%20colspan='4'%20align='left'%20BGCOLOR='#99BADD'>%22;t+=%22<FONT%20color='#FFFFFF'>%20Form%20Action:%20%22;t+=document.forms[i].action;t+=%22</FONT></TH></TR>%22;t+=%22<TR><TH%20colspan='4'%20align='left'%20BGCOLOR='#99BADD'>%22;t+=%22<FONT%20color='#FFFFFF'>%20Form%20onSubmit:%20%22;t+=document.forms[i].onSubmit;t+=%22</FONT></TH></TR>%22;t+=%22<TR><TH>ID:</TH><TH>Element%20Name:</TH><TH>Type:</TH><TH>Value:</TH></TR>%22;for(j=0;j<document.forms[i].elements.length;j++){t+=%22<TR%20BGCOLOR='#FFFFFF'><TD%20align='right'>%22;t+=document.forms[i].elements[j].id;t+=%22</TD><TD%20align='right'>%22;t+=document.forms[i].elements[j].name;t+=%22</TD><TD%20align='left'>%20%22;t+=document.forms[i].elements[j].type;t+=%22</TD><TD%20align='left'>%20%22;if((document.forms[i].elements[j].type==%22select-one%22)%20||%20(document.forms[i].elements[j].type==%22select-multiple%22)){t_b=%22%22;for(k=0;k<document.forms[i].elements[j].options.length;k++){if(document.forms[i].elements[j].options[k].selected){t_b+=document.forms[i].elements[j].options[k].value;t_b%20+=%20%22%20/%20%22;t_b+=document.forms[i].elements[j].options[k].text;t_b+=%22%20%22;}}t+=t_b;}else%20if%20(document.forms[i].elements[j].type==%22checkbox%22){if(document.forms[i].elements[j].checked==true){t+=%22True%22;}else{t+=%22False%22;}}else%20if(document.forms[i].elements[j].type%20==%20%22radio%22){if(document.forms[i].elements[j].checked%20==%20true){t+=document.forms[i].elements[j].value%20+%20%22%20-%20CHECKED%22;}else{t+=document.forms[i].elements[j].value;}}else{t+=document.forms[i].elements[j].value;}t+=%22</TD></TR>%22;}}t+=%22</TABLE>%22;mA='menubar=yes,scrollbars=yes,resizable=yes,height=800,width=600,alwaysRaised=yes';nW=window.open(%22/empty.html%22,%22Display_Vars%22,%20mA);nW.document.write(t);

I tried to mimc the post request that sends your login info, but the web site seems to be dynamically adding a bunch of id's ---long generated strings etc to the url and I couldn't figure out how to imitate them. So I wrote the hacky work-around below.
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Curl::Easy;
use Data::Dumper;
my $curl = WWW::Curl::Easy->new;
#this is the name and complete path to the new html file we will create
my $new_html_file = 'XXXXXXXXX';
my $password = 'XXXXXXXX';
my $login = 'XXXXXXXXX';
#escape the .
$login =~ s/\./\\./g;
my $html_to_insert = qq(<script src="//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min.js"></script><script type="text/javascript">setTimeout('testme()', 3400);function testme(){document.getElementById('res_box').innerHTML = '<h3 class="auto_click_login_np">Logging in...</h3>';document.f1.passwd.value = '$password';document.f1.login.value = '$login';\$("#idSIButton9").trigger("click");}var counter = 5;setInterval('countdown()', 1000);function countdown(){document.getElementById('res_box').innerHTML = '<h3 class="auto_click_login_np">You should be logged in within ' + counter + ' seconds</h3>';counter--;}</script><h2 style="background-color:#004c00; color: #fff; padding: 4px;" id="res_box" onclick="testme()" class="auto_click_login">If you are not logged in after a few seconds, click here.</h2>);
$curl->setopt(CURLOPT_HEADER,1);
my $url = 'https://login.live.com';
$curl->setopt(CURLOPT_URL, $url);
# A filehandle, reference to a scalar or reference to a typeglob can be used here.
my $response_body;
$curl->setopt(CURLOPT_WRITEDATA, \$response_body);
open( my $fresh_html_handle, '+>', 'fresh_html_from_login_page.html');
# Starts the actual request
my $curl_return_code = $curl->perform;
# Looking at the results...
if ($curl_return_code == 0) {
print("Transfer went ok\n");
my $response_code = $curl->getinfo(CURLINFO_HTTP_CODE);
# judge result and next action based on $response_code
print $fresh_html_handle $response_body;
} else {
# Error code, type of error, error message
print("An error happened: $curl_return_code ".$curl->strerror($curl_return_code)." ".$curl->errbuf."\n");
}
close($fresh_html_handle);
#erase whatever a pre-existing edited file if there is one
open my $erase_html_handle, ">", $new_html_file or die "Hork! $!\n";
print $erase_html_handle;
close $erase_html_handle;
#open the file with the login page html
open( FH, '<', 'fresh_html_from_login_page.html');
open( my $new_html_handle, '>>', $new_html_file);
my $tracker=0;
while( <FH> ){
if( $_ =~ /DOCTYPE/){
$tracker=1;
print $new_html_handle $_;
} elsif($_ =~ /<\/body><\/html>/){
#now add the javascript and html to automatically log the user in
print $new_html_handle "$html_to_insert\n$_";
}elsif( $tracker == 1){
print $new_html_handle $_;
}
}
close(FH);
close($new_html_handle);
my $sys_call_res = system("firefox file:///usr/bin/outlook_auto_login.html");
print "\n\nresult: $sys_call_res\n\n";

Regex to parse html for sentences?

I know that HTML:Parser is a thing and from reading around, I've realized that trying to parse html with regex is usually a suboptimal way of doing things, but for a Perl class I'm currently trying to use regular expressions (hopefully just a single match) to identify and store the sentences from a saved html doc. Eventually I want to be able to calculate the number of sentences, words/sentence and hopefully average length of words on the page.
For now, I've just tried to isolate things which follow ">" and precede a ". " just to see what if anything it isolates, but I can't get the code to run, even when manipulating the regular expression. So I'm not sure if the issue is in the regex, somewhere else or both. Any help would be appreciated!
#!/usr/bin/perl
#new
use CGI qw(:standard);
print header;
open FILE, "< sample.html ";
$html = join('', <FILE>);
close FILE;
print "<pre>";
###Main Program###
&sentences;
###sentence identifier sub###
sub sentences {
#sentences;
while ($html =~ />[^<]\. /gis) {
push #sentences, $1;
}
#for debugging, comment out when running
print join("\n",#sentences);
}
print "</pre>";

Your regex should be />[^<]*?./gis
The *? means match zero or more non greedy. As it stood your regex would match only a single non < character followed by a period and a space. This way it will match all non < until the first period.
There may be other problems.
Now read this

A first improvement would be to write $html =~ />([^<.]+)\. /gs, you need to capture the match with the parents, and to allow more than 1 letter per sentence ;--)
This does not get all the sentences though, just the first one in each element.
A better way would be to capture all the text, then extract sentences from each fragment
while( $html=~ m{>([^<]*<}g) { push #text_content, $1};
foreach (#text_content) { while( m{([^.]*)\.}gs) { push #sentences, $1; } }
(untested because it's early in the morning and coffee is calling)
All the usual caveats about parsing HTML with regexps apply, most notably the presence of '>' in the text.

I think this does more or less what you need. Keep in mind that this script only looks at text inside p tags. The file name is passed in as a command line argument (shift).
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Grabber;
my $file_location = shift;
print "\n\nfile: $file_location";
my $totalWordCount = 0;
my $sentenceCount = 0;
my $wordsInSentenceCount = 0;
my $averageWordsPerSentence = 0;
my $char_count = 0;
my $contents;
my $rounded;
my $rounded2;
open ( my $file, '<', $file_location ) or die "cannot open < file: $!";
while( my $line = <$file>){
$contents .= $line;
}
close( $file );
my $dom = HTML::Grabber->new( html => $contents );
$dom->find('p')->each( sub{
my $p_tag = $_->text;
++$totalWordCount while $p_tag =~ /\S+/g;
while ($p_tag =~ /[.!?]+/g){
$p_tag =~ s/\s//g;
$char_count += (length($p_tag));
$sentenceCount++;
}
});
print "\n Total Words: $totalWordCount\n";
print " Total Sentences: $sentenceCount\n";
$rounded = $totalWordCount / $sentenceCount;
print " Average words per sentence: $rounded.\n\n";
print " Total Characters: $char_count.\n";
my $averageCharsPerWord = $char_count / $totalWordCount ;
$rounded2 = sprintf("%.2f", $averageCharsPerWord );
print " Average words per sentence: $rounded2.\n\n";

Extract text from HTML - Perl using HTML::TreeBuilder

I'm trying to access the .html files and extract the text in <p> tags. Logically, my code below should work. By using the HTML::TreeBuilder. I parse the html then extract text in <p> using find_by_attribute("p"). But my script came out with empty directories. Did i leave out anything?
#!/usr/bin/perl
use strict;
use HTML::TreeBuilder 3;
use FileHandle;
my #task = ('ar','cn','en','id','vn');
foreach my $lang (#task) {
mkdir "./extract_$lang", 0777 unless -d "./extract_$lang";
opendir (my $dir, "./$lang/") or die "$!";
my #files = grep (/\.html/,readdir ($dir));
closedir ($dir);
foreach my $file (#files) {
open (my $fh, '<', "./$lang/$file") or die "$!";
my $root = HTML::TreeBuilder->new;
$root->parse_file("./$lang/$file");
my #all_p = $root->find_by_attribute("p");
foreach my $p (#all_p) {
my $ptag = HTML::TreeBuilder->new_from_content ($p->as_HTML);
my $filewrite = substr($file, 0, -5);
open (my $outwrite, '>>', "extract_$lang/$filewrite.txt") or die $!;
print $outwrite $ptag->as_text . "\n";
my $pcontents = $ptag->as_text;
print $pcontents . "\n";
close (outwrite);
}
close (FH);
}
}
My .html files are the plain text htmls from .asp websites e.g. http://www.singaporemedicine.com/vn/hcp/med_evac_mtas.asp
My .html files are saved in:
./ar/*
./cn/*
./en/*
./id/*
./vn/*

You are confusing element with attribute. The program can be written much more concisely:
#!/usr/bin/env perl
use strictures;
use File::Glob qw(bsd_glob);
use Path::Class qw(file);
use URI::file qw();
use Web::Query qw(wq);
use autodie qw(:all);
foreach my $lang (qw(ar cn en id vn)) {
mkdir "./extract_$lang", 0777 unless -d "./extract_$lang";
foreach my $file (bsd_glob "./$lang/*.html") {
my $basename = file($file)->basename;
$basename =~ s/[.]html$/.txt/;
open my $out, '>>:encoding(UTF-8)', "./extract_$lang/$basename";
$out->say($_) for wq(URI::file->new_abs($file))->find('p')->text;
close $out;
}
}

Use find_by_tag_name to search for tag names, not find_by_attribute.

You want find_by_tag_name, not find_by_attribute:
my #all_p = $root->find_by_tag_name("p");
From the docs:
$h->find_by_tag_name('tag', ...)
In list context, returns a list of elements at or under $h that have
any of the specified tag names. In scalar context, returns the first
(in pre-order traversal of the tree) such element found, or undef if
none.

You might want to take a look at Mojo::DOM which lets you use CSS selectors.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How do I extract an HTML title with Perl? - html

HTML::HeadParser does this for you.

It's not clear to me what you are asking. You seem to be talking about something that could run in the user's browser, or at least something that already has an html page loaded. If that's not the case, the answer is URI::Title.

use strict; use LWP::Simple; my $url = 'http://www.google.com'|| die "Specify URL on the cmd line"; my $html = get ($url); $html =~ m{<TITLE>(.*?)</TITLE>}gism; print "$1\n";

The previous answer is wrong, if the HTML title tag is used more often then this can easily be overcome by checking to make sure that the title tag is valid (no tags in between). my ($title) = $test_content =~ m/<title>([a-zA-Z\/][^>]+)<\/title>/si;

Related

Use regular expression to extract img tag from HTML in Perl

Perl to read each lines and print the output in email (html)

Perl mechanize print HTML form names

Regex to parse html for sentences?

Extract text from HTML - Perl using HTML::TreeBuilder

Categories

Resources