How to Enable HTML::TableExtract to Recognize Special Characters - html

I was trying to parse a page that contain scientific notation (Greek, etc).
This is the page. Note that there are other pages with more notations to be parsed.
For example it contain the following HTML
<td> human Interleukin 1β </td>
where &beta encode the Greek alphabet.
However after parsing with HTML::TableExtract it became:
human Interleukin 1\x{3b2}
Is there a way to make the code below capture the original HTML as it is,
i.e. maintaning 1&beta.
use HTML::TableExtract;
use Data::Dumper;
# Local file for http://www.violinet.org/vaxjo/vaxjo_detail.php?c_vaxjo_id=55
my $file = "vaxjo_detail.php\?c_vaxjo_id\=50.html";
my $te = HTML::TableExtract->new();
$te->parse_file($file);
my ($table) = $te->tables;
print Dumper $table ;

It did not return
human Interleukin 1\x{3b2}
It returned
human Interleukin 1β
Dumper simply prints that out as Perl string literal
"human Interleukin 1\x{3b2}"
Anyway, if you want the raw HTML instead of the text it represents, I believe passing keep_html => 1 to the constructor will do the trick.

Related

How to stop at the next specific character in regex

I have many links in a large variable, and am using regex to extract links. The most ideal link would look like
View Stock
And my regex works perfectly looking for two matches: The full Link and the vendornum.
/<a href="\/search\/\product/(.*?)\/.*?>(.*?)<\/a>/igm
But occasionally, the link will include other info such as a class, which has it's own quotes
<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>
And the extra "s throw me off. I cannot figure out the first match, which would be the first two "s
<a href="([^"]+)".*[^>].*?>View Stock</a>
I know regex can be very challenging, and I am using RegEx101.com, a real life saver.
But I just can't seem to figure out how to match the first pattern, the full href link, but excluding any other classes with their own before I reach the closing >
Any experts in regex the can guide me?
There is generally no reason to build an HTML parser by hand, from scratch, while there's usually trouble awaiting down the road; regex are picky, sensitive to details, and brittle to even tiny input changes, while requirements tend to evolve. Why not use one of a few great HTML libraries?
An example with HTML::TreeBuilder (also extracting links, need stated in a comment)
use warnings;
use strict;
use feature 'say';
use HTML::TreeBuilder;
my $links_string =
q(<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>
<a href="/search/title/?vendornum=StaplesA17" >View More Stock</a> );
my $dom = HTML::TreeBuilder->new_from_content($links_string);
my #links_html;
foreach my $tag ( $dom->look_down(_tag => "a") ) {
push #links_html, $tag->as_HTML; # the whole link, as is
my $href = $tag->attr("href");
my ($name, $value) = $href =~ /\?([^=]+)=([^&]+)/; #/
say "$name = $value";
say $tag->as_trimmed_text; # or: ->as_text, keep some spaces
# Or:
# say for $tag->content_list; # all children, and/or text
};
#say for #links_html;
I use a string with a newline between links for your "many links in a large variable", perhaps with some spaces around as well. This doesn't affect parsing done by the library.
A few commments
The workhorse here is HTML::Element class, with its powerful and flexible look_down method. If the string indeed has just links then you can probably use that class directly, but when done as above a full HTML document would parse just as well
Once I get the URL I use a very simple-minded regex to pull out a single name-value pair. Adjust if there can be more pairs, or let me know. Above all, use URI if there's more to it
The as_trimmed_text returns text parts of element's children, which in this case is presumably just the text of the link. The content_list returns all child nodes (same here)
Use URI::Escape if there are percent-encoded characters to convert, per RFC 3986
This prints
vendornum = StaplesA03
View Stock
vendornum = StaplesA17
View More Stock
Another option is Mojo::DOM, which is a part of a whole ecosystem
use warnings;
use strict;
use feature 'say';
use Mojo::DOM;
my $links_string = q( ... ); # as above
my $dom = Mojo::DOM->new($links_string);
my #links_html;
foreach my $node ( $dom->find('a')->each ) {
push #links_html, $node->to_string; # or $node, gets stringified to HTML
my $href = $node->attr('href');
my ($name, $value) = $href =~ /\?([^=]+)=([^&]+)/; #/
say "$name = $value";
say $node->text;
}
#say for #links_html;
I use the same approach as above, and this prints the same. But note that Mojolicious provides for yet other, convenient ways. Often, calls are chained using a range of its useful methods, and very fine navigation through HTML is easily done using CSS selectors.
While it is probably useful here to loop as above, as an example we can also do
my $v = $dom -> find('a')
-> map(
sub {
my ($name, $value) = $_->attr('href') =~ /\?(.+?)=([^&]+)/;
say "$name = $value";
say $_->text;
}
);
what prints the same as above. See Mojo::Collection to better play with this.
The parameters in the URL can be parsed using Mojo::URL if you really know the name
my $value = Mojo::URL->new($href)
-> query
-> param('vendornum');
If these aren't fixed then Mojo::Parameters is useful
my $param_names = Mojo::Parameters
-> new( Mojo::URL->new($href)->query )
-> names
where $param_names is an arrayref with names of all parameters in the query, or use
my $pairs = Mojo::Parameters->new( Mojo::URL->new($href)->query ) -> pairs;
# Or
# my %pairs = #{ Mojo::Parameters->new(Mojo::URL->new($href)->query) -> pairs };
which returns an arrayref with all name,value pairs listed in succession (what can be directly assigned to a hash, for instance).
An HTML document can be nicely parsed using XML::LibXML as well.
If I read correctly, you'd like to extract the vendornum value from the URL, and the link text. Best to use an html parser.
If you want to live dangerously with code that can break you can use a regex to parse html:
my $html = '<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>';
if($html =~ /<a href="[^\?]*\?vendornum=([^"]*)[^>]*>([^<]*).*$/) {
print "vendornum: $1, link text: $2\n";
} else {
print "no match";
}
Output:
vendornum: StaplesA03, link text: View Stock
Explanation:
vendornum=([^"]*) - scan for vendornum=, and capture everything after that until just before "
[^>]*> - scan over remaining attributes, such as class="", up to closing angle bracket
([^<]*) - capture link text
.*$ - scan up to end of text
First of all you should consider using HTML::TreeBuilder for things like this. Once you get the hang of it it can be easier than coming up with regexes. However for quick and dirty tasks, a regex is fine.
$text =
'<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>
<a x=y href="/search/product/?Vendornum=651687" foo=bar>View Stockings</A>';
$regex =
qr{<a\s[^>]*?href="(?<link>[^"]*?\?vendornum=(?<vendornum>\w+)[^"]*)"[^>]*?>(?<desc>(?:(?!</a>).)*)</a>}i;
while($text =~ m/$regex/g){ Data:Dump::pp1 %+; }
Returns
{
# tied Tie::Hash::NamedCapture
desc => "View Stock",
link => "/search/title/?vendornum=StaplesA03",
vendornum => "StaplesA03",
}
{
# tied Tie::Hash::NamedCapture
desc => "View Stockings",
link => "/search/product/?Vendornum=651687",
vendornum => 651687,
}
HTH

Using Perl LibXML to read textContent that contains html tags

If I have the following XML:
<File id="MyTestApp/app/src/main/res/values/strings.xml">
<Identifier id="page_title" isArray="0" isPlural="0">
<EngTranslation eng_indx="0" goesWith="-1" index="0">My First App</EngTranslation>
<Description index="0">Home page title</Description>
<LangTranslation index="0">My First App</LangTranslation>
</Identifier>
<Identifier id="count" isArray="0" isPlural="0">
<EngTranslation eng_indx="0" goesWith="-1" index="0">You have <b>%1$d</b> view(s)</EngTranslation>
<Description index="0">Number of page views</Description>
<LangTranslation index="0">You have <b>%1$d</b> view(s)</LangTranslation>
</Identifier>
</File>
I'm trying to read the 'EngTranslation' text value, and want to return the full value including any HTML tags. For example, I have the following:
my $parser = XML::LibXML->new;
my $dom = $parser->parse_file("test.xml") or die;
foreach my $file ($dom->findnodes('/File')) {
print $file->getAttribute("id")."\n";
foreach my $identifier ($file->findnodes('./Identifier')) {
print $identifier->getAttribute("id")."\n";
print encode('UTF-8',$identifier->findnodes('./EngTranslation')->get_node(1)->textContent."\n");
print encode('UTF-8',$identifier->findnodes('./Description')->get_node(1)->textContent."\n");
print encode('UTF-8',$identifier->findnodes('./LangTranslation')->get_node(1)->textContent."\n");
}
}
The output I get is:
MyTestApp/app/src/main/res/values/strings.xml
page_title
My First App
Home page title
My First App
count
You have %1$d view(s)
Number of page views
You have %1$d views
What I'm hoping to get is:
MyTestApp/app/src/main/res/values/strings.xml
page_title
My First App
Home page title
My First App
count
You have <b>%1$d</b> view(s)
Number of page views
You have <b>%1$d</b> views
I'm just using this as an example for a more complicated situation, hopefully it makes sense.
Thanks!
Here's a rather monkey patching solution, but it works:
sub XML::LibXML::Node::innerXML{
my ($self) = shift;
join '', $self->childNodes();
}
…
say $identifier->findnodes('./Description')->get_node(1)->innerXML;
Oh, and if the encoding becomes a problem, use the toString method, it's first argument handles encoding. (I did use open, but there were no out of range characters in the xml).
If you don't like the monkey patching. you can change the sub to a normal one and supply the argument, like this:
sub myInnerXML{
my ($self) = shift;
join '', map{$_->toString(1)} $self->childNodes();
}
…
say myInnerXML($identifier->findnodes('./Description')->get_node(1));
In your source XML, you either need to encode the tags as entities or wrap that content in a CDATA section.
One problem with embedding HTML in XML is that HTML is not necessarily 'well formed'. For example the <br> tag and the <img> tag are not usually followed by matching closing tags and without the closing tags, it would not be valid in an XML document unless you XML-escape the whole string of HTML, e.g.:
<EngTranslation eng_indx="0" goesWith="-1" index="0">You have <b>%1$d</b> view(s)</EngTranslation>
Or use a CDATA section:
<EngTranslation eng_indx="0" goesWith="-1" index="0"><![CDATA[You have <b>%1$d</b> view(s)]]></EngTranslation>
However, if you restrict your HTML to always be well-formed, you can achieve what you want with the toString() method.
If you called toString() on the <EngTranslation> element node, the output would include the <EngTranslation>...</EngTranslation> wrapper tags. So instead, you would need to call toString() on each of the child nodes and concatenate the results together:
binmode(STDOUT, ':utf8');
foreach my $file ($dom->findnodes('/File')) {
print $file->getAttribute("id")."\n";
foreach my $identifier ($file->findnodes('./Identifier')) {
print $identifier->getAttribute("id")."\n";
my $html = join '', map { $_->toString }
$identifier->findnodes('./EngTranslation')->get_node(1)->childNodes;
print $html."\n";
print $identifier->findnodes('./Description')->get_node(1)->textContent."\n";
print $identifier->findnodes('./LangTranslation')->get_node(1)->textContent."\n";
}
}
Note I took the liberty of using binmode to set UTF8 encoding on the output filehandle so it was not necessary to call encode for every print.

How to construct json text using string?

I'm trying to construct json text as show below. But the variables such as $token, $state, $failedServers are not been replaced with its value. Note- I don't want to use any module specifically for this to work, I just want some plain string to work. Can anyone help me ?
my $json = '{"serverToken":"$token", "state":"$state","parameters" :"$failedServers"}';
current output was:
{"serverToken":"$token", "state":"$state","parameters" :"$failedServers"}
needed output format:
{"serverToken":"1213", "state":"failed","parameters" :"oracleapps.veeralab.com,suntrust.com"}
Your variables are not being replaced, because they are inside of a single-quoted string--that is, they are inside a string quoted by ' characters. This prevents variable substitution.
You will also be much better off creating JSON using a JSON library, such as this one. Simply using a quoted string is very dangerous. Suppose your one of your variables ends up containing a special character; you will end up with invalid JSON.
{"serverToken":"123"ABC", "state":"offline", "paramameters":"bugs"}
If your variables come from user input, really bad things could happen. Imagine that $token is set to equal foo", "state":"online", "foo":"bar. Your resulting JSON structure would be:
{"serverToken":"foo", "state":"online", "foo":"bar", "state":"offline" ...
Certainly not what you want.
Possible solutions:
The most blatantly obvious solution is simply not to the ' quote character. This has the drawback of requiring you to escape your double quote (") characters, though, but it's easy:
my $json = "{\"serverToken\":\"$token\", \"state\":\"$state\",\"parameters\" :\"$failedServers\"}";
Another option is to use sprintf:
my $json = sprintf('{"serverToken":"%s", "state":"%s", "parameters":"%s"}', $token, $state, $failedServers);
But by far, the best solution, because it won't break with wonky input, is to use a library:
use JSON;
my $json = encode_json( {
serverToken => $token,
state => $state,
paramaters => $failedServers
} );

Retrieve text in HTML with powershell

In this html code :
<div id="ajaxWarningRegion" class="infoFont"></div>
<span id="ajaxStatusRegion"></span>
<form enctype="multipart/form-data" method="post" name="confIPBackupForm" action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup" id="confIPBackupForm" >
<pre>
Creating a new ZIP of IP Phone files from HTTP/PhoneBackup
and HTTPS/PhoneBackup
</pre>
<pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre>
<pre>Reports Success</pre>
<pre></pre>
<a href = /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip>
Download the new ZIP of IP Phone files
</a>
</div>
I want to retrieve the text IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip or just the date and hour between IP_PHONE_BACKUP- and .zip
How can I do that ?
What makes this question so interesting is that HTML looks and smells just like XML, the latter being much more programmably palatable due to its well-behaved and orderly structure. In an ideal world HTML would be a subset of XML, but HTML in the real-world is emphatically not XML. If you feed the example in the question into any XML parser it will balk on a variety of infractions. That being said, the desired result can be achieved with a single line of PowerShell. This one returns the whole text of the href:
Select-NodeContent $doc.DocumentNode "//a/#href"
And this one extracts the desired substring:
Select-NodeContent $doc.DocumentNode "//a/#href" "IP_PHONE_BACKUP-(.*)\.zip"
The catch, however, is in the overhead/setup to be able to run that one line of code. You need to:
Install HtmlAgilityPack to make HTML parsing look just like XML parsing.
Install PowerShell Community Extensions if you want to parse a live web page.
Understand XPath to be able to construct a navigable path to your target node.
Understand regular expressions to be able to extract a substring from your target node.
With those requirements satisfied you can add the HTMLAgilityPath type to your environment and define the Select-NodeContent function, both shown below. The very end of the code shows how you assign a value to the $doc variable used in the above one-liners. I show how to load HTML from a file or from the web, depending on your needs.
Set-StrictMode -Version Latest
$HtmlAgilityPackPath = [System.IO.Path]::Combine((Get-Item $PROFILE).DirectoryName, "bin\HtmlAgilityPack.dll")
Add-Type -Path $HtmlAgilityPackPath
function Select-NodeContent(
[HtmlAgilityPack.HtmlNode]$node,
[string] $xpath,
[string] $regex,
[Object] $default = "")
{
if ($xpath -match "(.*)/#(\w+)$") {
# If standard XPath to retrieve an attribute is given,
# map to supported operations to retrieve the attribute's text.
($xpath, $attribute) = $matches[1], $matches[2]
$resultNode = $node.SelectSingleNode($xpath)
$text = ?: { $resultNode } { $resultNode.Attributes[$attribute].Value } { $default }
}
else { # retrieve an element's text
$resultNode = $node.SelectSingleNode($xpath)
$text = ?: { $resultNode } { $resultNode.InnerText } { $default }
}
# If a regex is given, use it to extract a substring from the text
if ($regex) {
if ($text -match $regex) { $text = $matches[1] }
else { $text = $default }
}
return $text
}
$doc = New-Object HtmlAgilityPack.HtmlDocument
$result = $doc.Load("tmp\temp.html") # Use this to load a file
#$result = $doc.LoadHtml((Get-HttpResource $url)) # Use this PSCX cmdlet to load a live web page
Actually, the HTML surrounding your file name is irrelevant here. You can extract the date just fine with the following regex (which doesn't even care whether you're extracting it from an e-mail an HTML page or a CSV file):
(?<=/tmp/IP_PHONE_BACKUP-)[^.]+(?=\.zip)
Quick test:
PS> [regex]::Match($html, '(?<=/tmp/IP_PHONE_BACKUP-)[^.]+(?=\.zip)')
Groups : {2012-Jul-25_15:47:47}
Success : True
Captures : {2012-Jul-25_15:47:47}
Index : 391
Length : 20
Value : 2012-Jul-25_15:47:47
The group(2) and group(3) of the following regex receptively contains the date and time:
/IP_PHONE_BACKUP-((.*)_(.*)).zip/
Here is a link to extract the value from a regex in powershell.
Is there a shorter way to pull groups out of a Powershell regex?
HIH
Without regex:
$a = '<div id="ajaxWarningRegion" class="infoFont"></div><span id="ajaxStatusRegion"></span><form enctype="multipart/form-data" method="post" name="confIPBackupForm" action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup" id="confIPBackupForm" ><pre>Creating a new ZIP of IP Phone files from HTTP/PhoneBackup and HTTPS/PhoneBackup</pre><pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre><pre>Reports Success</pre><pre></pre><a href = /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip>Download the new ZIP of IP Phone files</a></div>'
$a.Substring($a.IndexOf("IP_PHONE_BACKUP")+"IP_PHONE_BACKUP".length+1, $a.IndexOf(".zip")-$a.IndexOf("IP_PHONE_BACKUP")-"IP_PHONE_BACKUP".length-1)
Substring gets you a part of the original string. The first parameter is the start position of the substring while the second part is the length of the desiered substring. So now all you have to do is to calculate the start and the length using a little IndexOf- and Length-magic.

How to get a matched substring from a string with regular expression in perl [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How can I extract URL and link text from HTML in Perl?
I am trying to get the substring in a string .There could be more than one matched string with that name in the string.
<LI>
<A
HREF="65378161_12011_Q.pdf">
65378161_12011_Q.pdf
</A>
From the above string i want to get the file name "65378161_12011_Q.pdf".
if($line=~ m/((.*)Q\.pdf)/i ){
my $inside=$2;
print " file name:$inside \n";
}
This is what i tried but it does not get the right sub string.
Can some one help on this.
I really appreciate if some one can answer to my question.
Use a HTML parser.
use strictures;
use Web::Query qw();
my $w = Web::Query->new_from_html(<<'HTML');
<LI>
<A
HREF="65378161_12011_Q.pdf">
65378161_12011_Q.pdf
</A>
HTML
$w->find('a')->attr('href');
# expression returns '65378161_12011_Q.pdf'
$w->find('a')->text;
# expression returns ' 65378161_12011_Q.pdf '
See the following script :
#!/usr/bin/env perl
use strict;
use warnings;
my $string = "65378161_12011_Q.pdf";
if($string =~ m/((.*)?Q\.pdf)/i ){
my $inside=$2;
print " file name:$inside \n";
}
Your code just lack the '?' character to tell the regex to be not greedy.
Another way is to match all of the characters that is not a 'Q' before itself :
m/(^[^Q]+)?Q\.pdf/i
Edit:
Because you had edited your post with a different spec :
If you need to parse HTML, I recommend to use a proper module :
Don't parse or modify html with regular expressions! See one of
HTML::Parser's subclasses: HTML::TokeParser, HTML::TokeParser::Simple,
HTML::TreeBuilder(::Xpath)?, HTML::TableExtract etc. If your response
begins "that's overkill. i only want to..." you are wrong.
http://en.wikipedia.org/wiki/Chomsky_hierarchy and
here for why not to use regex on HTML
(This is a reminder about using regex to parse HTML from #perl channel on irc.freenode.org)
Edit 2:
Here a complete working example :
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new_from_content('
<LI>
<A
HREF="65378161_12011_Q.pdf">
65378161_12011_Q.pdf
</A>
');
$tree->look_down("_tag", "a")->as_text =~ m/(^[^Q]+)Q\.pdf/i && print "$1\n";
Since . will match everything, simply remove the parenthesis around it.
#!/usr/bin/perl
my $line = "65378161_12011_Q.pdf";
if ($line =~ m/(.*Q\.pdf)/i )
{
my $inside = $1;
print "filename = $inside\n";
}
Produces the correct output.
Hope it helps.
Manny