Retrieve text in HTML with powershell - html

In this html code :
<div id="ajaxWarningRegion" class="infoFont"></div>
<span id="ajaxStatusRegion"></span>
<form enctype="multipart/form-data" method="post" name="confIPBackupForm" action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup" id="confIPBackupForm" >
<pre>
Creating a new ZIP of IP Phone files from HTTP/PhoneBackup
and HTTPS/PhoneBackup
</pre>
<pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre>
<pre>Reports Success</pre>
<pre></pre>
<a href = /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip>
Download the new ZIP of IP Phone files
</a>
</div>
I want to retrieve the text IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip or just the date and hour between IP_PHONE_BACKUP- and .zip
How can I do that ?

What makes this question so interesting is that HTML looks and smells just like XML, the latter being much more programmably palatable due to its well-behaved and orderly structure. In an ideal world HTML would be a subset of XML, but HTML in the real-world is emphatically not XML. If you feed the example in the question into any XML parser it will balk on a variety of infractions. That being said, the desired result can be achieved with a single line of PowerShell. This one returns the whole text of the href:
Select-NodeContent $doc.DocumentNode "//a/#href"
And this one extracts the desired substring:
Select-NodeContent $doc.DocumentNode "//a/#href" "IP_PHONE_BACKUP-(.*)\.zip"
The catch, however, is in the overhead/setup to be able to run that one line of code. You need to:
Install HtmlAgilityPack to make HTML parsing look just like XML parsing.
Install PowerShell Community Extensions if you want to parse a live web page.
Understand XPath to be able to construct a navigable path to your target node.
Understand regular expressions to be able to extract a substring from your target node.
With those requirements satisfied you can add the HTMLAgilityPath type to your environment and define the Select-NodeContent function, both shown below. The very end of the code shows how you assign a value to the $doc variable used in the above one-liners. I show how to load HTML from a file or from the web, depending on your needs.
Set-StrictMode -Version Latest
$HtmlAgilityPackPath = [System.IO.Path]::Combine((Get-Item $PROFILE).DirectoryName, "bin\HtmlAgilityPack.dll")
Add-Type -Path $HtmlAgilityPackPath
function Select-NodeContent(
[HtmlAgilityPack.HtmlNode]$node,
[string] $xpath,
[string] $regex,
[Object] $default = "")
{
if ($xpath -match "(.*)/#(\w+)$") {
# If standard XPath to retrieve an attribute is given,
# map to supported operations to retrieve the attribute's text.
($xpath, $attribute) = $matches[1], $matches[2]
$resultNode = $node.SelectSingleNode($xpath)
$text = ?: { $resultNode } { $resultNode.Attributes[$attribute].Value } { $default }
}
else { # retrieve an element's text
$resultNode = $node.SelectSingleNode($xpath)
$text = ?: { $resultNode } { $resultNode.InnerText } { $default }
}
# If a regex is given, use it to extract a substring from the text
if ($regex) {
if ($text -match $regex) { $text = $matches[1] }
else { $text = $default }
}
return $text
}
$doc = New-Object HtmlAgilityPack.HtmlDocument
$result = $doc.Load("tmp\temp.html") # Use this to load a file
#$result = $doc.LoadHtml((Get-HttpResource $url)) # Use this PSCX cmdlet to load a live web page

Actually, the HTML surrounding your file name is irrelevant here. You can extract the date just fine with the following regex (which doesn't even care whether you're extracting it from an e-mail an HTML page or a CSV file):
(?<=/tmp/IP_PHONE_BACKUP-)[^.]+(?=\.zip)
Quick test:
PS> [regex]::Match($html, '(?<=/tmp/IP_PHONE_BACKUP-)[^.]+(?=\.zip)')
Groups : {2012-Jul-25_15:47:47}
Success : True
Captures : {2012-Jul-25_15:47:47}
Index : 391
Length : 20
Value : 2012-Jul-25_15:47:47

The group(2) and group(3) of the following regex receptively contains the date and time:
/IP_PHONE_BACKUP-((.*)_(.*)).zip/
Here is a link to extract the value from a regex in powershell.
Is there a shorter way to pull groups out of a Powershell regex?
HIH

Without regex:
$a = '<div id="ajaxWarningRegion" class="infoFont"></div><span id="ajaxStatusRegion"></span><form enctype="multipart/form-data" method="post" name="confIPBackupForm" action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup" id="confIPBackupForm" ><pre>Creating a new ZIP of IP Phone files from HTTP/PhoneBackup and HTTPS/PhoneBackup</pre><pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre><pre>Reports Success</pre><pre></pre><a href = /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip>Download the new ZIP of IP Phone files</a></div>'
$a.Substring($a.IndexOf("IP_PHONE_BACKUP")+"IP_PHONE_BACKUP".length+1, $a.IndexOf(".zip")-$a.IndexOf("IP_PHONE_BACKUP")-"IP_PHONE_BACKUP".length-1)
Substring gets you a part of the original string. The first parameter is the start position of the substring while the second part is the length of the desiered substring. So now all you have to do is to calculate the start and the length using a little IndexOf- and Length-magic.

Related

How to stop at the next specific character in regex

I have many links in a large variable, and am using regex to extract links. The most ideal link would look like
View Stock
And my regex works perfectly looking for two matches: The full Link and the vendornum.
/<a href="\/search\/\product/(.*?)\/.*?>(.*?)<\/a>/igm
But occasionally, the link will include other info such as a class, which has it's own quotes
<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>
And the extra "s throw me off. I cannot figure out the first match, which would be the first two "s
<a href="([^"]+)".*[^>].*?>View Stock</a>
I know regex can be very challenging, and I am using RegEx101.com, a real life saver.
But I just can't seem to figure out how to match the first pattern, the full href link, but excluding any other classes with their own before I reach the closing >
Any experts in regex the can guide me?
There is generally no reason to build an HTML parser by hand, from scratch, while there's usually trouble awaiting down the road; regex are picky, sensitive to details, and brittle to even tiny input changes, while requirements tend to evolve. Why not use one of a few great HTML libraries?
An example with HTML::TreeBuilder (also extracting links, need stated in a comment)
use warnings;
use strict;
use feature 'say';
use HTML::TreeBuilder;
my $links_string =
q(<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>
<a href="/search/title/?vendornum=StaplesA17" >View More Stock</a> );
my $dom = HTML::TreeBuilder->new_from_content($links_string);
my #links_html;
foreach my $tag ( $dom->look_down(_tag => "a") ) {
push #links_html, $tag->as_HTML; # the whole link, as is
my $href = $tag->attr("href");
my ($name, $value) = $href =~ /\?([^=]+)=([^&]+)/; #/
say "$name = $value";
say $tag->as_trimmed_text; # or: ->as_text, keep some spaces
# Or:
# say for $tag->content_list; # all children, and/or text
};
#say for #links_html;
I use a string with a newline between links for your "many links in a large variable", perhaps with some spaces around as well. This doesn't affect parsing done by the library.
A few commments
The workhorse here is HTML::Element class, with its powerful and flexible look_down method. If the string indeed has just links then you can probably use that class directly, but when done as above a full HTML document would parse just as well
Once I get the URL I use a very simple-minded regex to pull out a single name-value pair. Adjust if there can be more pairs, or let me know. Above all, use URI if there's more to it
The as_trimmed_text returns text parts of element's children, which in this case is presumably just the text of the link. The content_list returns all child nodes (same here)
Use URI::Escape if there are percent-encoded characters to convert, per RFC 3986
This prints
vendornum = StaplesA03
View Stock
vendornum = StaplesA17
View More Stock
Another option is Mojo::DOM, which is a part of a whole ecosystem
use warnings;
use strict;
use feature 'say';
use Mojo::DOM;
my $links_string = q( ... ); # as above
my $dom = Mojo::DOM->new($links_string);
my #links_html;
foreach my $node ( $dom->find('a')->each ) {
push #links_html, $node->to_string; # or $node, gets stringified to HTML
my $href = $node->attr('href');
my ($name, $value) = $href =~ /\?([^=]+)=([^&]+)/; #/
say "$name = $value";
say $node->text;
}
#say for #links_html;
I use the same approach as above, and this prints the same. But note that Mojolicious provides for yet other, convenient ways. Often, calls are chained using a range of its useful methods, and very fine navigation through HTML is easily done using CSS selectors.
While it is probably useful here to loop as above, as an example we can also do
my $v = $dom -> find('a')
-> map(
sub {
my ($name, $value) = $_->attr('href') =~ /\?(.+?)=([^&]+)/;
say "$name = $value";
say $_->text;
}
);
what prints the same as above. See Mojo::Collection to better play with this.
The parameters in the URL can be parsed using Mojo::URL if you really know the name
my $value = Mojo::URL->new($href)
-> query
-> param('vendornum');
If these aren't fixed then Mojo::Parameters is useful
my $param_names = Mojo::Parameters
-> new( Mojo::URL->new($href)->query )
-> names
where $param_names is an arrayref with names of all parameters in the query, or use
my $pairs = Mojo::Parameters->new( Mojo::URL->new($href)->query ) -> pairs;
# Or
# my %pairs = #{ Mojo::Parameters->new(Mojo::URL->new($href)->query) -> pairs };
which returns an arrayref with all name,value pairs listed in succession (what can be directly assigned to a hash, for instance).
An HTML document can be nicely parsed using XML::LibXML as well.
If I read correctly, you'd like to extract the vendornum value from the URL, and the link text. Best to use an html parser.
If you want to live dangerously with code that can break you can use a regex to parse html:
my $html = '<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>';
if($html =~ /<a href="[^\?]*\?vendornum=([^"]*)[^>]*>([^<]*).*$/) {
print "vendornum: $1, link text: $2\n";
} else {
print "no match";
}
Output:
vendornum: StaplesA03, link text: View Stock
Explanation:
vendornum=([^"]*) - scan for vendornum=, and capture everything after that until just before "
[^>]*> - scan over remaining attributes, such as class="", up to closing angle bracket
([^<]*) - capture link text
.*$ - scan up to end of text
First of all you should consider using HTML::TreeBuilder for things like this. Once you get the hang of it it can be easier than coming up with regexes. However for quick and dirty tasks, a regex is fine.
$text =
'<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>
<a x=y href="/search/product/?Vendornum=651687" foo=bar>View Stockings</A>';
$regex =
qr{<a\s[^>]*?href="(?<link>[^"]*?\?vendornum=(?<vendornum>\w+)[^"]*)"[^>]*?>(?<desc>(?:(?!</a>).)*)</a>}i;
while($text =~ m/$regex/g){ Data:Dump::pp1 %+; }
Returns
{
# tied Tie::Hash::NamedCapture
desc => "View Stock",
link => "/search/title/?vendornum=StaplesA03",
vendornum => "StaplesA03",
}
{
# tied Tie::Hash::NamedCapture
desc => "View Stockings",
link => "/search/product/?Vendornum=651687",
vendornum => 651687,
}
HTH

Using Perl LibXML to read textContent that contains html tags

If I have the following XML:
<File id="MyTestApp/app/src/main/res/values/strings.xml">
<Identifier id="page_title" isArray="0" isPlural="0">
<EngTranslation eng_indx="0" goesWith="-1" index="0">My First App</EngTranslation>
<Description index="0">Home page title</Description>
<LangTranslation index="0">My First App</LangTranslation>
</Identifier>
<Identifier id="count" isArray="0" isPlural="0">
<EngTranslation eng_indx="0" goesWith="-1" index="0">You have <b>%1$d</b> view(s)</EngTranslation>
<Description index="0">Number of page views</Description>
<LangTranslation index="0">You have <b>%1$d</b> view(s)</LangTranslation>
</Identifier>
</File>
I'm trying to read the 'EngTranslation' text value, and want to return the full value including any HTML tags. For example, I have the following:
my $parser = XML::LibXML->new;
my $dom = $parser->parse_file("test.xml") or die;
foreach my $file ($dom->findnodes('/File')) {
print $file->getAttribute("id")."\n";
foreach my $identifier ($file->findnodes('./Identifier')) {
print $identifier->getAttribute("id")."\n";
print encode('UTF-8',$identifier->findnodes('./EngTranslation')->get_node(1)->textContent."\n");
print encode('UTF-8',$identifier->findnodes('./Description')->get_node(1)->textContent."\n");
print encode('UTF-8',$identifier->findnodes('./LangTranslation')->get_node(1)->textContent."\n");
}
}
The output I get is:
MyTestApp/app/src/main/res/values/strings.xml
page_title
My First App
Home page title
My First App
count
You have %1$d view(s)
Number of page views
You have %1$d views
What I'm hoping to get is:
MyTestApp/app/src/main/res/values/strings.xml
page_title
My First App
Home page title
My First App
count
You have <b>%1$d</b> view(s)
Number of page views
You have <b>%1$d</b> views
I'm just using this as an example for a more complicated situation, hopefully it makes sense.
Thanks!
Here's a rather monkey patching solution, but it works:
sub XML::LibXML::Node::innerXML{
my ($self) = shift;
join '', $self->childNodes();
}
…
say $identifier->findnodes('./Description')->get_node(1)->innerXML;
Oh, and if the encoding becomes a problem, use the toString method, it's first argument handles encoding. (I did use open, but there were no out of range characters in the xml).
If you don't like the monkey patching. you can change the sub to a normal one and supply the argument, like this:
sub myInnerXML{
my ($self) = shift;
join '', map{$_->toString(1)} $self->childNodes();
}
…
say myInnerXML($identifier->findnodes('./Description')->get_node(1));
In your source XML, you either need to encode the tags as entities or wrap that content in a CDATA section.
One problem with embedding HTML in XML is that HTML is not necessarily 'well formed'. For example the <br> tag and the <img> tag are not usually followed by matching closing tags and without the closing tags, it would not be valid in an XML document unless you XML-escape the whole string of HTML, e.g.:
<EngTranslation eng_indx="0" goesWith="-1" index="0">You have <b>%1$d</b> view(s)</EngTranslation>
Or use a CDATA section:
<EngTranslation eng_indx="0" goesWith="-1" index="0"><![CDATA[You have <b>%1$d</b> view(s)]]></EngTranslation>
However, if you restrict your HTML to always be well-formed, you can achieve what you want with the toString() method.
If you called toString() on the <EngTranslation> element node, the output would include the <EngTranslation>...</EngTranslation> wrapper tags. So instead, you would need to call toString() on each of the child nodes and concatenate the results together:
binmode(STDOUT, ':utf8');
foreach my $file ($dom->findnodes('/File')) {
print $file->getAttribute("id")."\n";
foreach my $identifier ($file->findnodes('./Identifier')) {
print $identifier->getAttribute("id")."\n";
my $html = join '', map { $_->toString }
$identifier->findnodes('./EngTranslation')->get_node(1)->childNodes;
print $html."\n";
print $identifier->findnodes('./Description')->get_node(1)->textContent."\n";
print $identifier->findnodes('./LangTranslation')->get_node(1)->textContent."\n";
}
}
Note I took the liberty of using binmode to set UTF8 encoding on the output filehandle so it was not necessary to call encode for every print.

How to construct json text using string?

I'm trying to construct json text as show below. But the variables such as $token, $state, $failedServers are not been replaced with its value. Note- I don't want to use any module specifically for this to work, I just want some plain string to work. Can anyone help me ?
my $json = '{"serverToken":"$token", "state":"$state","parameters" :"$failedServers"}';
current output was:
{"serverToken":"$token", "state":"$state","parameters" :"$failedServers"}
needed output format:
{"serverToken":"1213", "state":"failed","parameters" :"oracleapps.veeralab.com,suntrust.com"}
Your variables are not being replaced, because they are inside of a single-quoted string--that is, they are inside a string quoted by ' characters. This prevents variable substitution.
You will also be much better off creating JSON using a JSON library, such as this one. Simply using a quoted string is very dangerous. Suppose your one of your variables ends up containing a special character; you will end up with invalid JSON.
{"serverToken":"123"ABC", "state":"offline", "paramameters":"bugs"}
If your variables come from user input, really bad things could happen. Imagine that $token is set to equal foo", "state":"online", "foo":"bar. Your resulting JSON structure would be:
{"serverToken":"foo", "state":"online", "foo":"bar", "state":"offline" ...
Certainly not what you want.
Possible solutions:
The most blatantly obvious solution is simply not to the ' quote character. This has the drawback of requiring you to escape your double quote (") characters, though, but it's easy:
my $json = "{\"serverToken\":\"$token\", \"state\":\"$state\",\"parameters\" :\"$failedServers\"}";
Another option is to use sprintf:
my $json = sprintf('{"serverToken":"%s", "state":"%s", "parameters":"%s"}', $token, $state, $failedServers);
But by far, the best solution, because it won't break with wonky input, is to use a library:
use JSON;
my $json = encode_json( {
serverToken => $token,
state => $state,
paramaters => $failedServers
} );

How to Enable HTML::TableExtract to Recognize Special Characters

I was trying to parse a page that contain scientific notation (Greek, etc).
This is the page. Note that there are other pages with more notations to be parsed.
For example it contain the following HTML
<td> human Interleukin 1β </td>
where &beta encode the Greek alphabet.
However after parsing with HTML::TableExtract it became:
human Interleukin 1\x{3b2}
Is there a way to make the code below capture the original HTML as it is,
i.e. maintaning 1&beta.
use HTML::TableExtract;
use Data::Dumper;
# Local file for http://www.violinet.org/vaxjo/vaxjo_detail.php?c_vaxjo_id=55
my $file = "vaxjo_detail.php\?c_vaxjo_id\=50.html";
my $te = HTML::TableExtract->new();
$te->parse_file($file);
my ($table) = $te->tables;
print Dumper $table ;
It did not return
human Interleukin 1\x{3b2}
It returned
human Interleukin 1β
Dumper simply prints that out as Perl string literal
"human Interleukin 1\x{3b2}"
Anyway, if you want the raw HTML instead of the text it represents, I believe passing keep_html => 1 to the constructor will do the trick.

How can I modify HTML files in Perl?

I have a bunch of HTML files, and what I want to do is to look in each HTML file for the keyword 'From Argumbay' and change this with some href that I have.
I thought its very simple at first, so what I did is I opended each HTML file and loaded its content into an array (list), then I looked for each keyword and replaced it with s///, and dumped the contents to the file, what the problem? sometimes the keyword can also appear in a href, which in this case I dont want it to be replaced, or it can appear inside some tags and such.
An EXAMPLE: http://www.astrosociety.org/education/surf.html
I would like my script to replace each occurance of the word 'here' with some href that I have in $href, but as you can see, there is another 'here' which is already href'ed, I dont want it to href this one again.
In this case there arent additional 'here's there except from the href, but lets assume that there are.
I want to replace the keyword only if its just text, any idea?
BOUUNTY EDIT: Hi, I believe its a simple thing, But seems like it erases all the comments found in the HTML, SHTML file(the main issue is that it erases SSI's in SHTMLs), i tried using: store_comments(1) method on the $html before calling the recursive function, but to no avail. any idea what am I missing here?
To do this with HTML::TreeBuilder, you would read the file, modify the tree, and write it out (to the same file, or a different file). This is fairly complex, because you're trying to convert part of a text node into a tag, and because you have comments that can't move.
A common idiom with HTML-Tree is to use a recursive function that modifies the tree:
use strict;
use warnings;
use 5.008;
use File::Slurp 'read_file';
use HTML::TreeBuilder;
sub replace_keyword
{
my $elt = shift;
return if $elt->is_empty;
$elt->normalize_content; # Make sure text is contiguous
my $content = $elt->content_array_ref;
for (my $i = 0; $i < #$content; ++$i) {
if (ref $content->[$i]) {
# It's a child element, process it recursively:
replace_keyword($content->[$i])
unless $content->[$i]->tag eq 'a'; # Don't descend into <a>
} else {
# It's text:
if ($content->[$i] =~ /here/) { # your keyword or regexp here
$elt->splice_content(
$i, 1, # Replace this text element with...
substr($content->[$i], 0, $-[0]), # the pre-match text
# A hyperlink with the keyword itself:
[ a => { href => 'http://example.com' },
substr($content->[$i], $-[0], $+[0] - $-[0]) ],
substr($content->[$i], $+[0]) # the post-match text
);
} # end if text contains keyword
} # end else text
} # end for $i in content index
} # end replace_keyword
my $content = read_file('foo.shtml');
# Wrap the SHTML fragment so the comments don't move:
my $html = HTML::TreeBuilder->new;
$html->store_comments(1);
$html->parse("<html><body>$content</body></html>");
my $body = $html->look_down(qw(_tag body));
replace_keyword($body);
# Now strip the wrapper to get the SHTML fragment back:
$content = $body->as_HTML;
$content =~ s!^<body>\n?!!;
$content =~ s!</body>\s*\z!!;
print STDOUT $content; # Replace STDOUT with a suitable filehandle
The output from as_HTML will be syntactically correct HTML, but not necessarily nicely-formatted HTML for people to view the source of. You can use HTML::PrettyPrinter to write out the file if you want that.
If tags matter in your search and replace, you'll need to use HTML::Parser.
This tutorial looks a bit easier to understand than the documentation with the module.
If you wanted to go a regular-expression-only type method and you're prepared to accept the following provisos:
this will not work correctly within HTML comments
this will not work where the < or > character is used within a tag
this will not work where the < or > character is used and not part of a tag
this will not work where a tag spans multiple lines (if you're processing one line at a time)
If any of the above conditions do exist then you will have to use one of the HTML/XML parsing strategies outlined by other answers.
Otherwise:
my $searchfor = "From Argumbay";
my $replacewith = "<a href='http://google.com/?s=Argumbay'>From_Argumbay</a>";
1 while $html =~ s/
\A # beginning of string
( # group all non-searchfor text
( # sub group non-tag followed by tag
[^<]*? # non-tags (non-greedy)
<[^>]*> # whole tags
)*? # zero or more (non-greedy)
)
\Q$searchfor\E # search text
/$1$replacewith/sx;
Note that this will NOT work if $searchfor matches $replacetext (so don't put "From Argumbay" back into the replacement text).