How to grab info from HTML page?

How to grab info from HTML page? - html

Please, help me grab information from this structure:
<table id="id1" class="class1">
<tbody>
<tr id="id2">
<td>
<span class="class2">
"header text"
</span>
</td>
<td id="d" style="width:10px;">
<img style="width:10px;" src="/images/img1.gif">
</td>
<td id="r" style="width:40%;">
<span class="class2">
<nobr>Headings:</nobr>
</span>
</td>
</tr>
<tr>
<td>
<table class="class1" style="width:100%;">
<tbody>
<tr>
<td width="300" valign="top"></td>
</tr>
<tr>
<td style="padding:0px;">
<div>
<b>Address: </b>
Address text
</div>
<div>
<b>Tel.: </b>
250-1729
</div>
<br>
</td>
</tr>
</tbody>
</table>
</td>
<td>
<img src="/images/img.gif">
</td>
<td>
heading1
<br>
heading2
<br>
heading3
<br>
</td>
</tr>
</tbody>
I want to get:
header text
Address text
Tel. number
but I don't understand, how can I get it with PowerShell.
Firstly, I get this table
$address = "http://address.com"
$page = Invoke-WebRequest $address
$table = $($page.parsedhtml.getElementsByTagName("table") | Where { $_.id -eq 'id1' })
What's next?
How can I call table's children and get their texts?

This is my decision:
$address = "http://address.com"
$page = Invoke-WebRequest $address
$table = $($page.parsedhtml.getElementById("id1")
$tr = $table.getElementsByTagName('tr') | Where { $_.id -eq 'id2' }
$name=($tr.getElementsByTagName('a') | select -First 1).innertext
$divs=$table.getElementsByTagName('div')
foreach ($div in $divs) {
if ($div.innertext -match "address: ") {$adr=$div.innertext -replace "Address: ",""}
if ($div.innertext -match "Tel.: ") {$tel=$div.innertext -replace "Tel.: ",""}
}
$resultmassive+=[string]::Join(";",$name,$adr,$tel)
p.s. Maybe, it is possible to use PowerShell switch-case construcion instead foreach, but it doesn't works for me.

First and foremost: if your elements have an ID use getElementById() instead of getElementsByTagName() with an additional filter. That will give you the correct table (or other element) right away.
When you have the (parent) element you can get nested elements by calling getElementById(), getElementsByTagName(), etc. on the parent:
$nestedTables = $table.getElementsByTagName('table')
In your case you want to get
the child element with the ID id2 and then the (grand)child <a> element (for the header text)
$tr = $table.getElementById('id2')
$tr.getElementsByTagName('a')
the <div> elements in the nested table (for the address and phone number):
$table.getElementsByTagName('div')

Related

Use HTML::TreeBuilder in Perl to extract all instances of a specific span class

Trying to make a Perl script to open an HTML file and extract anything contained within <span class="postertrip"> tags.
Sample HTML:
<table>
<tbody>
<tr>
<td class="doubledash">>></td>
<td class="reply" id="reply2">
<a name="2"></a> <label><input type="checkbox" name="delete" value="1199313466,2" /> <span class="replytitle"></span> <span class="commentpostername">Test1</span><span class="postertrip">!AAAAAAAA</span> 08/01/03(Thu)02:06</label> <span class="reflink"> No.2 </span> <br /> <span class="filesize">File: <a target="_blank" href="test">1199326003295.jpg</a> -(<em>65843 B, 288x412</em>)</span> <span class="thumbnailmsg">Thumbnail displayed, click image for full size.</span><br /> <a target="_blank" test"> <img src="test" width="139" height="200" alt="65843" class="thumb" /></a>
<blockquote>
<p>Test message 1</p>
</blockquote>
</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td class="doubledash">>></td>
<td class="reply" id="reply5">
<a name="5"></a> <label><input type="checkbox" name="delete" value="1199313466,5" /> <span class="replytitle"></span> <span class="commentpostername">Test2</span><span class="postertrip">!BBBBBBBB</span> 08/01/03(Thu)16:12</label> <span class="reflink"> No.5 </span>
<blockquote>
<p>Test message 2</p>
</blockquote>
</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td class="doubledash">>></td>
<td class="reply" id="reply7">
<a name="7"></a> <label><input type="checkbox" name="delete" value="1199161229,7" /> <span class="replytitle"></span> <span class="commentpostername">Test3</span><span class="postertrip">!CCCCCCCC.</span> 08/01/01(Tue)17:53</label> <span class="reflink"> No.7 </span>
<blockquote>
<p>Test message 3</p>
</blockquote>
</td>
</tr>
</tbody>
</table>
Desired output:
!AAAAAAAA
!BBBBBBBB
!CCCCCCCC
Current script:
#!/usr/bin/env perl
use warnings;
use strict;
use 5.010;
use HTML::TreeBuilder;
open(my $html, "<", "temp.html")
or die "Can't open";
my $tree = HTML::TreeBuilder->new();
$tree->parse_file($html);
foreach my $e ($tree->look_down('class', 'reply')) {
my $e = $tree->look_down('class', 'postertrip');
say $e->as_text;
}
Bad output of script:
!AAAAAAAA
!AAAAAAAA
!AAAAAAAA

in your foreach-loop you have to look down from the element you found. So the correct code is:
foreach my $parent ($tree->look_down('class', 'reply')) {
my $e = $parent->look_down('class', 'postertrip');
say $e->as_text;
}

I've never liked HTML::TreeBuilder. It's a bit of a complicated mess, and it hasn't been updated in three years. Using CSS selectors with Mojo::DOM is pretty easy though. Its find does all that work that the various look_downs do:
use v5.10;
use Mojo::DOM;
my $html = do { local $/; <DATA> };
my #values = Mojo::DOM->new( $html )
->find( 'td.reply span.postertrip' )
->map( 'all_text' )
->each;
say join "\n", #values;
Note that in your HTML::TreeBuilder code, you don't have the logic to select the tags you care about. You can do it but you need extra work. The CSS selectors take care of that for you.

Hide empty table (with whitespace) with css

I'm using Blade to fill some tables with content but in some cases a table might end up empty when there is nothing to fill.
Here is part of the php / blade template:
<table class="table">
#isset ($content->client)
<tr>
<td>
Client:
</td>
<td class="text-right">
{{ $content->client }}
</td>
</tr>
#endisset
#isset ($content->published)
<tr>
<td>
Published:
</td>
<td class="text-right">
{{ $content->published }}
</td>
</tr>
#endisset
</table>
In case $content->client and $content->published are not set the result is something like:
<table class="table">
</table>
Is there a simple css way to remove the table entirely in these cases?
I'm familiar with the :empty selector but aparently that doesn't work if there are whitespaces in the tag :(

I would suggest not printing the table if either of the variables are empty.
<?php
if( isset($content->client) || isset($content->published))
{
// echo table
}
?>

Did you try :blank? It also selects whitespace while :empty does not.

How to select child element with skipping some of his parents in XPath [duplicate]

I have a complex html structure with lot of tables and divs.. and also the structure might change. How to find xpath by skipping the elements in between.
for example :
<table>
<tr>
<td>
<span>First Name</span>
</td>
<td>
<div>
<table>
<tbody>
<tr>
<td>
<div>
<table>
<tbody>
<tr>
<td>
<img src="1401-2ATd8" alt="" align="middle">
</td>
<td><span><input atabindex="2" id=
"MainLimitLimit" type="text"></span></td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</table>
I have to get the input element with respect to the "First Name" span
eg :
By.xpath("//span[contains(text(), 'First Name')]/../../td[2]/div/table/tbody/tr/td/table/tbody/tr/td[2]/input")
but.. can I skip the between htmls and directly access the input element.. something like?
By.xpath("//span[contains(text(), 'First Name')]/../../td[2]//input[contains#id,'MainLimitLimit')]")

You can try this Xpath :
//td[contains(span,'First Name')]/following-sibling::td[1]//input[contains(#id, 'MainLimitLimit')]
Explanation :
select <td><span>First Name</span></td> element :
//td[contains(span,'First Name')]
then get <td> element next to above <td> element :
/following-sibling::td[1]
then get <input> element within <td> element selected in the 2nd step above :
//input[contains(#id, 'MainLimitLimit')]

You can use // which means at any level
By.xpath("//span[contains(text(), 'First Name')]//td[2]/input[contains#id,'MainLimitLimit')]")

you can use the "First Name" span as a predicate. Try the code below
//td[preceding-sibling::td[span[contains(text(), 'First Name')]]]//input[contains(#id,'MainLimitLimit')]

how to find xpath of an element skipping the inner elements

I have a complex html structure with lot of tables and divs.. and also the structure might change. How to find xpath by skipping the elements in between.
for example :
<table>
<tr>
<td>
<span>First Name</span>
</td>
<td>
<div>
<table>
<tbody>
<tr>
<td>
<div>
<table>
<tbody>
<tr>
<td>
<img src="1401-2ATd8" alt="" align="middle">
</td>
<td><span><input atabindex="2" id=
"MainLimitLimit" type="text"></span></td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</table>
I have to get the input element with respect to the "First Name" span
eg :
By.xpath("//span[contains(text(), 'First Name')]/../../td[2]/div/table/tbody/tr/td/table/tbody/tr/td[2]/input")
but.. can I skip the between htmls and directly access the input element.. something like?
By.xpath("//span[contains(text(), 'First Name')]/../../td[2]//input[contains#id,'MainLimitLimit')]")

You can try this Xpath :
//td[contains(span,'First Name')]/following-sibling::td[1]//input[contains(#id, 'MainLimitLimit')]
Explanation :
select <td><span>First Name</span></td> element :
//td[contains(span,'First Name')]
then get <td> element next to above <td> element :
/following-sibling::td[1]
then get <input> element within <td> element selected in the 2nd step above :
//input[contains(#id, 'MainLimitLimit')]

You can use // which means at any level
By.xpath("//span[contains(text(), 'First Name')]//td[2]/input[contains#id,'MainLimitLimit')]")

you can use the "First Name" span as a predicate. Try the code below
//td[preceding-sibling::td[span[contains(text(), 'First Name')]]]//input[contains(#id,'MainLimitLimit')]

Scraping and divs

I'm new to PHP and am trying to scrape data from a website I'm using regular expressions, but locating content rental and details in the div is a problem here is my code. Could someone help me out?
<?php
header('content-type: text/plain');
$contents= file_get_contents('http://www.hassconsult.co.ke/index.php?option=com_content&view=article&id=22&Itemid=29');
$contents = preg_replace('/\s(1,)/','',$contents);
$contents = preg_replace('/ /','',$contents);
//print $contents."\n";
$records = preg_split('/<span class="style8"/',$contents);
for ($ix=1; $ix < count($records); $ix++){
$tmp = $records[$ix];
preg_match('/href="(.*?)"/',$tmp, $match_url);
preg_match('/>(.*?)<\/span>/',$tmp,$match_name);
preg_match('/<div[^>]+class ?= ?"style10"[^>]*>(\s*(<div.*(?2).*<\/div>\s*)*)<\/div>/Us',$tmp,$match_rental);//error is here
print_r($match_url);
print_r($match_name);
print_r($match_rental);
print $tmp."\n";
exit ();
}
//print count($records)."\n";
//print_r($records);
//if ($contents===false)
//print 'FALSE';
//print_r(htmlentities($contents));
?>
Here is a sample of the content
>HILLVIEW CROSSROADS4 BED HOUSE</span></div></td>
</tr>
<tr>
<td width="57%" style="padding-left:20px;"><div align="left" class="style10" style="color:#007AC7;">
<div align="left">
Rental;
USD 4,500
</div>
</div></td>
<td width="43%" align="right"><div align="right" class="style10" style="color:#007AC7;">
<div align="right">
No.
834
</div>
</div></td>
</tr>
<tr>
<td colspan="2" style="padding-left:20px;color:#000000;">
<div align="justify" style="font-family:Arial, Helvetica, sans-serif;font-size:11px;color:333300;">Artistically designed 4 bed (all
ensuite) house on half acre of well-tended gardens. Lounge with fireplace opening to terrace, opulent master suite, family room, study. Good finishes, SQ, carport, extra water storage
and generator. ....Details </div></td>
</tr>
</table></td>
</tr>
</table>
<br>

That website doesn't have good css selectors but it's still not to hard to get it with xpath:
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.hassconsult.co.ke/index.php?option=com_content&view=article&id=22&Itemid=29');
$xpath = new DOMXPath($dom);
foreach($xpath->query("//div[#id='ad']/table") as $table) {
// title
echo $xpath->query(".//span[#class='style8']", $table)->item(0)->nodeValue . "\n";
// price
echo $xpath->query(".//div[#class='style10']/div", $table)->item(0)->nodeValue . "\n";
// description
echo $xpath->query(".//div[#align='justify']", $table)->item(0)->nodeValue . "\n";
}

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to grab info from HTML page? - html

Related

Use HTML::TreeBuilder in Perl to extract all instances of a specific span class

Hide empty table (with whitespace) with css

How to select child element with skipping some of his parents in XPath [duplicate]

how to find xpath of an element skipping the inner elements

Scraping and divs

Categories

Resources