How to extract certain data from HTML using RegEx?

How to extract certain data from HTML using RegEx? - html

I've got the following code:
<tr class="even">
<td>
Title1
</td>
<td>
Name1
</td>
<td>
Email1
</td>
<td>
Postcode1
</td>
I want to use RegEx in to output the data between the tags like so:
Title1
Name1
Email1
Postcode1
Title2
Name2
Email2
Postcode2
...

You shouldn't use a regex to parse html, use an HTML parser instead.
Anyway, if you really want a regex you can use this one:
>\s+<|>\s*(.*?)\s*<
Working demo
Match information:
MATCH 1
1. [51-57] `Title1`
MATCH 2
1. [109-114] `Name1`
MATCH 3
1. [166-172] `Email1`
MATCH 4
1. [224-233] `Postcode1`

This should get rid of everything between the tags, and output the rest space separated:
$text =
#'
<tr class="even">
<td>
Title1
</td>
<td>
Name1
</td>
<td>
Email1
</td>
<td>
Postcode1
</td>
'#
$text -split '\s*<.+?>\s*' -match '\S' -as [string]
Title1 Name1 Email1 Postcode1

Don't use a regex. HTML isn't a regular language, so it can't be properly parsed with a regex. It will succeed most of the time, but other times will fail. Spectacularly.
Use the Internet Explorer COM object to read your HTML from a file:
$ie = new-object -com "InternetExplorer.Application"
$ie.visible = $false
$ie.navigate("F:\BuildOutput\rt.html")
$document = $ie.Document
# This will return all the tables
$document.getElementsByTagName('table')
# This will return a table with a specific ID
$document.getElementById('employees')
Here's the MSDN reference for the document class.

Related

Insert Data from CSV to HTML table using powershell

I am new to powershell and I want to insert the data from csv to html table which is I create separately. This is my csv
Sitename EmailAddress
Test example#gmail.com
Asking for help of how should I insert this data to my html table and then if I add data in csv it also automatically added on HTML table.
test.ps1 script
$kfxteam = Get-Content ('.\template\teamnotif.html')
$notifteam = '' #result html
$teamlist = Import-Csv ".\list\teamlist.csv" | select 'SiteName','EmailAddress'
For($a=0; $a -lt $kfxteam.Length; $a++) {
# If the "<table class=content>" matches on what written on $kfxteam it will show the result`
if($kfxteam -match "<table class=content >"){
# should be replacing the data came from csv to html and also adding new row
write-host $teamlist[$a].SiteName
}
}
html format
<<table class=content >
<td class=c1 nowrap>
Remote Sitenames
</td>
<td class=c1 nowrap >
Users Email Address
</td>
</tr>
<tr>
<td class=c1 nowrap>
usitename
</td>
<td class=c1 nowrap>
[uemail]
</td>
</tr>
</table>
The output html table should be
Remote Sitenames Email Address
Test example#gmail.com

If I were you, I'd change the HTML template file regarding the table to become something like this:
<table class=content>
<tr>
<td class=c1 nowrap>Remote Sitenames</td>
<td class=c1 nowrap>Users Email Address</td>
</tr>
##TABLEROWSHERE##
</table>
Now, you have a placeholder which you can replace with the table rows you create using the CSV file like:
# create a template for each of the rows to insert
# with two placeholders to fill in using the -f Format operator
$row = #"
<tr>
<td class=c1 nowrap>{0}</td>
<td class=c1 nowrap>{1}</td>
</tr>
"#
# import the csv, loop over the records and use the $row template to create the table rows
$tableRows = Import-Csv -Path '.\list\teamlist.csv' | ForEach-Object {
$row -f $_.Sitename, $_.EmailAddress
}
# then combine it all in the html
$result = (Get-Content -Path '.\template\teamnotif.html' -Raw) -replace '##TABLEROWSHERE##', ($tableRows -join [Environment]::NewLine)
# save the completed HTML
$result | Set-Content -Path '.\list\teamlist.html'

Ruby and Nokogiri parsing table?

This is my HTML:
<tbody><tr><th>SHOES</th></tr>
<tr>
<td>
Shoe 1 <br>shoe 2<br> shoe3 <br>
</td>
</tr>
</tbody>
This is my code:
nodes = page.css("tr").select do |el|
el.css('th').text =~ /SHOES/
end
nodes.each do |value|
puts value.css("td").text
end
I wish to get the values shoe 1, shoe 2 and shoe 3, but there is no output. I suspect there is an extra <tr></tr> in between <tr><th>SHOES</th></tr>. Or are the <br> the culprit?
There are other structures like:
<tr>
<th>SHOES</th>
<td>NBA</td>
</tr>
and I got the desired output "NBA".
What did I do wrong?
I have two kinds of structures:
Name1: value
Name1: value2
The above would give:
<tr>
<th>Name1</th>
<td>Value</td>
</tr>
but sometimes it's:
Name:
value
value2
value3
So the HTML is:
<tbody><tr><th>Name</th></tr>
<tr>
<td>value<br>value2<br> ....</td>

In HTML, tables are composed by rows. When you iterate by those rows, only one of them is the header. Although logically you see a relation between the body rows and the header ones, for HTML (and therefore for Nokogiri) there's none.
If what you want, is to get every value of the cells that have a specific header, what you can do is count the specific column, and then get the values from there.
Using this HTML as source
html = '<tbody><tr><th>HATS</th><th>SHOES</th></tr>
<tr>
<td>
hat 1 <br>hat 2<br> hat3 <br>
</td>
<td>
Shoe 1 <br>shoe 2<br> shoe3 <br>
</td>
</tr>
</tbody>'
We then follow to get the position of the right , in the first row of the table
page = Nokogiri::HTML(html)
shoes_position = page.css("tr")[0].css('th').find_index do |el|
el.text =~ /SHOES/
end
And with that, we find the s in that position in every other row, and get the text from that
shoes_tds = page.css('tr').map {|row| row.css('td')[shoes_position] }.compact
shoes_names = shoes_tds.map { |td| td.text }
I use a compact to remove the nil values, as the first row (the one with the headers) will not have a td, thus returning nil

You can get there with css:
td = doc.at('tr:has(th[text()=SHOES]) + tr td')
td.children.map{|x| x.text.strip}.reject(&:empty?)
#=> ["Shoe 1", "shoe 2", "shoe3"]
but maybe mixing it up with xpath is better:
td.search('./text()').map{|x| x.text.strip}
#=> ["Shoe 1", "shoe 2", "shoe3"]

How to read the values of a table in HTML file and Store them in Perl?

I read many questions and many answers but I couldn't find a straight answer to my question. All the answers were either very general or different from what I want to do. I got so far that i need to use HTML::TableExtract or HTML::TreeBuilder::XPath but I couldn't really use them to store the values. I could somehow get table row values and show them with Dumper.
Something like this:
foreach my $ts ($tree->table_states) {
foreach my $row ($ts->rows) {
push (#fir , (Dumper $row));
} }
print #sec;
But this is not really doing what I'm looking for. I will add the structure of the HTML table that I want to store the values:
<table><caption><b>Table 1 </b>bla bla bla</caption>
<tbody>
<tr>
<th ><p>Foo</p>
</th>
<td ><p>Bar</p>
</td>
</tr>
<tr>
<th ><p>Foo-1</p>
</th>
<td ><p>Bar-1</p>
</td>
</tr>
<tr>
<th ><p>Formula</p>
</th>
<td><p>Formula1-1</p>
<p>Formula1-2</p>
<p>Formula1-3</p>
<p>Formula1-4</p>
<p>Formula1-5</p>
</td>
</tr>
<tr>
<th><p>Foo-2</p>
</th>
<td ><p>Bar-2</p>
</td>
</tr>
<tr>
<th ><p>Foo-3</p>
</th>
<td ><p>Bar-3</p>
<p>Bar-3-1</p>
</td>
</tr>
</tbody>
</table>
It would be convenient if I can store the row values as pairs together.
expected output would be something like an array with values of:
(Foo , Bar , Foo-1 , Bar-1 , Formula , Formula-1 Formula-2 Formula-3 Formula-4 Formula-5 , ....)
The important thing for me is to learn how to store the values of each tag and how to move around in the tag tree.

Learn XPath and DOM manipulation.
use strictures;
use HTML::TreeBuilder::XPath qw();
my $dom = HTML::TreeBuilder::XPath->new;
$dom->parse_file('10280979.html');
my %extract;
#extract{$dom->findnodes_as_strings('//th')} =
map {[$_->findvalues('p')]} $dom->findnodes('//td');
__END__
# %extract = (
# Foo => [qw(Bar)],
# 'Foo-1' => [qw(Bar-1)],
# 'Foo-2' => [qw(Bar-2)],
# 'Foo-3' => [qw(Bar-3 Bar-3-1)],
# Formula => [qw(Formula1-1 Formula1-2 Formula1-3 Formula1-4 Formula1-5)],
# )

Finding an XPATH expression

For the following html:
<tr>
<td class="first">AUD</td>
<td> 0.00 </td>
<td> 1,305.01 </td>
<td> 1,305.01 </td>
<td> -65.20 </td>
<td> 0.00 </td>
<td> 0.00 </td>
<td> 1,239.81 </td>
<td class="fx-rate"> 0.98542 </td>
</tr>
I am trying to grab the value for the fx-rate, given the type of current. For example, the function would be something like get_fx_rate(currency). This is the XPATH expression I have so far, but it results in an empty element, []. What am I doing wrong here and what would be the correct expression?
"//td[#class='first']/text()[normalize-space()='AUD']/parent::td[#class='fx-rate']"

Use this:
//td[#class = 'first' and normalize-space() = 'AUD']/parent::tr/td[#class = 'fx-rate']
or clearer:
//tr[td[#class="first1" and normalize-space()="AUD"]]/td[#class="fx-rate"]

This is the way I managed to solve it, using partial xpaths:
### get all the elements via xpath
currencies = driver.find_elements_by_xpath("//td[#class='first']")
fx_rates = driver.find_elements_by_xpath("//td[#class='fx-rate']")
### build a list and zip it to get the k,v pairs
fx_values = [fx.text for fx in fx_rates if fx.text]
currency_text = [currency.text for currency in currencies if currency.text]
zip(currency_text,fx_values)[1:]

Webscraping In powershell monitor page

I want to be able to monitor my printers status web page and have a script email me when the ink level falls below 25%. Im pretty sure this can be done in Powershell, but Im at a loss on how to do it.
This is the page HTML in question:
<h2>Supply Status</h2>
<table class="matrix">
<thead>
<tr>
<th>Supply Information</th>
<th>Status</th>
</tr>
</thead>
<tbody>
<tr>
<td>Black Toner</td>
<td>End of life</td>
</tr>
<tr>
<td>Cyan Toner</td>
<td>Under 25%</td>
</tr>
<tr>
<td>Magenta Toner</td>
<td>Under 25%</td>
</tr>
<tr>
<td>Yellow Toner</td>
<td>Under 25%</td>
</tr>
</tbody>
</table>
<p>
Thanks.
Adam

Building on #Joey's answer, give this a whirl with the HTML Agility Pack.
$html = new-object HtmlAgilityPack.HtmlDocument
$result = $html.Load("http://full/path/to/file.htm")
$colors = $html.DocumentNode.SelectNodes("//table[#class='matrix']//tbody/tr")
$result = $colors | % {
$color = $_.SelectSingleNode("td[1]").InnerText
$level = $_.SelectSingleNode("td[2]").InnerText
new-object PsObject -Property #{ Color = $color; Level = $level; } |
Select Color,Level
}
$result | Sort Level | ft -a
This assumes you already have the HTML Agility Pack loaded into PowerShell. Mine is loaded in my profile as:
[System.Reflection.Assembly]::LoadFrom(
(join-path $profileDirectory HtmlAgilityPack)
+ "\HtmlAgilityPack.dll" ) | Out-Null
Using the example HTML provided, your output looks like:
At this point, you have the output and can email it out.

The easiest way would probably be the HTML Agility Pack which you can import in PowerShell. Lee Holmes has a short article demonstrating a simple example with it. Essentially you're using an XML-like API to access the HTML DOM.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to extract certain data from HTML using RegEx? - html

I've got the following code: <tr class="even"> <td> Title1 </td> <td> Name1 </td> <td> Email1 </td> <td> Postcode1 </td> I want to use RegEx in to output the data between the tags like so: Title1 Name1 Email1 Postcode1 Title2 Name2 Email2 Postcode2 ...

You shouldn't use a regex to parse html, use an HTML parser instead. Anyway, if you really want a regex you can use this one: >\s+<|>\s(.?)\s*< Working demo Match information: MATCH 1 1. [51-57] `Title1` MATCH 2 1. [109-114] `Name1` MATCH 3 1. [166-172] `Email1` MATCH 4 1. [224-233] `Postcode1`

This should get rid of everything between the tags, and output the rest space separated: $text = #' <tr class="even"> <td> Title1 </td> <td> Name1 </td> <td> Email1 </td> <td> Postcode1 </td> '# $text -split '\s<.+?>\s' -match '\S' -as [string] Title1 Name1 Email1 Postcode1

Related

Insert Data from CSV to HTML table using powershell

Ruby and Nokogiri parsing table?

How to read the values of a table in HTML file and Store them in Perl?

Finding an XPATH expression

Webscraping In powershell monitor page

Categories

Resources

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to extract certain data from HTML using RegEx? - html

I've got the following code: <tr class="even"> <td> Title1 </td> <td> Name1 </td> <td> Email1 </td> <td> Postcode1 </td> I want to use RegEx in to output the data between the tags like so: Title1 Name1 Email1 Postcode1 Title2 Name2 Email2 Postcode2 ...

You shouldn't use a regex to parse html, use an HTML parser instead. Anyway, if you really want a regex you can use this one: >\s+<|>\s*(.*?)\s*< Working demo Match information: MATCH 1 1. [51-57] `Title1` MATCH 2 1. [109-114] `Name1` MATCH 3 1. [166-172] `Email1` MATCH 4 1. [224-233] `Postcode1`

This should get rid of everything between the tags, and output the rest space separated: $text = #' <tr class="even"> <td> Title1 </td> <td> Name1 </td> <td> Email1 </td> <td> Postcode1 </td> '# $text -split '\s*<.+?>\s*' -match '\S' -as [string] Title1 Name1 Email1 Postcode1

Related

Insert Data from CSV to HTML table using powershell

Ruby and Nokogiri parsing table?

How to read the values of a table in HTML file and Store them in Perl?

Finding an XPATH expression

Webscraping In powershell monitor page

Categories

Resources

You shouldn't use a regex to parse html, use an HTML parser instead. Anyway, if you really want a regex you can use this one: >\s+<|>\s(.?)\s*< Working demo Match information: MATCH 1 1. [51-57] `Title1` MATCH 2 1. [109-114] `Name1` MATCH 3 1. [166-172] `Email1` MATCH 4 1. [224-233] `Postcode1`

This should get rid of everything between the tags, and output the rest space separated: $text = #' <tr class="even"> <td> Title1 </td> <td> Name1 </td> <td> Email1 </td> <td> Postcode1 </td> '# $text -split '\s<.+?>\s' -match '\S' -as [string] Title1 Name1 Email1 Postcode1