Regex: Starting from a specific point on each line - html

I have an HTML file that displays software installed on a machine, and I'd like to remove some of the cells in the table in the HTML file.
Below is a sample of the code:
<tr><td>Adobe Acrobat Reader DC</td><td>18.009.20050</td><td>20171130</td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>
<tr><td>Adobe Flash Player 28 ActiveX</td><td>28.0.0.137</td><td></td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>
...and so on.
What I'm trying to accomplish is to delete everything starting from the 4th instance of the td tag and stop just before the closing /tr tag on each line, so essentially eliminating...
<td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td>
<td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td>
...so that I'm left with...
<tr><td>Adobe Acrobat Reader DC</td><td>18.009.20050</td><td>20171130</td></tr>
<tr><td>Adobe Flash Player 28 ActiveX</td><td>28.0.0.137</td><td></td></tr>
The regex that I'm using is
(?<=<td>)(.*)(?=<\/tr>)
The issue I'm having is that the above regex is selecting the enitre line of code. How can I change this so that it's starting from the 4th instance of the tag for each line?
Please see the following link with a full example of the HTML file I'm using and the regex applied: https://regex101.com/r/C9lkMc/3
EDIT 1: This HTML is generated from a PowerShell script to fetch installed software on remote machines. The code for that is:
Invoke-Command -ComputerName $hostname -ScriptBlock {
if (!([Diagnostics.Process]::GetCurrentProcess().Path -match '\\syswow64\\')) {
$unistallPath = "\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\"
$unistallWow6432Path = "\SOFTWARE\Wow6432Node\Microsoft\Windows\CurrentVersion\Uninstall\"
#(
if (Test-Path "HKLM:$unistallWow6432Path" ) { Get-ChildItem "HKLM:$unistallWow6432Path"}
if (Test-Path "HKLM:$unistallPath" ) { Get-ChildItem "HKLM:$unistallPath" }
if (Test-Path "HKCU:$unistallWow6432Path") { Get-ChildItem "HKCU:$unistallWow6432Path"}
if (Test-Path "HKCU:$unistallPath" ) { Get-ChildItem "HKCU:$unistallPath" }
) |
ForEach-Object { Get-ItemProperty $_.PSPath } |
Where-Object {
$_.DisplayName -and !$_.SystemComponent -and !$_.ReleaseType -and !$_.ParentKeyName -and ($_.UninstallString -or $_.NoRemove)
} |
Sort-Object DisplayName | Select-Object -Property DisplayName, DisplayVersion, InstallDate | ft
}
}

Regex isn't great for parsing HTML; there can be a lot of odd scenarios; e.g. what happens if you have a node <td /> or <td colspan="2"> where you'd expected to have <td>? Equally, HTML (annoyingly) doesn't always follow XML rules; so an XML parser won't work (e.g. <hr> has no end tag / <hr /> is considered invalid).
As such, if parsing HTML you ideally need to use an HTML parser. For that, PowerShell has access to the HtmlFile com object, documented here: https://msdn.microsoft.com/en-us/library/aa752574(v=vs.85).aspx
Here are some examples...
This code finds all TR elements then strips all TDs after the first 4 and returns the row's outer HTML.
$html = #'
some sort of html code
<hr> an unclosed tab so it's messy like html / unlike xml
<table>
<tr><th>Program Name</th><th>version</th><th>install date</th><th>computer name</th><th>ID</th><th>Installed</th></tr>
<tr><td>Adobe Acrobat Reader DC</td><td>18.009.20050</td><td>20171130</td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>
<tr><td>Adobe Flash Player 28 ActiveX</td><td>28.0.0.137</td><td></td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>
<tr><td /><td>123</td><td></td><td>hello.com</td><td>456</td><td>True</td></tr>
</table>
etc...
'#
$Parser = New-Object -ComObject 'HTMLFile' #see https://msdn.microsoft.com/en-us/library/aa752574(v=vs.85).aspx
$Parser.IHTMLDocument2_write($html) #if you're using PS4 or below use instead: $Parser.IHTMLDocument2_write($html)
$parser.documentElement.getElementsByTagName('tr') | %{
$tr = $_
$tr.getElementsByTagName('td') | select-object -skip 4 | %{$tr.removeChild($_)} | out-null
$tr.OuterHtml
}
This works in a similar way; but just pulls back the values of the first 4 cells in each row:
$html = #'
some sort of html code
<hr> an unclosed tab so it's messy like html / unlike xml
<table>
<tr><th>Program Name</th><th>version</th><th>install date</th><th>computer name</th><th>ID</th><th>Installed</th></tr>
<tr><td>Adobe Acrobat Reader DC</td><td>18.009.20050</td><td>20171130</td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>
<tr><td>Adobe Flash Player 28 ActiveX</td><td>28.0.0.137</td><td></td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>
<tr><td /><td>123</td><td></td><td>hello.com</td><td>456</td><td>True</td></tr>
</table>
etc...
'#
$Parser = New-Object -ComObject 'HTMLFile' #see https://msdn.microsoft.com/en-us/library/aa752574(v=vs.85).aspx
$Parser.IHTMLDocument2_write($html) #if you're using PS4 or below use instead: $Parser.IHTMLDocument2_write($html)
$parser.documentElement.getElementsByTagName('tr') | %{
$tr = $_
$a,$b,$c,$d = $tr.getElementsByTagName('td') | select-object -first 4 | %{"$($_.innerText)"} #we do this istead of `select -expand innerText` to ensure nulls are returned as blanks; not ignored
(New-Object -TypeName 'PSObject' -Property ([ordered]#{
AppName = $a
Version = $b
InstallDate = $c
ComputerName = $d
}))
}

Related

How to parse HTML table with Powershell Core 7?

I have the following code:
$html = New-Object -ComObject "HTMLFile"
$source = Get-Content -Path $FilePath -Raw
try
{
$html.IHTMLDocument2_write($source) 2> $null
}
catch
{
$encoded = [Text.Encoding]::Unicode.GetBytes($source)
$html.write($encoded)
}
$t = $html.getElementsByTagName("table") | Where-Object {
$cells = $_.tBodies[0].rows[0].cells
$cells[0].innerText -eq "Name" -and
$cells[1].innerText -eq "Description" -and
$cells[2].innerText -eq "Default Value" -and
$cells[3].innerText -eq "Release"
}
The code works fine on Windows Powershell 5.1, but on Powershell Core 7 $_.tBodies[0].rows returns null.
So, how does one access the rows of an HTML table in PS 7?
PowerShell (Core), as of 7.3.1, does not come with a built-in HTML parser - and this may never change.
You must rely on a third-party solution, such as the PowerHTML module that wraps the HTML Agility Pack.
The object model works differently than the Internet Explorer-based one available in Windows PowerShell; it is similar to the XML DOM provided by the standard System.Xml.XmlDocument type ([xml])[1]; see the documentation and the sample code below.
# Install the module on demand
If (-not (Get-Module -ErrorAction Ignore -ListAvailable PowerHTML)) {
Write-Verbose "Installing PowerHTML module for the current user..."
Install-Module PowerHTML -ErrorAction Stop
}
Import-Module -ErrorAction Stop PowerHTML
# Create a sample HTML file with a table with 2 columns.
Get-Item $HOME | Select-Object Name, Mode | ConvertTo-Html > sample.html
# Parse the HTML file into an HTML DOM.
$htmlDom = ConvertFrom-Html -Path sample.html
# Find a specific table by its column names, using an XPath
# query to iterate over all tables.
$table = $htmlDom.SelectNodes('//table') | Where-Object {
$headerRow = $_.Element('tr') # or $tbl.Elements('tr')[0]
# Filter by column names
$headerRow.ChildNodes[0].InnerText -eq 'Name' -and
$headerRow.ChildNodes[1].InnerText -eq 'Mode'
}
# Print the table's HTML text.
$table.InnerHtml
# Extract the first data row's first column value.
# Note: #(...) is required around .Elements() for indexing to work.
#($table.Elements('tr'))[1].ChildNodes[0].InnerText
A Windows-only alternative is to use the HTMLFile COM object, as shown in this answer, and as used in your own attempt - I'm unclear on why it didn't work in your specific case.
[1] Notably with respect to supporting XPath queries via the .SelectSingleNode() and .SelectNodes() methods, exposing child nodes via a .ChildNodes collection, and providing .InnerHtml / .OuterHtml / .InnerText properties. Instead of an indexer that supports child element names, methods .Element(<name>) and .Elements(<name>) are provided.
I used the answer above for my solution. I installed PowerHTML.
I wanted to extract the datatable from https://www.dicomlibrary.com/dicom/dicom-tags/ and convert them.
From this:
<tr><td>(0002,0000)</td><td>UL</td><td>File Meta Information Group Length</td><td></td></tr>
To this:
{"00020000", "ULFile Meta Information Group Length"}
$page = Invoke-WebRequest https://www.dicomlibrary.com/dicom/dicom-tags/
$htmldom = ConvertFrom-Html $page
$table = $htmlDom.SelectNodes('//table') | Where-Object {
$headerRow = $_.Element('tr') # or $tbl.Elements('tr')[0]
# Filter by column names
$headerRow.ChildNodes[0].InnerText -eq 'Tag'
}
foreach ($row in $table.SelectNodes('tr'))
{$a = $row.SelectSingleNode('td[1]').innerText.Trim() -replace "`n|`r|\s+", " " -replace "\(",'{"' -replace ",","" -replace "\)",'",'
$c = $row.SelectSingleNode('td[3]').innerText.Trim() -replace "`n|`r|\s+", " "
$b=$row.seletSingleNode('td[2]').innerText.Trim() -replace "`n|`r|\s+", ""; $c = '"'+$b+$c+'"},'
$row = New-Object -TypeName psobject
$row | Add-Member -MemberType NoteProperty -Name Tag -Value $a
$row | Add-Member -MemberType NoteProperty -Name Value -Value $c
[array]$data += $row
}
$data | Out-File c:\scripts\dd.txt

Remove p tags with PowerShell within specified Div class

I am trying to clean up html code in alot of HTML files, basically what we want is remove all Paragraphs from within specific Div classes
I am trying to achieve this within powershell and got as far as finding the block of text to replace and remove the opening P and closing P tags within this text, but having troubles getting the updated text back in the HTML file
I have hundreds of files that contain one or more of these blocks
<div class="SomeClass">
<p various attributes>
HTML formatted content
</p>
</div>
What is the easiest way to update all .htm files such that the <P> tags within the "SomeClass" classes have been cleaned.
What I have now is
$htmlCode = Get-Content $Testfile
$firstString = '<div class="SomeClass">'
$secondString = '</div>'
$pattern = "$firstString(.*?)$secondString"
$result = [regex]::Match($htmlCode, $pattern).Groups[1].Value
$cleanedHtml = $result -replace '<p[^>]+>',''
$cleanedHtml = $cleanedHtml -replace '</p>',''
$newHtmlCode = $htmlCode -replace $result, $cleaned
When I run this the $newHtmlCode contains the original code. I'm having troubles getting the old block replaced by the new block.
In the end i achieved what i wanted with this piece of PowerShell code.
Im sure this could be done better, or shorter, but it worked for me.
$htmlCode = Get-Content $fileName -raw
$firstString = '<div class="SomeClass">'
$secondString = '</div>'
$pattern = "(?s)$firstString(.*?)$secondString"
$regex = [regex]($pattern)
$matches = $regex.Matches($htmlCode)
$updated = $false
foreach ($result in $matches)
{
if ($result.Value -like '*<p*')
{
[string]$original = $result.Value
$cleaned = $original -replace '<p[^>]+>',''
$cleaned = $cleaned -replace '<p>',''
$cleaned = $cleaned -replace '</p>',''
$htmlCode = $htmlCode.Replace($original, $cleaned)
$updated = $true
}
}
if ($updated)
{
Set-Content -path $fileName -value $htmlCode -Encoding UTF8
Write-Host "Updated $($fileName.FullName)"
}
I've been working on a similar problem, but took a different approach. I've been using MSHTML to parse the document and manipulate the individual elements.
$htmlCode = Get-Content $Testfile -raw # -raw will read the file as a single string
$HTML = New-Object -Com "HTMLFile"
$HTML.IHTMLDocument2_write($htmlCode) # Write HTML content according to DOM Level2
$divNodes=$html.GetElementsByTagName("div")
foreach ($node in $divNodes) {
if ($node.className="SomeClass")
{
$pNodes=$node.GetElementsByTagName("p")
foreach ($pNode in $pNodes) {
$pNode.removeNode($false) > $null # remove the p Tag, leaving the inside intact. Get rid of the return value
}
}
}
$html.documentElement.innerHTML | Set-Content $TestFile # write the result back
I found it a bit easier to use than regular expression matching.
It looks like HTMLAgilityPack would make this even simpler, but I haven't tried that yet.

Escape powershell Variable in HTML

I am trying to write a report in powershell and then sent the output to an HTML Page. i am figuring out how can i format my variable from my powershell script to HTML.
If i get an output which is not correct, i will it to have a blue font in my html page.
Here is my powershell script
$os = (get-wmiobject -class win32_operatingsystem).caption
function checkosvers {
if($os -contains "*Server*" ){
write-output "This is a server"}
else{
write-output "Its a $os"}
}
$osvers = checkosvers | foreach {$PSItem -replace ("Its a $os","<font color='blue'>Its a $os</font>")}
$osvers | ConvertTo-Html -Fragment -as List | Out-File -FilePath "C:\Users\XX\Desktop\mypage.html"
if i put a string in place of a varible, it appears blue in my html page
{$PSItem -replace ("Its a $os","Its a $os")}
You could make a filter of it, and just have it output what you want. I would have it accept values from the pipeline for ease of use, so something like this:
filter checkosvers {
param([Parameter(ValueFromPipeline)]$osVer)
if($osVer -match "Server" ){
"This is a server"
}else{
"<font color='blue'>Its a $osVer</font>"}
}
Then you can just pipe things to it like:
PS C:\Windows\system32> 'Microsoft Windows Server 2012 R2 Datacenter'|checkosvers
This is a server
or...
PS C:\WINDOWS\system32> (get-wmiobject -class win32_operatingsystem).caption|checkosvers
<font color='blue'>Its a Microsoft Windows 10 Enterprise</font>

powershell ConvertTo-Html add class

My script is pulling information from server, than it converts to HTML and send the report by email.
Snippet:
$sourceFile = "log.log"
$targetFile = "log.html"
$file = Get-Content $sourceFile
$fileLine = #()
foreach ($Line in $file) {
$MyObject = New-Object -TypeName PSObject
Add-Member -InputObject $MyObject -Type NoteProperty -Name Load -Value $Line
$fileLine += $MyObject
}
$fileLine | ConvertTo-Html -Property Load -head '<style> .tdclass{color:red;} </style>' | Out-File $target
Current HTML report snippet:
<table>
<colgroup><col/></colgroup>
<tr><th>Load on servers</th></tr>
<tr><td>Server1 load is 2442</td></tr>
<tr><td>Server2 load is 6126</td></tr>
<tr><td>Server3 load is 6443</td></tr>
<tr><td> </td></tr>
<tr><td>Higher than 4000:</td></tr>
<tr><td>6126</td></tr>
<tr><td>6443</td></tr>
</table>
This will generate an HTML report containing a table with tr and td.
Is there any method to make it generate td with classes, so I can insert the class name into the -head property with styles and make it red for the Higher than 4000: tds ?
I know this is in a old post, but stumbled cross it looking to do something similar.
I was able to add CSS styling by doing a replace.
Here is an example:
$Report = $Report -replace '<td>PASS</td>','<td class="GreenStatus">PASS ✔</td>'
You can then output $report to a file as normal, with the relevant css code in the header.
You would need some additional logic to find values over 4000
You can use the Get-Help ConvertTo-Html command you will get all parameters for the ConvertTo-Html command. Below is output:
ConvertTo-Html [[-Property] <Object[]>] [[-Head] <String[]>] [[-Title] <String>] [[-Body] <String[]>] [-As<String>] [-CssUri <Uri>] [-InputObject <PSObject>] [-PostContent <String[]>] [-PreContent <String[]>][<CommonParameters>]
You can create an external CSS file and give the CSS file path in the [-CssUri] parameter.

PowerShell match names with user email addresses and format as mailto

So i have the below script which scans a drive for folders, it then pulls in a csv with folder names and folder owners and then matches them and outputs to HTML.
I am looking for a way to within this use PS to look up the users names in the csv grab their email address from AD and then in the output of the HTML put them as mailto code.
function name($filename, $folderowners, $directory, $output){
$server = hostname
$date = Get-Date -format "dd-MMM-yyyy HH:mm"
$a = "<style>"
$a = $a + "TABLE{border-width: 1px;border-style: solid;border-color:black;}"
$a = $a + "Table{background-color:#ffffff;border-collapse: collapse;}"
$a = $a + "TH{border-width:1px;padding:0px;border-style:solid;border-color:black;}"
$a = $a + "TR{border-width:1px;padding-left:5px;border-style:solid;border-
color:black;}"
$a = $a + "TD{border-width:1px;padding-left:5px;border-style:solid;border-color:black;}"
$a = $a + "body{ font-family:Calibri; font-size:11pt;}"
$a = $a + "</style>"
$c = " <br></br> Content"
$b = Import-Csv $folderowners
$mappings = #{}
$b | % { $mappings.Add($_.FolderName, $_.Owner) }
Get-ChildItem $directory | where {$_.PSIsContainer -eq $True} | select Name,
#{n="Owner";e={$mappings[$_.Name]}} | sort -property Name |
ConvertTo-Html -head $a -PostContent $c |
Out-File $output
}
name "gdrive" "\\server\location\gdrive.csv" "\\server\location$"
"\\server\location\gdrive.html"
Try adding something like this to the select:
#{n="email";e={"mailto:"+((Get-ADUser $mappings[$_.Name] -Properties mail).mail)}
You need to load the ActiveDirectory module before you can use the Get-ADUser cmdlet:
Import-Module ActiveDirectory
On server versions this module can be installed via Server Manager or dism. On client versions you have to install the Remote Server Administration Tools before you can add the module under "Programs and Features".
Edit: I would have expected ConvertTo-Html to automatically create clickable links from mailto:user#example.com URIs, but apparently it doesn't. Since ConvertTo-Html automatically encodes angular brackets as HTML entities and I haven't found a way to prevent that, you also can't just pre-create the property as an HTML snippet. Something like this should work, though:
ConvertTo-Html -head $a -PostContent $c | % {
$_ -replace '(mailto:)([^<]*)', '$2'
} | Out-File $output
Here's how I would do it (avoiding the use of the AD Module, only because it's not on all of my workstations and this works just the same), and assuming you know the user name already:
#Setup Connection to Active Directory
$de = [ADSI]"LDAP://example.org:389/OU=Users,dc=example,dc=org"
$sr = New-Object System.DirectoryServices.DirectorySearcher($de)
After I setup a connection to AD, I set my LDAP search filter. This takes standard ldap query syntax.
#Set Properties of Search
$sr.SearchScope = [System.DirectoryServices.SearchScope]"Subtree"
$sr.Filter = "(&(ObjectClass=user)(samaccountname=$Username))"
I then execute the search.
#Grab user's information from OU. If search returns nothing, they are not a user and the script exits.
$SearchResults = $sr.FindAll()
if($SearchResults.Count -gt 0){
$emailAddr = $SearchResults[0].Properties["mail"]
$mailto = "Contact User"
}
You can of course send the $mailto variable anywhere you want, and change it's html, but hopefully this gets you started.