Powershell removing utf-8/ html formatting when combining files - html

This code is meant to combine 3 pieces of html the main is generated by ppowershell when the document is opened, it doesn't appear right and when opened in an html editor it doesn't support / improper format it looks like this.
��<!DOCTYPE HTML>
Here is the code used to combine the files, the head and tails are html files
$main += $tile
$html = $head + $main + $tail
$html > .\Report.html

I was able to figure it out I used set content to enforce UTF8
$main += $tile
$html = $head + $main + $tail
$html | Set-Content -Encoding UTF8 -Path test2.html

try this
$html | out-file "Report.html" -Encoding UTF8

Related

What is the most efficient way to replace all \ with \\, within a huge JSON File?

I have to replace all occurrences of \ with \\ within a huge JSON Lines File. I wanted to use Powershell, but there might be other options too.
The source file is 4.000.000 lines and is about 6GB.
The Powershell script I was using took too much time, I let it run for 2 hours and it wasn't done yet. A performance of half an hour would be acceptable.
$Importfile = "C:\file.jsonl"
$Exportfile = "C:\file2.jsonl"
(Get-Content -Path $Importfile) -replace "[\\]", "\\" | Set-Content -Path $Exportfile
If the replacement is simply a conversion of a single backslash to a a double backslash, the file can be processed row by row.
Using a StringBuilder puts data into a memory buffer, which is flushed on disk every now and then. Like so,
$src = "c:\path\MyBigFile.json"
$dst = "c:\path\MyOtherFile.json"
$sb = New-Object Text.StringBuilder
$reader = [IO.File]::OpenText($src)
$i = 0
$MaxRows = 10000
while($null -ne ($line = $reader.ReadLine())) {
# Replace slashes
$line = $line.replace('\', '\\')
# ' markdown coloring is confused by backslash-apostrophe
# so here is an extra one just for looks
[void]$sb.AppendLine($line)
++$i
# Write builder contents into file every now and then
if($i -ge $MaxRows) {
add-content $dst $sb.ToString() -NoNewline
[void]$sb.Clear()
$i = 0
}
}
# Flush the builder after the while loop if there's data
if($sb.Length -gt 0) {
add-content $dst $sb.ToString() -NoNewline
}
$reader.close()
Use -ReadCount parameter for Get-Content cmdlet (and set it to 0).
-ReadCount
Specifies how many lines of content are sent through the pipeline at a
time. The default value is 1. A value of 0 (zero) sends all of the
content at one time.
This parameter does not change the content displayed, but it does
affect the time it takes to display the content. As the value of
ReadCount increases, the time it takes to return the first line
increases, but the total time for the operation decreases. This can
make a perceptible difference in large items.
Example (runs cca 17× faster for a file cca 20MB):
$file = 'D:\bat\files\FileTreeLista.txt'
(Measure-Command {
$xType = (Get-Content -Path $file ) -replace "[\\]", "\\"
}).TotalSeconds, $xType.Count -join ', '
(Measure-Command {
$yType = (Get-Content -Path $file -ReadCount 0) -replace "[\\]", "\\"
}).TotalSeconds, $yType.Count -join ', '
Get-Item $file | Select-Object FullName, Length
13,3288848, 338070
0,7557814, 338070
FullName Length
-------- ------
D:\bat\files\FileTreeLista.txt 20723656
Based on the your earlier question How can I optimize this Powershell script, converting JSON to CSV?. You should try to use the PopwerShell pipeline for this, especially as it concerns large input and output files.
The point is that you shouldn't focus on single parts of the solution to determine the performance because this usually leaves wrong impression as the performance of a complete (PowerShell) pipeline solution is supposed to be better than the sum of its parts. Besides it saves a lot of memory and result is a lean PowerShell syntax...
In your specific case, if correctly setup, the CPU will replacing the slashes, rebuilds the json strings and converting it to objects while the harddisk is busy reading and writing the data...
To implement the replacement of the slashes into the PowerShell pipeline together with the ConvertFrom-JsonLines cmdlet:
Get-Content .\file.jsonl | ForEach-Object { $_.replace('\', '\\') } |
ConvertFrom-JsonLines | ForEach-Object { $_.events.items } |
Export-Csv -Path $Exportfile -NoTypeInformation -Encoding UTF8

Remove p tags with PowerShell within specified Div class

I am trying to clean up html code in alot of HTML files, basically what we want is remove all Paragraphs from within specific Div classes
I am trying to achieve this within powershell and got as far as finding the block of text to replace and remove the opening P and closing P tags within this text, but having troubles getting the updated text back in the HTML file
I have hundreds of files that contain one or more of these blocks
<div class="SomeClass">
<p various attributes>
HTML formatted content
</p>
</div>
What is the easiest way to update all .htm files such that the <P> tags within the "SomeClass" classes have been cleaned.
What I have now is
$htmlCode = Get-Content $Testfile
$firstString = '<div class="SomeClass">'
$secondString = '</div>'
$pattern = "$firstString(.*?)$secondString"
$result = [regex]::Match($htmlCode, $pattern).Groups[1].Value
$cleanedHtml = $result -replace '<p[^>]+>',''
$cleanedHtml = $cleanedHtml -replace '</p>',''
$newHtmlCode = $htmlCode -replace $result, $cleaned
When I run this the $newHtmlCode contains the original code. I'm having troubles getting the old block replaced by the new block.
In the end i achieved what i wanted with this piece of PowerShell code.
Im sure this could be done better, or shorter, but it worked for me.
$htmlCode = Get-Content $fileName -raw
$firstString = '<div class="SomeClass">'
$secondString = '</div>'
$pattern = "(?s)$firstString(.*?)$secondString"
$regex = [regex]($pattern)
$matches = $regex.Matches($htmlCode)
$updated = $false
foreach ($result in $matches)
{
if ($result.Value -like '*<p*')
{
[string]$original = $result.Value
$cleaned = $original -replace '<p[^>]+>',''
$cleaned = $cleaned -replace '<p>',''
$cleaned = $cleaned -replace '</p>',''
$htmlCode = $htmlCode.Replace($original, $cleaned)
$updated = $true
}
}
if ($updated)
{
Set-Content -path $fileName -value $htmlCode -Encoding UTF8
Write-Host "Updated $($fileName.FullName)"
}
I've been working on a similar problem, but took a different approach. I've been using MSHTML to parse the document and manipulate the individual elements.
$htmlCode = Get-Content $Testfile -raw # -raw will read the file as a single string
$HTML = New-Object -Com "HTMLFile"
$HTML.IHTMLDocument2_write($htmlCode) # Write HTML content according to DOM Level2
$divNodes=$html.GetElementsByTagName("div")
foreach ($node in $divNodes) {
if ($node.className="SomeClass")
{
$pNodes=$node.GetElementsByTagName("p")
foreach ($pNode in $pNodes) {
$pNode.removeNode($false) > $null # remove the p Tag, leaving the inside intact. Get rid of the return value
}
}
}
$html.documentElement.innerHTML | Set-Content $TestFile # write the result back
I found it a bit easier to use than regular expression matching.
It looks like HTMLAgilityPack would make this even simpler, but I haven't tried that yet.

Regex: Starting from a specific point on each line

I have an HTML file that displays software installed on a machine, and I'd like to remove some of the cells in the table in the HTML file.
Below is a sample of the code:
<tr><td>Adobe Acrobat Reader DC</td><td>18.009.20050</td><td>20171130</td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>
<tr><td>Adobe Flash Player 28 ActiveX</td><td>28.0.0.137</td><td></td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>
...and so on.
What I'm trying to accomplish is to delete everything starting from the 4th instance of the td tag and stop just before the closing /tr tag on each line, so essentially eliminating...
<td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td>
<td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td>
...so that I'm left with...
<tr><td>Adobe Acrobat Reader DC</td><td>18.009.20050</td><td>20171130</td></tr>
<tr><td>Adobe Flash Player 28 ActiveX</td><td>28.0.0.137</td><td></td></tr>
The regex that I'm using is
(?<=<td>)(.*)(?=<\/tr>)
The issue I'm having is that the above regex is selecting the enitre line of code. How can I change this so that it's starting from the 4th instance of the tag for each line?
Please see the following link with a full example of the HTML file I'm using and the regex applied: https://regex101.com/r/C9lkMc/3
EDIT 1: This HTML is generated from a PowerShell script to fetch installed software on remote machines. The code for that is:
Invoke-Command -ComputerName $hostname -ScriptBlock {
if (!([Diagnostics.Process]::GetCurrentProcess().Path -match '\\syswow64\\')) {
$unistallPath = "\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\"
$unistallWow6432Path = "\SOFTWARE\Wow6432Node\Microsoft\Windows\CurrentVersion\Uninstall\"
#(
if (Test-Path "HKLM:$unistallWow6432Path" ) { Get-ChildItem "HKLM:$unistallWow6432Path"}
if (Test-Path "HKLM:$unistallPath" ) { Get-ChildItem "HKLM:$unistallPath" }
if (Test-Path "HKCU:$unistallWow6432Path") { Get-ChildItem "HKCU:$unistallWow6432Path"}
if (Test-Path "HKCU:$unistallPath" ) { Get-ChildItem "HKCU:$unistallPath" }
) |
ForEach-Object { Get-ItemProperty $_.PSPath } |
Where-Object {
$_.DisplayName -and !$_.SystemComponent -and !$_.ReleaseType -and !$_.ParentKeyName -and ($_.UninstallString -or $_.NoRemove)
} |
Sort-Object DisplayName | Select-Object -Property DisplayName, DisplayVersion, InstallDate | ft
}
}
Regex isn't great for parsing HTML; there can be a lot of odd scenarios; e.g. what happens if you have a node <td /> or <td colspan="2"> where you'd expected to have <td>? Equally, HTML (annoyingly) doesn't always follow XML rules; so an XML parser won't work (e.g. <hr> has no end tag / <hr /> is considered invalid).
As such, if parsing HTML you ideally need to use an HTML parser. For that, PowerShell has access to the HtmlFile com object, documented here: https://msdn.microsoft.com/en-us/library/aa752574(v=vs.85).aspx
Here are some examples...
This code finds all TR elements then strips all TDs after the first 4 and returns the row's outer HTML.
$html = #'
some sort of html code
<hr> an unclosed tab so it's messy like html / unlike xml
<table>
<tr><th>Program Name</th><th>version</th><th>install date</th><th>computer name</th><th>ID</th><th>Installed</th></tr>
<tr><td>Adobe Acrobat Reader DC</td><td>18.009.20050</td><td>20171130</td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>
<tr><td>Adobe Flash Player 28 ActiveX</td><td>28.0.0.137</td><td></td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>
<tr><td /><td>123</td><td></td><td>hello.com</td><td>456</td><td>True</td></tr>
</table>
etc...
'#
$Parser = New-Object -ComObject 'HTMLFile' #see https://msdn.microsoft.com/en-us/library/aa752574(v=vs.85).aspx
$Parser.IHTMLDocument2_write($html) #if you're using PS4 or below use instead: $Parser.IHTMLDocument2_write($html)
$parser.documentElement.getElementsByTagName('tr') | %{
$tr = $_
$tr.getElementsByTagName('td') | select-object -skip 4 | %{$tr.removeChild($_)} | out-null
$tr.OuterHtml
}
This works in a similar way; but just pulls back the values of the first 4 cells in each row:
$html = #'
some sort of html code
<hr> an unclosed tab so it's messy like html / unlike xml
<table>
<tr><th>Program Name</th><th>version</th><th>install date</th><th>computer name</th><th>ID</th><th>Installed</th></tr>
<tr><td>Adobe Acrobat Reader DC</td><td>18.009.20050</td><td>20171130</td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>
<tr><td>Adobe Flash Player 28 ActiveX</td><td>28.0.0.137</td><td></td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>
<tr><td /><td>123</td><td></td><td>hello.com</td><td>456</td><td>True</td></tr>
</table>
etc...
'#
$Parser = New-Object -ComObject 'HTMLFile' #see https://msdn.microsoft.com/en-us/library/aa752574(v=vs.85).aspx
$Parser.IHTMLDocument2_write($html) #if you're using PS4 or below use instead: $Parser.IHTMLDocument2_write($html)
$parser.documentElement.getElementsByTagName('tr') | %{
$tr = $_
$a,$b,$c,$d = $tr.getElementsByTagName('td') | select-object -first 4 | %{"$($_.innerText)"} #we do this istead of `select -expand innerText` to ensure nulls are returned as blanks; not ignored
(New-Object -TypeName 'PSObject' -Property ([ordered]#{
AppName = $a
Version = $b
InstallDate = $c
ComputerName = $d
}))
}

powershell ConvertTo-Html add class

My script is pulling information from server, than it converts to HTML and send the report by email.
Snippet:
$sourceFile = "log.log"
$targetFile = "log.html"
$file = Get-Content $sourceFile
$fileLine = #()
foreach ($Line in $file) {
$MyObject = New-Object -TypeName PSObject
Add-Member -InputObject $MyObject -Type NoteProperty -Name Load -Value $Line
$fileLine += $MyObject
}
$fileLine | ConvertTo-Html -Property Load -head '<style> .tdclass{color:red;} </style>' | Out-File $target
Current HTML report snippet:
<table>
<colgroup><col/></colgroup>
<tr><th>Load on servers</th></tr>
<tr><td>Server1 load is 2442</td></tr>
<tr><td>Server2 load is 6126</td></tr>
<tr><td>Server3 load is 6443</td></tr>
<tr><td> </td></tr>
<tr><td>Higher than 4000:</td></tr>
<tr><td>6126</td></tr>
<tr><td>6443</td></tr>
</table>
This will generate an HTML report containing a table with tr and td.
Is there any method to make it generate td with classes, so I can insert the class name into the -head property with styles and make it red for the Higher than 4000: tds ?
I know this is in a old post, but stumbled cross it looking to do something similar.
I was able to add CSS styling by doing a replace.
Here is an example:
$Report = $Report -replace '<td>PASS</td>','<td class="GreenStatus">PASS ✔</td>'
You can then output $report to a file as normal, with the relevant css code in the header.
You would need some additional logic to find values over 4000
You can use the Get-Help ConvertTo-Html command you will get all parameters for the ConvertTo-Html command. Below is output:
ConvertTo-Html [[-Property] <Object[]>] [[-Head] <String[]>] [[-Title] <String>] [[-Body] <String[]>] [-As<String>] [-CssUri <Uri>] [-InputObject <PSObject>] [-PostContent <String[]>] [-PreContent <String[]>][<CommonParameters>]
You can create an external CSS file and give the CSS file path in the [-CssUri] parameter.

Find and Replace many Items with Powershell from Data within a CSV, XLS or two txt documents

So I recently have found the need to do a find and replace of mutliple items within a XML document. Currently I have found the code below which will allow me to do multiple find and replaces but these are hard coded within the powershell.
(get-content c:\temp\report2.xml) | foreach-object {$_ -replace "192.168.1.1", "Server1"} | foreach-object {$_ -replace "192.168.1.20", "RandomServername"} | set-content c:\temp\report3.xml
Ideally instead of hard coding the value I would like to find and replace from a list, ideally in a CSV or and XLSX. Maybe two txt file would be easier.
If it was from a CSV it could grab the value to find from A1 and the value to replace it with from B1 and keep looping down until the values are empty.
I understand I would have to use the get-content and the for each command I was just wondering if this was possible and how to go about it/ if anybody could help me.
Thanks in advance.
SG
#next line is to clear output file
$null > c:\temp\report3.xml
$replacers = Import-Csv c:\temp\replaceSource.csv
gc c:\temp\aip.xml | ForEach-Object {
$output = $_
foreach ($r in $replacers) {
$output = $output -replace $r.ReplaceWhat, $r.ReplaceTo
}
#the output has to be appended, not to rewrite everything
return $output | Out-File c:\temp\report3.xml -Append
}
Content of replaceSource.csv looks like:
ReplaceWhat,ReplaceTo
192.168.1.1,server1
192.168.1.20,SERVER2
Note the headers