How to parse the HTML of a website with PowerShell - html

I am trying to retrieve some information about a website, I want to look for a specific tag/class and then return the contained text value (innerHTML). This is what I have so far
$request = Invoke-WebRequest -Uri $url -UseBasicParsing
$HTML = New-Object -Com "HTMLFile"
$src = $request.RawContent
$HTML.write($src)
foreach ($obj in $HTML.all) {
$obj.getElementsByClassName('some-class-name')
}
I think there is a problem with converting the HTML into the HTML object, since I see a lot of undefined properties and empty results when I'm trying to "Select-Object" them.
So after spending two days, how am I supposed to parse HTML with Powershell?
I can't use IHTMLDocument2 methods, since I don't have Office installed (Unable to use IHTMLDocument2)
I can't use the Invoke-Webrequest without -UseBasicParsing since the Powershell hangs and spawns additional windows while accessing the ParsedHTML property (parsedhtml doesnt respond anymore and Using Invoke-Webrequest in PowerShell 3.0 spawns a Windows Security Warning)
So since parsing HTML with regex is such a big no-no, how do I do it otherwise? Nothing seems to work.

Since noone else has posted an answer, I managed to get a working solution with the following code:
$request = Invoke-WebRequest -Uri $URL -UseBasicParsing
$HTML = New-Object -Com "HTMLFile"
[string]$htmlBody = $request.Content
$HTML.write([ref]$htmlBody)
$filter = $HTML.getElementsByClassName($htmlClassName)
With some URLs I experienced that the $filter variable was empty while it was populated for other URLs. All in all this might work for your situation but it seems like Powershell isn't the way to go for more complex parsing.

In 2020 with PowerShell 5+ you do it like this:
$searchClass = "banana" <# in this example we parse all elements of class "banana" but you can use any class name you wish #>
$myURI = "url.com" <# replace url.com with any website you want to scrape from #>
[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12 <# using TLS 1.2 is vitally important #>
$req = Invoke-Webrequest -URI $myURI
$req.ParsedHtml.getElementsByClassName($searchClass) | %{Write-Host $_.innerhtml}
#for extra credit we can parse all the links
$req.ParsedHtml.getElementsByTagName('a') | %{Write-Host $_.href} #outputs all the links

Related

Parsing a JSON Byte Stream Returned from a Web Request using Powershell without writing to a file

I'm using the following Powershell to call a URL that returns a JSON file. (If you were to paste the URL into the browser it would download a file, rather than display the raw JSON):
$fileName = "output.json"
$url = "https://some-url-that-returns-a-json-file-byte-stream"
$jsonFileResponse = Invoke-WebRequest -Method GET -Uri $url -UseDefaultCredentials -OutFile "output.json"
$parsedData = Get-Content -Path $fileName | ConvertFrom-Json
This works fine, and I get my parsed data at the end.
I was wondering, is there a way of doing this without involving the file system, i.e. without having the step of outputting to output.json first, and instead handling this in-memory and without any disk i/o?
The question is less one of converting to 'JSON' and more of just converting to text.
Using [System.Text.Encoding]::ASCII.GetString seems to do the trick.
$url = "https://some-url-that-returns-a-json-file-byte-stream"
$jsonFileResponse = Invoke-WebRequest -Method GET -Uri $url -UseDefaultCredentials
$parsedData = [System.Text.Encoding]::ASCII.GetString($jsonFileResponse.Content) | ConvertFrom-Json

How do I fix a "cannot bind parameter 'uri'. Cannot convert the...." when converting json in powershell?

I am doing a Get-Weather project in powershell, where I pull data down from weatherapi.com . I am able to successfully connect to the website using an API key but, when I try to convert it from json in the script it doesn't work. The error I get is:
"Cannot bind parameter 'Uri'. Cannot convert the..."
I have tried so many different ways to write this:
$response = Invoke-RestMethod -uri $url -Method Get -ResponseHeadersVariable r -StatusCodeVariable s
$weatherobject = ConvertFrom-Json $url
The request for the website is:
$url = Invoke-WebRequest "http://api.weatherapi.com/v1/forecast.json?key=$key&q=$location&days=$Days"
Any help would be very much apperciated, thank you!
The input of the ConvertFrom-Json cmdlet is a JSON object. Look at the document below for more information
https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/convertfrom-json?view=powershell-7.1#description
$url = "http://api.weatherapi.com/v1/forecast.json?key=$key&q=$location&days=$Days"
$response = Invoke-RestMethod -uri $url -Method Get -ResponseHeadersVariable r -StatusCodeVariable s
$weatherobject = ConvertFrom-Json $response

Powershell script not parsing correctly

So I'm trying to pull the number of hours worked and the date worked from a table in my companies database to make a chart in Power BI through a streaming data set. I'm using powershell to parse a JSON file
Here's a JSON sample:
{"COUNT":"334","DISPLAY_LIST_START":"1","DISPLAY_LIST_STOP":"334","STOP":"334","RECORD":[{"SESSION_ID":"c_a7FdTFicmxBJh9kln4V6gKxz_QErcufE7URF9m","FIELD":["6",["04/23/2018"]]},{"SESSION_ID":"c_a7FdTFicmxBJh9kln4V6gKxz_QErcufE7URF9m","FIELD":["6",["04/24/2018"]]},{"SESSION_ID":"c_a7FdTFicmxBJh9kln4V6gKxz_QErcufE7URF9m","FIELD":["6",["04/26/2018"]]},{"SESSION_ID":"c_a7FdTFicmxBJh9kln4V6gKxz_QErcufE7URF9m","FIELD":["6",["04/30/2018"]]},{"SESSION_ID":"c_a7FdTFicmxBJh9kln4V6gKxz_QErcufE7URF9m","FIELD":["6",["05/01/2018"]]},{"SESSION_ID":"c_a7FdTFicmxBJh9kln4V6gKxz_QErcufE7URF9m","FIELD":["4",["05/02/2018"]]},{"SESSION_ID":"c_a7FdTFicmxBJh9kln4V6gKxz_QErcufE7URF9m","FIELD":["6",["05/03/2018"]]},{"SESSION_ID":"c_a7FdTFicmxBJh9kln4V6gKxz_QErcufE7URF9m","FIELD":["6",["05/07/2018"]]},
I know it's not the best in terms of organization, but it's all I have to work with.
Here's the powershell code I have so far:
Invoke-WebRequest -Uri "http://wya.works/rta_develop/xmlServlet?&command=retrieve&sql=select%20%5B%24Hours%5D%2C%20%5B%24Date%20Worked%5D%20from%20%5B%21HOURS%5D%20&attributesOnly=Date%20Worked%2C%20Hours&contentType=JSON&referer=&time=1595443368507&key=696a6768"
$endpoint = "https://api.powerbi.com/beta/d6cdaa23-930e-49c1-9d2a-0fbe648551b2/datasets/91466553-d719-420c-9e3e-73e748379263/rows?noSignUpCheck=1&key=SU5GRBBWuuEIDSjqHW5hdgJzSMCQ3qUQ9mGrBDanjgpExv6woY1Sa1c3PC1Wk3WHHn1N%2FEpIuVgzHHcw0JXwYw%3D%3D"
$json.RECORD | Foreach-Object {
Write-Output "Checking Records"
$hours = 0
$date = ""
$json.FIELD | Foreach-Object{
Write-Output "Checking Field"
if ($_ -match '\d{1,2}/\d{1,2}\/d{4}'){
$date = $_
}
else {
$hours = $_
}
}
$payload = #{
"Hours" = $hours
"Date Worked" =$date
}
}
Invoke-RestMethod -Method Post -Uri "$endpoint" jlk-Body (ConvertTo-Json #($payload))
I need to parse through each record and pull the values of the hours (the numeric value in the JSON) and the Date (the date value).
When I run the code I don't get any errors, but it doesn't seem to be reaching the -match or the else statements. I tried logging the output on both and it returns nothing.
Is there something wrong with my loops?
I'm brand new to powershell and most of this code I got from help from other people, but I understand what its doing for the most part.
Also, anyone who knows about streaming datasets, will pulling this this way even give me what I want?
Store the Invoke-Webrequest values in your $json first. You missed that point; that's why you are getting Null.I don't know it is a typo or a miss.
Secondly, you $json.RECORD is wrong because it doesnt have any Record in the response. What you are looking for is basically the content. $json.content is going to give you the content of numbers.
$json=Invoke-WebRequest -Uri "http://wya.works/rta_develop/xmlServlet?&command=retrieve&sql=select%20%5B%24Hours%5D%2C%20%5B%24Date%20Worked%5D%20from%20%5B%21HOURS%5D%20&attributesOnly=Date%20Worked%2C%20Hours&contentType=JSON&referer=&time=1595443368507&key=696a6768"
Your endpoint and invoke-restmethod has nothing to do with your json parsing. First handle the response in the loop and see what is the outcome you are getting. I have structured it but I have not worked on the JSON sample data as if now:
$json=Invoke-WebRequest -Uri "http://wya.works/rta_develop/xmlServlet?&command=retrieve&sql=select%20%5B%24Hours%5D%2C%20%5B%24Date%20Worked%5D%20from%20%5B%21HOURS%5D%20&attributesOnly=Date%20Worked%2C%20Hours&contentType=JSON&referer=&time=1595443368507&key=696a6768"
$json.content | Foreach-Object {
Write-Output "Checking Records"
$hours = 0
$date = ""
$json.FIELD | Foreach-Object{
Write-Output "Checking Field"
if ($_ -match '\d{1,2}/\d{1,2}\/d{4}'){
$date = $_
}
else {
$hours = $_
}
}
$payload = #{
"Hours" = $hours
"Date Worked" =$date
}
}
$endpoint = "https://api.powerbi.com/beta/d6cdaa23-930e-49c1-9d2a-0fbe648551b2/datasets/91466553-d719-420c-9e3e-73e748379263/rows?noSignUpCheck=1&key=SU5GRBBWuuEIDSjqHW5hdgJzSMCQ3qUQ9mGrBDanjgpExv6woY1Sa1c3PC1Wk3WHHn1N%2FEpIuVgzHHcw0JXwYw%3D%3D"
Invoke-RestMethod -Method Post -Uri "$endpoint" jlk-Body (ConvertTo-Json #($payload))
The root of your problem is this:
$json.RECORD | Foreach-Object {
$json.FIELD | Foreach-Object{
...
}
}
Your outer foreach-object is looping over each item in the RECORD array, but the inner foreach-object is trying to loop over top-level FIELD properties that don't actually exist!
However, I think you'll hit more problems further along your code if you try to troubleshoot what you've already got, and there's an easier way to do what you're trying to do...
First, your sample json isn't quite valid - there's a trailing comma and some unclosed brackets so I'm going to reformat it with some line breaks so it's easier to read, and then fix it.
Note - you'll get the text from your call to Invoke-WebRequest, or as #Lee_Dailey suggested, Invoke-RestMethod.
# reformat the json, fix it, and assign it to a variable using a "Here-String"
# (see https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_quoting_rules?view=powershell-7#here-strings)
$text = #"
{
"COUNT":"334",
"DISPLAY_LIST_START":"1",
"DISPLAY_LIST_STOP":"334",
"STOP":"334",
"RECORD":[
{"SESSION_ID":"c_a7FdTFicmxBJh9kln4V6gKxz_QErcufE7URF9m","FIELD":["6",["04/23/2018"]]},
{"SESSION_ID":"c_a7FdTFicmxBJh9kln4V6gKxz_QErcufE7URF9m","FIELD":["6",["04/24/2018"]]},
{"SESSION_ID":"c_a7FdTFicmxBJh9kln4V6gKxz_QErcufE7URF9m","FIELD":["6",["04/26/2018"]]},
{"SESSION_ID":"c_a7FdTFicmxBJh9kln4V6gKxz_QErcufE7URF9m","FIELD":["6",["04/30/2018"]]},
{"SESSION_ID":"c_a7FdTFicmxBJh9kln4V6gKxz_QErcufE7URF9m","FIELD":["6",["05/01/2018"]]},
{"SESSION_ID":"c_a7FdTFicmxBJh9kln4V6gKxz_QErcufE7URF9m","FIELD":["4",["05/02/2018"]]},
{"SESSION_ID":"c_a7FdTFicmxBJh9kln4V6gKxz_QErcufE7URF9m","FIELD":["6",["05/03/2018"]]},
{"SESSION_ID":"c_a7FdTFicmxBJh9kln4V6gKxz_QErcufE7URF9m","FIELD":["6",["05/07/2018"]]}
]
}
"#
Then, we'll parse it - i.e. convert it from a string into an object model with well-defined properties:
# parse the json text
$json = $text | ConvertFrom-Json
And then process each record in turn:
$json.RECORD | ForEach-Object {
# get the number of hours worked. this is the first value in the FIELD array
# (note - the array starts at index 0 because it's zero-indexed)
$hours = $_.FIELD[0] # e.g. "6"
# get the "date worked". we need to get the second value (index 1) in
# the FIELD array, but this is a nested array, so once we've got the
# inner array we need to get the first value (index 0 again) from that
$date = $_.FIELD[1][0] # e.g. "04/23/2018"
# now we can build the payload...
$payload = #{
"Hours" = $hours
"Date Worked" = $date
}
# ...and invoke the api for this record
$endpoint = "https://api.powerbi.com/beta/d6cdaa23-930e-49c1-9d2a-0fbe648551b2/datasets/91466553-d719-420c-9e3e-73e748379263/rows?noSignUpCheck=1&key=SU5GRBBWuuEIDSjqHW5hdgJzSMCQ3qUQ9mGrBDanjgpExv6woY1Sa1c3PC1Wk3WHHn1N%2FEpIuVgzHHcw0JXwYw%3D%3D"
Invoke-RestMethod -Method Post -Uri "$endpoint" -Body (ConvertTo-Json #($payload))
}
Note the api call is inside the foreach loop, otherwise you end up calculating the $payload for each RECORD but only ever actually calling the api for the last one.
(I've also removed a spurious "jlk" from your final line, which is probably a typo).

Check certain text from a webpage via powershell

I am trying to get the HTML code from an Intranet webpage and monitor if certain texts or titles exist. This powershell code will be used by my monitoring program to trigger alerts when the webpage is down so that I cannot see that certain texts or titles.
For now, I'm just using Write-Host to see if my piece of code works. I can now extract the HTML source to $output, and I am sure 'Create!' can be found inside. However, I'm not getting a 'YES'.
May I know if $output can be checked by using -contains?
Thank you very much for your help!
$targetUrl = 'https://myUrl/'
$ie = New-Object -com InternetExplorer.Application
$ie.visible=$true
$ie.navigate($targetUrl)
while($ie.Busy) {
Start-Sleep -m 2000
}
$output = $ie.Document.body.innerHTML
if($output -contains '*Create!*')
{Write-Host 'YES'}
else
{Write-Host 'NO'}
The operator -contains is used to search collections. The IE's innerHTML is just a string:
$output = $ie.Document.body.innerHTML
$output.GetType()
IsPublic IsSerial Name BaseType
-------- -------- ---- --------
True True String System.Object
Use pattern matching operators like, well, -like and -match.
By the way, if IE is not mandatory, try Invoke-WebRequest cmdlet.

Jira API POST and Invoke-RestMethod

Using Jira API 2 and PowerShell 5 Invoke-RestMethod I can successfully execute GET, but I keep getting a (400) Bad Request when attempting POST method to create an issue in my project.
$user = [System.Text.Encoding]::UTF8.GetBytes("me:mypassword")
$headers = #{Authorization = "Basic " + [System.Convert]::ToBase64String($user)}
$data = Get-Content D:\scripts\powershell\issue.txt
Invoke-RestMethod -Uri "https://agile.mycompany.com/rest/api/2/issue/" -Method POST -Headers $headers -ContentType "application/json" -Body $data
$data variable is well-formed JSON for Jira:
{
"fields":
{
"project":{"Key": "ITS"},
"summary":"Rest Test 1",
"issuetype":{"name": "Task"},
"assignee":{"key": "myusername"},
"priority":{"id": "3"},
"description":
"||Host Name||IP Address||Comments||
|some-pc|192.168.1.1| |",
"duedate": "2016-09-11"
}
}
I am the project owner, so this isn't a permissions issue.
Get-content is tricky, because it will actually result in a array of strings, where each line in your text file is an object in that array. The best way to get around that is probably using .Net's file read methods instead:
$data = [System.IO.File]::ReadAllText("D:\scripts\powershell\issue.txt")
btw, you can use a regular ps credential object instead of manually building your request header.
On a side note, it's always a good idea to test your api out using a tool such as postman. That will let you verify that you're posting valid json while not having to worry about your code doing strange things.
Problem solved. The issue was that I was CAPITALIZING the first letter of the field names. Apparently, Jira is very sensitive to CASE. Trondh - thank you for the recommendation to use postman. The errors that were getting generated by postman from the failed API calls were very concise.