scrape html source vba - html

I need to scrape some info from web, using vba. This is an extract of my code. It's ok, but the site has 2 classes with the same name. So my code writes only the last value. I want that:
Sheets("01").Range("DW" & number) = source.getAttribute("data-id")
writes only the first value of class "sample" found on site.
How can I do?
Thanks
With http
.Open "GET", site, False
.send
html.body.innerHTML = .responseText
End With
For Each source In html.getElementsByClassName("sample")
Sheets("01").Range("DW" & number) = source.getAttribute("data-id")
Next source
Next number

You can be more efficient by using querySelector, which only returns the first match rather than an entire collection (or nodeList)
Sheets("01").Range("DW" & Number) = html.querySelector(".sample").getAttribute("data-id")

To refer to the first element of a class collection, you can use the Item property, for which the index is 0-based. So you can replace your For Each/Next with the following line...
Sheets("01").Range("DW" & Number) = html.getElementsByClassName("sample").Item(0).getAttribute("data-id")
Hope this helps!

Related

ProperCase in SSIS where name has hyphen

I am trying to transform the 'firstname' column from its initial state: all upper case (e.g. FLORIN, FLORIN-MIHAI) to proper case (e.g. Florin, Florin-Mihai).
I`ve used the expression below
REPLACE(LEFT(FIRSTNAME,1) + LOWER(SUBSTRING(FIRSTNAME,2,100))," ","")
and it works for names without hyphen (e.g. FLORIN = Florin), but where my names have hyphen it doesn't (FLORIN-MIHAI = Florin-Mihai and I am looking for Florin-Mihai).
Is there a easy way to do this?
The Original Poster answered his own question with the following script component (transformation):
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
Row.FIRSTNAME = StrConv(Row.FIRSTNAME, VbStrConv.ProposerCase)
End Sub

Get URL parameters in VBA

I need to get the ID parameter in a URL, for example I have
http://apps/inventory/others.aspx?ID=8678
How do I extract the 8678, I've looked at the method of the Object WinHttp.WinHttpRequest.5.1 but I haven't found anything. Could that be possible with a simple substring? The URL is always the same and there is always one GET parameter,
Thanks
Try like this:
Option Explicit
Public Sub TestMe()
Debug.Print ExtractAfter("http://apps/inventory/others.aspx?ID=8678", "ID=")
Debug.Print ExtractAfter("http://apps/inventory/others.aspx?ID=867843", "ID=")
End Sub
Public Function ExtractAfter(strInput As String, strAfter As String) As String
ExtractAfter = Mid(strInput, InStr(strInput, strAfter) + Len(strAfter))
End Function
This is what you get in the immediate window:
8678
867843
In VBA, assuming you have the url in a variable url:
Debug.Print Mid(url, InStr(url, "ID=") + 3)
However, works only correct if the ID parameter is always present and always the only paramter, else you need some more sophisticated string handling.

VBS to get an Element from a web page is not working properly

I want to get the value '24' in my VBS, which is set in the div with id 'test'. My HTML is:
<html><body>
Welcome <br /> Value: = <div id="test">24</div>
<br> Name: <p id="name">Someone</p><br>
</body></html>
And my VBS is:
on error resume next
set ie=createobject("internetExplorer.Application")
ie.navigate "http://localhost/h/getid.html"
ie.visible = false
wscript.sleep 2000
dim val
set val =ie.document.getElementsById("test").item(1).value
wscript.echo "value is= "& val
But the output does not show the value "24", it is just echoing
value is=
How can I get that value?
You should not ask a question here that concerns a script with an active "On Error
Resume Next". That is a waste of everybody's time. By not hiding errors/Pay attention to
error messages, you can solve the problem(s) on your own (most of the time).
Delete/Deactive the OERN and you get
set val =ie.document.getElementsById("test").item(1).value
==>
... runtime error: Object doesn't support this property or method: 'ie.document.getElementsById'
Even if you don't recognize the typo, a google search for "html dom getelementsbyid"
will re-route you to "Ergebnisse für [i.e. results for] html dom getelementbyid".
Follow one of the first links (e.g.) to refresh you knowledge about that method.
That way the next error:
set val =ie.document.getElementById("test").item(1).value
==>
... runtime error: Object doesn't support this property or method: 'ie.document.getElementById(...).item'
won't surprise you. An element isn't a collection of items/elements. [BTW: You shouldn't
post answers here without at least basic tests].
The next version
set val =ie.document.getElementById("test").value
should raise a red alert: An assignment with Set, but a right value that wants to be a
property of an object. That is blantantly wrong. So try:
set elm =ie.document.getElementById("test") ' at least a decent assignment
val = elm.value
==>
... runtime error: Object doesn't support this property or method: 'elm.value'
A google query like "html dom div text" will point you to "innerText"
and its features: 1 or 2
At last:
set elm =ie.document.getElementById("test") ' at least a decent assignment
val = elm.innerText
success!
cscript 23971918.vbs
value is= 24
It looks like you need change getElementsById with getElementById.
Someone mentioned you should change the getElementsById to getElementById (no 's'), but I also think you should lose the .item(1) in this line:
set val =ie.document.getElementsById("test").item(1).value
If I'm remembering correctly, using .item(#) would be appropriate if your object was a collection, like something returned by using .getElementsByTagName but .getElementById should only return one item.
Hope that helps!

Regular Expression Pattern Matching to HTML content

I am trying to do a Regular Expression search on string assigned to the HTML content of web search. The pattern I am trying to match has the following format HQ 12345 the second fragment could also start with a letter so HQ A12345 is also a possibility. As shown in the code below the regex pattern I am using is "HQ .*[0-9]".
Problem is when i run the regex search the pattern matched is not just HQ 959693 but also includes the rest of the html file content as shown in the snapshot of the message box below.
Sub Test()
Dim mystring As String
mystring = getHTMLData("loratadine")
Dim rx As New RegExp
rx.IgnoreCase = True
rx.MultiLine = False
rx.Global = True
rx.Pattern = "HQ .*[0-9]"
Dim mtch As Variant
For Each mtch In rx.Execute(mystring)
Debug.Print mtch
MsgBox(mtch)
Next
End Sub
Public Function getHTMLData (ByVal name As String) As String
Dim XMLhttp: Set XMLhttp = CreateObject("MSXML2.ServerXMLHTTP")
XMLhttp.setTimeouts 2000, 2000, 2000, 2000
XMLhttp.Open "GET", "http://rulings.cbp.gov/results.asp?qu=" & name & "&p=1", False
XMLhttp.send
If XMLhttp.Status = 200 Then
getHTMLData = XMLhttp.responsetext
Else
getHTMLData = ""
End If
End Function
Use ? to specify non-greedy, otherwise the match will consume up until the last digit of the entire string. Also, you are only matching one digit occurrence. Add a + to specify "one or more" so it will match your goal:
HQ .*?[0-9]+
Alternatively, you can try to use a negated character class like so:
HQ [^0-9]*[0-9]+
Or you can even simplify it further:
HQ [^\d]*\d+
Regex matching is by default greedy. Unfortunately I didn't manage to reproduce precisely your issue, but I am pretty sure it is because you a long string which is being matched by '.*' to a number at the end.
I find this link useful, see the explaination near the bottom about the greediness of *
http://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm
I suggest changing your Regex to:
HQ .*?[0-9]+
That will match the "HQ " and any number of characters, followed by any number of numeric characters. It will also consume the minimal amount in the ".*", because of the "?".
Please respond if this does not work and I will getting your Regex running in Excel.

SSIS Convert Blank or other values to Zeros

After applying the unpivot procedure, I have an Amount column that has blanks and other characters ( like "-"). I would like to convert those non-numberic values to zero. I use replace procedure but it only converts one at the time.
Also, I tried to use the following script
/**
Public Overrides Sub Input()_ProcessInputRows(ByVal Row As Input()Buffer)
If Row.ColumnName_IsNull = False Or Row.ColumnName = "" Then
Dim pattern As String = String.Empty
Dim r As Regex = Nothing
pattern = "[^0-9]"
r = New Regex(pattern, RegexOptions.Compiled)
Row.ColumnName = Regex.Replace(Row.ColumnName, pattern, "")
End If
End Sub
**/
but i'm getting error.I don't much about script so maybe I placed in the wrong place. The bottom line is that I need to convert those non-numberic values.
Thank you in advance for your help.
I generally look at regular expressions as a great way to introduce another problem into an existing one.
What I did to simulate your problem was to write a select statement that added 5 rows. 2 with valid numbers, the rest were an empty string, string with spaces and one with a hyphen.
I then wired it up to a Script Component and set the column as read/write
The script I used is as follows. I verified there was a value there and if so, I attempted to convert the value to an integer. If that failed, then I assigned it zero. VB is not my strong suit so if this could have been done more elegantly, please edit my script.
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
' Ensure we have data to work with
If Not Row.ColumnName_IsNull Then
' Test whether it's a number or not
' TryCast doesn't work with value types so I'm going the lazy route
Try
' Cast to an integer and then back to string because
' my vb is weak
Row.ColumnName = CStr(CType(Row.ColumnName, Integer))
Catch ex As Exception
Row.ColumnName = 0
End Try
End If
End Sub