Extracting information from web through Excel or VBA - html

I am a beginner of coding. I am now learning how to get information from web through Excel or VBA.
My question can be ask through below example:
In below link (https://www.schooland.hk/ss/tsuen-wan),
when you click those individual blue wordings in the red circle which I shown, it will leads to individual pages.
In all those individual pages, for example (https://www.schooland.hk/ss/twgss), they also have a part like this showing same kind of information, like phone number etc.(As circled below).
My work is to use a spreadsheet, like excel, show a table which listed all those individual wordings' informations in the red circled without using copy and paste.
How could I do it?

The following using XHR to achieve your task.
Notes:
GetSchoolInfo is the main sub.
It sends the initial request to "https://www.schooland.hk/ss/tsuen-wan" .It uses the function GetHTMLDoc to return an HTML document for the webpage URL passed in.
The links to individual schools you show in the first red circled area can be retrieved using a CSS selector of
.school-table a
"." means className, so .school-table means all elements with className school-table.
" a" means all a tags contained within that.
Sample results returned by this CSS query:
The actual HTML returned for each link is like the following:
<A title="Tsuen Wan Government Secondary School" href="about:twgss">??????</A>
We can make use of the fact that each linked to page uses the short string after about: i.e. twggs concatentated with a general base string of "https://www.schooland.hk/ss/", to give each school specific URL i.e.
"https://www.schooland.hk/ss/twggs".
Function GethRefSubString obtains this short string which is concantenated with the general base string BASEURL. This school specific link is the added to the schoolLinks collection.
The schoolLinks collection is looped using GetHTMLDoc to process the links into school specific new HTML documents.
The contact info on each school page, the second red circled area in your question, resides with an HTMLDivElement with className contact. The appropriate index is 0 i.e. the first matching className in the collection (also the only!).
All the required contact info can be accessed via the .innerText property of the HTMLDivElement.
Example webpage content:
Example code output:
VBA Code:
Option Explicit
Public Sub GetSchoolInfo()
Application.ScreenUpdating = False
Dim xmlHttp As Object, html As HTMLDocument, links As Object
Const BASEURL As String = "https://www.schooland.hk/ss/"
Set xmlHttp = CreateObject("MSXML2.XMLHTTP.6.0")
Set html = GetHTMLDoc("https://www.schooland.hk/ss/tsuen-wan", xmlHttp)
Set links = html.querySelectorAll(".school-table a[href]") 'get all
Dim link As Long, schoolLinks As Collection
Set schoolLinks = New Collection
For link = 0 To links.Length - 1
schoolLinks.Add BASEURL & GethRefSubString(links(link).outerHTML)
Next link
Dim currentLink, counter As Long
With ActiveSheet
For Each currentLink In schoolLinks
counter = counter + 1
Set html = GetHTMLDoc(currentLink, xmlHttp)
Dim contactInfo As Object '<HTMLDivElement
Set contactInfo = html.getElementsByClassName("contact")(0)
.Cells(counter, 1) = contactInfo.innerText
Next currentLink
End With
Application.ScreenUpdating = True
End Sub
Public Function GetHTMLDoc(ByVal url As String, ByRef xmlHttp As Object) As HTMLDocument
With xmlHttp
.Open "GET", url, False
.setRequestHeader "Content-Type", "text/xml"
.send
Dim html As HTMLDocument
Set html = New HTMLDocument
html.body.innerHTML = .responseText
End With
Set GetHTMLDoc = html
End Function
Public Function GethRefSubString(ByVal aString As String) As String
GethRefSubString = Split(Split(aString, "href=""about:")(1), Chr$(34))(0)
End Function
References required:
VBE > Tools > References > HTML Object library

Related

Extract data from HTML Element - VBA

I'm new to web scraping and the HTML language.
I'm trying to write a code in VBA to extract data from the following website:
https://companies.govmu.org:4343/MNSOnlineSearch/
I have an Excel sheet with over 5000 company names and their respective "File No" in columns A and B respectively, and I need to input their "Status" (either "Live" or "Defunct") in column C. This will be done after searching for each company by "File No" and then extracting their status to the Excel sheet.
The issue is that I can't seem to get the element containing the data that I need.
I've already written the bit of code which will extract the "File No" from my Excel sheet, paste it on the webpage in the "File No" search box, and run the search. (You can try searching C5113, as an example).
However, on the resulting webpage, I've tried getting the element containing the data that I need, but it does not work.
For example, I tried to MsgBox (MsgBox is my personal way to check whether my variable contains the data I need) the inner HTML of the tag fieldset (fs) with ID "CompanyList" as shown in the code below, but it returns an error.
I've also tried with another variable named div, of data type HTMLDivElement, and then getting the element by ID "companies".
And finally, I've also tried looping through a variable of type IHTMLElementCollection to look for the element that I need, but it still does not show the element that I need (it shows other elements that I don't need).
Option Explicit
Sub ExtractStatusDetails()
Dim ie As InternetExplorer
Dim html As HTMLDocument
Dim resultHtml As HTMLDocument
Dim fs As IHTMLElement
Dim searchBoxes As IHTMLElementCollection
Dim searchButton As Object
Dim homePage As String
homePage = "https://companies.govmu.org:4343/MNSOnlineSearch/"
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate homePage
Do While ie.readyState <> READYSTATE_COMPLETE
DoEvents
Loop
Set html = ie.document
Set searchBoxes = html.getElementsByClassName("col-md-6 col-lg-4")
searchBoxes(0).innerHTML = Replace(searchBoxes(0).innerHTML, "placeholder", "value")
searchBoxes(0).innerHTML = Replace(searchBoxes(0).innerHTML, "Search company by File No...", "C63")
Set searchButton = searchBoxes(0).getElementsByClassName("btn btn-large btn-primary btn-raised")
searchButton(0).Click
Do While ie.readyState <> READYSTATE_COMPLETE
DoEvents
Loop
Set resultHtml = ie.document
Set fs = resultHtml.getElementById("CompanyList")
MsgBox fs.innerHTML
ie.Quit
End Sub
The page does an xmlhttp POST request which retrieves data from a backend data store (likely Oracle GlassFish > JDBC API > data repository e.g. MySQL) . It returns all similar matches, possibly including exact.
You can find the POST request in the network traffic of browser dev tools after you enter the fileNo and press the search button.
Below is a function you can call in a loop over your fileNos to retrieve the company status
Option Explicit
Public Sub test()
Dim fileNo As String, xmlhttp As Object
fileNo = "C5113"
Set xmlhttp = CreateObject("MSXML2.XMLHTTP")
MsgBox GetCompanyStatus(fileNo, xmlhttp)
End Sub
Public Function GetCompanyStatus(ByVal fileNo As String, ByVal xmlhttp As Object) As String
Dim html As HTMLDocument, body As String, fileNos As Object, i As Long
Set html = New HTMLDocument
body = "tabs=tab-1&searchByName=&searchByFileNo=PLACEHOLDER&submitCompanies=&searchByBusName=&searchByBRN=&searchByIncDateFrom=&searchByIncDateTo=&doAction=search"
With xmlhttp
.Open "POST", "https://companies.govmu.org:4343/MNSOnlineSearch/GetCompanies", False
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
.send Replace$(body, "PLACEHOLDER", fileNo)
html.body.innerHTML = .responseText
Set fileNos = html.querySelectorAll("td.tdFileNo")
If fileNos.Length > 0 Then
For i = 0 To fileNos.Length - 1
If fileNos.item(i).innerText = fileNo Then
GetCompanyStatus = html.querySelectorAll("td.tdStatus").item(i).innerText
Exit Function
End If
Next i
End If
GetCompanyStatus = "Not found"
End With
End Function
I would instead consider how you can group your requests. As you can post partial file numbers you could cut down on the number of requests considerably by doing in batches with partial file numbers e.g. search for C5 or C51; the backend then does something like "C5%" to return all matches starting with the specified string, and then loop those results searching for your file numbers of interest that fall within that range.
You could have a dictionary with fileNo as key and status as value and update this as you loop the results returned by a request. I think the number of keys is constrained by Long, so no problem I think for storing all your fileNos at the start, in a dictionary, and updating later during requests. You could even have multiple dictionaries that host ranges of fileNos, like the volumes of the good old Encyclopædia Britannica. This would limit loops to dicts you hope to populate from the same request, for example. Is something to play around with an explore.

How to set dropdown box on website so that select option and scrape data

A website has changed so I can't scrape data from it anymore. Just need to change the set line below I believe but have tried a number of things and none have worked. I'm not very knowledgeable about this area I'm afraid but rest of code is working. Eg url is https://www.gurufocus.com/stock/CFWFF/insider and it is the table of insider transactions I am trying to press the dropdown for and change it to 100 instead of 10:
t = Timer
noTable = 0
Do
Set DropDown = doc.querySelectorAll(".el-dropdown-menu__item")
lastDropDrownItemIndex = DropDown.Length - 1
If Timer - t > MAX_WAIT_SEC Then
noTable = 1
Exit Do
End If
Loop While lastDropDrownItemIndex < 1
If noTable = 1 Then GoTo noTableEscape
DropDown.Item(lastDropDrownItemIndex).Click
Thanks
Ok so, not sure what you are after exactly, but the website you're scraping offers an API which in my opinion could probably make your life a lot easier. To put it simply, this means that it provides an easy way to request for data with the use of some parameters embedded in a URL. It returns the requested data in JSON format.
In the following code I will be using the XMLHTTP request method and a JSON Parser. For this you will need the following:
JSON parser , it helps you parse the downloaded data. Follow the installation instructions to import it in your project
A reference to the Microsoft Scripting Runtime library. The JSON parser needs it.
A reference to the Microsoft WinHTTP Services, Version 5.1 library. It lets you use an HTTP request object.
For demonstration purposes, the following code will only print in the immediate window the first entry's name and position. You can modify the code to fit your needs.
Sub test()
Dim req As New WinHttpRequest
Dim jsonResponse As String
Dim jsonParsed As Object
Dim url As String
Dim pageNum As Integer
Dim numPerPage As Integer
pageNum = 1 'You can change this parameter to navigate in different pages
numPerPage = 100 'You can change this parameter to control the number of entries
url = "https://www.gurufocus.com/reader/_api/stocks/OTCPK:CFWFF/insider?page=" & pageNum & "&per_page=" & numPerPage & "&sort=date%7Cdesc"
With req
.Open "GET", url, False
.setRequestHeader "Accept", "application/json, text/plain, */*"
.setRequestHeader "Authorization", ThisWorkbook.Worksheets("The name of your Worksheet").Range("A1").Value 'I have stored a string that is essential to the request in cell A1
.send
jsonResponse = .responseText
End With
Set jsonParsed = JsonConverter.ParseJson(jsonResponse)
Debug.Print jsonParsed("data")(1)("name") 'get the name parameter of the first entry
Debug.Print jsonParsed("data")(1)("position") 'get the position parameter of the first entry
End Sub
Please note that there's a very long string which is essential to the request, which I have stored in cell A1. This string looks like so:
Bearer
eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsImp0aSI6ImUxYjAwMmYxMjczMGRiMTBmMmZkYjJkNDk0YTU4NjRmZDZjOWY3ZGI4ZmI1NDY1NTQ2MzZlMGJhNzkxODUxNmY4NTM2ZWIzZDNhODhmN2VmIn0.eyJhdWQiOiIyIiwianRpIjoiZTFiMDAyZjEyNzMwZGIxMGYyZmRiMmQ0OTRhNTg2NGZkNmM5ZjdkYjhmYjU0NjU1NDYzNmUwYmE3OTE4NTE2Zjg1MzZlYjNkM2E4OGY3ZWYiLCJpYXQiOjE1NTkwNzA3OTcsIm5iZiI6MTU1OTA3MDc5NywiZXhwIjoxNTY5NDM4NzkzLCJzdWIiOiIiLCJzY29wZXMiOltdfQ.mZ4DqhUk9YAU6JYDBScF8MJ_zHPyL94bAec7LxZTaWipcWf9uesdGDMDC9v_7W-6zrtXAUWhk4YAL70E5rpPjM7gusYH0RfO48O2PnaV8gsqXoNCFwFBOHuxh109q7X0YsNkfX2wX8m3XigtK9A_YAGID7wxgX96lwzBevsDJ3borHMcJlQtxidF_Bq2D5WPASsuy3jdY80HkOCR1y4eaSIswBEtK5rPj_xy7VXRbYGhLklqw4wgHgq4blfaHnVVmPXf6k8mx45ye8vPecS-w9kjuDOHVn2mvU6mpBzqEpbH4lqpiqmYG7M-CvB1joEAcMQtcilCvsdfKOusoC2MU4_vPtF3Q4ZFVaEcXIQgomdKtFa_XGpCudit45b2rEFacKMUENqLj_sPwYkgM1IPl1lQfR-VpigqnCHPAxVQAPzqwJvS6CxuYOPmvnrx23fBAillP7LtDHwHtlMpgZUjdB5y6IWsia76crM4kbkrKn3zc8xoAGb1fIrgJlY-9hOzrwsmrchantEdYOFZjcMJvhCnlfvnEm6kT2Sdcu4o6TndTZJjrVmD4mb-jNGy4kw_mAx1DfyqR7GLtCVSzcSLKgrrwCJEL22K2bfXH2HExXvgLFbPXivVZJc70TnF9lJmx_dx79cxAm7szFGIdrs56bAC4mdKpvKL3BNmVY-J-G0
The same string should work for you as well.
The result looks like so:
Brown, James Michael
Senior Officer
Each one of the 100 data entries has the following structure:
It's fairly easy to loop through all the entries as well. For example, to print the name of all the entries you would have to do this:
Dim item As Object
For Each item In jsonParsed("data")
Debug.Print item("name")
Next item
Finally, you can also loop through all the parameters of each entry. For example, the following code prints all the parameters and their corresponding values for the first entry:
Dim key As Variant
For Each key In jsonParsed("data")(1).Keys
Debug.Print key & ": " & jsonParsed("data")(1)(key)
Next key
So this way you can basically access any parameter you want for each entry.

Scrape data from xmlhttp

I'm trying to scrape elements from xmlhttp.
I'm not too bad with vba, but relatively new to data scraping.
I have previously been using ie.
I can import the html into a cell, but would like to import specifically, the name, id, price and stock level.
The code I'm using to import the data is
Private Sub HTML_VBA_Excel()
Dim oXMLHTTP As Object
Dim sPageHTML As String
Dim sURL As String
'Change the URL before executing the code
sURL = "https://www.superdrug.com/Make-Up/Lips/Lip-Kits/Flower-Beauty-Mix-N%27-Matte-Lipstick-Duo-Tickled-Pink-687/p/769466"
'Extract data from website to Excel using VBA
Set oXMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")
oXMLHTTP.Open "GET", sURL, False
oXMLHTTP.send
sPageHTML = oXMLHTTP.responseText
'Get webpage data into Excel
sh02.Cells(1, 1) = sPageHTML
End Sub
Thanks in advance for any help received.
Ian
You cannot extract the information reliably from an xmlhttp request issued against the url you show as the content is javascript loaded and will not have run.
Not sure how sustainable the token is (doesn't seem to matter the value used) but you can join the productid, which is the end of your url, with the ajax token present in the page and issue and xmlhttp request using querystring parameters and parse a json response for the items of interest. I use jsonconverter.bas. After downloading and installing the .bas you need to go VBE > Tools > References and add a reference to Microsoft Scripting Runtime.
Some testing seems to indicate any number can be added after the hyphen in place of the token so you could randomly generate a number on the fly to use.
It's worth noting you can comma separate multiple products in the query string and thus do a bulk request. You would need then do a For Each Loop over the collection of dictionaries returned.
Option Explicit
Public Sub GetInfo()
Const URL As String = "https://www.superdrug.com/micrositeProduct/bulk/769466-1548702898380"
Dim json As Object, title As String, price As String, stocking As String, id As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.Send
Set json = jsonconverter.ParseJson(.responsetext)(1)
End With
title = json("name")
price = json("price")("formattedValue") 'json("price")("value")
stocking = json("stockLevel")
id = json("code")
End Sub
If you use a browser then the json string is present within one the script tags as the .innerHTML and you can easily extract from there.

Unable To start download

I am automating the downloading of a report but am struggling getting the report to run. When manually carrying out process I navigate to report area and right click "run as excel"
When I inspect the element I have the below lines which I cannot get to run in my code.
Elements
I have tried .click on a few of the different ID's ("ext-gen152") or ("ext-gen153") but I think I need to approach executing this line different due to it not being a visible button but a hidden menu.
If this is simple and been covered before accept my apologies but a point in the right direction would be handed. Is there a really good resource vba with IE?
I have also tried looping through the elements in an object but do not know the name of the object which "ext-gen152" sits
Sub AP_Login()
Dim Site As Object
Set Site = CreateObject("InternetExplorer.application")
Dim AP_ID As String
Dim AP_PW As String
Dim URL As String
Dim ControlSht As Worksheet
Dim Obj As Object
Site.Visible = True
Set ControlSht = Sheet2
AP_ID = ControlSht.Range("user").Value
AP_PW = ControlSht.Range("Password").Value
URL = ControlSht.Range("LogPage").Value
Site.Navigate URL
While Site.Busy
Wend
Dim oHTMLDoc As Object
Set oHTMLDoc = Site.Document
oHTMLDoc.getElementById("inputEmail").Value = AP_ID
oHTMLDoc.getElementById("inputPassword").Value = AP_PW
oHTMLDoc.getElementById("submit").Click
While Site.Busy
Wend
URL = ControlSht.Range("ReportPage").Value
Site.Navigate URL
While Site.Busy
Wend
oHTMLDoc.getElementById("ext-gen153").Click
Site.Visible = True
End Sub

Parse HTML in VBA to extract information from description list?

I want to extract information from a website with Excel XP.
I found some example code (http://www.wiseowl.co.uk/blog/s393/scrape-website-html.htm) and tried the following:
Function strHtmlElementValue(htmldoc As HTMLDocument, id As String) As String
Dim HtmlElement As IHTMLElement
Set HtmlElement = htmldoc.getElementById(id)
strHtmlElementValue = id & ": " & HtmlElement.innerText
End Function
I tried it with the following URL (loaded as the htmldoc): http://www.immobilienscout24.de/expose/73940554
If I use the string "expose-title" for the id, the function returns the title of the page, which is fine.
But how can I access e. g. information like the price?
In the Html code, it looks like that. There is no ID and if I try to use the class-name "is24qa-kaufpreis" for getelementbyid, I get an error message.
<dl>
<dt>
<strong class="is24qa-kaufpreis-label">
Kaufpreis:
</strong>
</dt>
<dd class="is24qa-kaufpreis">
2.190.000,00 EUR
</dd>
</dl>
So, is there a way to access fields like this "is24qa-kaufpreis" directly and read out the inner text (in this case the 2.190.000,00 EUR?
This worked for me. IE11, but should work with IE9+.
Sub TestGEBCN()
Dim doc As New MSHTML.HTMLDocument, html, els
html = "<dl><dt><strong class=""is24qa-kaufpreis-label"">Kaufpreis:" & _
"</strong></dt><dd class=""is24qa-kaufpreis"">" & _
"2.190.000,00 EUR</dd></dl>"
doc.body.innerHTML = html
Set els = doc.getElementsByClassName("is24qa-kaufpreis")
Debug.Print els(0).innerText
End Sub
There are a number of different ways you could go about it. The following code shows two approaches based on "getElementsByTagName." In the source code for the web page, if you can count which instance of div "dd" kaufpreis is in, then you could use the first method. A more general approach is shown following it.
Sub test()
my_url = "http://www.immobilienscout24.de/expose/73940554"
Set html_doc = CreateObject("htmlfile")
Set xml_obj = CreateObject("MSXML2.XMLHTTP")
xml_obj.Open "GET", my_url, False
xml_obj.send
html_doc.body.innerhtml = xml_obj.responseText
Set xml_obj = Nothing
k_pice = html_doc.body.getElementsByTagName("dd")(0).innertext
' Or
Set Results = html_doc.body.getElementsByTagName("dd")
For Each itm In Results
If InStr(1, itm.outerhtml, "EUR", vbTextCompare) > 0 Then
k_price = itm.innertext
Exit For
Else
End If
Next
End Sub
Use
getElementsByTagName("strong")(0).InnerText
for Kaufpreis;
Use
getElementsByTagName("dd")(0).InnerText
for 2.190.000,00 EUR.
(0) is the number of the same tag element, there can be many entries with the same tag name in the code, to retrieve them use ("tag")(0), ("tag")(1),...,("tag")(n).
I suggest researching the topics regarding child or sub elements for automation purposes.
Also remember that Excel can do it's own web queries. On the Data - Import External Data - New Web Query menu (Alt + D, D, W). Then you would refer to it as sheet2!a22 or whatever. It no good for a page that constantly changes it's layout.
CSS selector:
.querySelector method of HTMLDocument to apply a CSS selector of dd[class='is24qa-kaufpreis']
This says get first element with tag name dd having class attribute of is24qa-kaufpreis'. "[]" means attribute.
CSS query:
VBA:
htmldocument.querySelector("dd[class='is24qa-kaufpreis']").innerText
You need to obtain the HTMLDocument object but the other answers already show meothds for this.