Using VBA to Copy Multiple HTML tables from browser to Excel - html

Sounds simple right?
I am working on a class project and I need a bunch of data from a bunch of .gov websites and I am struggling to get this to work reliably.
I was able to get this to work perfectly on tables where there was only text in the cells. I was able to do this with .innertext function of an HTML object (and of course I left my flashdrive at school so I can't see what I used, figures).
Anyway, my issue is that I cannot get the information from some cells when they contain links to other pages. For example, on this OPM site, https://www.opm.gov/policy-data-oversight/pay-leave/salaries-wages/2017/general-schedule/, there is the web option to view the form and I cannot get it to copy the web addresses into excel. Copying the text in the cells works fine but I cannot figure out how to make it copy the href text.
Does anyone happen to have any experience with this and could hopefully point me in the right direction?
Best,
Clueless Guy who is lost as usual.
Edit: Here is my code so far. I removed the bits about href because they were causing it not to work and were almost certainly incorrect.
Sub GetTables()
Dim doc As HTMLDocument
Dim htmTable As HTMLTable
Dim hpLink As IHTMLElement
Dim data
Dim x As Long, y As Long
Dim oRow As Object, oCell As Object
Dim oDom As Object: Set oDom = CreateObject("htmlFile")
x = 1
y = 1
Set doc = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.opm.gov/policy-data-oversight/pay-leave/salaries-wages/2017/general-schedule/"
.send
Do: DoEvents: Loop Until .readyState = 4
doc.body.innerHTML = .responseText
.abort
End With
Set htmTable = doc.getElementsByClassName("DataTable")(0)
With htmTable
Debug.Print .Rows(0).Cells(1).innerText
Debug.Print .Rows(6).Cells(1).innerText
Debug.Print .Rows(7).Cells(1).innerText
ReDim data(1 To .Rows.Length, 1 To .Rows(1).Cells.Length)
For Each oRow In .Rows
For Each oCell In oRow.Cells
data(x, y) = oCell.innerText
'Previously, I had attempted to use oCell.href to get the value
'but that did not work.
y = y + 1
Next oCell
y = 1
x = x + 1
Next oRow
End With
Sheets(1).Cells(1, 1).Resize(UBound(data), UBound(data, 2)).Value = data
End Sub

Related

Search a website using excel vba with excel data and extract the active state in flowchart of search result and mapping it into column

I am hoping someone can help....
I have around 7000 values in a excel spreadsheet that I need to search in a website and then record active state of result flowchart from the website to be inputted back into the excel spreadsheet. Since I am new to macros web scrape I used to automate web code modified input ids for the website which I want to extract information (https://nacionalidade.justica.gov.pt/). I am a bit confused in how to apply if condition to get the active state having seven classes in flowhchart, Here is the flow chart.
Now that I have access codes each will be on different stage, I only want to pick the state and put it in column E in front of the access code(currently doing manually)
I am unclear how to extract that info being new to this type of web data extraction - any help would be incredible!
Here is my code:(couldn't be able to change for mentioned web after this)
objIE.document.getElementById("btnPesquisa").Click
Code:
'start a new subroutine called SearchBot
Sub SearchBot()
'dimension (declare or set aside memory for) our variables
Dim objIE As InternetExplorer 'special object variable representing the IE browser
Dim aEle As HTMLLinkElement 'special object variable for an <a> (link) element
Dim y As Integer 'integer variable we'll use as a counter
Dim result As String 'string variable that will hold our result link
'initiating a new instance of Internet Explorer and asigning it to objIE
Set objIE = New InternetExplorer
'make IE browser visible (False would allow IE to run in the background)
objIE.Visible = True
'navigate IE to this web page (a pretty neat search engine really)
objIE.navigate "https://nacionalidade.justica.gov.pt/"
'wait here a few seconds while the browser is busy
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
'in the search box put cell "A2" value, the word "in" and cell "C1" value
objIE.document.getElementById("SenhaAcesso").Value = _
Sheets("Guy Touti").Range("D2").Value
'click the 'go' button
objIE.document.getElementById("btnPesquisa").Click
'wait again for the browser
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
'the first search result will go in row 2
y = 2
'for each <a> element in the collection of objects with class of 'result__a'...
For Each aEle In objIE.document.getElementsByClassName("result__a")
'...get the href link and print it to the sheet in col C, row y
result = aEle
Sheets("Guy Touti").Range("E" & y).Value = result
'...get the text within the element and print it to the sheet in col D
Sheets("Guy Touti").Range("D" & y).Value = aEle.innerText
Debug.Print aEle.innerText
'is it a yellowpages link?
If InStr(result, "yellowpages.com") > 0 Or InStr(result, "yp.com") > 0 Then
'make the result red
Sheets("Guy Touti").Range("C" & y).Interior.ColorIndex = 3
'place a 1 to the left
Sheets("Guy Touti").Range("B" & y).Value = 1
End If
'increment our row counter, so the next result goes below
y = y + 1
'repeat times the # of ele's we have in the collection
Next
'add up the yellowpages listings
Sheets("Guy Touti").Range("B1").Value = _
Application.WorksheetFunction.Sum(Sheets("Guy Touti").Range("B2:B100"))
'close the browser
objIE.Quit
'exit our SearchBot subroutine
End Sub
I did try this first but after a while started searching for a better way. Can you help????
You can simplify the POST XHR request the page makes to get data and use the classnames to limit to nodes with either active1 or active3. Take the last node in that nodelist and extract the step number and convert colour via lookup (if wanted). With 7,000 requests it might be considerate to add a delay in every 50 requests, or less, of 1-2 seconds. You can i mod 50 to determine this in the loop and use Application.Wait Now + Timeserial(0,0,2)
Option Explicit
Public Sub GetStatus()
Dim html As MSHTML.HTMLDocument, xhr As Object, colourLkup As Object
Dim ws As Worksheet, senhas(), i As Long, results()
Set ws = ThisWorkbook.Worksheets("Sheet1")
senhas = Application.Transpose(ws.Range("D2:D" & ws.Cells(ws.Rows.Count, "D").End(xlUp).Row))
ReDim results(1 To UBound(senhas))
Set colourLkup = CreateObject("Scripting.Dictionary")
colourLkup.Add "active1", "green"
colourLkup.Add "active3", "orange"
Set html = New MSHTML.HTMLDocument
Set xhr = CreateObject("MSXML2.XMLHTTP")
For i = LBound(senhas) To UBound(senhas)
If senhas(i) <> vbNullString Then
With xhr
.Open "POST", "https://nacionalidade.justica.gov.pt/Home/GetEstadoProcessoAjax", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded; charset=UTF-8"
.send "SenhaAcesso=" & senhas(i)
html.body.innerHTML = .responseText
End With
Dim nodes As Object, classinfo() As String
Set nodes = html.querySelectorAll(".active1, .active3")
classinfo = Split(nodes(nodes.Length - 1).className, Chr$(32))
results(i) = Replace$(classinfo(1), "step", vbNullString) & "-" & colourLkup(classinfo(2))
End If
Set nodes = Nothing
Next
ws.Cells(2, 5).Resize(UBound(results), 1) = Application.Transpose(results)
End Sub

Excel Macros - Using Excel Data to scrape HTML page

As with many other questions, I'm really new to using Excel macros and HTML.
I'm building a table with entries of items from Dungeons and Dragons (DnD) that I want to get descriptions of from a particular web page: http://www.5esrd.com/gamemastering/magic-items/ . For example, An entry in my table would be Adamantine Armor, so I would want the macro to search the web page for that listing, scrape the description that's entered on the webpage and save that to the excel document. In this case, that specific description (as per the 5esrd webpage) is as follows:
Armor (medium or heavy, but not hide), uncommon
This suit of armor is reinforced with adamantine, one of the hardest substances in existence. While you’re wearing it, any critical hit against you becomes a normal hit.
On this page, all the items are links to pages that have them grouped by type (Sword, wand, etc.). I would like to iterate through my row of items in Excel, search for that item on the page and scrape its description into my excel table.
I was following along this page: https://www.wiseowl.co.uk/blog/s393/scrape-website-html.htm, which scrapes data from stackoverflow's home page. It directly targets some of the tables that the web page uses to organize its entries. The DnD page I'm trying to use doesn't organize its entries into tables like this, so I'm a little lost with how to proceed.
If anyone could help point me in the right direction, I would be very grateful!
This is the code I use for scraping a web page on our company site. I put all my data in column B, so make adjustments accordingly. This should get you started.
Sub TestScrape()
' SCRAPE A WEB PAGE
Dim ieDoc As New HTMLDocument 'ieDocDocument
Dim tdCollection As Object 'table that has the javascript attributes and contains the element I want to click
Dim AnchorLinks As Object
Dim tdElements As Object
Dim tdElement As Object
Dim AnchorLink As Object
Dim lRow As Long
Dim ie As InternetExplorer
Dim cls As IHTMLElementCollection
Set ie = New SHDocVw.InternetExplorer
With ie
.Visible = True
.Left = 0
.TheaterMode = True '<<-- Comment this out if you don't want Theater Mode
.Navigate url:="Enter your URL here"
While .ReadyState < 4 Or .Busy: DoEvents: Wend
End With
Application.Wait Now + TimeSerial(0, 0, 4)
ieDoc.body.innerHTML = ie.Document.body.innerHTML
With ieDoc.body
Set AnchorLinks = .getElementsByTagName("table")
Set tdElements = .getElementsByTagName("td") '
For Each AnchorLink In AnchorLinks
Debug.Print AnchorLink.innerText
Next AnchorLink
End With
lRow = 1
For Each tdElement In tdElements
Debug.Print tdElement.innerText
Cells(lRow, 2).Value = tdElement.innerText
lRow = lRow + 1
Next
Application.Wait (Now + TimeValue("0:00:1"))
With ie
.TheaterMode = False
.Quit
End With
Set ie = Nothing
End Sub

Excel VB Searching for text on a webpage and Copying information in the same Element

I am relatively new VBA.
I am trying to use this code to grab a bit of information from a website. When I do it by Element I have to search for the tag name which is tr and use a number next to it to define which one I want to use. The problem with that is it changes frequently with the position on the website. Currently the Keyword I want to search for and the information it contains is like so:
<tr>
<td class="nt">Operations</td>
<td>Windows</td>
</tr>
So if I can search by the class "Operations", and get the information "Windows", that would help. Also, I currently having an error
Next without For
If possible, is there a way I can use this to do multiple searches before I close the page? So I look for multiple specific words and input that data into different cells before moving onto the next column where it would repeat until completed at the end of the x value. I currently only have it set to x=2 To 5 but I would like to increase that to 10 or higher in the future.
The current code looks like this.
Private Sub Worksheet_Change(ByVal Target As Range)
For x = 2 To 5
If Target.Row = Cells(x, 35).Row And _
Target.Column = Cells(x, 35).Column Then
'If Target.Row = Range("ManufacturerPartNumber").Row And _
'Target.Column = Range("ManufacturerPartNumber").Column Then
Dim IE As New InternetExplorer
'IE.Visible = True
'For x = 2 To 5
'IE.navigate "" & Range("Website_1").Value
IE.navigate "" & Cells(x, 35).Value
Do
DoEvents
Loop Until IE.readyState = READYSTATE_COMPLETE
Dim Doc As HTMLDocument
Set Doc = IE.document
Dim sDD As String
sDD = InStr(1, IE.document.body.innerHTML, "Processor Model")
'sDD = Trim(Doc.getElementsByTagName("Processor Model")(1).innerText) 'Use this with tag like dd and number for which it appears like 0 or 1
IE.Quit
Dim aDD As Variant
aDD = Split(sDD, ",")
Cells(x, 44).Value = aDD(0)
'Range("ProcessorNumberCd").Value = aDD(0)
'Range("OSProvided").Value = aDD(0)
Next x
End If
'MsgBox "Complete"
End Sub
I think you want to grab the 'inner text'. Take a look at the example below.
Sub Scraper()
Dim item As Long
Dim priceStr As String
Dim priceTag As Object
Dim priceTable As Object
item = "10011" 'this will eventually be placed in a loop for multiple searches
Set objIE = CreateObject("InternetExplorer.Application")
objIE.Visible = True
' navigate and download the web page
objIE.Navigate "www.google.com"
Do While objIE.ReadyState <> 4 Or objIE.Busy
DoEvents
Loop
'objIE.Document.getElementsByTagName("input")(0).Value = item
'objIE.Document.getElementByID("FDI").Click
Set priceTable = objIE.Document.getElementByID("price_FGC")
Set priceTag = priceTable.getElementsByTagName("u")(3)
priceStr = priceTag.innerText
Sheet1.Range("A1").Value = priceStr
objIE.Quit
End Sub
Also, check out this link for several other ways of how to do other, similar things.
http://www.tushar-mehta.com/publish_train/xl_vba_cases/vba_web_pages_services/index.htm

VBA Excel Cell Content Spacing Messed up when Parsed from HTML

I am having a problem with the spacing between strings when I parse data from an HTML. It's hard to explain, but I'll try my best. So my code runs through an HTML file and copies every table of data into just one cell within column A of excel. So when the code compiles, I have cells A1 to A10 (for example) filled in with a huge amount of data in each cell. What I want to do from that is scan through the cells and look for certain strings. The sample code is:
sot = Application.WorksheetFunction.CountIf(Range("A:A"), "eggs and bacon")
the string I am looking for is multiple words with spaces. However, I have noticed that when the HTML data is parsed and transferred to excel, the spacing between the letters is messed up. For example, if I am in clicked into the cell and hit the left and right arrow cursors on the cell, the blinking cursor line sometimes ends up right in the middle of a letter, not BETWEEN two letters as it should be. This causes my code to not work because the number of spaces between strings cannot be correctly determined so the string cannot be found.
Therefore, sot returns a value of 0, when there is indeed an "eggs and bacon" string in the data.
Please let me know if there is anything I can do/if you've seen or had this problem before! I am getting very frustrated because I spent so much time making code that parses an HTML and a code that scans through my data looking for keywords, but it won't work just because the spacing gets messed up when going from HTML to Excel. Thank you so much.
Here is my code that parses the data:
Private Sub HTMLParser()
'This code will go to the html page and parse the page for the relevant data and put it into excel
'Select Cell A1 So Code Works As Intended
Range("A1").Select
'Declare variables
Dim ie As Object, i As Long, strText As String
Dim doc As Object, hTable As Object, hBody As Object, hTR As Object, hTD As Object
Dim tb As Object, bb As Object, tr As Object, td As Object
Dim y As Long, z As Long, wb As Excel.Workbook, ws As Excel.Worksheet
Set wb = Excel.ActiveWorkbook
Set ws = wb.ActiveSheet
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = True
y = 1 'Column A in Excel
z = 1 'Row 1 in Excel
'Enter Desired URL
ie.navigate "MY URL HERE", , , , "Content-Type: application/x-www-form-urlencoded" & vbCrLf
'Wait until page fully loads
Do While ie.Busy: DoEvents: Loop
Do While ie.readyState <> 4: DoEvents: Loop
Set doc = ie.document
Set hTable = doc.getElementsByTagName("table")
'Loop through HTML Tags and paste every HTML table cell into excel
For Each tb In hTable
Set hBody = tb.getElementsByTagName("tbody")
For Each bb In hBody
Set hTR = bb.getElementsByTagName("tr")
For Each tr In hTR
Set hTD = tr.getElementsByTagName("td")
y = 1 ' Resets back to column A
For Each td In hTD
ws.Cells(z, y).Value = td.innerText
y = y + 1
Next td
DoEvents
z = z + 1
Next tr
Exit For
Next bb
Exit For
Next tb
End Sub

Scraping data from website using vba

Im trying to scrape data from website: http://uk.investing.com/rates-bonds/financial-futures via vba, like real-time price, i.e. German 5 YR Bobl, US 30Y T-Bond, i have tried excel web query but it only scrapes the whole website, but I would like to scrape the rate only, is there a way of doing this?
There are several ways of doing this. This is an answer that I write hoping that all the basics of Internet Explorer automation will be found when browsing for the keywords "scraping data from website", but remember that nothing's worth as your own research (if you don't want to stick to pre-written codes that you're not able to customize).
Please note that this is one way, that I don't prefer in terms of performance (since it depends on the browser speed) but that is good to understand the rationale behind Internet automation.
1) If I need to browse the web, I need a browser! So I create an Internet Explorer browser:
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
2) I ask the browser to browse the target webpage. Through the use of the property ".Visible", I decide if I want to see the browser doing its job or not. When building the code is nice to have Visible = True, but when the code is working for scraping data is nice not to see it everytime so Visible = False.
With appIE
.Navigate "http://uk.investing.com/rates-bonds/financial-futures"
.Visible = True
End With
3) The webpage will need some time to load. So, I will wait meanwhile it's busy...
Do While appIE.Busy
DoEvents
Loop
4) Well, now the page is loaded. Let's say that I want to scrape the change of the US30Y T-Bond:
What I will do is just clicking F12 on Internet Explorer to see the webpage's code, and hence using the pointer (in red circle) I will click on the element that I want to scrape to see how can I reach my purpose.
5) What I should do is straight-forward. First of all, I will get by the ID property the tr element which is containing the value:
Set allRowOfData = appIE.document.getElementById("pair_8907")
Here I will get a collection of td elements (specifically, tr is a row of data, and the td are its cells. We are looking for the 8th, so I will write:
Dim myValue As String: myValue = allRowOfData.Cells(7).innerHTML
Why did I write 7 instead of 8? Because the collections of cells starts from 0, so the index of the 8th element is 7 (8-1). Shortly analysing this line of code:
.Cells() makes me access the td elements;
innerHTML is the property of the cell containing the value we look for.
Once we have our value, which is now stored into the myValue variable, we can just close the IE browser and releasing the memory by setting it to Nothing:
appIE.Quit
Set appIE = Nothing
Well, now you have your value and you can do whatever you want with it: put it into a cell (Range("A1").Value = myValue), or into a label of a form (Me.label1.Text = myValue).
I'd just like to point you out that this is not how StackOverflow works: here you post questions about specific coding problems, but you should make your own search first. The reason why I'm answering a question which is not showing too much research effort is just that I see it asked several times and, back to the time when I learned how to do this, I remember that I would have liked having some better support to get started with. So I hope that this answer, which is just a "study input" and not at all the best/most complete solution, can be a support for next user having your same problem. Because I have learned how to program thanks to this community, and I like to think that you and other beginners might use my input to discover the beautiful world of programming.
Enjoy your practice ;)
Other methods were mentioned so let us please acknowledge that, at the time of writing, we are in the 21st century. Let's park the local bus browser opening, and fly with an XMLHTTP GET request (XHR GET for short).
Wiki moment:
XHR is an API in the form of an object whose methods transfer data
between a web browser and a web server. The object is provided by the
browser's JavaScript environment
It's a fast method for retrieving data that doesn't require opening a browser. The server response can be read into an HTMLDocument and the process of grabbing the table continued from there.
Note that javascript rendered/dynamically added content will not be retrieved as there is no javascript engine running (which there is in a browser).
In the below code, the table is grabbed by its id cr1.
In the helper sub, WriteTable, we loop the columns (td tags) and then the table rows (tr tags), and finally traverse the length of each table row, table cell by table cell. As we only want data from columns 1 and 8, a Select Case statement is used specify what is written out to the sheet.
Sample webpage view:
Sample code output:
VBA:
Option Explicit
Public Sub GetRates()
Dim html As HTMLDocument, hTable As HTMLTable '<== Tools > References > Microsoft HTML Object Library
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://uk.investing.com/rates-bonds/financial-futures", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT" 'to deal with potential caching
.send
html.body.innerHTML = .responseText
End With
Application.ScreenUpdating = False
Set hTable = html.getElementById("cr1")
WriteTable hTable, 1, ThisWorkbook.Worksheets("Sheet1")
Application.ScreenUpdating = True
End Sub
Public Sub WriteTable(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByVal ws As Worksheet)
Dim tSection As Object, tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, C As Long, tBody As Object
r = startRow: If ws Is Nothing Then Set ws = ActiveSheet
With ws
Dim headers As Object, header As Object, columnCounter As Long
Set headers = hTable.getElementsByTagName("th")
For Each header In headers
columnCounter = columnCounter + 1
Select Case columnCounter
Case 2
.Cells(startRow, 1) = header.innerText
Case 8
.Cells(startRow, 2) = header.innerText
End Select
Next header
startRow = startRow + 1
Set tBody = hTable.getElementsByTagName("tbody")
For Each tSection In tBody
Set tRow = tSection.getElementsByTagName("tr")
For Each tr In tRow
r = r + 1
Set tCell = tr.getElementsByTagName("td")
C = 1
For Each td In tCell
Select Case C
Case 2
.Cells(r, 1).Value = td.innerText
Case 8
.Cells(r, 2).Value = td.innerText
End Select
C = C + 1
Next td
Next tr
Next tSection
End With
End Sub
you can use winhttprequest object instead of internet explorer as it's good to load data excluding pictures n advertisement instead of downloading full webpage including advertisement n pictures those make internet explorer object heavy compare to winhttpRequest object.
This question asked long before. But I thought following information will useful for newbies. Actually you can easily get the values from class name like this.
Sub ExtractLastValue()
Set objIE = CreateObject("InternetExplorer.Application")
objIE.Top = 0
objIE.Left = 0
objIE.Width = 800
objIE.Height = 600
objIE.Visible = True
objIE.Navigate ("https://uk.investing.com/rates-bonds/financial-futures/")
Do
DoEvents
Loop Until objIE.readystate = 4
MsgBox objIE.document.getElementsByClassName("pid-8907-last")(0).innerText
End Sub
And if you are new to web scraping please read this blog post.
Web Scraping - Basics
And also there are various techniques to extract data from web pages. This article explain few of them with examples.
Web Scraping - Collecting Data From a Webpage
I modified some thing that were poping up error for me and end up with this which worked great to extract the data as I needed:
Sub get_data_web()
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
With appIE
.navigate "https://finance.yahoo.com/quote/NQ%3DF/futures?p=NQ%3DF"
.Visible = True
End With
Do While appIE.Busy
DoEvents
Loop
Set allRowofData = appIE.document.getElementsByClassName("Ta(end) BdT Bdc($c-fuji-grey-c) H(36px)")
Dim i As Long
Dim myValue As String
Count = 1
For Each itm In allRowofData
For i = 0 To 4
myValue = itm.Cells(i).innerText
ActiveSheet.Cells(Count, i + 1).Value = myValue
Next
Count = Count + 1
Next
appIE.Quit
Set appIE = Nothing
End Sub