Excel Macros - Using Excel Data to scrape HTML page - html

As with many other questions, I'm really new to using Excel macros and HTML.
I'm building a table with entries of items from Dungeons and Dragons (DnD) that I want to get descriptions of from a particular web page: http://www.5esrd.com/gamemastering/magic-items/ . For example, An entry in my table would be Adamantine Armor, so I would want the macro to search the web page for that listing, scrape the description that's entered on the webpage and save that to the excel document. In this case, that specific description (as per the 5esrd webpage) is as follows:
Armor (medium or heavy, but not hide), uncommon
This suit of armor is reinforced with adamantine, one of the hardest substances in existence. While you’re wearing it, any critical hit against you becomes a normal hit.
On this page, all the items are links to pages that have them grouped by type (Sword, wand, etc.). I would like to iterate through my row of items in Excel, search for that item on the page and scrape its description into my excel table.
I was following along this page: https://www.wiseowl.co.uk/blog/s393/scrape-website-html.htm, which scrapes data from stackoverflow's home page. It directly targets some of the tables that the web page uses to organize its entries. The DnD page I'm trying to use doesn't organize its entries into tables like this, so I'm a little lost with how to proceed.
If anyone could help point me in the right direction, I would be very grateful!

This is the code I use for scraping a web page on our company site. I put all my data in column B, so make adjustments accordingly. This should get you started.
Sub TestScrape()
' SCRAPE A WEB PAGE
Dim ieDoc As New HTMLDocument 'ieDocDocument
Dim tdCollection As Object 'table that has the javascript attributes and contains the element I want to click
Dim AnchorLinks As Object
Dim tdElements As Object
Dim tdElement As Object
Dim AnchorLink As Object
Dim lRow As Long
Dim ie As InternetExplorer
Dim cls As IHTMLElementCollection
Set ie = New SHDocVw.InternetExplorer
With ie
.Visible = True
.Left = 0
.TheaterMode = True '<<-- Comment this out if you don't want Theater Mode
.Navigate url:="Enter your URL here"
While .ReadyState < 4 Or .Busy: DoEvents: Wend
End With
Application.Wait Now + TimeSerial(0, 0, 4)
ieDoc.body.innerHTML = ie.Document.body.innerHTML
With ieDoc.body
Set AnchorLinks = .getElementsByTagName("table")
Set tdElements = .getElementsByTagName("td") '
For Each AnchorLink In AnchorLinks
Debug.Print AnchorLink.innerText
Next AnchorLink
End With
lRow = 1
For Each tdElement In tdElements
Debug.Print tdElement.innerText
Cells(lRow, 2).Value = tdElement.innerText
lRow = lRow + 1
Next
Application.Wait (Now + TimeValue("0:00:1"))
With ie
.TheaterMode = False
.Quit
End With
Set ie = Nothing
End Sub

Related

Trying to only extract a piece of innertext

Option Explicit
Sub VBAWebscraping2()
Dim IEObject As Object
Set IEObject = New InternetExplorer
IEObject.Visible = True
IEObject.navigate url:="https://streeteasy.com/building/" & Cells(2, 4).Value
Do While IEObject.Busy = True Or IEObject.readyState <> READYSTATE_COMPLETE
Application.Wait Now + TimeValue("00:00:01")
Loop
Dim IEDocument As HTMLDocument
Set IEDocument = IEObject.document
'GRAB by classname'
Dim IEElements As IHTMLElementCollection
Dim IEElement As IHTMLElement
Set IEElements = IEDocument.getElementsByClassName("details")
For Each IEElement In IEElements
If IEElement.innerText = "price" Then
Debug.Print (IEElement.innerText)
End If
Exit For
Next
'Dim lastRow As Long
'lastRow = Range("A" & Rows.count).End(xlUp).row
End Sub
It goes to the desired location, but when it's trying to extract the price it either doesn't print in the debug menu or grabs too much. Trying to figure out a more precise way to only grab the price, a lot of the tutorials mention id tags but there aren't many on the website.
This is the website I'm trying to scrape https://streeteasy.com/building/the-cambridge-500-east-85-street-new_york/15l?card=1
Also trying to retrieve number of rooms, baths, and neighborhood
Price you can use class
IEDocument.querySelector(".price").innerText
Same for rooms
IEDocument.querySelector(".first_detail_cell").innerText
Beds
IEDocument.querySelector("[class='detail_cell ']").innerText
Baths
IEDocument.querySelector("last_detail_cell").innerText
Looking at how the classes are named, your mileage with different pages will almost certainly vary if any of these items are missing or in a different order. You will then need to loop the element list returned by selecting for class .detail_cell and test the .innerText for the presence of the text "room", "bed" or "bath", and assign accordingly.

Excel VBA browser scraping loop: Reload page for each loop

I am having a bit trouble when web scraping, basically I have a loop that goes to a website and searches for a specific company and clicks on search and then goes to the company page and extracts the company's founding date. (https://icis.corp.delaware.gov/Ecorp/EntitySearch/NameSearch.aspx). I basically have a large table but I have a smaller one here for simplicity, and the loop goes from top to bottom of the table in excel and extracts the founding date in the B column next to the name.
{Table example}. My code is as follows:
Option Explicit
Sub click_search()
Dim i As SHDocVw.InternetExplorer
Dim idoc As MSHTML.HTMLDocument
Dim output As Range
Dim txt_input As Range
Dim wb As Workbook
Dim ws As Worksheet
Dim n As Integer
Set wb = ActiveWorkbook
Set ws = wb.Sheets(1)
Set i = New InternetExplorer
i.Visible = True
i.Navigate "https://icis.corp.delaware.gov/Ecorp/EntitySearch/NameSearch.aspx"
Do While i.ReadyState <> READYSTATE_COMPLETE
Loop
Set idoc = i.Document
On Error Resume Next
For n = 1 To 3
idoc.getElementById("ctl00_ContentPlaceHolder1_frmEntityName").Value = Cells(n, 1).Value
Application.Wait Now + #12:00:01 AM#
idoc.getElementById("ctl00_ContentPlaceHolder1_btnSubmit").click
Application.Wait Now + #12:00:15 AM#
idoc.getElementById("ctl00_ContentPlaceHolder1_rptSearchResults_ctl00_lnkbtnEntityName").click
Application.Wait Now + #12:00:10 AM#
Dim V As Variant
Set V = idoc.getElementById("ctl00_ContentPlaceHolder1_lblIncDate")
Cells(n, 2) = V.innerHTML
i.Quit
Set i = New InternetExplorer
i.Visible = True
i.Navigate "https://icis.corp.delaware.gov/Ecorp/EntitySearch/NameSearch.aspx"
Application.Wait Now + #12:00:10 AM#
Next n
End Sub
The problem is that I have to quit internet explorer and restart it in order to be able to search for the next company, this is where I have a problem in my loop as it just works for the very first iteration and kind of gets stuck on the page in the second iteration of the loop. It works as intended and gets the correct date for the first company but for the second company it kind of gets stuck on the start page and wont search at all. Just reloading the page wont work either as that will cause the browser to get stuck because you have to manually accept that you want to reload the page in the browser. Any help at all on how to fix this issue is greatly appreciated.
Thanks in advance!
Removed the last big chunk of the code at i.Quit and replaced it with:
i.Navigate "https://icis.corp.delaware.gov/Ecorp/EntitySearch/NameSearch.aspx"
Application.Wait Now + #12:00:10 AM#

Why is my VBA copy/paste of web page table to Excel truncated at the top of the table

I am trying to copy a table from a web page into an Excel worksheet with VBA. The web page has only one table. See below for my code. I am using a slightly modified version of an answer posted by QHarr in July of 2018.
My code runs without error, however there is nothing in Cell(1, 1) in my target worksheet. I was expecting a copy of the table from the web page. (Manual select, copy, paste works just fine.) Once I get this working, it will be part of a loop to fetch and process a lot of pages/tables.
In the Immediate Window, "Debug.Print hTable.outerHTML" shows that the table tag, the title row, the headings row, and the beginning (tr, and first 2 td) of the first data row are missing. The rest of the table rows, thru appear to be present.
I'm a bit surprised that the top of the table is missing. To me, it suggests that something ran out of space and the "oldest" data got pushed off the top of the stack.
I'm assuming that my empty worksheet is due to the incomplete table definition.
My question is, how do I get past this problem? I am very new to VBA.
=====
My code so far...
Option Explicit
Private Sub CommandButton1_Click()
Dim IE As InternetExplorer
Dim html As HTMLDocument, hTable As HTMLTable, clipboard As Object
Dim travURL As String
Set IE = New InternetExplorer
Application.ScreenUpdating = False
'Point to hyperlink
travURL = Cells(5, 18).Hyperlinks(1).Address
With IE
.Visible = True
.Navigate (travURL)
While .Busy Or .readyState < 4: DoEvents: Wend
'Set html = .Document
'With html
With .Document
Set hTable = .getElementsByTagName("table")(0)
Debug.Print hTable.outerHTML
Set clipboard = New MSForms.DataObject
clipboard.SetText hTable.outerHTML
clipboard.PutInClipboard
ActiveSheet.Cells(1, 1).PasteSpecial
End With
.Quit
End With
Application.ScreenUpdating = True
End Sub

How to access the Web using VBA? Please check my code

In order to improve the repeatitive work, I tried to access the Web site which is using in company using VBA.
So, I made code using VBA. And I checked it could be access the normal site such as google, youtube...
But, I don't know why it could not be access the company site.
VBA stopped this line
Set HTMLDoc = IE_ctrl.document
Thank you in advanced.
And I checked one different things(VBA Local values, type) between Normal and company site.
please check below 2 pictures.
Sub a()
Dim IE_ctrl As InternetExplorer
Dim HTMLDoc As HTMLDocument
Dim input_Data As IHTMLElement
Dim URL As String
URL = "https://www.google.com"
Set IE_ctrl = New InternetExplorer
IE_ctrl.Silent = True
IE_ctrl.Visible = True
IE_ctrl.navigate URL
Wait_Browser IE_ctrl
Set HTMLDoc = IE_ctrl.document
Wait_Browser IE_ctrl
Set input_Data = HTMLDoc.getElementsByClassName("text").Item
input_Data.Click
End Sub
Sub Wait_Browser(Browser As InternetExplorer, Optional t As Integer = 1)
While Browser.Busy
DoEvents
Wend
Application.Wait DateAdd("s", t, Now)
End Sub
Normal site(operating well.)
enter image description here
Company site(operating error.)
enter image description here
You can try the following code. Please read the comments. I can't say anymore because I don't know the page or the html of the page.
Sub a()
'Use late binding for what you need
Dim ie As Object
Dim nodeInputData As Object
Dim url As String
url = "https://www.google.com"
'Use the windows GUID to initialize the Internet Explorer, if you
'want to get access to a company page. This helps if there are
'security rules you can't access over other ways of initializing IE
'This don't work in most cases for pages in the "real" web
'Read here for more infos:
'https://blogs.msdn.microsoft.com/ieinternals/2011/08/03/default-integrity-level-and-automation/
Set ie = GetObject("new:{D5E8041D-920F-45e9-B8FB-B1DEB82C6E5E}")
ie.Visible = True
ie.navigate url
'Waiting for the document to load
Do Until ie.readyState = 4: DoEvents: Loop
'If necessary, if there is dynamic content that must be loaded,
'after the ie reports, loading was ready
'(The last three values are: hours, minutes, seconds)
Application.Wait (Now + TimeSerial(0, 0, 1))
'I don't know your html. If you only want to click a button,
'you don't need a varable
'ie.document.getElementsByClassName("text")(0).Click
'will do the same like
Set nodeInputData = ie.document.getElementsByClassName("text")(0)
nodeInputData.Click
'A short explanation of getElementsByClassName() and getElementsByTagName():
'Both methods create a node collection of all html elements that was found
'by the creteria in the brackets. This is because there can be any number of
'html elements with specified class names or tag names. If, for example,
'3 html elements with the class name "Text" were found, a node collection
'with three elements is created by getElementsByClassName("Text").
'These have the indices 0 to 2, as in an array. The individual elements are
'also addressed via these indices. They are indicated in round brackets.
End Sub

Using VBA to Copy Multiple HTML tables from browser to Excel

Sounds simple right?
I am working on a class project and I need a bunch of data from a bunch of .gov websites and I am struggling to get this to work reliably.
I was able to get this to work perfectly on tables where there was only text in the cells. I was able to do this with .innertext function of an HTML object (and of course I left my flashdrive at school so I can't see what I used, figures).
Anyway, my issue is that I cannot get the information from some cells when they contain links to other pages. For example, on this OPM site, https://www.opm.gov/policy-data-oversight/pay-leave/salaries-wages/2017/general-schedule/, there is the web option to view the form and I cannot get it to copy the web addresses into excel. Copying the text in the cells works fine but I cannot figure out how to make it copy the href text.
Does anyone happen to have any experience with this and could hopefully point me in the right direction?
Best,
Clueless Guy who is lost as usual.
Edit: Here is my code so far. I removed the bits about href because they were causing it not to work and were almost certainly incorrect.
Sub GetTables()
Dim doc As HTMLDocument
Dim htmTable As HTMLTable
Dim hpLink As IHTMLElement
Dim data
Dim x As Long, y As Long
Dim oRow As Object, oCell As Object
Dim oDom As Object: Set oDom = CreateObject("htmlFile")
x = 1
y = 1
Set doc = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.opm.gov/policy-data-oversight/pay-leave/salaries-wages/2017/general-schedule/"
.send
Do: DoEvents: Loop Until .readyState = 4
doc.body.innerHTML = .responseText
.abort
End With
Set htmTable = doc.getElementsByClassName("DataTable")(0)
With htmTable
Debug.Print .Rows(0).Cells(1).innerText
Debug.Print .Rows(6).Cells(1).innerText
Debug.Print .Rows(7).Cells(1).innerText
ReDim data(1 To .Rows.Length, 1 To .Rows(1).Cells.Length)
For Each oRow In .Rows
For Each oCell In oRow.Cells
data(x, y) = oCell.innerText
'Previously, I had attempted to use oCell.href to get the value
'but that did not work.
y = y + 1
Next oCell
y = 1
x = x + 1
Next oRow
End With
Sheets(1).Cells(1, 1).Resize(UBound(data), UBound(data, 2)).Value = data
End Sub