How can I scrape multiple pages/links at once using VBA? - html

I'm currrently trying to scrape info from this Reddit Page. My goal is to make excel open all the posts in new tabs and then I want to scrape information from each of those pages, since the starting page doesn't have as much information.
I've been trying for the last few hours to figure this out, but I'm admittedly pretty confused about how to do it, just overall unsure what to do next, so any pointers would be greatly appreciated!
Here is my current code, it works decently enough but as I said, I'm not sure what I should do next to open the links it finds one by one and scrape each page for data.
The links are scraped off that first page and then added to my spreadsheet right now, but if possible I'd like to just skip that step and scrape them all at once.
Thanks! :)
Sub GetData()
Dim objIE As InternetExplorer
Dim itemEle As Object
Dim upvote As Integer, awards As Integer, animated As Integer
Dim postdate As String, upvotepercent As String, oc As String, filetype As String, linkurl As String, myhtmldata As String, visiComments As String, totalComments As String, removedComments As String
Dim y As Integer
Set objIE = New InternetExplorer
objIE.Visible = False
objIE.navigate (ActiveCell.Value)
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
y = 1
For Each itemEle In objIE.document.getElementsByClassName("flat-list buttons")
visiComments = itemEle.getElementsByTagName("a")(0).innerText
linkurl = itemEle.getElementsByTagName("a")(0).href
Sheets("Sheet1").Range("A" & y).Value = visiComments
Sheets("Sheet1").Range("B" & y).Value = linkurl
y = y + 1
Next
End Sub

You should be able to gather the urls then visit in a loop and write results from page visited to array, then array to sheet. Add this after your existing line
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
Add:
Dim nodeList As Object , i As Long, urls(), results()
Note: You are only potentially gaining on the page loads, as VBA is single threaded. To do that you would need to store a reference to each tab, or open all first, then loop through relevant open windows to do the scrape. My preference would be to keep in same tab to be honest.
Set nodeList = ie.document.querySelectorAll(".comments")
Redim urls(0 To nodeList.Length-1)
Redim results(1 to nodeList.Length, 1 to 3)
'Store all urls in an array to later loop
For i = 0 To nodeList.Length -1
urls(i) = nodeList.item(i).href
Next
For i = LBound(urls) To UBound(urls)
ie.Navigate2 urls(i)
While ie.Busy Or ie.Readystate <> 4: DoEvents:Wend
'may need a pause here
results(i + 1, 1) = ie.document.querySelector("a.title").innerText 'title
results(i + 1, 2) = ie.document.querySelector(".number").innerText 'upvotes
results(i + 1, 3) = ie.document.querySelector(".word").NextSibling.nodeValue '%
Next
ActiveSheet.Cells(1,1).Resize(UBound(results,1) , UBound(results,2)) = results

Related

Search a website using excel vba with excel data and extract the active state in flowchart of search result and mapping it into column

I am hoping someone can help....
I have around 7000 values in a excel spreadsheet that I need to search in a website and then record active state of result flowchart from the website to be inputted back into the excel spreadsheet. Since I am new to macros web scrape I used to automate web code modified input ids for the website which I want to extract information (https://nacionalidade.justica.gov.pt/). I am a bit confused in how to apply if condition to get the active state having seven classes in flowhchart, Here is the flow chart.
Now that I have access codes each will be on different stage, I only want to pick the state and put it in column E in front of the access code(currently doing manually)
I am unclear how to extract that info being new to this type of web data extraction - any help would be incredible!
Here is my code:(couldn't be able to change for mentioned web after this)
objIE.document.getElementById("btnPesquisa").Click
Code:
'start a new subroutine called SearchBot
Sub SearchBot()
'dimension (declare or set aside memory for) our variables
Dim objIE As InternetExplorer 'special object variable representing the IE browser
Dim aEle As HTMLLinkElement 'special object variable for an <a> (link) element
Dim y As Integer 'integer variable we'll use as a counter
Dim result As String 'string variable that will hold our result link
'initiating a new instance of Internet Explorer and asigning it to objIE
Set objIE = New InternetExplorer
'make IE browser visible (False would allow IE to run in the background)
objIE.Visible = True
'navigate IE to this web page (a pretty neat search engine really)
objIE.navigate "https://nacionalidade.justica.gov.pt/"
'wait here a few seconds while the browser is busy
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
'in the search box put cell "A2" value, the word "in" and cell "C1" value
objIE.document.getElementById("SenhaAcesso").Value = _
Sheets("Guy Touti").Range("D2").Value
'click the 'go' button
objIE.document.getElementById("btnPesquisa").Click
'wait again for the browser
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
'the first search result will go in row 2
y = 2
'for each <a> element in the collection of objects with class of 'result__a'...
For Each aEle In objIE.document.getElementsByClassName("result__a")
'...get the href link and print it to the sheet in col C, row y
result = aEle
Sheets("Guy Touti").Range("E" & y).Value = result
'...get the text within the element and print it to the sheet in col D
Sheets("Guy Touti").Range("D" & y).Value = aEle.innerText
Debug.Print aEle.innerText
'is it a yellowpages link?
If InStr(result, "yellowpages.com") > 0 Or InStr(result, "yp.com") > 0 Then
'make the result red
Sheets("Guy Touti").Range("C" & y).Interior.ColorIndex = 3
'place a 1 to the left
Sheets("Guy Touti").Range("B" & y).Value = 1
End If
'increment our row counter, so the next result goes below
y = y + 1
'repeat times the # of ele's we have in the collection
Next
'add up the yellowpages listings
Sheets("Guy Touti").Range("B1").Value = _
Application.WorksheetFunction.Sum(Sheets("Guy Touti").Range("B2:B100"))
'close the browser
objIE.Quit
'exit our SearchBot subroutine
End Sub
I did try this first but after a while started searching for a better way. Can you help????
You can simplify the POST XHR request the page makes to get data and use the classnames to limit to nodes with either active1 or active3. Take the last node in that nodelist and extract the step number and convert colour via lookup (if wanted). With 7,000 requests it might be considerate to add a delay in every 50 requests, or less, of 1-2 seconds. You can i mod 50 to determine this in the loop and use Application.Wait Now + Timeserial(0,0,2)
Option Explicit
Public Sub GetStatus()
Dim html As MSHTML.HTMLDocument, xhr As Object, colourLkup As Object
Dim ws As Worksheet, senhas(), i As Long, results()
Set ws = ThisWorkbook.Worksheets("Sheet1")
senhas = Application.Transpose(ws.Range("D2:D" & ws.Cells(ws.Rows.Count, "D").End(xlUp).Row))
ReDim results(1 To UBound(senhas))
Set colourLkup = CreateObject("Scripting.Dictionary")
colourLkup.Add "active1", "green"
colourLkup.Add "active3", "orange"
Set html = New MSHTML.HTMLDocument
Set xhr = CreateObject("MSXML2.XMLHTTP")
For i = LBound(senhas) To UBound(senhas)
If senhas(i) <> vbNullString Then
With xhr
.Open "POST", "https://nacionalidade.justica.gov.pt/Home/GetEstadoProcessoAjax", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded; charset=UTF-8"
.send "SenhaAcesso=" & senhas(i)
html.body.innerHTML = .responseText
End With
Dim nodes As Object, classinfo() As String
Set nodes = html.querySelectorAll(".active1, .active3")
classinfo = Split(nodes(nodes.Length - 1).className, Chr$(32))
results(i) = Replace$(classinfo(1), "step", vbNullString) & "-" & colourLkup(classinfo(2))
End If
Set nodes = Nothing
Next
ws.Cells(2, 5).Resize(UBound(results), 1) = Application.Transpose(results)
End Sub

VBA HTML Navigate Through Listings

I've got the following code that WORKS, and pulls all of the links for listings in the webpage below. I am now looking to expand this to pull the next page of results (up to n). I took a stab at doing this (second half of this code), but nothing is being displayed.
NOTE: In this sample of code, I was attempting to place the second page of links in Column B, but in an ideal world, I'd like to add the links to the bottom of the results of Page 1 (in Column A).
UPDATE: This code now moves to each page result, but it pastes the same links in Col A as B as C, etc. I am not sure how this is happening as I can watch the browser changing URLs as it goes.
Also, if you have any better ways of doing this (rather than copy/pasting this 10x to get the amount of results I am looking for), please let me know!
Option Explicit
Public Sub GetLinks()
Dim ie As New InternetExplorer, ws As Worksheet, t As Date
Dim k As Integer
Const MAX_WAIT_SEC As Long = 10
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
.Visible = True
k = 0
Do While k < 10
.Navigate2 "https://www.ebay.com/sch/i.html?_from=R40&_nkw=iPhone&_sacat=0&_ipg=200&_pgn=" & k
While .Busy Or .readyState < 4: DoEvents: Wend
Dim Links As Object, i As Long, count As Long
t = Timer
Do
On Error Resume Next
Set Links = .Document.querySelectorAll(".s-item__link[href]")
count = Links.Length
On Error GoTo 0
If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop While count = 0
For i = 0 To Links.Length - 1
ws.Cells(i + 1, k + 1) = Links.item(i)
Next
k = k + 1
Loop
.Quit
End With
End Sub
I would probably look to add in a test to ensure the number of pages you request are not greater than the available. Modularize the code a little to pull out the info extraction step. Use arrays and some basic optimization (Screenupdating) to speed up the whole process. Also, get rid of the ie object asap.
This with the listings results count set to 200 (which in fact gives 211 results per page with the given selector). Not sure if this is simply an ebay setting that is remembered or is default.
Option Explicit
Public Sub GetInfo()
Dim ie As InternetExplorer, nodeList As Object, page As Long, totalResults As Long, ws As Worksheet
Const RESULTS_PER_PAGE = 211
Const DESIRED_PAGES = 3
Const BASE = "https://www.ebay.com/sch/i.html?_from=R40&_nkw=iPhone&_sacat=0&_ipg=200&_pgn="
Dim results(), url As String, maxPages As Long
ReDim results(1 To DESIRED_PAGES)
Application.ScreenUpdating = False
Set ie = New InternetExplorer
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
.Visible = True
For page = 1 To DESIRED_PAGES
url = BASE & page
.Navigate2 url
While .Busy Or .readyState < 4: DoEvents: Wend
If page = 1 Then
totalResults = Replace$(.document.querySelector(".srp-controls__count-heading").innerText, " results", vbNullString)
maxPages = totalResults / RESULTS_PER_PAGE
End If
Set nodeList = .document.querySelectorAll("#srp-river-results .s-item__link[href]")
results(page) = GetLinks(nodeList)
Set nodeList = Nothing
If page + 1 >= maxPages Then Exit For
Next
.Quit
End With
If maxPages < DESIRED_PAGES Then ReDim Preserve results(1 To maxPages)
For page = LBound(results) To UBound(results)
If page = 1 Then
ws.Cells(1, 1).Resize(UBound(results(page), 1)) = Application.Transpose(results(page))
Else
ws.Cells(GetLastRow(ws, 1) + 1, 1).Resize(UBound(results(page), 1)) = Application.Transpose(results(page))
End If
Next
Application.ScreenUpdating = True
End Sub
Public Function GetLinks(ByVal nodeList As Object) As Variant
Dim results(), i As Long
ReDim results(1 To nodeList.Length)
For i = 0 To nodeList.Length - 1
results(i + 1) = nodeList.item(i)
Next
GetLinks = results
End Function
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.Rows.Count, columnNumber).End(xlUp).Row
End With
End Function
Untested (and I might be missing something), but seems like you can just specify the page you want to access with URL query parameter _pgn.
So for example, navigating to the URL below:
https://www.ebay.com/sch/i.html?_from=R40&_nkw=iPhone&_sacat=0&_ipg=200&_pgn=2
means you're requesting page 2 (also, the _ipg parameter seems to dictate how many results are shown on a page, so increasing to 200 might mean you need to make fewer requests overall).
So if you create some variable pageNumber in your code and increment it inside some sort of loop (that terminates once you've reached the last page), you should be able to get all pages -- or even any page at some arbitrary index --without copy-pasting/repeating yourself in your code.

Cannot Read New Page HTML Excel VBA

I am trying to get scraped data from a results page, after I have entered search variables on the previous page. I cannot get the .doc HTML to reflect the new page's html, (well not consistently, it reads correctly in 5-10% of cases). I have looked at other solutions for this problem and the majority come down to not waiting for the new page html to appear, but I have put 5,10 even twenty second waits in but that doesn't seem to do the trick.
Any advice / pointers would be welcome
<code>Sub ParseInternet()
Dim post_code As String
Dim house_num As String
post_code = CStr(Sheet1.Cells(9, 2).Value) 'get search data from worksheet
house_num = CStr(Sheet1.Cells(9, 1).Value)
If post_code = "" Then
MsgBox ("House Name /Number and postcode MUST be entered")
Exit Sub
End If
Set site = CreateObject("InternetExplorer.application")
Dim url As String
url = "http://landregistry.data.gov.uk/app/ppd"
site.Navigate url
While site.ReadyState <> READYSTATE_COMPLETE: DoEvents: Wend
site.Visible = True
site.Document.getElementById("paon").Value = house_num
site.Document.getElementById("postcode").Value = post_code
Set my_classes = site.Document.getElementsByClassName("btn btn-primary")
For Each my_class In my_classes
my_class.Click
Next my_class
While site.ReadyState <> READYSTATE_COMPLETE: DoEvents: Wend
Application.Wait (Now + TimeValue("0:00:10"))
Dim a As String
Dim b As String
b = "property by searching for"
'''STILL PREVIOUS PAGE HTML IN .DOC
Set HTMLLI = site.Document.getElementsByTagName("div")
For xli = 0 To HTMLLI.Length - 1
a = site.Document.getElementsByTagName("div")(xli).innerText
If InStr(site.Document.getElementsByTagName("div")(xli).innerText, b) Then
Sheet1.Cells(9, 4).Value = site.Document.getElementsByTagName("div")(xli).innerText
Exit For
End If
Next xli
site.Quit
Set site = Nothing
End Sub
</code>

Excel VB Searching for text on a webpage and Copying information in the same Element

I am relatively new VBA.
I am trying to use this code to grab a bit of information from a website. When I do it by Element I have to search for the tag name which is tr and use a number next to it to define which one I want to use. The problem with that is it changes frequently with the position on the website. Currently the Keyword I want to search for and the information it contains is like so:
<tr>
<td class="nt">Operations</td>
<td>Windows</td>
</tr>
So if I can search by the class "Operations", and get the information "Windows", that would help. Also, I currently having an error
Next without For
If possible, is there a way I can use this to do multiple searches before I close the page? So I look for multiple specific words and input that data into different cells before moving onto the next column where it would repeat until completed at the end of the x value. I currently only have it set to x=2 To 5 but I would like to increase that to 10 or higher in the future.
The current code looks like this.
Private Sub Worksheet_Change(ByVal Target As Range)
For x = 2 To 5
If Target.Row = Cells(x, 35).Row And _
Target.Column = Cells(x, 35).Column Then
'If Target.Row = Range("ManufacturerPartNumber").Row And _
'Target.Column = Range("ManufacturerPartNumber").Column Then
Dim IE As New InternetExplorer
'IE.Visible = True
'For x = 2 To 5
'IE.navigate "" & Range("Website_1").Value
IE.navigate "" & Cells(x, 35).Value
Do
DoEvents
Loop Until IE.readyState = READYSTATE_COMPLETE
Dim Doc As HTMLDocument
Set Doc = IE.document
Dim sDD As String
sDD = InStr(1, IE.document.body.innerHTML, "Processor Model")
'sDD = Trim(Doc.getElementsByTagName("Processor Model")(1).innerText) 'Use this with tag like dd and number for which it appears like 0 or 1
IE.Quit
Dim aDD As Variant
aDD = Split(sDD, ",")
Cells(x, 44).Value = aDD(0)
'Range("ProcessorNumberCd").Value = aDD(0)
'Range("OSProvided").Value = aDD(0)
Next x
End If
'MsgBox "Complete"
End Sub
I think you want to grab the 'inner text'. Take a look at the example below.
Sub Scraper()
Dim item As Long
Dim priceStr As String
Dim priceTag As Object
Dim priceTable As Object
item = "10011" 'this will eventually be placed in a loop for multiple searches
Set objIE = CreateObject("InternetExplorer.Application")
objIE.Visible = True
' navigate and download the web page
objIE.Navigate "www.google.com"
Do While objIE.ReadyState <> 4 Or objIE.Busy
DoEvents
Loop
'objIE.Document.getElementsByTagName("input")(0).Value = item
'objIE.Document.getElementByID("FDI").Click
Set priceTable = objIE.Document.getElementByID("price_FGC")
Set priceTag = priceTable.getElementsByTagName("u")(3)
priceStr = priceTag.innerText
Sheet1.Range("A1").Value = priceStr
objIE.Quit
End Sub
Also, check out this link for several other ways of how to do other, similar things.
http://www.tushar-mehta.com/publish_train/xl_vba_cases/vba_web_pages_services/index.htm

Copying columns of values from Excel worksheet to an HTML table in a website

I'm hoping to copy values from an excel table into an html table. If anyone has experience doing this between Excel 2013 and Internet Explorer 11, please let me know and I can give details. Thank you
The following is a little section of the code for the website that I am trying to automate the fill-in process for:
Table and Source Code for Charge Numbers for the Site
This is showing just one part of the process (the charge numbers), which we have anywhere from 10-25 each week with their corresponding hours charged to them each day (as seen in the table matrix).
There are 7 charge numbers on each page and Charge Numbers all have name="chargeNo" and increase id names with each box downward like so:
'1) id="chargeNo0"
'2) id="chargeNo1"
'3) id="chargeNo2"
'4) id="chargeNo3"
'5) id="chargeNo4"
'6) id="chargeNo5"
'7) id="chargeNo6"
Each day of the week also has set names as follows:
Table and Source Code for Hours Charged Each Day
'Saturday: name="hrs0"
'Sunday: name="hrs1"
'Monday: name="hrs2"
'Tuesday: name="hrs3"
'Wed: name="hrs4"
'Thurs: name="hrs5"
'Fri: name="hrs6"
I'm trying to automate the process of pulling the arrays of charge times I have from Excel and inputting them into their respective textboxes.
This script will convert elements in ColumnA into an array.
Sub MakeArray()
Dim arr As Variant
With ActiveSheet
arr = WorksheetFunction.Transpose(.Range(.[A1], .Cells(Rows.Count, "A").End(xlUp)))
End With
End Sub
Sub MakeString()
Dim s As String
Const DELIMITER = ","
With ActiveSheet
s = Join(WorksheetFunction.Transpose(.Range(.[A1], .Cells(Rows.Count, "A").End(xlUp))), DELIMITER)
End With
End Sub
If you want to pass the array to an HTML TextBox, try this.
Sub CreateArrayAndPassToHTMLTextbox()
Dim ie As Object
Dim arr As Variant
Const DELIMITER = ","
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = True
ie.navigate "http://home.arcor.de/peter.schleif/SearchForTDelement.html"
While ie.Busy: DoEvents: Wend
With ActiveSheet
arr = WorksheetFunction.Transpose(.Range(.[A1], .Cells(Rows.Count, "A").End(xlUp)))
If Not IsArray(arr) Then arr = Array(arr)
ie.document.getElementById("trackField").Value = Join(arr, DELIMITER)
End With
End Sub
If, for instance, you want to pass data to a ComboBox, try this.
Sub passValueToComboBox1()
Dim ie As Object
Dim oHTML_Element As IHTMLElement
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = True
ie.navigate "http://peterschleif.bplaced.net/excel/combobox/index.php"
While ie.Busy Or ie.readyState <> 4: DoEvents: Wend
Set oHTML_Element = ie.document.getElementsByName("selectedReportClass")(0)
If Not oHTML_Element Is Nothing Then oHTML_Element.Value = "com.db.moap.report.FUBU7"
For Each oHTML_Element In ie.document.getElementsByTagName("input")
If oHTML_Element.Type = "submit" Then oHTML_Element.Click: Exit For
Next
End Sub