I'm trying to figure out a way to extract information from an HTML source using a Visual Basic based application I made in Visual Studio 2010. What I'd like to do is have the system load a webpage (based on an order number) in the background and search for a value that is assigned to an HTML tag and return that value. For example, in the href string below:
href="#" class="lnk11blue" onClick="parent.gotoPage('/oe/cllctrshp_.html?dlvyId=26130700&hdrId=7205902&lineId=21188936&ordLnTyp=FEL SVC LINE','dWnd')">26130700
I want the tool to return the 26130700, either after the "dlvyId=" or at the end of the href. The issue I have is the dlvyId changes with every order, as probably does the hdrId and lineId values, so is there a way to have my program read the value after "dlvyId=" after it locates this string? Or is there a way for the program to read the text after the greater than carat after locating the string?
I'll mess around and see what I can find out (and hopefully post code of some attempts), but any ideas/help in the meantime would be greatly appreciated.
Edit: Thanks to Steve, I've got the function to search a string. However, now I'm having trouble loading the page source. I tried this code below but it doesn't seems to work:
Dim objHttp as object
Dim strURL As String
Dim strText As String
objHttp = CreateObject("MSXML2.ServerXMLHTTP")
strURL = "http://companywebprd.tc.company.com/oe/cllctrord_.html?order_nbr=" &
RMA_Number.Text & "&customer_id=&po_nbr=&ord_date=&ord_cond=0&ord_kind=0&serial_nbr=&item_id=
&i_ord_type=&instance=&svcChk=1&custSiteUseId=0&custSiteUseCd=0"
objHttp.Open("GET", strURL, False)
objHttp.Send("")
strText = objHttp.responsetext
Advice? I'll keep searching around as well
This should get you started down the right path...
Sub test()
Const mystring = "href='#' class='lnk11blue' onClick='parent.gotoPage('/oe/cllctrshp_.html?dlvyId=26130700&hdrId=7205902&lineId=21188936&ordLnTyp=FEL SVC LINE','dWnd')'>26130700'"
If InStr(1, mystring, "dlvyId=") <> 0 Then
For i = InStr(1, mystring, "dlvyId=") To Len(mystring)
If Mid(mystring, i, 1) = "&" Then
Exit For
End If
Next i
MsgBox Mid(mystring, InStr(1, mystring, "dlvyId="), (i - InStr(1, mystring, "dlvyId=")))
End If
End Sub
Related
I don't have internet explorer on any of the computers at work, therefore creating a object of internet explorer and using ie.navigate to parse the html and search for the tags isn't possible. My question is, how can I pull certain data with a tag automatically from a frame source to my spreadsheet without using IE? Example of code in answers would be very useful :) Thanks
You could use XMLHTTP to retrieve the HTML source of a web page:
Function GetHTML(url As String) As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", url, False
.Send
GetHTML = .ResponseText
End With
End Function
I wouldn't suggest using this as a worksheet function, or else the site URL will be re-queried every time the worksheet recalculates. Some sites have logic in place to detect scraping via frequent, repeated calls, and your IP could become banned, temporarily or permanently, depending on the site.
Once you have the source HTML string (preferably stored in a variable to avoid unnecessary repeat calls), you can use basic text functions to parse the string to search for your tag.
This basic function will return the value between the <tag> and </tag>:
Public Function getTag(url As String, tag As String, Optional occurNum As Integer) As String
Dim html As String, pStart As Long, pEnd As Long, o As Integer
html = GetHTML(url)
'remove <> if they exist so we can add our own
If Left(tag, 1) = "<" And Right(tag, 1) = ">" Then
tag = Left(Right(tag, Len(tag) - 1), Len(Right(tag, Len(tag) - 1)) - 1)
End If
' default to Occurrence #1
If occurNum = 0 Then occurNum = 1
pEnd = 1
For o = 1 To occurNum
' find start <tag> beginning at 1 (or after previous Occurence)
pStart = InStr(pEnd, html, "<" & tag & ">", vbTextCompare)
If pStart = 0 Then
getTag = "{Not Found}"
Exit Function
End If
pStart = pStart + Len("<" & tag & ">")
' find first end </tag> after start <tag>
pEnd = InStr(pStart, html, "</" & tag & ">", vbTextCompare)
Next o
'return string between start <tag> & end </tag>
getTag = Mid(html, pStart, pEnd - pStart)
End Function
This will find only basic <tag>'s but you could add/remove/change the text functions to suit your needs.
Example Usage:
Sub findTagExample()
Const testURL = "https://en.wikipedia.org/wiki/Web_scraping"
'search for 2nd occurence of tag: <h2> which is "Contents" :
Debug.Print getTag(testURL, "<h2>", 2)
'...this returns the 8th occurence, "Navigation Menu" :
Debug.Print getTag(testURL, "<h2>", 8)
'...and this returns an HTML <span> containing a title for the 'Legal Issues' section:
Debug.Print getTag("https://en.wikipedia.org/wiki/Web_scraping", "<h2>", 4)
End Sub
Anyone who has done some web scraping will be familiar with creating an instance of Internet Explorer (IE) and the navigating to a web address and then once the page is ready start navigating the DOM using the 'Microsoft HTML Object Library' (MSHTML) type library. The question asks if IE is unavailable what to do. I am in the same situation for my box running Windows 10.
I had suspected it was possible to spin up an instance of MSHTML.HTMLDocument independent of IE but its creation is not obvious. Thanks to the questioner for asking this now. The answer lies in the MSHTML.IHTMLDocument4.createDocumentFromUrl method. One needs a local file to work (EDIT: actually one can put a webby url in as well!) with but we have a nice tidy Windows API function called URLDownloadToFile to download a file.
This codes runs on my Windows 10 box where Microsoft Edge is running and not Internet Explorer. This is an important find and thanks to the questioner for raising it.
Option Explicit
'* Tools->Refernces Microsoft HTML Object Library
'* MSDN - URLDownloadToFile function - https://msdn.microsoft.com/en-us/library/ms775123(v=vs.85).aspx
Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" _
(ByVal pCaller As Long, ByVal szURL As String, ByVal szFileName As String, _
ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
Sub Test()
Dim fso As Object
Set fso = CreateObject("Scripting.FileSystemObject")
Dim sLocalFilename As String
sLocalFilename = Environ$("TMP") & "\urlmon.html"
Dim sURL As String
sURL = "https://stackoverflow.com/users/3607273/s-meaden"
Dim bOk As Boolean
bOk = (URLDownloadToFile(0, sURL, sLocalFilename, 0, 0) = 0)
If bOk Then
If fso.FileExists(sLocalFilename) Then
'* Tools->Refernces Microsoft HTML Object Library
Dim oHtml4 As MSHTML.IHTMLDocument4
Set oHtml4 = New MSHTML.HTMLDocument
Dim oHtml As MSHTML.HTMLDocument
Set oHtml = Nothing
'* IHTMLDocument4.createDocumentFromUrl
'* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx
Set oHtml = oHtml4.createDocumentFromUrl(sLocalFilename, "")
'* need to wait a little whilst the document parses
'* because it is multithreaded
While oHtml.readyState <> "complete"
DoEvents '* do not comment this out it is required to break into the code if in infinite loop
Wend
Debug.Assert oHtml.readyState = "complete"
Dim sTest As String
sTest = Left$(oHtml.body.outerHTML, 100)
Debug.Assert Len(Trim(sTest)) > 50 '* just testing we got a substantial block of text, feel free to delete
'* page specific logic goes here
Dim htmlAnswers As Object 'MSHTML.DispHTMLElementCollection
Set htmlAnswers = oHtml.getElementsByClassName("answer-hyperlink")
Dim lAnswerLoop As Long
For lAnswerLoop = 0 To htmlAnswers.Length - 1
Dim vAnswerLoop
Set vAnswerLoop = htmlAnswers.Item(lAnswerLoop)
Debug.Print vAnswerLoop.outerText
Next
End If
End If
End Sub
Thanks for asking this.
P.S. I have used TaskList to verify that IExplore.exe is not created under the hoods when this code runs.
P.P.S If you liked this then see more at my Excel Development Platform blog
Using VBA, I am attempting to find the href attribute for weekly share price data of a given stock in the HTML of Yahoo finance. I need this to automate the download of the csv file containing the data. I have done my research, and have written the following code which I believe should find the URL. However, the code continues to end with an error. Is there a small error in my code or is there a bigger problem at hand.
I have searched for the parent of the hyperlink by classname to narrow the search which is successful. But when I search within this parent the code returns the error "The remote server machine does not exist or is unavailable".
I am new HTML scraping so any help would be much appreciated. Thanks in advance for any help.
Sub webpage()
Dim internet As Object
Dim internetdata As Object
Dim Find_Parent As Object
Dim Stock_Links As Object
Dim Link As Object
Dim URL As String
Set internet = CreateObject("InternetExplorer.Application")
internet.Visible = True
URL = "https://finance.yahoo.com/quote/GTE.AX/history?period1=1341266400&period2=1499032800&interval=1wk&filter=history&frequency=1wk"
internet.Navigate URL
Do While internet.ReadyState <> READYSTATE_COMPLETE
Loop
Set internetdata = internet.Document
Set Find_Parent = internetdata.getElementsByClassName("Fl(end) Pos(r) T(-6px)")
Set Stock_Links = Find_Parent.getElementsByTagName("a")
NextRow = 1
For Each Link In Stock_Links
Sheet1.Range("A" & NextRow) = Link.getAttribute("href")
NextRow = NextRow + 1
Next
MsgBox "Link Found"
End Sub
The first issue is that you're using the InternetExplorer constant READYSTATE_COMPLETE, while using late binding. Therefore, it sees it as an empty variable instead of its value, which is 4. This means that the Do While/Loop will always evaluate to True. So you'll need to replace the constant with its value...
Do While internet.ReadyState <> 4
Loop
The other issue is with getElementsByClass. It returns a collection of elements, where the indexing starts at 0. So you'll need to refer to one of the elements from that collection. In this example, it looks like only the one element that you're looking for is returned. So you'll need to refer to the first element, which has an index of 0...
Set Stock_Links = Find_Parent(0).getElementsByTagName("a")
I want to extract a U.S. Patent title from a url like
http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=6293874.PN.&OS=PN/6293874&RS=PN/6293874
(Update: as pointed out the comments, the patent title is not labeled "Title;" however, it consistently appears by itself above "Abstract" on the web page.) In most cases it is in the 7th child element of "body" or the 3rd "font" element in the document, but occasionally a notice at the top of the page to "** Please see images for: ( Certificate of Correction ) **" or "( Reexamination Certificate )" messes up both methods of extraction by inserting one additional child of "body" and three additional "font" elements before you get to the title.
However, the title seems to be consistently the first "font" element with the attribute "size" having a value of "+1". Unfortunately other elements have size="-1", including the aforementioned elements that are not always present, so it has to be specifically with that attribute and value. I have searched but can't figure out how to get elements by attribute and value. Here is my code:
Function Test_UpdateTitle(url As String)
Dim title As String
Dim pageSource As String
Dim xml_obj As XMLHTTP60
Set xml_obj = CreateObject("MSXML2.XMLHTTP")
xml_obj.Open "GET", url, False
xml_obj.send
pageSource = xml_obj.responseText
Set xml_obj = Nothing
Dim html_doc As HTMLDocument
Set html_doc = CreateObject("HTMLFile")
html_doc.body.innerHTML = pageSource
Dim fontElement As IHTMLElement
'Methods 1 and 2 fail in cases of a certificate of correction or reexamination certificate
'Method 1
' Dim body As IHTMLElement
' Set body = html_doc.getElementsByTagName("body").Item(0)
' Set fontElement = body.Children(6)
'Method 2
' Set fontElement = html_doc.getElementsByTagName("font").Item(3)
'Method 3
Dim n As Integer
For n = 3 To html_doc.getElementsByTagName("font").Length - 1
Set fontElement = html_doc.getElementsByTagName("font").Item(n)
If InStr(fontElement.innerText, "Please see") = 0 And _
InStr(fontElement.innerText, "( Certificate of Correction )") = 0 And _
InStr(fontElement.innerText, "( Reexamination Certificate )") = 0 And _
InStr(fontElement.innerText, " **") = 0 Then
Test_UpdateTitle = fontElement.innerText
Exit Function
End If
Next n
End Function
I should add that the " **" is not working to skip the the last element <b> **</b> and I am getting " **" as the title where there is a notice to please see images. Is asterisk a wildcard character in this context?
You can try this. As long as its the first font tag with the size attribute and a value of "+1" this should work. I only tested with 3 different pages but they all returned the correct results.
Function Test_UpdateTitle(url)
title = "Title Not Found!"
Set xml_obj = CreateObject("MSXML2.XMLHTTP")
xml_obj.Open "GET", url, False
xml_obj.send
pageSource = xml_obj.responseText
Set xml_obj = Nothing
Set document = CreateObject("HTMLFile")
document.write pageSource
For i = 0 To document.getElementsByTagName("font").length - 1
If document.getElementsByTagName("font")(i).size = "+1" Then
title = document.getElementsByTagName("font")(i).innerText
Exit For
End If
Next
Test_UpdateTitle = title
End Function
MsgBox Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=6293874.PN.&OS=PN/6293874&RS=PN/6293874")
MsgBox Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=fight.TI.&OS=TTL/fight&RS=TTL/fight")
MsgBox Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=14&f=G&l=50&d=PTXT&p=1&S1=search&OS=search&RS=search")
This answer is somewhat incomplete because my Excel won't do these lines:
Dim xml_obj As XMLHTTP60
Set xml_obj = CreateObject("MSXML2.XMLHTTP")
But I think this might be a preferred approach.
Instead of using USPTO's site, what about using Google's?
Hit this URL: http://www.google.com/patents/US6293874
Note that the patent number is apparent in that URL.
Then, in your function, just pull that tag named invention-title.
Set titleElement = html_doc.getElementsByTagName("invention-title").Item(0)
title = titleElement.innerText
MsgBox(title)
If you check the source on that page, there's only one of those.
If you're open to this alternative approach, it would be relatively easy to parse patent numbers from the URLs you have, and I think the extraction of invention-title would be much more reliable.
See if this answer is working as intended. Make sure you have references to the following libraries in your workbook:
Microsoft XML, v6.0
Microsoft HTML Object Library
If you are not sure how to add these to Excel just give a read to this link Link to reference adding
Option Explicit
Sub Test()
Debug.Print Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=6293874.PN.&OS=PN/6293874&RS=PN/6293874")
End Sub
Function Test_UpdateTitle(ByVal strURL As String) As String
Dim oHTTP As MSXML2.XMLHTTP60
Dim oDoc As MSHTML.HTMLDocument
Dim oFontTags As Variant
Dim oFontTag As HTMLFontElement
Dim strInnerText As String
Dim strSize As String
' Create the http object and send it.
Set oHTTP = New MSXML2.XMLHTTP60
oHTTP.Open "GET", strURL, False
oHTTP.send
' Make sure that get the a reponse back
If oHTTP.Status = 200 Then
Set oDoc = New HTMLDocument
oDoc.body.innerHTML = oHTTP.responseText
Set oFontTags = oDoc.getElementsByTagName("font")
' Go through all the tags.
For Each oFontTag In oFontTags
'Get the inner text and size of each tag.
strInnerText = oFontTag.innerText
strSize = oFontTag.getAttributeNode("size").Value
'Compare to make sure you have what's needed
If InStr(strInnertText, "Please see") = 0 And _
InStr(strInnertText, "( Certificate of Correction )") = 0 And _
InStr(strInnertText, "( Reexamination Certificate )") = 0 And _
InStr(strInnertText, " **") = 0 Then
If strSize = "+1" Then
Test_UpdateTitle = strInnerText
Exit Function
End If
End If
Next oFontTag
End If
End Function
I hope this helps. :)
In short:
Debug.Print html_doc.querySelector("font[size=+1]").innerText
tl;dr;
① CSS selector:
No need for any of the long winded methods. You have stated the styling pattern. Use a CSS selector to grab it.
font[size=+1]
The reads as font tag with attribute size who value is +1. You may need font[size='+1'] when using VBA to try both.
② CSS query:
③ VBA:
As it is the first match you want you can use the querySelector method of document to apply the selector and retrieve a single element.
html_doc.querySelector("font[size=+1]")
You may need to add a reference to HTML Object Library and use an early bound call of Dim html_doc As HTMLDocument to access the method. The late bound method may expose the querySelector method but if the interface doesn't then use early binding.
We are in the process of implementing code to read/create 2D bar codes that are starting to show up on our supplier's parts.
We have a need to create the 2D bar codes in MS Access reports and forms. Has anyone had success with the font (IDAutomation) or Active X (dlSoft) solutions out there.
For C#, the open source library "http://barcoderender.codeplex.com/" was suggested. Any thoughts on how successful this was or if anyone has other open-source and/or pay for options.
Thanks,
Anton
Depending on the volume of codes you need to generate, you could use the Google Charts API to generate QR Codes.
Simply add a "Microsoft Web Browser" ActiveX component and the following code to your Form:
Dim Size As Integer
Dim Text As String
Dim URL As String
Size = 200
Text = "This is my test"
' Better to actually use a URL encoding function like those described here:
' http://stackoverflow.com/questions/218181/how-can-i-url-encode-a-string-in-excel-vba
Text = Replace(Text, " ", "%20")
URL = "http://chart.apis.google.com/chart?chs=" & Size & "x" & Size & "&cht=qr&chld=H|0&chl=" & Text
WebBrowser.Navigate (URL)
You can of course change the Size and the Text depending on your need. The Text can also be a value directly from your Form, therefore your data.
I would advise you to check Googles Terms and Services before using it.
I completely escaped the web browser control and the Google API since that functionality is now deprecated from what I can tell. I went with a different free API but the Google API or any other API could be used instead.
In my example I am creating an .png image in the same directory as the application. I have a text box on my form named txtToCode in which I type in any text I want to code. I also have an image control so that the image can be viewed from the form, but you can modify it how you wish:
Private Sub btnCode2_Click()
Call GetQRCode(Me.txtToCode, 150, 150)
End Sub
Sub GetQRCode(Content As String, Width As Integer, Height As Integer)
Dim ByteData() As Byte
Dim XmlHttp As Object
Dim HttpReq As String
Dim ReturnContent As String
Dim EncContent As String
Dim QRImage As String
EncContent = EncodeURL(Content)
HttpReq = "https://api.qrserver.com/v1/create-qr-code/?data=" & EncContent & "&size=" & Width & "x" & Height & ""
Set XmlHttp = CreateObject("MSXML2.XmlHttp")
XmlHttp.Open "GET", HttpReq, False
XmlHttp.Send
ByteData = XmlHttp.responseBody
Set XmlHttp = Nothing
ReturnContent = StrConv(ByteData, vbUnicode)
Call ExportImage(ReturnContent)
End Sub
Private Sub ExportImage(image As String)
Dim FilePath As String
On Error GoTo NoSave
' Build Export Path
FilePath = Application.CurrentProject.Path & "\qr.png"
Open FilePath For Binary As #1
Put #1, 1, image
Close #1
Me.Image3.Picture = FilePath
' Save File Path
Exit Sub
NoSave:
MsgBox "Could not save the QR Code Image! Reason: " & Err.Description, vbCritical, "File Save Error"
End Sub
Private Function EncodeURL(str As String)
Dim ScriptEngine As Object
Dim encoded As String
Dim Temp As String
Temp = Replace(str, " ", "%20")
Temp = Replace(Temp, "#", "%23")
EncodeURL = Temp
End Function
How do I do a search and replace of text within a module in Access from another module in access? I could not find this on Google.
FYI, I figured out how to delete a module programatically:
Call DoCmd.DeleteObject(acModule, modBase64)
I assume you mean how to do this programatically (otherwise it's just ctrl-h). Unless this is being done in the context of a VBE Add-In, it is rarely (if ever) a good idea. Self modifying code is often flagged by AV software an although access will let you do it, it's not really robust enough to handle it, and can lead to corruption problems etc. In addition, if you go with self modifying code you are preventing yourself from ever being able to use an MDE or even a project password. In other words, you will never be able to protect your code. It might be better if you let us know what problem you are trying to solve with self modifying code and see if a more reliable solution could be found.
After a lot of searching I found this code:
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
'Function to Search for a String in a Code Module. It will return True if it is found and
'False if it is not. It has an optional parameter (NewString) that will allow you to
'replace the found text with the NewString. If NewString is not included in the call
'to the function, the function will only find the string not replace it.
'
'Created by Joe Kendall 02/07/2003
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Public Function SearchOrReplace(ByVal ModuleName As String, ByVal StringToFind As String, _
Optional ByVal NewString, Optional ByVal FindWholeWord = False, _
Optional ByVal MatchCase = False, Optional ByVal PatternSearch = False) As Boolean
Dim mdl As Module
Dim lSLine As Long
Dim lELine As Long
Dim lSCol As Long
Dim lECol As Long
Dim sLine As String
Dim lLineLen As Long
Dim lBefore As Long
Dim lAfter As Long
Dim sLeft As String
Dim sRight As String
Dim sNewLine As String
Set mdl = Modules(ModuleName)
If mdl.Find(StringToFind, lSLine, lSCol, lELine, lECol, FindWholeWord, _
MatchCase, PatternSearch) = True Then
If IsMissing(NewString) = False Then
' Store text of line containing string.
sLine = mdl.Lines(lSLine, Abs(lELine - lSLine) + 1)
' Determine length of line.
lLineLen = Len(sLine)
' Determine number of characters preceding search text.
lBefore = lSCol - 1
' Determine number of characters following search text.
lAfter = lLineLen - CInt(lECol - 1)
' Store characters to left of search text.
sLeft = Left$(sLine, lBefore)
' Store characters to right of search text.
sRight = Right$(sLine, lAfter)
' Construct string with replacement text.
sNewLine = sLeft & NewString & sRight
' Replace original line.
mdl.ReplaceLine lSLine, sNewLine
End If
SearchOrReplace = True
Else
SearchOrReplace = False
End If
Set mdl = Nothing
End Function
Check out the VBA object browser for the Access library. Under the Module object you can search the Module text as well as make replacements. Here is an simple example:
In Module1
Sub MyFirstSub()
MsgBox "This is a test"
End Sub
In Module2
Sub ChangeTextSub()
Dim i As Integer
With Application.Modules("Module1")
For i = 1 To .CountOfLines
If InStr(.Lines(i, 1), "This is a Test") > 0 Then
.ReplaceLine i, "Msgbox ""It worked!"""
End If
Next i
End With
End Sub
After running ChangeTextSub, MyFirstSub should read
Sub MyFirstSub()
MsgBox "It worked!"
End Sub
It's a pretty simple search but hopefully that can get you going.
additional for the function (looping through all the lines)
Public Function ReplaceWithLine(modulename As String, StringToFind As String, NewString As String)
Dim mdl As Module
Set mdl = Modules(modulename)
For x = 0 To mdl.CountOfLines
Call SearchOrReplace(modulename, StringToFind, NewString)
Next x
Set mdl = Nothing
End Function
Enjoy ^^