Web scraping using excel VBA - html

I am looking at an HTML code link below:
<h1 class="wer wer">
<a href="http://somelink.com" rel="bookmark" title="Permanent Link to Title of this page that covers some random topic">
Short title of this page...</a>
</h1>
I am currently using the below code to pull out innertext ("Short title of this page...")
For Each ele In .document.all
Select Case ele.classname
Case "wer wer"
RowCount = RowCount + 1
sht.Range("A" & RowCount) = ele.innertext
End Select
Next ele
How can I modify this code to pull out title ("Permanent Link to Title of this page that covers some random topic") and href ("http://somelink.com")?
Any help would be much appreciated. Thanks.

Select the element by its styling.
.document.querySelector("a[href=http://somelink.com]").innerText
a[href=http://somelink.com] is a CSS selector of first element with an a tag having href = 'http://somelink.com'.

Related

How to get the text of img alt inside <a> tag

I have a url with the following html part
<div class="shop cf">
<a class="shop-logo js-shop-logo" href="/m/3870/GMobile">
<noscript>
<img alt="GMobile" class="js-lazy" data-src="//a.scdn.gr/ds/shops/logos/3870/mid_20160920155600_71ff515d.jpeg" src="//a.scdn.gr/ds/shops/logos/3870/mid_20160920155600_71ff515d.jpeg" />
</noscript>
<img alt="GMobile" class="js-lazy" data-src="//a.scdn.gr/ds/shops/logos/3870/mid_20160920155600_71ff515d.jpeg" src="//c.scdn.gr/assets/transparent-325472601571f31e1bf00674c368d335.gif" />
</a>
</div>
I want to get the first img alt inside the div class shop cf and I do
Set seller = Doc.querySelectorAll("img")
wks.Cells(i, "D").Value = seller.getAttribute("alt").Content(0)
I get nothing what I forget to include?!?
Can I get it from
<noscript>
tag?
I tried the following as well
Set seller = Doc.getElementsByClassName("js-lazy")
wks.Cells(i, "D").Value = seller.getAttribute("alt")
Use element with attribute selector
CSS:
img[alt]
VBA:
ie.document.querySelector("img[alt]")
You may need to add
ie.document.querySelector("img[alt]").getAttribute("alt")
To include the class use
ie.document.querySelector("img.js-lazy[alt]")
If more than one element then use querySelectorAll and index into returned nodeList e.g.
Set list = ie.document.querySelectorAll("img.js-lazy[alt]")
list.item(0).getAttribute('alt')
list.item(1).getAttribute('alt')
have you try this way?
let lazy1 = document.querySelectorAll(".js-lazy")[0]
let lazyalt = lazy1.getAttribute("alt");
let shop = document.querySelector('.shop');
shop.classList.add(lazyalt);
console.log(lazyalt)

How to extract something between <!-- --> using VBA?

I'm trying to scrape a page using VBA. I know how to get elements by id class and tag names. But now I have come across this Tag
<!-- <b>IE CODE : 3407004044</b> -->
Now after searching on the internet I know that this is a comment in the HTML, but what I'm unable to find is what is the tag name of this element ,if it qualifies as a tag at all. Should I use
documnet.getelementsbytagname("!") ?
If not, how else can I extract these comments ?
EDIT:
I have a bunch of these td elements within tr elements and I want to extract IE Code : 3407004044
Below is a larger set of HTML code:
<tr align="left">
<td width="50%" class="subhead1">
' this is the part that I want to extract
<!-- <b>IE CODE : 3108011111</b> -->
</td>
<td rowspan="9" valign="top">
<span id="datalist1_ctl00_lbl_p"></span>
</td>
</tr>
Thanks!
Give it a try like this, it works if you fix it a bit further:
Option Explicit
Public Sub TestMe()
Dim myString As String
Dim cnt As Long
Dim myArr As Variant
myString = "<!-- <b>IE CODE : Koj sega e</b> -->blas<hr>My Website " & _
"is here<B><B><B><!-- <b>IE CODE : nomer </b> -->" & _
"is here<B><B><B><!-- <b>IE CODE : 1? </b> -->"
myString = Replace(myString, "-->", "<!--")
myArr = Split(myString, "<!--")
For cnt = LBound(myArr) To UBound(myArr)
If cnt Mod 2 = 1 Then Debug.Print myArr(cnt)
Next cnt
End Sub
This is what you get:
<b>IE CODE : Koj sega e</b>
<b>IE CODE : nomer </b>
<b>IE CODE : 1? </b>
The idea is the following:
Replace the --> with <!--
Split the input by <!--
Take every second value from the array
There are some possible scenarios, where it will not work, e.g. if you have --> or <!-- written somewhere within the text, but in the general case it should be ok.
You can use XPath:
substring-before(substring-after(//tr//comment(), "<b>"), "</b>")
to get required data

VBA does not click <a> within <li>

I am navigating to a webpage with an unordered list. Now I have to click an 'anchor' tag within a specific 'li' tag.
The part of the source code is,
<UL class="x-tab-strip x-tab-strip-top" id=ext-gen151>
<LI id=infoPageinfoPanelID__infoPage_myTab_pubst_pubstructStructureGWT _nodup="30817">
<A class=x-tab-strip-close></A>
<A class=x-tab-right href="#">
<EM class=x-tab-left>
<SPAN class=x-tab-strip-inner>
<SPAN class="x-tab-strip-text ">Structure</SPAN>
</SPAN>
</EM>
</A>
</LI>
</UL>
The anchor tag does not have a name or ID and has a class name(" x-tab-right ").
I tried the following vba code for simulating a click on that tag,
Dim targetSpan As HTMLObjectElement
Set targetSpan = doc.getElementById("infoPageinfoPanelID__infoPage_myTab_pubst_pubstructStructureGWT").getElementsByTagName("a")(1)
targetSpan.click
=> Code :
Dim AllSpanElements As IHTMLElementCollection
Dim spanCounter As Long
Set AllSpanElements = doc.getElementsByTagName("li")
For spanCounter = 0 To AllSpanElements.Length - 1
With AllSpanElements(spanCounter)
If (.innerText) = "Structure" Then
.ParentElement. ParentElement.ParentElement.Click
Exit For
End If
End With
Next
I got the 2nd code from StackOverflow.
Both the code doesn't do anything. What am I doing wrong?
Thanks in advance.

Image Src url extraction using ms access vba ie navigation

I am using MS Access code for VBA IE navigation. I want to extract image link value from below mentioned html code but unable to extract image link value.
HTML code is given below:
<div class="product-image-vp-sub">
<div class="js-media-zoom-icons hide-content">
<div class="zoom product-zoom product-zoom-in js-zoom-in wmicon wmicon-zoom"></div>
</div>
<img itemprop="image" src="http://ll-us-i5.wal.co/dfw/dce07b8c-3eca/k2-_f47f1f48-69bb-4277-9009-6f7d3a63697a.v2.jpg" class="product-image js-product-image js-product-primary-image" data-asset-id="2A5BB9FEFACA4EA09290025ED003ACAE" data-zoom-image="" alt="...And The Earth Did Not Swallow Him">
</div>
VBA code is given below.
Set html = ie.Document
my_data1 = html.getElementsByClassName("product-image-vp-sub")
For Each Item In my_data1
href1 = Item.getElementsByTagName("img")(0)
href2 = href1.src
Next

xpath find specific link in page

I'm trying to get the email to a friend link from this page using xpath.
http://www.guardian.co.uk/education/2009/oct/14/30000-miss-university-place
The link itself is wrapped up in tags like this
<li><a class="rollover sendlink" href="http://www.guardian.co.uk/email/354237257" title="Opens an email form" name="&lid={pageToolbox}{Email a friend}&lpos={pageToolbox}{2}"><img src="http://static.guim.co.uk/static/80163/common/images/icon_email-friend.gif" alt="" class="trail-icon" /><span>Send to a friend</span></a></li>
I'm using this for my query, but it's not quite right.
$links = $xpath->query("//a/span[text()='Send to a friend']/#href");
You're trying to get the href of the span there. I think you want
$links = $xpath->query("//a[span/text()='Send to a friend']/#href");
You need to use something like this (since href is an attribute of a):
$links = $xpath->query("//a[span/text()='Send to a friend']/#href");
The href is an attribute of the anchor hence you need:-
$links = $xpath->query("//a[span[text()='Send to a friend']]/#href");
try this
$links = $xpath->query("//a[span='Send to a friend']/#href");