Scraping HTML by Class in VBA - html

I have a html code as shown
<div class="property-title visible-xs">
<a href="/property/473902/Office-Lot">
<h2><b> 2nd Floor, Block D5, Solaris Dutamas, No. 1, Jalan Dutamas 1, 50480, Kuala Lumpur</b></h2>
</a>
</div>
<p style="color: #0071ee;">Office Lot</p>
<h4><b>RM 880,000</b></h4>
<div>
<table>
<!-- <tr><td>Office Lot</td></tr> -->
<tr>
<td>Property Code</td><td>:</td><td>PB473902</td>
</tr>
<tr>
<td>Auction Date</td><td>:</td><td>2016-02-26</td>
</tr>
<tr>
<td>Built up </td><td>:</td><td>754 sq.ft </td>
</tr>
<tr>
<td>Tenure</td><td>:</td><td>Freehold</td>
</tr>
and I used the following code to extract the details "2nd Floor, Block D5,...."
objIE1.Document.getElementsByClassName("property-title visible-xs").getElementsByTagName ("a")
but it don't seem to get the result I need. Please help.
The html code shown is in multiple form.

This will work:
extract1 = objIE1.Document.getElementsByClassName("property-title visible-xs")(0).getElementsByTagName ("a")(0).innerText
Cells(1,1).Value = extract1
When a function has getElementsBy (plural - "Elements") such as getElementsByClassName or getElementsByTagName the code will extract a collection of elements so you need to specify which one you want, in this case it is the first which in html is 0. When a function uses getElementBy (singular - "Element") such as getElementById this extracts a single element and therefore does not need an index specification as there is no collection.

Related

lxml relative xPath doesn't return result relative to the given HtmlElement

I apply a relative XPath (./) to an HtmlElement and it doesn't return any results. When I try using double dots (../), it returns all results matching from root HTML instead of descendant results of that specific HtmlElement. I am not sure what is wrong here.
The version of lxml is 4.5.2
Example:
<html>
<h3>
<p>
<table>
<tr>
<td>Sample</td>
<td>Sample</td>
</tr>
</table>
</p>
<h3>
<p>
<table>
<tr>
<td>Sample 2</td>
<td>Sample 2</td>
</tr>
</table>
</p>
</html>
Code
r = requests.get('http://website.com')
tree = html.fromstring(r.content)
tables = tree.xpath("(//p/table)")
for table in tables:
result = table.xpath('.//td')
text = result.text_content()
The first iteration in the loop should return "Sample" texts and the second iteration should return "Sample 2" texts.
The problem was with the HTML itself. When I inspect the document on a browser, it shows that <p> is the parent of the <table> elements however requested HTML revealed that <p></p> is actually the sibling element preceding <table>.

Use XPath in nodeset repeater (XForms)

I have a question about XPath and the nodeset repeater (XForms).
As you can see in the following code snippet I want to change an attribute of a specific entry of a list and additionally an attribute in the following entry in the nodeset with a trigger.
The first <xf:action> works fine but the second does not. What I want here is to leave the current nodeset of the processinstance, go to the following one and change the attribute state here. How do I realize that with XPath?
<div>
<xf:model>
<xf:instance xmlns="" id="template">
<project id="">
<name/>
...
<processinstance>
<name>
<state>
</processinstance>
</project>
</xf:instance>
</xf:model>
....
<!-- Process repeat table -->
<div>
<table class="table table-hover">
<thead>
<th width="50%">Processname</th>
<th width="50%">State</th>
<th width="50%">Action</th>
</thead>
<tbody id="process-repeat" xf:repeat-nodeset="//project[index('project-repeat')]/processinstance">
<tr>
<td>
<xf:output ref="name"/>
</td>
<td>
<xf:output ref="state"/>
</td>
<td>
<xf:group ref=".[state eq 'in processing']">
<xf:trigger appearance="minimal">
<xf:label>finish process</xf:label>
<xf:action>
<xf:setvalue ref="state">finished</xf:setvalue>
</xf:action>
<!-- THE FOLLOWING DOES NOT WORK AS I WANT! -->
<xf:action>
<xf:setvalue ref="//project[index('project-repeat')]/processinstance[index(process-repeat)+1]">in process</xf:setvalue>
</xf:action>
</xf:trigger>
</xf:group>
</td>
</tr>
</tbody>
</table>
</div>
</div>
Best regards,
Felix
One thing is that you have a typo:
index(process-repeat)
vs.
index('process-repeat')
In addition, the index() function represents the currently selected, in the UI, repeat iteration. It does not represent the current iteration being evaluated in XPath.
The bottom line is that you cannot use index('process-repeat') to identify the current repeat iteration. It is a common misunderstanding of the index() function.
Some implementations have functions to identify the current repeat iteration. I assume you are using BetterFORM, and I don't know if it has such a function. With Orbeon Forms you could write:
//project[index('project-repeat')]/processinstance[xxf:repeat-position()]
Or better, if betterFORM supports variables, you could use that to avoid repeating yourself with:
<tbody id="process-repeat" xf:repeat-nodeset="//project[index('project-repeat')]/processinstance">
<xf:var name="current-process" value="."/>
<tr>
<td>
<xf:output ref="name"/>
</td>
<td>
<xf:output ref="state"/>
</td>
<td>
<xf:group ref=".[state eq 'in processing']">
<xf:trigger appearance="minimal">
<xf:label>finish process</xf:label>
<xf:action>
<xf:setvalue ref="state">finished</xf:setvalue>
</xf:action>
<xf:action>
<xf:setvalue ref="$current-process">in process</xf:setvalue>
</xf:action>
</xf:trigger>
</xf:group>
</td>
</tr>
</tbody>

VBA scraping with same class name but different innertext

Scraping value on a website but turned out the value that I need shared the same class name as the others.
HTML code
<tr class="table_bdrow1_style">
<td></td>
<td style="text-align:center" class="table_bdtext_style">1.</td>
<td style="text-align:center" class="table_bdtext_style">
<div id="a">
"0.8948"
</div>
</td>
<td style="text-align:center" class="table_bdtext_style">December 19, 2016</td>
</tr>
I need the value of second line (0.8948) and third line - the date value (December 19, 2016) but the code I am using only shows me the first value (1).
extract1 = IE.Document.getElementsByClassName("table_bdtext_style")(1).innerText
Cells(4, "A").Value = extract1
Not sure how can I extract the second and third but not the first value. Anyone can help? Thanks a lot!
Just assign the respective index in your extract call:
' for second tag
IE.Document.getElementsByClassName("table_bdtext_style")(2).innerText
' for third tag
IE.Document.getElementsByClassName("table_bdtext_style")(3).innerText

Find specific element position in XPath after checking a condition

I have the following html I am working with: (a chunk of it here)
<table class="detailTable">
<tbody>
<tr>
<td class="detailTitle" align="top">
<h3>Credit Limit:</h3>
<h3>Current Balance:</h3>
<h3>Pending Balance:</h3>
<h3>Available Credit:</h3>
</td>
<td align="top">
<p>$677.77</p>
<p>$7.77</p>
<p>$7.77</p>
<p>$677.77</p>
</td>
<td class="detailTitle">
<h3>Last Statement Date:</h3>
<h4>Payment Address</h4>
</td>
<td>
<p> 05/19/2015 </p>
<p class="attribution">
</td>
</tr>
</tbody>
</table>
I need to first check if "Statement Date" exists, and then find its position. Then get it's value which is in a corresponding <p> tag. I need to do this using XPath. Any suggestions?
So far I tried using //table[#class='detailTable'][1]//td[2]//p[position(td[contains(.,'Statement Date')])] but it doesn't work.
This is one possible way : (formatted for readability)
//table[#class='detailTable']
//tr
/td[*[contains(.,'Statement Date')]]
/following-sibling::td[1]
/*[position()
=
count(
parent::td
/preceding-sibling::td[1]
/*[contains(.,'Statement Date')]/preceding-sibling::*
)+1
]
explanation :
..../td[*[contains(.,'Statement Date')]] : From the beginning up to this part, the XPath will find td element where, at least, one of its children contains text "Statement Date"
/following-sibling::td[1] : from previously matched td, navigate to the nearest following sibling td ...
/*[position() = count(parent::td/preceding-sibling::td[1]/*[contains(.,'Statement Date')]/preceding-sibling::*)+1] : ...and return child element at position equals to position of element that contains text "Statement Date" in the previous td. Notice that we use count(preceding-sibling::*)+1 to get position index of the element containing text "Statement Date" here.
You can do it this way:
//table[#class='detailTable'][1]//td[#class="detailTitle" and contains(./h3, 'Statement Date')]/following-sibling::td[1]/p[1]/text()
This will find the <td> that contains the Statement Date heading, and get the <td> immediately after it. Then it gets the text content of the first p in that <td>.

How to embed links (anchor tag) into HTML context from UIBINDER in gwt

I have a HTML widget in my ui.xml which I am using in Uibinder to populate data as given below:
ui.xml ->
<g:HTML ui:field="operationsDetailTableTemplate" visible="false">
<table class="{style.LAYOUT_STYLE}" width="100%" border="1">
<tr>
<td><img src="images/indent-blue.gif"/></td>
<td>
<table class="{style.DEFAULT_STYLE}">
<thead>
<tr>
<th>OperationUuid</th>
....
</tr>
</thead>
<tbody>
<tr>
<td>%s</td>
...
</tr>
</tbody>
</table>
</td>
</tr>
....
</g:html>
Uibinder.java--->
String htmlText = operationsDetailTableTemplate.getHTML()
.replaceFirst("%s", toSafeString(operation.getOperationUuid()))
....
HTML html = new HTML(htmlText);
operationsDetail.add(html);
The above is done in a for loop for each of the operation retrieved from the database.
My question is how I can embed a hyperlink or an anchor tag on one of the cell (eg. operation id ) for each of the operation set retrieved. I also wish to have a listener attached to it.
P.S. - It does not allow me to have a anchor tag in HTML in ui.xml.
You'd better use the tools in the way they've been designed to be used: use ui:field="foo" on the <td> and #UiField Element foo + foo.setInnerHTML(toSafeString(...)) instead of extracting the HTML, modifying it and reinjecting it elsewhere. You could also use a <g:Anchor> and attach an #UiHandler to handle ClickEvents.
Your way of using UiBinder makes me think of SafeHtmlTemplates, or the new UiRenderer aka UiBinder for Cells: https://developers.google.com/web-toolkit/doc/latest/DevGuideUiBinder#Rendering_HTML_for_Cells