Learning to use xpath queries. I am having an issue were I am unable to extract a value that changes whenever the page is refreshed.
For example, I am trying to extract the value '62804' from the following html code: "canvas.strokeText('Answer: 62804',90,112);" . Any ideas how this can be done. Thanks
<html>
<div id="content" class="large-12 columns">
<div class="example">
<h3>Challenging DOM</h3>
<p>The hardest part in automated web testing is finding the best locators (e.g., ones that well named, unique, and unlikely to change). It's more often than not that the application you're testing was not built with this concept in mind. This example demonstrates that with unique IDs, a table with no helpful locators, and a canvas element.</p>
<hr>
<div class="row">
<div class="large-12 columns large-centered">
<div class="large-2 columns">
<a id="debcda40-b692-0137-457b-2213fbd48497" href="" class="button">qux</a><br>
<a id="debce410-b692-0137-457c-2213fbd48497" href="" class="button alert">baz</a><br>
<a id="debd03d0-b692-0137-457d-2213fbd48497" href="" class="button success">foo</a><br>
</div>
<div class="large-10 columns">
<table>
<thead>
<tr>
<th>Lorem</th>
<th>Ipsum</th>
<th>Dolor</th>
<th>Sit</th>
<th>Amet</th>
<th>Diceret</th>
<th>Action</th>
</tr>
</thead>
<tbody>
<tr>
<td>Iuvaret0</td>
<td>Apeirian0</td>
<td>Adipisci0</td>
<td>Definiebas0</td>
<td>Consequuntur0</td>
<td>Phaedrum0</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret1</td>
<td>Apeirian1</td>
<td>Adipisci1</td>
<td>Definiebas1</td>
<td>Consequuntur1</td>
<td>Phaedrum1</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret2</td>
<td>Apeirian2</td>
<td>Adipisci2</td>
<td>Definiebas2</td>
<td>Consequuntur2</td>
<td>Phaedrum2</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret3</td>
<td>Apeirian3</td>
<td>Adipisci3</td>
<td>Definiebas3</td>
<td>Consequuntur3</td>
<td>Phaedrum3</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret4</td>
<td>Apeirian4</td>
<td>Adipisci4</td>
<td>Definiebas4</td>
<td>Consequuntur4</td>
<td>Phaedrum4</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret5</td>
<td>Apeirian5</td>
<td>Adipisci5</td>
<td>Definiebas5</td>
<td>Consequuntur5</td>
<td>Phaedrum5</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret6</td>
<td>Apeirian6</td>
<td>Adipisci6</td>
<td>Definiebas6</td>
<td>Consequuntur6</td>
<td>Phaedrum6</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret7</td>
<td>Apeirian7</td>
<td>Adipisci7</td>
<td>Definiebas7</td>
<td>Consequuntur7</td>
<td>Phaedrum7</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret8</td>
<td>Apeirian8</td>
<td>Adipisci8</td>
<td>Definiebas8</td>
<td>Consequuntur8</td>
<td>Phaedrum8</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret9</td>
<td>Apeirian9</td>
<td>Adipisci9</td>
<td>Definiebas9</td>
<td>Consequuntur9</td>
<td>Phaedrum9</td>
<td>
edit
delete
</td>
</tr>
</tbody></table>
<div class="row">
<div class="large-10 columns">
<canvas id="canvas" width="599" height="200" style="border:1px dotted;"></canvas>
</div>
</div>
</div>
</div>
</div>
<hr>
</div>
<script>
var canvas_el = document.getElementById('canvas');
var canvas = canvas_el.getContext('2d');
canvas.font = '60px Arial';
canvas.strokeText('Answer: 62804',90,112);
</script>
</div>
</html>
In order to use the XPath query the input document must be a valid XML.
In your case it isn't, because there are some tags that are not properly closed (you can verify it using an XMLLint tool).
E.g.
<hr> and <br> should be replaced with <hr/> and <br/>.
Once the XML is corrected, you can use an XPath query.
The fist step is select the script element:
//script
The output is:
Element='<script>
var canvas_el = document.getElementById('canvas');
var canvas = canvas_el.getContext('2d');
canvas.font = '60px Arial';
canvas.strokeText('Answer: 62804',90,112);
</script>'
Then you have to convert the Element Node in a String and then perform some parsing:
substring-before(substring-after(//script/text(), 'canvas.strokeText(''Answer: ') , ''',90,112)')
The result is the following:
String='62804'
Note: You can do the same operation in a more elastic way using Javascript, for example.
XPath is very good to query an XML (like the first operation that I mentioned) but quite complicated to do String parsing (like the second operation that I mentioned).
Hope it can help.
Related
I know it might be a duplicate but I am not able to extract a value from this HTML source. Any help would be greatly appreciated.
So what I am trying to do is get the pid of the project from page.
The names of the project are being read from a csv file and I need to get the pid.
For example if the project here is "AA project", just the project key "AA" can also be used, the pid that needs to be extracted is 10441.
Since the values are not a label, I cannot figure out how to extract these.
Update : just using pid=(\d....) gives all the pid without any reference to the project name or key.
<table id="project-list" class="aui">
<thead>
<tr>
<th></th>
<th>Name</th>
<th>Key</th>
<th class="project-list-type">Project Type</th>
<th>URL</th>
<th>Project Lead</th>
<th>Default Assignee</th>
<th>Operations</th>
</tr>
</thead>
<tbody>
<tr data-project-key="AA">
<td class="cell-type-icon" data-cell-type="avatar">
<div class="aui-avatar aui-avatar-small aui-avatar-project jira-system-avatar"><span class="aui-avatar-inner"><img src="/secure/projectavatar?pid=10441&avatarId=10011&size=small" alt="Project Avatar for 10441" /></span></div>
</td>
<td data-cell-type="name">
<a id="view-project-10441" href="/plugins/servlet/project-config/AA/summary">AA project</a>
</td>
<td data-cell-type="key">AA</td>
<span>Software</span>
</td>
<td class="cell-type-url" data-cell-type="url">
No URL
</td>
<td class="cell-type-user" data-cell-type="lead">
<a class="user-hover" rel="localadmin" id="view_AA_projects_localadmin" href="/secure/ViewProfile.jspa?name=localadmin">Atlassian Administrator</a>
</td>
<td class="cell-type-user" data-cell-type="default-assignee">
Unassigned
</td>
<td data-cell-type="operations">
<ul class="operations-list">
<li><a class="edit-project" id="edit-project-10441" href="/secure/project/EditProject!default.jspa?pid=10441&returnUrl=ViewProjects.jspa">Edit</a></li>
<li><a id="change_project_type_10441" class="change-project-type-link" data-project-id="10441" href="#">Change project type</a></li>
<li><a id="delete_project_10441" href="/secure/project/DeleteProject!default.jspa?pid=10441&returnUrl=ViewProjects.jspa">Delete</a></li>
</ul>
</td>
</tr>
<tr data-project-key="AAL">
<td class="cell-type-icon" data-cell-type="avatar">
<div class="aui-avatar aui-avatar-small aui-avatar-project jira-system-avatar"><span class="aui-avatar-inner"><img src="/secure/projectavatar?pid=10442&avatarId=10011&size=small" alt="Project Avatar for 10442" /></span></div>
</td>
<td data-cell-type="name">
<a id="view-project-10442" href="/plugins/servlet/project-config/AAL/summary">AAL project</a>
</td>
<td data-cell-type="key">AAL</td>
<td class="cell-type-project-type">
<span>Software</span>
</td>
<td class="cell-type-url" data-cell-type="url">
No URL
</td>
<td class="cell-type-user" data-cell-type="lead">
<a class="user-hover" rel="localadmin" id="view_AAL_projects_localadmin" href="/secure/ViewProfile.jspa?name=localadmin">Atlassian Administrator</a>
</td>
<td class="cell-type-user" data-cell-type="default-assignee">
Unassigned
</td>
<td data-cell-type="operations">
<ul class="operations-list">
I wouldn't recommend using regular expressions to parse HTML data as it will be a headache to develop and maintain and it will be very sensitive to markup changes hence very fragile, see https://stackoverflow.com/a/1732454/2897748 for details.
Go for XPath Extractor instead, the relevant configuration would be:
Reference Name: anything meaningful, i.e. id
XPath Query: substring-after(//tr[#data-project-key='AA']/td[#data-cell-type='name']/a/#id,'view-project-')
Check Use Tidy if your response is not XHTML-compliant
Demo:
References:
XPath Tutorial
XPath Language Reference
Here is the html code:
<table>
<tr class="WhiteRow">
<td align="center">
<input id="SelectedDelivery1" type="checkbox" onclick="HandleClick(this.name,this.checked,"")" value="Y" name="SelectedDelivery1">
</td>
<td valign="top">
<span></span>
<span class="bold">Instrument Search</span>
<br>
abc (TRANSFER)
</td>
<td align="center">5 minutes</td>
<td class="noborder" align="right">
<td class="noborder" align="right">
<td class="noborder" align="right">
<td class="noborder" align="right">
</tr>
<tr>
<td align="center">
<input id="SelectedDelivery2" type="checkbox" onclick="HandleClick(this.name,this.checked,"")" value="Y" name="SelectedDelivery1">
</td>
<td valign="top">
<span></span>
<span class="bold">Instrument Search</span>
<br>
abc (CAVEAT)
</td>
...
</tr>
</table>
I would like to target the <tr> containing <span class="bold">Instrument Search</span> and abc (TRANSFER). That tr may not be the first element in the table.
So far I tried
//td/span[text()="Instrument Search"]/ancestor::tr
which only satisfy one of the condition, and there are a few tr that satisfy the selector.
Could you please advise me how to target both of them
Use the following XPath expression:
//tr[contains(., 'abc (TRANSFER)') and contains(td/span[#class = 'bold'], 'Instrument Search')]
If possible, you should always use expressions that are unidirectional, because a "backwards" axis like ancestor:: could be a costly move. That's the advantage over the solution you have found already.
If the span[#class = 'bold'] cannot contain anything else than "Instrument Search", you should modifiy the expression above to:
//tr[contains(., 'abc (TRANSFER)') and td/span[#class = 'bold'] = 'Instrument Search']
The location of "abc (TRANSFER)" is still not very precise, if it is required in a certain place (e.g. always inside a td element) you'd have to further restrict the above.
EDIT Respondin to your comment:
abc (TRANSFER) is inside td tag, it's just a text field
Then use
//tr[contains(td, 'abc (TRANSFER)') and td/span[#class = 'bold'] = 'Instrument Search']
I found myself an answer after crawling through the syntax.
Please let me know if there is any other better ways
//td/span[text()="Instrument Search"]/ancestor::td/text()[contains(., "TRANSFER")]/ancestor::tr
I have a complex html structure with lot of tables and divs.. and also the structure might change. How to find xpath by skipping the elements in between.
for example :
<table>
<tr>
<td>
<span>First Name</span>
</td>
<td>
<div>
<table>
<tbody>
<tr>
<td>
<div>
<table>
<tbody>
<tr>
<td>
<img src="1401-2ATd8" alt="" align="middle">
</td>
<td><span><input atabindex="2" id=
"MainLimitLimit" type="text"></span></td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</table>
I have to get the input element with respect to the "First Name" span
eg :
By.xpath("//span[contains(text(), 'First Name')]/../../td[2]/div/table/tbody/tr/td/table/tbody/tr/td[2]/input")
but.. can I skip the between htmls and directly access the input element.. something like?
By.xpath("//span[contains(text(), 'First Name')]/../../td[2]//input[contains#id,'MainLimitLimit')]")
You can try this Xpath :
//td[contains(span,'First Name')]/following-sibling::td[1]//input[contains(#id, 'MainLimitLimit')]
Explanation :
select <td><span>First Name</span></td> element :
//td[contains(span,'First Name')]
then get <td> element next to above <td> element :
/following-sibling::td[1]
then get <input> element within <td> element selected in the 2nd step above :
//input[contains(#id, 'MainLimitLimit')]
You can use // which means at any level
By.xpath("//span[contains(text(), 'First Name')]//td[2]/input[contains#id,'MainLimitLimit')]")
you can use the "First Name" span as a predicate. Try the code below
//td[preceding-sibling::td[span[contains(text(), 'First Name')]]]//input[contains(#id,'MainLimitLimit')]
I need to repeat an html snippet several times on a page but the problem is that the contained html elements have id defined that should be unique on page. So what is the proper way I could repeat this snippet without removing the id attribute from the elements ?
Could I use iframe as container for this snippet & let the duplicate ids exist on page?
You can use JavaScript to clone the snippet and at the same time change the IDs.
Example if your snippet is:
<div id="snippet">
<table id="list">
<tr id="0">
<td id="firstName0">John</td>
<td id="lastName0">Smith</td>
</tr>
<tr id="1">
<td id="firstName2">John</td>
<td id="lastName2">Doe</td>
</tr>
</table>
</div>
Using
$("#snippet").clone(false).find("*[id]").andSelf().each(function() { $(this).attr("id", $(this).attr("id") + "_1"); });
Will Produce
<div id="snippet_1">
<table id="list_1">
<tr id="0_1">
<td id="firstName0_1">John</td>
<td id="lastName0_1">Smith</td>
</tr>
<tr id="1_1">
<td id="firstName2_1">John</td>
<td id="lastName2_1">Doe</td>
</tr>
</table>
</div>
Which will solve the duplicate ID problem.
How could I use ruby to extract information from a table consisting of these rows? Is it possible to detect the comments using nokogiri?
<!-- Begin Topic Entry 4134 -->
<tr>
<td align="center" class="row2"><image src='style_images/ip.boardpr/f_norm.gif' border='0' alt='New Posts' /></td>
<td align="center" width="3%" class="row1"> </td>
<td class="row2">
<table class='ipbtable' cellspacing="0">
<tr>
<td valign="middle"><alink href='http://www.xxx.com/index.php?showtopic=4134&view=getnewpost'><image src='style_images/ip.boardpr/newpost.gif' border='0' alt='Goto last unread' title='Goto last unread' hspace=2></a></td>
<td width="100%">
<div style='float:right'></div>
<div> <alink href="http://www.xxx.com/index.php?showtopic=4134&hl=">EXTRACT LINK 1</a> </div>
</td>
</tr>
</table>
<span class="desc">EXTRACT DESCRIPTION</span>
</td>
<td class="row2" width="15%"><span class="forumdesc"><alink href="http://www.xxx.com/index.php?showforum=19" title="Living">EXTRACT LINK 2</a></span></td>
<td align="center" class="row1" width='10%'><alink href='http://www.xxx.com/index.php?showuser=1642'>Mr P</a></td>
<td align="center" class="row2"><alink href="javascript:who_posted(4134);">1</a></td>
<td align="center" class="row1">46</td>
<td class="row1"><span class="desc">Today, 12:04 AM<br /><alink href="http://www.xxx.com/index.php?showtopic=4134&view=getlastpost">Last post by:</a> <b><alink href='http://www.xxx.com/index.php?showuser=1649'>underft</a></b></span></td>
</tr>
<!-- End Topic Entry 4134 -->
-->
Try to use xpath instead:
html_doc = Nokogiri::HTML("<html><body><!-- Begin Topic Entry 4134 --></body></html>")
html_doc.xpath('//comment()')
You could implement a Nokogiri SAX Parser. This is done faster than it might seem at first sight. You get events for Elements, Attributes and Comments.
Within your parser, your should rememeber the state, like #currently_interested = true to know which parts to rememeber and which not.