Generating all data HTML - html

I want to webcrawl this link using the XML package. Problem is the data are not automatically generated. This piece of HTML generates the table:
<table width="1280px" id="maintable">
<tr id="tabletoggles">
<td> </td>
<td id="tablelabel"> </td>
<td id="abovestats" class="abovestats" align="right">
<span class="revscore likelink"></span>
<b>Stats:</b>
<span class="statso stattab">Serve</span> | <span class="statsr stattab likelink">Return</span> | <span class="statsw stattab likelink">Raw</span>
</td></tr>
<tr>
<td id="footer" class="footer"> </td>
<td colspan="2" id="stats" class="stats"><table id="matches"></table></td>
</tr>
<tr>
<td id="belowmenus"> <br/> <br/> <br/> <br/> </td>
<td colspan="2" id="belowmatches"> </td>
</tr>
</table></div>
</div>
When using the function readHTMLTable in XML on this piece of HTML I just get nonsensical values:
readHTMLTable("http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic&f=ACareerqq",which = 3)
V1 V2
1 Â
2 Â Â Â Â Â Â
How can I retrieve the "full link" containing all data? I can do it manually for each page using Firebug but I'd like to have a solution which can retrieve multiple urls at the same time.

I believe that this is due to lack of UTF8 encode.
What language are you using for get this data?
If you are using PHP to take the data, I recommend using
header('Content-Type: text/html; charset=utf-8');
before the entire code.

Related

Unable to extract value using xpath query

Learning to use xpath queries. I am having an issue were I am unable to extract a value that changes whenever the page is refreshed.
For example, I am trying to extract the value '62804' from the following html code: "canvas.strokeText('Answer: 62804',90,112);" . Any ideas how this can be done. Thanks
<html>
<div id="content" class="large-12 columns">
<div class="example">
<h3>Challenging DOM</h3>
<p>The hardest part in automated web testing is finding the best locators (e.g., ones that well named, unique, and unlikely to change). It's more often than not that the application you're testing was not built with this concept in mind. This example demonstrates that with unique IDs, a table with no helpful locators, and a canvas element.</p>
<hr>
<div class="row">
<div class="large-12 columns large-centered">
<div class="large-2 columns">
<a id="debcda40-b692-0137-457b-2213fbd48497" href="" class="button">qux</a><br>
<a id="debce410-b692-0137-457c-2213fbd48497" href="" class="button alert">baz</a><br>
<a id="debd03d0-b692-0137-457d-2213fbd48497" href="" class="button success">foo</a><br>
</div>
<div class="large-10 columns">
<table>
<thead>
<tr>
<th>Lorem</th>
<th>Ipsum</th>
<th>Dolor</th>
<th>Sit</th>
<th>Amet</th>
<th>Diceret</th>
<th>Action</th>
</tr>
</thead>
<tbody>
<tr>
<td>Iuvaret0</td>
<td>Apeirian0</td>
<td>Adipisci0</td>
<td>Definiebas0</td>
<td>Consequuntur0</td>
<td>Phaedrum0</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret1</td>
<td>Apeirian1</td>
<td>Adipisci1</td>
<td>Definiebas1</td>
<td>Consequuntur1</td>
<td>Phaedrum1</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret2</td>
<td>Apeirian2</td>
<td>Adipisci2</td>
<td>Definiebas2</td>
<td>Consequuntur2</td>
<td>Phaedrum2</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret3</td>
<td>Apeirian3</td>
<td>Adipisci3</td>
<td>Definiebas3</td>
<td>Consequuntur3</td>
<td>Phaedrum3</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret4</td>
<td>Apeirian4</td>
<td>Adipisci4</td>
<td>Definiebas4</td>
<td>Consequuntur4</td>
<td>Phaedrum4</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret5</td>
<td>Apeirian5</td>
<td>Adipisci5</td>
<td>Definiebas5</td>
<td>Consequuntur5</td>
<td>Phaedrum5</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret6</td>
<td>Apeirian6</td>
<td>Adipisci6</td>
<td>Definiebas6</td>
<td>Consequuntur6</td>
<td>Phaedrum6</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret7</td>
<td>Apeirian7</td>
<td>Adipisci7</td>
<td>Definiebas7</td>
<td>Consequuntur7</td>
<td>Phaedrum7</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret8</td>
<td>Apeirian8</td>
<td>Adipisci8</td>
<td>Definiebas8</td>
<td>Consequuntur8</td>
<td>Phaedrum8</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret9</td>
<td>Apeirian9</td>
<td>Adipisci9</td>
<td>Definiebas9</td>
<td>Consequuntur9</td>
<td>Phaedrum9</td>
<td>
edit
delete
</td>
</tr>
</tbody></table>
<div class="row">
<div class="large-10 columns">
<canvas id="canvas" width="599" height="200" style="border:1px dotted;"></canvas>
</div>
</div>
</div>
</div>
</div>
<hr>
</div>
<script>
var canvas_el = document.getElementById('canvas');
var canvas = canvas_el.getContext('2d');
canvas.font = '60px Arial';
canvas.strokeText('Answer: 62804',90,112);
</script>
</div>
</html>
In order to use the XPath query the input document must be a valid XML.
In your case it isn't, because there are some tags that are not properly closed (you can verify it using an XMLLint tool).
E.g.
<hr> and <br> should be replaced with <hr/> and <br/>.
Once the XML is corrected, you can use an XPath query.
The fist step is select the script element:
//script
The output is:
Element='<script>
var canvas_el = document.getElementById('canvas');
var canvas = canvas_el.getContext('2d');
canvas.font = '60px Arial';
canvas.strokeText('Answer: 62804',90,112);
</script>'
Then you have to convert the Element Node in a String and then perform some parsing:
substring-before(substring-after(//script/text(), 'canvas.strokeText(''Answer: ') , ''',90,112)')
The result is the following:
String='62804'
Note: You can do the same operation in a more elastic way using Javascript, for example.
XPath is very good to query an XML (like the first operation that I mentioned) but quite complicated to do String parsing (like the second operation that I mentioned).
Hope it can help.

Xpath grep elements

I`m using Scrapy Python to try to grep data from the site.
How I can grep this structure with Xpath?
<div class="foo">
<h3>Need this text_1</h3>
<table class="thesamename">
<tbody>
<tr>
<td class="tmp_year">
45767
</td>
<td class="tmp_outcome">
<b>Win_1</b><br>
<span class="tmp_category">TEST_1</span>
</td>
</tr>
<tr>
<td class="tmp_year">
1232004
</td>
<td class="tmp_outcome">
<b>Win_2</b><br>
<span class="tmp_category">TEST_2</span>
</td>
</tr>
<tr>
<td class="tmp_year">
122004
</td>
<td class="tmp_outcome">
<b>Win_3</b><br>
<span class="tmp_category">TEST_3</span>
</td>
</tr>
</tbody>
<h3>Need this text_2</h3>
<table class="thesamename">
<tbody>
<td class="tmp_year">
234
</td>
<td class="tmp_outcome">
<b>Win_E</b><br>
<span class="tmp_category">TEST_E</span>
</td>
</tr>
<tr>
<td class="tmp_year">
3476
</td>
<td class="tmp_outcome">
<b>Win_C</b><br>
<span class="tmp_category">TEST_C</span>
</td>
</tr>
</tbody>
<h3>Need this text_3</h3>
<table class="thesamename">
<tbody>
<tr>
<td class="tmp_year">
85567
</td>
<td class="tmp_outcome">
<b>Win_T</b><br>
<span class="tmp_category">TEST_T</span>
</td>
</tr>
<tr>
<td class="tmp_year">
435656
</td>
<td class="tmp_outcome">
<b>Win_A</b><br>
<span class="tmp_category">TEST_A</span>
</td>
</tr>
<tr>
<td class="tmp_year">
980
</td>
<td class="tmp_outcome">
<b>Win_Z</b><br>
<span class="tmp_category">TEST_Z</span>
</td>
</tr>
</tbody>
I would like to have output with this structure:
"Section": {
Need this text_1 :
[45767 : Win_1 : TEST_1]
[1232004 : Win_2 : TEST_2]
[122004: Win_3 : TEST_3]
,
Need this text_2:
[234 : Win_E : TEST_E]
[3476 : Win_C : TEST_C]
,
Need this text_3:
[85567 : Win_T : TEST_T]
[435656 : Win_A : TEST_A]
[980: Win_Z : TEST_Z]
}
How can I create the proper xpath select to take this structure?
I can take separately all "h3" , all "a" then all tags with class but how I can match?
GREP YOU SAY?! LOL Well, You would be entirely wron to name it so but for the sake ofkeeping the jargon cleanfor understanding your just parsing/extracting.... So new to scrapy? or web dev sideof things? No matter... Theres no way I couldexpect to teach you in one answer here how to xpth/regex like a pro... only wayis for you to keep at but I throw in my input.
First of all, xpath is amazingly usefull wen it comes to websites that are necessarily build to stadard, which doesnt make them bad per say but in the html snipet you gave... its structured all right soo.. Id recommend css extract .. THESE ARE THE VALUES...
year = response.css('td.tmp_year a::text').extract()
outcome = response.css('td.tmp_outcome b::text').extract()
category= response.css('span.tmp_category::text').extract()
PRO-TIP: For what ever case you deem it neccesary, you can save a web page asan HTML file and use scrapy shell by referencing the direct file path to it... So I save you html snippet to a file on my desktop then ran...
scrapy shell file:///home/scriptso/Desktop/letsGREPlol.html
ANYWAYS... as far as xpath... since you asked lol... cake. lets compare the xpath with the cssand tell me you can see... it? lol
response.css('td.tmp_outcome b::text').extract()
so is a td tag....and the class name is tmp_outcome, thn the next node is a bold tag... of which where the text is thusly declaring it as text with the ::text
response.xpath('//td[#class="tmp_outcome"]/b/text()').extract()
So xpath is basically saying we star with a patter inthe entire site of the td tag... and class= tmp_outcome, then the bold, then in xpath to declare type /text() is for text.... /#href is for.. yeah you guessedit

How to match and replace multiline html file with sed

I have a text file something like this.
<tbody>
<tr>
<td>
String1
</td>
<td>
String2
</td>
<td>
String3
</td>
...
...
<td>
StringN
</td>
</tr>
</tbody>
This is the output that I want.
<tbody>
<tr>
String1;String2;String3;... ...;StringN
</tr>
</tbody>
Here is my BUGGY code.
sed '{
:a
N
$!ba
s|<td.*>\(.*\)</td>|\1|
}'
I wanted to remove all <td> and </td> tags and get all the strings delimitered by some string (I can filter those strings later using that as the delimiter charater). I used the solution given in this URL. Output does not come as I expected.
This is the actual Code
<tbody>
<tr>
<td>
120.52.72.58:80
</td>
<td>
HTTP
</td>
<td>
<span class="text-danger">Transparent</span>
</td>
<td>
<abbr title="2016-12-15 00:07:46">12h ago</abbr>
</td>
<td class="small">
<span class="text-muted">—</span>
</td>
<td>
<img src="/flags/png/cn.png" alt="China (CN)" title="China (CN)" onerror="this.style.display='none'"> <abbr title="China">CN</abbr>
</td>
<td class="small">
Beijing
</td>
<td class="small">
Beijing
</td>
<td class="small">
China Unicom IP network
</td>
<td class="small">
<span class="text-muted">—</span>
</td>
</tr>
</tbody>
Output does not come as I expected.
Your sed code does not work because the <td.*>\(.*\)</td> matches the part of the pattern space from the first <td up to the last </td> due to the greediness of the * quantifier. Unfortunately, sed doesn't support a more modern regex flavor with ungreedy quantifiers; thus, some other tool would be more appropriate.
I wanted to remove all <td> and </td> tags and get all the strings delimitered by some string …
If those tags are always (as in your examples) on a separate line, we can do with a simple sed command:
sed '/<\/*td.*>/d'
All the strings are thereafter delimited by some string which is \n followed by spaces.

Extract a value from HTML source code using jmeter

I know it might be a duplicate but I am not able to extract a value from this HTML source. Any help would be greatly appreciated.
So what I am trying to do is get the pid of the project from page.
The names of the project are being read from a csv file and I need to get the pid.
For example if the project here is "AA project", just the project key "AA" can also be used, the pid that needs to be extracted is 10441.
Since the values are not a label, I cannot figure out how to extract these.
Update : just using pid=(\d....) gives all the pid without any reference to the project name or key.
<table id="project-list" class="aui">
<thead>
<tr>
<th></th>
<th>Name</th>
<th>Key</th>
<th class="project-list-type">Project Type</th>
<th>URL</th>
<th>Project Lead</th>
<th>Default Assignee</th>
<th>Operations</th>
</tr>
</thead>
<tbody>
<tr data-project-key="AA">
<td class="cell-type-icon" data-cell-type="avatar">
<div class="aui-avatar aui-avatar-small aui-avatar-project jira-system-avatar"><span class="aui-avatar-inner"><img src="/secure/projectavatar?pid=10441&amp;avatarId=10011&amp;size=small" alt="Project Avatar for 10441" /></span></div>
</td>
<td data-cell-type="name">
<a id="view-project-10441" href="/plugins/servlet/project-config/AA/summary">AA project</a>
</td>
<td data-cell-type="key">AA</td>
<span>Software</span>
</td>
<td class="cell-type-url" data-cell-type="url">
No URL
</td>
<td class="cell-type-user" data-cell-type="lead">
<a class="user-hover" rel="localadmin" id="view_AA_projects_localadmin" href="/secure/ViewProfile.jspa?name=localadmin">Atlassian Administrator</a>
</td>
<td class="cell-type-user" data-cell-type="default-assignee">
Unassigned
</td>
<td data-cell-type="operations">
<ul class="operations-list">
<li><a class="edit-project" id="edit-project-10441" href="/secure/project/EditProject!default.jspa?pid=10441&returnUrl=ViewProjects.jspa">Edit</a></li>
<li><a id="change_project_type_10441" class="change-project-type-link" data-project-id="10441" href="#">Change project type</a></li>
<li><a id="delete_project_10441" href="/secure/project/DeleteProject!default.jspa?pid=10441&returnUrl=ViewProjects.jspa">Delete</a></li>
</ul>
</td>
</tr>
<tr data-project-key="AAL">
<td class="cell-type-icon" data-cell-type="avatar">
<div class="aui-avatar aui-avatar-small aui-avatar-project jira-system-avatar"><span class="aui-avatar-inner"><img src="/secure/projectavatar?pid=10442&amp;avatarId=10011&amp;size=small" alt="Project Avatar for 10442" /></span></div>
</td>
<td data-cell-type="name">
<a id="view-project-10442" href="/plugins/servlet/project-config/AAL/summary">AAL project</a>
</td>
<td data-cell-type="key">AAL</td>
<td class="cell-type-project-type">
<span>Software</span>
</td>
<td class="cell-type-url" data-cell-type="url">
No URL
</td>
<td class="cell-type-user" data-cell-type="lead">
<a class="user-hover" rel="localadmin" id="view_AAL_projects_localadmin" href="/secure/ViewProfile.jspa?name=localadmin">Atlassian Administrator</a>
</td>
<td class="cell-type-user" data-cell-type="default-assignee">
Unassigned
</td>
<td data-cell-type="operations">
<ul class="operations-list">
I wouldn't recommend using regular expressions to parse HTML data as it will be a headache to develop and maintain and it will be very sensitive to markup changes hence very fragile, see https://stackoverflow.com/a/1732454/2897748 for details.
Go for XPath Extractor instead, the relevant configuration would be:
Reference Name: anything meaningful, i.e. id
XPath Query: substring-after(//tr[#data-project-key='AA']/td[#data-cell-type='name']/a/#id,'view-project-')
Check Use Tidy if your response is not XHTML-compliant
Demo:
References:
XPath Tutorial
XPath Language Reference

C++: How to recursively/iteratively search HTML file (using Boost C++)?

I'm working on an application where I need to fetch a HTML file (from the web) and obtain a piece of information, by searching for a string.
I reckon it is more effective and easier to treat the HTML file as a XML file and iterate over the tags in the HTML file and match the content with a string.
Here is the HTML table I'm interested in:
<table width='100%' class='datatable' cellspacing='0' cellpadding='0'>
<tr>
<td>
</td>
<td width='30px'>
</td>
<td width='220px'>
</td>
<td width='50px'>
</td>
</tr>
<tr>
<td height='7' colspan='4'>
<img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
</td>
</tr>
<tr>
<td width='170'>
Aktiv tid: <!--This is a string I will search for.-->
</td>
<td colspan='3'>
1 dag, 17:03:46 <!--This is a piece of information I need to obtain.-->
</td>
</tr>
<tr>
<td height='7' colspan='4'>
<img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
</td>
</tr>
<tr>
<td width='170'>
Bandbredd (upp/ned) [kbps/kbps]:
</td>
<td colspan='3'>
1.058 / 21.373
</td>
</tr>
<tr>
<td height='7' colspan='4'>
<img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
</td>
</tr>
<tr>
<td width='170'>
Överförda data (skickade/mottagna) [GB/GB]: <!--This is another string I will search for.-->
</td>
<td colspan='3'>
1,67 / 42,95 <!--This is another piece of information I need to obtain.-->
</td>
</tr>
</table>
So I will search for the <td> tags containing either of the following strings:
Aktiv tid:
Överförda data (skickade/mottagna) [GB/GB]:
After that I need to select the next <td> tag containing the piece of information I want (in the same <tr>.
I successfully fetched the HTML file using cURL but need a little help with the XML search algorithm.
Thank you in advance!
(EDIT: Here is the pseudocode for my desired application (should be very self-explanatory):
extern "C" {
#include "url.h"
}
#include <string>
#include <iostream>
std::string xmlSearch(std::string fn, std::string str);
int main(void)
{
/* download HTML file from URL to file */
url("http://myurl.com/","page.html");
/* search page.html for "Aktiv tid:" and return the content of the next <td> tag. */
std::string data0 = xmlSearch("page.html","Aktiv tid:");
/* search page.html for "Överförda data (skickade/mottagna) [GB/GB]:" and return the content of the next <td> tag. */
std::string data1 = xmlSearch("page.html","Överförda data (skickade/mottagna) [GB/GB]:");
/* process results */
}
std::string xmlSearch(std::string fn, std::string str){
/* perform search algorithim */
/* return content of the next <td> tag. */
}
)
I could see myself doing this with a quick-and-dirty script, not with C++, really.
In one line:
(tidy -asxml input.xml | xmllint --xpath 'descendant-or-self::*[starts-with(text(), "Aktiv tid:")]/following-sibling::*/text()' -) 2>/dev/null
Here
tidy converts quirky html to xml
xmllint queries it:
from * (any element) which [starts-with(text(), "Aktiv tid:")]
select the text() from the following sibling
2>/dev/null is there to suppress any warning from tidy and xmllint
Presto, it prints:
1 dag, 17:03:46
For the precise input from your question.