Scraping td elements in a table in HTML - html

I have to get the Text from the td elements from a table in html which looks like this:
<table id="gvrslt" >
<tbody><tr style="font-size:10pt;">
<th scope="col">Sem</th><th scope="col" style="font-size:X-Small;">Total Obtained Marks</th><th scope="col" style="font-size:X-Small;">Max Total Marks</th><th scope="col">Result</th>
</tr>
<tr>
<td align="center">VI</td>
<td align="center">458</td>
<td align="center">550</td>
<td align="center">PASSED</td>
</tr>
</tbody></table>
I want to grab the 458 from the table which has more such td elements.The problem is that before getting to the Results' page and getting the above HTML, I have to enter some credentials and then a Result page is shown with Right click disabled. Now I can get the source of the Results' page via driver.page_source but when I try to find the table elements via webdriver, it searches the page where I entered the credentials and not the actual results' page. Is there a way to search the driver.page_source for table and td elements
Here is my code:
html=driver.page_source
soup = BeautifulSoup(html)
table=soup.find_all('table',id='gvrslt')
print(table)

If you want to get the text directly you can use a css locator to get to the 2nd td directly instead of using the table.
table[id='gvrslt'] td:nth-of-type(2)
nth-of-type gets you the 2nd td element

Try using Xpath in this case:
//table[#id='gvrslt']//td[index]
with your index of td

I'm not familiar with selenium using python. What you try is find the value using xpath.
In C# below is the code. See if it can hep you in any way possible.
IWebElement tdCell = driver.FindElement(By.XPath("//table[#id='']/tbody/tr[2]/td[2]"));
string valueOfTd = tdCell.Text;

Related

My XPath is not working for all the scenario

I need to do assertion with the total of table and to do this when I am writing XPath for total in image table its not working for all the scenario. I know the reason because the table is dynamic so sometimes its have only 1 Image and sometime its have 5 Image so I am having difficulty to wrote x-path which works for all the scenario.
For the attached image I have written : enter image description here
WebElement totalElement = driver.findElement(By.xpath("//*[#id="image_table"]/tbody/tr[5]/td[4]"));
its running fine but when the table size changes. it failed.
for reference the webpage design is as below,
***<table class="table table-bordered table-striped table-hover display" id="image_table">
<tr>
<td colspan="2" class="text-right">Total</td>
<td class="text-center">1,650</td>
<td class="text-center">19,936</td>
<td class="text-center">21,586</td> (trying to write xpath for this total)
</tr>***
Try css:
#image_table>tr>td:last-of-type
It always defines the last element.
Update based on comment. This css selector is more robust:
#image_table tr:last-of-type>td:last-of-type
//td[contains(string(),"Total")]/following-sibling::td[last()]
find td that contains the text total and find the last following td

HTML Scraping with HTMLReader Search for Table Row Content and Return href

using Objective-C HTMLReader for my first (simple, I think) HTML scraping task. But there's little documentation with it, and after a lot of experimentation, can't quite get what I need.
I'm scraping an old HTML page whose largest feature is one table with three columns and many rows. Here's a sample of the table with one row:
<table border="1" cellspacing="2" cellpadding="6" bordercolor="#000000" bgcolor="#999999" style="margin-top:50px;width:100%;">
<tr height=30>
<td bgcolor="#34003C" align="left" valign="middle" background="background.gif"><span class="cls_TableHeader">Bands</span></td>
<td bgcolor="#34003C" align="left" valign="middle" background="background.gif"><span class="cls_TableHeader">Style</span></td>
<td bgcolor="#34003C" align="left" valign="middle" background="background.gif"><span class="cls_TableHeader">Country</span></td>
</tr>
<tr>
<td class="cls_tdDisco0" align="left" valign="middle">
<strong>THE BEATLES</strong>
</td>
<td class="cls_tdDisco0" align="left" valign="middle">
<span class="cls_DiscoText">Rock</span></td>
<td class="cls_tdDisco0" align="left" valign="middle"><span class="cls_DiscoText">England</span></td>
</tr>
there are, of course, many rows.
What I'm trying to accomplish:
I need to search for the td that contains "THE BEATLES", and extract the href attached to it (of course, even when it's contained in the middle of a lot of other rows)
What I've tried:
I can get the table itself with
HTMLDocument *home = [HTMLDocument documentWithData:data contentTypeHeader:nil];
HTMLElement *table = [home firstNodeMatchingSelector:#"TABLE"];
HTMLNode *theActualTable =[table childAtIndex:1];
but I can't really use the method "nodesMatchingSelector" to search rows since what I'm looking for isn't a selector. I've tried getting the rows (via children), but then I'm looking at iterating through each row's children of children until I drill to the tag that contains THE BEATLES and then using that index to get the a tag attached to that? It seems that there should be a much easier way to do this with HTMLReader. I feel like I'm missing something simple.
Thanks in advance!
Here is some psuedo code that might work for you:
Use nodesMatchingSelector to get all the tr in the table
Then loop through all the tr and get the first td of each tr
Then use nodesMatchingSelector again to get the strong tag
Then use node.textContent to get the text content of the strong tag
https://github.com/nolanw/HTMLReader has an example in the readme that shows using the textContent method
feel free to post follow up questions as comments if any of this doesn't make sense

Excel VBA web scraping from table

I am trying to extract some info from the table below into Excel using VBL without any success. The values which I need do not seem to have any element ID, tag name or class name assigned to it. I'm after the Fuel Usage value(89218) and the time value in the same row (01:15). Can anyone point me in the right direction on how to scrape values from a table, or how to extract data from specific TR, TD.
HTML source of the table:
<h3>Airbus A300-600-PW4158 Fuel Planner</h3>
<p>London to Chicago EGKK-KORD (3441 NM)<br /></p>
<h2>Total Fuel: 101901 POUNDS</h2>
<table width="100%" border=1>
<tr>
<th style="text-align:left;"> </th>
<th style="text-align:left;">Fuel</td>
<th style="text-align:left;">Time</th>
</tr>
<tr>
<td>Fuel Usage</td>
<td>89218</td>
<td>08:47</td>
</tr>
<tr>
<td>Reserve Fuel</td>
<td>12682</td>
<td>01:15</td>
</tr>
<tr>
<td>Fuel on Board</td>
<td>101901</td>
<td>10:02</td>
</tr>
</table>
much appreciated.
CSS Selectors:
Without seeing more of the HTML you can use the following CSS selectors selectors for the snippet shown:
tr td:nth-child(2)
tr td:nth-child(3)
With CSS selectors this will bring back nodeLists of all 2 or 3 child tds with a tr.
For example:
You can access individual items from a nodeList by index.
VBA:
The syntax in vba overall will be something like:
.document.querySelectorAll("tr td:nth-child(2)")(0).innerText
or possibly
.document.querySelectorAll("tr td:nth-child(2)").Item(0).innerText
The 0 is hypothetical. You would need to inspect your full HTML to ascertain the correct index to use.
The .document innerHTML can be populated from the .responseText using IE, for example, to navigate to the page.

Only parsing outer element

I am writing a scraper with Nokogiri, and I want to scrape a large HTML file.
Currently, I am scraping a large table; here is a small fragment:
<table id="rptBidTypes__ctl0_dgResults">
<tr>
<td align="left">S24327</td>
<td>
Airfield Lighting
<div>
<div>
<table cellpadding="5px" border="2" cellspacing="1px" width="100%" bgcolor=
"black">
<tr>
<td bgcolor="white">Abstract:<br />
This project is for the purchase and delivery, of various airfield
lighting, for a period of 36 months, with two optional 1 year renewals,
in accordance with the specifications, terms and conditions specified in
the solicitation.</td>
</tr>
</table>
</div>
</div>
</td>
</tr>
</table>
And here is the Ruby code I am using to scrape:
document = doc.search("table#rptBidTypes__ctl0_dgResults tr")
document[1..-1].each do |v|
cells = v.search 'td'
if cells.inner_html.length > 0
data = {
number: cells[0].text,
}
end
ScraperWiki::save_sqlite(['number'], data)
end
Unfortunately this isn't working for me. I only want to extract S24327, but I am getting the content of every table cell. How do I only extract the content of the first td?
Keep in mind that under this table, there are many table rows following the same format.
In CSS, table tr means tr anywhere underneath the table, including nested tables. But table > tr means the tr must be a direct child of the table.
Also, it appears you only want the cell values, so you don't need to iterate. This will give you all such cells (the first in each row):
doc.search("table#rptBidTypes__ctl0_dgResults > tr > td[1]").map(&:text)
The content of the first td would be:
doc.at("table#rptBidTypes__ctl0_dgResults td").text
The problem is that your search is matching two different things: the <tr> tag nested directly within the table with id rptBidTypes__ctl0_dgResults, and the <tr> tag within the table nested inside that parent table. When you loop through document[1..-1] you're actually selecting the second <tr> tag rather than the first one.
To select just the direct child <tr> tag, use:
document = doc.search("table#rptBidTypes__ctl0_dgResults > tr")
Then you can get the text for the <td> tag with:
document.css('td')[0].text #=> "S24327"

XPath:: Get following Sibling

I have following HTML Structure: I am trying to build a robust method to extract second color digest element since there will be many of these tag within the DOM.
<table>
<tbody>
<tr bgcolor="#AAAAAA">
<tr>
<tr>
<tr>
<tr>
<td>Color Digest </td>
<td>AgArAQICGQMVBBwTIRQHIwg0GUMURAZTBWQJcwV0AoEDAQ </td>
</tr>
<tr>
<td>Color Digest </td>
<td>2,43,2,25,21,28,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
</tbody>
</table>
I am trying to extract the Second "Color Digest" td element that has the decoded value.
I wrote the following xpath but instead of getting the second i am not getting the second td element.
//td[text() = ' Color Digest ']/following-sibling::td[2]
And when I change it to td[2] to td[1] I get both the elements.
You should be looking for the second tr that has the td that equals ' Color Digest ', then you need to look at either the following sibling of the first td in the tr, or the second td.
Try the following:
//tr[td='Color Digest'][2]/td/following-sibling::td[1]
or
//tr[td='Color Digest'][2]/td[2]
http://www.xpathtester.com/saved/76bb0bca-1896-43b7-8312-54f924a98a89
You can go for identifying a list of elements with xPath:
//td[text() = ' Color Digest ']/following-sibling::td[1]
This will give you a list of two elements, than you can use the 2nd element as your intended one. For example:
List<WebElement> elements = driver.findElements(By.xpath("//td[text() = ' Color Digest ']/following-sibling::td[1]"))
Now, you can use the 2nd element as your intended element, which is elements.get(1)
/html/body/table/tbody/tr[9]/td[1]
In Chrome (possible Safari too) you can inspect an element, then right click on the tag you want to get the xpath for, then you can copy the xpath to select that element.