Get elements from table using XPath

Get elements from table using XPath - html

I am trying to get information from this website
https://www.realtypro.co.za/property_detail.php?ref=1736
I have this table from which I want to take the number of bedrooms
<div class="panel panel-primary">
<div class="panel-heading">Property Details</div>
<div class="panel-body">
<table width="100%" cellpadding="0" cellspacing="0" border="0" class="table table-striped table-condensed table-tweak">
<tbody><tr>
<td class="xh-highlight">3</td><td style="width: 140px" class="">Bedrooms</td>
</tr>
<tr>
<td>Bathrooms</td>
<td>3</td>
</tr>
I am using this xpath expression:
bedrooms = response.xpath("//div[#class='panel panel-primary']/div[#class='panel-body']/table[#class='table table-striped table-condensed table-tweak']/tbody/tr[1]/td[2]/text()").extract_first()
However, I only get 'None' as output.
I have tried several combinations and I only get None as output. Any suggestions on what I am doing wrong?
Thanks in advance!

I would use bs4 4.7.1. where you can search with :contains for the td cell having the text "Bedrooms" then take the adjacent sibling td. You can add a test for is None for error handling. Less fragile than long xpath.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.realtypro.co.za/property_detail.php?ref=1736')
soup = bs(r.content, 'lxml')
print(int(soup.select_one('td:contains(Bedrooms) + td').text)
If position was fixed you could use
.table-tweak td + td

Try this and let me know if it works:
import lxml.html
response = [your code above]
beds = lxml.html.fromstring(response)
bedrooms = beds.xpath("//div[#class='panel panel-primary']/div[#class='panel-body']/table[#class='table table-striped table-condensed table-tweak']/tbody/tr[1]/td[2]//preceding-sibling::*/text()")
bedrooms
Output:
['3']
EDIT:
Or possibly:
for bed in beds:
num_rooms = bed.xpath("//div[#class='panel panel-primary']/div[#class='panel-body']/table[#class='table table-striped table-condensed table-tweak']/tbody/tr[1]/td[2]//preceding-sibling::*/text()")
print(num_rooms)

Related

Scraping ID attribute using rvest

I am trying to check if Polish elections are fair and candidates form opposition did not get abnormal low number of votes in districts with higher amount of invalid votes. To do so I need to scrape results of each district.
Link to official results of elections for my city - in the bottom table, each row is different district and by clicking you get redirected to district. The link is not usual <a ... hef = ...> format, but in the data-id=... is encoded the variable part of the link to districts.
My question is how to extract the data-id= attribute table on a webpage using R?
Sample data - in this example I would like to extract 697773 from row data
<div class="proto" style="">
<div id="DataTables_Table_16_wrapper" class="dataTables_wrapper dt-bootstrap no-footer">
<div class="table-responsive">
<table class="table table-bordered table-striped table-hover dataTable no-footer clickable" id="DataTables_Table_16" role="grid">
<thead><tr role="row"><th class="sorting_asc" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-sort="ascending" aria-label="Numer: aktywuj, by posortować kolumnę malejąco">Numer</th><th class="sorting" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-label="Siedziba: aktywuj, by posortować kolumnę rosnąco">Siedziba</th><th class="sorting" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-label="Granice: aktywuj, by posortować kolumnę rosnąco">Granice</th></tr></thead>
<tbody>
<tr data-id="697773" role="row" class="odd"><td class="sorting_1">1</td><td>Szkoła Podstawowa nr 63</td> <td>Bożego Ciała...</td></tr>
</tbody>
</table>
</div>
</div>
</div>
I have tried using:
library(dplyr)
library(rvest)
read_html("https://wybory.gov.pl/prezydent20200628/pl/wyniki/1/pow/26400") %>%
html_nodes('[class="table-responsive"]') %>%
html_nodes('[class="table table-bordered table-striped table-hover"]') %>%
html_nodes('tr') %>%
html_attrs()
But I get named character(0) as a result

I found not very optimal solution. I bet there is better way!
I have downloaded webpage, saved it as txt file and read from there:
txt_webpage <- readChar(paste0(getwd(), "\\Wyniki pierwszego głosowania _ Wrocław.txt"),
file.info(paste0(getwd(), "\\Wyniki pierwszego głosowania _ Wrocław.txt"))$size)
posiotions <- gregexpr(pattern ='<tr data', txt_webpage)
districts_numbers <- c()
for (i in posiotions[[1]]) {
print (i)
tmp <- substr(txt_webpage, i + 10, i + 22)
tmp <- gsub('\\D+','', tmp)
districts_numbers <- c(districts_numbers, tmp)
}

Beautiful soup finding the first sibling of a known object with a known attribute

I have the following code to select a certain cell in a table element:
tag = soup.find_all('td', attrs={'class': 'I'})
as shown in the attached image 1, I would like to somehow be able to find its first sibling within the same class "even_row". Ideally, the selection would output only the contents of data-seconds, in this case "58". Not every "even_row" class has a element with class I, and some have more than one, so I need to get the value data-seconds only for the "even_row" classes that have the element with class "I"
Any help would be appreciated as I've been banging my head on the wall looking through documentation to no avail.
html look like :
<tr class='even_row'>
<td class='row_labels' data-seconds="58">
<div class='celldiv slots1'></div>
</td>
<td class='new'>...</td>
<td class='I'>...</td>
<td class='new'>...</td>
<td class='new'>...</td>

One way to get around that issue is to pass True
from bs4 import BeautifulSoup
html = """
<tr class='even_row'>
<td class='row_labels' data-seconds="58">
<div class='celldiv slots1'></div>
</td>
<td class='new'>...</td>
<td class='I'>...</td>
<td class='new'>...</td>
<td class='new'>...</td>
</tr>
<tr class='even_row'>
<td class='row_labels' >
<div class='celldiv slots1'></div>
</td>
<td class='new'>...</td>
<td class='I'>...</td>
<td class='new'>...</td>
<td class='new'>...</td>
</tr>
"""
soup = BeautifulSoup(html,'html.parser')
even_rows = soup.find_all('tr', attrs={'class': 'even_row'})
for row in even_rows:
tag = row.find("td", {"data-seconds" : True})
if tag is not None:
print(tag.get('data-seconds'))
Output :
58
another way to do it is using regular expressions
import re
tds = [tag.get('data-seconds') for tag in soup.findAll("td", {"data-seconds" : re.compile(r".*")})]
print(tds)
Output :
['58']

Cannot test properly without the html but sounds like with bs4 4.7.1+ you can use :has to satisfy your requirements for .even_row:has(.I) i.e. parent with class even_row, having child with class I, and then add in [data-seconds] to cater for all child data-seconds attribute values
print([i['data-seconds'] for i in soup.select('.even_row:has(.I) [data-seconds]')])

Unwanted Padding in QTextEdit HTML Subset <img>

I am trying to insert an image in a table html inside a QTextEdit Subset.
I would like that image to fit perfectly with the table width.
Unfortunately nothing goes smooth with Qt and an annoying padding is left at the right and under the image.
Here is the simplified code (use any image to test), anyone as any idea on how to avoid it?
import sys
from PyQt5.QtWidgets import *
from PyQt5.QtGui import *
from PyQt5.QtCore import QCoreApplication, QRect, Qt
class Labhtml(QTextEdit):
def __init__(self):
super().__init__()
html= '''
<table border="1" cellspacing="0" cellpadding="0">
<tr>
<td>
<img src="bar.png">
</td>
</tr>
</table>
'''
self.setText(html)
class Example(QScrollArea):
def __init__(self):
super().__init__()
widget = QWidget()
layout = QVBoxLayout(widget)
layout.setAlignment(Qt.AlignTop)
layout.addWidget(Labhtml())
self.setWidget(widget)
self.setWidgetResizable(True)
self.show()
if __name__ == '__main__':
app = QApplication(sys.argv)
ex = Example()
sys.exit(app.exec_())
Qt is full of these glitches when it comes to subset I am nearly going to give up.

The problem is caused by the whitespace you've added within the td tag. So one way to fix the problem is like this:
<table border="1" cellspacing="0" cellpadding="0">
<tr>
<td><img src="image.png"></td>
</tr>
</table>
Alternatively, use fixed dimensions that are smaller than the image:
<table border="1" cellspacing="0" cellpadding="0">
<tr>
<td width="0" height="0">
<img src="image.png">
</td>
</tr>
</table>

Extract text from nested tags inside another nested tags using beautifulsoup in python3

I have an html page in which it has the same set of html codes with different data, i need to get the data "709". I am able to get all the texts inside the tr tag, but i dunno how to get inside of the tr tag and to get the data in the td tag alone. Please help me. Below is the html code.
<table class="readonlydisplaytable">
<tbody>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Payer Phone #</th>
<td class="readonlydisplayfielddata">1234</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Name</th>
<td class="readonlydisplayfielddata">ABC SERVICES</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Package #</th>
<td class="readonlydisplayfielddata">709</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Case #</th>
<td class="readonlydisplayfielddata">n/a</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Date</th>
<td class="readonlydisplayfielddata">n/a</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Adjuster</th>
<td class="readonlydisplayfielddata">n/a</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Adjuster Phone #</th>
<td class="readonlydisplayfielddata">n/a</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Adjuster Fax #</th>
<td class="readonlydisplayfielddata">n/a</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Body Part</th>
<td class="readonlydisplayfielddata">n/a</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Deadline</th>
<td class="readonlydisplayfielddata">11/22/2014</td>
</tr>
</tbody>
</table>
Below is the code i used.
from selenium import webdriver
import os, time, csv, datetime
from selenium.webdriver.common.keys import Keys
import threading
import multiprocessing
from selenium.webdriver.support.select import Select
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
import openpyxl
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
soup = BeautifulSoup(open("C:\\Users\\mapraveenkumar\\Desktop\\phonepayor.htm"), "html5lib")
a = soup.find_all("table", class_="readonlydisplaytable")
for b in a:
c = b.find_all("tr", class_="readonlydisplayfield")
for d in c:
if "Package #" in d.get_text():
print(d.get_text())

You want the text inside the td element adjacent to the th element that contains 'Package #'. I begin by looking for that, then I find its parent and the parent's siblings. As usual, I find it easiest to work in an interactive environment when I'm trying to ellucidate how to capture what I want. I suspect that the main point is to use find_all with string=.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('temp.htm').read(),'lxml')
>>> target = soup.find_all(string='Package #')
>>> target
['Package #']
>>> target[0].findParent()
<th class="readonlydisplayfieldlabel">Package #</th>
>>> target[0].findParent().fetchNextSiblings()
[<td class="readonlydisplayfielddata">709</td>]
>>> tds = target[0].findParent().fetchNextSiblings()
>>> tds[0].text
'709'

html = '''code above (html'''
soup = bs(html,'lxml')
find_tr = soup.find_all('tr') #Iterates through 'tr'
for i in find_tr:
for j in i.find_all('th'): #iterates through 'th' tags in the 'tr'
print(j)
for k in i.find_all('td'): #iterates through 'td' tags in 'tr'
print(k)
This should do the job. We make a for loop that goes through each TR tag
and for EACH value of the tr tag example (we'll make 2 loops that find all th and td tags:
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Payer Phone #</th>
<td class="readonlydisplayfielddata">1234</td>
</tr>
Now this will work also if there is more than 1 td or th tag.
For one tag (td,th) use, we can do the following:
find_tr = soup.find_all('tr') #finds all tr
for i in find_tr: #Goes through all tr
print(i.th.text) # the .th will gives us the th tag from one TR
print(i.td.text) # .td will return the td.text value.

How to Display records in div table using AngularJs Like 1-10 out of 100,11-20 out of 100

Here i'm new for AngularJs can you please helpme how to show records Like 1-10 out of 100 if I click on next paging 11-20 out of 100.soo o to countinuous
<b style="color:red">Items Search is : {{TotalRec.length}}</b>
<b> Toal Records Available {{GetDb.length}}</b>
<table class="table table-hover table-bordered">
<tr>
<th>Id</th>
<th>Name</th>
</tr>
<tr dir-paginate="ee in GetDb|orderBy:sortKey:reverse|filter:search|itemsPerPage:2|filter:query as TotalRec">
<td>{{ee.id}}</td>
<td>{{ee.Name}}</td>
</tr>
</table>

Have a look at this example :
https://github.com/rahil471/search-sort-and-pagination-angularjs

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Get elements from table using XPath - html

Related

Scraping ID attribute using rvest

Beautiful soup finding the first sibling of a known object with a known attribute

Unwanted Padding in QTextEdit HTML Subset <img>

Extract text from nested tags inside another nested tags using beautifulsoup in python3

How to Display records in div table using AngularJs Like 1-10 out of 100,11-20 out of 100

Categories

Resources