Scraping ID attribute using rvest

Scraping ID attribute using rvest - html

I am trying to check if Polish elections are fair and candidates form opposition did not get abnormal low number of votes in districts with higher amount of invalid votes. To do so I need to scrape results of each district.
Link to official results of elections for my city - in the bottom table, each row is different district and by clicking you get redirected to district. The link is not usual <a ... hef = ...> format, but in the data-id=... is encoded the variable part of the link to districts.
My question is how to extract the data-id= attribute table on a webpage using R?
Sample data - in this example I would like to extract 697773 from row data
<div class="proto" style="">
<div id="DataTables_Table_16_wrapper" class="dataTables_wrapper dt-bootstrap no-footer">
<div class="table-responsive">
<table class="table table-bordered table-striped table-hover dataTable no-footer clickable" id="DataTables_Table_16" role="grid">
<thead><tr role="row"><th class="sorting_asc" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-sort="ascending" aria-label="Numer: aktywuj, by posortować kolumnę malejąco">Numer</th><th class="sorting" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-label="Siedziba: aktywuj, by posortować kolumnę rosnąco">Siedziba</th><th class="sorting" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-label="Granice: aktywuj, by posortować kolumnę rosnąco">Granice</th></tr></thead>
<tbody>
<tr data-id="697773" role="row" class="odd"><td class="sorting_1">1</td><td>Szkoła Podstawowa nr 63</td> <td>Bożego Ciała...</td></tr>
</tbody>
</table>
</div>
</div>
</div>
I have tried using:
library(dplyr)
library(rvest)
read_html("https://wybory.gov.pl/prezydent20200628/pl/wyniki/1/pow/26400") %>%
html_nodes('[class="table-responsive"]') %>%
html_nodes('[class="table table-bordered table-striped table-hover"]') %>%
html_nodes('tr') %>%
html_attrs()
But I get named character(0) as a result

I found not very optimal solution. I bet there is better way!
I have downloaded webpage, saved it as txt file and read from there:
txt_webpage <- readChar(paste0(getwd(), "\\Wyniki pierwszego głosowania _ Wrocław.txt"),
file.info(paste0(getwd(), "\\Wyniki pierwszego głosowania _ Wrocław.txt"))$size)
posiotions <- gregexpr(pattern ='<tr data', txt_webpage)
districts_numbers <- c()
for (i in posiotions[[1]]) {
print (i)
tmp <- substr(txt_webpage, i + 10, i + 22)
tmp <- gsub('\\D+','', tmp)
districts_numbers <- c(districts_numbers, tmp)
}

Related

How to get HTML element that is before a certain class?

I'm scraping and having trouble getting the element of the “th” tag that comes before the other “th” element that contains the “type2” class. I prefer to take it by identifying that it is the element "th" before the "th" with class "type2" because my HTML has a lot of "th" and that was the only difference I found between the tables.
Using rvest or xml2 (or other R package), can I get this parent?
The content which I want is "text_that_I_want".
Thank you!
<tr>
<th class="array">text_that_I_want</th>
<td class="array">
<table>
<thead>
<tr>
<th class="string type2">name</th>
<th class="array type2">answers</th>
</tr>
</thead>

The formal and more generalizable way to navigate xpath relative to a given node is via ancestor preceding-sibling:
read_html(htmldoc) %>%
html_nodes(xpath = "//th[#class = 'string type2']/ancestor::td/preceding-sibling::th") %>%
html_text()
#> [1] "text_that_I_want"

We can look for the "type2" string in all <th>s, get the index of the first occurrence and substract 1 to get the index we want:
library(dplyr)
library(rvest)
location <- test%>%
html_nodes('th') %>%
str_detect("type2")
index_want <- min(which(location == TRUE) - 1)
test%>%
html_nodes('th') %>%
.[[index_want]] %>%
html_text()
[1] "text_that_I_want"

Get elements from table using XPath

I am trying to get information from this website
https://www.realtypro.co.za/property_detail.php?ref=1736
I have this table from which I want to take the number of bedrooms
<div class="panel panel-primary">
<div class="panel-heading">Property Details</div>
<div class="panel-body">
<table width="100%" cellpadding="0" cellspacing="0" border="0" class="table table-striped table-condensed table-tweak">
<tbody><tr>
<td class="xh-highlight">3</td><td style="width: 140px" class="">Bedrooms</td>
</tr>
<tr>
<td>Bathrooms</td>
<td>3</td>
</tr>
I am using this xpath expression:
bedrooms = response.xpath("//div[#class='panel panel-primary']/div[#class='panel-body']/table[#class='table table-striped table-condensed table-tweak']/tbody/tr[1]/td[2]/text()").extract_first()
However, I only get 'None' as output.
I have tried several combinations and I only get None as output. Any suggestions on what I am doing wrong?
Thanks in advance!

I would use bs4 4.7.1. where you can search with :contains for the td cell having the text "Bedrooms" then take the adjacent sibling td. You can add a test for is None for error handling. Less fragile than long xpath.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.realtypro.co.za/property_detail.php?ref=1736')
soup = bs(r.content, 'lxml')
print(int(soup.select_one('td:contains(Bedrooms) + td').text)
If position was fixed you could use
.table-tweak td + td

Try this and let me know if it works:
import lxml.html
response = [your code above]
beds = lxml.html.fromstring(response)
bedrooms = beds.xpath("//div[#class='panel panel-primary']/div[#class='panel-body']/table[#class='table table-striped table-condensed table-tweak']/tbody/tr[1]/td[2]//preceding-sibling::*/text()")
bedrooms
Output:
['3']
EDIT:
Or possibly:
for bed in beds:
num_rooms = bed.xpath("//div[#class='panel panel-primary']/div[#class='panel-body']/table[#class='table table-striped table-condensed table-tweak']/tbody/tr[1]/td[2]//preceding-sibling::*/text()")
print(num_rooms)

How to remove a hyperlink on a number on iphone, ipad?

Does anyone know how I can stop only some sets of numbers from being linked eg numéros - but also let the iphone link the real phone numbers?
<h2>Table</h2>
<table class="table table-sm table-striped table-hover">
<caption class="sr-only">Résultat(s) de recherche</caption>
<thead>
<tr>
<th scope="col">Nom</th>
<th scope="col">Numéro</th>
</tr>
</thead>
<tbody>
<tr tabindex="0" role="button" onclick="window.location='http://www.google.com'">
<td>123 Plomberie</td>
<td class="no_underline">25458-5172-53</td>
</tr>
</tbody>
</table>
This solution worked for all type of number, but how let the iphone link the real phone numbers : https://webmasters.stackexchange.com/questions/4459/keep-iphone-browser-from-turning-numbers-into-links

Not a solution but temporary workaround, its worked for all number. Add this in header of html
<meta name = "format-detection" content = "telephone=no">

How to find and bold a series of four letters in an html table

I'm using the R programming language.
I'm hoping to find and make bold a series of four letters (amino acids, if you're curious) in a large html table of letters. I want to do this through html table navigation. If I were using regex on a normal string of letters, it would be "([KR].[ST][ILV])". This would find the letters RSSI or KATV, for instance. Unfortunately, the actual string I'm looking for would look something like this:
<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt>R</tt></td>
<td bgcolor=""><tt>S</tt></td>
<td bgcolor="pink"><tt>S</tt></td>
<td bgcolor=""><tt>I</tt></td>
The end result I want is this:
<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt><b>R</b></tt></td>
<td bgcolor=""><tt><b>S</b></tt></td>
<td bgcolor="pink"><tt><b>S</b></tt></td>
<td bgcolor=""><tt><b>I</b></tt></td>
I've written a monster-sized regex to find this sequence (attached below), but it doesn't seem to work. And I realize now that I should be using html commands, but I'm having a good deal of trouble finding websites that tell me how to search-and-replace. What should I be searching for? And/or how would I accomplish what I've described above?
This is my monster-sized regex to find the sequence I want, but it doesn't seem to work. I now realize, of course, that I was going at it from the wrong direction.
`regexp <- '(
[\\<<td bgcolor=""><tt>K</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>K</tt></td>\\>
\\<<td bgcolor=""><tt>R</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>R</tt></td>\\>]
[\\<<td bgcolor=""><tt>.</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>.</tt></td>\\>]
[\\<<td bgcolor=""><tt>S</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>S</tt></td>\\>
\\<<td bgcolor=""><tt>T</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>T</tt></td>\\>]
[\\<<td bgcolor=""><tt>I</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>I</tt></td>\\>
\\<<td bgcolor=""><tt>L</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>L</tt></td>\\>
\\<<td bgcolor=""><tt>V</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>V</tt></td>\\>])'
`

Maybe try this approach instead of regular expressions:
library(xml2)
library(tidyverse)
txt <- '<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt>R</tt></td>
<td bgcolor=""><tt>S</tt></td>
<td bgcolor="pink"><tt>S</tt></td>
<td bgcolor=""><tt>I</tt></td>'
needles <- c("RSSI", "KMSV")
doc <- read_html(txt)
doc %>%
xml_find_all("//tr") %>%
keep(xml_text(.) %in% gsub("(.)", "\\1\n", needles)) %>%
xml_find_all("td/tt/text()") %>%
xml_add_parent("b")
write_html(doc, tf <- tempfile(fileext = ".html"))
shell.exec(tf) # open temp file on windows
This wraps each column text into <b>...</b> (and saves the result to a temporary file).
cat(as.character(doc))
# ...
# <center><table class="sequence-table">
# <tr><th align="left">
# </th></tr>
# <tr>
# <td bgcolor="lightgreen"><tt><b>R</b></tt></td>
# <td bgcolor=""><tt><b>S</b></tt></td>
# <td bgcolor="pink"><tt><b>S</b></tt></td>
# <td bgcolor=""><tt><b>I</b></tt></td>
# ...

How to Display records in div table using AngularJs Like 1-10 out of 100,11-20 out of 100

Here i'm new for AngularJs can you please helpme how to show records Like 1-10 out of 100 if I click on next paging 11-20 out of 100.soo o to countinuous
<b style="color:red">Items Search is : {{TotalRec.length}}</b>
<b> Toal Records Available {{GetDb.length}}</b>
<table class="table table-hover table-bordered">
<tr>
<th>Id</th>
<th>Name</th>
</tr>
<tr dir-paginate="ee in GetDb|orderBy:sortKey:reverse|filter:search|itemsPerPage:2|filter:query as TotalRec">
<td>{{ee.id}}</td>
<td>{{ee.Name}}</td>
</tr>
</table>

Have a look at this example :
https://github.com/rahil471/search-sort-and-pagination-angularjs

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Scraping ID attribute using rvest - html

Related

How to get HTML element that is before a certain class?

Get elements from table using XPath

How to remove a hyperlink on a number on iphone, ipad?

How to find and bold a series of four letters in an html table

How to Display records in div table using AngularJs Like 1-10 out of 100,11-20 out of 100

Categories

Resources