I am trying to check if Polish elections are fair and candidates form opposition did not get abnormal low number of votes in districts with higher amount of invalid votes. To do so I need to scrape results of each district.
Link to official results of elections for my city - in the bottom table, each row is different district and by clicking you get redirected to district. The link is not usual <a ... hef = ...> format, but in the data-id=... is encoded the variable part of the link to districts.
My question is how to extract the data-id= attribute table on a webpage using R?
Sample data - in this example I would like to extract 697773 from row data
<div class="proto" style="">
<div id="DataTables_Table_16_wrapper" class="dataTables_wrapper dt-bootstrap no-footer">
<div class="table-responsive">
<table class="table table-bordered table-striped table-hover dataTable no-footer clickable" id="DataTables_Table_16" role="grid">
<thead><tr role="row"><th class="sorting_asc" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-sort="ascending" aria-label="Numer: aktywuj, by posortować kolumnę malejąco">Numer</th><th class="sorting" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-label="Siedziba: aktywuj, by posortować kolumnę rosnąco">Siedziba</th><th class="sorting" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-label="Granice: aktywuj, by posortować kolumnę rosnąco">Granice</th></tr></thead>
<tbody>
<tr data-id="697773" role="row" class="odd"><td class="sorting_1">1</td><td>Szkoła Podstawowa nr 63</td> <td>Bożego Ciała...</td></tr>
</tbody>
</table>
</div>
</div>
</div>
I have tried using:
library(dplyr)
library(rvest)
read_html("https://wybory.gov.pl/prezydent20200628/pl/wyniki/1/pow/26400") %>%
html_nodes('[class="table-responsive"]') %>%
html_nodes('[class="table table-bordered table-striped table-hover"]') %>%
html_nodes('tr') %>%
html_attrs()
But I get named character(0) as a result
I found not very optimal solution. I bet there is better way!
I have downloaded webpage, saved it as txt file and read from there:
txt_webpage <- readChar(paste0(getwd(), "\\Wyniki pierwszego głosowania _ Wrocław.txt"),
file.info(paste0(getwd(), "\\Wyniki pierwszego głosowania _ Wrocław.txt"))$size)
posiotions <- gregexpr(pattern ='<tr data', txt_webpage)
districts_numbers <- c()
for (i in posiotions[[1]]) {
print (i)
tmp <- substr(txt_webpage, i + 10, i + 22)
tmp <- gsub('\\D+','', tmp)
districts_numbers <- c(districts_numbers, tmp)
}
I'm interest in learning about scraping a website. now I learn how to scraping table on the website. I used BeautifulSoup.
I have a simple HTML table to parse but somehow Beautifulsoup I try to get row in tbody but always get word in "thead" ones. . I'm wondering if anyone would take a look at that and see what's wrong. So I already created the rows object from the HTML table:
<table id="companyTable" class="table table--zebra table-content-page width-block dataTable no-footer" role="grid" aria-describedby="companyTable_info" style="width: 868px;">
<thead>
<tr role="row">
<th class="sorting_disabled" rowspan="1" colspan="1" style="width: 41px;">No</th>
<th class="sorting_disabled" rowspan="1" colspan="1" style="width: 224px;">Kode/Nama Perusahaan</th>
<th class="sorting_disabled" rowspan="1" colspan="1" style="width: 267px;">Nama</th>
<th class="sorting_disabled" rowspan="1" colspan="1" style="width: 187px;">Tanggal Pencatatan</th>
</tr>
</thead>
<tbody>
<tr role="row" class="odd">
<td class="text-center">1</td>
<td class="text-center">AALI</td>
<td>Astra Agro Lestari Tbk</td>
<td>09 Des 1997</td>
</tr>
<tr role="row" class="even">
<td class="text-center">2</td>
<td class="text-center">ABBA</td>
<td>Mahaka Media Tbk</td>
<td>03 Apr 2002</td>
</tr>
I'm really really sorry I've already read and tried this Beautifulsoup HTML table parsing--only able to get the last row? . but still, don't get it.. and get '[ ]' at output.
here's the link that I want to scrape it. : https://www.idx.co.id/perusahaan-tercatat/profil-perusahaan-tercatat/
I want to get this row.
<tr role="row" class="odd">
<td class="text-center">1</td>
<td class="text-center">AALI</td>
<td>Astra Agro Lestari Tbk</td>
<td>09 Des 1997</td>
</tr>
I try to get it but always get word in "thead" ones.
here's my code :
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
url = 'https://www.idx.co.id/perusahaan-tercatat/profil-perusahaan-tercatat/'
uClient = uReq(url)
pageHtml = uClient.read()
uClient.close()
pageSoup = soup(pageHtml, "html.parser")
table = pageSoup.findAll('table', id = "companyTable")
table = table[0]
for row in table.findAll('tr'):
for cell in row.findAll('th'):
print(cell.text)
You just need the first tr in the tbody tag. So I'd use this:
first_row = s.find('tbody').find('tr')
Where s is the soup in my case. Here's an example:
>>> html = """<table id="companyTable" class="table table--zebra table-content-page width-block dataTable no-footer" role="grid" aria-describedby="companyTable_info" style="width: 868px;">
... <thead>
... <tr role="row">
... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 41px;">No</th>
... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 224px;">Kode/Nama Perusahaan</th>
... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 267px;">Nama</th>
... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 187px;">Tanggal Pencatatan</th>
... </tr>
... </thead>
... <tbody>
... <tr role="row" class="odd">
... <td class="text-center">1</td>
... <td class="text-center">AALI</td>
... <td>Astra Agro Lestari Tbk</td>
... <td>09 Des 1997</td>
... </tr>
... <tr role="row" class="even">
... <td class="text-center">2</td>
... <td class="text-center">ABBA</td>
... <td>Mahaka Media Tbk</td>
... <td>03 Apr 2002</td>
... </tr>
... """
>>> s = BeautifulSoup(html)
>>> first_row = s.find('tbody').find('tr')
>>> first_row
<tr class="odd" role="row">
<td class="text-center">1</td>
<td class="text-center">AALI</td>
<td>Astra Agro Lestari Tbk</td>
<td>09 Des 1997</td>
</tr>
It works because find only returns the first element that matches
Solving the problem
If I understood it right, you just want to get the table data from this site. However, inspecting the site and analyzing the requests and responses using the Google Network tools, I just found out that the site is using DataTables and fills the table using JS, with the responses from this request.
In other words, you could just have made
import requests
url = "https://www.idx.co.id/umbraco/Surface/Helper/GetEmiten?emitenType=s"
response = requests.get(url)
print(response.json())
What you should learn from this
Inspecting the page elements and requests/responses in order to know what is the easiest way to get the data. The tool I suggest is the Chrome Devtools, but you may use the browser that fits you the best.
I'm using the R programming language.
I'm hoping to find and make bold a series of four letters (amino acids, if you're curious) in a large html table of letters. I want to do this through html table navigation. If I were using regex on a normal string of letters, it would be "([KR].[ST][ILV])". This would find the letters RSSI or KATV, for instance. Unfortunately, the actual string I'm looking for would look something like this:
<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt>R</tt></td>
<td bgcolor=""><tt>S</tt></td>
<td bgcolor="pink"><tt>S</tt></td>
<td bgcolor=""><tt>I</tt></td>
The end result I want is this:
<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt><b>R</b></tt></td>
<td bgcolor=""><tt><b>S</b></tt></td>
<td bgcolor="pink"><tt><b>S</b></tt></td>
<td bgcolor=""><tt><b>I</b></tt></td>
I've written a monster-sized regex to find this sequence (attached below), but it doesn't seem to work. And I realize now that I should be using html commands, but I'm having a good deal of trouble finding websites that tell me how to search-and-replace. What should I be searching for? And/or how would I accomplish what I've described above?
This is my monster-sized regex to find the sequence I want, but it doesn't seem to work. I now realize, of course, that I was going at it from the wrong direction.
`regexp <- '(
[\\<<td bgcolor=""><tt>K</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>K</tt></td>\\>
\\<<td bgcolor=""><tt>R</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>R</tt></td>\\>]
[\\<<td bgcolor=""><tt>.</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>.</tt></td>\\>]
[\\<<td bgcolor=""><tt>S</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>S</tt></td>\\>
\\<<td bgcolor=""><tt>T</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>T</tt></td>\\>]
[\\<<td bgcolor=""><tt>I</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>I</tt></td>\\>
\\<<td bgcolor=""><tt>L</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>L</tt></td>\\>
\\<<td bgcolor=""><tt>V</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>V</tt></td>\\>])'
`
Maybe try this approach instead of regular expressions:
library(xml2)
library(tidyverse)
txt <- '<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt>R</tt></td>
<td bgcolor=""><tt>S</tt></td>
<td bgcolor="pink"><tt>S</tt></td>
<td bgcolor=""><tt>I</tt></td>'
needles <- c("RSSI", "KMSV")
doc <- read_html(txt)
doc %>%
xml_find_all("//tr") %>%
keep(xml_text(.) %in% gsub("(.)", "\\1\n", needles)) %>%
xml_find_all("td/tt/text()") %>%
xml_add_parent("b")
write_html(doc, tf <- tempfile(fileext = ".html"))
shell.exec(tf) # open temp file on windows
This wraps each column text into <b>...</b> (and saves the result to a temporary file).
cat(as.character(doc))
# ...
# <center><table class="sequence-table">
# <tr><th align="left">
# </th></tr>
# <tr>
# <td bgcolor="lightgreen"><tt><b>R</b></tt></td>
# <td bgcolor=""><tt><b>S</b></tt></td>
# <td bgcolor="pink"><tt><b>S</b></tt></td>
# <td bgcolor=""><tt><b>I</b></tt></td>
# ...
I am new to Powershell, and I suck at html.
There's a page with a table, and each cell has a ahref link, the value of the link is dynamic, but the link which I want to automate-clicking is always in the first cell.
I know there's cellindex in html/JS, is it usable in PS?
For example, let's say I have this table on a website.
<table>
<tr>
<td>
<a href="http://example1.com">
<div style="height:100%;width:100%">
hello world1
</div>
</a>
</td>
</tr>
<tr>
<td>
<a href="http://example2.com">
<div style="height:100%;width:100%">
hello world2
</div>
</a>
</td>
</tr>
<tr>
<td>
<a href="http://example3.com">
<div style="height:100%;width:100%">
hello world3
</div>
</a>
</td>
</tr>
</table>
And I want to make powershell to always click on the first link, the link inside is dynamic though.
Any ideas? Hints?
The result of Invoke-WebRequest returns a property named Links that is a collection of all the hyperlinks on a web page.
For example:
$Web = Invoke-webrequest -Uri 'http://wragg.io' $Web.Links | Select innertext,href
Returns:
innerText href
--------- ----
Mark Wragg http://wragg.io
Twitter https://twitter.com/markwragg
Github https://github.com/markwragg
LinkedIn https://uk.linkedin.com/in/mwragg
If the link you want to capture is always the first in this list you could get it by doing:
$Web.Links[0].href
If it's the second [1], third [2] etc. etc.
I don't think there is an equivalent of "cellindex", although there is a property named AllElements that you can access via an array index. E.g if you wanted the second element on the page you could for example do:
$Web.AllElements[2]
If you need to get to a specific table in the page and then access links inside of that table you'd probably need to iterate through the AllElements property until you reached the table you wanted. For example if you know the links were in the third table on the page:
$Links = #()
$TableCount = 0
$Web.AllElements | ForEach-Object {
If ($_.tagname -eq 'table'){ $TableCount++ }
If ($TableCount -eq 3){
If ($_.tagname -eq 'a') {
$Links += $_
}
}
}
$Links | Select -First 1
Ok, the Invoke-webrequest method is working with mark's link but with my page; but I noticed a pattern that may can be used:
I noticed the the following:
<table id="row" class="simple">
<thead>
<tr>
<th></th>
<th class="centerjustify">File Name</th>
<th class="centerjustify">File ID</th>
<th class="datetime">Creation Date</th>
<th class="datetime">Upload Date</th>
<th class="centerjustify">Processing Status</th>
<th class="centerjustify">Exceptions</th>
<th class="centerjustify">Unprocessed Count</th>
<th class="centerjustify">Discarded Count</th>
<th class="centerjustify">Rejected Count</th>
<th class="centerjustify">Void Count</th>
<th class="centerjustify">PO Total Count</th>
<th class="centerjustify">PO Total Amount</th>
<th class="centerjustify">CM Total Count</th>
<th class="centerjustify">CM Total Amount</th>
<th class="centerjustify">PO Processed Count</th>
<th class="centerjustify">PO Processed Amount</th>
<th class="centerjustify">CM Processed Count</th>
<th class="centerjustify">CM Processed Amount</th>
<th class="centerjustify">Counts At Upload</th></tr></thead>
<tbody>
<tr class="odd">
<td><input type="radio" disabled="disabled" name="checkedValue" value="12047" /></td>
<td class="leftjustify textColorBlack">
520170123000000_520170123000000_20170327_01.txt</td>
<td class="centerjustify textColorBlack">1</td>
<td class="datetime textColorBlack">Mar 27, 2017 0:00</td>
<td class="datetime textColorBlack">Mar 27, 2017 10:33:24 PM +03:00</td>
<td class="centerjustify textColorBlack">
The fId part in "loadConfirmationDetails.htm?fId=12047" is dynamic; and it's the last part of the next page;
For example: "https://aaa.xxxxxxx.com/aaa/community/loadConfirmationDetails.htm?fId=12047
And table's ID is unique, called "row" - I wonder if I can use a completely another way; other than invoking the webpage, by auto-copying this id info from its source html and concatenate it with the main link?
I am really out of ideas beyond that.
I am trying to retrieve tabular data from a html document stored in my local drive.I am stuck # what to do after parsing i.e how to retrieve those nodes where we have data stored specifically.
<thead>
<tr>
<th></th>
<th data-field="position"><a>Rank</a></th>
<th data-field="name"><a>Brand</a></th>
<th data-field="brandValue"><a>Brand Value</a></th>
<th data-field="oneYearValueChange"><a>1-Yr Value Change</a></th>
<th data-field="revenue"><a>Brand Revenue</a></th>
<th data-field="advertising"><a>Company Advertising</a></th>
<th data-field="industry"><a>Industry</a></th>
</tr>
</thead>
This is the first pat of HTML I want to retrieve , this is the header part for my tabular data.
<tbody id="list-table-body">
<tr class="data">
<td class="image"><img src="./Forbes_files/apple_100x100.jpg" alt=""></td>
<td class="rank">#1 </td>
<td class="name">Apple</td>
<td>$145.3 B</td>
<td>17%</td>
<td>$182.3 B</td>
<td>$1.2 B</td>
<td>Technology</td>
</tr>
<tr class="data">
<td class="image"><img src="./Forbes_files/microsoft_100x100.jpg" alt=""></td>
<td class="rank">#2 </td>
<td class="name">Microsoft</td>
<td>$69.3 B</td>
<td>10%</td>
<td>$93.3 B</td>
<td>$2.3 B</td>
<td>Technology</td>
</tr>
<tr class="data">
<td class="image"><img src="./Forbes_files/google_100x100.jpg" alt=""></td>
<td class="rank">#3 </td>
<td class="name">Google</td>
<td>$65.6 B</td>
<td>16%</td>
<td>$61.8 B</td>
<td>$3 B</td>
<td>Technology</td>
</tr>
This portion of HTML contains the data i.e Rank , Name,and the other statistics.
How can I retrieve both Header and the The data I showed in a dataframe ? Is it possible to retrieve images if I want to ?
Edit : So I looked a little harder and retrieved the data using XpathsAppy which contains class = data , I proceeded to remove "\t" and "\n" , which left me with a character array
fb1 <- htmlParse("forbes.html")
fb2 <- xpathSApply (fb1,"//tr[contains(#class,'data')]",xmlValue)
k3 <- gsub('\\t','',fb2)
k3 <- gsub('\\n',',',k3)
Now k3 is a character array with my data
> k3[1:5]
[1] ",#1 ,Apple,$145.3 B,17%,$182.3 B,$1.2 B,Technology,"
[2] ",#2 ,Microsoft,$69.3 B,10%,$93.3 B,$2.3 B,Technology,"
[3] ",#3 ,Google,$65.6 B,16%,$61.8 B,$3 B,Technology,"
[4] ",#4 ,Coca-Cola,$56 B,0%,$23.1 B,$3.5 B,Beverages,"
[5] ",#5 ,IBM,$49.8 B,4%,$92.8 B,$1.3 B,Technology,"
How do I convert it to a Data Frame ?
Also I wanted the header at the top , but for this k3 charater array , header is at the bottom.
> tail(k3)
[1] ",#96 ,Lancome,$6.2 B,-2%,$4.5 B,-,Consumer Packaged Goods,"
[2] ",#97 ,KIA Motors,$6.2 B,-11%,$42.9 B,$992 M,Automotive,"
[3] ",#98 ,Sprite,$6.2 B,2%,$3.7 B,$3.5 B,Beverages,"
[4] ",#99 ,MTV,$6.2 B,6%,$3.4 B,$1 B,Media,"
[5] ",#100 ,Estee Lauder,$6.1 B,4%,$4.5 B,$2.8 B,Consumer Packaged Goods,"
[6] ",[RANK],[NAME],[BRAND_VALUE],[ONEYEARCHANGE],[REVENUE],[ADVERTISING],[INDUSTRY],
The Rank , Nmae part was supposed to be a header.
I would like any suggestions to improve my code or alternatives as well