How to Retrieve data from the following HTML document structure in R - html
I am trying to retrieve tabular data from a html document stored in my local drive.I am stuck # what to do after parsing i.e how to retrieve those nodes where we have data stored specifically.
<thead>
<tr>
<th></th>
<th data-field="position"><a>Rank</a></th>
<th data-field="name"><a>Brand</a></th>
<th data-field="brandValue"><a>Brand Value</a></th>
<th data-field="oneYearValueChange"><a>1-Yr Value Change</a></th>
<th data-field="revenue"><a>Brand Revenue</a></th>
<th data-field="advertising"><a>Company Advertising</a></th>
<th data-field="industry"><a>Industry</a></th>
</tr>
</thead>
This is the first pat of HTML I want to retrieve , this is the header part for my tabular data.
<tbody id="list-table-body">
<tr class="data">
<td class="image"><img src="./Forbes_files/apple_100x100.jpg" alt=""></td>
<td class="rank">#1 </td>
<td class="name">Apple</td>
<td>$145.3 B</td>
<td>17%</td>
<td>$182.3 B</td>
<td>$1.2 B</td>
<td>Technology</td>
</tr>
<tr class="data">
<td class="image"><img src="./Forbes_files/microsoft_100x100.jpg" alt=""></td>
<td class="rank">#2 </td>
<td class="name">Microsoft</td>
<td>$69.3 B</td>
<td>10%</td>
<td>$93.3 B</td>
<td>$2.3 B</td>
<td>Technology</td>
</tr>
<tr class="data">
<td class="image"><img src="./Forbes_files/google_100x100.jpg" alt=""></td>
<td class="rank">#3 </td>
<td class="name">Google</td>
<td>$65.6 B</td>
<td>16%</td>
<td>$61.8 B</td>
<td>$3 B</td>
<td>Technology</td>
</tr>
This portion of HTML contains the data i.e Rank , Name,and the other statistics.
How can I retrieve both Header and the The data I showed in a dataframe ? Is it possible to retrieve images if I want to ?
Edit : So I looked a little harder and retrieved the data using XpathsAppy which contains class = data , I proceeded to remove "\t" and "\n" , which left me with a character array
fb1 <- htmlParse("forbes.html")
fb2 <- xpathSApply (fb1,"//tr[contains(#class,'data')]",xmlValue)
k3 <- gsub('\\t','',fb2)
k3 <- gsub('\\n',',',k3)
Now k3 is a character array with my data
> k3[1:5]
[1] ",#1 ,Apple,$145.3 B,17%,$182.3 B,$1.2 B,Technology,"
[2] ",#2 ,Microsoft,$69.3 B,10%,$93.3 B,$2.3 B,Technology,"
[3] ",#3 ,Google,$65.6 B,16%,$61.8 B,$3 B,Technology,"
[4] ",#4 ,Coca-Cola,$56 B,0%,$23.1 B,$3.5 B,Beverages,"
[5] ",#5 ,IBM,$49.8 B,4%,$92.8 B,$1.3 B,Technology,"
How do I convert it to a Data Frame ?
Also I wanted the header at the top , but for this k3 charater array , header is at the bottom.
> tail(k3)
[1] ",#96 ,Lancome,$6.2 B,-2%,$4.5 B,-,Consumer Packaged Goods,"
[2] ",#97 ,KIA Motors,$6.2 B,-11%,$42.9 B,$992 M,Automotive,"
[3] ",#98 ,Sprite,$6.2 B,2%,$3.7 B,$3.5 B,Beverages,"
[4] ",#99 ,MTV,$6.2 B,6%,$3.4 B,$1 B,Media,"
[5] ",#100 ,Estee Lauder,$6.1 B,4%,$4.5 B,$2.8 B,Consumer Packaged Goods,"
[6] ",[RANK],[NAME],[BRAND_VALUE],[ONEYEARCHANGE],[REVENUE],[ADVERTISING],[INDUSTRY],
The Rank , Nmae part was supposed to be a header.
I would like any suggestions to improve my code or alternatives as well
Related
Extract weather values from app.weathercloud.net
Hi all I would like to extract 25.8 value from this html block using xpath the html code is from a weather website, https://app.weathercloud.net/ "<div id=""gauge-rainrate""><h3>Intensidad de lluvia</h3><canvas id=""rainrate"" width=""200"" height=""200""></canvas><div class=""summary""> <table> <tbody><tr> <th> mm/h</th> <th class=""max""><i class=""icon-chevron-up icon-white""></i> Máx </th> </tr> <tr> <td class=""grey"">Diaria</td> <td><a id=""gauge-rainrate-max-day"" rel=""tooltip"" title="""" data-original-title=""22/04/2022 00:00"">0.0</a></td> </tr> <tr> <td class=""grey"">Mensual</td> <td><a id=""gauge-rainrate-max-month"" rel=""tooltip"" title=""21/04/2022 02:15"">25.8</a></td> </tr> <tr> <td class=""grey"">Anual</td> <td><a id=""gauge-rainrate-max-year"" rel=""tooltip"" title=""21/04/2022 02:15"">25.8</a></td> </tr> </tbody></table> </div></div>" I use this expression to extract in a google spreadsheet cell =IMPORTXML("https://app.weathercloud.net/d5044837546#current";"//a[#id='gauge-rainrate-max-month']") apparently the code is correct but my output is always - I don't understand why...
Python web scraping unstructured table
I am trying to extract some information from a table which appears on a webpage, but the table is unstructured with row being header and column being content like this: (My apologies for not disclosing the webpage) <table class="table-detail"> <tbody> <tr> <td colspan="4" class="noborder">General Information </td> </tr> <tr> <th>Full name</th> <td> James Smith </td> <th>Year of birth</th> <td>1992</td> </tr> <tr> <th>Gender</th> <td>Male</td> </tr> <tr> <th>Place of birth</th> <td>TTexas, USA</td> <td> </td> <td> </td> </tr> <tr> <th>Address</th> <td>Texas, USA</td> <td> </td> <td></td> </tr> At the moment, I am able to extract the table by using this script: import pandas as pd import requests url = "example.com" r = requests.get(url) df_list = pd.read_html(r.text) df = df_list[0] df.head() df.to_csv('myfile.csv',encoding='utf-8-sig') And the table essentially looks like the following: However, I am a little stuck with how to achieve this on Python. I cannot seem to get my head around to getting the data. The result I want is as below: Any help would be appreciated. Thank you so much in advance.
You can use beautifulsoup to parse the HTML. For example: import pandas as pd from bs4 import BeautifulSoup txt = '''<table class="table-detail"> <tbody> <tr> <td colspan="4" class="noborder">General Information </td> </tr> <tr> <th>Full name</th> <td> James Smith </td> <th>Year of birth</th> <td>1992</td> </tr> <tr> <th>Gender</th> <td>Male</td> </tr> <tr> <th>Place of birth</th> <td>TTexas, USA</td> <td> </td> <td> </td> </tr> <tr> <th>Address</th> <td>Texas, USA</td> <td> </td> <td></td> </tr>''' soup = BeautifulSoup(txt, 'html.parser') row = {} for h in soup.select('th:has(+td)'): row[h.text] = h.find_next('td').get_text(strip=True) df = pd.DataFrame([row]) print(df) Prints: Full name Year of birth Gender Place of birth Address 0 James Smith 1992 Male TTexas, USA Texas, USA
BeautifulSoup HTML scraping, how to get row after thead in tbody
I'm interest in learning about scraping a website. now I learn how to scraping table on the website. I used BeautifulSoup. I have a simple HTML table to parse but somehow Beautifulsoup I try to get row in tbody but always get word in "thead" ones. . I'm wondering if anyone would take a look at that and see what's wrong. So I already created the rows object from the HTML table: <table id="companyTable" class="table table--zebra table-content-page width-block dataTable no-footer" role="grid" aria-describedby="companyTable_info" style="width: 868px;"> <thead> <tr role="row"> <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 41px;">No</th> <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 224px;">Kode/Nama Perusahaan</th> <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 267px;">Nama</th> <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 187px;">Tanggal Pencatatan</th> </tr> </thead> <tbody> <tr role="row" class="odd"> <td class="text-center">1</td> <td class="text-center">AALI</td> <td>Astra Agro Lestari Tbk</td> <td>09 Des 1997</td> </tr> <tr role="row" class="even"> <td class="text-center">2</td> <td class="text-center">ABBA</td> <td>Mahaka Media Tbk</td> <td>03 Apr 2002</td> </tr> I'm really really sorry I've already read and tried this Beautifulsoup HTML table parsing--only able to get the last row? . but still, don't get it.. and get '[ ]' at output. here's the link that I want to scrape it. : https://www.idx.co.id/perusahaan-tercatat/profil-perusahaan-tercatat/ I want to get this row. <tr role="row" class="odd"> <td class="text-center">1</td> <td class="text-center">AALI</td> <td>Astra Agro Lestari Tbk</td> <td>09 Des 1997</td> </tr> I try to get it but always get word in "thead" ones. here's my code : from bs4 import BeautifulSoup as soup from urllib.request import urlopen as uReq url = 'https://www.idx.co.id/perusahaan-tercatat/profil-perusahaan-tercatat/' uClient = uReq(url) pageHtml = uClient.read() uClient.close() pageSoup = soup(pageHtml, "html.parser") table = pageSoup.findAll('table', id = "companyTable") table = table[0] for row in table.findAll('tr'): for cell in row.findAll('th'): print(cell.text)
You just need the first tr in the tbody tag. So I'd use this: first_row = s.find('tbody').find('tr') Where s is the soup in my case. Here's an example: >>> html = """<table id="companyTable" class="table table--zebra table-content-page width-block dataTable no-footer" role="grid" aria-describedby="companyTable_info" style="width: 868px;"> ... <thead> ... <tr role="row"> ... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 41px;">No</th> ... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 224px;">Kode/Nama Perusahaan</th> ... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 267px;">Nama</th> ... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 187px;">Tanggal Pencatatan</th> ... </tr> ... </thead> ... <tbody> ... <tr role="row" class="odd"> ... <td class="text-center">1</td> ... <td class="text-center">AALI</td> ... <td>Astra Agro Lestari Tbk</td> ... <td>09 Des 1997</td> ... </tr> ... <tr role="row" class="even"> ... <td class="text-center">2</td> ... <td class="text-center">ABBA</td> ... <td>Mahaka Media Tbk</td> ... <td>03 Apr 2002</td> ... </tr> ... """ >>> s = BeautifulSoup(html) >>> first_row = s.find('tbody').find('tr') >>> first_row <tr class="odd" role="row"> <td class="text-center">1</td> <td class="text-center">AALI</td> <td>Astra Agro Lestari Tbk</td> <td>09 Des 1997</td> </tr> It works because find only returns the first element that matches
Solving the problem If I understood it right, you just want to get the table data from this site. However, inspecting the site and analyzing the requests and responses using the Google Network tools, I just found out that the site is using DataTables and fills the table using JS, with the responses from this request. In other words, you could just have made import requests url = "https://www.idx.co.id/umbraco/Surface/Helper/GetEmiten?emitenType=s" response = requests.get(url) print(response.json()) What you should learn from this Inspecting the page elements and requests/responses in order to know what is the easiest way to get the data. The tool I suggest is the Chrome Devtools, but you may use the browser that fits you the best.
table with one header and data from Array inside a Map with thymeleaf
I want to create a table that its data is a Map< String, List < Object> >. So the table has one header that and the rows should have the exact data. Map.key Object.item1 Object.item2 Object.item3 So since it is a List of Object i want one row for every Object of the List and the Map.key to be repeated. So i need to iterate through keys like <table> <thead> <tr> <th>Code</th> <th>Status</th> <th>Flag</th> <th>Message</th> </tr> </thead> <tbody> <tr th:each= "result : ${myMap}"> <td th:text="${result.key}"></td> <td><table> <tr th:each="obj: ${result.value}"> <td th:text="${not #lists.isEmpty(obj.errorList)}?'Error':'Warning'"></td> <td th:text="${obj.flag}==true?'YES':'NO'"></td> <td th:text="${not #lists.isEmpty(obj.errorList)}?${obj.warningList}:${obj.errorList}"></td> </tr> </table></td> </tr> </tbody> </table> but this solution places a table in a table. I want to use one header and iterate the lists and place the variables in the main table .
I think you're looking for a structure like this: <table> <thead> <tr> <th>Code</th> <th>Status</th> <th>Flag</th> <th>Message</th> </tr> </thead> <tbody> <th:block th:each= "result : ${myMap}"> <tr th:each="obj: ${result.value}"> <td th:text="${result.key}" /> <td th:text="${not #lists.isEmpty(obj.errorList)}?'Error':'Warning'" /> <td th:text="${obj.flag}==true?'YES':'NO'" /> <td th:text="${not #lists.isEmpty(obj.errorList)}?${obj.warningList}:${obj.errorList}" /> </tr> </th:block> </tbody> </table>
PowerShell / Html - questions
I am new to Powershell, and I suck at html. There's a page with a table, and each cell has a ahref link, the value of the link is dynamic, but the link which I want to automate-clicking is always in the first cell. I know there's cellindex in html/JS, is it usable in PS? For example, let's say I have this table on a website. <table> <tr> <td> <a href="http://example1.com"> <div style="height:100%;width:100%"> hello world1 </div> </a> </td> </tr> <tr> <td> <a href="http://example2.com"> <div style="height:100%;width:100%"> hello world2 </div> </a> </td> </tr> <tr> <td> <a href="http://example3.com"> <div style="height:100%;width:100%"> hello world3 </div> </a> </td> </tr> </table> And I want to make powershell to always click on the first link, the link inside is dynamic though. Any ideas? Hints?
The result of Invoke-WebRequest returns a property named Links that is a collection of all the hyperlinks on a web page. For example: $Web = Invoke-webrequest -Uri 'http://wragg.io' $Web.Links | Select innertext,href Returns: innerText href --------- ---- Mark Wragg http://wragg.io Twitter https://twitter.com/markwragg Github https://github.com/markwragg LinkedIn https://uk.linkedin.com/in/mwragg If the link you want to capture is always the first in this list you could get it by doing: $Web.Links[0].href If it's the second [1], third [2] etc. etc. I don't think there is an equivalent of "cellindex", although there is a property named AllElements that you can access via an array index. E.g if you wanted the second element on the page you could for example do: $Web.AllElements[2] If you need to get to a specific table in the page and then access links inside of that table you'd probably need to iterate through the AllElements property until you reached the table you wanted. For example if you know the links were in the third table on the page: $Links = #() $TableCount = 0 $Web.AllElements | ForEach-Object { If ($_.tagname -eq 'table'){ $TableCount++ } If ($TableCount -eq 3){ If ($_.tagname -eq 'a') { $Links += $_ } } } $Links | Select -First 1
Ok, the Invoke-webrequest method is working with mark's link but with my page; but I noticed a pattern that may can be used: I noticed the the following: <table id="row" class="simple"> <thead> <tr> <th></th> <th class="centerjustify">File Name</th> <th class="centerjustify">File ID</th> <th class="datetime">Creation Date</th> <th class="datetime">Upload Date</th> <th class="centerjustify">Processing Status</th> <th class="centerjustify">Exceptions</th> <th class="centerjustify">Unprocessed Count</th> <th class="centerjustify">Discarded Count</th> <th class="centerjustify">Rejected Count</th> <th class="centerjustify">Void Count</th> <th class="centerjustify">PO Total Count</th> <th class="centerjustify">PO Total Amount</th> <th class="centerjustify">CM Total Count</th> <th class="centerjustify">CM Total Amount</th> <th class="centerjustify">PO Processed Count</th> <th class="centerjustify">PO Processed Amount</th> <th class="centerjustify">CM Processed Count</th> <th class="centerjustify">CM Processed Amount</th> <th class="centerjustify">Counts At Upload</th></tr></thead> <tbody> <tr class="odd"> <td><input type="radio" disabled="disabled" name="checkedValue" value="12047" /></td> <td class="leftjustify textColorBlack"> 520170123000000_520170123000000_20170327_01.txt</td> <td class="centerjustify textColorBlack">1</td> <td class="datetime textColorBlack">Mar 27, 2017 0:00</td> <td class="datetime textColorBlack">Mar 27, 2017 10:33:24 PM +03:00</td> <td class="centerjustify textColorBlack"> The fId part in "loadConfirmationDetails.htm?fId=12047" is dynamic; and it's the last part of the next page; For example: "https://aaa.xxxxxxx.com/aaa/community/loadConfirmationDetails.htm?fId=12047 And table's ID is unique, called "row" - I wonder if I can use a completely another way; other than invoking the webpage, by auto-copying this id info from its source html and concatenate it with the main link? I am really out of ideas beyond that.