Python web scraping unstructured table

Python web scraping unstructured table - html

I am trying to extract some information from a table which appears on a webpage, but the table is unstructured with row being header and column being content like this: (My apologies for not disclosing the webpage)
<table class="table-detail">
<tbody>
<tr>
<td colspan="4" class="noborder">General Information
</td>
</tr>
<tr>
<th>Full name</th>
<td>
James Smith
</td>
<th>Year of birth</th>
<td>1992</td>
</tr>
<tr>
<th>Gender</th>
<td>Male</td>
</tr>
<tr>
<th>Place of birth</th>
<td>TTexas, USA</td>
<td> </td>
<td> </td>
</tr>
<tr>
<th>Address</th>
<td>Texas, USA</td>
<td> </td>
<td></td>
</tr>
At the moment, I am able to extract the table by using this script:
import pandas as pd
import requests
url = "example.com"
r = requests.get(url)
df_list = pd.read_html(r.text)
df = df_list[0]
df.head()
df.to_csv('myfile.csv',encoding='utf-8-sig')
And the table essentially looks like the following:
However, I am a little stuck with how to achieve this on Python. I cannot seem to get my head around to getting the data. The result I want is as below:
Any help would be appreciated. Thank you so much in advance.

You can use beautifulsoup to parse the HTML. For example:
import pandas as pd
from bs4 import BeautifulSoup
txt = '''<table class="table-detail">
<tbody>
<tr>
<td colspan="4" class="noborder">General Information
</td>
</tr>
<tr>
<th>Full name</th>
<td>
James Smith
</td>
<th>Year of birth</th>
<td>1992</td>
</tr>
<tr>
<th>Gender</th>
<td>Male</td>
</tr>
<tr>
<th>Place of birth</th>
<td>TTexas, USA</td>
<td> </td>
<td> </td>
</tr>
<tr>
<th>Address</th>
<td>Texas, USA</td>
<td> </td>
<td></td>
</tr>'''
soup = BeautifulSoup(txt, 'html.parser')
row = {}
for h in soup.select('th:has(+td)'):
row[h.text] = h.find_next('td').get_text(strip=True)
df = pd.DataFrame([row])
print(df)
Prints:
Full name Year of birth Gender Place of birth Address
0 James Smith 1992 Male TTexas, USA Texas, USA

Related

Extract weather values from app.weathercloud.net

Hi all I would like to extract 25.8 value from this html block using xpath
the html code is from a weather website, https://app.weathercloud.net/
"<div id=""gauge-rainrate""><h3>Intensidad de lluvia</h3><canvas id=""rainrate"" width=""200"" height=""200""></canvas><div class=""summary"">
<table>
<tbody><tr>
<th> mm/h</th>
<th class=""max""><i class=""icon-chevron-up icon-white""></i> Máx </th>
</tr>
<tr>
<td class=""grey"">Diaria</td>
<td><a id=""gauge-rainrate-max-day"" rel=""tooltip"" title="""" data-original-title=""22/04/2022 00:00"">0.0</a></td>
</tr>
<tr>
<td class=""grey"">Mensual</td>
<td><a id=""gauge-rainrate-max-month"" rel=""tooltip"" title=""21/04/2022 02:15"">25.8</a></td>
</tr>
<tr>
<td class=""grey"">Anual</td>
<td><a id=""gauge-rainrate-max-year"" rel=""tooltip"" title=""21/04/2022 02:15"">25.8</a></td>
</tr>
</tbody></table>
</div></div>"
I use this expression to extract in a google spreadsheet cell
=IMPORTXML("https://app.weathercloud.net/d5044837546#current";"//a[#id='gauge-rainrate-max-month']")
apparently the code is correct but my output is always
-
I don't understand why...

Simple HTML contact details form with table

These are my first steps in FE so please don't hate me. I want to create a page with user details.
What I have is something below:
First name: Last name: User status: Date joined: Policies accepted:
John Doe active 12.12.2021 true
To do so I used table but below that table I need to display user contact details:
Contact details
Email: Phone number:
example#example.com +681234123412
Is there a better way to display such a thing than table or it is a common approach in such situations ?

You should use forms instead of table for the user details and the contact details, it's much more effective, productive, and easier.

<table border="1">
<thead>
<tr>
<th>First Name:</th>
<th>Last Name:</th>
<th>User Status:</th>
<th>Date Joined:</th>
<th>Policies Accepted:</th>
</tr>
</thead>
<tbody>
<tr>
<td>John</td>
<td>Doe</td>
<td>Active</td>
<td>12.12.2021</td>
<td>True</td>
</tr>
</tbody>
<tfoot>
<tr>
<td colspan="5">Contact Details</td>
</tr>
<tr>
<td colspan="3">Email:</td>
<td colspan="2">Phone Number:</td>
</tr>
<tr>
<td colspan="3">example#example.com</td>
<td colspan="2">+681234123412</td>
</tr>
</tfoot>
Here's the screen shot example.
https://i.stack.imgur.com/zJSqN.jpg

Data Extract from the Html String

I have been looking for extracting few information from html which I receive from email body. Before extracting data i have sanitized the html to only have minimum base html code & no attributes style empty line and all.
I saw some of the mailparser uses gui to select the fields which I needed to be extracted by creating the new template. I also found that if any minor change in html, it works smart and extract data like before
My Question is how are these websites able to create a template by gui (selecting the text which I need). Also is there any opensource project or any library it can help me.
Example: need to extract booking no, pnr, date.. Prefer GUI to create template.
<table>
<tbody>
<tr>
<td>Booking No.: 5903154789</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td>Download the Trip.com app to track your flight status and check booking details on the move.</td>
</tr>
<tr>
<td>FAQs</td>
</tr>
<tr>
<td>How can I refund my flight ticket?</td>
</tr>
<tr>
<td>If you need to refund your flight ticket after the ticket has been issued, please sign in and select My bookings, then Flights, click on the relevant order number to open the booking details page, click the Refund button to apply for a ticket refund according to the website instructions. A cancellation fee might apply which depends on the policy of the airline. If you reserved your flight ticket as a guest, you can search your booking through the email address which you used for your booking and apply for a ticket refund according to the website instructions.</td>
</tr>
<tr>
<td>How can I change my flight ticket?</td>
</tr>
<tr>
<td>If you need to modify your ticket after it has been issued, please contact one of our Trip.com customer service representatives. A change fee might apply which is dependent on the policy of the airline.</td>
</tr>
<tr>
<td>How can I check the flight status?</td>
</tr>
<tr>
<td>You can check the flight status through "Get Flight Status" in the "Flights tools" at the bottom of the homepage of Flights. You can also download our Trip.com App to check your flight's status by clicking the button "Flight Status" on the homepage.</td>
</tr>
<tr>
<td>Contact Us</td>
</tr>
<tr>
<td>United States : 833 896 0077 24/7</td>
</tr>
<tr>
<td>China : 400 828 8966 24/7</td>
</tr>
<tr>
<td>Other Locations : +86 21 3210 4669 24/7</td>
</tr>
<tr>
<td>Great deals with reliable service</td>
</tr>
<tr>
<td>Thank you for choosing Trip.comCustomer Service Department</td>
</tr>
<tr>
<td>Do not forward this mail as it contains your personal information and booking details.</td>
</tr>
<tr>
<td>Copyright © 1999-2018 Trip.com All rights reserved</td>
</tr>
<tr>
<td>Using Trip.comâs website means that you agree with Trip.comâs Privacy Policy.</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td>Flight Booking Confirmed</td>
</tr>
<tr>
<td>
<strong>Dear Customer</strong>,
<p>Your flight booking has been confirmed and your tickets have been issued.</p>
<p>If you'd like to change or cancel your booking, the Trip.com app makes it easy.</p>
<p>You will find your itinerary and e-receipt attached. We advise you print out your itinerary and take it with you to ensure your trip goes as smoothly as possible.</p>
</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td>Booking No.</td>
<td>5903154789</td>
</tr>
<tr>
<td>Booked On</td>
<td>25 Mar 2018 12:32</td>
</tr>
<tr>
<td>Airline Booking Reference</td>
<td>C9LHJQ</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>
<strong>Flight Details</strong>(DPS - SIN)
</td>
</tr>
<tr>
<td>Bali - SingaporeScoot · TR281</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td>3 May 2018 10:50</td>
<td>DPS</td>
<td>Ngurah Rai Airport I</td>
</tr>
<tr>
<td>3 May 2018 13:25</td>
<td>SIN</td>
<td>Changi Airport T2</td>
</tr>
<tr>
<td>
<strong>Baggage Allowance</strong>
<p>
<strong>[FREE]</strong>No free baggage allowance.Please contact airline for detailed baggage regulations.
</p>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>Passenger</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td>Name</td>
<td>Ticket Number</td>
</tr>
<tr>
<td>SOMANATH/MAMATHA</td>
<td>C9LHJQ</td>
</tr>
<tr>
<td>YADARANGI/SOMANATH</td>
<td>C9LHJQ</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>Click here to view date change and cancellation policies.</td>
</tr>
<tr>
<td>For more information, please check the attachments or view your booking in more detail on the Trip.com website or app.</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td>Important information</td>
</tr>
<tr>
<td>â¢</td>
<td>All departure/arrival times and dates are in local time.</td>
</tr>
<tr>
<td>â¢</td>
<td>Tickets must be used in the sequence set out in the itinerary.</td>
</tr>
<tr>
<td>â¢</td>
<td>Please arrive at the airport at least 2 hours before departure to ensure you have enough time to check in.</td>
</tr>
<tr>
<td>â¢</td>
<td>Your ID must be valid for at least 6 months beyond the date you complete your itinerary.</td>
</tr>
<tr>
<td>â¢</td>
<td>A transit visa may be required if you need to transfer in a third country. We recommend you confirm visa details with the embassy of the relevant country.</td>
</tr>
<tr>
<td>â¢</td>
<td>If you have only booked a one-way ticket and are travelling on a short-term business/tourism visa, we recommend you purchase a return ticket as soon as possible. Failure to do so may result in denial of check-in, entry, or exit.</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td>Booking No.</td>
<td>5903154789</td>
</tr>
<tr>
<td>Booked On</td>
<td>25 Mar 2018 12:32</td>
</tr>
<tr>
<td>Airline Booking Reference</td>
<td>C9LHJQ</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td>3 May 2018 10:50</td>
<td>DPS</td>
<td>Ngurah Rai Airport I</td>
</tr>
<tr>
<td>3 May 2018 13:25</td>
<td>SIN</td>
<td>Changi Airport T2</td>
</tr>
<tr>
<td>
<strong>Baggage Allowance</strong>
<p>
<strong>[FREE]</strong>No free baggage allowance.Please contact airline for detailed baggage regulations.
</p>
</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td>Name</td>
<td>Ticket Number</td>
</tr>
<tr>
<td>SOMANATH/MAMATHA</td>
<td>C9LHJQ</td>
</tr>
<tr>
<td>YADARANGI/SOMANATH</td>
<td>C9LHJQ</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td>Important information</td>
</tr>
<tr>
<td>â¢</td>
<td>All departure/arrival times and dates are in local time.</td>
</tr>
<tr>
<td>â¢</td>
<td>Tickets must be used in the sequence set out in the itinerary.</td>
</tr>
<tr>
<td>â¢</td>
<td>Please arrive at the airport at least 2 hours before departure to ensure you have enough time to check in.</td>
</tr>
<tr>
<td>â¢</td>
<td>Your ID must be valid for at least 6 months beyond the date you complete your itinerary.</td>
</tr>
<tr>
<td>â¢</td>
<td>A transit visa may be required if you need to transfer in a third country. We recommend you confirm visa details with the embassy of the relevant country.</td>
</tr>
<tr>
<td>â¢</td>
<td>If you have only booked a one-way ticket and are travelling on a short-term business/tourism visa, we recommend you purchase a return ticket as soon as possible. Failure to do so may result in denial of check-in, entry, or exit.</td>
</tr>
</tbody>
</table>
Online Parsers are below:
https://mailparser.io/
https://parser.zapier.com/
https://parseur.com/
Edit:
Currently i have created manual template using imangazaliev/didom (PHP)
Pointing excact node element to get data but its too hard to do for so many template so looking for others.

How do I awk a unix text file to a predefined html code?

I don't know HTML (HORRIBLY EMBARRASSED but didn't ever have the need to). I am pretty perspicacious when it comes to UNIX however I am horribly confused with this assignment I have. I know what I need to do but am having the hardest time ever getting started.
I have the following files in my hwk12 directory:
roster.html
roster.txt
sample.html
sample.txt
The following is the content of the roster.html file:
<html>
<body>
<table border=2>
<tr><th>Name</th><th>Username</th><th>Email</th></tr>
<tr>
<td>Nikhil Banerjee</td>
<td>nbanerje</td>
<td>zetapsi796#hotmail.com</td>
</tr>
<tr>
<td>Jeff Nazarian</td>
<td>jnazaria</td>
<td>jeff.nazarian#asu.edu</td>
</tr>
<tr>
<td>Anna Melzer</td>
<td>amelzer</td>
<td>anna.melzer#asu.edu</td>
</tr>
<tr>
<td>Jose Garcia</td>
<td>jgarcia</td>
<td>garcia-j#msn.com</td>
</tr>
<tr>
<td>Jillian Testa</td>
<td>jtesta</td>
<td>jillian.testa#asu.edu</td>
</tr>
<tr>
<td>Clayton Lengelzigich</td>
<td>clengelz</td>
<td><a href="mailto:clayton.lengel-zigich#asu.edu">clayton.lengel-
zigich#asu.edu</a></td>
</tr>
<tr>
<td>Ashley Bennett</td>
<td>abennett</td>
<td>ashley.bennett#asu.edu</td>
</tr>
<tr>
<td>Ann Frost</td>
<td>afrost</td>
<td>ann.frost#asu.edu</td>
</tr>
<tr>
<td>Timothy Whipple</td>
<td>twhipple</td>
<td>tweed#asu.edu</td>
</tr>
<tr>
<td>Wei Shen</td>
<td>wshen</td>
<td>shenwei58#hotmail.com</td>
</tr>
<tr>
<td>Cari Mahon</td>
<td>cmahon</td>
<td>cari.mahon#asu.edu</td>
</tr>
<tr>
<td>Alberto Salas</td>
<td>asalas</td>
<td>alberto2504#msn.com</td>
</tr>
<tr>
<td>Dorothy Haskett</td>
<td>dhaskett</td>
<td>dorothy.haskett#asu.edu</td>
</tr>
<tr>
<td>Criss Bradbury</td>
<td>cbradbur</td>
<td>crissbradbury#hotmaiil.com</td>
</tr>
<tr>
<td>Steve Ellermann</td>
<td>sellerma</td>
<td>cis494#ellermann.com</td>
</tr>
<tr>
<td>Zewdie Bekele</td>
<td>zbekele</td>
<td>zewdiea#aol.com</td>
</tr>
<tr>
<td>Frederic Diziere</td>
<td>fdiziere</td>
<td>fsd#asu.edu</td>
</tr>
<tr>
<td>Matt Bowes</td>
<td>mbowes</td>
<td>matt.bowes#asu.edu</td>
</tr>
<tr>
<td>Jasen Meece</td>
<td>jmeece</td>
<td>jasen.meece#sun.com</td>
</tr>
<tr>
<td>Aaron Carpenter</td>
<td>acarpent</td>
<td>aaron.carpenter#asu.edu</td>
</tr>
<tr>
<td>Binqin Xi</td>
<td>bxi</td>
<td>binqin.xi#asu.edu</td>
</tr>
<tr>
<td>Yinting Chan</td>
<td>ychan</td>
<td>yin.chen#asu.edu</td>
</tr>
<tr>
<td>Michael Evans</td>
<td>mevans</td>
<td>michael.evans#asu.edu</td>
</tr>
<tr>
<td>Herman Beringer</td>
<td>hberinge</td>
<td>jber#cox.net</td>
</tr>
<tr>
<td>Andrew Jolley</td>
<td>ajolley</td>
<td>andrew#andrewjolley.com</td>
</tr>
<tr>
<td>Michael Raby</td>
<td>mraby</td>
<td>mike1071#yahoo.com</td>
</tr>
<tr>
<td>Hajar Alaoui</td>
<td>halaoui</td>
<td>hajar6#hotmail.com</td>
</tr>
<tr>
<td>Anne Lemar</td>
<td>alemar</td>
<td>anne.lemar#asu.edu</td>
</tr>
<tr>
<td>Russell Crotts</td>
<td>rcrotts</td>
<td>Russell.Crotts#asu.edu</td>
</tr>
<tr>
<td>Dan Mazzola</td>
<td>dmazzola</td>
<td>dan.mazzola#sun.com</td>
</tr>
<tr>
<td>Bill Boyton</td>
<td>bboyton</td>
<td>boytonb#earthlink.net</td>
</tr>
</table>
</body>
</html>
The following is the content of the roster.txt file:
Whipple Timothy tweed#asu.edu Shen Wei shenwei58#hotmail.com
Mahon Cari cari.mahon#asu.edu Salas Alberto alberto2504#msn.com
Haskett Dorothy dorothy.haskett#asu.edu Bradbury Criss
crissbradbury#hotmaiil.com Ellermann Steve
cis494#ellermann.com Bekele Zewdie zewdiea#aol.com Diziere Frederic
fsd#asu.edu Bowes Matt matt.bowes#asu.edu Meece Jasen
jasen.meece#sun.com Carpenter Aaron aaron.carpenter#asu.edu
Xi Binqin binqin.xi#asu.edu Chan Yinting yin.chen#asu.edu
Evans Michael michael.evans#asu.edu Beringer Herman
jber#cox.net Jolley Andrew andrew#andrewjolley.com Raby Michael
mike1071#yahoo.com Alaoui Hajar hajar6#hotmail.com Lemar Anne
anne.lemar#asu.edu Crotts Russell Russell.Crotts#asu.edu Mazzola Dan
dan.mazzola#sun.com Boyton Bill boytonb#earthlink.net
The following is the content of the sample.html file:
<html>
<body>
<table border=2>
<tr><th>Name</th><th>Username</th><th>Email</th></tr>
<tr>
<td>Michael Raby</td>
<td>mraby</td>
<td>mike1071#yahoo.com</td>
</tr>
<tr>
<td>Hajar Alaoui</td>
<td>halaoui</td>
<td>hajar6#hotmail.com</td>
</tr>
<tr>
<td>Anne Lemar</td>
<td>alemar</td>
<td>anne.lemar#asu.edu</td>
</tr>
<tr>
<td>Russell Crotts</td>
<td>rcrotts</td>
<td>Russell.Crotts#asu.edu</td>
</tr>
<tr>
<td>Dan Mazzola</td>
<td>dmazzola</td>
<td>dan.mazzola#sun.com</td>
</tr>
<tr>
<td>Bill Boyton</td>
<td>bboyton</td>
<td>boytonb#earthlink.net</td>
</tr>
</table>
</body>
</html>
The following is the content of the sample.txt file:
Raby Michael mike1071#yahoo.com
Alaoui Hajar hajar6#hotmail.com
Lemar Anne anne.lemar#asu.edu
Crotts Russell Russell.Crotts#asu.edu
Mazzola Dan dan.mazzola#sun.com
Boyton Bill boytonb#earthlink.net
I'm not asking for someone to do this for me because I LOVE UNIX and I want to learn it myself. Everytime I look at this HTML code I am confusing the #$$#& out of myself. I need help getting started.
The homework prompt is the following:
You are to write a nawk(1) script called ~/hwk12/mk_html.awk that converts a text file (sample.txt and roster.txt) to an html page that a web browser can read. I have given you the output in the file sample.html which is reproduced below (notice how each level of indentation is two spaces deep):
Again, I don't want someone to do this for me. Im just confused as to how data in the text file will append to the HTML table without the actual HTML code. Can someone please help me get started?

Looks like you'll need to define the necessary HTML tags within your script. The meat of the html file will be these lines:
<tr>
<td>$first $last</td>
<td>$username</td>
<td>$email</td>
</tr>
These tags define a table row. You can parse the variables from the text files with awk and use them to fill in the html. The other html markup can be copy-pasted as static text into the output html file.
Edit: You can do this to grab the first and last name and print to the html file.
last = $1
first = $2
print " <tr>"
print " <td>" first " " last "</td>"
print " </tr>"
You just need to expand that to get the email and username.

Parse html table using Nokogiri and Mechanize

Using the following code I am trying to scrape a call log from our phone provider's web application to enter the info into my Ruby on Rails application.
desc "Import incoming calls"
task :fetch_incomingcalls => :environment do
# Logs into manage.phoneprovider.co.uk and retrieved list of incoming calls.
require 'rubygems'
require 'mechanize'
require 'logger'
# Create a new mechanize object
agent = Mechanize.new { |a| a.log = Logger.new(STDERR) }
# Load the Phone Provider website
page = agent.get("https://manage.phoneprovider.co.uk/login")
# Select the first form
form = agent.page.forms.first
form.username = 'username
form.password = 'password
# Submit the form
page = form.submit form.buttons.first
# Click on link called Call Logs
page = agent.page.link_with(:text => "Call Logs").click
# Click on link called Incoming Calls
page = agent.page.link_with(:text => "Incoming Calls").click
# Prints out table rows
# puts doc.css('table > tr')
# Print out the body as a test
# puts page.body
end
As you can see from the last five lines, I have tested that the 'puts page.body' works successfully and the above code works. It successfully logs in and then navigates to Call Logs followed by Incoming Calls.The incoming call table looks like this:
| Timestamp | Source | Destination | Duration |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
Which is generated from the following code:
<thead>
<tr>
<td>Timestamp</td>
<td>Source</td>
<td>Destination</td>
<td>Duration</td>
<td>Cost</td>
<td class='centre'>Recording</td>
</tr>
</thead>
<tbody>
<tr class='o'>
<tr>
<td>03 Jan 13:40</td>
<td>12345678</td>
<td>12345679</td>
<td>00:01:14</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>30 Dec 20:31</td>
<td>12345678</td>
<td>12345679</td>
<td>00:02:52</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>24 Dec 00:03</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:09</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>23 Dec 14:56</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:07</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>21 Dec 13:26</td>
<td>07793770851</td>
<td>12345679</td>
<td>00:00:26</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
I'm trying to work out how to selects just the cells I want (Timestamp, Source, Destination and Duration) and output those. I can then worry about outputting them to the database rather than in Terminal.
I have tried using Selector Gadget but it just show either 'td' or 'tr:nth-child(6) td , tr:nth-child(2) td' if I select multiple.
Any help or pointers would be appreciated!

There is a pattern in the table that is easy to leverage using XPath. The <tr> tag of rows with the required information lack the class attribute. Fortunately, XPath provides some simple logical operations including not(). This provides just the functionality we need.
Once we've reduced the number of rows we're dealing with, we can iterate over the rows and extract the text of the necessary columns by using XPath's element[n] selector. One important note here is that XPath counts elements starting from 1, so the first column of a table row would be td[1].
Example code using Nokogiri (and specs):
require "rspec"
require "nokogiri"
HTML = <<HTML
<table>
<thead>
<tr>
<td>
Timestamp
</td>
<td>
Source
</td>
<td>
Destination
</td>
<td>
Duration
</td>
<td>
Cost
</td>
<td class='centre'>
Recording
</td>
</tr>
</thead>
<tbody>
<tr class='o'>
<td></td>
</tr>
<tr>
<td>
03 Jan 13:40
</td>
<td>
12345678
</td>
<td>
12345679
</td>
<td>
00:01:14
</td>
<td></td>
<td class='opt recording'></td>
</tr>
<tr class='e'>
<td></td>
</tr>
<tr>
<td>
30 Dec 20:31
</td>
<td>
12345678
</td>
<td>
12345679
</td>
<td>
00:02:52
</td>
<td></td>
<td class='opt recording'></td>
</tr>
<tr class='o'>
<td></td>
</tr>
<tr>
<td>
24 Dec 00:03
</td>
<td>
12345678
</td>
<td>
12345679
</td>
<td>
00:00:09
</td>
<td></td>
<td class='opt recording'></td>
</tr>
<tr class='e'>
<td></td>
</tr>
<tr>
<td>
23 Dec 14:56
</td>
<td>
12345678
</td>
<td>
12345679
</td>
<td>
00:00:07
</td>
<td></td>
<td class='opt recording'></td>
</tr>
<tr class='o'>
<td></td>
</tr>
<tr>
<td>
21 Dec 13:26
</td>
<td>
07793770851
</td>
<td>
12345679
</td>
<td>
00:00:26
</td>
<td></td>
<td class='opt recording'></td>
</tr>
</tbody>
</table>
HTML
class TableExtractor
def extract_data html
Nokogiri::HTML(html).xpath("//table/tbody/tr[not(#class)]").collect do |row|
timestamp = row.at("td[1]").text.strip
source = row.at("td[2]").text.strip
destination = row.at("td[3]").text.strip
duration = row.at("td[4]").text.strip
{:timestamp => timestamp, :source => source, :destination => destination, :duration => duration}
end
end
end
describe TableExtractor do
before(:all) do
#html = HTML
end
it "should extract the timestamp properly" do
subject.extract_data(#html)[0][:timestamp].should eq "03 Jan 13:40"
end
it "should extract the source properly" do
subject.extract_data(#html)[0][:source].should eq "12345678"
end
it "should extract the destination properly" do
subject.extract_data(#html)[0][:destination].should eq "12345679"
end
it "should extract the duration properly" do
subject.extract_data(#html)[0][:duration].should eq "00:01:14"
end
it "should extract all informational rows" do
subject.extract_data(#html).count.should eq 5
end
end

Your answer lies in this railscasts
http://railscasts.com/episodes/190-screen-scraping-with-nokogiri
This too can help
How do I parse an HTML table with Nokogiri?

You should be able to reach the exact node you required from the root (worst case) using XPath selectors. Using XPath with Nokogiri is listed here.
For detail on how reach all your elements using XPath, look here.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Python web scraping unstructured table - html

Related

Extract weather values from app.weathercloud.net

Simple HTML contact details form with table

Data Extract from the Html String

How do I awk a unix text file to a predefined html code?

Parse html table using Nokogiri and Mechanize

Categories

Resources