Parse html table using Nokogiri and Mechanize

Parse html table using Nokogiri and Mechanize - html

Using the following code I am trying to scrape a call log from our phone provider's web application to enter the info into my Ruby on Rails application.
desc "Import incoming calls"
task :fetch_incomingcalls => :environment do
# Logs into manage.phoneprovider.co.uk and retrieved list of incoming calls.
require 'rubygems'
require 'mechanize'
require 'logger'
# Create a new mechanize object
agent = Mechanize.new { |a| a.log = Logger.new(STDERR) }
# Load the Phone Provider website
page = agent.get("https://manage.phoneprovider.co.uk/login")
# Select the first form
form = agent.page.forms.first
form.username = 'username
form.password = 'password
# Submit the form
page = form.submit form.buttons.first
# Click on link called Call Logs
page = agent.page.link_with(:text => "Call Logs").click
# Click on link called Incoming Calls
page = agent.page.link_with(:text => "Incoming Calls").click
# Prints out table rows
# puts doc.css('table > tr')
# Print out the body as a test
# puts page.body
end
As you can see from the last five lines, I have tested that the 'puts page.body' works successfully and the above code works. It successfully logs in and then navigates to Call Logs followed by Incoming Calls.The incoming call table looks like this:
| Timestamp | Source | Destination | Duration |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
Which is generated from the following code:
<thead>
<tr>
<td>Timestamp</td>
<td>Source</td>
<td>Destination</td>
<td>Duration</td>
<td>Cost</td>
<td class='centre'>Recording</td>
</tr>
</thead>
<tbody>
<tr class='o'>
<tr>
<td>03 Jan 13:40</td>
<td>12345678</td>
<td>12345679</td>
<td>00:01:14</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>30 Dec 20:31</td>
<td>12345678</td>
<td>12345679</td>
<td>00:02:52</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>24 Dec 00:03</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:09</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>23 Dec 14:56</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:07</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>21 Dec 13:26</td>
<td>07793770851</td>
<td>12345679</td>
<td>00:00:26</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
I'm trying to work out how to selects just the cells I want (Timestamp, Source, Destination and Duration) and output those. I can then worry about outputting them to the database rather than in Terminal.
I have tried using Selector Gadget but it just show either 'td' or 'tr:nth-child(6) td , tr:nth-child(2) td' if I select multiple.
Any help or pointers would be appreciated!

There is a pattern in the table that is easy to leverage using XPath. The <tr> tag of rows with the required information lack the class attribute. Fortunately, XPath provides some simple logical operations including not(). This provides just the functionality we need.
Once we've reduced the number of rows we're dealing with, we can iterate over the rows and extract the text of the necessary columns by using XPath's element[n] selector. One important note here is that XPath counts elements starting from 1, so the first column of a table row would be td[1].
Example code using Nokogiri (and specs):
require "rspec"
require "nokogiri"
HTML = <<HTML
<table>
<thead>
<tr>
<td>
Timestamp
</td>
<td>
Source
</td>
<td>
Destination
</td>
<td>
Duration
</td>
<td>
Cost
</td>
<td class='centre'>
Recording
</td>
</tr>
</thead>
<tbody>
<tr class='o'>
<td></td>
</tr>
<tr>
<td>
03 Jan 13:40
</td>
<td>
12345678
</td>
<td>
12345679
</td>
<td>
00:01:14
</td>
<td></td>
<td class='opt recording'></td>
</tr>
<tr class='e'>
<td></td>
</tr>
<tr>
<td>
30 Dec 20:31
</td>
<td>
12345678
</td>
<td>
12345679
</td>
<td>
00:02:52
</td>
<td></td>
<td class='opt recording'></td>
</tr>
<tr class='o'>
<td></td>
</tr>
<tr>
<td>
24 Dec 00:03
</td>
<td>
12345678
</td>
<td>
12345679
</td>
<td>
00:00:09
</td>
<td></td>
<td class='opt recording'></td>
</tr>
<tr class='e'>
<td></td>
</tr>
<tr>
<td>
23 Dec 14:56
</td>
<td>
12345678
</td>
<td>
12345679
</td>
<td>
00:00:07
</td>
<td></td>
<td class='opt recording'></td>
</tr>
<tr class='o'>
<td></td>
</tr>
<tr>
<td>
21 Dec 13:26
</td>
<td>
07793770851
</td>
<td>
12345679
</td>
<td>
00:00:26
</td>
<td></td>
<td class='opt recording'></td>
</tr>
</tbody>
</table>
HTML
class TableExtractor
def extract_data html
Nokogiri::HTML(html).xpath("//table/tbody/tr[not(#class)]").collect do |row|
timestamp = row.at("td[1]").text.strip
source = row.at("td[2]").text.strip
destination = row.at("td[3]").text.strip
duration = row.at("td[4]").text.strip
{:timestamp => timestamp, :source => source, :destination => destination, :duration => duration}
end
end
end
describe TableExtractor do
before(:all) do
#html = HTML
end
it "should extract the timestamp properly" do
subject.extract_data(#html)[0][:timestamp].should eq "03 Jan 13:40"
end
it "should extract the source properly" do
subject.extract_data(#html)[0][:source].should eq "12345678"
end
it "should extract the destination properly" do
subject.extract_data(#html)[0][:destination].should eq "12345679"
end
it "should extract the duration properly" do
subject.extract_data(#html)[0][:duration].should eq "00:01:14"
end
it "should extract all informational rows" do
subject.extract_data(#html).count.should eq 5
end
end

Your answer lies in this railscasts
http://railscasts.com/episodes/190-screen-scraping-with-nokogiri
This too can help
How do I parse an HTML table with Nokogiri?

You should be able to reach the exact node you required from the root (worst case) using XPath selectors. Using XPath with Nokogiri is listed here.
For detail on how reach all your elements using XPath, look here.

Related

Python web scraping unstructured table

I am trying to extract some information from a table which appears on a webpage, but the table is unstructured with row being header and column being content like this: (My apologies for not disclosing the webpage)
<table class="table-detail">
<tbody>
<tr>
<td colspan="4" class="noborder">General Information
</td>
</tr>
<tr>
<th>Full name</th>
<td>
James Smith
</td>
<th>Year of birth</th>
<td>1992</td>
</tr>
<tr>
<th>Gender</th>
<td>Male</td>
</tr>
<tr>
<th>Place of birth</th>
<td>TTexas, USA</td>
<td> </td>
<td> </td>
</tr>
<tr>
<th>Address</th>
<td>Texas, USA</td>
<td> </td>
<td></td>
</tr>
At the moment, I am able to extract the table by using this script:
import pandas as pd
import requests
url = "example.com"
r = requests.get(url)
df_list = pd.read_html(r.text)
df = df_list[0]
df.head()
df.to_csv('myfile.csv',encoding='utf-8-sig')
And the table essentially looks like the following:
However, I am a little stuck with how to achieve this on Python. I cannot seem to get my head around to getting the data. The result I want is as below:
Any help would be appreciated. Thank you so much in advance.

You can use beautifulsoup to parse the HTML. For example:
import pandas as pd
from bs4 import BeautifulSoup
txt = '''<table class="table-detail">
<tbody>
<tr>
<td colspan="4" class="noborder">General Information
</td>
</tr>
<tr>
<th>Full name</th>
<td>
James Smith
</td>
<th>Year of birth</th>
<td>1992</td>
</tr>
<tr>
<th>Gender</th>
<td>Male</td>
</tr>
<tr>
<th>Place of birth</th>
<td>TTexas, USA</td>
<td> </td>
<td> </td>
</tr>
<tr>
<th>Address</th>
<td>Texas, USA</td>
<td> </td>
<td></td>
</tr>'''
soup = BeautifulSoup(txt, 'html.parser')
row = {}
for h in soup.select('th:has(+td)'):
row[h.text] = h.find_next('td').get_text(strip=True)
df = pd.DataFrame([row])
print(df)
Prints:
Full name Year of birth Gender Place of birth Address
0 James Smith 1992 Male TTexas, USA Texas, USA

Data Extract from the Html String

I have been looking for extracting few information from html which I receive from email body. Before extracting data i have sanitized the html to only have minimum base html code & no attributes style empty line and all.
I saw some of the mailparser uses gui to select the fields which I needed to be extracted by creating the new template. I also found that if any minor change in html, it works smart and extract data like before
My Question is how are these websites able to create a template by gui (selecting the text which I need). Also is there any opensource project or any library it can help me.
Example: need to extract booking no, pnr, date.. Prefer GUI to create template.
<table>
<tbody>
<tr>
<td>Booking No.: 5903154789</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td>Download the Trip.com app to track your flight status and check booking details on the move.</td>
</tr>
<tr>
<td>FAQs</td>
</tr>
<tr>
<td>How can I refund my flight ticket?</td>
</tr>
<tr>
<td>If you need to refund your flight ticket after the ticket has been issued, please sign in and select My bookings, then Flights, click on the relevant order number to open the booking details page, click the Refund button to apply for a ticket refund according to the website instructions. A cancellation fee might apply which depends on the policy of the airline. If you reserved your flight ticket as a guest, you can search your booking through the email address which you used for your booking and apply for a ticket refund according to the website instructions.</td>
</tr>
<tr>
<td>How can I change my flight ticket?</td>
</tr>
<tr>
<td>If you need to modify your ticket after it has been issued, please contact one of our Trip.com customer service representatives. A change fee might apply which is dependent on the policy of the airline.</td>
</tr>
<tr>
<td>How can I check the flight status?</td>
</tr>
<tr>
<td>You can check the flight status through "Get Flight Status" in the "Flights tools" at the bottom of the homepage of Flights. You can also download our Trip.com App to check your flight's status by clicking the button "Flight Status" on the homepage.</td>
</tr>
<tr>
<td>Contact Us</td>
</tr>
<tr>
<td>United States : 833 896 0077 24/7</td>
</tr>
<tr>
<td>China : 400 828 8966 24/7</td>
</tr>
<tr>
<td>Other Locations : +86 21 3210 4669 24/7</td>
</tr>
<tr>
<td>Great deals with reliable service</td>
</tr>
<tr>
<td>Thank you for choosing Trip.comCustomer Service Department</td>
</tr>
<tr>
<td>Do not forward this mail as it contains your personal information and booking details.</td>
</tr>
<tr>
<td>Copyright © 1999-2018 Trip.com All rights reserved</td>
</tr>
<tr>
<td>Using Trip.comâs website means that you agree with Trip.comâs Privacy Policy.</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td>Flight Booking Confirmed</td>
</tr>
<tr>
<td>
<strong>Dear Customer</strong>,
<p>Your flight booking has been confirmed and your tickets have been issued.</p>
<p>If you'd like to change or cancel your booking, the Trip.com app makes it easy.</p>
<p>You will find your itinerary and e-receipt attached. We advise you print out your itinerary and take it with you to ensure your trip goes as smoothly as possible.</p>
</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td>Booking No.</td>
<td>5903154789</td>
</tr>
<tr>
<td>Booked On</td>
<td>25 Mar 2018 12:32</td>
</tr>
<tr>
<td>Airline Booking Reference</td>
<td>C9LHJQ</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>
<strong>Flight Details</strong>(DPS - SIN)
</td>
</tr>
<tr>
<td>Bali - SingaporeScoot · TR281</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td>3 May 2018 10:50</td>
<td>DPS</td>
<td>Ngurah Rai Airport I</td>
</tr>
<tr>
<td>3 May 2018 13:25</td>
<td>SIN</td>
<td>Changi Airport T2</td>
</tr>
<tr>
<td>
<strong>Baggage Allowance</strong>
<p>
<strong>[FREE]</strong>No free baggage allowance.Please contact airline for detailed baggage regulations.
</p>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>Passenger</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td>Name</td>
<td>Ticket Number</td>
</tr>
<tr>
<td>SOMANATH/MAMATHA</td>
<td>C9LHJQ</td>
</tr>
<tr>
<td>YADARANGI/SOMANATH</td>
<td>C9LHJQ</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>Click here to view date change and cancellation policies.</td>
</tr>
<tr>
<td>For more information, please check the attachments or view your booking in more detail on the Trip.com website or app.</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td>Important information</td>
</tr>
<tr>
<td>â¢</td>
<td>All departure/arrival times and dates are in local time.</td>
</tr>
<tr>
<td>â¢</td>
<td>Tickets must be used in the sequence set out in the itinerary.</td>
</tr>
<tr>
<td>â¢</td>
<td>Please arrive at the airport at least 2 hours before departure to ensure you have enough time to check in.</td>
</tr>
<tr>
<td>â¢</td>
<td>Your ID must be valid for at least 6 months beyond the date you complete your itinerary.</td>
</tr>
<tr>
<td>â¢</td>
<td>A transit visa may be required if you need to transfer in a third country. We recommend you confirm visa details with the embassy of the relevant country.</td>
</tr>
<tr>
<td>â¢</td>
<td>If you have only booked a one-way ticket and are travelling on a short-term business/tourism visa, we recommend you purchase a return ticket as soon as possible. Failure to do so may result in denial of check-in, entry, or exit.</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td>Booking No.</td>
<td>5903154789</td>
</tr>
<tr>
<td>Booked On</td>
<td>25 Mar 2018 12:32</td>
</tr>
<tr>
<td>Airline Booking Reference</td>
<td>C9LHJQ</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td>3 May 2018 10:50</td>
<td>DPS</td>
<td>Ngurah Rai Airport I</td>
</tr>
<tr>
<td>3 May 2018 13:25</td>
<td>SIN</td>
<td>Changi Airport T2</td>
</tr>
<tr>
<td>
<strong>Baggage Allowance</strong>
<p>
<strong>[FREE]</strong>No free baggage allowance.Please contact airline for detailed baggage regulations.
</p>
</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td>Name</td>
<td>Ticket Number</td>
</tr>
<tr>
<td>SOMANATH/MAMATHA</td>
<td>C9LHJQ</td>
</tr>
<tr>
<td>YADARANGI/SOMANATH</td>
<td>C9LHJQ</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td>Important information</td>
</tr>
<tr>
<td>â¢</td>
<td>All departure/arrival times and dates are in local time.</td>
</tr>
<tr>
<td>â¢</td>
<td>Tickets must be used in the sequence set out in the itinerary.</td>
</tr>
<tr>
<td>â¢</td>
<td>Please arrive at the airport at least 2 hours before departure to ensure you have enough time to check in.</td>
</tr>
<tr>
<td>â¢</td>
<td>Your ID must be valid for at least 6 months beyond the date you complete your itinerary.</td>
</tr>
<tr>
<td>â¢</td>
<td>A transit visa may be required if you need to transfer in a third country. We recommend you confirm visa details with the embassy of the relevant country.</td>
</tr>
<tr>
<td>â¢</td>
<td>If you have only booked a one-way ticket and are travelling on a short-term business/tourism visa, we recommend you purchase a return ticket as soon as possible. Failure to do so may result in denial of check-in, entry, or exit.</td>
</tr>
</tbody>
</table>
Online Parsers are below:
https://mailparser.io/
https://parser.zapier.com/
https://parseur.com/
Edit:
Currently i have created manual template using imangazaliev/didom (PHP)
Pointing excact node element to get data but its too hard to do for so many template so looking for others.

Print Headers on each page Print.CSS

I have a view that takes a List of objects.
So for example, if I had a List of people.. put in order by where they were located and the division they were in like so:
| ID | Name | Location | Division | Age |
--------------------------------------------------------------
1 John Building1 Finance 25
2 Alex Building1 Finance 30
3 Chris Building2 ISS 22
4 Justin Building1 Human Resources 41
5 Mary Building2 Accounting 43
6 Ian Building1 Human Resources 27
7 John Building1 Finance 35
So my action return statement looks like this:
lstOfPersonnel = lstOfPersonnel.OrderBy(x => x.Location).ThenBy(x => x.Division).ThenBy(x => x.Name).ToList();
return View(lstOfPersonnel);
In my View I have this:
<table class="table table-bordered no-border">
#foreach (var item in Model)
{
if ((Model.IndexOf(item) == 0) || ((Model.IndexOf(item) != 0) && (!item.Building.Equals(Model.ElementAt(Model.IndexOf(item) - 1).Building) || !item.Division.Equals(Model.ElementAt(Model.IndexOf(item) - 1).Division))))
{
<thead>
<tr>
<th><b>#item.Building</b></th>
<th><b>#item.Division</b></th>
</tr>
<tr class="no-display"></tr>
<tr>
<th>Name</th>
</tr>
<tr>
<th>Age</th>
</tr>
</thead>
<tbody>
<tr>
<td>
#Html.DisplayFor(modelItem => item.Name)
</td>
<td>
#Html.DisplayFor(modelItem => item.Age)
</td>
</tr>
</tbody>
}
else
{
<tbody>
<tr>
<td>
#Html.DisplayFor(modelItem => item.Name)
</td>
<td>
#Html.DisplayFor(modelItem => item.Age)
</td>
</tr>
</tbody>
}
}
</table>
Now, when I print preview this, it puts everybody that is in the same building and division under their respective header. However, the very first <thead> element.. lets say for this example would be Building1 and Finance due to the .OrderBy.... is shown on every page over top of the next Building.
So for a visual this is what it looks like when I print preview:
Page 1:
// Perfect Render
Building1 | Finance
Name | Age
Alex 30
John 35
John 25
Page 2:
// Repeat of Page 1's headers
Building1 | Finance
Name | Age
Building1 | Human Resources
Name | Age
Ian 27
Justin 41
Page 3:
// Repeat of Page 1's headers
Building1 | Finance
Name | Age
Building2 | Accounting
Name | Age
Mary 43
Page 4:
// Repeat of Page 1's headers
Building1 | Finance
Name | Age
Building2 | ISS
Name | Age
Chris 43

Try creating the table inside the for-loop, so that the loop creates a new table on every loop, instead of just the head and body. So I think this is happening:
<table>
<thead>
...
</thead>
<tbody>
...
</tbody>
<thead>
...
</thead>
<tbody>
...
</tbody>
instead of this
<table>
<thead>
...
</thead>
<tbody>
...
</tbody>
</table>
<table>
<thead>
...
</thead>
<tbody>
...
</tbody>
</table>
So what is happening is that you only have 1 table, and it is printing each head inside of that table at the top of each page, because the table is going over to the next page. With multiple tables, it only contains on head, so that head will print on top of a new page every time that table floats over to the next page.
You can then just do your if before every tbody because it looks like the head stays the same in every table.
Maybe try this:
#foreach (var item in Model){
<table class="table table-bordered no-border">
<thead>
<tr>
<th><b>#item.Building</b></th>
<th><b>#item.Division</b></th>
</tr>
<tr class="no-display"></tr>
<tr>
<th>Name</th>
</tr>
<tr>
<th>Age</th>
</tr>
</thead>
<tbody>
if ((Model.IndexOf(item) == 0) || ((Model.IndexOf(item) != 0) && (!item.Building.Equals(Model.ElementAt(Model.IndexOf(item) - 1).Building) || !item.Division.Equals(Model.ElementAt(Model.IndexOf(item) - 1).Division)))){
<tr>
<td>
#Html.DisplayFor(modelItem => item.Name)
</td>
<td>
#Html.DisplayFor(modelItem => item.Age)
</td>
</tr>
}
else{
<tr>
<td>
#Html.DisplayFor(modelItem => item.Name)
</td>
<td>
#Html.DisplayFor(modelItem => item.Age)
</td>
</tr>
}
</tbody>
</table>
}

Multiple thead in a single table is probably not viable. That said you will have to create a new table per thead that you wish to create.
To accomplish this in one loop would look something like this:
var lastBuilding = null, lastDivision = null;
foreach(var item in Model) {
if(!item.Building.Equals(lastBuilding) || !item.Division.Equals(lastDivision)) {
if(lastBuilding != null && lastDivision != null) {
</tbody></table>
}
<table>
<thead>
<tr>
<th><b>#item.Building</b></th>
<th><b>#item.Division</b></th>
</tr>
<tr class="no-display"></tr>
<tr>
<th>Name</th>
</tr>
<tr>
<th>Age</th>
</tr>
</thead>
<tbody>
}
<tr>
<td>
#Html.DisplayFor(modelItem => item.Name)
</td>
<td>
#Html.DisplayFor(modelItem => item.Age)
</td>
</tr>
lastBuilding = item.Building;
lastDivision = item.Division;
}
The key to the logic here is:
1. Each item ultimately equates to a row in a table so the item outputs each iteration. The rest is just checking for whether or not a new table should be started (and the last one ended).
2. Setting lastBuilding and lastDivision to null for the first iteration avoids the first table being ended immediately.

YII2 Activedataprovider custom template

Let's assume that we have three tables, tbl_student ,tbl_lesson and tbl_score(student_id, lesson_id, score).
for Activedataprovider, we have:
$query = \app\models\Score::find()
->joinWith('student', false)
->joinWith('lesson', false);
$provider = new \yii\data\ActiveDataProvider([
'query' => $array,
]);
echo GridView::widget([
'dataProvider' => $dataProvider,
]);
So it gives an output with columns : student_id, lesson_id, score:
<table border="1">
<tr>
<td> student_id </td>
<td> lesson_id </td>
<td> score </td>
</tr>
<tr>
<td> 1 </td>
<td> 1 </td>
<td> 99.5 </td>
</tr>
<tr>
<td> 1 </td>
<td> 2 </td>
<td> 54 </td>
</tr>
<tr>
<td> 1 </td>
<td> 3 </td>
<td> 87 </td>
</tr>
<tr>
<td> 2 </td>
<td> 1 </td>
<td> 76 </td>
</tr>
<tr>
<td> 2 </td>
<td> 2 </td>
<td> 84 </td>
</tr>
<tr>
<td> 2 </td>
<td> 3 </td>
<td> 69 </td>
</tr>
</table>
But what I want is to display students in the first COLUMN, and lessons in the first ROW and then display the associated scores in the body of the table:
<table border="1">
<tr>
<td>student_id</td>
<td>lesson_1</td>
<td>lesson_2</td>
<td>lesson_3</td>
</tr>
<tr>
<td>1</td>
<td>99.5</td>
<td>54</td>
<td>87</td>
</tr>
<tr>
<td>2</td>
<td>76</td>
<td>84</td>
<td>69</td>
</tr>
</table>
How can I do that?
Thanks in advance.
I know that I can use Arraydataprovider, but as said in Yii2 data providers guide
Note: Compared to Active Data Provider and SQL Data Provider, array data provider is less efficient because it requires loading all data into the memory.
What I want:
but I want to do it using Activedataprovider

string replace in phpmyadmin while keeping inner text

i need to change quite some html entries in a mysql database. my problem is that some tags need to be replaced while the surrounded code needs to stay the same. in detail: all td-tags in tr-tags with the class "kopf" need to be changed to th-tags (and the addording closing for the tags)
it would not be a problem without the closing tags..
update `tt_content` set `bodytext` = replace(`bodytext`,'<tr class="kopf"><td colspan="2">','<tr><th colspan="2">');
this would work
from what i found the %-sign is used, but how exactly?:
update `tt_content` set `bodytext` = replace(`bodytext`,'<tr class="kopf"><td colspan="2">%</td></tr>','<tr><th colspan="2">%</th></tr>');
i guess this would replace all the code within the old td tags by a %-sign?? how can i achive the needed replacement?
edit: just to clarify things here is a possible entry in the db:
<table class="techDat" > <tbody> <tr class="kopf"> <td colspan="2"> <p><strong>Technical data:</strong></p> </td> </tr> <tr> <td> <p>Operating time depending on battery chargeBetriebszeit je Akkuladung</p> </td> <td> <p>Approx. 4 h</p> </td> </tr> <tr> <td> <p>Maximum volume</p> </td> <td> <p>Approx. 120 dB(A)</p> </td> </tr> <tr> <td> <p>Weight</p> </td> <td> <p>Approx. 59 g</p> </td> </tr> </tbody> </table>
after the mysql replacement it should look like
<table class="techDat" > <tbody> <tr> <th colspan="2"> <p><strong>Technical data:</strong></p> </th> </tr> <tr> <td> <p>Operating time depending on battery chargeBetriebszeit je Akkuladung</p> </td> <td> <p>Approx. 4 h</p> </td> </tr> <tr> <td> <p>Maximum volume</p> </td> <td> <p>Approx. 120 dB(A)</p> </td> </tr> <tr> <td> <p>Weight</p> </td> <td> <p>Approx. 59 g</p> </td> </tr> </tbody> </table>

Try two replaces
update `tt_content` set `bodytext` =
replace(replace(`bodytext`,
'<tr class="kopf"><td colspan="2">','<tr><th colspan="2">'),
'</td></tr>','</th></tr>')

Try updating your records with two queries :
1) for without % sign:
updatett_contentsetbodytext= replace(bodytext,'<tr class="kopf"><td colspan="2">','<tr><th colspan="2">');
2) for % sign
updatett_contentsetbodytext= replace(bodytext,'<tr class="kopf"><td colspan="2">%</td></tr>','<tr><th colspan="2">%</th></tr>')
where instr(bodytext,'%') > 0 ;

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Parse html table using Nokogiri and Mechanize - html

Your answer lies in this railscasts http://railscasts.com/episodes/190-screen-scraping-with-nokogiri This too can help How do I parse an HTML table with Nokogiri?

You should be able to reach the exact node you required from the root (worst case) using XPath selectors. Using XPath with Nokogiri is listed here. For detail on how reach all your elements using XPath, look here.

Related

Python web scraping unstructured table

Data Extract from the Html String

Print Headers on each page Print.CSS

YII2 Activedataprovider custom template

string replace in phpmyadmin while keeping inner text

Categories

Resources