Is there a way to remove an entire row (html tags 'n all) from an HTML Table with HTML::TableExtract?
Mucking around with the sample code from CPAN, this is what I've tried so far:
use HTML::TableExtract qw(tree);
my $te = HTML::TableExtract->new( headers => [qw(name type members)] );
# get $html_string out of a file...
$te->parse($html_string);
my $table = $te->first_table_found();
my $table_tree = $table->tree;
$table_tree->row(4)->replace_content('');
my $document_tree = $te->tree;
my $document_html = $document_tree->as_HTML;
# write $document_html to a file ...
Now, as the name suggests, 'replace_content()' in the line $table_tree->row(4)->replace_content(''); removes the content of row 4, but the row itself remains in markup. I need to get the tags and everything in-between removed as well.
Any ideas?
What you want is the parent and delete methods
See the docs for HTML::Element and for HTML::Element::delete
UPDATE
Ok, click that checkmark and mark this one as answered....Here it is:
my($p) = $table_tree->row(4)->parent();
$p->delete;
Also, NOTE, you need the () parens around $p! If you don't have parens don't get back a reference.
For me, with the above Perl code working on this HTML,
<table>
<tr><td>name</td><td>type</td><td>members</td></tr>
<tr><td>row1</td><td>row1</td> <td>row1</td></tr>
<tr><td>row2</td><td>row2</td> <td>row2</td></tr>
<tr><td>row3</td><td>row3</td> <td>row3</td></tr>
<tr><td>row4</td><td>row4</td> <td>row4</td></tr>
</table>
I get this as a result of printing $document_html
<table>
<tr><td>name</td><td>type</td><td>members</td></tr>
<tr><td>row1</td><td>row1</td><td>row1</td></tr>
<tr><td>row2</td><td>row2</td><td>row2</td></tr>
<tr><td>row3</td><td>row3</td><td>row3</td></tr>
</table>
Notice that there is no empty <tr></tr>
Related
I am working on my second Python scraper and keep running into the same problem. I would like to scrape the website shown in the code below. I would like to be ability to input parcel numbers and see if their Property Use Code matches. However, I am not sure if my scraper if finding the correct row in the table. Also, not sure how to use the if statement if the use code is not the 3730.
Any help would be appreciated.
from bs4 import BeautifulSoup
import requests
parcel = input("Parcel Number: ")
web = "https://mcassessor.maricopa.gov/mcs.php?q="
web_page = web+parcel
web_header={'User-Agent':'Mozilla/5.0(Macintosh;IntelMacOSX10_13_2)AppleWebKit/537.36(KHTML,likeGecko)Chrome/63.0.3239.132Safari/537.36'}
response=requests.get(web_page,headers=web_header,timeout=100)
soup=BeautifulSoup(response.content,'html.parser')
table=soup.find("td", class_="Property Use Code" )
first_row=table.find_all("td")[1]
if first_row is '3730':
print (parcel)
else:
print ('N/A')
There's no td with class "Property Use Code" in the html you're looking at - that is the text of a td. If you want to find that row, you can use
td = soup.find('td', text="Property Use Code")
and then, to get the next td in that row, you can use:
otherTd = td.find_next_sibling()
or, of you want them all:
otherTds = td.find_next_siblings()
It's not clear to me what you want to do with the values of these tds, but you'll want to use the text attribute to access them: your first_row is '3730' will always be False, because first_row is a bs4.element.Tag object here and '3730' is a str. You can, however, get useful information from otherTd.text == '3730'.
In this example I am trying to get the text from within the <td> tag of a table. First, the html code.
<table>
<tbody>
<tr>
<td>Single line of text</td>
</tr>
<tr>
<td>Text here<p>First line</p><p>Second line</p></td>
</tr>
</tbody>
</table>
Then the ruby code here.
require 'nokogiri'
require 'pp'
html = File.open('test.html').read
doc = Nokogiri::HTML(html)
rows = doc.xpath('//table[1]/tbody/tr')
data = rows.collect do |row|
row.at_xpath('td[1]/text()').to_s
end
pp data
And the result that I get is.
["Single line of text", "Text here"]
How can I get all of the text in the second <td> tag?
There are two changes you will need to make to get all the text nodes. First at_xpath will only ever return a single node, so to get multiple nodes you’ll need to use xpath.
Second, to get all descendant nodes, not just child nodes, use // instead of /.
Combining these, the line of code would be:
row.xpath('td[1]//text()').to_s
This will concatenate all the text nodes together, giving the result:
["Single line of text", "Text hereFirst lineSecond line"]
which may not be what you want. Rather than just call to_s on the resulting nodeset you will need to process to fit your needs.
How about this?
pp doc.search("//tr[2]//td//text()").map { |item| item.text }
As matt says, you can get all descendants using //.
You can also index the second tr if you want that one specifically. Just leave out the indexing to get all the trs.
And you can filter the resulting text objects to get only those that have a td upstream.
Finally, map over each Nokogiri object, plucking out the text into the final array, which looks like this:
["Text here", "First line", "Second line"]
You want the text method of Nokogiri::XML::Node if you want to get all the text for any element:
p doc.xpath('//table[1]/tbody/tr').map{ |tr| tr.text.strip }
#=> ["Single line of text", "Text hereFirst lineSecond line"]
(The strip method just gets rid of leading and trailing whitespace.)
I want to perform an HTML table search on one column of a table. The table in this example here shows 2 columns. I have added classes to the tags to select column "Title 1" to filter only - however, the code is still looking at the "Title 2" column for the filter.
var $rows = $('#table tbody tr td[class = "col1"]');
$('#search').keyup(function() {
var val = '^(?=.*\\b' + $.trim($(this).val()).split(/\s+/).join('\\b)(?=.*\\b') + ').*$',
reg = RegExp(val, 'i'),
text;
$rows.show().filter(function() {
text = $(this).text().replace(/\s+/g, ' ');
return !reg.test(text);
}).hide();
});
Could anyone provide some advice to the mistake I am making?
The main problem of your code is that when you do hide(), you're hiding out the <td> instead of the <tr> so that you end up searching col2 when the <td> of col1 is hidden. So, in that regard you need to use the parent <tr> and hide it.
}).parent("tr").hide();
Another problem i saw is that your $row is good initially but once you
hide at least a row you end up with less rows to work with. So, in that regard you
need to preserve the original number of rows and use it to hide those that need to be hidden and show those need to be shown.
See my updated JSFiddle Demo
As per the OPs comments.
I think you're running into trouble because your selector is too detailed. try:
var $rows = $(".col1", "table");
I'm trying to get links from table in HTML. By using HTML::TableExtract, I'm able to parse table and get text (i.e. Ability, Abnormal in below example) but cannot get link that involves in the table. For example,
<table id="AlphabetTable">
<tr>
<td>
Ability <span class="count">2650</span>
</td>
<td>
Abnormal <span class="count">26</span>
</td>
</table>
Is there a way to get link using HTML::TableExtract ? or other module that could possibly use in this situation. Thanks
part of my code:
$mech->get($link->url());
$te->parse($mech->content);
foreach $ts ($te->tables){
foreach $row ($ts->rows){
print #$row[0]; #it only prints text part
#but I want its link
}
}
HTML::LinkExtor, passing the extracted table text to its parse method.
my $le = HTML::LinkExtor->new();
foreach $ts ($te->tables){
foreach $row ($ts->rows){
$le->parse($row->[0]);
for my $link_tag ( $le->links ) {
my ($tag, %links) = #$link_tag;
# next if $tag ne 'a'; # exclude other kinds of links?
print for values %links;
}
}
}
Use keep_html option in the constructor.
keep_html
Return the raw HTML contained in the cell, rather than just the visible text. Embedded tables are not retained in the HTML extracted from a cell. Patterns for header matches must take into account HTML in the string if this option is enabled. This option has no effect if extracting into an element tree structure.
$te = HTML::TableExtract->new( keep_html => 1, headers => [qw(field1 ... fieldN)]);
I'm using Qwebkit and I like to be able to insert into html table each data input that comes last
as first record (<tr><td>...my data ...</td></tr>) in to the table.
Here is my code this is only example :
ui.webView->page()->mainFrame()->setHtml("<html><body><p>HTML Table Test</p>"
"<table id=\"mainTable\" name=\"mainTable\" BORDER=1 BORDERCOLOR=RED></table>"
"</body></html>");
QWebElement body = ui.webView->page()->mainFrame()->documentElement();
QWebElement mainTable = ui.webView->page()->mainFrame()->findFirstElement("#mainTable");
mainTable.appendInside ("<tr><td>1111111<\/td></\tr>"); ///<-- this is i like to be last in the end
mainTable.appendInside ("<tr><td>2222222<\/td></\tr>"); ///<-- this is i like to be in the middle
mainTable.appendInside ("<tr><td>3333333<\/td></\tr>"); ///<-- this is i like to be in the first
The content of the records are coming dynamically and not as I show here, so I can't do it hard coded; in short I need LIFO algorithm here ..
How should I do that ?
The QWebElement::appendInside method add the parameter to the end of the web element.
The QWebElement::prependInside method add the parameter to the beginning of the web element.
If we have a QWebElement *elt containing a empty table such as :
<table><table>
to create the following table,
<table>
<tr><td>A</td></tr>
<tr><td>B</td></tr>
<tr><td>C</td></tr>
</table>
You can use one of the two following methods, they are equivalent.
Method 1, with appendInside
elt->appendInside("<tr><td>A</td></tr>");
elt->appendInside("<tr><td>B</td></tr>");
elt->appendInside("<tr><td>C</td></tr>");
or method 2, with preprendInside
elt->prependInside("<tr><td>C</td></tr>");
elt->prependInside("<tr><td>B</td></tr>");
elt->prependInside("<tr><td>A</td></tr>");
Using prependInside or appendInside gives you the control over the FIFO or LIFO behaviour of your algorithm.