Is there a better way to get this element/node with Nokogiri? - html

I think the best way to explain this is via some code. Basically the only way to identify the TR I need inside the table (i've already reached the table itself and named it annual_income_statement) is by the text of the first TD in the TR, like this:
this may be helpful to know, too:
actual html:
doc = Nokogiri::HTML(open('https://www.google.com/finance?q=NYSE%3AAA&fstype=iii'))
html snippet:
<div id="incannualdiv">
<table id="fs-table">
<tbody>
<tr>..</tr>
...
<tr>
<td>Net Income</td>
<td>100</td>
</tr>
<tr>..</tr>
</tbody>
</table>
</div>
original xpath
irb(main):161:0> annual_income_statement = doc.xpath("//div[#id='incannualdiv']/table[#id='fs-table']/tbody")
irb(main):121:0> a = nil
=> nil
irb(main):122:0> annual_income_statement.children.each { |e| if e.text.include? "Net Income" and e.text.exclude? "Ex"
irb(main):123:2> a = e.text
irb(main):124:2> end }
=> 0
irb(main):125:0> a
=> "Net Income\n\n191.00\n611.00\n254.00\n-1,151.00\n"
irb(main):127:0> a.split "\n"
=> ["Net Income", "", "191.00", "611.00", "254.00", "-1,151.00"]
but is there a better way?
more details:
doc = Nokogiri::HTML(open('https://www.google.com/finance?q=NYSE%3AAA&fstype=iii'))
div = doc.at "div[#id='incannualdiv']" #div containing the table i want
table = div.at 'table' #table containing tbody i want
tbody = table.at 'tbody' #tbody containing tr's I want
trs = tbody.at 'tr' #SHOULD be all tr's of that table/tbody - but it's only the first TR?
I expect that last bit to give me ALL the TR's (which would include the TD i'm looking for)
but in fact it only gives me the first TR

Best is probably:
table.at 'tr:has(td[1][text()="Net Income"])'
Edit
More info:
doc = Nokogiri::HTML <<EOF
<div id="incannualdiv">
<table id="fs-table">
<tbody>
<tr>..</tr>
...
<tr>
<td>Net Income</td>
<td>100</td>
</tr>
<tr>..</tr>
</tbody>
</table>
</div>
EOF
table = doc.at 'table'
table.at('tr:has(td[1][text()="Net Income"])').to_s
#=> "<tr>\n<td>Net Income</td>\n <td>100</td>\n </tr>\n"

Related

How filter out content from html element with the same name

HTML Code:
<table>
<tbody>
<tr>
<th>Position</th>
<th>City & Province</th>
<th>Country</th>
<th>Salary</th>
</tr>
<tr>...</tr>
<tr>...</tr>
<tr valign="top"><td>content1</td></tr>
<tr>...</tr>
<tr>...</tr>
<tr valign="top"><td>content2</td></tr>
</tbody>
</table>
I am trying to filter an array of 'tr' elements and only access the content that falls under the 'tr' element that has a 'valign' attribute, I want all the other elements to be ignored.
Puppeteer code:
const options = await page.$$eval('table tbody tr', (trArray) =>
trArray.map((tr) => tr.valign)
);
//current results = ,,,,,,,,,,,,,,
//expected results = content1,content2
Any help would be greatly appreciated.
You can get only needed tr by selector 'table tbody tr[valign]'. To get the content, try tr.innerText:
const options = await page.$$eval('table tbody tr[valign]', (trArray) =>
trArray.map((tr) => tr.innerText)
);

Match and replace every 7th instance of a <td> tag

I'm struggling to wrap my head around how to get this regex working in Visual Studio Code.
I'm trying to match every 7th instance of <td> tag to then replace it with <td data-order="">.
Original
<tr>
<td>1</td>
<td>Name</td>
<td>Owner</td>
<td>Value</td>
<td>Total</td>
<td>Percent</td>
<td>Ratio</td>
<td>Final</td>
</tr>
What I want
<tr>
<td>1</td>
<td>Name</td>
<td>Owner</td>
<td>Value</td>
<td>Total</td>
<td>Percent</td>
<td data-order="">Ratio</td>
<td>Final</td>
</tr>
I've tried variations on ((?:.*<td>){1}), but any number greater than 1 just gives me a "No results" message.
[You say "match every 7th instance" but I think you mean match the seventh instance, not the 7th, 14th, 21st, etc. Assuming that you mean the 7th only..."]
If your data is really as regular and structured as you showed, you could use this as the regex in a Find
Find: (?<=<tr>\n(?:<td>.*<\/td>\n){6})(<td)
Replace: <td data-order=""
If you might have newlines within a <td>...\n...</td> tag, use this
Find: (?<=<tr>\n(?:<td>[^/]*<\/td>\n){6})(<td)
Replace: <td data-order=""
Vscode's find/replace (in one file) can use a non-fixed length lookbehind like
(?<=<tr>\n(?:<td>.*<\/td>\n){6})
The search across files cannot do that so this regex would not work there. Also sites like regex101.com can't use it either so I'll show a demo in vscode itself:
You can use the extension Select By. And use the command moveby.regex.
In your keybindings.json define a keybinding to search for the next <td> tag.
{
"key": "ctrl+i ctrl+u", // or any other key combo
"when": "editorTextFocus",
"command": "moveby.regex",
"args": {
"regex": "<td[^>]*>",
"properties": ["next", "end"]
}
}
Select the first <tr> tag of where you want to start
Select every following <tr> tag with:
command: Add Selection to Next Find Match(Ctrl+D - editor.action.addSelectionToNextFindMatch)
menu option: Selection > Select All Occurrences
Apply the key binding as many times as you want
Your cursors are now after the n-th <td> tag
Make any edits you want
Press Esc to leave Multi Cursor mode
In Select By v1.2.0 you can specify a repeat count. The count can be fixed or asked from the user.
{
"key": "ctrl+i ctrl+u", // or any other key combo
"when": "editorTextFocus",
"command": "moveby.regex",
"args": {
"regex": "<td[^>]*>",
"properties": ["next", "end"],
"repeat": "ask"
}
}
If you leave out the property "regex" you are asked for a regular expression too.
Edit
Using a regular expression takes quite some time to get it correct
let testStr =`<tr>
<td>1</td>
<td>Name</td>
<td>Owner</td>
<td>Value</td>
<td>Total</td>
<td>Percent</td>
<td>Ratio</td>
<td>Final</td>
</tr>`;
var replace = '$1<td class="red">$2';
var regex = new RegExp("(<tr>[\n\r\s]*(?:<td[^>]*>(?:.|[\n\r\s])*?</td>[\n\r\s]*){6})<td>((?:.|[\n\r\s])*</tr>)");
var newstr=testStr.replace(regex,replace);
console.log(newstr);
document.getElementById("test").innerHTML=newstr
.red {
color: red
}
<table>
<tbody >
<tr id="test">
</tr>
</tbody>
</table>
I was interested in this as I don't know much of regex and need to learn, but I manged to make it so in two goes.
I hope someone will correct me and help with correct one way.
I tried to folow this: but cant make it to work: Find Nth Occurrence of Character
let testStr = '<td>1</td><td>Name</td><td>Owner</td><td>Value</td><td>Total</td><td>Percent</td><td>Ratio</td><td>Final</td>';
var replace = '<td class="red">';
var regex = new RegExp("((<td>.*?)){7}");
// tried with a lot of (?:...) combinations here but none works :(
var newstr=testStr.replace(regex,replace);
var regex2 = new RegExp("((</td>.*?)){6}");
var newstr2=testStr.match(regex2);
console.log(newstr);
console.log(newstr2[0]);
document.getElementById("test").innerHTML=newstr2[0]+newstr
.red {
color: red
}
<table>
<tbody >
<tr id="test">
</tr>
</tbody>
</table>
EDIT:
Got something :)
let testStr = '<td>1</td><td>Name</td><td>Owner</td><td>Value</td><td>Total</td><td>Percent</td><td>Ratio</td><td>Final</td>';
var replace = '<td class="red">';
var regex = new RegExp("(?:[</td>]){6}(<td>)");
var newstr=testStr.replace(regex,replace);
console.log(newstr);
document.getElementById("test").innerHTML=newstr
.red {
color: red
}
<table>
<tbody >
<tr id="test">
</tr>
</tbody>
</table>
And with #rioV8's help:
let testStr = '<td>1</td><td>Name</td><td>Owner</td><td>Value</td><td>Total</td><td>Percent</td><td>Ratio</td><td>Final</td>';
var replace = '$1<td class="red">';
var regex = new RegExp("((?:.*?</td>){6})<td>");
var newstr=testStr.replace(regex,replace);
console.log(newstr);
document.getElementById("test").innerHTML=newstr
.red {
color: red
}
<table>
<tbody >
<tr id="test">
</tr>
</tbody>
</table>

Is there any way to clear/hide the first td in a two td table, without access to the first td?

Is there any way to clear or hide the contents of the first td, from the second td in a two column table, without any edit access to the actual td's?
So I'd like to hide the numbers in the table below
<table>
<tr>
<td>1.</td>
<td>Content</td>
</tr>
<tr>
<td>2.</td>
<td>More content</td>
</tr>
<tr>
<td>3.</td>
<td>Even more content</td>
</tr>
</table>
This is in a vendor-supplied application that spits out the coded page. The only access is the ability to add code in the Content section (second td in each row).
I've tried to use a div tag with some absolute positioning and just cover the first td with the second, but I could never get it to work consistently.
With CSS Selectors
If your page has only one table you could use CSS selectors. In your case you need to add a style that targets <td> tags that don't have a previous <td> sibling.
td {
/* hide the first td element */
display: none;
}
td + td {
/* display all td elements that have a previous td sibling */
display: block;
}
If you are only able to add content within the second <td> of each row then adding a whitespace stripped version of the above code within style tags to the first one will probably work, but could have messy side effects if there is more than one table on your page.
<table>
<tr>
<td>1.</td>
<td><style>td{display:none;}td+td{display:block;}</style>Content</td>
</tr>
<tr>
<td>2.</td>
<td>More content</td>
</tr>
<tr>
<td>3.</td>
<td>Even more content</td>
</tr>
</table>
With JavaScript
If you have more than one table on your page, try inserting an empty <div> with a unique ID into the first <td>'s content. Immediately after place a script that targets the closest <table> parent of that ID, from which you can extract the necessary <td>s to hide. Additionally, you need to make sure you only run the code once the page is loaded, otherwise it may not pick up any trs etc beyond where the script is implemented.
The easiest way to find the nearest parent that is <table> is by using closest but this isn't supported in Internet Explorer. This post has a good solution (parent only) that I'll use.
The complete script:
window.onload = function() {
function getClosest( el, tag ) {
tag = tag.toUpperCase();
do {
if ( el.nodeName === tag ) {
return el;
}
} while ( el = el.parentNode );
return null;
}
var table = getClosest( document.getElementById( 'unique-id' ), 'table' );
var trs = table.getElementsByTagName( 'tr' );
for ( var i = 0; i < trs.length; i++ ) {
trs[ i ].getElementsByTagName( 'td' )[ 0 ].style.display = 'none';
}
}
Including the <div> with a unique ID, stripping whitespace and adding the <script> tags, your table would look something like:
<table>
<tr>
<td>1.</td>
<td><div id="unique-id"></div><script>window.onload=function(){function getClosest(el,tag){tag=tag.toUpperCase();do{if(el.nodeName===tag){return el;}}while(el=el.parentNode);return null;}var table=getClosest(document.getElementById('unique-id'),'table'),trs = table.getElementsByTagName('tr');for(var i=0;i<trs.length;i++){trs[ i ].getElementsByTagName('td')[0].style.display='none';}}</script>Content</td>
</tr>
<tr>
<td>2.</td>
<td>More content</td>
</tr>
<tr>
<td>3.</td>
<td>Even more content</td>
</tr>
</table>

Ruby and Nokogiri parsing table?

This is my HTML:
<tbody><tr><th>SHOES</th></tr>
<tr>
<td>
Shoe 1 <br>shoe 2<br> shoe3 <br>
</td>
</tr>
</tbody>
This is my code:
nodes = page.css("tr").select do |el|
el.css('th').text =~ /SHOES/
end
nodes.each do |value|
puts value.css("td").text
end
I wish to get the values shoe 1, shoe 2 and shoe 3, but there is no output. I suspect there is an extra <tr></tr> in between <tr><th>SHOES</th></tr>. Or are the <br> the culprit?
There are other structures like:
<tr>
<th>SHOES</th>
<td>NBA</td>
</tr>
and I got the desired output "NBA".
What did I do wrong?
I have two kinds of structures:
Name1: value
Name1: value2
The above would give:
<tr>
<th>Name1</th>
<td>Value</td>
</tr>
but sometimes it's:
Name:
value
value2
value3
So the HTML is:
<tbody><tr><th>Name</th></tr>
<tr>
<td>value<br>value2<br> ....</td>
In HTML, tables are composed by rows. When you iterate by those rows, only one of them is the header. Although logically you see a relation between the body rows and the header ones, for HTML (and therefore for Nokogiri) there's none.
If what you want, is to get every value of the cells that have a specific header, what you can do is count the specific column, and then get the values from there.
Using this HTML as source
html = '<tbody><tr><th>HATS</th><th>SHOES</th></tr>
<tr>
<td>
hat 1 <br>hat 2<br> hat3 <br>
</td>
<td>
Shoe 1 <br>shoe 2<br> shoe3 <br>
</td>
</tr>
</tbody>'
We then follow to get the position of the right , in the first row of the table
page = Nokogiri::HTML(html)
shoes_position = page.css("tr")[0].css('th').find_index do |el|
el.text =~ /SHOES/
end
And with that, we find the s in that position in every other row, and get the text from that
shoes_tds = page.css('tr').map {|row| row.css('td')[shoes_position] }.compact
shoes_names = shoes_tds.map { |td| td.text }
I use a compact to remove the nil values, as the first row (the one with the headers) will not have a td, thus returning nil
You can get there with css:
td = doc.at('tr:has(th[text()=SHOES]) + tr td')
td.children.map{|x| x.text.strip}.reject(&:empty?)
#=> ["Shoe 1", "shoe 2", "shoe3"]
but maybe mixing it up with xpath is better:
td.search('./text()').map{|x| x.text.strip}
#=> ["Shoe 1", "shoe 2", "shoe3"]

CodeIgniter HTML table generator with custom layout inside

CI table->generate($data1, $data2, $data3) will output my data in a form of simple table like:
<table>
<tr>
<td>data1</td>
<td>data2</td>
<td>data3</td>
</tr>
</table>
What if I need a complex cell layout with multiple $vars within each cell:
$data1 = array('one', 'two', 'three');
and I want something like this:
<table>
<tr>
<td>
<div class="caption">$data1[0]</div>
<span class="span1">$data1[1] and here goes <strong>$data1[2]</strong></span>
</td>
<td>...</td>
<td>...</td>
</tr>
</table>
How should I code that piece?
For now I just generate the content of td in a model and then call generate(). But this means that my HTML for the cell is in the model but I would like to keep it in views.
What I would suggest is have a view that you pass the data that Generates the td structure. Capture the output of the view and pass this to the table generator. This keeps your structure in the view albeit a different one.
Hailwood's answer isn't the best way to do it.
the html table class has a data element on the add_row method. so the code would be:
$row = array();
$row[] = array('data' => "<div class='caption'>{$data1[0]}</div><span class='span1'>{$data1[1]} and here goes <strong>{$data1[2]}</strong></span>");
$row[] = $col2;
$row[] = $col3;
$this->table->add_row($row)
echo $this->table->generate();
as an aside, having a class named caption in a table is semantically confusing because table has a caption tag.