Cheerio weird parsing

Cheerio weird parsing - cheerio

I am testing with very simple code:
const $ = await cheerio.load("<body> <table> kokoko </table> </body>");
await console.log($("body").first().html());
But when I run it I get this in console:
kokoko<table></table>
Why is the kokoko outside of <table> tag?

A <table> element cannot directly contain text content, it would need to be inside <tr> and <td> or <th> elements.
The parser is therefore interpreting the text as being directly in the <body>, the closest location where it's allowed.

Related

How to use Nokogiri to insert a block of HTML with proper indentation?

I have a block of HTML that I want to insert into an HTML document in Nokogiri. The issue is when inserting the block anywhere in the HTML it doesn't take the indentation of where it is inserted. Here is an example:
HTML document (shortened for example):
<div>
<div id="insertHere">
</div>
</div>
HTML to insert:
<table>
<tbody>
<tr>
<td>Hi</td>
</tr>
</tbody>
</table>
Result after inserting. This happens because it doesn't take into account the indentation. I want to be able to take into account the indentation where it is being inserted and pad the left of every line that is being inserted with that indentation:
<div>
<table>
<tbody>
<tr>
<td>Hi</td>
</tr>
</tbody>
</table>
</div>
It get's inserted using Nokogiri's node.replace('<table>....</table>').
What I am wanting it to look like:
<div>
<table>
<tbody>
<tr>
<td>Hi</td>
</tr>
</tbody>
</table>
</div>
Is there a way to get the indentation left of a block where I am inserting or replacing it?
Edit: If not using Nokogiri is this another way I could accomplish this? Maybe set a unique ID on each element kind of like the data-react-id set on react elements and then once I have a place where I need to insert an element I can use a regex to find it and match whitespace indentation to the left? Open to other approaches outside of Nokogiri. Trying to brainstorm other options.
I know Nokogiri can't "pretty print" but is there a way to get the whitespace "to the left or" or whitespace "before the current element and after a line break" to count indentation then I can pad what is being inserted manually. Maybe there is a way in nokogiri to get the nodes parent, then some how use the contents of the parent to get the whitespace to the left of the current node.

Turns out Nokogiri is really not meant for pretty-printing as #Casper mentioned. Instead I just run it all through an html pretty printer which I am using the https://github.com/threedaymonk/htmlbeautifier gem for.

Selecting variations in Nokogiri

I'm scraping these two sites:
https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=Law
https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=BSL.
Unfortunately, they have variations. One has the level name (Eg. Level 2) inside a href tag, while the other one is just plain text. How can I select one or the other depending which one is there?
I tried this to no avail:
level.css(/"a[href]"|".left"/).text
Here are shortened versions of the 2 HTML sections:
<table class="chart">
<tr valign="middle">
<td class="left">Level 2</td> <!-- the problem -->
<td class="middle"><div style="width:86%;"><strong>86%</strong></div></td>
</tr>
</table>
<table class="chart">
<tr valign="middle">
<td class="left">Level 1</td>
<td class="middle"><div style="width:32%;"><strong>32%</strong></div></td>
</tr>
</table>
My Code (edited from section of code to whole method)
def self.scrape_details_page(library_url)
details_page = Nokogiri::HTML(open(library_url))
details_page.css("table.chart tr").collect do |level|
right = level.css(".right").text.split
{level: level.css("a[href]").text, available: right[0], out_of_available: right[3]}
end
end

If what you want to do is grab the text that is within the innermost div, you should be able to dive all the way down just by calling #text on the parsed td element. No need to account for and walk extra tags that might be present inside, e.g. the link tag. Given your code as written:
details_page.css("table.chart tr").collect do |level|
level = level.text
end
For each element, that would pull the level label or percentage value (inner text) as a string and assign the value to the levels variable.
Edit: also, if all you care about is getting the level label, you can just filter the elements by class up front:
details_page.css("table.chart tr td.left").collect do |level|
level = level.text
end

The answer by jk_ should work in this particular case.
In the more general case, if you're going to use a CSS selector, you need to use CSS syntax for "or" (a comma). So if you were going to use the selectors you originally asked about, it'd be
level.css('a[href], .left').text

Thanks to inspiration from #jk_ I fixed it using .css(".left").text. That just selects all the text in the left td inside the tr.
The working code:
def self.scrape_details_page(library_url)
details_page = Nokogiri::HTML(open(library_url))
details_page.css("table.chart tr").collect do |level|
right = level.css(".right").text.split
{level: level.css(".left").text, available: right[0], out_of_available: right[3]}
end
end

Is td allowed inside thead?

I have couple of <th> elements within a <thead> element. The first one or one of them is an empty th used as placeholder and does not contain any text.
Wave tool gives out an error that th cannot be empty and suggests I change to <td>.
Now if I have a <td> within a <thead> it solves the issue and passes html validation too.
Is there any reason, I should not be having a <td> within <thead>

From HTML view:
<td> is allowed inside a <thead>. Permitted content of a <thead> are zero or more <tr> elements. In a <tr> element you can put a <td> and/or <th> element. It doesn’t matter.
From WCAG view:
A table can not have any empty table headers. This can be really confusing for screen reader users. There is one special case: Layout tables. Tables which are only used for "layouting", can have empty <td>'s as "column header". But if i understand your case correctly, you have some other regular table content, so you must add a column header for every column.
So in your case it is not ok to have an empty <td> as column header.

XPath:: Get following Sibling

I have following HTML Structure: I am trying to build a robust method to extract second color digest element since there will be many of these tag within the DOM.
<table>
<tbody>
<tr bgcolor="#AAAAAA">
<tr>
<tr>
<tr>
<tr>
<td>Color Digest </td>
<td>AgArAQICGQMVBBwTIRQHIwg0GUMURAZTBWQJcwV0AoEDAQ </td>
</tr>
<tr>
<td>Color Digest </td>
<td>2,43,2,25,21,28,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
</tbody>
</table>
I am trying to extract the Second "Color Digest" td element that has the decoded value.
I wrote the following xpath but instead of getting the second i am not getting the second td element.
//td[text() = ' Color Digest ']/following-sibling::td[2]
And when I change it to td[2] to td[1] I get both the elements.

You should be looking for the second tr that has the td that equals ' Color Digest ', then you need to look at either the following sibling of the first td in the tr, or the second td.
Try the following:
//tr[td='Color Digest'][2]/td/following-sibling::td[1]
or
//tr[td='Color Digest'][2]/td[2]
http://www.xpathtester.com/saved/76bb0bca-1896-43b7-8312-54f924a98a89

You can go for identifying a list of elements with xPath:
//td[text() = ' Color Digest ']/following-sibling::td[1]
This will give you a list of two elements, than you can use the 2nd element as your intended one. For example:
List<WebElement> elements = driver.findElements(By.xpath("//td[text() = ' Color Digest ']/following-sibling::td[1]"))
Now, you can use the 2nd element as your intended element, which is elements.get(1)

/html/body/table/tbody/tr[9]/td[1]
In Chrome (possible Safari too) you can inspect an element, then right click on the tag you want to get the xpath for, then you can copy the xpath to select that element.

How to get access to no-existing tag?

If you create following HTML below:
<table>
<tr>
<td></td>
<td>Last1</td>
<SPAN><FONT>test1</FONT></SPAN>
</tr>
</table>
and check DOM model you will see that IE(maybe others) create tag with no-name.
Here is link to screenshot Screenshot:
Are there any possibility to get access to this tag or not ?

Your HTML is invalid.
You can only put <td> and <th> elements directly inside <tr>s.
The browser is creating this tag in an attempt to fix your broken markup.
You should correct your HTML.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Cheerio weird parsing - cheerio

I am testing with very simple code: const $ = await cheerio.load("<body> <table> kokoko </table> </body>"); await console.log($("body").first().html()); But when I run it I get this in console: kokoko<table></table> Why is the kokoko outside of <table> tag?

A <table> element cannot directly contain text content, it would need to be inside <tr> and <td> or <th> elements. The parser is therefore interpreting the text as being directly in the <body>, the closest location where it's allowed.

Related

How to use Nokogiri to insert a block of HTML with proper indentation?

Selecting variations in Nokogiri

Is td allowed inside thead?

XPath:: Get following Sibling

How to get access to no-existing tag?

Categories

Resources