Finding a <div> block with an 'id' and 'class' using Nokogiri - html

How can I search for the following block using Nokogiri:
<div id="live_list_cat_16" class="football-block sport-block" style="display:block;">
</div>

Try this
doc.search('div#foo.bar')
How does this work?
search and at method both accept CSS queries
div#foo finds a div with id foo
div.bar finds a div with class bar

You can use #some_id as the CSS selector.
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div id="foo" class="bar">text</div>
<div id="foo2" class="bar">more_text</div>
</body>
</html>
EOT
doc.search('#foo').to_html # => "<div id=\"foo\" class=\"bar\">text</div>"
doc.search('div.bar').to_html # => "<div id=\"foo\" class=\"bar\">text</div><div id=\"foo2\" class=\"bar\">more_text</div>"
Remember, a particular ID is only allowed to exist once in the document.

Related

How to wrap <a> tags with <H1> tag?

I have the following:
<a class="clickable_element" href="https://www.mywebsite.com"> ==$0
<div class="content">My Website</div>
</a>
I want to wrap all this in a "h1" tag.
How would I go about doing this using javascript?
You would first have to create the h1 tag, append it to the DOM. Then append the tag to the tag.
Like this
const aTag = document.querySelector('a')
const h1Tag = document.createElement('h1')
document.body.appendChild(h1Tag)
h1Tag.appendChild('aTag')
I am not sure why you want to wrap this with h1 if you can just create a class for the div, but here's the code you can try:
HTML
<h1 id="wrap">
</h1>
JS
document.getElementById("wrap").innerHTML = "<a href='test.html'><div>My Website</div></a>"

Nokogiri HTML Nested Elements Extract Class and Text

I have a basic page structure with elements (span's) nested under other elements (div's and span's). Here's an example:
html = "<html>
<body>
<div class="item">
<div class="profile">
<span class="itemize">
<div class="r12321">Plains</div>
<div class="as124223">Trains</div>
<div class="qwss12311232">Automobiles</div>
</div>
<div class="profile">
<span class="itemize">
<div class="lknoijojkljl98799999">Love</div>
<div class="vssdfsd0809809">First</div>
<div class="awefsaf98098">Sight</div>
</div>
</div>
</body>
</html>"
Notice that the class names are random. Notice also that there is whitespace and tabs in the html.
I want to extract the children and end up with a hash like so:
page = Nokogiri::HTML(html)
itemhash = Hash.new
page.css('div.item div.profile span').map do |divs|
children = divs.children
children.each do |child|
itemhash[child['class']] = child.text
end
end
Result should be similar to:
{\"r12321\"=>\"Plains\", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", \"lknoijojkljl98799999\"=>\"Love\", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
But I'm ending up with a mess like this:
{nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"r12321\"=>\"Plains\", nil=>\" \", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"lknoijojkljl98799999\"=>\"Love\", nil=>\" \", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
This is because of the tabs and whitespace in the HTML. I don't have any control over how the HTML is generated so I'm trying to work around the issue. I've tried noblanks but that's not working. I've also tried gsub but that only destroys my markup.
How can I extract the class and values of these nested elements while cleanly ignoring whitespace and tabs?
P.S. I'm not hung up on Nokogiri - so if another gem can do it better I'm game.
The children method returns all child nodes, including text nodes—even when they are empty.
To only get child elements you could do an explicit XPath query (or possibly the equivalent CSS), e.g.:
children = divs.xpath('./div')
You could also use the children_elements method, which would be closer to what you are already doing, and which only returns children that are elements:
children = divs.element_children

how to get all HTML elements with specific text by jsoup

How can I get all elements with specific text(inner HTML) in an HTML document by jsoup?
for example all elements with text test :
<html><head><title>for example></title></head>
<body>
<div id="div1" class='test'>
test
<p id='p1'>test<a id='a1'>test</a></p>
<a id='a2'>test</a>
<img src='' id='img1' alt='test'>
<p id='p2'>example</p>
</div>
</body></html>
note that I don't want to use tags' id or tags' name for selecting elements!
If I understand you correctly:
String html = "<html><head><title>for example></title></head><body><div id=\"div1\" class='test'>test<p id='p1'>test<a id='a1'>test</a></p><a id='a2'>test</a><img src='' id='img1' alt='test'><p id='p2'>example</p></div></body></html>";
Document doc = Jsoup.parse(html);
Elements elements = doc.select("*:containsOwn(test)");
for(Element element:elements)
{
System.out.println(element.toString()+"\n");
}
This will give the output for tags with id: div1,p1,a1,a2.

Perl: Changing html element Label by ID

How does one with perl change an html element's value using the html element's ID. How do you do this in the code.
For example:
<div id="container">
<form id="form" action="../code.cgi" method="post">
<label id="lblMessage" class="text">Message</label>
</form>
</div>
How would I change the label's text in my perl script?
Get an HTML or XML parser find the element (e.g., by XPath expression), and remove all its child text nodes, and add a new text node with the text you want.
use strict;
use warnings;
use XML::LibXML;
my $dom = XML::LibXML->load_html(location => 'myfile.html');
my $xpath = XML::LibXML::XPathContext->new($dom);
foreach my $label ($xpath->findnodes('//label[#id="lblMessage"]')) {
$label->removeChildNodes();
$label->addChild($dom->createTextNode("new text"));
}
Caveat: if there are other nodes (elements, like <b> or <span>) in your label, these get removed as well.
You probably need to add some code to write the modified html back to a file.
I, personally like HTML::TreeBuilder for this sort of task.
use HTML::TreeBuilder;
my $html = <<END;
<div id="container">
<form id="form" action="../code.cgi" method="post">
<label id="lblMessage" class="text">Message</label>
</form>
</div>
END
my $root = HTML::TreeBuilder->new_from_content( $html );
$root->elementify(); # Become a tree of HTML::Element objects.
my $message = $root->find_by_attribute( 'id', 'lblMessage' )
or die "No such element";
$message->delete_content();
$message->push_content('I am new stuff');
print $root->as_HTML();

Changing href attributes with nokogiri and ruby on rails

I Have a HTML document with links links, for exemple:
<html>
<body>
<ul>
<li>teste1</li>
<li>teste2</li>
<li>teste3</li>
<ul>
</body>
</html>
I want with Ruby on Rails, with nokogiri or some other method, to have a final doc like this:
<html>
<body>
<ul>
<li>teste1</li>
<li>teste2</li>
<li>teste3</li>
<ul>
</body>
</html>
What's the best strategy to achieve this?
If you choose to use Nokogiri, I think this should work:
require 'cgi'
require 'rubygems' rescue nil
require 'nokogiri'
file_path = "your_page.html"
doc = Nokogiri::HTML(open(file_path))
doc.css("a").each do |link|
link.attributes["href"].value = "http://myproxy.com/?url=#{CGI.escape link.attributes["href"].value}"
end
doc.write_to(open(file_path, 'w'))
If I'm not mistaken rails loads REXML up by default, depending on what you're trying to do you could use this also.
Here is what I did for replacing images src attributes:
doc = Nokogiri::HTML(html)
doc.xpath("//img").each do |img|
img.attributes["src"].value = Absolute_asset_path(img.attributes["src"].value)
end
doc.to_html // simply use .to_html to re-convert to html