how to get all HTML elements with specific text by jsoup - html

How can I get all elements with specific text(inner HTML) in an HTML document by jsoup?
for example all elements with text test :
<html><head><title>for example></title></head>
<body>
<div id="div1" class='test'>
test
<p id='p1'>test<a id='a1'>test</a></p>
<a id='a2'>test</a>
<img src='' id='img1' alt='test'>
<p id='p2'>example</p>
</div>
</body></html>
note that I don't want to use tags' id or tags' name for selecting elements!

If I understand you correctly:
String html = "<html><head><title>for example></title></head><body><div id=\"div1\" class='test'>test<p id='p1'>test<a id='a1'>test</a></p><a id='a2'>test</a><img src='' id='img1' alt='test'><p id='p2'>example</p></div></body></html>";
Document doc = Jsoup.parse(html);
Elements elements = doc.select("*:containsOwn(test)");
for(Element element:elements)
{
System.out.println(element.toString()+"\n");
}
This will give the output for tags with id: div1,p1,a1,a2.

Related

How to wrap <a> tags with <H1> tag?

I have the following:
<a class="clickable_element" href="https://www.mywebsite.com"> ==$0
<div class="content">My Website</div>
</a>
I want to wrap all this in a "h1" tag.
How would I go about doing this using javascript?
You would first have to create the h1 tag, append it to the DOM. Then append the tag to the tag.
Like this
const aTag = document.querySelector('a')
const h1Tag = document.createElement('h1')
document.body.appendChild(h1Tag)
h1Tag.appendChild('aTag')
I am not sure why you want to wrap this with h1 if you can just create a class for the div, but here's the code you can try:
HTML
<h1 id="wrap">
</h1>
JS
document.getElementById("wrap").innerHTML = "<a href='test.html'><div>My Website</div></a>"

jQuery - Insert text inside element of a HTML string

I store an html string into var HTML, which I get using the following:
var HTML = $('.group').get(0).outerHTML;
The output of HTML using console.log(HTML) is:
<div class="group">
<div class="class1">
Data123...
</div>
<div class="class2">
<!--I want to insert text here -->
</div>
</div>
Now, I want to insert some text inside the div class="class2". I am using the following code:
$(HTML).find('.class2').text("Hello!");
But now the output of HTML using console.log(HTML) is the same old HTML as before. The text "Hello!" did not get inserted. Can anyone help with the solution.
Here is the complete code:
<div class="group">
<div class="class1">
Data123...
</div>
<div class="class2">
</div>
</div>
<script type="text/javascript">
var HTML = $('.group').get(0).outerHTML;
$(HTML).find('.class2').text("Hello!");
console.log(HTML);
</script>
You're updating a temporary DOM element, but that doesn't change the HTML string. You need to save the DOM elements in a variable.
var new_div = $(HTML);
new_div.find('.class2').text("Hello!");
console.log($(new_div).html());

How to replace contents of p tag to div tag?

I need to replace the content of my P tag to DIV tag in almost 600 html pages.Each page had different name and topic.
<div id = "topicname"></div>
<P><A NAME="4u_uvt"></A><B>FiBu-Übergabe</B></P>
I wrote this Javascript and called in my HTML function, but it is not working
<script>
var mydivpchange=document.querySelectorAll("p");
document.getElementByID("topicname").innerHTML=mydivpchange[1].innerHTML;
</script>
// get tag names that are <p> only in first occurence
let test=document.getElementsByTagName('p')[0];
// create a div element
let y=document.createElement('div');
// state that p = div
y.innerHTML=test.innerHTML;
// replace test with y
test.parentNode.replaceChild(y, test);
div{ background:red;}
<p>test</p>
<br/>
<p>test2</p>
<br/>
<p>test3</p>

Nokogiri HTML Nested Elements Extract Class and Text

I have a basic page structure with elements (span's) nested under other elements (div's and span's). Here's an example:
html = "<html>
<body>
<div class="item">
<div class="profile">
<span class="itemize">
<div class="r12321">Plains</div>
<div class="as124223">Trains</div>
<div class="qwss12311232">Automobiles</div>
</div>
<div class="profile">
<span class="itemize">
<div class="lknoijojkljl98799999">Love</div>
<div class="vssdfsd0809809">First</div>
<div class="awefsaf98098">Sight</div>
</div>
</div>
</body>
</html>"
Notice that the class names are random. Notice also that there is whitespace and tabs in the html.
I want to extract the children and end up with a hash like so:
page = Nokogiri::HTML(html)
itemhash = Hash.new
page.css('div.item div.profile span').map do |divs|
children = divs.children
children.each do |child|
itemhash[child['class']] = child.text
end
end
Result should be similar to:
{\"r12321\"=>\"Plains\", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", \"lknoijojkljl98799999\"=>\"Love\", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
But I'm ending up with a mess like this:
{nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"r12321\"=>\"Plains\", nil=>\" \", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"lknoijojkljl98799999\"=>\"Love\", nil=>\" \", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
This is because of the tabs and whitespace in the HTML. I don't have any control over how the HTML is generated so I'm trying to work around the issue. I've tried noblanks but that's not working. I've also tried gsub but that only destroys my markup.
How can I extract the class and values of these nested elements while cleanly ignoring whitespace and tabs?
P.S. I'm not hung up on Nokogiri - so if another gem can do it better I'm game.
The children method returns all child nodes, including text nodes—even when they are empty.
To only get child elements you could do an explicit XPath query (or possibly the equivalent CSS), e.g.:
children = divs.xpath('./div')
You could also use the children_elements method, which would be closer to what you are already doing, and which only returns children that are elements:
children = divs.element_children

Using Nokogiri's CSS method to get all elements within an alt tag

I am trying to use Nokogiri's CSS method to get some names from my HTML.
This is an example of the HTML:
<section class="container partner-customer padding-bottom--60">
<div>
<div>
<a id="technologies"></a>
<h4 class="center-align">The Team</h4>
</div>
</div>
<div class="consultant list-across wrap">
<div class="engineering">
<img class="" src="https://v0001.jpg" alt="Person 1"/>
<p>Person 1<br>Founder, Chairman & CTO</p>
</div>
<div class="engineering">
<img class="" src="https://v0002.png" alt="Person 2"/></a>
<p>Person 2<br>Founder, VP of Engineering</p>
</div>
<div class="product">
<img class="" src="https://v0003.jpg" alt="Person 3"/></a>
<p>Person 3<br>Product</p>
</div>
<div class="Human Resources & Admin">
<img class="" src="https://v0004.jpg" alt="Person 4"/></a>
<p>Person 4<br>People & Places</p>
</div>
<div class="alliances">
<img class="" src="https://v0005.jpg" alt="Person 5"/></a>
<p>Person 5<br>VP of Alliances</p>
</div>
What I have so far in my people.rake file is the following:
staff_site = Nokogiri::HTML(open("https://www.website.com/company/team-all"))
all_hands = staff_site.css("div.consultant").map(&:text).map(&:squish)
I am having a little trouble getting all elements within the alt="" tag (the name of the person), as it is nested under a few divs.
Currently, using div.consultant, it gets all the names + the roles, i.e. Person 1Founder, Chairman; CTO, instead of just the person's name in alt=.
How could I simply get the element within alt?
Your desired output isn't clear and the HTML is broken.
Start with this:
require 'nokogiri'
doc = Nokogiri::HTML('<html><body><div class="consultant"><img alt="foo"/><img alt="bar" /></div></body></html>')
doc.search('div.consultant img').map{ |img| img['alt'] } # => ["foo", "bar"]
Using text on the output of css isn't a good idea. css returns a NodeSet. text against a NodeSet results in all text being concatenated, which often results in mangled text content forcing you to figure out how to pull it apart again, which, in the end, is horrible code:
doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
doc.search('p').text # => "foobar"
This behavior is documented in NodeSet#text:
Get the inner text of all contained Node objects
Instead, use text (AKA inner_text or content) against the individual nodes, resulting in the exact text for that node, that you can then join as you want:
Returns the content for this Node
doc.search('p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.