Wrap all nodes in a DocumentFragment with a DIV - html

Given a simple DocumentFragment:
html = "<h1>Three's Company</h1><p>A love triangle.</p>"
doc = Nokogiri::HTML::DocumentFragment.parse html
Is there an elegant way to wrap everything the DocumentFragment holds with a DIV? Please note that I have to do this inside a method which is supposed to return a DocumentFragment instance doc which has been parsed elsewhere. I'd like doc.to_html to look something like this:
<div class="wrapper"><h1>Three's Company</h1><p>A love triangle.</p></div>
Thanks for your hints!

Here is what I found :
require 'nokogiri'
string = "<h1>Three's Company</h1><p>A love triangle.</p>"
doc = Nokogiri::HTML::DocumentFragment.parse "<div class='foo'>"
doc.at(".//div").inner_html = string
puts doc.to_html
output:
<div class="foo">
<h1>Three's Company</h1>
<p>A love triangle.</p>
</div>

Related

Nokogiri HTML Nested Elements Extract Class and Text

I have a basic page structure with elements (span's) nested under other elements (div's and span's). Here's an example:
html = "<html>
<body>
<div class="item">
<div class="profile">
<span class="itemize">
<div class="r12321">Plains</div>
<div class="as124223">Trains</div>
<div class="qwss12311232">Automobiles</div>
</div>
<div class="profile">
<span class="itemize">
<div class="lknoijojkljl98799999">Love</div>
<div class="vssdfsd0809809">First</div>
<div class="awefsaf98098">Sight</div>
</div>
</div>
</body>
</html>"
Notice that the class names are random. Notice also that there is whitespace and tabs in the html.
I want to extract the children and end up with a hash like so:
page = Nokogiri::HTML(html)
itemhash = Hash.new
page.css('div.item div.profile span').map do |divs|
children = divs.children
children.each do |child|
itemhash[child['class']] = child.text
end
end
Result should be similar to:
{\"r12321\"=>\"Plains\", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", \"lknoijojkljl98799999\"=>\"Love\", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
But I'm ending up with a mess like this:
{nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"r12321\"=>\"Plains\", nil=>\" \", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"lknoijojkljl98799999\"=>\"Love\", nil=>\" \", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
This is because of the tabs and whitespace in the HTML. I don't have any control over how the HTML is generated so I'm trying to work around the issue. I've tried noblanks but that's not working. I've also tried gsub but that only destroys my markup.
How can I extract the class and values of these nested elements while cleanly ignoring whitespace and tabs?
P.S. I'm not hung up on Nokogiri - so if another gem can do it better I'm game.
The children method returns all child nodes, including text nodes—even when they are empty.
To only get child elements you could do an explicit XPath query (or possibly the equivalent CSS), e.g.:
children = divs.xpath('./div')
You could also use the children_elements method, which would be closer to what you are already doing, and which only returns children that are elements:
children = divs.element_children

How can I get list of elements or data which are on same level with same attributes?

I have one web application which have one HTML page.
In this page structure is like this:
<div class = 'abc'>
<div class = 'pqr'>test1</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>-</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>-</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>test2</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>-</div>
</div>
Here I want to take data from test1 to test2.
I have tried xpath with [Node Number] But I have found all nodes at [1] level.
Is there any way to get all data or List of elements test1 to test2 with "-" ?
I have seen this kind of issue before.
You have to use following-sibling here.
First I use this type of xpath :
//div[text()='test1']/..//following-sibling::div[#class='pqr' and not(contains(text(),'test'))]
Then you need to change script. "Note : I have written code in JAVA"
Logic :
while(element found text = '-')
{
//get data here
}
Please try this approach.
I guess you want the following xpath :
(//div[#class='pqr'])[position()<=4]
Notice the brackets () before position() predicate.
output in xpath tester :
Element='<div class="pqr">test1</div>'
Element='<div class="pqr">-</div>'
Element='<div class="pqr">-</div>'
Element='<div class="pqr">test2</div>'
I think you can't use the Test1 and Test2 elements as identifiers because they are on the same line as the nodes you want to collect. Otherwise, I think you can use findElements(by.Xpath("patern_to_search")). that will return you a collection of elements that are matching your pattern.
one more way without using xpath:
List<WebElement> element = driver.findElements(By.className("pqr"));
for(int i=0;i<element.size()-1;i++){
System.out.println(element.get(i).getText());
}

how to parse nested html tag using xpath

This is my sample html code.
using HtmlXpathSelector i need to parse the html file.
def parse(self, response):
edxData = HtmlXpathSelector(response)
first i need to get all the tag which contain
edxData.xpath('//h2[#class = "title course-title"]')
inside of that tag i need to check a tag value.
then need to parse the div tag with class name subtitle course-subtitle copy-detail.
how can i parse this value kindly give some suggestion
sample html response data:
<html>
<body>
<h2 class="title course-title">
<a href="https://www.edx.org/course/mitx/mitx-14-73x-challenges-global-poverty-1350">The Challenges of Global Poverty
</a>
</h2>
<div class="subtitle course-subtitle copy-detail">A course for those who are interested in the challenge posed by massive and persistent world poverty.
</div>
</body>
</html>
one way to loop over the inner tag could be:
>>> for h2 in sel.xpath('//h2[#class = "title course-title"]'):
... print h2.xpath('a')
...
[<Selector xpath='a' data=u'<a href="https://www.edx.org/course/mitx'>]
or even simply:
>>> sel.xpath('//h2[#class = "title course-title"]/a')
[<Selector xpath='//h2[#class = "title course-title"]/a' data=u'<a href="https://www.edx.org/course/mitx'>]
to find another xpath, simply do:
>>> sel.xpath('//div[#class="subtitle course-subtitle copy-detail"]')
[<Selector xpath='//div[#class="subtitle course-subtitle copy-detail"]' data=u'<div class="subtitle course-subtitle cop'>]
it seem like you're using scrapy, pls also tag that question as such

Indenting generated markup in Jekyll/Ruby

Well this is probably kind of a silly question but I'm wondering if there's any way to have the generated markup in Jekyll to preserve the indentation of the Liquid-tag. World doesn't end if it isn't solvable. I'm just curious since I like my code to look tidy, even if compiled. :)
For example I have these two:
base.html:
<body>
<div id="page">
{{content}}
</div>
</body>
index.md:
---
layout: base
---
<div id="recent_articles">
{% for post in site.posts %}
<div class="article_puff">
<img src="/resources/images/fancyi.jpg" alt="" />
<h2>{{post.title}}</h2>
<p>{{post.description}}</p>
Read more
</div>
{% endfor %}
</div>
Problem is that the imported {{content}}-tag is rendered without the indendation used above.
So instead of
<body>
<div id="page">
<div id="recent_articles">
<div class="article_puff">
<img src="/resources/images/fancyimage.jpg" alt="" />
<h2>Gettin' down with responsive web design</h2>
<p>Everyone's talking about it. Your client wants it. You need to code it.</p>
Read more
</div>
</div>
</div>
</body>
I get
<body>
<div id="page">
<div id="recent_articles">
<div class="article_puff">
<img src="/resources/images/fancyimage.jpg" alt="" />
<h2>Gettin' down with responsive web design</h2>
<p>Everyone's talking about it. Your client wants it. You need to code it.</p>
Read more
</div>
</div>
</div>
</body>
Seems like only the first line is indented correctly. The rest starts at the beginning of the line... So, multiline liquid-templating import? :)
Using a Liquid Filter
I managed to make this work using a liquid filter. There are a few caveats:
Your input must be clean. I had some curly quotes and non-printable chars that looked like whitespace in a few files (copypasta from Word or some such) and was seeing "Invalid byte sequence in UTF-8" as a Jekyll error.
It could break some things. I was using <i class="icon-file"></i> icons from twitter bootstrap. It replaced the empty tag with <i class="icon-file"/> and bootstrap did not like that. Additionally, it screws up the octopress {% codeblock %}s in my content. I didn't really look into why.
While this will clean the output of a liquid variable such as {{ content }} it does not actually solve the problem in the original post, which is to indent the html in context of the surrounding html. This will provide well formatted html, but as a fragment that will not be indented relative to tags above the fragment. If you want to format everything in context, use the Rake task instead of the filter.
-
require 'rubygems'
require 'json'
require 'nokogiri'
require 'nokogiri-pretty'
module Jekyll
module PrettyPrintFilter
def pretty_print(input)
#seeing some ASCII-8 come in
input = input.encode("UTF-8")
#Parsing with nokogiri first cleans up some things the XSLT can't handle
content = Nokogiri::HTML::DocumentFragment.parse input
parsed_content = content.to_html
#Unfortunately nokogiri-pretty can't use DocumentFragments...
html = Nokogiri::HTML parsed_content
pretty = html.human
#...so now we need to remove the stuff it added to make valid HTML
output = PrettyPrintFilter.strip_extra_html(pretty)
output
end
def PrettyPrintFilter.strip_extra_html(html)
#type declaration
html = html.sub('<?xml version="1.0" encoding="ISO-8859-1"?>','')
#second <html> tag
first = true
html = html.gsub('<html>') do |match|
if first == true
first = false
next
else
''
end
end
#first </html> tag
html = html.sub('</html>','')
#second <head> tag
first = true
html = html.gsub('<head>') do |match|
if first == true
first = false
next
else
''
end
end
#first </head> tag
html = html.sub('</head>','')
#second <body> tag
first = true
html = html.gsub('<body>') do |match|
if first == true
first = false
next
else
''
end
end
#first </body> tag
html = html.sub('</body>','')
html
end
end
end
Liquid::Template.register_filter(Jekyll::PrettyPrintFilter)
Using a Rake task
I use a task in my rakefile to pretty print the output after the jekyll site has been generated.
require 'nokogiri'
require 'nokogiri-pretty'
desc "Pretty print HTML output from Jekyll"
task :pretty_print do
#change public to _site or wherever your output goes
html_files = File.join("**", "public", "**", "*.html")
Dir.glob html_files do |html_file|
puts "Cleaning #{html_file}"
file = File.open(html_file)
contents = file.read
begin
#we're gonna parse it as XML so we can apply an XSLT
html = Nokogiri::XML(contents)
#the human() method is from nokogiri-pretty. Just an XSL transform on the XML.
pretty_html = html.human
rescue Exception => msg
puts "Failed to pretty print #{html_file}: #{msg}"
end
#Yep, we're overwriting the file. Potentially destructive.
file = File.new(html_file,"w")
file.write(pretty_html)
file.close
end
end
We can accomplish this by writing a custom Liquid filter to tidy the html, and then doing {{content | tidy }} to include the html.
A quick search suggests that the ruby tidy gem may not be maintained but that nokogiri is the way to go. This will of course mean installing the nokogiri gem.
See advice on writing liquid filters, and Jekyll example filters.
An example might look something like this: in _plugins, add a script called tidy-html.rb containing:
require 'nokogiri'
module TextFilter
def tidy(input)
desired = Nokogiri::HTML::DocumentFragment.parse(input).to_html
end
end
Liquid::Template.register_filter(TextFilter)
(Untested)

Changing href attributes with nokogiri and ruby on rails

I Have a HTML document with links links, for exemple:
<html>
<body>
<ul>
<li>teste1</li>
<li>teste2</li>
<li>teste3</li>
<ul>
</body>
</html>
I want with Ruby on Rails, with nokogiri or some other method, to have a final doc like this:
<html>
<body>
<ul>
<li>teste1</li>
<li>teste2</li>
<li>teste3</li>
<ul>
</body>
</html>
What's the best strategy to achieve this?
If you choose to use Nokogiri, I think this should work:
require 'cgi'
require 'rubygems' rescue nil
require 'nokogiri'
file_path = "your_page.html"
doc = Nokogiri::HTML(open(file_path))
doc.css("a").each do |link|
link.attributes["href"].value = "http://myproxy.com/?url=#{CGI.escape link.attributes["href"].value}"
end
doc.write_to(open(file_path, 'w'))
If I'm not mistaken rails loads REXML up by default, depending on what you're trying to do you could use this also.
Here is what I did for replacing images src attributes:
doc = Nokogiri::HTML(html)
doc.xpath("//img").each do |img|
img.attributes["src"].value = Absolute_asset_path(img.attributes["src"].value)
end
doc.to_html // simply use .to_html to re-convert to html