Nokogiri prevent converting entities - html

def wrap(content)
require "Nokogiri"
doc = Nokogiri::HTML.fragment("<div>"+content+"</div>")
chunks = doc.at("div").traverse do |p|
if p.is_a?(Nokogiri::XML::Text)
input = p.content
p.content = input.scan(/.{1,5}/).join("­")
end
end
doc.at("div").inner_html
end
wrap("aaaaaaaaaa")
gives me
"aaaaa&shy;aaaaa"
instead of
"aaaaa­aaaaa"
How get the second result ?

Return
doc.at("div").text
instead of
doc.at("div").inner_html
This, however, strips all HTML from the result. If you need to retain other markup, you can probably get away with using CGI.unescapeHTML:
CGI.unescapeHTML(doc.at("div").inner_html)

Related

nokogiri scrape all children divs from a selected div

I was playing around with Nokogiri in my free time, and I am afraid I got really stuck.I am trying to solve this problem since this morning (almost 8h now :( ) and it looks that I didn't progress at all. On the website I want to scrape all the threads on the page.So far I realize that parent for all threads is
<div id="threads" class="extended-small">
each thread consist of 3 elements:
link to the image
div#title that contains value of replies(R) and images(I)
div#teaser that contains the name of the thread
My question is how can I select the children of the id='threads'
and push each child with 3 elements to the array ?
As you can see in this code I don't really know what I am doing and I would very , very much appreciate
require 'httparty'
require 'nokogiri'
require 'json'
require 'pry'
require 'csv'
page = HTTParty.get('https://boards.4chan.org/g/catalog')
parse_page = Nokogiri::HTML(page)
threads_array = []
threads = parse_page.search('.//*[#id="threads"]/div') do |a|
post_id = a.text
post_pic = a.text
post_title = a.text
post_teaser = a.text
threads_array.push(post_id,post_pic,post_title,post_teaser)
end
CSV.open('sample.csv','w') do |csv|
csv << threads_array
end
Pry.start(binding)
Doesn't look like the raw HTML source contains those fields which is why you're not seeing it when parsing with HTTParty and Nokogiri. It looks like they put the data in a JS variable farther up. Try this:
require 'rubygems'
require 'httparty'
require 'json'
page = HTTParty.get('https://boards.4chan.org/g/catalog')
m = page.match(/var catalog = ({.*?});var/)
json_str = m.captures.first
catalog = JSON.parse(json_str)
pp catalog
Whether that is robust enough I'll let you decide :)

Apache Tika Document Content Extraction Per Page

I am using Apache Tika 1.9 and content extraction working awesome.
The problem I am facing is with pages. I can extract total pages from document metadata. But I can't find any way to extract content per page from the document.
I had searched a lot and tried some solutions suggested by users, but did not work for me, may be due to latest Tika version.
Please suggest any solution or further research direction for this.
I will be thankful.
NOTE: I am using JRuby for implementation
Here is the class for custom content handler that I created and which solved my issue.
class PageContentHandler < ToXMLContentHandler
attr_accessor :page_tag
attr_accessor :page_number
attr_accessor :page_class
attr_accessor :page_map
def initialize
#page_number = 0
#page_tag = 'div'
#page_class = 'page'
#page_map = Hash.new
end
def startElement(uri, local_name, q_name, atts)
start_page() if #page_tag == q_name and atts.getValue('class') == #page_class
end
def endElement(uri, local_name, q_name)
end_page() if #page_tag == q_name
end
def characters(ch, start, length)
if length > 0
builder = StringBuilder.new(length)
builder.append(ch)
#page_map[#page_number] << builder.to_s if #page_number > 0
end
end
def start_page
#page_number = #page_number + 1
#page_map[#page_number] = String.new
end
def end_page
return
end
end
And to use this content handler, here is the code:
parser = AutoDetectParser.new
handler = PageContentHandler.new
parser.parse(input_stream, handler, #metadata_java, ParseContext.new)
puts handler.page_map

insert html into string at several positions at the same time

So I have a Peptide, which is a string of letters, corresponding to aminoacids
Say the peptide is
peptide_sequence = "VEILANDQGNR"
And it has a modification on L at position 4 and R at position 11,
I would like to insert a "<span class=\"modified_aa\"> and </span> before and after those positions at the same time.
Here is what I tried:
My modifications are stored in an array pep_mods of objects modification containing an attribute location with the position, in this case 4 and 11
pep_mods.each do |m|
peptide_sequence.gsub(peptide_sequence[m.position.to_i-1], "<span class=\"mod\">#{#peptide_sequence[m.location.to_i-1]}</span>" )
end
But since there are two modifications after the first insert of the html span tag the positions in the string become all different
How could I achieve what I intend to do? I hope it was clear
You should work backwards- make the modification starting with the last one. That way the index of earlier modifications is unchanged.
You might need to sort the array of indices in reverse order - then you can use the code you currently have.
Floris's answer is correct, but if you want to do it the hard way (O(n^2) instead of O(nlgn)) here is the basic idea.
Instead of relying on gsub you can iterate over the characters checking if each has an index corresponding to one of the modifications. If the index matches, perform the modification. Otherwise, keep the original character.
modified = peptide_sequence.each_with_index
.to_a
.map do |c, i|
pep_mods.each do |m|
if m.location.to_i = i
%Q{<span class="mod">#{c}</span>}
else
c
end
end
end.join('')
Ok, just in case this is helpful for anyone else, this is how I finally did it:
I first converted the peptide sequence to an array :
pep_seq_arr = peptide_sequence.split("")
then used each_with_index as Casey mentioned
pep_seq_arr.each_with_index do |aa, i|
pep_mods.each do |m|
pep_seq_arr[i] = "<span class='mod'>#{aa}</span>" if i == m.location.to_i-1
end
end
and finally joined the array:
pep_seq_arr.join
It was easier than I first thought

Download HTML Text with Ruby

I am trying to create a histogram of the letters (a,b,c,etc..) on a specified web page. I plan to make the histogram itself using a hash. However, I am having a bit of a problem actually getting the HTML.
My current code:
#!/usr/local/bin/ruby
require 'net/http'
require 'open-uri'
# This will be the hash used to store the
# histogram.
histogram = Hash.new(0)
def open(url)
Net::HTTP.get(URI.parse(url))
end
page_content = open('_insert_webpage_here')
page_content.each do |i|
puts i
end
This does a good job of getting the HTML. However, it gets it all. For www.stackoverflow.com it gives me:
<body><h1>Object Moved</h1>This document may be found here</body>
Pretending that it was the right page, I don't want the html tags. I'm just trying to get Object Moved and This document may be found here.
Is there any reasonably easy way to do this?
When you require 'open-uri', you don't need to redefine open with Net::HTTP.
require 'open-uri'
page_content = open('http://www.stackoverflow.com').read
histogram = {}
page_content.each_char do |c|
histogram[c] ||= 0
histogram[c] += 1
end
Note: this does not strip out <tags> within the HTML document, so <html><body>x!</body></html> will have { '<' => 4, 'h' => 2, 't' => 2, ... } instead of { 'x' => 1, '!' => 1 }. To remove the tags, you can use something like Nokogiri (which you said was not available), or some sort of regular expression (such as the one in Dru's answer).
See the section "Following Redirection" on the Net::HTTP Documentation here
Stripping html tags without Nokogiri
puts page_content.gsub(/<\/?[^>]*>/, "")
http://codesnippets.joyent.com/posts/show/615

How to wrap words in HTML document without attributes and tag names

I have an HTML document that has long words:
<div>this is a veeeeeeeeeeeerryyyyyyyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>
I want to word-wrap it without cutting the tags or its attributes:
<div>this is a veeeeeeeeeeeerryyyyy yyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>
Also, it's possible that I will not have any HTML tag at all.
I tried Nokogiri, but it inserts a paragraph in tagless input, and wraps the whole response with an HTML document, which is not my intention.
What is the best way to accomplish this?
require "Nokogiri"
class String
def wrap()
doc = Nokogiri::HTML(self)
doc.at("body").traverse do |p|
if p.is_a?(Nokogiri::XML::Text)
input = p.content
p.content = input.scan(/.{1,25}/).join(" ")
end
end
doc.to_s # I want only the wrapped string, without the head/body stuff
end
end
I think using Nokogiri::XML(self) instead of Nokogiri::HTML(self) will help you.
This looks like a starting point for you:
require 'nokogiri'
max_word_length = 30
html = '<div>this is a veeeeeeeeeeeerryyyyyyyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>'
doc = Nokogiri::HTML.fragment(html)
doc.search('text()').each do |n|
n.content = n.content.split(' ').map { |l|
if (l.size > max_word_length)
l = l.scan(/.{1,#{ max_word_length }}/).join("\n")
end
l
}.join(' ')
end
puts doc.to_html
# >> <div>this is a veeeeeeeeeeeerryyyyyyyyloooong
# >> woooord<img src="/fooooooooobaaar.jof">
# >> </div>