Nokogiri help without spaces - html

i have the following code:
#/usr/bin/env ruby
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'cora'
require 'eat'
#require 'timeout'
doc = Nokogiri::HTML(open("http://mobile.bahn.de/bin/mobil/bhftafel.exe/dox?input=Richard-Strauss-Stra%DFe%2C+M%FCnchen%23625127&date=27.01.12&time=20%3A41&productsFilter=1111111111000000&REQTrain_name=&maxJourneys=10&start=Suchen&boardType=Abfahrt&ao=yes"))
doc = doc.xpath('//div').each do |node|
puts node.content
end
How can i remove the p-tags and spaces?

Here's a guess at what you might want:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://mobile.bahn.de/bin/mobil/bhftafel.exe/dox?input=Richard-Strauss-Stra%DFe%2C+M%FCnchen%23625127&date=27.01.12&time=20%3A41&productsFilter=1111111111000000&REQTrain_name=&maxJourneys=10&start=Suchen&boardType=Abfahrt&ao=yes"))
doc.xpath('//div//p').remove
doc = doc.xpath('//div').each do |node|
text = node.text.gsub(/\n([ \t]*\n)+/,"\n").gsub(/^\s+|\s+$/,'')
puts text unless text.empty?
end
This removes all <p> elements from the document and then removes all blank lines and leading and trailing whitespace from the text. In the end, it does not print the text if the result was an empty string.
Edit: To make a variable for the date, wrap the above in a function and use string interpolation to construct your URL. For example:
require 'nokogiri'
require 'open-uri'
def get_data( date )
date_string = date.strftime('%d-%m-%y')
url = "http://mobilde.bahn.de/…more…#{date_string}…more…"
doc = Nokogiri::HTML(open(url))
# more code from above
end

Related

nokogiri scrape all children divs from a selected div

I was playing around with Nokogiri in my free time, and I am afraid I got really stuck.I am trying to solve this problem since this morning (almost 8h now :( ) and it looks that I didn't progress at all. On the website I want to scrape all the threads on the page.So far I realize that parent for all threads is
<div id="threads" class="extended-small">
each thread consist of 3 elements:
link to the image
div#title that contains value of replies(R) and images(I)
div#teaser that contains the name of the thread
My question is how can I select the children of the id='threads'
and push each child with 3 elements to the array ?
As you can see in this code I don't really know what I am doing and I would very , very much appreciate
require 'httparty'
require 'nokogiri'
require 'json'
require 'pry'
require 'csv'
page = HTTParty.get('https://boards.4chan.org/g/catalog')
parse_page = Nokogiri::HTML(page)
threads_array = []
threads = parse_page.search('.//*[#id="threads"]/div') do |a|
post_id = a.text
post_pic = a.text
post_title = a.text
post_teaser = a.text
threads_array.push(post_id,post_pic,post_title,post_teaser)
end
CSV.open('sample.csv','w') do |csv|
csv << threads_array
end
Pry.start(binding)
Doesn't look like the raw HTML source contains those fields which is why you're not seeing it when parsing with HTTParty and Nokogiri. It looks like they put the data in a JS variable farther up. Try this:
require 'rubygems'
require 'httparty'
require 'json'
page = HTTParty.get('https://boards.4chan.org/g/catalog')
m = page.match(/var catalog = ({.*?});var/)
json_str = m.captures.first
catalog = JSON.parse(json_str)
pp catalog
Whether that is robust enough I'll let you decide :)

How extract text from a tag using Nokogiri

Example:
<p>http://localhost:3000/replies/279<br><p>
Currently using Nokogiri to grab the href from the <a>:
doc.search('a').each do |node|
href = node.attributes['href'].try(:value)
I need to make sure what's in the text part is what's in the href and I'm not sure how to extract that.
Here are the basics for checking:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<p>http://localhost:3000/replies/279<br><p>
EOT
link = doc.at('a')
link['href'] == link.text # => false
Modifying the HTML so the HREF and text match:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<p>http://localhost:3000/replies/279<br><p>
EOT
link = doc.at('a')
link['href'] == link.text # => true
at returns only the first node that matches the selector, so if you're looking to check multiple nodes you'll want to use search and iterate over the NodeSet it returns.

Download HTML Text with Ruby

I am trying to create a histogram of the letters (a,b,c,etc..) on a specified web page. I plan to make the histogram itself using a hash. However, I am having a bit of a problem actually getting the HTML.
My current code:
#!/usr/local/bin/ruby
require 'net/http'
require 'open-uri'
# This will be the hash used to store the
# histogram.
histogram = Hash.new(0)
def open(url)
Net::HTTP.get(URI.parse(url))
end
page_content = open('_insert_webpage_here')
page_content.each do |i|
puts i
end
This does a good job of getting the HTML. However, it gets it all. For www.stackoverflow.com it gives me:
<body><h1>Object Moved</h1>This document may be found here</body>
Pretending that it was the right page, I don't want the html tags. I'm just trying to get Object Moved and This document may be found here.
Is there any reasonably easy way to do this?
When you require 'open-uri', you don't need to redefine open with Net::HTTP.
require 'open-uri'
page_content = open('http://www.stackoverflow.com').read
histogram = {}
page_content.each_char do |c|
histogram[c] ||= 0
histogram[c] += 1
end
Note: this does not strip out <tags> within the HTML document, so <html><body>x!</body></html> will have { '<' => 4, 'h' => 2, 't' => 2, ... } instead of { 'x' => 1, '!' => 1 }. To remove the tags, you can use something like Nokogiri (which you said was not available), or some sort of regular expression (such as the one in Dru's answer).
See the section "Following Redirection" on the Net::HTTP Documentation here
Stripping html tags without Nokogiri
puts page_content.gsub(/<\/?[^>]*>/, "")
http://codesnippets.joyent.com/posts/show/615

Nokogiri prevent converting entities

def wrap(content)
require "Nokogiri"
doc = Nokogiri::HTML.fragment("<div>"+content+"</div>")
chunks = doc.at("div").traverse do |p|
if p.is_a?(Nokogiri::XML::Text)
input = p.content
p.content = input.scan(/.{1,5}/).join("­")
end
end
doc.at("div").inner_html
end
wrap("aaaaaaaaaa")
gives me
"aaaaa&shy;aaaaa"
instead of
"aaaaa­aaaaa"
How get the second result ?
Return
doc.at("div").text
instead of
doc.at("div").inner_html
This, however, strips all HTML from the result. If you need to retain other markup, you can probably get away with using CGI.unescapeHTML:
CGI.unescapeHTML(doc.at("div").inner_html)

How to wrap words in HTML document without attributes and tag names

I have an HTML document that has long words:
<div>this is a veeeeeeeeeeeerryyyyyyyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>
I want to word-wrap it without cutting the tags or its attributes:
<div>this is a veeeeeeeeeeeerryyyyy yyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>
Also, it's possible that I will not have any HTML tag at all.
I tried Nokogiri, but it inserts a paragraph in tagless input, and wraps the whole response with an HTML document, which is not my intention.
What is the best way to accomplish this?
require "Nokogiri"
class String
def wrap()
doc = Nokogiri::HTML(self)
doc.at("body").traverse do |p|
if p.is_a?(Nokogiri::XML::Text)
input = p.content
p.content = input.scan(/.{1,25}/).join(" ")
end
end
doc.to_s # I want only the wrapped string, without the head/body stuff
end
end
I think using Nokogiri::XML(self) instead of Nokogiri::HTML(self) will help you.
This looks like a starting point for you:
require 'nokogiri'
max_word_length = 30
html = '<div>this is a veeeeeeeeeeeerryyyyyyyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>'
doc = Nokogiri::HTML.fragment(html)
doc.search('text()').each do |n|
n.content = n.content.split(' ').map { |l|
if (l.size > max_word_length)
l = l.scan(/.{1,#{ max_word_length }}/).join("\n")
end
l
}.join(' ')
end
puts doc.to_html
# >> <div>this is a veeeeeeeeeeeerryyyyyyyyloooong
# >> woooord<img src="/fooooooooobaaar.jof">
# >> </div>