Download HTML Text with Ruby - html

I am trying to create a histogram of the letters (a,b,c,etc..) on a specified web page. I plan to make the histogram itself using a hash. However, I am having a bit of a problem actually getting the HTML.
My current code:
#!/usr/local/bin/ruby
require 'net/http'
require 'open-uri'
# This will be the hash used to store the
# histogram.
histogram = Hash.new(0)
def open(url)
Net::HTTP.get(URI.parse(url))
end
page_content = open('_insert_webpage_here')
page_content.each do |i|
puts i
end
This does a good job of getting the HTML. However, it gets it all. For www.stackoverflow.com it gives me:
<body><h1>Object Moved</h1>This document may be found here</body>
Pretending that it was the right page, I don't want the html tags. I'm just trying to get Object Moved and This document may be found here.
Is there any reasonably easy way to do this?

When you require 'open-uri', you don't need to redefine open with Net::HTTP.
require 'open-uri'
page_content = open('http://www.stackoverflow.com').read
histogram = {}
page_content.each_char do |c|
histogram[c] ||= 0
histogram[c] += 1
end
Note: this does not strip out <tags> within the HTML document, so <html><body>x!</body></html> will have { '<' => 4, 'h' => 2, 't' => 2, ... } instead of { 'x' => 1, '!' => 1 }. To remove the tags, you can use something like Nokogiri (which you said was not available), or some sort of regular expression (such as the one in Dru's answer).

See the section "Following Redirection" on the Net::HTTP Documentation here

Stripping html tags without Nokogiri
puts page_content.gsub(/<\/?[^>]*>/, "")
http://codesnippets.joyent.com/posts/show/615

Related

Replacing <a> with Nokogiri

I am using Nokogiri to scan a document and remove specific files that are stored as attachments. I want to note however that the value was removed in-line.
Eg.
File Download
Converted to:
File Removed
Here is what I tried:
#doc = Nokogiri::HTML(html).to_html
#doc.search('a').each do |attachment|
attachment.remove
attachment.content = "REMOVED"
# ALSO TRIED:
attachment.content = "REMOVED"
end
The second one does replace the anchor text but keeps the href and the user can still download the value.
How can I replace the anchor value and change it to a < p> with a new string?
Use combination of create_element and replace to achieve that. Find the comments inline below.
html = 'File Download'
dom = Nokogiri::HTML(html) # parse with nokogiri
dom.to_s # original content
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>File Download</body></html>\n"
# scan the dom for hyperlinks
dom.css('a').each do |a|
node = dom.create_element 'p' # create paragraph element
node.inner_html = "REMOVED" # add content you want
a.replace node # replace found link with paragraph
end
dom.to_s # modified html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>REMOVED</p></body></html>\n"
Hope this helps

nokogiri scrape all children divs from a selected div

I was playing around with Nokogiri in my free time, and I am afraid I got really stuck.I am trying to solve this problem since this morning (almost 8h now :( ) and it looks that I didn't progress at all. On the website I want to scrape all the threads on the page.So far I realize that parent for all threads is
<div id="threads" class="extended-small">
each thread consist of 3 elements:
link to the image
div#title that contains value of replies(R) and images(I)
div#teaser that contains the name of the thread
My question is how can I select the children of the id='threads'
and push each child with 3 elements to the array ?
As you can see in this code I don't really know what I am doing and I would very , very much appreciate
require 'httparty'
require 'nokogiri'
require 'json'
require 'pry'
require 'csv'
page = HTTParty.get('https://boards.4chan.org/g/catalog')
parse_page = Nokogiri::HTML(page)
threads_array = []
threads = parse_page.search('.//*[#id="threads"]/div') do |a|
post_id = a.text
post_pic = a.text
post_title = a.text
post_teaser = a.text
threads_array.push(post_id,post_pic,post_title,post_teaser)
end
CSV.open('sample.csv','w') do |csv|
csv << threads_array
end
Pry.start(binding)
Doesn't look like the raw HTML source contains those fields which is why you're not seeing it when parsing with HTTParty and Nokogiri. It looks like they put the data in a JS variable farther up. Try this:
require 'rubygems'
require 'httparty'
require 'json'
page = HTTParty.get('https://boards.4chan.org/g/catalog')
m = page.match(/var catalog = ({.*?});var/)
json_str = m.captures.first
catalog = JSON.parse(json_str)
pp catalog
Whether that is robust enough I'll let you decide :)

Putting hyperlinks into an HTML table in R

I am a biologist trying to do computer science for research, so I may be a bit naïve. But I would like to a make a table containing information from a data frame, with a hyperlink in one of the columns. I imagine this needs to be an html document (?). I found this post this post describing how to put a hyperlink into a data frame and write it as an HTML file using googleVis. I would like to use this approach (it is the only one I know and seems to work well) except I would like to replace the actual URL with a description. The real motivation being that I would like to include many of these hyperlinks, and the links have long addresses which is difficult to read.
To be verbose, I essentially want to do what I did here where we read 'here' but 'here' points to
http:// stackoverflow.com/questions/8030208/exporting-table-in-r-to-html-with-hyperlinks
From your previous question, you can have another list which contains the titles of the URL's:
url=c('http://nytimes.com', 'http://cnn.com', 'http://www.weather.gov'))
urlTitles=c('NY Times', 'CNN', 'Weather'))
foo <- transform(foo, url = paste('<a href = ', shQuote(url), '>', urlTitles, '</a>'))
x = gvisTable(foo, options = list(allowHTML = TRUE))
plot(x)
Building on Jack's answer but consolidating from different threads:
library(googleVis)
library(R2HTML)
url <- c('http://nytimes.com', 'http://cnn.com', 'http://www.weather.gov')
urlTitles <- c('NY Times', 'CNN', 'Weather')
foo <- data.frame(a=c(1,2,3), b=c(4,5,6), url=url)
foo <- transform(foo, url = paste('<a href = ', shQuote(url), '>', urlTitles, '</a>'))
x <- gvisTable(foo, options = list(allowHTML = TRUE))
plot(x)

How to write a transformer for Ruby Sanitize Gem to transform <br> into newlines?

I'm using a wrapper for the Sanitize Gem's clean method to solve some our issues:
def remove_markup(html_str)
html_str.gsub /(\<\/p\>)/, "#{$1}\n"
marked_up = Sanitize.clean html_str
ESCAPE_SEQUENCES.each do |esc_seq, ascii_seq|
marked_up = marked_up.gsub('&' + esc_seq + ';', ascii_seq.chr)
end
marked_up
end
I recently add the gsub two lines as a quick way to do what I wanted:
Replace insert a newline wherever a paragraph ended.
However, I'm sure this can be accomplished more elgantly with a Sanitize transformer.
Unfortunately, I think I must be misunderstanding a few things. Here is an example of a transformer I wrote for the tag that worked.
s2 = "<p>here is para 1<br> It's a nice paragraph</p><p>Don't forget para 2</p>"
br_to_nl = lambda do |env|
node = env[:node]
node_name = env[:node_name]
return if env[:is_whitelisted] || !node.element?
return unless node_name == 'br'
node.replace "\n"
end
Sanitize.clean s2, :transformers => [br_to_nl]
=> " here is para 1\n It's a nice paragraph Don't forget para 2 "
But I couldn't come up with a solution that would work well for <p> tags.
Should I add a text element to the node as a child? How to make it show up immediately after the element?
related question (answered) How to use RubyGem Sanitize transformers to sanitize an unordered list into a comma seperated list?

Nokogiri prevent converting entities

def wrap(content)
require "Nokogiri"
doc = Nokogiri::HTML.fragment("<div>"+content+"</div>")
chunks = doc.at("div").traverse do |p|
if p.is_a?(Nokogiri::XML::Text)
input = p.content
p.content = input.scan(/.{1,5}/).join("­")
end
end
doc.at("div").inner_html
end
wrap("aaaaaaaaaa")
gives me
"aaaaa&shy;aaaaa"
instead of
"aaaaa­aaaaa"
How get the second result ?
Return
doc.at("div").text
instead of
doc.at("div").inner_html
This, however, strips all HTML from the result. If you need to retain other markup, you can probably get away with using CGI.unescapeHTML:
CGI.unescapeHTML(doc.at("div").inner_html)