Ruby - Scraping HTML : If url does not exist then skip to next - html

I am currently working on a html scraper that takes a list of anime-planet url's from a text file and then loops through them, parses and stores the data in a database.
The scraper is working nicely however if I put in a large list then the chances of the url not linking to a series properly and throwing an error is quite high. I want to try make it so that IF the url does not work then it notes down the url in an array named 'error-urls' and just skips the record.
The end result being that the script finishes all working url's and returns a list of non working urls i can work with later (maybe in a text file, or just display in console).
I am currently using a rake task for this which is working quite nicely. If anyone could help me with implementing the error handling functionality it would be much appreciated. Cheers!
scrape.rake:
task :scrape => :environment do
require 'nokogiri'
require 'open-uri'
text = []
File.read("text.txt").each_line do |line|
text << line.chop
end
text.each do |series|
url = "http://www.anime-planet.com/anime/" + series
data = Nokogiri::HTML(open(url))
title = data.at_css('.theme').text
synopsis = data.at_css('.synopsis').text.strip
synopsis.slice! "Synopsis:\r\n\t\t\t\t\t"
eps = data.at_css('.type').text
year = data.at_css('.year').text
rating = data.at_css('.avgRating').text
categories = data.at_css('.categories')
genre = categories.css('li').text.to_s
image = data.at_css('#screenshots img')
imagePath = "http://www.anime-planet.com" + image['src']
anime = Series.create({:title => title, :image => imagePath, :description => synopsis, :eps => eps, :year => year, :rating => rating})
anime.tag_list = genre
anime.save()
end
end
Small example of list.txt
5-Centimeters-Per-Second
11Eyes
A-Channel
Air
Air-Gear
Aishiteru-Ze-Baby

You can use open-uri's error handling. See this for more details.
url = "http://www.anime-planet.com/anime/" + series
begin
doc = open(url)
rescue OpenURI::HTTPError => http_error
# bad status code returned
// do something here
status = http_error.io.status[0].to_i # => 3xx, 4xx, or 5xx
puts "Got a bad status code #{status}"
# http_error.message is the numeric code and text in a string
end
data = Nokogiri::HTML(doc)

Related

Scraping HTML table with Ruby and Nokogiri

so I'm working on a project that scrapes data from a website that has gun accident/death data. Here's what the website looks like: http://www.gunviolencearchive.org/officer-involved-shootings
I'm trying to grab each table row and make an object(instance?, sorry I'm new to ruby) with the data from that row and print it out into the console. Right now, the #occurances array returns an array of the same data 26 times. Clearly it is overwriting with the first row. How would you suggest that I store each of these instances?
Here is my code, the (choice) is the website address.
def self.data_from_choice(choice)
doc = Nokogiri::HTML(open(choice))
#occurances = []
doc.xpath("//tr").each do |x|
date = doc.css("td")[0].text
state = doc.css("td")[1].text
city = doc.css("td")[2].text
deaths = doc.css("td")[4].text
injured = doc.css("td")[5].text
source = doc.search(".links li.last a").attr("href").value
#occurances << {:date => date, :state => state, :city => city, :deaths => deaths, :injured => injured, :source => source}
end
puts #occurances
end
In the loop for each row you are calling doc.css(...). This causes a search from the top of the document each time (i.e. from doc). What I think you want is to make the search relative to the row, which you have in the x variable.
So change this:
date = doc.css("td")[0].text
to this
date = x.css("td")[0].text
and similarly for state, city etc.

rails 2.3 convert hash into mysql query

I'm trying to find out how rails converts a hash such as (This is an example please do not take this literally I threw something together to get the concept by I know this query is the same as User.find(1)):
{
:select => "users.*",
:conditions => "users.id = 1",
:order => "username"
}
Into:
SELECT users.* FROM users where users.id = 1 ORDER BY username
The closest thing I can find is ActiveRecord::Base#find_every
def find_every(options)
begin
case from = options[:from]
when Symbol
instantiate_collection(get(from, options[:params]))
when String
path = "#{from}#{query_string(options[:params])}"
instantiate_collection(format.decode(connection.get(path, headers).body) || [])
else
prefix_options, query_options = split_options(options[:params])
path = collection_path(prefix_options, query_options)
instantiate_collection( (format.decode(connection.get(path, headers).body) || []), prefix_options )
end
rescue ActiveResource::ResourceNotFound
# Swallowing ResourceNotFound exceptions and return nil - as per
# ActiveRecord.
nil
end
end
I'm unsure as to how to modify this to just return what the raw mysql statement would be.
So after a few hours of digging I came up with an answer although its not great.
class ActiveRecord::Base
def self._get_finder_options options
_get_construct_finder_sql(options)
end
private
def self._get_construct_finder_sql(options)
return (construct_finder_sql(options).inspect)
end
end
adding this as an extension gives you a publicly accessible method _get_finder_options which returns the raw sql statement.
In my case this is for a complex query to be wrapped as so
SELECT COUNT(*) as count FROM (INSERT_QUERY) as count_table
So that I could still use this with the will_paginate gem. This has only been tested in my current project so if you are trying to replicate please keep that in mind.

why pageObject based on Cheezy does not work?

I'm new to ruby (1.9.3)
I have intermediate experience with Selenium WebDriver plus C#. I want to move to Watir-Webdriver.
I'd be grateful to find out why the first block of IRB code works, but the second block simply loads the correct page, then does nothing. The page is active and responds to manual input.
The second block of code is based on the PageObject example here:
https://github.com/cheezy/page-object/wiki/Get-me-started-right-now%21
require 'watir-webdriver'
browser = Watir::Browser.start 'http://x.com/'
browser.select_list(:id, "ddlInterestType").select("Deferred")
browser.select_list(:id, "ddlCompanyName").select("XYZ")
browser.button(:value,"Enter Transactions").click
Second block
require 'watir-webdriver'
browser = Watir::Browser.new :firefox
browser.goto "http://x.com/"
deferredPage = DeferredPage.new(browser)
deferredPage.interestType.select = 'Deferred'
deferredPage.company.select = 'XYZ'
deferredPage.enterTransactions
class DeferredPage
include PageObject
select_list(:interestType, :id => 'ddlInterestType')
select_list(:company, :id => 'ddlCompanyName')
button(:enterTransactions, :id => 'btnEnterTransactions')
end
In your page-object code example, after loading the page, an exception is likely being thrown (which makes it seem like nothing happens). That code should throw an no method exception:
undefined method `select=' for "stuff":String
When you declare a select list there are three methods created:
your_select= - this is for setting the select list
your_select - this is for getting the select list value
your_select_element - this is for getting the page-object gem element
When you do deferredPage.interestType, it returns a string that is the value of the select list. Strings do not have a select= method, which is why you get the exception (and does nothing).
The two selections should be done without the .select:
deferredPage.interestType = 'Deferred'
deferredPage.company = 'XYZ'
As you can see the page-object API is slightly different than the watir API.
While googling for info on page objects, I found this page by Alister Scott. :
http://watirmelon.com/2012/06/04/roll-your-own-page-objects/
For an idiot++ such as me, I think I'll use his method until I know more about Watir-Webdriver. Based on #justinko's comment, I'll stick to one API for the present. I tried rolling my own, and it works fine:
require 'watir-webdriver'
browser = Watir::Browser.new :ie
class DeferredPage
def initialize( browser )
#browser = browser
end
def enterIntType(intType)
#browser.select_list(:id, "ddlInterestType").select(intType)
end
def clickEnter()
#browser.button(:value,"Enter Transactions").click
end
end
dp = DeferredPage.new(browser)
browser.goto "http://x.com"
dp.enterIntType( "Deferred" )
dp.clickEnter
Could you please let us know what error you are getting? I suspect the problem you are seeing is related to the way the Ruby interpreter reads the code. It reads the file from top to bottom and you are using the DeferredPage class before it is defined. What would happen if you changed your code to this:
require 'watir-webdriver'
require 'page-object'
browser = Watir::Browser.new :firefox
class DeferredPage
include PageObject
select_list(:interestType, :id => 'ddlInterestType')
select_list(:company, :id => 'ddlCompanyName')
button(:enterTransactions, :id => 'btnEnterTransactions')
end
deferredPage = DeferredPage.new(browser)
deferredPage.navigate_to "http://x.com/"
deferredPage.interestType = 'Deferred'
deferredPage.company = 'XYZ'
deferredPage.enterTransactions
In this case I am declaring the class prior to using it.
Another thing I might suggest is creating a higher level method to perform the data entry. For example, you could change your code to this:
require 'watir-webdriver'
require 'page-object'
browser = Watir::Browser.new :firefox
class DeferredPage
include PageObject
select_list(:interestType, :id => 'ddlInterestType')
select_list(:company, :id => 'ddlCompanyName')
button(:enterTransactions, :id => 'btnEnterTransactions')
def do_something(interest, company)
self.interestType = interest
self.company = company
enterTransactions
end
end
deferredPage = DeferredPage.new(browser)
deferredPage.navigate_to "http://x.com/"
deferredPage.do_someting('Deferred', 'XYZ')
This is cleaner - the access to the page is abstracted behind a method that should add some business value.
-Cheezy

Soundcloud API doesn't explicitly support pagination with json

Specific example I was working with:
http://api.soundcloud.com/users/dubstep/tracks.json?client_id=YOUR_CLIENT_ID
You'll get their first 50 tracks, but there is not next-href object like what you see in the xml version.
However, you can use offset and limit and it works as expected- but then I would need to "blindly" crawl through tracks until there are no more tracks, unlike with the XML version which gives you the "next page" of results. I wouldn't have even noticed it was paginated except by chance when I was searching the json object and noticed there was exactly 50 tracks (which is suspiciously even).
Is there a plan to support the next-href tag in json? Am I missing something? is it a bug that it's missing?
There is an undocumented parameter you can use linked_partitioning=1, that will add next_href to the response.
http://api.soundcloud.com/users/dubstep/tracks.json?client_id=YOUR_CLIENT_ID&linked_partitioning=1
for ex :
// build our API URL
$clientid = "Your API Client ID"; // Your API Client ID
$userid = "/ IDuser"; // ID of the user you are fetching the information for
// Grab the contents of the URL
//more php get
$number="1483";
$offset=1300;
$limit=200;
$soundcloud_url = "http://api.soundcloud.com/users/{$userid}/tracks.json?client_id={$clientid}&offset={$offset}&limit={$limit}";
$tracks_json = file_get_contents($soundcloud_url);
$tracks = json_decode($tracks_json);
foreach ($tracks as $track) {
echo "<pre>";
echo $track->title . ":";
echo $track->permalink_url . "";
echo "</pre>";
}
sI've seen this code is supposed to help (this is in Ruby):
# start paging through results, 100 at a time
tracks = client.get('/tracks', :order => 'created_at', :limit => page_size,
:linked_partitioning => 1)
tracks.each { |t| puts t.title }
However, the first set of results will show and i'll even see the "next_href" at the end of the response, but what are you supposed to do, to make the next set of results show?

Ruby on Rails: Decompression (Zlib::Deflate) doesn't work after certain amount of time

I have a need to compress large chunk of text before saving it to the database and decompress it back once client requests it.
The method I am using right now seems to work fine when I insert new records using the Rails console and query for the newly inserted record right away. i.e., I can decompress the compressed description successfully.
But I am not able to decompress the compressed description for any of my other records added prior to this date. It is really confusing for me especially being a beginnner to the ROR world.
I am using MySQL as a database.
See my Model below to better understand it.
require "base64"
class Video < ActiveRecord::Base
before_save :compress_description
def desc
unless description.blank?
return decompress(description)
end
end
private
def compress_description
unless description.blank?
self.description = compress(description)
end
end
def compress(text)
Base64.encode64(Zlib::Deflate.new(nil, -Zlib::MAX_WBITS).deflate(text, Zlib::FINISH))
end
def decompress(text)
Zlib::Inflate.new(-Zlib::MAX_WBITS).inflate(Base64.decode64(text))
end
end
Ok it's actually very easy to reproduce your problem. In rails console do the following
Video.create(:description => "This is a test")
Video.last.description
=> "C8nILFYAokSFktTiEgA=\n"
Video.last.desc
=> "This is a test"
Video.last.save #This update corrupts the description
Video.last.desc
=> "C8nILFYAokSFktTiEgA=\n"
The reason the corruption happens is because you are compressing an already compressed string
You should probably modify your class as follows and you should be fine
require 'base64'
class Video < ActiveRecord::Base
before_save :compress_description
after_find :decompress_description
attr_accessor :uncompressed_description
private
def compress_description
unless #uncompressed_description.blank?
self.description = compress(#uncompressed_description)
end
end
def decompress_description
unless description.blank?
#uncompressed_description = decompress(description)
end
end
def compress(text)
Base64.encode64(Zlib::Deflate.new(nil, -Zlib::MAX_WBITS).deflate(text, Zlib::FINISH))
end
def decompress(text)
Zlib::Inflate.new(-Zlib::MAX_WBITS).inflate(Base64.decode64(text))
end
end
Now use your class as follows
Video.create(:uncompressed_description => "This is a test")
Video.last.description
=> "C8nILFYAokSFktTiEgA=\n"
Video.last.uncompressed_description
=> "This is a test"
Video.last.save
Video.last.uncompressed_description
=> "This is a test"