Scraping HTML table with Ruby and Nokogiri - html

so I'm working on a project that scrapes data from a website that has gun accident/death data. Here's what the website looks like: http://www.gunviolencearchive.org/officer-involved-shootings
I'm trying to grab each table row and make an object(instance?, sorry I'm new to ruby) with the data from that row and print it out into the console. Right now, the #occurances array returns an array of the same data 26 times. Clearly it is overwriting with the first row. How would you suggest that I store each of these instances?
Here is my code, the (choice) is the website address.
def self.data_from_choice(choice)
doc = Nokogiri::HTML(open(choice))
#occurances = []
doc.xpath("//tr").each do |x|
date = doc.css("td")[0].text
state = doc.css("td")[1].text
city = doc.css("td")[2].text
deaths = doc.css("td")[4].text
injured = doc.css("td")[5].text
source = doc.search(".links li.last a").attr("href").value
#occurances << {:date => date, :state => state, :city => city, :deaths => deaths, :injured => injured, :source => source}
end
puts #occurances
end

In the loop for each row you are calling doc.css(...). This causes a search from the top of the document each time (i.e. from doc). What I think you want is to make the search relative to the row, which you have in the x variable.
So change this:
date = doc.css("td")[0].text
to this
date = x.css("td")[0].text
and similarly for state, city etc.

Related

Sequel Left Join

My aim is to build a relationship between rooms, people and work shift. So my sequel string looks like this :
x = DB[:raum].join_table(:left, DB[:platz], :rid => :id)
.join_table(:left, DB[:patient_behandlungs_link], :platz_id => :id)
.join_table(:left, DB[:patienten], :id => :patienten_id)
.join_table(:left, DB[:behandlungsverfahren], :id => :t2__behandlungsverfahren_id)
.join_table(:left, DB[:dialysezeit], :id => :t2__dialysezeit_id)
.join_table(:left, DB[:nadeln], :id => :t2__dialysenadel_id)
.join_table(:left, DB[:dialysatorzugang], :id => :t2__dialysatorzugang_id)
.where("raumnummer = ?", raumid.to_i)
It's working like this but in the resulting table there is also a field for the shift id. In this state it does not differentiate in which workshift the person is working. if i do a foreach and push the values out I get my empty nil fields with no one inside, which I want to, and I get the people which are in room raumid from all workshifts.
If I make a .filter(:schicht_id => 1) for example, then I loose my nil values. I need them to assign new people to the empty slots, so I tried (:schicht_id => 1).or(:schicht_id => nil) and similar things but I don't get my result, I want
I think i blamed sequel for something that is not related to sequel.
In the image my select box starts showing options from the second value
This behavior made me think, something was wrong with my joins and group by instructions.... i now have to figure out why the select box shows me from values form 2-9 and not from 1-9. In the HTML Site Source code all 9 Options are given.
This is strange for me.
sorry for blaming sequel.

Number Stored as Text when Writing Query to a Worksheet

I am creating a report using the following gems:
require "mysql2"
require "watir"
require "io/console"
require "writeexcel"
After I query a database with mysql2 and convert the query into a multidimensional array like so:
Mysql2::Client.default_query_options.merge!(:as => :array)
mysql = Mysql2::Client.new(:host => "01.02.03.405", :username => "user", :password => "pass123", :database => "db")
report = mysql.query("SELECT ... ASC;")
arr = []
report.each {|row| arr << row}
and then finally write the data to an Excel spreadsheet like so:
workbook = WriteExcel.new("File.xls")
worksheet = workbook.add_worksheet(sheetname = "Report")
header = ["Column A Title", ... , "Column N Title"]
worksheet.write_row("A1", header)
worksheet.write_col("A2", arr)
workbook.close
when I open the file in the latest edition of Excel for OSX (Office 365) I get the following error for every cell containing mostly numerals:
This report has a target audience that may become distracted with such an error.
I have attempted all the .set_num_format enumerable methods found in the documentation for writeexcel here.
How can I create a report with columns that contain special characters and numerals, such as currency, with write excel?
Should I look into utilizing another gem entirely?
Define the format after you create the worksheet.
format01 = workbook.add_format
format01.set_num_format('#,##0.00')
then write the column with the format.
worksheet.write_col("A2", arr, format01)
Since I'm not a Ruby user, this is just a S.W.A.G.

How to lazy-load or eager-load the sum of associated table?

We build some object in our controller:
#sites = Site.find(:all, :conditions => conditions, :order => 'group_id ASC')
And then in the view (currently), we are doing something like:
#sites.each do |site|
%p Site Name
= site.name
- actual = site.estimates.where('date<?', Date.today).sum(:actual_cost)
%p
= actual
end
Basically like that.
And of course this fires off a query for the Sites and then a query for N sites returned. I know about eager-loading associations with #sites = Site.find(:all, :include => :estimates) but in this case it didn't matter because we're doing the SUM on the query which is special it seems.
How would I eager load the SUMs in such that I don't get all the crazy queries? Currently it's over 600...
provide your conditions & ordering in this query only, which will push the result into a Hash.
sites = Site.includes(:estimates).where('estimates.date < ? ', Date.today)
.order('sites.id ASC')
actual = 0
sites.map { |site|
site.estimates.map { |estimate| actual = actual + estimate.actual_cost }
}
From your explanation, I am assuming actual_cost is a column in estimate table.

Rails custom query based on params

I have zero or many filter params being sent from a json request.
the params may contain:
params[:category_ids"]
params[:npo_ids"]
etc.
I am trying to retreive all Projects from my database with the selected ids. Here is what I have currently:
def index
if params[:category_ids].present? || params[:npo_ids].present?
#conditions = []
#ids = []
if params["category_ids"].present?
#conditions << '"category_id => ?"'
#ids << params["category_ids"].collect{|x| x.to_i}
end
if params["npo_ids"].present?
#conditions << '"npo_id => ?"'
#ids << params["npo_ids"].collect{|x| x.to_i}
end
#conditions = #ids.unshift(#conditions.join(" AND "))
#projects = Project.find(:all, :conditions => #conditions)
else ...
This really isn't working, but hopefully it gives you an idea of what I'm trying to do.
How do I filter down my activerecord query based on params that I'm unsure will be there.
Maybe I can do multiple queries and then join them... Or maybe I should put a filter_by_params method in the Model...?
What do you think is a good way to do this?
In rails 3 and above you build queries using ActiveRelation objects, no sql is executed until you try to access the results, i.e.
query = Project.where(is_active: true)
# no sql has been executed
query.each { |project| puts project.id }
# sql executed when the first item is accessed
The syntax you are using looks like rails 2 style; hopefully you are using 3 or above and if so you should be able to do something like
query = Project.order(:name)
query = query.where("category_id IN (?)", params[:category_ids]) if params[:category_ids].present?
query = query.where("npo_ids IN (?)", params[:npo_ids]) if params[:npo_ids].present?
#projects = query
I solved this. here's my code
def index
if params[:category_ids].present? || params[:npo_ids].present?
#conditions = {}
if params["category_ids"].present?
#conditions["categories"] = {:id => params["category_ids"].collect{|x| x.to_i}}
end
if params["npo_ids"].present?
#conditions["npo_id"] = params["npo_ids"].collect{|x| x.to_i}
end
#projects = Project.joins(:categories).where(#conditions)
else
basically it stored the .where conditions in #conditions, which looks something like this when there's both categories and npos:
{:categories => {:id => [1,2,3]}, :npo_id => [1,2,3]}
Then inserting this into
Project.joins(:categories).where(#conditions)
seems to work.
If you're filtering on a has_many relationship, you have to join. Then after joining, make sure to call the specific table you're referring to by doing something like this:
:categories => {:id => [1,2,3]}

Ruby - Scraping HTML : If url does not exist then skip to next

I am currently working on a html scraper that takes a list of anime-planet url's from a text file and then loops through them, parses and stores the data in a database.
The scraper is working nicely however if I put in a large list then the chances of the url not linking to a series properly and throwing an error is quite high. I want to try make it so that IF the url does not work then it notes down the url in an array named 'error-urls' and just skips the record.
The end result being that the script finishes all working url's and returns a list of non working urls i can work with later (maybe in a text file, or just display in console).
I am currently using a rake task for this which is working quite nicely. If anyone could help me with implementing the error handling functionality it would be much appreciated. Cheers!
scrape.rake:
task :scrape => :environment do
require 'nokogiri'
require 'open-uri'
text = []
File.read("text.txt").each_line do |line|
text << line.chop
end
text.each do |series|
url = "http://www.anime-planet.com/anime/" + series
data = Nokogiri::HTML(open(url))
title = data.at_css('.theme').text
synopsis = data.at_css('.synopsis').text.strip
synopsis.slice! "Synopsis:\r\n\t\t\t\t\t"
eps = data.at_css('.type').text
year = data.at_css('.year').text
rating = data.at_css('.avgRating').text
categories = data.at_css('.categories')
genre = categories.css('li').text.to_s
image = data.at_css('#screenshots img')
imagePath = "http://www.anime-planet.com" + image['src']
anime = Series.create({:title => title, :image => imagePath, :description => synopsis, :eps => eps, :year => year, :rating => rating})
anime.tag_list = genre
anime.save()
end
end
Small example of list.txt
5-Centimeters-Per-Second
11Eyes
A-Channel
Air
Air-Gear
Aishiteru-Ze-Baby
You can use open-uri's error handling. See this for more details.
url = "http://www.anime-planet.com/anime/" + series
begin
doc = open(url)
rescue OpenURI::HTTPError => http_error
# bad status code returned
// do something here
status = http_error.io.status[0].to_i # => 3xx, 4xx, or 5xx
puts "Got a bad status code #{status}"
# http_error.message is the numeric code and text in a string
end
data = Nokogiri::HTML(doc)