Working with large data sets and ruby

Working with large data sets and ruby - mysql

Could REALLY use some help here. Struggling with displaying a dashboard with large data.
When working with # 2k records average # 2 sec.
The query in MySql Console take less than 3.5 seconds to return 150k rows. Same query in Ruby takes over 4 + minutes from time query is performed until all objects and ready.
Goal: Optimize data even further before adding cache server. Working with Ruby 1.9.2, Rails 3.0 and Mysql (Mysql2 gem)
Questions:
Does working with Hashes hurt performance?
Should I first put everything in one primary hash then manipulate the data I need afterwards?
Is there anything else I can do to help with performance?
Rows in DB:
GasStations and US Census has # 150,000 records
Person has # 100,000 records
Cars has # 200,000 records
FillUps has # 2.3 Million
Required for dashboard (query based on time periods of last 24 hours, last week, etc). All data returned in JSON format for JS.
Gas Stations, with FillUps and US Census data (zip code, Name, City, Population)
Top 20 cities with the most fill ups
Top 10 cars with Fill Ups
Cars grouped by how many times they filled up their tank
Code (sample of 6 months. Returns # 100k + records):
# for simplicity, removed the select clause I had, but removing data I don't need like updated_at, gas_station.created_at, etc. instead of returning all the columns for each table.
#primary_data = FillUp.includes([:car, :gas_staton, :gas_station => {:uscensus}]).where('fill_ups.created_at >= ?', 6.months.ago) # This would take # 4 + minutes
# then tried
#primary_data = FillUp.find_by_sql('some long sql query...') # took longer than before.
# Note for others, sql query did some pre processing for me which added attributes to the return. Query in DB Console took < 4 seconds. Because of these extra attributes, query took longer as if Ruby was checking each row for mapping attributes
# then tried
MY_MAP = Hash[ActiveRecord::Base.connection.select_all('SELECT thingone, thingtwo from table').map{|one| [one['thingone'], one['thingtwo']]}] as seen http://stackoverflow.com/questions/4456834/ruby-on-rails-storing-and-accessing-large-data-sets
# that took 23 seconds and gained mapping of additional data that was processing later, so much faster
# currently using below which takes # 10 seconds
# All though this is faster, query still only takes 3.5 seconds, but parsing it to the hashes does add overhead.
cars = {}
gasstations = {}
cities = {}
filled = {}
client = Mysql2::Client.new(:host => "localhost", :username => "root")
client.query("SELECT sum(fill_ups_grouped_by_car_id) as filled, fillups.car_id, cars.make as make, gasstations.name as name, ....", :stream => true, :as => :json).each do |row|
# this returns fill ups gouged by car ,fill_ups.car_id, car make, gas station name, gas station zip, gas station city, city population
if cities[row['city']]
cities[row['city']]['fill_ups'] = (cities[row['city']]['fill_ups'] + row['filled'])
else
cities[row['city']] = {'fill_ups' => row['filled'], 'population' => row['population']}
end
if gasstations[row['name']]
gasstations[row['name']]['fill_ups'] = (gasstations[row['name']]['fill_ups'] + row['filled'])
else
gasstations[row['name']] = {'city' => row['city'],'zip' => row['city'], 'fill_ups' => row['filled']}
end
if cars[row['make']]
cars[row['make']] = (cars[row['make']] + row['filled'])
else
cars[row['make']] = row['filled']
end
if row['filled']
filled[row['filled']] = (filled[row['filled']] + 1)
else
filled[row['filled']] = 1
end
end
Have the following models:
def Person
has_many :cars
end
def Car
belongs_to :person
belongs_to :uscensus, :foreign_key => :zipcode, :primary_key => :zipcode
has_many :fill_ups
has_many :gas_stations, :through => :fill_ups
end
def GasStation
belongs_to :uscensus, :foreign_key => :zipcode, :primary_key => :zipcode
has_many :fill_ups
has_many :cars, :through => :fill_ups
end
def FillUp
# log of every time a person fills up there gas
belongs_to :car
belongs_to :gas_station
end
def Uscensus
# Basic data about area based on Zip code
end

I don't use RoR, but returning 100k rows for a dashboard is never going to be very fast. I strongly suggest building or maintaining summary tables and run GROUP BYs in the database to summarize your dataset before presentation.

Related

Rails first_or_create! creating null values in database table

I'm using first_or_create to populate a table with a list of email subscribers (called members). The code is as follows:
def community_members=(members)
self.members = members.split(",").map do |member|
Member.where(email: member.strip, community_id: self.id).first_or_create! unless member.strip == nil
end
end
Everything works fine, except that when I add additional emails to the same community, the table turns the "community_id" column for all previous rows to NULL.
Here's the server log:
Member Load (0.2ms) SELECT "members".* FROM "members" WHERE "members"."email" = $1 AND "members"."community_id" = $2 ORDER BY "members"."id" ASC LIMIT 1 [["email", "lisa#holy.com"], ["community_id", 1]]
SQL (0.3ms) INSERT INTO "members" ("email", "community_id", "created_at", "updated_at") VALUES ($1, $2, $3, $4) RETURNING "id" [["email", "lisa#holy.com"], ["community_id", 1], ["created_at", "2015-04-30 16:14:25.930012"], ["updated_at", "2015-04-30 16:14:25.930012"]]
Member Load (0.2ms) SELECT "members".* FROM "members" WHERE "members"."community_id" = $1 [["community_id", 1]]
SQL (0.4ms) UPDATE "members" SET "community_id" = NULL WHERE "members"."community_id" = $1 AND "members"."id" = 30 [["community_id", 1]]
(0.3ms) COMMIT
The first "Member" load does exactly what it's supposed to do. But for some reason it always ends with the second Member load that goes in and sets all "community_id" fields to NULL.
Right now I call :community_member from a form on a community page:
<%= form_for(#community) do |f| %>
<%= f.text_area :community_members, class:"form-control input-lg", placeholder:"Please add your list of comma separated member email addresses here" %>
<%= f.submit "Save", class: "btn btn-lg btn-green btn-block pad-top" %>
<% end %>
Seems like I'm missing something obvious here. Any ideas? Thank you.

You're going to want to find by the unique attribute, email, and create by community name, I think.
If that's the case, you'll have to do something like:
Member.where(email: member.strip).first_or_create(community: self) unless...
If you have records with non-unique emails, you'll have to redesign your associations.
class Subscriber
#this should have the email attribute
has_many :community_subscribers
has_many :communities, through: :community_subscribers
end
class CommunitySubscriber
#this is a 'bridge' table
belongs_to :subscriber
belongs_to :community
end
class Community
has_many :community_subscribers
has_may :subscribers, through: :community_subscribers
#I suggest new method and arg names
#Using self will keep the query within the scope of the community you are working on
#This also allows for creation of Subscriber records if you insist placing that here
#are you sure you want to map instead of just iterating the split list of emails?
def new_subscribers(emails)
emails.split(",").map do |email|
clean_email = email.strip
subscriber = Subscriber.where(email: clean_email).find_or_create unless clean_email.blank?
self.community_subscribers.where(subscriber: subscriber).first_or_create unless subscriber.blank?
end
end
end
Docs:
http://apidock.com/rails/v3.2.1/ActiveRecord/Relation/first_or_create
http://guides.rubyonrails.org/v3.2.13/active_record_querying.html#first_or_create

Memory leak with large number of SQL Logging strings in memory

My app is running on Puma (2.4) cluster mode with 4 workers.
Initially, they use less than 2GB RAM in total but grow continuously and finally take up to 7GB after 20 hours of running.
By using ObjectSpace, I find out that the number of string objects increases very fast, come from ~300k to 4-5 million objects in each worker.
Then I use the following script to group those strings by first 60 characters and perform counting:
counts = Hash.new(0)
ObjectSpace.each_object do |o|
next unless (o.class == String)
counts[o[0,60]] += 1
end
counts = counts.to_a.sort_by(&:last);
puts counts[-10..-1]
It turns out that most of those strings are SQL Logging from Active Record
ObjectSpace.count_objects
# result
{
:TOTAL => 2439593,
:FREE => 209200,
:T_OBJECT => 65944,
:T_CLASS => 11343,
:T_MODULE => 2003,
:T_FLOAT => 13,
:T_STRING => 1821445,
:T_REGEXP => 6570,
:T_ARRAY => 157012,
:T_HASH => 27477,
:T_STRUCT => 1406,
:T_BIGNUM => 1393,
:T_FILE => 142,
:T_DATA => 75081,
:T_MATCH => 1334,
:T_COMPLEX => 1,
:T_RATIONAL => 2809,
:T_NODE => 51890,
:T_ICLASS => 4530
}
# top 10 string
["PricingRule Exists: SELECT" , 74632]
[": SELECT COUNT(*) FROM `re" , 85454]
["CACHE: SELECT `companies`" , 93045]
["PricingRule Load: SELECT " , 114169]
["Page Load: SELECT `pages`" , 140245]
[": SELECT COUNT(*) FROM `pa" , 182274]
["Customer Load: SELECT `cu" , 191972]
["Company Load: SELECT `com" , 253025]
["Page Load: SELECT `pages`." , 320267]
["DestinationCountry Load: S" , 413299]
I use Rails 4, Ruby 2, mysql2(v0.3.13) and set log level at warn, but those SQL string still be stored and keep increase in the memory.
Does anyone have any idea or experience with this problem? I'm very appreciate if you can help.
Thanks!

These string can from 'sql.active_record' event, the reason may you subscribe
'sql.active_record' event and keep these string in your object so GB cannot release them.
ActiveSupport::Notifications.subscribed(callback, "sql.active_record") do
end
Make sure unsubscribe after using.

How to return MySQL query results from EventMachine?

I'm trying to use EM::Synchrony to speed up my queries by making them async. Following along with the examples from the github page here I'm making 2 asynchronous queries:
EventMachine.synchrony do
db = EventMachine::Synchrony::ConnectionPool.new(size: 2) do
Mysql2::EM::Client.new(
:host => config[:server],
:username => config[:user],
:password => config[:pwd],
:database => config[:db_name]
)
end
multi = EventMachine::Synchrony::Multi.new
multi.add :a, db.aquery("
select count(distinct userid) from my_table
where date = '2013-09-28'
")
multi.add :b, db.aquery("
select count(distinct userid) from my_table
where date = '2013-09-27'
")
res = multi.perform
puts res
# p "Look ma, no callbacks, and parallel MySQL requests!"
# p res.responses[:callback][0]
EventMachine.stop
end
> #<EventMachine::Synchrony::Multi:0x00000001eb8da8>
My question is how do I setup a callback to actually get the values returned by the queries? What I would like to do is once the queries are finished, aggregate them back together and write to another table or csv or whatever. Thanks.

Maybe you don't need Synchrony? They are async anyway. https://github.com/brianmario/mysql2#eventmachine
But if you do need then probably the answer is
res.responses[:callback][:a]
res.responses[:callback][:b]
https://github.com/igrigorik/em-synchrony/blob/master/lib/em-synchrony/em-multi.rb#L17

What I found is that the following code will give me the results from the queries:
res.responses[:callback].each do
|obj|
obj[1].callback do
|rows|
rows.each do
|row|
puts row.inspect
end
end
end
$ruby async_mysql.rb
{"count(distinct ui)"=>159}
{"count(distinct ui)"=>168}

The best way to insert large amount of messages into TorqueBox message queue

what's the fastest way to push millions of messages into HornetQ? I have these both approaches:
1.) My current code, reuses one session for all messages, ca. 2.200 messages per second
time = Benchmark.realtime do
queue.with_session(:tx => false) do |session|
1_000_000.times do
payload = Array.new(32){rand(36).to_s(36)}.join
session.publish(queue, payload, queue.normalize_options(:persistent => false))
end
end
end
puts "Time elapsed #{time} seconds ..."
2.) The optimized way, but it hangs after the first 30k messages, til then 9.000 messages per second
time = Benchmark.realtime do
queue.with_session(:tx => false) do |session|
options = queue.normalize_options(:persistent => false)
producer = session.instance_variable_get('#jms_session').create_producer(session.java_destination(queue))
1_000_000.times do
payload = Array.new(32){rand(36).to_s(36)}.join
message = TorqueBox::Messaging::Message.new(session.instance_variable_get('#jms_session'), payload, options[:encoding])
message.populate_message_headers(options)
message.populate_message_properties(options[:properties])
producer.disable_message_id = true
producer.disable_message_timestamp = true
producer.send( message.jms_message,
options.fetch(:delivery_mode, producer.delivery_mode),
options.fetch(:priority, producer.priority),
options.fetch(:ttl, producer.time_to_live)
)
end
end
end
puts "Time elapsed #{time} seconds …"
The question is, why does the second snippet hang after the 30k? What would you recommend for massive inserts into HornetQ?
Regards,
Chris

Ruby On Rails: Testing deletes tables

I'm creating an application in RoR and I'm implementing unit testing in all my models.
When I run every test on his own (by running ruby test/unit/some_test.rb) all tests are successful.
But when I run all tests together (by running rake test:units) some tables from both databases (development e test) are deleted.
I'm using raw SQL (mysql) do create tables because I need composite primary keys and physical constraints so I figured it would be the best. Maybe this be the cause?
All my tests are in this form:
require File.dirname(FILE) + '/../test_helper'
require File.dirname(FILE) + '/../../app/models/order'
class OrderTestCase < Test::Unit::TestCase
def setup
#order = Order.new(
:user_id => 1,
:total => 10.23,
:date => Date.today,
:status => 'processing',
:date_concluded => Date.today,
:user_address_user_id => 3,
:user_address_address_id => 5,
:creation_date => Date.today,
:update_date => Date.today
)
end
################ Happy Path
def test_happy_path
assert #order.valid?, #order.errors.full_messages
end
...
The errors I get when running the tests are something like this:
3) Error:
test_empty_is_primary(AddressTestCase):
ActiveRecord::StatementInvalid: Mysql::Error: Table 'shopshop_enterprise_test.addresses' doesn't exist: SHOW FIELDS FROM addresses
/test/unit/address_test.rb:9:in new'
/test/unit/address_test.rb:9:insetup'
Any guesses?
Thanks!
PS: When using postgres as the database engine, everything works fine with rake test:units! (of course, with the correct changes so the sql statements can work with postgres)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Working with large data sets and ruby - mysql

I don't use RoR, but returning 100k rows for a dashboard is never going to be very fast. I strongly suggest building or maintaining summary tables and run GROUP BYs in the database to summarize your dataset before presentation.

Related

Rails first_or_create! creating null values in database table

Memory leak with large number of SQL Logging strings in memory

How to return MySQL query results from EventMachine?

The best way to insert large amount of messages into TorqueBox message queue

Ruby On Rails: Testing deletes tables

Categories

Resources