Batch processing in Rails

Batch processing in Rails - mysql

Rails query:
Detail.created_at_gt(15.days.ago.to_datetime).find_each do |d|
//Some code
end
Equivalent mysql query:
SELECT * FROM `details` WHERE (details.id >= 0) AND
(details.created_at > '2012-07-01 12:22:32')
ORDER BY details.id ASC LIMIT 1000
By using find_each in rails it is checking for details.id >= 0 and ordering details in
in ascending order.
Here, I want to avoid those two actions because in my case it is scanning whole table when
I have large data to process (i.e) indexing on created_at fails. So this is inefficient to do. Please anyone help.

Here you've source of find_in_batches used in find_each:
http://apidock.com/rails/ActiveRecord/Batches/find_in_batches
Click Show source link.
Essential lines are:
relation = relation.reorder(batch_order).limit(batch_size)
records = relation.where(table[primary_key].gteq(start)).all
and
records = relation.where(table[primary_key].gt(primary_key_offset)).to_a
You must order records by primary index or other unique index to process in batches and to select next batches.
You can't do batches by created_at because it is not unique. But you could mix ordering by created_at and selecting by unique id:
relation = relation.reorder('created_at ASC, id ASC').limit(batch_size)
records = relation.where(table[primary_key].gteq(start)).all
#....
while records.any?
records_size = records.size
primary_key_offset = records.last.id
created_at_key = records.last.created_at
yield records
break if records_size < batch_size
if primary_key_offset
records = relation.where('created_at>:ca OR (created_at=:ca AND id>:id)',:ca=>created_at_key,:id=>primary_key_offset).to_a
else
raise "Primary key not included in the custom select clause"
end
end
If you are absolutely sure that no record, with the same created_at value, will be repeated more than bach_size times you could just use created_at as only key in batch processing.
Anyway you need index on created_at to be efficient.

Detail.where('created_at > ? AND id < ?', 15.days.ago.to_datetime, 1000).order('details.id ASC')
You don't have to explicitly check details.id >= 0 as Rails does it for you by default.

Be better if you will use scopes and ARel style of quering:
class Detail < ActiveRecord::Base
table = self.arel_table
scope :created_after, lambda { |date| where(table[:created_at].gt(date)).limit(1000) }
end
Than you can find 1000 records that was created after some date:
#details = Detail.created_after(15.days.ago.to_date_time)

Related

MySQL optimization with joins, group by with large number of data

The following query takes a whopping 6 seconds to execute and I can't seem to figure out why. I have an index on the table. But like it doesn't do much to speed up the query.
Query :
SELECT `AD`.`id`, `CAM`.`cam_name`, `CUI`.`cui_id`, `CAM`.`cam_id`, `AD`.`api_json_response_data` AS `refused_by_api`
FROM `tbl_api_data` AS `AD`
LEFT JOIN `tbl_camp_user_info` AS `CUI` ON `AD`.`cui_id` = `CUI`.`cui_id`
JOIN `tbl_campaign` AS `CAM` ON `CAM`.`cam_id` = `CUI`.`cui_campaign_id`
JOIN `tbl_usr_lead_setting` AS `ULS` ON `CUI`.`cui_id` = `ULS`.`cui_id`
WHERE `CUI`.`cui_status` = 'active'
AND `CAM`.`cam_status` = 'active'
AND `ULS`.`uls_status` = 'active'
AND `AD`.`status` = 'error'
AND `CUI`.`cui_cron_status` = '1'
AND `CUI`.`cui_created_date` >= '2021-07-01 00:00:00'
GROUP BY `AD`.`cui_id`
I have an index on the below table:
tbl_api_data - id,cui_id
tbl_camp_user_info - cui_id,cui_campaign_id,cui_cron_status (cui_status - not)
tbl_campaign - cam_id, cam_status
tbl_usr_lead_setting - cui_id,uls_status
index image
Total number of record in each table :
tbl_api_data - 297,297 rows
tbl_camp_user_info - 843,390 rows
tbl_campaign - 334 rows
tbl_usr_lead_setting - 879,390 rows
And query Result has 376 rows.
If I have used limit on above query like below.Result is 1o rows But it will take 8.278 sec.That's also too much.
SELECT `AD`.`id`, `CAM`.`cam_name`, `CUI`.`cui_id`, `CAM`.`cam_id`, `AD`.`api_json_response_data` AS `refused_by_api`
FROM `tbl_api_data` AS `AD`
LEFT JOIN `tbl_camp_user_info` AS `CUI` ON `AD`.`cui_id` = `CUI`.`cui_id`
JOIN `tbl_campaign` AS `CAM` ON `CAM`.`cam_id` = `CUI`.`cui_campaign_id`
JOIN `tbl_usr_lead_setting` AS `ULS` ON `CUI`.`cui_id` = `ULS`.`cui_id`
WHERE `CUI`.`cui_status` = 'active'
AND `CAM`.`cam_status` = 'active'
AND `ULS`.`uls_status` = 'active'
AND `AD`.`status` = 'error'
AND `CUI`.`cui_cron_status` = '1'
AND `CUI`.`cui_created_date` >= '2021-07-01 00:00:00'
GROUP BY `AD`.`cui_id`
LIMIT 10
I'm stuck on this for last 1 week. I really need to optimize the above query. Please help me today if possible.
Help would be appreciated. Thank you.

Going by what you posted, you have a composite index on tbl_api_data - id,cui_id. In the SQL you are joining this table with another table using "cui_id" field and you are also using this field for group by. However, you havent added index on this field. That can be a reason.
Remember that the composite index you posted cant be used for this join and group by because "cui_id" is not the leftmost field (or first field in composite index).
So try adding a separate index on "cui_id"

That seems to be a JOIN, not a LEFT JOIN.
These composite indexes may help:
AD: INDEX(status, cui_id, id, api_json_response_data)
CUI: INDEX(cui_status, cui_cron_status, cui_created_date,
cui_id, cui_campaign_id)
CAM: INDEX(cam_status, cam_id, cam_name)
ULS: INDEX(uls_status, cui_id)
When adding a composite index, DROP index(es) with the same leading columns.
That is, when you have both INDEX(a) and INDEX(a,b), toss the former.
A LIMIT without and ORDER BY can lead to any random subset of the rows being returned.
"tbl_api_data - id,cui_id" -- Assuming that id is the PRIMARY KEY, this index is likely to be useless. That is, don't start a secondary index with the PK's columns(s).

Yii2 select by max date?

Suppose I have table A with its active record in yii2, What is the best way to can load the record with max created date to the model.
This is the query :
select *
from A
where created_date = (
select max(created_date) from A
)
Now I am getting the max date first then use it in another access to database ie:
$max = A::find()->select("created_date)")->max();
$model = A::find()->where("created_date = :date",[":date"=>$max])->one();
I am sure that this can be done with one access to database , but I don't know how.
please any help.

Your query is the equivalent of:
SELECT * FROM A ORDER BY created_date DESC LIMIT 1;
You can order your records by created_date in descending order and get the first record i.e:
$model = A::find()->orderBy('created_date DESC')->limit(1)->one();
Why limit(1)? As pointed out by nicolascolman, according to the official Yii documentation:
Neither yii\db\ActiveRecord::findOne() nor yii\db\ActiveQuery::one() will add LIMIT 1 to the generated SQL statement. If your query may return many rows of data, you should call limit(1) explicitly to improve the performance, e.g., Customer::find()->limit(1)->one().

$maxdate=A::find()->max('created_date');

Try this
$model = A::find()->orderBy("created_date DESC")->one();

SQL statement hanging up in MySQL database

I am needing some SQL help. I have a SELECT statement that references several tables and is hanging up in the MySQL database. I would like to know if there is a better way to write this statement so that it runs efficiently and does not hang up the DB? Any help/direction would be appreciated. Thanks.
Here is the code:
Select Max(b.BurID) As BurID
From My.AppTable a,
My.AddressTable c,
My.BurTable b
Where a.AppID = c.AppID
And c.AppID = b.AppID
And (a.Forename = 'Bugs'
And a.Surname = 'Bunny'
And a.DOB = '1936-01-16'
And c.PostcodeAnywhereBuildingNumber = '999'
And c.PostcodeAnywherePostcode = 'SK99 9Q9'
And c.isPrimary = 1
And b.ErrorInd <> 1
And DateDiff(CurDate(), a.ApplicationDate) <= 30)
There is NO mysql error in the log. Sorry.

Pro tip: use explicit JOINs rather than a comma-separated list of tables. It's easier to see the logic you're using to JOIN that way. Rewriting your query to do that gives us this.
select Max(b.BurID) As BurID
From My.AppTable AS a
JOIN My.AddressTable AS c ON a.AppID = c.AppID
JOIN My.BurTable AS b ON c.AppID = b.AppID
WHERE (a.Forename = 'Bugs'
And a.Surname = 'Bunny'
And a.DOB = '1936-01-16'
And c.PostcodeAnywhereBuildingNumber = '999'
And c.PostcodeAnywherePostcode = 'SK99 9Q9'
And c.isPrimary = 1
And b.ErrorInd <> 1
And DateDiff(CurDate(), a.ApplicationDate) <= 30)
Next pro tip: Don't use functions (like DateDiff()) in WHERE clauses, because they defeat using indexes to search. That means you should change the last line of your query to
AND a.ApplicationDate >= CurDate() - INTERVAL 30 DAY
This has the same logic as in your query, but it leaves a naked (and therefore index-searchable) column name in the search expression.
Next, we need to look at your columns to see how you are searching, and cook up appropriate indexes.
Let's start with AppTable. You're screening by specific values of Forename, Surname, and DOB. You're screening by a range of ApplicationDate values. Finally you need AppID to manage your join. So, this compound index should help. Its columns are in the correct order to use a range scan to satisfy your query, and contains the needed results.
CREATE INDEX search1 USING BTREE
ON AppTable
(Forename, Surname, DOB, ApplicationDate, AppID)
Next, we can look at your AddressTable. Similar logic applies. You'll enter this table via the JOINed AppID, and then screen by specific values of three columns. So, try this index
CREATE INDEX search2 USING BTREE
ON AddressTable
(AppID, PostcodeAnywherePostcode, PostcodeAnywhereBuildingNumber, isPrimary)
Finally, we're on to your BurTable. Use similar logic as the other two, and try this index.
CREATE INDEX search3 USING BTREE
ON BurTable
(AppID, ErrorInd, BurID)
This kind of index is called a compound covering index, and can vastly speed up the sort of summary query you have asked about.

Remove duplicate records/objects uniquely identified by multiple attributes

I have a model called HeroStatus with the following attributes:
id
user_id
recordable_type
hero_type (can be NULL!)
recordable_id
created_at
There are over 100 hero_statuses, and a user can have many hero_statuses, but can't have the same hero_status more than once.
A user's hero_status is uniquely identified by the combination of recordable_type + hero_type + recordable_id. What I'm trying to say essentially is that there can't be a duplicate hero_status for a specific user.
Unfortunately, I didn't have a validation in place to assure this, so I got some duplicate hero_statuses for users after I made some code changes. For example:
user_id = 18
recordable_type = 'Evil'
hero_type = 'Halitosis'
recordable_id = 1
created_at = '2010-05-03 18:30:30'
user_id = 18
recordable_type = 'Evil'
hero_type = 'Halitosis'
recordable_id = 1
created_at = '2009-03-03 15:30:00'
user_id = 18
recordable_type = 'Good'
hero_type = 'Hugs'
recordable_id = 1
created_at = '2009-02-03 12:30:00'
user_id = 18
recordable_type = 'Good'
hero_type = NULL
recordable_id = 2
created_at = '2009-012-03 08:30:00'
(Last two are not a dups obviously. First two are.) So what I want to do is get rid of the duplicate hero_status. Which one? The one with the most-recent date.
I have three questions:
How do I remove the duplicates using a SQL-only approach?
How do I remove the duplicates using a pure Ruby solution? Something similar to this: Removing "duplicate objects".
How do I put a validation in place to prevent duplicate entries in the future?

For an SQL only approach, I would use this query - (I'm assuming the id's are unique.)
DELETE FROM HeroStatus WHERE id IN
(SELECT id FROM
(SELECT user_id, recordable_type, hero_type, recordable_id, MAX(created_at)
GROUP BY del.user_id, recordable_type, hero_type, recordable_id
HAVING Count(id)>1) AS del
INNER JOIN HeroStatus AS hs ON
hs.user_id=del.user_id AND hs.recordable_type=del.recordable_type
AND hs.hero_type=del.hero_type AND hs.recordable_id=del.recordable_id
AND hs.created_at = del.created_at)
A bit of a monster! The query finds all duplicates using the natural key (user_id, recordable_type, hero_type) and selects the one with the largest created_at value (most recently created). It then finds the IDs of those rows (by joining back to the main table) and deletes rows with that id.
(Please try this on a copy of the table first and verify you get the results you want! :-)
To prevent this happening in future, add a unique index or constraint over the columns user_id, recordable_type, hero_type, recordable_id. E.g.
ALTER TABLE HeroStatus
ADD UNIQUE (user_id, recordable_type, hero_type, recordable_id)
EDIT:
You add (and remove) this index within a migration like this:
add_index(:HeroStatus, [:user_id, :recordable_type, :hero_type, :recordable_id], :unique => true)
remove_index(:HeroStatus, :column => [:user_id, :recordable_type, :hero_type, :recordable_id], :unique => true)
Or, if you want to explicitly name it:
add_index(:HeroStatus, [:user_id, :recordable_type, :hero_type, :recordable_id], :unique => true, :name => :my_unique_index)
remove_index(:HeroStatus, :name => :my_unique_index)

Sometimes you need to just roll up your sleeves and do some serious SQL to kill off all the ones you don't want. This is easy if it's a one shot thing, and not too hard to roll into a Rake task you can fire on demand.
For instance, to select all the distinct status records, it is reasonable to use something like the following:
SELECT id FROM hero_statuses GROUP BY user_id, hero_type, recordable_id
Given that these are the sufficiently unique records in your set, you can go about removing all the ones you don't want:
DELETE FROM hero_statuses WHERE id NOT IN (SELECT id FROM hero_statuses GROUP BY user_id, hero_type, recordable_id)
As with any operation that involves DELETE FROM, I hope you don't just fire this off on your production data without the usual precautions of backing things up.
As to how to prevent this in the future, if these are unique constraints, create a unique index on them:
add_index :hero_statuses, [ :user_id, :hero_type, :recordable_id ], :unique => true
This will generate ActiveRecord exceptions when you attempt to introduce a duplicate record. One benefit of a unique index is that you can make use of the "INSERT IGNORE INTO..." or "INSERT ... ON DUPLICATE KEY ..." features to recover from potential duplications.

Doctrine issue - Different queries, same results but not with Doctrine

i'm having a little issue with doctrine using symfony 1.4 (I think it's using doctrine 1.2). I have 2 queries, using raw sql in the mysql console, they produce the same resultset. The queries can be generated using this code :
$dates = Doctrine::getTable('Picture')
->createQuery('a')
->select('substr(a.created_at,1,10) as date')
->leftjoin('a.PictureTag pt ON a.id = pt.picture_id')
->leftjoin('pt.Tag t ON t.id = pt.tag_id')
->where('a.created_at <= ?', date('Y-m-d 23:59:59'))
->orderBy('date DESC')
->groupby('date')
->limit(ITEMS_PER_PAGE)
->offset(ITEMS_PER_PAGE * $this->page)
->execute();
If I remove the two joins, it changes the query, but the resultset it's the same.
But using doctrine execute(), one produces only one row.
Somebody have an idea on what's going on here?
PS : Picture table has id, title, file, created_at (format 'Y-m-d h:i:s'), the Tag table is id, name and PictureTag is an relationship table with id and the two foreign keys.
PS 2 : Here are the two sql queries produced (the first without joins)
SELECT substr(l.created_at, 1, 10) AS l__0 FROM lupa_picture l WHERE (l.created_at <= '2010-03-19 23:59:59') GROUP BY l__0 ORDER BY l__0 DESC LIMIT 4
SELECT substr(l.created_at, 1, 10) AS l__0 FROM lupa_picture l LEFT JOIN lupa_picture_tag l2 ON (l.id = l2.picture_id) LEFT JOIN lupa_tag l3 ON (l3.id = l2.tag_id) WHERE (l.created_at <= '2010-03-19 23:59:59') GROUP BY l__0 ORDER BY l__0 DESC LIMIT 4

I had something similar this week. Doctrine's generated SQL (from the Symfony debug toolbar) worked fine in phpMyAdmin, but failed when running the query as in your question. Try adding in the following into your query:
->setHydrationMode(Doctrine::HYDRATE_SCALAR)
and see if it gives you the expected result. If so, it's down to the Doctrine_Collection using the Picture primary key as the index in the collection. If you have more than 1 result with the same index, Doctrine will refuse to add it into the collection, so you only end up with 1 result. I ended up running the query using a different table rather than the one I wanted, which resulted in a unique primary key and then the results I wanted appeared.

Well, the solution was that...besides substr(), it needs another column of the table. Using select(substr(), a.created_at) made it work

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Batch processing in Rails - mysql

Detail.where('created_at > ? AND id < ?', 15.days.ago.to_datetime, 1000).order('details.id ASC') You don't have to explicitly check details.id >= 0 as Rails does it for you by default.

Related

MySQL optimization with joins, group by with large number of data

Yii2 select by max date?

SQL statement hanging up in MySQL database

Remove duplicate records/objects uniquely identified by multiple attributes

Doctrine issue - Different queries, same results but not with Doctrine

Categories

Resources