I need some help implementing eager loading - mysql

I've recently been tipped off to eager loading and its necessity in improving performance. I've managed to cut a few queries from loading this page, but I suspect that I can trim them down significantly more if I can eager-load the needed records correctly.
This controller needs to load all of the following to fill the view:
A Student
The seminar (class) page that the student is viewing
All of the objectives included in that seminar
The objective_seminars, the join table between objectives and seminars. This includes the column "priority" which is set by the teacher and used in ordering the objectives.
The objective_students, another join table. Includes a column "points" for the student's score on that objective.
The seminar_students, one last join table. Includes some settings that the student can adjust.
Controller:
def student_view
#student = Student.includes(:objective_students).find(params[:student])
#seminar = Seminar.includes(:objective_seminars).find(params[:id])
#oss = #seminar.objective_seminars.includes(:objective).order(:priority)
#objectives = #seminar.objectives.order(:name)
objective_ids = #objectives.map(&:id)
#student_scores = #student.objective_students.where(:objective_id => objective_ids)
#ss = #student.seminar_students.find_by(:seminar => #seminar)
#teacher = #seminar.user
#teach_options = teach_options(#student, #seminar, 5)
#learn_options = learn_options(#student, #seminar, 5)
end
The method below is where a lot of duplicate queries are occurring that I thought were supposed to be eliminated by eager loading. This method gives the student six options so she can choose one objective to teach her classmates. The method looks first at objectives where the student has scored between 75% and 99%. Within that bracket, they are also sorted by "priority" (from the objective_seminars join table. This value is set by the teacher.) If there is room for more, then the method looks at objectives where the student has scored 100%, sorted by priority. (The learn_options method is practically the same as this method, but with different bracket numbers.)
teach_options method:
def teach_options(student, seminar, list_limit)
teach_opt_array = []
[[70,99],[100,100]].each do |n|
#oss.each do |os|
obj = os.objective
this_score = #student_scores.find_by(:objective => obj)
if this_score
this_points = this_score.points
teach_opt_array.push(obj) if (this_points >= n[0] && this_points <= n[1])
end
end
break if teach_opt_array.length > list_limit
end
return teach_opt_array
end
Thank you in advance for any insight!

#jeff - In regards to your question, I don't see where a lot of queries would be happening outside of #student_scores.find_by(:objective => obj).
Your #student_scores object is already an ActiveRecord relation, correct? So you can use .where() on this, or .select{} without hitting the db again. Select will leave you with an array though, rather than an AR Relation, so be careful there.
this_score = #student_scores.where(objectve: obj)
this_score = #student_scores.select{|score| score.objective == obj}
Those should work.
Just some other suggestions on your top controller method - I don't see any guards or defensive coding, so if any of those objects are nil, your .order(:blah) is probably going to error out. Additionally, if they return nil, your subsequent queries which rely on their data could error out. I'd opt for some try()s or rescues.
Last, just being nitpicky, but those first two lines are a little hard to read, in that you could mistakenly interpret the params as being applied to the includes as well as the main object:
#student = Student.includes(:objective_students).find(params[:student])
#seminar = Seminar.includes(:objective_seminars).find(params[:id])
I'd put the find with your main object, followed by the includes:
#student = Student.find(params[:student]).includes(:objective_students)
#seminar = Seminar.find(params[:id]).includes(:objective_seminars)

Related

Dropping duplicates in a pyarrow table?

Is there a way to sort data and drop duplicates using pure pyarrow tables? My goal is to retrieve the latest version of each ID based on the maximum update timestamp.
Some extra details: my datasets are normally structured into at least two versions:
historical
final
The historical dataset would include all updated items from a source so it is possible to have duplicates for a single ID for each change that happened to it (picture a Zendesk or ServiceNow ticket, for example, where a ticket can be updated many times)
I then read the historical dataset using filters, convert it into a pandas DF, sort the data, and then drop duplicates on some unique constraint columns.
dataset = ds.dataset(history, filesystem, partitioning)
table = dataset.to_table(filter=filter_expression, columns=columns)
df = table.to_pandas().sort_values(sort_columns, ascending=True).drop_duplicates(unique_constraint, keep="last")
table = pa.Table.from_pandas(df=df, schema=table.schema, preserve_index=False)
# ds.write_dataset(final, filesystem, partitioning)
# I tend to write the final dataset using the legacy dataset so I can make use of the partition_filename_cb - that way I can have one file per date_id. Our visualization tool connects to these files directly
# container/dataset/date_id=20210127/20210127.parquet
pq.write_to_dataset(final, filesystem, partition_cols=["date_id"], use_legacy_dataset=True, partition_filename_cb=lambda x: str(x[-1]).split(".")[0] + ".parquet")
It would be nice to cut out that conversion to pandas and then back to a table, if possible.
Edit March 2022: PyArrow is adding more functionalities, though this one isn't here yet. My approach now would be:
def drop_duplicates(table: pa.Table, column_name: str) -> pa.Table:
unique_values = pc.unique(table[column_name])
unique_indices = [pc.index(table[column_name], value).as_py() for value in unique_values]
mask = np.full((len(table)), False)
mask[unique_indices] = True
return table.filter(mask=mask)
//end edit
I saw your question because I had a similar one, and I solved it for my work (due to IP issues I can't post the whole code but I'll try to answer as well as I can. I've never done this before)
import pyarrow.compute as pc
import pyarrow as pa
import numpy as np
array = table.column(column_name)
dicts = {dct['values']: dct['counts'] for dct in pc.value_counts(array).to_pylist()}
for key, value in dicts.items():
# do stuff
I used the 'value_counts' to find the unique values and how many of them there are (https://arrow.apache.org/docs/python/generated/pyarrow.compute.value_counts.html). Then I iterated over those values. If the value was 1, I selected the row by using
mask = pa.array(np.array(array) == key)
row = table.filter(mask)
and if the count was more then 1 I selected either the first or last one by using numpy boolean arrays as a mask again.
After iterating it was just as simple as pa.concat_tables(tables)
warning: this is a slow process. If you need something quick&dirty, try the "Unique" option (also in the same link I provided).
edit/extra:: you can make it a bit faster/less memory intensive by keeping up a numpy array of boolean masks while iterating over the dictionary. then in the end you return a "table.filter(mask=boolean_mask)".
I don't know how to calculate the speed though...
edit2:
(sorry for the many edits. I've been doing a lot of refactoring and trying to get it to work faster.)
You can also try something like:
def drop_duplicates(table: pa.Table, col_name: str) ->pa.Table:
column_array = table.column(col_name)
mask_x = np.full((table.shape[0]), False)
_, mask_indices = np.unique(np.array(column_array), return_index=True)
mask_x[mask_indices] = True
return table.filter(mask=mask_x)
The following gives a good performance. About 2mins for a table with half billion rows. The reason I don't do combine_chunks(): there is a bug, arrow seems can not combine chunk arrays if there size are too large. See details: https://issues.apache.org/jira/browse/ARROW-10172?src=confmacro
a = [len(tb3['ID'].chunk(i)) for i in range(len(tb3['ID'].chunks))]
c = np.array([np.arange(x) for x in a])
a = ([0]+a)[:-1]
c = pa.chunked_array(c+np.cumsum(a))
tb3= tb3.set_column(tb3.shape[1], 'index', c)
selector = tb3.group_by(['ID']).aggregate([("index", "min")])
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=selector['index_min']))
I found duckdb can give better performance on group by. Change the last 2 lines above into the following will give 2X speedup:
import duckdb
duck = duckdb.connect()
sql = "select first(index) as idx from tb3 group by ID"
duck_res = duck.execute(sql).fetch_arrow_table()
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=duck_res['idx']))

Django save() does not work

I'm writing an elections application. In the process, I've defined an Election model and a Candidate model.
Note: I'm using Django version 1.3.7, Python 2.7.1.
One of Election's methods,
Election.count_first_place(self)
is intended to count the number of first place votes each candidate receives and update the candidates' numVotes attribute. But for some reason they all stay at zero, no matter the ballots.
Note: I'm implementing STV so each ballot contains an array(ballot.voteArray) of Candidates in order of most preferred (position zero) to least preferred (position n). I've implemented this list with a PickledObjectField (see link).
models.py
class Candidate(models.Model):
election = models.ForeignKey("Election")
numVotes = models.FloatField(blank=True)
class Ballot(models.Model):
election = models.ForeignKey("Election", related_name = "ballot_set")
voteArray = PickledObjectField(null=True,blank=True)
class Election(models.Model):
position = models.CharField(max_length = 50)
candidates = models.ManyToManyField(Candidate,related_name="elections_in",null=True,blank=True)
def count_first_place(self):
#retrieve all of the ballots cast in this election
ballots = Ballot.objects.filter(election = self)
for ballot in ballots.all():
# the first element of a ballot's voteArray is a Candidate object
first_place_choice = ballot.voteArray[0]
first_place_choice.numVotes += 1
first_place_choice.save()
ballot.save()
self.save()
Here is what happens when I run a test:
Note: I realize that I am saving way more often than is necessary. Just being absolutely sure while I test this thing that it saves when it needs to.
elec = Election(position="Student Body President")
elec.save()
j = Candidate(election=elec,numVotes=0)
j.save()
e = Candidate(election=elec,numVotes=0)
e.save()
b = Candidate(election=elec,numVotes=0)
b.save()
elec.candidates.add(j,e,b)
elec.save()
ballot1 = Ballot(election=elec,voteArray=[j,e,b])
ballot1.save()
ballot2 = Ballot(election=elec,voteArray=[j,b,e])
ballot2.save()
ballot3 = Ballot(election=elec,voteArray[e,b,j])
ballot3.save()
So after this bit, j has two 2 place votes, and e has 1. But when I run
elec.count_first_place()
j still has zero votes, as do e and b.
What's up with that????
This is a very strange table structure. Pickling other model instances is a very bad idea: the pickled versions will not update when their database rows do. Really you should be storing an array of candidate IDs, or even better create a many-to-many relationship from Ballot to Candidate with a through table indicating position.
But I think your problem is simpler than that. You say that the objects still have zero votes: that is because you have not updated those particular instances. Again, there is no direct relationship between a Django instance and the database row, other than on loading and saving. You'll need to reload the objects from the database to see any updates.

Creating complex XPQuery - LINQ to SQL with nested lists

any hint on what's wrong with the below query?
return new ItemPricesViewModel()
{
Source = (from o in XpoSession.Query<PRICE>()
select new ItemPriceViewModel()
{
ID = o.ITEM_ID.ITEM_ID,
ItemCod = o.ITEM_ID.ITEM_COD,
ItemModifier = o.ITEM_MODIFIER_ID.ITEM_MODIFIER_COD,
ItemName = o.ITEM_ID.ITEM_COD,
ItemID = o.ITEM_ID.ITEM_ID,
ItemModifierID = o.ITEM_MODIFIER_ID.ITEM_MODIFIER_ID,
ItemPrices = (from d in o
where d.ITEM_ID.ITEM_ID == o.ITEM_ID.ITEM_ID && d.ITEM_MODIFIER_ID.ITEM_MODIFIER_ID == o.ITEM_MODIFIER_ID.ITEM_MODIFIER_ID
select new Price()
{
ID = o.PRICE_ID,
PriceList = o.PRICELIST_ID.PRICELIST_,
Price = o.PRICE_
}).ToList()
}).ToList()
};
o in subquery is in read and I got the message "Could not find an implementation of the query pattern for source type . 'Where' not found."
I would like to have distinct ItemID, ItemModifier: should I create a custom IEqualityComparer to do it?
Thank you!
It seems like XPO it's not able to respond to this scenario. For reference this is what you could do with DbContext.
It sounds like maybe you want a GroupBy. Try something like this.
var result = dbContext.Prices
.GroupBy(p => new {p.ItemName, p.ItemTypeName)
.Select(g => new Item
{
ItemName = g.Key.ItemName,
ItemTypeName = g.Key.ItemTypeName,
Prices = g.Select(p => new Price
{
Price = p.Price
}
).ToList()
})
.Skip(x)
.Take(y)
.ToList();
Probable cause
In general, XPO does not support "free joins" in most of the cases. It was explicitelly written somewhere in their knowledgebase or Q/A site. If I hit that article again, I'll include a link to it.
In your original code example, you were trying to perform a "free join" in the INNER query. The 'WHERE' clause was doing a join-by-key, probably navigational, but also it contained an extra filter by "modifier" which probably is not a part of the definition of the relation.
Moreover, the query tried to reuse the IQueryable<PRICE> o in the inner query - what actually seems somewhat supported by XPO - but if you ever add any prefiltering ('where') to the toplevel 'o', it would have high odds of breaking again.
The docs state that XPO supports only navigational joins, along paths formed by properties and/or xpcollections defined in your XPObjects. This applies to XPO as whole, so XPQuery too. All other kinds of joins are called "free joins" and either:
are silently emulated by XPO by fetching related objects, extracting key values from them and rewriting the query into a multiple roundtrips with a series of partial queries that fetch full objects with WHERE-id-IN-(#p0,#p1,#p2,...) - but this happens only in the some simpliest cases
or are "not fully supported", meaning they throw exceptions and require you to manually split the query or rephrase it
Possible direct solution schemes
If ITEM_ID is a relation and XPCollection in PRICE class, then you could rewrite your query so that it fetches a PRICE object then builds up a result object and initializes its fields with PRICE object's properties. Something like:
return new ItemPricesViewModel()
{
Source = (from o in XpoSession.Query<PRICE>().AsEnumerable()
select new ItemPriceViewModel()
{
ID = o.ITEM_ID.ITEM_ID,
ItemCod = o.ITEM_ID.ITEM_COD,
....
ItemModifierID = o.ITEM_MODIFIER_ID.ITEM_MODIFIER_ID,
ItemPrices = (from d in o
where d.ITEM_ID.ITEM_ID == ....
select new Price()
.... .... ....
};
Note the 'AsEnumerable' that breaks the query and ensures that PRICE objects are first fetched instead of just trying to translate the query. Very probable that this would "just work".
Also, splitting the query into explicit stages sometimes help the XPO to analyze it:
return new ItemPricesViewModel()
{
Source = (from o in XpoSession.Query<PRICE>()
select new
{
id = o.ITEM_ID.ITEM_ID,
itemcod = o.ITEM_ID.ITEM_COD,
....
}
).AsEnumerable()
.Select(temp =>
select new ItemPriceViewModel()
{
ID = temp.id
ItemCod = temp.itemcod,
....
ItemPrices = (from d in XpoSession.Query<PRICE>()
where d.ITEM_ID.ITEM_ID == ....
select new Price()
.... .... ....
};
Here, note that I first fetch the item-data from server, and then conctruct the item on the 'client', and then build the required groupings. Note that I could not refer to the variable o anymore. In these precise case and examples, unsuprisingly, the second one (splitted) would be probably even slower than the first one, since it would fetch all PRICEs and then refetch the groupings through additional queries, while the first one would just fetch all PRICEs and then would calculate the groups in-memory basing on the PRICEs already fetched. This is not an a sideeffect of my laziness, but it is a common pitfall when rewriting the LINQ queries, so I included it as a warning :)
Both of these code examples are NOT RECOMMENDED for your case, as they would probably have very poor performace, especially if you have many PRICEs in the table, which is highly likely. I included them to present as only an example of how you could rewrite the query to siplify its structure so the XPO can eat it without choking. However, you have to be really careful and pay attention to little details, as you can very easily spoil the performance.
observations and real solution
However, it is worth noting that they are not that much worse than your original query. It was itself quite poor, since it tried to perform something near O(N^2) row-fetches from the table just to perform to group te rows by "ITEM_ID" and then formatting the results as separate objects. Properly done, it would be something like O(N lg N)+O(N), so regardless of being supported or not, your alternate attempt with GroupBy is surely a much better approach, and I'm really glad you found it yourself.
Very often when you are trying to split/simplify the XPQuery expressions as I did above, you implicitely rethink the problem and find an easier and simplier way to express the query that was initially not-supported or just were crashing.
Unfortunatelly, your query was in fact quite simple. For a really complex queries that cannot be "just rephrased", splitting into stages and making some of the join-filter work at 'client' is unavoidable.. But again, doing them on XPCollections or XPViews with CritieriaOperators is impossible too, so either we have to bear with it or use plain direct handcrafted SQL..
Sidenote:
Whole XPO has problems with "free joins", they are "not fully supported" not only in XPQuery, but also there's not much for them in XPCollection, XPView, CriteriaOperators, etc, too. But, it is worth noting that at least in "my version" of DX11, the XPQuery has very poor LINQ support at all.
I've hit many cases where a proper LINQ query was:
throwing "NotSupportedException", mostly in FreeJoins, but also very often with complex GroupBy or Select-with-Projection, GroupJoin, and many others - sometimes even Distinct(!) seemed to malfunction
throwing "NullReferenceExceptions" at some proper type conversions (XPO tried to interprete a column that held INT/NULL as an object..), often I had to write some completely odd and artificial expressions like foo!=null && foo.bar!=123 instead of foo = 123 despite the 'foo' being an public int Foo {get;set;}, all because the DX could not cope properly with NULLs in the database (because XPO created nullable-INT column for this property.. but that's another story)
throwing other random ArgumentException/InvalidOperation exceptions from other constructs
or even analyzing the query structure improperly, for example this one is usually valid:
session.Query<ABC>()
.Where( abc => abc.foo == "somefilter" )
.Select( abc => new { first = abc, b = abc } )
.ToArray();
but things like this one usually throws:
session.Query<ABC>()
.Select( abc => new { first = abc, b = abc } )
.Where ( temp => temp.first.foo == "somefilter" )
.ToArray();
but this one is valid:
session.Query<ABC>()
.Select( abc => new { first = abc, b = abc } )
.ToArray()
.Where ( temp => temp.first.foo == "somefilter" )
.ToArray();
The middle code example usually throws with an error that reveals that XPO layer were trying to find ".first.foo" path inside the ABC class, which is obviously wrong since at that point the element type isn't ABC anymore but instead a' anonymous class.
disclaimer
I've already noted that, but let me repeat: these observations are related to DX11 and most probably also earlier. I do not know what of that was fixed in DX12 and above (if anything at all was!).

Rails select random record

I don't know if I'm just looking in the wrong places here or what, but does active record have a method for retrieving a random object?
Something like?
#user = User.random
Or... well since that method doesn't exist is there some amazing "Rails Way" of doing this, I always seem to be to verbose. I'm using mysql as well.
Most of the examples I've seen that do this end up counting the rows in the table, then generating a random number to choose one. This is because alternatives such as RAND() are inefficient in that they actually get every row and assign them a random number, or so I've read (and are database specific I think).
You can add a method like the one I found here.
module ActiveRecord
class Base
def self.random
if (c = count) != 0
find(:first, :offset =>rand(c))
end
end
end
end
This will make it so any Model you use has a method called random which works in the way I described above: generates a random number within the count of the rows in the table, then fetches the row associated with that random number. So basically, you're only doing one fetch which is what you probably prefer :)
You can also take a look at this rails plugin.
We found that offsets ran very slowly on MySql for a large table. Instead of using offset like:
model.find(:first, :offset =>rand(c))
...we found the following technique ran more than 10x faster (fixed off by 1):
max_id = Model.maximum("id")
min_id = Model.minimum("id")
id_range = max_id - min_id + 1
random_id = min_id + rand(id_range).to_i
Model.find(:first, :conditions => "id >= #{random_id}", :limit => 1, :order => "id")
Try using Array's sample method:
#user = User.all.sample(1)
In Rails 4 I would extend ActiveRecord::Relation:
class ActiveRecord::Relation
def random
offset(rand(count))
end
end
This way you can use scopes:
SomeModel.all.random.first # Return one random record
SomeModel.some_scope.another_scope.random.first
I'd use a named scope. Just throw this into your User model.
named_scope :random, :order=>'RAND()', :limit=>1
The random function isn't the same in each database though. SQLite and others use RANDOM() but you'll need to use RAND() for MySQL.
If you'd like to be able to grab more than one random row you can try this.
named_scope :random, lambda { |*args| { :order=>'RAND()', :limit=>args[0] || 1 } }
If you call User.random it will default to 1 but you can also call User.random(3) if you want more than one.
If you would need a random record but only within certain criteria you could use "random_where" from this code:
module ActiveRecord
class Base
def self.random
if (c = count) != 0
find(:first, :offset =>rand(c))
end
end
def self.random_where(*params)
if (c = where(*params).count) != 0
where(*params).find(:first, :offset =>rand(c))
end
end
end
end
For e.g :
#user = User.random_where("active = 1")
This function is very useful for displaying random products based on some additional criteria
Strongly Recommend this gem for random records, which is specially designed for table with lots of data rows:
https://github.com/haopingfan/quick_random_records
Simple Usage:
#user = User.random_records(1).take
All other answers perform badly with large database, except this gem:
quick_random_records only cost 4.6ms totally.
the accepted answer User.order('RAND()').limit(10) cost 733.0ms.
the offset approach cost 245.4ms totally.
the User.all.sample(10) approach cost 573.4ms.
Note: My table only has 120,000 users. The more records you have, the more enormous the difference of performance will be.
UPDATE:
Perform on table with 550,000 rows
Model.where(id: Model.pluck(:id).sample(10)) cost 1384.0ms
gem: quick_random_records only cost 6.4ms totally
Here is the best solution for getting random records from database.
RoR provide everything in ease of use.
For getting random records from DB use sample, below is the description for that with example.
Backport of Array#sample based on Marc-Andre Lafortune’s github.com/marcandre/backports/ Returns a random element or n random elements from the array. If the array is empty and n is nil, returns nil. If n is passed and its value is less than 0, it raises an ArgumentError exception. If the value of n is equal or greater than 0 it returns [].
[1,2,3,4,5,6].sample # => 4
[1,2,3,4,5,6].sample(3) # => [2, 4, 5]
[1,2,3,4,5,6].sample(-3) # => ArgumentError: negative array size
[].sample # => nil
[].sample(3) # => []
You can use condition with as per your requirement like below example.
User.where(active: true).sample(5)
it will return randomly 5 active user's from User table
For more help please visit : http://apidock.com/rails/Array/sample

Recalculate Counter Cache of 120k Records [Rails / ActiveRecord]

The following situation:
I have a poi model, which has many pictures (1:n). I want to recalculate the counter_cache column, because the values are inconsistent.
I've tried to iterate within ruby over each record, but this takes much too long and quits sometimes with some "segmentation fault" bugs.
So i wonder, if its possible to do this with a raw sql query?
If, for example, you have Post and Picture models, and post has_many :pictures, you can do it with update_all :
Post.update_all("pictures_count=(Select count(*) from pictures where pictures.post_id=posts.id)")
I found a nice solution on krautcomputing.
It uses reflections to find all counter caches of a project, SQL queries to find only the objects that are inconsistent and use Rails reset_counters to clean things up.
Unfortunately it only works with "conventional" counter caches (no class name, no custom counter cache names) so I refined it:
Rails.application.eager_load!
ActiveRecord::Base.descendants.each do |many_class|
many_class.reflections.each do |name, reflection|
if reflection.options[:counter_cache]
one_class = reflection.class_name.constantize
one_table, many_table = [one_class, many_class].map(&:table_name)
# more reflections, use :inverse_of, :counter_cache etc.
inverse_of = reflection.options[:inverse_of]
counter_cache = reflection.options[:counter_cache]
if counter_cache === true
counter_cache = "#{many_table}_count"
inverse_of ||= many_table.to_sym
else
inverse_of ||= counter_cache.to_s.sub(/_count$/,'').to_sym
end
ids = one_class
.joins(inverse_of)
.group("#{one_table}.id")
.having("MAX(#{one_table}.#{counter_cache}) != COUNT(#{many_table}.id)")
.pluck("#{one_table}.id")
ids.each do |id|
puts "reset #{id} on #{many_table}"
one_class.reset_counters id, inverse_of
end
end
end
end