Let's say I have this Django model:
class Question(models.Model):
question_code = models.CharField(max_length=10)
and I have 15k questions in the database.
I want to sort it by question_code, which is alphanumeric. This is quite a classical problem and has been talked about in:
http://blog.codinghorror.com/sorting-for-humans-natural-sort-order/
Does Python have a built in function for string natural sort?
I tried the code in the 2nd link (which is copied below, changed a bit), and notice it takes up to 3 seconds to sort the data. To make sure about the function's performance, I write a test which creates a list of 100k random alphanumeric string. It takes only 0.76s to sort that list. So what's happening?
This is what I think. The function needs to get the question_code of each question for comparing, thus calling this function to sort 15k values means requesting mysql 15k separate times. And this is the reason why it takes so long. Any idea? And any solution to natural sort for Django in general? Thanks a lot!
def natural_sort(l, ascending, key=lambda s:s):
def get_alphanum_key_func(key):
convert = lambda text: int(text) if text.isdigit() else text
return lambda s: [convert(c) for c in re.split('([0-9]+)', key(s))]
sort_key = get_alphanum_key_func(key)
return sorted(l, key=sort_key, reverse=ascending)
As far as I'm aware there isn't a generic Django solution to this. You can reduce your memory usage and limit your db queries by building an id/question_code lookup structure
from natsort import natsorted
question_code_lookup = Question.objects.values('id','question_code')
ordered_question_codes = natsorted(question_code_lookup, key=lambda i: i['question_code'])
Assuming you want to page the results you can then slice up ordered_question_codes, perform another query to retrieve all the questions you need order them according to their position in that slice
#get the first 20 questions
ordered_question_codes = ordered_question_codes[:20]
question_ids = [q['id'] for q in ordered_question_codes]
questions = Question.objects.filter(id__in=question_ids)
#put them back into question code order
id_to_pos = dict(zip((question_ids), range(len(question_ids))))
questions = sorted(questions, key = lambda x: id_to_pos[x.id])
If the lookup structure still uses too much memory, or takes too long to sort, then you'll have to come up with something more advanced. This certainly wouldn't scale well to a huge dataset
Related
In Pyarrow now you can do:
a = ds.dataset("blah.parquet")
b = a.to_batches()
first_batch = next(b)
What if I want the iterator to return me every Nth batch instead of every other? Seems like this could be something in FragmentScanOptions but that's not documented at all.
No, there is no way to do that today. I'm not sure what you're after but if you are trying to sample your data there are a few choices but none that achieve quite this effect.
To load only a fraction of your data from disk you can use pyarrow.dataset.head
There is a request in place for randomly sampling a dataset although the proposed implementation would still load all of the data into memory (and just drop rows according to some random probability).
Update: If your dataset is only parquet files then there are some rather custom parts and pieces that you can cobble together to achieve what you want.
a = ds.dataset("blah.parquet")
all_fragments = []
for fragment in a.get_fragments():
for row_group_fragment in fragment.split_by_row_group():
all_fragments.append(row_group_fragment)
sampled_fragments = all_fragments[::2]
# Have to construct the sample dataset manually
sampled_dataset = ds.FileSystemDataset(sampled_fragments, schema=a.schema, format=a.format)
# Iterator which will only return some of the batches
# of the source dataset
sampled_dataset.to_batches()
I describe the outcome of a strategy by numerous rows. Each row contains a symbol (describing an asset), a timestamp (think of a backtest) and a price + weight.
Before a strategy runs I delete all previous results from this particular strategy (I have many strategies). I then loop over all symbols and all times.
# delete all previous data written by this strategy
StrategyRow.objects.filter(strategy=strategy).delete()
for symbol in symbols.keys():
s = symbols[symbol]
for t in portfolio.prices.index:
p = prices[symbol][t]
w = weights[symbol][t]
row = StrategyRow.objects.create(strategy=strategy, symbol=s, time=t)
if not math.isnan(p):
row.price = p
if not math.isnan(w):
row.weight = w
row.save()
This works but is very, very slow. Is there a chance to achive the same with write_frame from pandas? Or maybe using faster raw sql?
I don't think the first thing you should try is the raw SQL route (more on that in a bit)
But I think it's because of calling row.save() on many objects, that operation is known to be slow.
I'd look into StrategyRow.objects.bulk_create() first, https://docs.djangoproject.com/en/1.7/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
The difference is you pass it a list of your StrategyRow model, instead of calling .save() on individual instances. It's pretty straightforward, bundle up a few rows then create them in batches, maybe try 10, 20, a 100 etc at a time, your database configs can also help find the optimum batch size. (e.g. http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_max_allowed_packet)
Back to your idea of raw SQL, that would make a difference, if e.g. the Python code that creates the StrategyRow instances is slow (e.g. StrategyRow.objects.create()), but still I believe the key is to batch insert them instead of running N queries
I am working with serialized array fields in one of my models, specifically in counting how many members of each array are shared.
Now, by the nature of my project, I am having to a HUGE number of these overlap countings.. so I was wondering if there was a super quick, cleaver way to do this.
At the moment, I am using the '&' method, so my code looks like this
(user1.follower_names & user2.follower_names).count
which works fine... but I was hoping there might be a faster way to do it.
Sets are faster for this.
require 'benchmark'
require 'set'
alphabet = ('a'..'z').to_a
user1_followers = 100.times.map{ alphabet.sample(3) }
user2_followers = 100.times.map{ alphabet.sample(3) }
user1_followers_set = user1_followers.to_set
user2_followers_set = user2_followers.to_set
n = 1000
Benchmark.bm(7) do |x|
x.report('arrays'){ n.times{ (user1_followers & user2_followers).size } }
x.report('set'){ n.times{ (user1_followers_set & user2_followers_set).size } }
end
Output:
user system total real
arrays 0.910000 0.000000 0.910000 ( 0.926098)
set 0.350000 0.000000 0.350000 ( 0.359571)
An alternative to the above is to use the '-' operator on arrays:
user1.follower_names.size - (user1.follower_names - user2.follower_names).size
Essentially this gets the size of list one and minuses the size of the joint list without the intersection. This isn't as fast as using sets but much quicker than using intersection alone with Arrays
I have a pretty expensive method in my model that compares text items of arrays for many many items.
It runs really slow. If I use a relational database table and compare the ID's only, will my method run a lot faster?
/EDIT
I'm attempting to benchmark the below:
#matches = #location_matches.sort do |l1, l2|
l1.compute_score(current_user) <=> l2.compute_score(current_user)
end
#matches.reverse!
To be short, I guess number comparison will be faster because comparing string is about comparing character after character (advice: in Ruby use symbols when you can, their comparison is much faster).
Whatever, you'll find there, everything you need to benchmark and get your detailled results.
A code sample:
require 'benchmark'
n = 50000
Benchmark.bm do |x|
x.report("for:") { for i in 1..n; a = "1"; end }
x.report("times:") { n.times do ; a = "1"; end }
x.report("upto:") { 1.upto(n) do ; a = "1"; end }
end
The result:
user system total real
for: 1.050000 0.000000 1.050000 ( 0.503462)
times: 1.533333 0.016667 1.550000 ( 0.735473)
upto: 1.500000 0.016667 1.516667 ( 0.711239)
Your first task is to replace that sort with a sort_by, you can also skip the reverse! by sorting things in the desired order in the first place:
#matches = #location_matches.sort_by { |loc| -loc.compute_score(current_user) }
The sort method will have to do many comparisons while sorting and each comparison requires two compute_score calls, the sort_by method does a Schwartzian Transform internally so your expensive compute_score will only be called once per entry. The negation inside the block is just an easy way to reverse the sort order (I'm assuming that your scores are numeric).
Fix up the obvious performance problem and then move on to benchmarking various solutions (but be sure to benchmark sort versus sort_by just to be sure that "obvious" matches reality).
I'm used to EF because it usually works just fine as long as you get to know it better, so you know how to optimize your queries. But.
What would you choose when you know you'll be working with large quantities of data? I know I wouldn't want to use EF in the first place and cripple my application. I would write highly optimised stored procedures and call those to get certain very narrow results (with many joins so they probably won't just return certain entities anyway).
So I'm a bit confused which DAL technology/library I should use? I don't want to use SqlConnection/SqlCommand way of doing it, since I would have to write much more code that's likely to hide some obscure bugs.
I would like to make bug surface as small as possible and use a technology that will accommodate my process not vice-a-versa...
Is there any library that gives me the possibility to:
provide the means of simple SP execution by name
provide automatic materialisation of returned data so I could just provide certain materialisers by means of lambda functions?
like:
List<Person> result = Context.Execute("StoredProcName", record => new Person{
Name = record.GetData<string>("PersonName"),
UserName = record.GetData<string>("UserName"),
Age = record.GetData<int>("Age"),
Gender = record.GetEnum<PersonGender>("Gender")
...
});
or even calling stored procedure that returns multiple result sets etc.
List<Question> result = Context.ExecuteMulti("SPMultipleResults", q => new Question {
Id = q.GetData<int>("QuestionID"),
Title = q.GetData<string>("Title"),
Content = q.GetData<string>("Content"),
Comments = new List<Comment>()
}, c => new Comment {
Id = c.GetData<int>("CommentID"),
Content = c.GetData<string>("Content")
});
Basically this last one wouldn't work, since this one doesn't have any knowledge how to bind both together... but you get the point.
So to put it all down to a single question: Is there a DAL library that's optimised for stored procedure execution and data materialisation?
Business Layer Toolkit might be exactly what's needed here. It's a lightweight ORM tool that supports lots of scenarios including multiple result sets although they seem very complicated to do.