I have a pretty expensive method in my model that compares text items of arrays for many many items.
It runs really slow. If I use a relational database table and compare the ID's only, will my method run a lot faster?
/EDIT
I'm attempting to benchmark the below:
#matches = #location_matches.sort do |l1, l2|
l1.compute_score(current_user) <=> l2.compute_score(current_user)
end
#matches.reverse!
To be short, I guess number comparison will be faster because comparing string is about comparing character after character (advice: in Ruby use symbols when you can, their comparison is much faster).
Whatever, you'll find there, everything you need to benchmark and get your detailled results.
A code sample:
require 'benchmark'
n = 50000
Benchmark.bm do |x|
x.report("for:") { for i in 1..n; a = "1"; end }
x.report("times:") { n.times do ; a = "1"; end }
x.report("upto:") { 1.upto(n) do ; a = "1"; end }
end
The result:
user system total real
for: 1.050000 0.000000 1.050000 ( 0.503462)
times: 1.533333 0.016667 1.550000 ( 0.735473)
upto: 1.500000 0.016667 1.516667 ( 0.711239)
Your first task is to replace that sort with a sort_by, you can also skip the reverse! by sorting things in the desired order in the first place:
#matches = #location_matches.sort_by { |loc| -loc.compute_score(current_user) }
The sort method will have to do many comparisons while sorting and each comparison requires two compute_score calls, the sort_by method does a Schwartzian Transform internally so your expensive compute_score will only be called once per entry. The negation inside the block is just an easy way to reverse the sort order (I'm assuming that your scores are numeric).
Fix up the obvious performance problem and then move on to benchmarking various solutions (but be sure to benchmark sort versus sort_by just to be sure that "obvious" matches reality).
Related
Let's say I have: [[1,2], [3,9], [4,2], [], []]
I would like to know the scripts to get:
The number of nested lists which are/are not non-empty. ie want to get: [3,2]
The number of nested lists which contain or not contain number 3. ie want to get: [1,4]
The number of nested lists for which the sum of the elements is/isn't less than 4. ie want to get: [3,2]
ie basic examples of nested data partition.
Since stackoverflow.com is not a coding service, I'll confine this response to the first question, with the hope that it will convince you that learning jq is worth the effort.
Let's begin by refining the question about the counts of the lists
"which are/are not empty" to emphasize that the first number in the answer should correspond to the number of empty lists (2), and the second number to the rest (3). That is, the required answer should be [2,3].
Solution using built-in filters
The next step might be to ask whether group_by can be used. If the ordering did not matter, we could simply write:
group_by(length==0) | map(length)
This returns [3,2], which is not quite what we want. It's now worth checking the documentation about what group_by is supposed to do. On checking the details at https://stedolan.github.io/jq/manual/#Builtinoperatorsandfunctions,
we see that by design group_by does indeed sort by the grouping value.
Since in jq, false < true, we could fix our first attempt by writing:
group_by(length > 0) | map(length)
That's nice, but since group_by is doing so much work when all we really need is a way to count, it's clear we should be able to come up with a more efficient (and hopefully less opaque) solution.
An efficient solution
At its core the problem boils down to counting, so let's define a generic tabulate filter for producing the counts of distinct string values. Here's a def that will suffice for present purposes:
# Produce a JSON object recording the counts of distinct
# values in the given stream, which is assumed to consist
# solely of strings.
def tabulate(stream):
reduce stream as $s ({}; .[$s] += 1);
An efficient solution can now be written down in just two lines:
tabulate(.[] | length==0 | tostring )
| [.["true", "false"]]
QED
p.s.
The function named tabulate above is sometimes called bow (for "bag of words"). In some ways, that would be a better name, especially as it would make sense to reserve the name tabulate for similar functionality that would work for arbitrary streams.
I am trying to compare using if condition
xorg != "t8405" or "t9405" or "t7805" or "t8605" or "t8705"
I want to compare if xorg is not equal to all of these values on the right side then perform Y.
I am trying to figure out how can I have more smart comparison better or shell I compare xorg with one by one value?
Regards
I think the in and ni (not in) operators are what you should look at. They test for membership (or non-membership) of a list. In this case:
if {$xorg ni {"t8405" "t9405" "t7805" "t8605" "t8705"}} {
puts "it wasn't in there!"
}
If you've got a lot of these things and are testing frequently, you're actually better off putting the values into the keys of an array and using info exists:
foreach key {"t8405" "t9405" "t7805" "t8605" "t8705"} {
set ary($key) 1
}
if {![info exists ary($xorg)]} {
puts "it wasn't in there!"
}
It takes more setup doing it this way, but it's actually faster per test after that (especially from 8.5 onwards). The speedup is because arrays are internally implemented using fast hash tables; hash lookups are quicker than linear table scans. You can also use dictionaries (approximately dict set instead of set and dict exists instead of info exists) but the speed is similar.
The final option is to use lsearch -sorted if you put that list of things in order, since that switches from linear scanning to binary search. This can also be very quick and has potentially no setup cost (if you store the list sorted in the first place) but it's the option that is least clear in my experience. (The in operator uses a very simplified lsearch internally, but just in linear-scanning mode.)
# Note; I've pre-sorted this list
set items {"t7805" "t8405" "t8605" "t8705" "t9405"}
if {[lsearch -sorted -exact $items $xorg] < 0} {
puts "it wasn't in there!"
}
I usually use either the membership operators (because they're easy) or info exists if I've got a convenient set of array keys. I often have the latter around in practice...
I describe the outcome of a strategy by numerous rows. Each row contains a symbol (describing an asset), a timestamp (think of a backtest) and a price + weight.
Before a strategy runs I delete all previous results from this particular strategy (I have many strategies). I then loop over all symbols and all times.
# delete all previous data written by this strategy
StrategyRow.objects.filter(strategy=strategy).delete()
for symbol in symbols.keys():
s = symbols[symbol]
for t in portfolio.prices.index:
p = prices[symbol][t]
w = weights[symbol][t]
row = StrategyRow.objects.create(strategy=strategy, symbol=s, time=t)
if not math.isnan(p):
row.price = p
if not math.isnan(w):
row.weight = w
row.save()
This works but is very, very slow. Is there a chance to achive the same with write_frame from pandas? Or maybe using faster raw sql?
I don't think the first thing you should try is the raw SQL route (more on that in a bit)
But I think it's because of calling row.save() on many objects, that operation is known to be slow.
I'd look into StrategyRow.objects.bulk_create() first, https://docs.djangoproject.com/en/1.7/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
The difference is you pass it a list of your StrategyRow model, instead of calling .save() on individual instances. It's pretty straightforward, bundle up a few rows then create them in batches, maybe try 10, 20, a 100 etc at a time, your database configs can also help find the optimum batch size. (e.g. http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_max_allowed_packet)
Back to your idea of raw SQL, that would make a difference, if e.g. the Python code that creates the StrategyRow instances is slow (e.g. StrategyRow.objects.create()), but still I believe the key is to batch insert them instead of running N queries
Let's say I have this Django model:
class Question(models.Model):
question_code = models.CharField(max_length=10)
and I have 15k questions in the database.
I want to sort it by question_code, which is alphanumeric. This is quite a classical problem and has been talked about in:
http://blog.codinghorror.com/sorting-for-humans-natural-sort-order/
Does Python have a built in function for string natural sort?
I tried the code in the 2nd link (which is copied below, changed a bit), and notice it takes up to 3 seconds to sort the data. To make sure about the function's performance, I write a test which creates a list of 100k random alphanumeric string. It takes only 0.76s to sort that list. So what's happening?
This is what I think. The function needs to get the question_code of each question for comparing, thus calling this function to sort 15k values means requesting mysql 15k separate times. And this is the reason why it takes so long. Any idea? And any solution to natural sort for Django in general? Thanks a lot!
def natural_sort(l, ascending, key=lambda s:s):
def get_alphanum_key_func(key):
convert = lambda text: int(text) if text.isdigit() else text
return lambda s: [convert(c) for c in re.split('([0-9]+)', key(s))]
sort_key = get_alphanum_key_func(key)
return sorted(l, key=sort_key, reverse=ascending)
As far as I'm aware there isn't a generic Django solution to this. You can reduce your memory usage and limit your db queries by building an id/question_code lookup structure
from natsort import natsorted
question_code_lookup = Question.objects.values('id','question_code')
ordered_question_codes = natsorted(question_code_lookup, key=lambda i: i['question_code'])
Assuming you want to page the results you can then slice up ordered_question_codes, perform another query to retrieve all the questions you need order them according to their position in that slice
#get the first 20 questions
ordered_question_codes = ordered_question_codes[:20]
question_ids = [q['id'] for q in ordered_question_codes]
questions = Question.objects.filter(id__in=question_ids)
#put them back into question code order
id_to_pos = dict(zip((question_ids), range(len(question_ids))))
questions = sorted(questions, key = lambda x: id_to_pos[x.id])
If the lookup structure still uses too much memory, or takes too long to sort, then you'll have to come up with something more advanced. This certainly wouldn't scale well to a huge dataset
I am working with serialized array fields in one of my models, specifically in counting how many members of each array are shared.
Now, by the nature of my project, I am having to a HUGE number of these overlap countings.. so I was wondering if there was a super quick, cleaver way to do this.
At the moment, I am using the '&' method, so my code looks like this
(user1.follower_names & user2.follower_names).count
which works fine... but I was hoping there might be a faster way to do it.
Sets are faster for this.
require 'benchmark'
require 'set'
alphabet = ('a'..'z').to_a
user1_followers = 100.times.map{ alphabet.sample(3) }
user2_followers = 100.times.map{ alphabet.sample(3) }
user1_followers_set = user1_followers.to_set
user2_followers_set = user2_followers.to_set
n = 1000
Benchmark.bm(7) do |x|
x.report('arrays'){ n.times{ (user1_followers & user2_followers).size } }
x.report('set'){ n.times{ (user1_followers_set & user2_followers_set).size } }
end
Output:
user system total real
arrays 0.910000 0.000000 0.910000 ( 0.926098)
set 0.350000 0.000000 0.350000 ( 0.359571)
An alternative to the above is to use the '-' operator on arrays:
user1.follower_names.size - (user1.follower_names - user2.follower_names).size
Essentially this gets the size of list one and minuses the size of the joint list without the intersection. This isn't as fast as using sets but much quicker than using intersection alone with Arrays