I am working with serialized array fields in one of my models, specifically in counting how many members of each array are shared.
Now, by the nature of my project, I am having to a HUGE number of these overlap countings.. so I was wondering if there was a super quick, cleaver way to do this.
At the moment, I am using the '&' method, so my code looks like this
(user1.follower_names & user2.follower_names).count
which works fine... but I was hoping there might be a faster way to do it.
Sets are faster for this.
require 'benchmark'
require 'set'
alphabet = ('a'..'z').to_a
user1_followers = 100.times.map{ alphabet.sample(3) }
user2_followers = 100.times.map{ alphabet.sample(3) }
user1_followers_set = user1_followers.to_set
user2_followers_set = user2_followers.to_set
n = 1000
Benchmark.bm(7) do |x|
x.report('arrays'){ n.times{ (user1_followers & user2_followers).size } }
x.report('set'){ n.times{ (user1_followers_set & user2_followers_set).size } }
end
Output:
user system total real
arrays 0.910000 0.000000 0.910000 ( 0.926098)
set 0.350000 0.000000 0.350000 ( 0.359571)
An alternative to the above is to use the '-' operator on arrays:
user1.follower_names.size - (user1.follower_names - user2.follower_names).size
Essentially this gets the size of list one and minuses the size of the joint list without the intersection. This isn't as fast as using sets but much quicker than using intersection alone with Arrays
Related
Let's say I have: [[1,2], [3,9], [4,2], [], []]
I would like to know the scripts to get:
The number of nested lists which are/are not non-empty. ie want to get: [3,2]
The number of nested lists which contain or not contain number 3. ie want to get: [1,4]
The number of nested lists for which the sum of the elements is/isn't less than 4. ie want to get: [3,2]
ie basic examples of nested data partition.
Since stackoverflow.com is not a coding service, I'll confine this response to the first question, with the hope that it will convince you that learning jq is worth the effort.
Let's begin by refining the question about the counts of the lists
"which are/are not empty" to emphasize that the first number in the answer should correspond to the number of empty lists (2), and the second number to the rest (3). That is, the required answer should be [2,3].
Solution using built-in filters
The next step might be to ask whether group_by can be used. If the ordering did not matter, we could simply write:
group_by(length==0) | map(length)
This returns [3,2], which is not quite what we want. It's now worth checking the documentation about what group_by is supposed to do. On checking the details at https://stedolan.github.io/jq/manual/#Builtinoperatorsandfunctions,
we see that by design group_by does indeed sort by the grouping value.
Since in jq, false < true, we could fix our first attempt by writing:
group_by(length > 0) | map(length)
That's nice, but since group_by is doing so much work when all we really need is a way to count, it's clear we should be able to come up with a more efficient (and hopefully less opaque) solution.
An efficient solution
At its core the problem boils down to counting, so let's define a generic tabulate filter for producing the counts of distinct string values. Here's a def that will suffice for present purposes:
# Produce a JSON object recording the counts of distinct
# values in the given stream, which is assumed to consist
# solely of strings.
def tabulate(stream):
reduce stream as $s ({}; .[$s] += 1);
An efficient solution can now be written down in just two lines:
tabulate(.[] | length==0 | tostring )
| [.["true", "false"]]
QED
p.s.
The function named tabulate above is sometimes called bow (for "bag of words"). In some ways, that would be a better name, especially as it would make sense to reserve the name tabulate for similar functionality that would work for arbitrary streams.
I have a database with Lab models. I want to be able to search them using multiple different methods.
I chose to use one input field and separate query into words array:
search = search.split(/[^[[:word:]]]+/).map{|val| val.downcase}
I use Acts-as-taggable gem so it would be nice to include those tags in search to:
tag_results = self.tagged_with(search, any: true, wild: true)
For methods down below it seemed to be necessary to use:
search = search.map{|val| "%#{val}%"}
Sunspot seemed also a great way to go for full-text search so
full_text_search = self.search {fulltext search}
full_text_results = full_text_search.results
I decided to go also with simple database query searching for a Lab name:
name_results = self.where("LOWER(name) ILIKE ANY ( array[?] )", search)
Lastly I need all of the results in one array so:
result = (tag_results + name_results + full_text_results).uniq
It works perfectly (what I mean is that the result is what I expect) but it returns a simple array and not ActiveRecord::Relation so there is no way for me to use method like .select() or .order() on the results.
I want to ask is there is some better way to implement such search? I was searching for search engines but it seems like there is nothing that would fit my idea.
If there is not - is there a way to convert an array into ActiveRecord::Relation? (SO says there is no way)
Answering this one:
is there a way to convert an array into ActiveRecord::Relation? (SO
says there is no way)
You can convert an array ob ActiveRecord objects into ActiveRecord::Relation by fetching ids from array and querying your AR model for objects with these ids:
Model.where(id: result.map(&:ids)) # returns AR, as expected.
It is the only way I am aware of.
I describe the outcome of a strategy by numerous rows. Each row contains a symbol (describing an asset), a timestamp (think of a backtest) and a price + weight.
Before a strategy runs I delete all previous results from this particular strategy (I have many strategies). I then loop over all symbols and all times.
# delete all previous data written by this strategy
StrategyRow.objects.filter(strategy=strategy).delete()
for symbol in symbols.keys():
s = symbols[symbol]
for t in portfolio.prices.index:
p = prices[symbol][t]
w = weights[symbol][t]
row = StrategyRow.objects.create(strategy=strategy, symbol=s, time=t)
if not math.isnan(p):
row.price = p
if not math.isnan(w):
row.weight = w
row.save()
This works but is very, very slow. Is there a chance to achive the same with write_frame from pandas? Or maybe using faster raw sql?
I don't think the first thing you should try is the raw SQL route (more on that in a bit)
But I think it's because of calling row.save() on many objects, that operation is known to be slow.
I'd look into StrategyRow.objects.bulk_create() first, https://docs.djangoproject.com/en/1.7/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
The difference is you pass it a list of your StrategyRow model, instead of calling .save() on individual instances. It's pretty straightforward, bundle up a few rows then create them in batches, maybe try 10, 20, a 100 etc at a time, your database configs can also help find the optimum batch size. (e.g. http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_max_allowed_packet)
Back to your idea of raw SQL, that would make a difference, if e.g. the Python code that creates the StrategyRow instances is slow (e.g. StrategyRow.objects.create()), but still I believe the key is to batch insert them instead of running N queries
Let's say I have this Django model:
class Question(models.Model):
question_code = models.CharField(max_length=10)
and I have 15k questions in the database.
I want to sort it by question_code, which is alphanumeric. This is quite a classical problem and has been talked about in:
http://blog.codinghorror.com/sorting-for-humans-natural-sort-order/
Does Python have a built in function for string natural sort?
I tried the code in the 2nd link (which is copied below, changed a bit), and notice it takes up to 3 seconds to sort the data. To make sure about the function's performance, I write a test which creates a list of 100k random alphanumeric string. It takes only 0.76s to sort that list. So what's happening?
This is what I think. The function needs to get the question_code of each question for comparing, thus calling this function to sort 15k values means requesting mysql 15k separate times. And this is the reason why it takes so long. Any idea? And any solution to natural sort for Django in general? Thanks a lot!
def natural_sort(l, ascending, key=lambda s:s):
def get_alphanum_key_func(key):
convert = lambda text: int(text) if text.isdigit() else text
return lambda s: [convert(c) for c in re.split('([0-9]+)', key(s))]
sort_key = get_alphanum_key_func(key)
return sorted(l, key=sort_key, reverse=ascending)
As far as I'm aware there isn't a generic Django solution to this. You can reduce your memory usage and limit your db queries by building an id/question_code lookup structure
from natsort import natsorted
question_code_lookup = Question.objects.values('id','question_code')
ordered_question_codes = natsorted(question_code_lookup, key=lambda i: i['question_code'])
Assuming you want to page the results you can then slice up ordered_question_codes, perform another query to retrieve all the questions you need order them according to their position in that slice
#get the first 20 questions
ordered_question_codes = ordered_question_codes[:20]
question_ids = [q['id'] for q in ordered_question_codes]
questions = Question.objects.filter(id__in=question_ids)
#put them back into question code order
id_to_pos = dict(zip((question_ids), range(len(question_ids))))
questions = sorted(questions, key = lambda x: id_to_pos[x.id])
If the lookup structure still uses too much memory, or takes too long to sort, then you'll have to come up with something more advanced. This certainly wouldn't scale well to a huge dataset
I have a pretty expensive method in my model that compares text items of arrays for many many items.
It runs really slow. If I use a relational database table and compare the ID's only, will my method run a lot faster?
/EDIT
I'm attempting to benchmark the below:
#matches = #location_matches.sort do |l1, l2|
l1.compute_score(current_user) <=> l2.compute_score(current_user)
end
#matches.reverse!
To be short, I guess number comparison will be faster because comparing string is about comparing character after character (advice: in Ruby use symbols when you can, their comparison is much faster).
Whatever, you'll find there, everything you need to benchmark and get your detailled results.
A code sample:
require 'benchmark'
n = 50000
Benchmark.bm do |x|
x.report("for:") { for i in 1..n; a = "1"; end }
x.report("times:") { n.times do ; a = "1"; end }
x.report("upto:") { 1.upto(n) do ; a = "1"; end }
end
The result:
user system total real
for: 1.050000 0.000000 1.050000 ( 0.503462)
times: 1.533333 0.016667 1.550000 ( 0.735473)
upto: 1.500000 0.016667 1.516667 ( 0.711239)
Your first task is to replace that sort with a sort_by, you can also skip the reverse! by sorting things in the desired order in the first place:
#matches = #location_matches.sort_by { |loc| -loc.compute_score(current_user) }
The sort method will have to do many comparisons while sorting and each comparison requires two compute_score calls, the sort_by method does a Schwartzian Transform internally so your expensive compute_score will only be called once per entry. The negation inside the block is just an easy way to reverse the sort order (I'm assuming that your scores are numeric).
Fix up the obvious performance problem and then move on to benchmarking various solutions (but be sure to benchmark sort versus sort_by just to be sure that "obvious" matches reality).