Is it okay to do binary search on an indexed column to get data from a non-indexed column? - mysql

I have a large table users(id, inserttime, ...), with index only on id. I would like to find list of users who were inserted between a given start_date and finish_date range.
User.where(inserttime: start_date..finish_date).find_each
^This leads to a search which takes a lot of time, since the inserttime column is not indexed.
The solution which I came up with is to do find user.id for start_date and finish_date separately by doing a binary search twice on the table using the indexed id column.
Then do this to get all the users between start_id and finish_id:
User.where(id: start_id..finish_id).find_each
The binary search function I am using is something like this:
def find_user_id_by_date(date)
low = User.select(:id, :inserttime).first
high = User.select(:id, :inserttime).last
low_id = low.id
high_id = high.id
low_date = low.inserttime
high_date = high.inserttime
while(low_id <= high_id)
mid_id = low_id + ((high_id - low_id) / 2);
mid = User.select(:id, :inserttime).find_by(id: mid_id)
# sometimes there can be missing users. Ex: [1,2,8,9,10,16,17,..]
while mid.nil?
mid_id = mid_id + 1
mid = User.select(:id, :inserttime).find_by(id: mid_id)
end
if (mid.inserttime < date)
low_id = mid.id + 1
elsif (mid.inserttime > date)
high_id = mid.id - 1
else
return mid.id
end
end
# when date = start_date
return (low_id < high_id) ? low_id + 1 : high_id + 1
# when date = finish_date
return (low_id < high_id) ? low_id : high_id + 1
end
I am not sure if what I am doing is the right way to deal with this problem or even if my binary search function covers all the cases.
I think the best solution would be to add an index on inserttime column but that is sadly not possible.

This might not be the best way to do it, but if the IDs are numeric and sequential you could write a query to find the users in between the minimum and maximum user ID:
SELECT id
FROM users
WHERE id BETWEEN [low_id_here] AND [high_id_here];
In ActiveRecord:
low = User.select(:id, :inserttime).first
high = User.select(:id, :inserttime).last
low_id = low.id
high_id = high.id
User.where('id BETWEEN ? AND ?', low_id, high_id)

Related

optimize sql query inside foreach

I need help optimizing the below querys for a recurrent calendar i've built.
if user fail to accomplish all task where date
This is the query i use inside a forech which fetched all dates that the current activity is active.
This is my current setup, which works, but is very slow.
Other string explained:
$today=date("Y-m-d");
$parts = explode($sepparator, $datespan);
$dayForDate2 = date("l", mktime(0, 0, 0, $parts[1], $parts[2], $parts[0]));
$week2 = strtotime($datespan);
$week2 = date("W", $week2);
if($week2&1) { $weektype2 = "3"; } # Odd week 1, 3, 5 ...
else { $weektype2 = "2"; } # Even week 2, 4, 6 ...
Query1:
$query1 = "SELECT date_from, date_to, bok_id, kommentar
FROM bokningar
WHERE bokningar.typ='2'
and date_from<'".$today."'";
function that makes the foreach move ahead one day at the time...
function date_range($first, $last, $step = '+1 day', $output_format = 'Y-m-d' )
{
$dates = array();
$current = strtotime($first);
$last = strtotime($last);
while( $current <= $last ) {
$dates[] = date($output_format, $current);
$current = strtotime($step, $current);
}
return $dates;
}
foreach:
foreach (date_range($row['date_from'], $row['date_to'], "+1 day", "Y-m-d")
as $datespan)
if ($datespan < $today)
Query 2:
$query2 = "
SELECT bok_id, kommentar
FROM bokningar b
WHERE b.typ='2'
AND b.bok_id='".$row['bok_id']."'
AND b.weektype = '1'
AND b.".$dayForDate2." = '1'
AND NOT EXISTS
(SELECT t.tilldelad, t.bok_id
FROM tilldelade t
WHERE t.tilldelad = '".$datespan."'
AND t.bok_id='".$row['bok_id']."')
OR b.typ='2'
AND b.bok_id='".$row['bok_id']."'
AND b.weektype = '".$weektype2."'
AND b.".$dayForDate2." = '1'
AND NOT EXISTS
(SELECT t.tilldelad, t.bok_id
FROM tilldelade t
WHERE t.tilldelad = '".$datespan."'
AND t.bok_id='".$row['bok_id']."')";
b.weektype is either 1,2 or 3 (every week, every even week, every uneven week)
bokningar needs INDEX(typ, date_from)
Instead of computing $today, you can do
and date_from < CURDATE()
Are you running $query2 for each date? How many days is that? You may be able to build a table of dates, then JOIN it to bokningar to do all the SELECTs in a single SELECT.
When doing x AND y OR x AND z, first add parenthes to make it clear which comes first AND or OR: (x AND y) OR (x AND z). Then use a simple rule in Boolean arithmetic to transform it into a more efficient expression: x AND (y OR z) (where the parens are necessary).
The usual pattern for EXISTS is EXISTS ( SELECT 1 FROM ... ); there is no need to list columns.
If I am reading it correctly, the only difference is in testing b.weektype. So the WHERE can be simply
WHERE b.weektype IN ('".$weektype2."', '1')
AND ...
There is no need for OR, since it is effectively in IN().
tilldelade needs INDEX(tilldelad, bok_id), in either order. This should make the EXISTS(...) run faster.
Finally, bokningar needs INDEX(typ, bok_id, weektype) in any order.
That is a lot to change and test. See if you can get those things done. If it still does not run fast enough, start a new Question with the new code. Please include SHOW CREATE TABLE for both tables.

MySql to Python Date

I try to convert MySQl datetime to same in Python.On debug there is
ValueError: time data '2001-06-04T11:30:35' doesn't match format %Y-%m-%dT%H:%M:%S .
In MySQL there is no 'T' in data.I tried format with 'T' and without.
I saw this article How to convert the following string to python date? .
This is code:
query = QSqlQuery ()
query.exec_("SELECT birthday FROM vista.user ")
def countAge(birthday):
birthday = datetime.strptime(str(birthday), "%Y-%m-%dT%H:%M:%S.%f")
today = date.today()
age = today.year - birthday.year
if today.month < birthday.month:
age -= 1
elif today.month == birthday.month and today.day < birthday.day:
age -= 1
if age >= 0 :
return age
ages = []
index = 0
while (query.next()):
print(query.value(index).toString())
ages.append(countAge(query.value(index).toString()))
index = index + 1
What is a problem?
If an example date-string is 2001-06-04T11:30:35, then you need:
birthday = datetime.strptime(str(birthday), "%Y-%m-%dT%H:%M:%S")

More efficient active record grouping

Im trying to check 2.5 second intervals for records and add an object to an array based on the count. This way works but its far too slow. thanks
#tweets = Tweet.last(3000)
first_time = #tweets.first.created_at
last_time = #tweets.last.created_at
while first_time < last_time
group = #tweets.where(created_at: (first_time)..(first_time + 2.5.seconds)).count
if group == 0 || nil
puts "0 or nil"
first_id + 1
array << {tweets: 0}
else
first_id += group
array << {tweets: group}
end
first_time += 2.5.seconds
end
return array.to_json
end
What you really need is the group_by method on the records you've retrieved:
grouped = #tweets.group_by do |tweet|
# Convert from timestamp to 2.5s interval number
(tweet.created_at.to_f / 2.5).to_i
end
That returns a hash with the key being the time interval, and the values being an array of tweets.
What you're doing in your example probably has the effect of making thousands of queries. Always watch log/development.log to see what's going on in the background.

mysql select count(column) where sum(column) > value

I'm trying to query $wpdb to get back an int value of the number of users in a custom table who have recorded a number of hours volunteer work above a set target - these hours need to have been moderated ( value set to = 1 ) - I have this so far:
EDIT - updated to use consistent {} around php variables in query --
$target = get_post_meta($post->ID, 'target', true) ? (int)get_post_meta($post->ID, 'target', true) : 100;
$awards = $wpdb->get_var("
SELECT user_id
FROM {$this->options['rewards_logging']}
WHERE moderated = 1 AND reward_id = {$post->ID}
GROUP BY user_id
HAVING sum(hours) > {$target}
");
Which returns the correct value of '0' if none of the hours are approved ( moderated = 0 ), however as soon as one of those users hours are approved, this query returns the count of all the users who have logged more than the target hours ( whether they have been approved or not ).
Any pointers!
Cheers
Ray
Seems I was trying to get back a single variable using $wpdb->get_var, when I really needed the whole result set:
$awards = $wpdb->get_results("
SELECT user_id
FROM {$this->options['rewards_logging']}
WHERE moderated = 1 AND reward_id = {$post->ID}
GROUP BY user_id
HAVING sum(hours) > {$target}
");
Then I can check over the data and display a result - etc...:
if ( count($awards) > 0 ) {
#var_dump($awards);
echo '<span class="awards-notice">'.count($awards).'</span>';
} else {
echo '-';
}

How can I refer to the main query's table in a nested subquery?

I have a table named passive than contains a list of timestamped events per user. I want to fill the attribute duration, which correspond to the time between the current row's event and the next event done by this user.
I tried the following query:
UPDATE passive as passive1
SET passive1.duration = (
SELECT min(UNIX_TIMESTAMP(passive2.event_time) - UNIX_TIMESTAMP(passive1.event_time) )
FROM passive as passive2
WHERE passive1.user_id = passive2.user_id
AND UNIX_TIMESTAMP(passive2.event_time) - UNIX_TIMESTAMP(passive1.event_time) > 0
);
This returns the error message Error 1093 - You can't specify target table for update in FROM.
In order to circumvent this limitation, I tried to follow the structure given in https://stackoverflow.com/a/45498/395857, which uses a nested subquery in the FROM clause to create an implicit temporary table, so that it doesn't count as the same table we're updating:
UPDATE passive
SET passive.duration = (
SELECT *
FROM (SELECT min(UNIX_TIMESTAMP(passive2.event_time) - UNIX_TIMESTAMP(passive.event_time))
FROM passive, passive as passive2
WHERE passive.user_id = passive2.user_id
AND UNIX_TIMESTAMP(passive2.event_time) - UNIX_TIMESTAMP(passive1.event_time) > 0
)
AS X
);
However, the passive table in the nested subquery doesn't refer to the same passive as in the main query. Because of that, all rows have the same passive.duration value. How can I refer to the main query's passive in the nested subquery? (or maybe are there some alternative ways to structure such a query?)
Try Like this....
UPDATE passive as passive1
SET passive1.duration = (
SELECT min(UNIX_TIMESTAMP(passive2.event_time) - UNIX_TIMESTAMP(passive1.event_time) )
FROM (SELECT * from passive) Passive2
WHERE passive1.user_id = passive2.user_id
AND UNIX_TIMESTAMP(passive2.event_time) - UNIX_TIMESTAMP(passive1.event_time) > 0
)
;
We can use a Python script to circumvent the issue:
'''
We need an index on user_id, timestamp to speed up
'''
#!/usr/bin/python
# -*- coding: utf-8 -*-
# Download it at http://sourceforge.net/projects/mysql-python/?source=dlp
# Tutorials: http://mysql-python.sourceforge.net/MySQLdb.html
# http://zetcode.com/db/mysqlpython/
import MySQLdb as mdb
import datetime, random
def main():
start = datetime.datetime.now()
db=MySQLdb.connect(user="root",passwd="password",db="db_name")
db2=MySQLdb.connect(user="root",passwd="password",db="db_name")
cursor = db.cursor()
cursor2 = db2.cursor()
cursor.execute("SELECT observed_event_id, user_id, observed_event_timestamp FROM observed_events ORDER BY observed_event_timestamp ASC")
count = 0
for row in cursor:
count += 1
timestamp = row[2]
user_id = row[1]
primary_key = row[0]
sql = 'SELECT observed_event_timestamp FROM observed_events WHERE observed_event_timestamp > "%s" AND user_id = "%s" ORDER BY observed_event_timestamp ASC LIMIT 1' % (timestamp, user_id)
cursor2.execute(sql)
duration = 0
for row2 in cursor2:
duration = (row2[0] - timestamp).total_seconds()
if (duration > (60*60)):
duration = 0
break
cursor2.execute("UPDATE observed_events SET observed_event_duration=%s WHERE observed_event_id = %s" % (duration, primary_key))
if count % 1000 == 0:
db2.commit()
print "Percent done: " + str(float(count) / cursor.rowcount * 100) + "%" + " in " + str((datetime.datetime.now() - start).total_seconds()) + " seconds."
db.close()
db2.close()
diff = (datetime.datetime.now() - start).total_seconds()
print 'finished in %s seconds' % diff
if __name__ == "__main__":
main()