SQL where clause performance - mysql

I have to create a table that assigns an user id and a product id to some data (models 2 one to many relationships). I will make a lot of queries like
select * from table where userid = x;
The first thing that I am interested is how big should the table get before the query starts to be observable (let's say it takes more than 1 second).
Also, how this can be optimised?
I know that this might depend on the implementation. I will use mysql for this specific project, but I am interested in more general answers as well.

It all depends on the horse power of your machine. The make that query more efficient, create an index with "userid"

how big should the table get before the query starts to be observable (let's say it takes more than 1 second)
There are too many factors to deterministically measure run time. CPU speed, memory, I/O speed, etc. are just some of the external factors.
how this can be optimized?
That's more staightforward. If there is an index on userid then the query will likely to an index seek which is about as fast as you can get as far as finding the record. If the userid is a clustered index then it will be faster because it won't have to use the position from the index to find the record in data pages - the data is physically organized as part of the index.

let's say it takes more than 1 second
With an index on userid, Mysql will manage to find the correct row in (worst case) Oh (log n). In "seconds" it now depends on the performance of your machine.
It is impossible to give you an exact number, without considering how long one operation takes.
As an Example: Assuming you have a database with 4 records. This requires 2 operations worst case. Any time, you double your data, one more operation is required.
for example:
# records | # operations to find entry in worst case
2 1
4 2
8 3
16 4
...
4096 12
...
~1 B 30
~2 B 31
So, with a huge amount of records - time almost remains constant. For 1 Billion records, you would need to perform ~ 30 operations.
And like that it continues: 2 Billion records, 31 operations.
so, let's say your query executes in 0.001 second for 4096 entries (12 ops)
it would take arround (0.001 / 12 * 30) 0.0025 seconds for 1 Billion records.
Heavy Sidenode: this is just considering the runtime complexity of the binary search, but it shows how the complexity would scale.
In a nutshell: Your database would be unimpressed by a single query on an indexed value. However, if you run a heavy amount of those queries at the same time, time increases ofc.

Related

Calculating frequency of password hashes efficiently in MySQL

For my bachelor thesis I have to analyze a password leak and I have a table with 2 colums MEMBER_EMAIL and MEMBER_HASH
I want to calculate the frequency of each hash efficiently
So that the output looks like:
Hash | Amount
----------------
2e3f.. | 345
2f2e.. | 288
b2be.. | 189
My query until now was straight forward:
SELECT MEMBER_HASH AS hashed, count(*) AS amount
FROM thesis.fulllist
GROUP BY hashed
ORDER BY amount DESC
While it works fine for smaller tables, i have problems computing the query on the whole list (112 mio. entries), where it takes me over 2 days, ending in a weird connection timeout error even if my settings regarding that are fine.
So I wonder if there is a better way to calculate (as i can't really think of any), would appreciate any help!
Your query can't be optimized as it's quite simple. The only way I think to improve the way the query is executed is to index the "MEMBER_HASH".
This is how you can do it :
ALTER TABLE `table` ADD INDEX `hashed` (`MEMBER_HASH`);

mysql bit-operations performance

&I need to compare looooong bit-sequences (up to 1000 digits) with each other, currently I've got a table with 16 columns (16*64 = 1024):
userid block_1 block_2 block_3 ...
1 1001... 1100... 0010...
2 1101... 1011... 0111...
3 1011... 0111... 1100...
my statement:
select sh.userid,
sum(bit_count(se.block_1 & sh.block_1)+bit_count(se.block_2 & sh.block_2)+...)
from my_table se, my_table sh
where se.userid = 1
and se.userId != sh.userId
group by sh.userId
With about 5 million entries, the query time is ~ 1.5 seconds, which is already pretty good, I guess. Most of the time is lost for the bit_count- and &-part, therefore I'm asking myself if there's some room for improvement.
Is there a better way to compare very long binary sequences?
EDIT
I'm expecting the se.block_X to contain much more 0s than sh.block_X; does it make a difference if I do
se.block_X & sh.block_X
or
sh.block_X & se.block_X
? I would expect the first one to be faster.
the 1 and 0 represent yes and no in a lot of categories and I'm only interested in where BOTH entries said yes. So
10101011100101011
&
10010101101001010
=================
10000001100001010
and now I now where both of them said yes. For my use case I have to compare e.g. my answers with ~ 5 millions of entries of others answers, and there are about 1000 questions. So I've got a table with 5 million entries, each with 16x64 bit columns. And then I have to compare 1 of the entry to all the others basically.

MySQL- Counting rows VS Setting up a counter

I have 2 tables posts<id, user_id, text, votes_counter, created> and votes<id, post_id, user_id, vote>. Here the table vote can be either 1 (upvote) or -1(downvote). Now if I need to fetch the total votes(upvotes - downvotes) on a post, I can do it in 2 ways.
Use count(*) to count the number of upvotes and downvotes on that post from votes table and then do the maths.
Set up a counter column votes_counter and increment or decrement it everytime a user upvotes or downvotes. Then simply extract that votes_counter.
My question is which one is better and under what condition. By saying condition, I mean factors like scalability, peaktime et cetera.
To what I know, if I use method 1, for a table with millions of rows, count(*) could be a heavy operation. To avoid that situation, if I use a counter then during peak time, the votes_counter column might get deadlocked, too many users trying to update the counter!
Is there a third way better than both and as simple to implement?
The two approaches represent a common tradeoff between complexity of implementation and speed.
The first approach is very simple to implement, because it does not require you to do any additional coding.
The second approach is potentially a lot faster, especially when you need to count a small percentage of items in a large table
The first approach can be sped up by well designed indexes. Rather than searching through the whole table, your RDBMS could retrieve a few records from the index, and do the counts using them
The second approach can become very complex very quickly:
You need to consider what happens to the counts when a user gets deleted
You should consider what happens when the table of votes is manipulated by tools outside your program. For example, merging records from two databases may prove a lot more complex when the current counts are stored along with the individual ones.
I would start with the first approach, and see how it performs. Then I would try optimizing it with indexing. Finally, I would consider going with the second approach, possibly writing triggers to update counts automatically.
As this sounds a lot like StackExchange, I'll refer you to this answer on the meta about the database schema used on the site. The votes table looks like this:
Votes table:
Id
PostId
VoteTypeId, one of the following values:
1 - AcceptedByOriginator
2 - UpMod
3 - DownMod
4 - Offensive
5 - Favorite (if VoteTypeId = 5, UserId will be populated)
6 - Close
7 - Reopen
8 - BountyStart (if VoteTypeId = 8, UserId will be populated)
9 - BountyClose
10 - Deletion
11 - Undeletion
12 - Spam
15 - ModeratorReview
16 - ApproveEditSuggestion
UserId (only present if VoteTypeId is 5 or 8)
CreationDate
BountyAmount (only present if VoteTypeId is 8 or 9)
And so based on that it sounds like the way it would be run is:
SELECT VoteTypeId FROM Votes WHERE VoteTypeId = 2 OR VoteTypeId = 3
And then based on the value, do the maths:
int score = 0;
for each vote in voteQueryResults
if(vote == 2) score++;
if(vote == 3) score--;
Even with millions of results, this is probably going to be a very fast operation as it's so simple.

Slow MySQL Queries on new Server, switching from Joyent to rimuhosting

I recently switched servers because Joyent is ending their service soon. But the queries on rimuhosting seem to take significantly longer (2-6 times). But there's a huge variance in behavior: most queries run .02 seconds or less and then sometimes those same exact queries take .5 seconds or more. Both servers were running mySQL and PHP (similar version but not the same exact number).
The server load is 20-40% Idle for the CPU. Most of the memory is being used, but they tell me that's normal. The tech support tells me it's not swapping:
Here's what it looks like right now: (though memory usage will increase to near max eventually, like last time)
Mem: 1513548k total, 1229316k used, 284232k free, 63540k buffers
Swap: 131064k total, 0k used, 131064k free, 981420k cached
SQL max connections is set to 400.
So, why am i getting these super slow queries, sometimes?
Here is an example of a query that is sometimes .01 second and sometimes greater then 1 second:
SELECT (!attacked AND (firstLoginDate > 1348703469 )) as protected,
id, universe.uid, universe.name AS obj_name,top,left, guilds.name as alliance,
rotate,what, player_data.first, player_data.last,
aid AS gid, (aid=1892 AND aid>0) as am,
fleet LIKE '%Turret%' AS turret,
startLeft, startTop, endLeft, endTop, duration, startTime, movetype,
moving,speed, defend, hp, lastAttack>1349740269 AS ra FROM universe LEFT JOIN player_data ON universe.uid=player_data.uid
LEFT JOIN guilds ON aid=guilds.gid
WHERE ( sector='21_82' OR sector='22_82' OR sector='21_83' OR sector='22_83' ) OR
( universe.uid=1568425485 AND ( upgrading=1 OR building=1 ))
Yes, I do have indexes on all the appropriate columns. And all 3 tables featured above are InnoDB tables, which means they are only row locked, not table locked.
but this is interesting: (new server)
Innodb_row_lock_time_avg 400 The average time to acquire a row lock, in milliseconds.
Innodb_row_lock_time_max 4,010 The maximum time to acquire a row lock, in milliseconds.
Innodb_row_lock_waits 31 The number of times a row lock had to be waited for.
why does it take so long to get a row lock?
my old server was able to get the row lock faster:
Innodb_row_lock_time_avg 26 The average time to acquire a row lock, in milliseconds.
Here's the new server:
Opened_tables 5,500 (in just 2 hours) The number of tables that have been opened. If opened tables is big, your table cache value is probably too small.
table cache 256
table locks waited 3,302 (in just 2 hours)
Here is the old server:
Opened_tables 420
table cache 64
Does that makes sense? If I increase the table Cache will that alleviate things?
Note: i have 1.5 GB on this server
Here is the explain
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE universe index_merge uidwhat,uid,uidtopleft,upgrading,building,sector sector,uid 38,8 NULL 116 Using sort_union(sector,uid); Using where
1 SIMPLE player_data ref mainIndex mainIndex 8 jill_sp.universe.uid 1
1 SIMPLE guilds eq_ref PRIMARY PRIMARY 8 jill_sp.player_data.aid 1

MySQL (HeidiSQL): Using localhost but 10 seconds of query time is "network"?

I am running mySQL queries using the heidiSQL editor. When it tells me my query time it will sometimes also include a network time:
Duration for 1 query: 1.194 sec. (+ 10.078 sec. network)
But it can't really be the network since everything is on my own computer?? Is that extra time something that would disappear with another setup or do I need to improve my query performance the usual way (rewriting/reworking)? It's hard for me to improve performance on a query when I'm not even sure what's causing the poor performance.
EDIT: Profiling info
I used this neat profiling sql: http://www.mysqlperformanceblog.com/2012/02/20/how-to-convert-show-profiles-into-a-real-profile/
Query 1:
Select count(*) from my_table_with_100_thousand_rows;
"Duration for 1 query: 0.390 sec."
(This one did not show any network time, but almost .4 seconds for a simple count(*) seems a lot.)
STATE Total_R Pct_R Calls R/Call
Sending data 0.392060 35.84 1 0.3920600000
freeing items 0.000214 0.02 1 0.0002140000
starting 0.000070 0.01 1 0.0000700000
Opening tables 0.000031 0.00 1 0.0000310000
statistics 0.000024 0.00 1 0.0000240000
init 0.000020 0.00 1 0.0000200000
(shorter times not included)
Query 2:
select * from 4 tables with many rows, joined by primary_key-foreign_key or indexed column.
"Duration for 1 query: 0.156 sec. (+ 10.140 sec. network)" (the times below add up to more than the total?)
STATE Total_R Pct_R Calls R/Call
Sending data 16.424433 NULL 1 16.4244330000
freeing items 0.000390 NULL 1 0.0003900000
starting 0.000116 NULL 1 0.0001160000
statistics 0.000054 NULL 1 0.0000540000
Opening tables 0.000050 NULL 1 0.0000500000
init 0.000046 NULL 1 0.0000460000
preparing 0.000033 NULL 1 0.0000330000
optimizing 0.000028 NULL 1 0.0000280000
(shorter times not included)
Query 3:
Same as query 2 but with count * instead of select *
"Duration for 1 query: 10.047 sec."
STATE Total_R Pct_R Calls R/Call
Sending data 10.050007 NULL 1 10.0500070000
(shorter times not included)
It seems to me that it includes network time in the "duration" if it has to display a lot of rows, but this does NOT mean that I can subtract this time if it doesn't have to display the rows. It's real query time. Does this seem right?
Old question!
I'm pretty sure Heidi counts as "network time" the elapsed time --
from receipt of the first response packet over the network
to receipt of the last response packet in the result set.
So, for your SELECT COUNT(*) FROM big _f_table query the first packet comes back right away, and declares that there's a single column containing an integer.
The rest of that result set comes when the query engine is done counting the rows. So Heidi's so-called "network time" is the time to count the rows. That's practically instantaneous for MyISAM, and takes a while for InnoDB.
For your SELECT tons of columns FROM complex join the same thing applies. The first packet arrives when the query planner has figured out what columns will be in the result set. The last packet arrives when all that data has finally been transferred to Heidi over your computer's internal loopback (localhost) network.
It's like what you see in your browser devtools. The query time is analogous to the "time to first byte", and the "network time" is the time to deliver the rest of the result. Time to first byte is the query parsing / planning time PLUS the time to get enough information to send something for the result set metadata. Network time is the time to get the rest. If the query planner can stream the rows to you directly from the table storage you'll have a high proportion of network time. If, on the other hand, it has to crunch the data (for example with ORDER BY) you'll have a higher proportion of query time. But don't try to overthink this stuff. MariaDB and MySQL are very complex, with layers of caching and fetching. The way they satisfy queries is sometimes hard to figure out.