Calculating frequency of password hashes efficiently in MySQL - mysql

For my bachelor thesis I have to analyze a password leak and I have a table with 2 colums MEMBER_EMAIL and MEMBER_HASH
I want to calculate the frequency of each hash efficiently
So that the output looks like:
Hash | Amount
----------------
2e3f.. | 345
2f2e.. | 288
b2be.. | 189
My query until now was straight forward:
SELECT MEMBER_HASH AS hashed, count(*) AS amount
FROM thesis.fulllist
GROUP BY hashed
ORDER BY amount DESC
While it works fine for smaller tables, i have problems computing the query on the whole list (112 mio. entries), where it takes me over 2 days, ending in a weird connection timeout error even if my settings regarding that are fine.
So I wonder if there is a better way to calculate (as i can't really think of any), would appreciate any help!

Your query can't be optimized as it's quite simple. The only way I think to improve the way the query is executed is to index the "MEMBER_HASH".
This is how you can do it :
ALTER TABLE `table` ADD INDEX `hashed` (`MEMBER_HASH`);

Related

MySQL query performance with reference tables

For the following 2 tables structures, assuming the data volume is really high:
cars table
Id | brand name | make year | purchase year | owner name
Is there any query performance benefit with structuring it this way and joining the 2 tables instead?
cars table
Id | brand_id | make year | purchase year | owner name
brands table
Id | name
Also, if all 4 columns fall in my where clause, does it make sense indexing any?
I would at least have INDEX(owner_name) since that is very selective. Having INDEX(owner_name, model_year) won't help enough to matter for this type of data. There are other cases where I would recommend a 4-column composite index.
"data volume is really high". If you are saying there are 100K rows, then it does not matter much. If you are saying a billion rows, then we need to get into a lot more details.
"data volume is really high". 10 queries/second -- Yawn. 1000/second -- more details, please.
2 tables vs 1.
Data integrity - someone could mess up the data either way
Speed -- a 1-byte TINYINT UNSIGNED (range 0..255) is smaller than an average of about 7 bytes for VARCHAR(55) forbrand. But it is hardly enough smaller to matter on space or speed. (And if you goof and makebrand_idaBIGINT`, which is 8 bytes; well, oops!)
Indexing all columns is different than having no indexes. But "indexing all" is ambiguous:
INDEX(user), INDEX(brand), INDEX(year), ... is likely to make it efficient to search or sort by any of those columns.
INDEX(user, brand, year), ... makes it especially efficient to search by all those columns (with =), or certain ORDER BYs.
No index implies scanning the entire table for any SELECT.
Another interpretation of what you said (plus a little reading between the lines): Might you be searching by any combination of columns? Perhaps non-= things like year >= 2016? Or make IN ('Toyota', 'Nissan')?
Study http://mysql.rjweb.org/doc.php/index_cookbook_mysql
An argument for 1 table
If you need to do
WHERE brand = 'Toyota'
AND year = 2017
Then INDEX(brand, year) (in either order) is possible and beneficial.
But... If those two columns are in different tables (as with your 2-table example), then you cannot have such an index, and performance will suffer.

MYSQL slow duration or fetch time depending on "distinct" command

I have a pretty small, simple MYSQL table for holding precalculated financial data. The table looks like:
refDate | instrtument | rate|startDate |maturityDate|carry1|carry2|carry3
with 3 indices defined as:
unique unique_ID(refDate,instrument)
refDate (refDate)
instrument (instrument)
rows right now is about 10 million, though for each refDate, there are only about 5000 distinct instruments right now
I have a query that self joins on this table to generate an output like:
refDate|rate instrument=X | rate instrument = Y| rate instrument=Z|....
basically returning time series data which I can then do my own analytics in.
Here is the problem: my original query looked like:
Select distinct AUDSpot1yFq.refDate,AUDSpot1yFq.rate as 'AUDSpot1yFq',
AUD1y1yFq.rate as AUD1y1yFq
from audratedb AUDSpot1yFq inner join audratedb AUD1y1yFq on
AUDSpot1yFq.refDate=AUD1y1yFq.refDate
where AUDSpot1yFq.instrument = 'AUDSpot1yFq' and
AUD1y1yFq.instrument = 'AUD1y1yFq'
order by AUDSpot1yFq.refDate
Note, in this particular query for timing below, I was actually getting 10 different instruments, which means the query was much longer but followed this same pattern of naming, inner joins, and where statements.
This was slow, in workbench I time it as 7-8 second duration (but near 0 fetch time, as I have workbench on the machine running the server). When I stripped the distinct, the duration drops to 0.25-0.5 seconds (far more manageable) and when I stripped the "order by" it got even faster (<0.1 seconds, at which point I don't care). But my Fetchtime exploded to ~7 seconds. So in total, I gain nothing but it has all become a Fetch time issue. When I run this query from the python scripts that will be doing the lifting and work, I get roughly the same timing whether I include distinct or not.
when I run an explain on my cut down query (which has the horrid fetch time) I get:
1 SIMPLE AUDSpot1yFq ref unique_ID,refDate,instrument instrument 39 const 1432 100.00 Using where
1 SIMPLE AUD1y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD2y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD3y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD4y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD5y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD6y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD7y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD8y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD9y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
I now realize distinct is not required, and order by is something I can throw out and sort in pandas when I get the output to a dataframe. That is great. But I don't know how to get the Fetch time down. I'm not going to win any competency competitions on this website, but I have searched as much as I can and can't find a solution for this issue. Any help is greatly appreciated.
~cocoa
(I had to simplify the table aliases in order to read it:)
Select distinct
s.refDate,
s.rate as AUDSpot1yFq,
y.rate as AUD1y1yFq
from audratedb AS s
join audratedb AS y on s.refDate = y.refDate
where s.instrument = 'AUDSpot1yFq'
and y.instrument = 'AUD1y1yFq'
order by s.refDate
Index needed:
INDEX(instrument, refDate) -- To filter and sort, or
INDEX(instrument, refDate, rate) -- to also "cover" the query.
That assumes the query is not more complex than you said. I see that the EXPLAIN already has many more tables. Please provide SHOW CREATE TABLE audratedb and the entire SELECT.
Back to your questions...
DISTINCT is done one of two ways: (1) sort the table, then dedup, or (2) dedup in a hash in memory. Keep in mind that you are dedupping all 3 columns (refDate, s.rate, y.rate).
ORDER BY is a sort after gathering all the data. However, with the suggested index (not the indexes you had), the sort is not needed, since the index will get the rows in the desired order.
But... Having both DISTINCT and ORDER BY may confuse the Optimizer to the point where it does something 'dumb'.
You say that (refDate,instrument) is UNIQUE, but you do not mention a PRIMARY KEY, nor have you mentioned which Engine you are using. If you are using InnoDB, then PRIMARY KEY(instrument, refDate), in that order, would further speed things up, and avoid the need for any new index.
Furthermore, it is redundant to have (a,b) and also (a). That is, your current schema does not need INDEX(refDate), but by changing the PK, you would not need INDEX(instrument), instead.
Bottom line: Only
PRIMARY KEY(instrument, refDate),
INDEX(refDate)
and no other indexes (unless you can show some query that needs it).
More on the EXPLAIN. Notice how the Rows column says 1432, 1, 1, ... That means that it scanned an estimated 1432 rows of the first table. This is probably far more than necessary because of lack of a proper index. Then it needed to look at only 1 row in each of the other tables. (Can't get better than that.)
How many rows in the SELECT without the DISTINCT or ORDER BY? That tells you how much work was needed after doing the fetching and JOINing. I suspect it is only a few. A "few" is really cheap for DISTINCT and ORDER BY; hence I think you were barking up the wrong tree. Even 1432 rows would be very fast to process.
As for the buffer_pool... How big is the table? Do SHOW TABLE STATUS. I suspect the table is more than 1GB, hence it cannot fit in the buffer_pool. Hence raising that cache size would let the query run in RAM, not hitting the disk (at least after it gets cached). Keep in mind that running a query on a cold cache will have lots of I/O. As the cache warms up, queries will run faster. But if the cache is too small, you will continue to need I/O. I/O is the slowest part of the processing.
I hope you have at least 6GB of RAM; otherwise, 2G could be dangerously large. Swapping is really bad for performance.
The question doesn't mention existing indexes, or show the output from an EXPLAIN for any of the queries.
The quick answer to improve performance is to add an index:
... ON audratedb (instrument,refdate,rate)
To answer why we'd want to add that index, we'd need to understand how MySQL processes SQL statements, what operations are possible, and which are required. To see how MySQL is actually processing your statement, you need to use EXPLAIN to see the query plan.

SQL where clause performance

I have to create a table that assigns an user id and a product id to some data (models 2 one to many relationships). I will make a lot of queries like
select * from table where userid = x;
The first thing that I am interested is how big should the table get before the query starts to be observable (let's say it takes more than 1 second).
Also, how this can be optimised?
I know that this might depend on the implementation. I will use mysql for this specific project, but I am interested in more general answers as well.
It all depends on the horse power of your machine. The make that query more efficient, create an index with "userid"
how big should the table get before the query starts to be observable (let's say it takes more than 1 second)
There are too many factors to deterministically measure run time. CPU speed, memory, I/O speed, etc. are just some of the external factors.
how this can be optimized?
That's more staightforward. If there is an index on userid then the query will likely to an index seek which is about as fast as you can get as far as finding the record. If the userid is a clustered index then it will be faster because it won't have to use the position from the index to find the record in data pages - the data is physically organized as part of the index.
let's say it takes more than 1 second
With an index on userid, Mysql will manage to find the correct row in (worst case) Oh (log n). In "seconds" it now depends on the performance of your machine.
It is impossible to give you an exact number, without considering how long one operation takes.
As an Example: Assuming you have a database with 4 records. This requires 2 operations worst case. Any time, you double your data, one more operation is required.
for example:
# records | # operations to find entry in worst case
2 1
4 2
8 3
16 4
...
4096 12
...
~1 B 30
~2 B 31
So, with a huge amount of records - time almost remains constant. For 1 Billion records, you would need to perform ~ 30 operations.
And like that it continues: 2 Billion records, 31 operations.
so, let's say your query executes in 0.001 second for 4096 entries (12 ops)
it would take arround (0.001 / 12 * 30) 0.0025 seconds for 1 Billion records.
Heavy Sidenode: this is just considering the runtime complexity of the binary search, but it shows how the complexity would scale.
In a nutshell: Your database would be unimpressed by a single query on an indexed value. However, if you run a heavy amount of those queries at the same time, time increases ofc.

mysql optimize data content: multi column or simple column hash data

I actually have a table with 30 columns. In one day this table can get around 3000 new records!
The columns datas look like :
IMG Name Phone etc..
http://www.site.com/images/image.jpg John Smith 123456789 etc..
http://www.site.com/images/image.jpg Smith John 987654321 etc..
I'm looking a way to optimize the size of the table but also the response time of the sql queries. I was thinking of doing something like :
Column1
http://www.site.com/images/image.jpg|John Smith|123456789|etc..
And then via php i would store each value into an array..
Would it be faster ?
Edit
So to take an example of the structure, let's say i have two tables :
package
package_content
Here is the structure of the table package :
id | user_id | package_name | date
Here is the structure of the table package_content :
id | package_id | content_name | content_description | content_price | content_color | etc.. > 30columns
The thing is for each package i can get up to 16rows of content. For example :
id | user_id | package_name | date
260 11 Package 260 2013-7-30 10:05:00
id | package_id | content_name | content_description | content_price | content_color | etc.. > 30columns
1 260 Content 1 Content 1 desc 58 white etc..
2 260 Content 2 Content 2 desc 75 black etc..
3 260 Content 3 Content 3 desc 32 blue etc..
etc...
Then with php i make like that
select * from package
while not EOF {
show package name, date etc..
select * from package_content where package_content.package_id = package.id and package.id = package_id
while not EOF{
show package_content name, desc, price, color etc...
}
}
Would it be faster? Definitely not. If you needed to search by Name or Phone or etc... you'd have to pull those values out of Column1 every time. You'd never be able to optimize those queries, ever.
If you want to make the table smaller it's best to look at splitting some columns off into another table. If you'd like to pursue that option, post the entire structure. But note that the number of columns doesn't affect speed that much. I mean it can, but it's way down on the list of things that will slow you down.
Finally, 3,000 rows per day is about 1 million rows per year. If the database is tolerably well designed, MySQL can handle this easily.
Addendum: partial table structures plus sample query and pseudocode added to question.
The pseudocode shows the package table being queried all at once, then matching package_content rows being queried one at a time. This is a very slow way to go about things; better to use a JOIN:
SELECT
package.id,
user_id,
package_name,
date,
package_content.*
FROM package
INNER JOIN package_content on package.id = package_content.id
WHERE whatever
ORDER BY whatever
That will speed things up right away.
If you're displaying on a web page, be sure to limit results with a WHERE clause - nobody will want to see 1,000 or 3,000 or 1,000,000 packages on a single web page :)
Finally, as I mentioned before, the number of columns isn't a huge worry for query optimization, but...
Having a really wide result row means more data has to go across the wire from MySQL to PHP, and
It isn't likely you'll be able to display 30+ columns of information on a web page without it looking terrible, especially if you're reading lots of rows.
With that in mind, you'll be better of picking specific package_content columns in your query instead of picking them all with a SELECT *.
Don't combine any columns, this is no use and might even be slower in the end.
You should use indexes on a column where you query at. I do have a website with about 30 columns where atm are around 600.000 results. If you use EXPLAIN before a query, you should see if it uses any indexes. If you got a JOIN with 2 values and a WHERE at the same table. You should make a combined index with the 3 columns, in order from JOIN -> WHERE. If you join on the same table, you should see this as a seperate index.
For example:
SELECT p.name, p.id, c.name, c2.name
FROM product p
JOIN category c ON p.cat_id=c.id
JOIN category c2 ON c.parent_id=c2.id AND name='Niels'
WHERE p.filterX='blaat'
You should have an combined index at category
parent_id,name
AND
id (probably the AI)
A index on product
cat_id
filterX
With this easy solution you can optimize queries from NOT DOABLE to 0.10 seconds, or even faster.
If you use MySQL 5.6 you should step over to INNODB because MySQL is better with optimizing JOINS and sub queries. Also MySQL will try to run them into MEMORY which will make it a lot faster aswel. Please keep in mind that backupping INNODB tables might need some extra attention.
You might also think about making MEMORY tables for super fast querieing (you do still need indexes).
You can also optimize by making integers size 4 (4 bytes, not 11 characters). And not always using VARCHAR 255.

How to query random records in a MYSQL table while factoring in a priority system

My table will look something like this
ID | Priority
---------------
#1 | 25
#2 | 50
#3 | 125
#4 | 300
#5 | 500
For every 1000 queries I would like to (on average) retrieve ID #1 25 times, #2 50 times, #3 125 times, etc.
My table will have 1000s and eventually 100,000+ records, would it be possible to scale this?
This query would be getting ran very often so it'll need to run very fast in a large table as-well.
I'm definitely willing to reconsider table structure if theres a more efficient method - any advice?
I think you're going to struggle to find a query that is going to scale particularly well on very large data sets.
There are effectively two paths that you can go down:
Using a weighting table like you have, then multiplying this weighting by a random number for each row.
Having the count of the records with each ID in your table refect your weighting. e.g. #2 is twice as likely as #1, so #1 has 1 record and #2 has two records. If #3 is four times as likely as #2, then it would have eight records, etc. This method has a major, major drawback - if #4 is half as likely as #1, then the only solution is to double the number of records that every other type has, and then insert one record for #4. Very, very messy to keep track of.
With that in mind, here's a solution using approach 1:
SELECT ID
FROM tablename
ORDER BY (RAND() * Priority) DESC
LIMIT 1;
(I'm not 100% sure of the syntax, as I'm a SQL Server / Oracle head, as opposed to MySQL, but I think this is right.)