MySQL does it matter which column I COUNT() by - mysql

I've been wondering about this for quite some time. Is it better to do this where the primary key ticket_id is counted:
SELECT COUNT(ticket_id)
FROM tickets
WHERE ticket_country_id = 238
Or to do this:
SELECT COUNT(ticket_country_id)
FROM tickets
WHERE ticket_country_id = 238
In this case ticket_country_id is an indexed foreign key, but we could also assume it's just a non-indexed column (perhaps the answer would be different for non-indexed columns)
In other words, does it matter that I am calling on another column for the COUNT()?
Obviously the performance saving would probably be small, but I like to do things the best way.

Yes,it can matter. Select count(*) allows the DB to use whatever resources make sense and are most efficient. It can do as table scan, use a primary key or other index to answer your question.
Count(something-else) means count the non null values. Again, the DB can use several methods such as indexes if such thing are available but you are then asking a different question.
As is often the case with SQL it's better to ask the question you want answers to than play silly games trying to game the system for a few milliseconds here and there.
That helps your future colleagues too by clearly stating the thing you are trying to do.

Related

Faster counts with mysql by sampling table

I'm looking for a way I can get a count for records meeting a condition but my problem is the table is billions of records long and a basic count(*) is not possible as it times out.
I thought that maybe it would be possible to sample the table by doing something like selecting 1/4th of the records. I believe that older records will be more likely to match so I'd need a method which accounts for this (perhaps random sorting).
Is it possible or reasonable to query a certain percent of rows in mysql? And is this the smartest way to go about solving this problem?
The query I currently have which doesn't work is pretty simple:
SELECT count(*) FROM table_name WHERE deleted_at IS NOT NULL
SHOW TABLE STATUS will 'instantly' give an approximate Row count. (There is an equivalent SELECT ... FROM information_schema.tables.) However, this may be significantly far off.
A count(*) on an index on any column in the PRIMARY KEY will be faster because it will be smaller. But this still may not be fast enough.
There is no way to "sample". Or at least no way that is reliably better than SHOW TABLE STATUS. EXPLAIN SELECT ... with some simple query will do an estimate; again, not necessarily any better.
Please describe what kind of data you have; there may be some other tricks we can use.
See also Random . There may be a technique that will help you "sample". Be aware that all techniques are subject to various factors of how the data was generated and whether there has been "churn" on the table.
Can you periodically run the full COUNT(*) and save it somewhere? And then maintain the count after that?
I assume you don't have this case. (Else the solution is trivial.)
AUTO_INCREMENT id
Never DELETEd or REPLACEd or INSERT IGNOREd or ROLLBACKd any rows
ADD an index key with deleted_at column, to improve time execution
and try to count id if id is set.

MySQL self join performance: fact or just bad indexing?

As an example: I'm having a database to detect visitor (bots, etc) and since not every visitor have the same amount of 'credential' I made a 'dynamic' table like so: see fiddle: http://sqlfiddle.com/#!9/ca4c8/1 (simplified version).
This returns me the profile ID that I use to gather info about each profile (in another DB). Depending on the profile type I query the table with different nameclause (name='something') (ei: hostname, ipAddr, userAgent, HumanId, etc).
I'm not an expert in SQL but I'm familiar with indexes, constraints, primary, unique, foreign key etc. And from what I saw from these search results:
Mysql Self-Join Performance
How to tune self-join table in mysql like this?
Optimize MySQL self join query
JOIN Performance Issue MySQL
MySQL JOIN performance issue
Most of them have comments about bad performance on self-join but answers tend to go for the missing index cause.
So the final question is: is self joining a table makes it more prone to bad performance assuming that everything is indexed properly?
On a side note, more information about the table: might be irrelevant to the question but is well in context for my particular situation:
column flag is used to mark records for deletion as the user I use from php don't have DELETE permission over this database. Sorry, Security is more important than performance
I added the 'type' that will go with info I get from the user agent. (ie: if anything is (at least seems to be) a bot, we will only search for type 5000.
Column 'name' is unfortunately a varchar indexed in the primary key (with profile and type).
I tried to use as much INT and filtering (WHERE) in the SELECT query to reduce eventual lost of performance (if that even matters)
I'm willing to study and tweak the thing if needed unless someone with a high background in mysql tells me it's really not a good thing to do.
This is a big project I have in development so I cannot test it with millions of records for now but I wonder if performance will be an issues as this grows. Any input, links, references, documentation or test procedure (maybe in comments) will be appreciated.
A self-join is no different than joining two different tables. The optimizer will pick one 'table', usually based on the WHERE, then do a Nested Loop Join into the other. In your case, you have implied, via LEFT, that it should work only one way. (The Optimizer will ignore that if it sees no need for it.
Your keys are find for that Fiddle.
The real problem is "Entity-Attribute-Value", which is a messy way to lay out data in tables. Your query seems to be saying "find a (limit 1) profile (entity) that has a certain pair of attributes (name = Googlebot AND addr = ...).
It would be so much easier, and faster, to have two columns (name and addr) and a "composite" INDEX(name, addr).
I recommend doing that for the common "attributes", then put the rest into a single column with a JSON string. See here.

MySQL : Table optimization word 'guest' or memberid in the column

This question is for MySQL (it allows many NULLs in the column which is UNIQUE, so the solution for my question could be slightly different).
There are two tables: members and Table2.
Table members has:
memberid char(20), it's a primary key. (Please do not recommend to use int(11) instead of char(20) for memberid, I can't change it, it contains exactly 20 symbols).
Table2 has:
CREATE TABLE IF NOT EXISTS `Table2`
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
memberid varchar(20) NOT NULL,
`Time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
status tinyint(4) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
Table2.memberid is a word 'guest' (could be repeated many times) or a value from members.memberid (it also could be repeated many times). Any value from Table2.memberid column (if not 'guest') exists in members.memberid column. Again, members.memberid is unique. Table2.memberid, even excluding words 'guest' is not unique.
So, Table2.memberid column looks like:
'guest'
'lkjhasd3lkjhlkjg8sd9'
'kjhgbkhgboi7sauyg674'
'guest'
'guest'
'guest'
'lkjhasd3lkjhlkjg8sd9'
Table2 has INSERTS and UPDATES only. It updates only status. Criteria for updating status: set status=0 WHERE memberid='' and status=1. So, it could be updated once or not updated at all. As result, the number of UPDATES is less or equal (by statistics it is twice less) than number of INSERTS.
The question is only about optimization.
The question could be splitted as:
1) Do you HIGHLY recommend to replace the word 'guest' to NULL or to a special 'xxxxxyyyyyzzzzz00000' (20 symbols like a 'very special and reserved' string) so you can use chars(20) for Table2.memberid, because all values are char(20)?
2) What about using a foreign key? I can't use it because of the value 'guest'. That value can NOT be in members.memberid column.
Using another words, I need some help to decide:
wether I can use 'guest' (I like that word) -vs- choosing 20-char-reserved-string so I can use char(20) instead of varchar(20) -vs- keeping NULLs instead of 'guest',
all values, except 'guest' are actually foreign keys. Is there any possible way to use this information for increasing the performance?
That table is used pretty often so I have to build Table2 as good as I can. Any idea is highly appreciated.
Thank you.
Added:
Well... I think I have found a good solution, that allows me to treat memberid as a foreign key.
1) Do you HIGHLY recommend to replace the word 'guest' to NULL or to a
special 'xxxxxyyyyyzzzzz00000' (20 symbols like a 'very special and
reserved' string) so you can use chars(20) for Table2.memberid,
because all values are char(20)?
Mixing values from different domains always causes trouble. The best thing to do is fix the underlying stuctural problem. Bad design can be really expensive to work around, and it can be really expensive to fix.
Here's the issue in a nutshell. The simplest data integrity constraint for this kind of issue is a foreign key constraint. You can't use one, because "guest" isn't a memberid. (Member ids are from one domain; "guest" isn't part of that domain; you're mixing values from two domains.) Using NULL to identify a guest doesn't help much; you can't distinguish guests from members whose memberid is missing. (Using NULL to identify anything is usually a bad idea.)
If you can use a special 20-character member id to identify all guests, it might be wise to do so. You might be lucky, in that "guest" is five letters. If you can use "guestguestguestguest" for the guests without totally screwing your application logic, I'd really consider that first. (But, you said that seems to treat guests as logged in users, which I think makes things break.)
Retrofitting a "users" supertype is possible, I think, and this might prove to the the best overall solution. The supertype would let you treat members and guests as the same sometimes (because they're not utterly different), and different at other times (because they're not entirely the same). A supertype also allows both individuals (members) and aggregate users (guests all lumped together) without undue strain. And it would unify the two domains, so you could use foreign key constraints for members. But it would require changing the program logic.
In Table2 (and do find a better name than that, please), an index on memberid or a composite index on memberid and status will perform just about as well as you can expect. I'm not sure whether a composite index will help; "status" only has two values, so it's not very selective.
all values, except 'guest' are actually foreign keys. Is there any
possible way to use this information for increasing the performance?
No, they're not foreign keys. (See above.) True foreign keys would help with data integrity, but not with SELECT performance.
"Increasing the performance" is pretty much meaningless. Performance is a balancing act. If you want to increase performance, you need to specify which part you want to improve. If you want faster inserts, drop indexes and integrity constraints. (Don't do that.) If you want faster SELECT statements, build more indexes. (But more indexes slows the INSERTS.)
You can speed up all database performance by moving to hardware that speeds up all database performance. (ahem) Faster processor, faster disks, faster disk subsystem, more memory (usually). Moving critical tables or indexes to a solid-state disk might blow your socks off.
Tuning your server can help. But keep an eye on overall performance. Don't get so caught up in speeding up one query than you degrade performance in all the others. Ideally, write a test suite and decide what speed is good enough before you start testing. For example, say you have one query that takes 30 seconds. What's acceptable improvement? 20 seconds? 15 seconds? 2 milliseconds sounds good, but is an unlikely target for a query that takes 30 seconds. (Although I've seen that kind of performance increase by moving to better table and index structure.)

Should I create an Index on my table?

I've a table with 7 columns, I've on primary on first column, another index (foreign key).
My app does:
SELECT `comment_vote`.`ip`, `comment_vote`.`comment_id`, COUNT(*) AS `nb` FROM `comment_vote`
SELECT `comment_vote`.`type` FROM `comment_vote` WHERE (comment_id = 123) AND (ip = "127.0.0.1")
Is it worth to add an index on ip column? it is often used in my select query.
By the way is there anything I can do to quick up those queries? Sometimes it tooks a long time and lock the table preventing other queries to run.
If you are searching by ip quite often then yes you can create an index. However your insert/updates might take a bit longer due to this. Not sure how your data is structured but if the data collection is by ip then may be you can consider partitioning it by ip.
A good rule of thumb: If a column appears in the WHERE clause, there should be an index for it. If a query is slow, there's a good chance an index could help, particularly one that contains all fields in the WHERE clause.
In MySQL, you can use the EXPLAIN keyword to see an approximate query plan for your query, including indexes used. This should help you find out where your queries spend their time.
Yes, do create an index on ip if you're using it in other queries.
This one uses column id and ip, so I'd create an index on the combination. An index on ip alone won't help that query.
YES! Almost always add an INDEX or two or three! (multi-column indexes?) to every column.
If it is in not a WHERE clause today, you can bet it will be tomorrow.
Most data is WORM (written once read many times) so making the read most effective is where you will get the most value. And, as many have pointed out, the argument about having to maintain the index during a write is just plain silly.

MySQL: a huge table. can't query, even a simple select!

i have a table with about 200,000 records.
it takes a long time to do a simple select query. i am confiused because i am running under a 4 core cpu and 4GB of ram.
how should i write my query?
or is there anything to do with INDEXING?
important note: my table is static (it's data wont change).
what's your solutions?
PS
1 - my table has a primary key id
2 - my table has a unique key serial
3 - i want to query over the other fields like where param_12 not like '%I.S%'
or where param_13 = '1'
4 - 200,000 is not big and this is exactly why i am surprised.
5 - i even have problem when adding a simple field: my question
6 - can i create an INDEX for BOOL fields? (or is it usefull)
PS and thanks for answers
7 - my select shoudl return the fields that has specified 'I.S' or has not.
select * from `table` where `param_12` like '%I.S%'
this is all i want. it seems no Index helps here. ham?
Indexing will help. Please post table definition and select query.
Add index for all "=" columns in where clause.
Yes, you'll want/need to index this table and partitioning would be helpful as well. Doing this properly is something you will need to provide more information for. You'll want to use EXPLAIN PLAN and look over your queries to determine which columns and how you should index them.
Another aspect to consider is whether or not your table normalized. Normalized tables tend to give better performance due to lowered I/O.
I realize this is vague, but without more information that's about as specific as we can be.
BTW: a table of 200,000 rows is relatively small.
Here is another SO question you may find useful
1 - my table has a primary key id: Not really usefull unless you use some scheme which requires a numeric primary key
2 - my table has a unique key serial: The id is also unique by definition; why not use serial as the primary? This one is automatically indexed because you defined it as unique.
3 - i want to query over the other fields like where param_12 not like '%I.S%' or where param_13 = '1': A like '%something%' query can not really use an index; is there some way you can change param12 to param 12a which is the first %, and param12b which is 'I.S%'? An index can be used on a like statement if the starting string is known.
4 - 200,000 is not big and this is exactly why i am surprised: yep, 200.000 is not that much. But without good indexes, queries and/or cache size MySQL will need to read all data from disk for comparison, which is slow.
5 - i even have problem when adding a simple field: my question
6 - can i create an INDEX for BOOL fields? Yes you can, but an index which matches half of the time is fairly useless, an index is used to limit the amount of records MySQL has to load fully as much as possible; if an index does not dramatically limit that number, as is often the case with boolean (in a 50-50 distribution), using an index only requires more disk IO and can slow searching down. So unless you expect something like an 80-20 distribution or better creating an index will cost time, and not win time.
Index on param_13 might be used, but not the one on param_12 in this example, since the use of LIKE '% negate the index use.
If you're querying data with LIKE '%asdasdasd%' then no index can help you. It will have to do a full scan every time. The problem here is the leading % because that means that the substring you are looking for can be anywhere in the field - so it has to check it all.
Possibly you might look into full-text indexing, but depending on your needs that might not be appropriate.
Firstly, ensure your table have a primary key.
To answer in any more detail than that you'll need to provide more information about the structure of the table and the types of queries you are running.
I don't believe that the keys you have will help. You have to index on the columns used in WHERE clauses.
I'd also wonder if the LIKE requires table scans regardless of indexes. The minute you use a function like that you lose the value of the index, because you have to check each and every row.
You're right: 200K isn't a huge table. EXPLAIN PLAN will help here. If you see TABLE SCAN, redesign.