Add index on column or not? - mysql

I have a table looking like this:
CREATE TABLE `item` (
`id` int(11) NOT NULL auto_increment,
`title` varchar(255),
`my_number` int(10) unsigned default NULL
);
There are a hundreds of thousands of items, and I usually order them by 'my_number'.
Will adding an index on 'my_number' increase performance on queries when I order by this field?
I use MySQL, the table is InnoDB.

I know the answer but it doesn't matter. What you should do in all cases of potential database optimisation is to add the index and test it. And, by that, I mean test all of your affected queries, inserts and updates to ensure you have a performance boost that outdoes any loss of performance elsewhere.
You shouldn't trust my knowledge or anyone else's if the alternative is concrete data that you can gather yourself. Measure, don't guess!
The answer is, by the way: "it should, unless your DBMS is truly brain-dead". Generally, adding indexes will increase the speed of selects but slow down inserts and updates. But I mean "generally" - it's not always the case. A good DBA continuously examines the performance of the DBMS and tunes it where necessary. Databases are not set-and-forget objects, they need to be cared for and nurtured :-)

Related

Optimizing MySQL Table Structure and impact of row size

One of my database tables has grown quite large, to the point where I think it is impacting the performance on my site (it is definitely making backups a lot slower).
It has ~13,000,000 rows and is 4.2 GiB in size, of which 1.2 GiB is data.
The structure looks like this:
CREATE TABLE IF NOT EXISTS `t1` (
`id` int(10) unsigned NOT NULL,
`int2` int(10) unsigned NOT NULL,
`int3` int(10) unsigned NOT NULL,
`int4` int(10) unsigned NOT NULL,
`char1` varchar(255) NOT NULL,
`int5` int(10) NOT NULL,
`char2` varchar(1024) DEFAULT NULL,
`char3` varchar(1024) NOT NULL,
PRIMARY KEY (`id`,`int2`,`int3`,`int4`),
KEY `key1` (`id`,`int2`,`char1`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Common operations in this table are insert and selects, rows are never updated and rarely deleted. int2 is a running version number, which means usually only the rows with the highest value of int2 for that id are selected.
I have been thinking of several ways of optimizing this and I was wondering which one would be the which one to pursue:
char1 (which is in the index) actually only contains about 40,000 different strings. I could move the strings into a second table (idchar -> char) and then just save the id in my main table, at the cost of an additional id lookup step during inserts and selects.
char2 and char3 are often empty. I could move them to a separate table that I would then do a LEFT JOIN on in selects.
Even if char2 and char3 contain data they are usually shorter than 1024 chars. I could probably shorten these to ~200.
Which one of these do you think is the most promising? Does decreasing the row size (either by making char1 into an integer or by removing/resizing columns) in MySQL InnoDB tables actually have a big impact on performance?
Thanks
There are several options. From what you say, moving char1 to another table seems quite reasonable. The additional lookup could, under some circumstances, even be faster than storing the raw data in the tables. (This occurs when the repeated values cause the table to be larger than necessary, especially when the larger table might be larger than available memory.) And, this would save space both in the data table and the corresponding index.
The exact impact on performance is hard to say, without understanding much more about your system and the query load.
Moving char3 and char4 to another table will have minimal impact. The overhead of the link to the other table would eat up any gains in space. You could save a couple bytes per record by storing them as varchar(255) rather than varchar(1024).
If you have a natural partitioning key, then partitioning is definitely an option, particularly for reducing the time for backups. This is very handy for a transaction-style table, where records are inserted and never or very rarely modified. If, on the other hand, the records contain customer records and any could be modified at any time, then you would still need to back up all the partitions.
There are several factors that could affect performance of your DB. Partitioning is definitive the best option but not allways can be done. If you are searching char1 before insert, then partitioning can be a problem because you have to search all the parts for the key. You must analize how the data is generated and most important how you make your querys for this table. This is the key so you should post your querys over this table. In the case on char2 and char3, moving to another table won't make any difference. You also should mention the physical distribution of you data. Are you using a single data file? Are data files on same physical disk as SO? Give more details so we can give you more help.

Partitioning a mySQL table

I am considering partitioning a mySQL table that has the potential to grow very big. The table as it stands goes like this
DROP TABLE IF EXISTS `uidlist`;
CREATE TABLE IF NOT EXISTS `uidlist` (
`uid` varchar(9) CHARACTER SET ascii COLLATE ascii_bin NOT NULL,
`chcs` varchar(16) NOT NULL DEFAULT '',
UNIQUE KEY `uid` (`uid`)
) ENGINE=InnoDB DEFAULT CHARSET=ascii;
where
uid is a sequence of 9 character id strings starting with a lowercase letter
chcs is a checksum that is used internally.
I suspect that the best way to partition this table would be based on the first letter of the uid field. This would give
Partition 1
abcd1234,acbd1234,adbc1234...
Partition 2
bacd1234,bcad1234,bdac1234...
However, I have never done partitioning before I have no idea how to go about it. Is the partitioning scheme I have outlined possible? If so, how do I go about implementing it?
I would much appreciate any help with this.
Check out the manual for start :)
http://dev.mysql.com/tech-resources/articles/partitioning.html
MySQL is pretty feature-rich when it comes to partitioning and choosing the correct strategy depends on your use case (can partitioning help your sequential scans?) and the way your data grows since you don't want any single partition to become too large to handle.
If your data will tend to grow over time somewhat steadily you might want to do a create-date based partitioning scheme so that (for example) all records generated in a single year end up in last partition and previous partitions are never written to - for this to happen you may have to introduce another column to regulate this, see http://dev.mysql.com/doc/refman/5.1/en/partitioning-hash.html.
Added optimization benefit of this approach would be that you can have the most recent partition on a disk with fast writes (a solid state for example) and you can keep the older partitions on a cheaper disk with decent read speed.
Anyway, knowing more about your use case would help people give you more concrete answers (possibly including sql code)
EDIT, also, check out http://www.tokutek.com/products/tokudb-for-mysql/
The main question you need to ask yourself before partitioning is "why". What is the goal you are trying to achieve by partitioning the table?
Since all the table's data will still existing on a single MySQL server and, I assume, new rows will be arriving in "random" order (meaning the partition they'll be inserted into), you won't gain much by partitioning. Your point select queries might be slightly faster, but not likely by much.
The main benefit I've seen using MySQL partitioning is for data that needs to be purged according to a set retention policy. Partitioning data by week or month makes it very easy to delete old data quickly.
It sounds more likely to me that you want to be sharding your data (spreading it across many servers), and since your data design as shown is really just key-value then I'd recommend looking at database solutions that include sharding as a feature.
I have upvoted both of the answers here since they both make useful points. #bbozo - a move to TokuDB is planned but there are constraints that stop it from being made right now.
I am going off the idea of partitioning the uidlist table as I had originally wanted to do. However, for the benefit of anyone who finds this thread whilst trying to do something similiar here is the "how to"
DROP TABLE IF EXISTS `uidlist`;
CREATE TABLE IF NOT EXISTS `uidlist` (
`uid` varchar(9) CHARACTER SET ascii COLLATE ascii_bin NOT NULL ,
`chcs` varchar(16) NOT NULL DEFAULT '',
UNIQUE KEY `uid` (`uid`)
) ENGINE=InnoDB DEFAULT CHARSET=ascii
PARTITION BY RANGE COLUMNS(uid)
(
PARTITION p0 VALUES LESS THAN('f%'),
PARTITION p1 VALUES LESS THAN('k%'),
PARTITION p2 VALUES LESS THAN('p%'),
PARTITION p3 VALUES LESS THAN('u%')
);
which creates four partitions.
I suspect that the long term solution here is to use a key-value store as suggested by #tmcallaghan rather than just stuffing everything into a MySQL table. I will probably post back in due course once I have established what would be the right way to accomplish that.

Would it help to add index to BIGINT column in MySQL?

I have a table that will have millions of entries, and a column that has BIGINT(20) values that are unique to each row. They are not the primary key, but during certain operations, there are thousands of SELECTs using this column in the WHERE clause.
Q: Would adding an index to this column help when the amount of entries grows to the millions? I know it would for a text value, but I'm unfamiliar with what an index would do for INT or BIGINT.
A sample SELECT that would happen thousands of times is similar to this:
`SELECT * FROM table1 WHERE my_big_number=19287319283784
If you have a very large table, then searching against values that aren't indexed can be extremely slow. In MySQL terms this kind of query ends up being a "table scan" which is a way of saying it must test against each row in the table sequentially. This is obviously not the best way to do it.
Adding an index will help with read speeds, but the price you pay is slightly slower write speeds. There's always a trade-off when making an optimization, but in your case the reduction in read time would be immense while the increase in write time would be marginal.
Keep in mind that adding an index to a large table can take a considerable amount of time so do test this against production data before applying it to your production system. The table will likely be locked for the duration of the ALTER TABLE statement.
As always, use EXPLAIN on your queries to determine their execution strategy. In your case it'd be something like:
EXPLAIN SELECT * FROM table1 WHERE my_big_number=19287319283784
It will improve your look up (SELECT) performance (based on your example queries), but it will also make your inserts/updates slower. Your DB size will also increase. You need to look at how often you make these SELECT calls vs. INSERT calls. If you make a lot of SELECT calls, then this should improve your overall performance.
I have a 22 million row table on amazon ec2 small instance. So it is not the fastest server environment by a long shot. I have this create:
CREATE TABLE huge
(
myid int not null AUTO_INCREMENT PRIMARY KEY,
version int not null,
mykey char(40) not null,
myvalue char(40) not null,
productid int not null
);
CREATE INDEX prod_ver_index ON huge(productid,version);
This call runs finishes instantly:
select * from huge where productid=3333 and version=1988210878;
As for inserts, I can do 100/sec in PHP, but if i cram 1000 inserts into an array use implode on this same table I get get 3400 inserts per second. Naturally your data is not coming in that way. Just saying the server is relatively snappy. But as tadman suggests, and he meant to say EXPLAIN not examine, in front of a typical statement to see if the key column is showing an index that will be used were you to run it.
General Comments
For slow query debugging, place the word EXPLAIN in front of the word select (no matter how complicated the select/join may be), and run it. Though the query will not be run in normal fashion with resolving the resultset, the db engine will produce (almost immediately) an execution plan it would attempt. This plan may be abandoned when the real query is run (the one prior to putting EXPLAIN in front of it), but it is a major clue-in to schema shortcomings.
The output of EXPLAIN appears cryptic for those first reading one. Not for long though. After reading a few articles about it, such as Using EXPLAIN to Write Better MySQL Queries, one will usually be able to determine which sections of the query are using which indexes, using none and doing slow tablescans, slower where clauses, derived and temp tables.
Using the output of EXPLAIN sized up against your schema, you can gain insight into strategies for index creation (such as composite and covering indexes) to gain substantial query performance.
Sharing
Sharing this EXPLAIN output and schema output with others (such as in stackoverflow questions) hastens better answers concerning performance. Schema output is rendered with such statements as show create table myTableName. Thank you for sharing.

If threads are getting 3000 posts each is it maybe better to make a new table per thread?

There's 12 million posts already and people seem to be using things as a chat. I don't know if it's more efficient to have a bunch of little tables than having the database scan for the last 10 messages in a database with so many entries. I know I'd have to benchmark but just asking if anyone has any observations or anecdotes if they've ever had similar situations.
edit add schema:
create table reply(
id int(11) unsigned not null auto_increment,
thread_id int(10) unsigned not null default 0,
ownerId int(9) unsigned not null default 0,
ownerName varchar(20),
profileId int(9) unsigned,
profileName varchar(50),
creationDate dateTime,
ip int unsigned,
pic varchar(255) default '',
reply text,
index(thread_id),
primary key(id)) TYPE=MyISAM;
It's not a good idea to use variable table names. If you've indexed the columns that would be turned into separate tables, the database will do a better job using the index than you can do by creating separate tables. That's what the database was designed for.
I assume that "thread" here means thread in a pool of postings.
The way you are going to get long-term scalability here is to develop an architecture in which you can have multiple database instances, and avoid having queries that need to be performed across all instances.
Creating multiple tables on the same DB won't really help in terms of scalability. (In fact, it might even reduce throughput ... due to increasing the load on the DB's caches.) But it sounds like in your application you could partition into "pools" of messages in different databases, provided that you can arrange that a reply to a message goes into the same pool as the message it replies to.
The problem that arises is that certain things will involve querying across data in all DB instances. In this case, it might be listing all of a user's messages, or doing a keyword search. So you really have to look at the entire picture to figure out how best to achieve a partitioning. You need to analyze all of the queries, taking account of their relative frequencies. And at the end of the day, the solution to might involve denormalizing the schema so that the database can be partitioned.
Dynamic tables are typically a very bad idea in relational schema. Key/value stores make different trade-offs, so some are better at things like dynamic tables but at the cost of things like weak data integrity/consistency guarantees. You don't appear to have defined any foreign key references and you're using MyISAM so data integrity/reliability probably isn't a priority; the important thing to understand is that different designs have different things they're good at so what's good design for one DB can be bad design for another DB.
I can't help with much else as I focus on Pg and this is a MySQL question. Untagging.
(Note that in PostgreSQL at least, many operations on the relation set are O(n), so huge numbers of relations can be quite harmful.)

MySQL : Table optimization word 'guest' or memberid in the column

This question is for MySQL (it allows many NULLs in the column which is UNIQUE, so the solution for my question could be slightly different).
There are two tables: members and Table2.
Table members has:
memberid char(20), it's a primary key. (Please do not recommend to use int(11) instead of char(20) for memberid, I can't change it, it contains exactly 20 symbols).
Table2 has:
CREATE TABLE IF NOT EXISTS `Table2`
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
memberid varchar(20) NOT NULL,
`Time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
status tinyint(4) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
Table2.memberid is a word 'guest' (could be repeated many times) or a value from members.memberid (it also could be repeated many times). Any value from Table2.memberid column (if not 'guest') exists in members.memberid column. Again, members.memberid is unique. Table2.memberid, even excluding words 'guest' is not unique.
So, Table2.memberid column looks like:
'guest'
'lkjhasd3lkjhlkjg8sd9'
'kjhgbkhgboi7sauyg674'
'guest'
'guest'
'guest'
'lkjhasd3lkjhlkjg8sd9'
Table2 has INSERTS and UPDATES only. It updates only status. Criteria for updating status: set status=0 WHERE memberid='' and status=1. So, it could be updated once or not updated at all. As result, the number of UPDATES is less or equal (by statistics it is twice less) than number of INSERTS.
The question is only about optimization.
The question could be splitted as:
1) Do you HIGHLY recommend to replace the word 'guest' to NULL or to a special 'xxxxxyyyyyzzzzz00000' (20 symbols like a 'very special and reserved' string) so you can use chars(20) for Table2.memberid, because all values are char(20)?
2) What about using a foreign key? I can't use it because of the value 'guest'. That value can NOT be in members.memberid column.
Using another words, I need some help to decide:
wether I can use 'guest' (I like that word) -vs- choosing 20-char-reserved-string so I can use char(20) instead of varchar(20) -vs- keeping NULLs instead of 'guest',
all values, except 'guest' are actually foreign keys. Is there any possible way to use this information for increasing the performance?
That table is used pretty often so I have to build Table2 as good as I can. Any idea is highly appreciated.
Thank you.
Added:
Well... I think I have found a good solution, that allows me to treat memberid as a foreign key.
1) Do you HIGHLY recommend to replace the word 'guest' to NULL or to a
special 'xxxxxyyyyyzzzzz00000' (20 symbols like a 'very special and
reserved' string) so you can use chars(20) for Table2.memberid,
because all values are char(20)?
Mixing values from different domains always causes trouble. The best thing to do is fix the underlying stuctural problem. Bad design can be really expensive to work around, and it can be really expensive to fix.
Here's the issue in a nutshell. The simplest data integrity constraint for this kind of issue is a foreign key constraint. You can't use one, because "guest" isn't a memberid. (Member ids are from one domain; "guest" isn't part of that domain; you're mixing values from two domains.) Using NULL to identify a guest doesn't help much; you can't distinguish guests from members whose memberid is missing. (Using NULL to identify anything is usually a bad idea.)
If you can use a special 20-character member id to identify all guests, it might be wise to do so. You might be lucky, in that "guest" is five letters. If you can use "guestguestguestguest" for the guests without totally screwing your application logic, I'd really consider that first. (But, you said that seems to treat guests as logged in users, which I think makes things break.)
Retrofitting a "users" supertype is possible, I think, and this might prove to the the best overall solution. The supertype would let you treat members and guests as the same sometimes (because they're not utterly different), and different at other times (because they're not entirely the same). A supertype also allows both individuals (members) and aggregate users (guests all lumped together) without undue strain. And it would unify the two domains, so you could use foreign key constraints for members. But it would require changing the program logic.
In Table2 (and do find a better name than that, please), an index on memberid or a composite index on memberid and status will perform just about as well as you can expect. I'm not sure whether a composite index will help; "status" only has two values, so it's not very selective.
all values, except 'guest' are actually foreign keys. Is there any
possible way to use this information for increasing the performance?
No, they're not foreign keys. (See above.) True foreign keys would help with data integrity, but not with SELECT performance.
"Increasing the performance" is pretty much meaningless. Performance is a balancing act. If you want to increase performance, you need to specify which part you want to improve. If you want faster inserts, drop indexes and integrity constraints. (Don't do that.) If you want faster SELECT statements, build more indexes. (But more indexes slows the INSERTS.)
You can speed up all database performance by moving to hardware that speeds up all database performance. (ahem) Faster processor, faster disks, faster disk subsystem, more memory (usually). Moving critical tables or indexes to a solid-state disk might blow your socks off.
Tuning your server can help. But keep an eye on overall performance. Don't get so caught up in speeding up one query than you degrade performance in all the others. Ideally, write a test suite and decide what speed is good enough before you start testing. For example, say you have one query that takes 30 seconds. What's acceptable improvement? 20 seconds? 15 seconds? 2 milliseconds sounds good, but is an unlikely target for a query that takes 30 seconds. (Although I've seen that kind of performance increase by moving to better table and index structure.)