I have a table that will have millions of entries, and a column that has BIGINT(20) values that are unique to each row. They are not the primary key, but during certain operations, there are thousands of SELECTs using this column in the WHERE clause.
Q: Would adding an index to this column help when the amount of entries grows to the millions? I know it would for a text value, but I'm unfamiliar with what an index would do for INT or BIGINT.
A sample SELECT that would happen thousands of times is similar to this:
`SELECT * FROM table1 WHERE my_big_number=19287319283784
If you have a very large table, then searching against values that aren't indexed can be extremely slow. In MySQL terms this kind of query ends up being a "table scan" which is a way of saying it must test against each row in the table sequentially. This is obviously not the best way to do it.
Adding an index will help with read speeds, but the price you pay is slightly slower write speeds. There's always a trade-off when making an optimization, but in your case the reduction in read time would be immense while the increase in write time would be marginal.
Keep in mind that adding an index to a large table can take a considerable amount of time so do test this against production data before applying it to your production system. The table will likely be locked for the duration of the ALTER TABLE statement.
As always, use EXPLAIN on your queries to determine their execution strategy. In your case it'd be something like:
EXPLAIN SELECT * FROM table1 WHERE my_big_number=19287319283784
It will improve your look up (SELECT) performance (based on your example queries), but it will also make your inserts/updates slower. Your DB size will also increase. You need to look at how often you make these SELECT calls vs. INSERT calls. If you make a lot of SELECT calls, then this should improve your overall performance.
I have a 22 million row table on amazon ec2 small instance. So it is not the fastest server environment by a long shot. I have this create:
CREATE TABLE huge
(
myid int not null AUTO_INCREMENT PRIMARY KEY,
version int not null,
mykey char(40) not null,
myvalue char(40) not null,
productid int not null
);
CREATE INDEX prod_ver_index ON huge(productid,version);
This call runs finishes instantly:
select * from huge where productid=3333 and version=1988210878;
As for inserts, I can do 100/sec in PHP, but if i cram 1000 inserts into an array use implode on this same table I get get 3400 inserts per second. Naturally your data is not coming in that way. Just saying the server is relatively snappy. But as tadman suggests, and he meant to say EXPLAIN not examine, in front of a typical statement to see if the key column is showing an index that will be used were you to run it.
General Comments
For slow query debugging, place the word EXPLAIN in front of the word select (no matter how complicated the select/join may be), and run it. Though the query will not be run in normal fashion with resolving the resultset, the db engine will produce (almost immediately) an execution plan it would attempt. This plan may be abandoned when the real query is run (the one prior to putting EXPLAIN in front of it), but it is a major clue-in to schema shortcomings.
The output of EXPLAIN appears cryptic for those first reading one. Not for long though. After reading a few articles about it, such as Using EXPLAIN to Write Better MySQL Queries, one will usually be able to determine which sections of the query are using which indexes, using none and doing slow tablescans, slower where clauses, derived and temp tables.
Using the output of EXPLAIN sized up against your schema, you can gain insight into strategies for index creation (such as composite and covering indexes) to gain substantial query performance.
Sharing
Sharing this EXPLAIN output and schema output with others (such as in stackoverflow questions) hastens better answers concerning performance. Schema output is rendered with such statements as show create table myTableName. Thank you for sharing.
Related
I have a Postgres table with several columns, one column is the datetime that the column was last updated. My query is to get all the updated rows between a start and end time. It is my understanding for this query to use WHERE in this query instead of BETWEEN. The basic query is as follows:
SELECT * FROM contact_tbl contact
WHERE contact."UpdateTime" >= '20150610' and contact."UpdateTime" < '20150618'
I am new at creating SQL queries, I believe this query is doing a full table scan. I would like to optimize it if possible. I have placed a Normal index on the UpdateTime column, which takes a long time to create, but with this index the query is faster. One thing I am not sure about is if have to keep recalculating this index if the table gets bigger/columns get changed. Also, I am considering a CLUSTERED index on the UpdateTime row, but I wanted to ask if there was a canonical way of optimizing this/if I was on the right track first
Placing an index on UpdateTime is correct. It will allow the index to be used instead of full table scans.
2 WHERE conditions like the above vs. using the BETWEEN keyword are the exact same:
http://dev.mysql.com/doc/refman/5.7/en/comparison-operators.html#operator_between
BETWEEN is just "syntactical sugar" for those that like that syntax better.
Indexes allow for faster reads, but slow down writes (because like you mention, the new data has to be inserted into the index as well). The entire index does not need to be recalculated. Indexes are smart data structures, so the extra data can be added without a lot of extra work, but it does take some.
You're probably doing many more reads than writes, so using an index is a good idea.
If you're doing lots of writes and few reads, then you'd want to think a bit more about it. It would then come down to business requirements. Although overall the throughput may be slowed, read latency may not be a requirement but write latency may be, in which case you wouldn't want the index.
For instance, think of this lottery example: Everytime someone buys a ticket, you have to record their name and the ticket number. However the only time you ever have to read that data, is after the 1 and only drawing to see who had that ticket number. In this database, you wouldn't want to index the ticket number since they'll be so many writes and very few reads.
I have a huge number of tables for each country. I want multiple comment related fields for each so that users can make comments on my website. I might might have a few more fields like: date when comment was created, user_id of commenter. Also I might need to add other fields in the future.For Example, company_support_comment/support_rating, company_professionalism_comment
Let's say I have 1 million companies in one table and 100 comments per company. then I get lot's of comments just for one country It will easily exceed 2 billion.
Unsigned bigint can support 18 446 744 073 709 551 615. So we can have that many comments in one table. Unsigned int will give us 4.2+ billion. Which won't be enough in one table.
However imagine querying a table with 4 billion records? How long would that take ? I might not be able to efficiently retrieve the comments and it would take a huge load on the database. Given that in practice one table probably can't be done.
Multiple tables might also be bad. unless we just use json data..
Actually I'm not sure now. I need a proper solution for my database design. I am using mysql now.
Your question goes in the wrong direction, in my view.
Start with your database design. That means go with bigints to start with if you are concerned about it (because converting from int to bigint is a pain if you get that wrong). Build a good, normalized schema. Then figure out how to make it fast.
In your case, PostgreSQL may be a better option than MySQL because your query is going to likely be against secondary indexes. These are more expensive on MySQL with InnoDB than PostgreSQL, because with MySQL, you have to traverse the primary key index to retrieve the row. This means, effectively, traversing two btree indexes to get the rows you are looking for. Probably not the end of the world, but if performance is your primary concern that may be a cost you don't want to pay. While MySQL covering indexes are a little more useful in some cases, I don't think they help you here since you are interested, really, in text fields which you probably are not directly indexing.
In PostgreSQL, you have a btree index which then gives you a series of page/tuple tuples, which then allow you to look up the data effectively with random access. This would be a win with such a large table, and my experience is that PostgreSQL can perform very well on large tables (tables spanning, say, 2-3TB in size with their indexes).
However, assuming you stick with MySQL, careful attention to indexing will likely get you where you need to go. Remember you are only pulling up 100 comments for a company and traversing an index has O(log n) complexity so it isn't really that bad. The biggest issue is traversing the pkey index for each of the rows retrieved but even that should be manageable.
4 billions records in one table is not a big deal for No SQL database. Even for traditional database, if you build a bunch of secondary indexes correctly, like in MySQL, search in them will be quick(travels a b tree like data structure takes Log(n) disk visitation).
And for faster access, you need a front end cache system to work on your hot data, like redis or memcachd.
Recall your current situation, you are not sure what fields will be needed, then the only choice is a no-sql solution. Since the fields(columns) can be added in the future when they will be needed.
(From a MySQL perspective...)
1 table for companies; INT UNSIGNED will do. 1 table for comments BIGINT UNSIGNED may be necessary. You won't fetch hundreds of comments for display at once, will you? Unless you take care of the data layout, 100 comments could easily be 100 random disk hits, which (on cheap disk) would be 1 second.
You must have indexes (this mostly rules out NoSql)? Otherwise searching for records would be too painfully slow.
CREATE TABLE Comments (
comment_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
company_id INT UNSIGNED NOT NULL,
ts TIMESTAMP,
...
PRIMARY KEY(company_id, comment_id, ts), -- to get clustering and ordering
INDEX(comment_id) -- to keep AUTO_INCREMENT happy
...
) ENGINE=InnoDB;
If you paginate the display of the comments, use the tips in remember where you left off. That will make fetching comments about as efficient as possible.
As for Log(n) -- With about 100 items per node, a billion rows will have only 5 levels of BTree. This is small enough to essentially ignore when worrying about timing. Comments will be a terabyte or more? And your RAM will be significantly less than that? Then, you will generally have non-leaf nodes cached, but leaf nodes (where the data is) not cached. There might be several comment rows per leaf node consecutively stored. Hence, less than 100 disk hits to get 100 comments for display.
(Note: When the data is much bigger than RAM, 'performance' degenerates into 'counting the disk hits'.)
Well, you mentioned comments. What about other queries?
As for "company_support_comment/support_rating..." -- The simplest would be to add a new table(s) when you need to add those 'columns'. The basic Company data is relatively bulky and static; ratings are relative small but frequently changing. (Again, I am 'counting the disk hits'.)
Here is my situation. I have a MySQL MyISAM table containing about 4 million records with a total of 13,3 GB of data. The table contains messages received from an external system. Two of the columns in the table keep track of a timestamp and a boolean whether the message is handled or not.
When using this query:
SELECT MIN(timestampCB) FROM webshop_cb_onx_message
The result shows up almost instantly.
However, I need to find the earliest timestamp of unhandled messages, like this:
SELECT MIN(timestampCB ) FROM webshop_cb_onx_message WHERE handled = 0
The results of this query show up after about 3 minutes, which is way too slow for the script I'm writing.
Both columns are individually indexed, not together. However, adding an index to the table would take incredibly long considering the amount of data that is in there already.
Does my problem originate from the fact that both columns are separatly indexed, and if so, does anyone have a solution to my issue other than adding another index?
It is commonly recommended that if the selectivity of an index over 20% then a full table scan is preferable over an index access. This would mean it is likely that your index on handled won't actually result in using the index but a full table scan given the selectivity.
A composite index of handled, timestampCB may actually improve the performance given its a composite index, even if the selectivity isn't great MySQL would most likely still use it - even if it didn't you could force it's use.
We have a MySQL table with about 3.5 million IP entries.
The structure:
CREATE TABLE IF NOT EXISTS `geoip_blocks` (
`uid` int(11) NOT NULL auto_increment,
`pid` int(11) NOT NULL,
`startipnum` int(12) unsigned NOT NULL,
`endipnum` int(12) unsigned NOT NULL,
`locid` int(11) NOT NULL,
PRIMARY KEY (`uid`),
KEY `startipnum` (`startipnum`),
KEY `endipnum` (`endipnum`)
) TYPE=MyISAM AUTO_INCREMENT=3538967 ;
The problem: A query takes more than 3 seconds.
SELECT uid FROM `geoip_blocks` WHERE 1406658569 BETWEEN geoip_blocks.startipnum AND geoip_blocks.endipnum LIMIT 1
- about 3 seconds
SELECT uid FROM `geoip_blocks` WHERE startipnum < 1406658569 and endipnum > 1406658569 limit 1
- no gain, about 3 seconds
How can this be improved?
The solution to this is to grab a BTREE/ISAM library and use that (like BerkelyDB). Using ISAM this is a trivial task.
Using ISAM, you would set your start key to the number, do a "Find Next", (to find the block GREATER or equal to your number), and if it wasn't equal, you'd "findPrevious" and check that block. 3-4 disk hits, shazam, done in a blink.
Well, it's A solution.
The problem that is happening here is that SQL, without a "sufficiently smart optimizer", does horrible on this kind of query.
For example, your query:
SELECT uid FROM `geoip_blocks` WHERE startipnum < 1406658569 and endipnum > 1406658569 limit 1
It's going to "look at" ALL of the rows that are "less than" 1406658569. ALL of them, then it's going to scan them looking for ALL of the rows that match the 2nd criteria.
With a 3.5m row table, assuming "average" (i.e. it hits the middle), welcome to a 1.75m row table scan. Even worse, and index table scan. Ideally MySQl will "give up" and "just" table scan, as it's faster.
Clearly, this is not what you want.
#Andomar's solution is basically forcing you to "block" to data space, via the "network" criteria. Effectively breaking your table in to 255 pieces. So, instead of scanning 1.75m rows, you get to scan 6800 rows, a marked improvement at a cost of you breaking your blocks up on the network boundary.
There is nothing wrong with range queries in SQL.
SELECT * FROM table WHERE id between X and Y
is a, typically, fast query, as the optimizer can readily delimit the range of rows using the index.
But, that's not your query, because you are not ranging you main ID in this case (startipnum).
If you "know" that your block sizes are within a certain range (i.e. none of your blocks, EVER, have more than, say, 1000 ips), then you can block your query by adding "WHERE startipnum between {ipnum - 1000} and {ipnum + 1000}". That's not really different than the network blocking that was proposed, but here you don't have to maintain it as much. Of course, you can learn this with:
SELECT max(endipnum - startipnum) FROM table
to get an idea what your largest range is.
Another option, which I've seen, have never used, but is actually, well, perfect for this, is to look at MySql's Spatial Extensions, since that's what this really is.
This is designed more for GIS applications, but you ARE searching for something in ranges, and that's a lot of what GIS apps do. So, that may well be a solution for you as well.
Your startip and endip should be a combined index. Mysql can't utilize multiple indexes on the same table in one query.
I'm not sure about the syntax, but something like
KEY (startipnum, endipnum)
It looks like you're trying to find the range that an IP address belongs to. The problem is that MySQL can't make the best use of an index for the BETWEEN operation. Indexes work better with an = operation.
One way you can add an = operation to your query is by adding the network part of the address to the table. For your example:
numeric 1406658569
ip address 83.215.232.9
class A with 8 bit network part
network part = 83
With an index on (networkpart, startipnum, endipnum, uid) a query like this will become very fast:
SELECT uid
FROM `geoip_blocks`
WHERE networkpart = 83
AND 1406658569 BETWEEN startipnum AND endipnum
In case one geoip block spans multiple network classes, you'd have to split it in one row per network class.
Based on information from your question I am assuming that what you are doing is an implementation of the GeoIP® product from MaxMind®. I downloaded the free version of the GeoIP® data, loaded it into a MySQL database and did a couple of quick experiments.
With an index on startipnum the query execution time ranged from 0.15 to 0.25 seconds. Creating a composite index on startipnum and endipnum did not change the query performance. This leads me to believe that your performance issues are due to insufficient hardware, improper MySQL tuning, or both. The server I ran my tests on had 8G of RAM which is considerably more than would be needed to get this same performance as the index file was only 28M.
My recommendation is one of the two following options.
Spend some time tuning your MySQL server. The MySQL online documentation would be a reasonable starting point. http://dev.mysql.com/doc/refman/5.0/en/optimizing-the-server.html An internet search will turn up a large volume of books, forums, articles, etc. if the MySQL documentation is not sufficient.
If my assumption is correct that you are using the GeoIP® product, then a second option would be to use the binary file format provided by MaxMind®. The custom file format has been optimized for speed, memory usages, and database size. APIs to access the data are provided for a number of languages. http://www.maxmind.com/app/api
As an aside, the two queries you presented are not equivalent. The between operator is inclusive. The second query would need to use <= >= operators to be equivalent to the query which used the between operator.
Maybe you would like to have a look at partitioning the table. This feature has been available since MySQL 5.1 - hence you do not specify which version you are using, this might not work for you if you are stuck with an older version.
As the possible value range for IP addresses is known - at least for IPv4 - you could break down the table into multiple partitions of equal size (or maybe even not equal if your data is not evenly distributed). With that MySQL could very easily skip large parts of the table, speeding up the scan if it is still required.
See MySQL manual on partitioning for the available options and syntax.
Thanks for all your comments, I really appreciate it.
For now we ended up using a caching mechanism and we have reduced that expensive queries.
I am looking at storing some JMX data from JVMs on many servers for about 90 days. This data would be statistics like heap size and thread count. This will mean that one of the tables will have around 388 million records.
From this data I am building some graphs so you can compare the stats retrieved from the Mbeans. This means I will be grabbing some data at an interval using timestamps.
So the real question is, Is there anyway to optimize the table or query so you can perform these queries in a reasonable amount of time?
Thanks,
Josh
There are several things you can do:
Build your indexes to match the queries you are running. Run EXPLAIN to see the types of queries that are run and make sure that they all use an index where possible.
Partition your table. Paritioning is a technique for splitting a large table into several smaller ones by a specific (aggregate) key. MySQL supports this internally from ver. 5.1.
If necessary, build summary tables that cache the costlier parts of your queries. Then run your queries against the summary tables. Similarly, temporary in-memory tables can be used to store a simplified view of your table as a pre-processing stage.
3 suggestions:
index
index
index
p.s. for timestamps you may run into performance issues -- depending on how MySQL handles DATETIME and TIMESTAMP internally, it may be better to store timestamps as integers. (# secs since 1970 or whatever)
Well, for a start, I would suggest you use "offline" processing to produce 'graph ready' data (for most of the common cases) rather than trying to query the raw data on demand.
If you are using MYSQL 5.1 you can use the new features.
but be warned they contain lot of bugs.
first you should use indexes.
if this is not enough you can try to split the tables by using partitioning.
if this also wont work, you can also try load balancing.
A few suggestions.
You're probably going to run aggregate queries on this stuff, so after (or while) you load the data into your tables, you should pre-aggregate the data, for instance pre-compute totals by hour, or by user, or by week, whatever, you get the idea, and store that in cache tables that you use for your reporting graphs. If you can shrink your dataset by an order of magnitude, then, good for you !
This means I will be grabbing some data at an interval using timestamps.
So this means you only use data from the last X days ?
Deleting old data from tables can be horribly slow if you got a few tens of millions of rows to delete, partitioning is great for that (just drop that old partition). It also groups all records from the same time period close together on disk so it's a lot more cache-efficient.
Now if you use MySQL, I strongly suggest using MyISAM tables. You don't get crash-proofness or transactions and locking is dumb, but the size of the table is much smaller than InnoDB, which means it can fit in RAM, which means much quicker access.
Since big aggregates can involve lots of rather sequential disk IO, a fast IO system like RAID10 (or SSD) is a plus.
Is there anyway to optimize the table or query so you can perform these queries
in a reasonable amount of time?
That depends on the table and the queries ; can't give any advice without knowing more.
If you need complicated reporting queries with big aggregates and joins, remember that MySQL does not support any fancy JOINs, or hash-aggregates, or anything else useful really, basically the only thing it can do is nested-loop indexscan which is good on a cached table, and absolutely atrocious on other cases if some random access is involved.
I suggest you test with Postgres. For big aggregates the smarter optimizer does work well.
Example :
CREATE TABLE t (id INTEGER PRIMARY KEY AUTO_INCREMENT, category INT NOT NULL, counter INT NOT NULL) ENGINE=MyISAM;
INSERT INTO t (category, counter) SELECT n%10, n&255 FROM serie;
(serie contains 16M lines with n = 1 .. 16000000)
MySQL Postgres
58 s 100s INSERT
75s 51s CREATE INDEX on (category,id) (useless)
9.3s 5s SELECT category, sum(counter) FROM t GROUP BY category;
1.7s 0.5s SELECT category, sum(counter) FROM t WHERE id>15000000 GROUP BY category;
On a simple query like this pg is about 2-3x faster (the difference would be much larger if complex joins were involved).
EXPLAIN Your SELECT Queries
LIMIT 1 When Getting a Unique Row
SELECT * FROM user WHERE state = 'Alabama' // wrong
SELECT 1 FROM user WHERE state = 'Alabama' LIMIT 1
Index the Search Fields
Indexes are not just for the primary keys or the unique keys. If there are any columns in your table that you will search by, you should almost always index them.
Index and Use Same Column Types for Joins
If your application contains many JOIN queries, you need to make sure that the columns you join by are indexed on both tables. This affects how MySQL internally optimizes the join operation.
Do Not ORDER BY RAND()
If you really need random rows out of your results, there are much better ways of doing it. Granted it takes additional code, but you will prevent a bottleneck that gets exponentially worse as your data grows. The problem is, MySQL will have to perform RAND() operation (which takes processing power) for every single row in the table before sorting it and giving you just 1 row.
Use ENUM over VARCHAR
ENUM type columns are very fast and compact. Internally they are stored like TINYINT, yet they can contain and display string values.
Use NOT NULL If You Can
Unless you have a very specific reason to use a NULL value, you should always set your columns as NOT NULL.
"NULL columns require additional space in the row to record whether their values are NULL. For MyISAM tables, each NULL column takes one bit extra, rounded up to the nearest byte."
Store IP Addresses as UNSIGNED INT
In your queries you can use the INET_ATON() to convert and IP to an integer, and INET_NTOA() for vice versa. There are also similar functions in PHP called ip2long() and long2ip().