Attempting to speed mysql queries on large tables - mysql

I am hitting some performance issues on a Mysql server.
I am trying to query a large table (~500k rows) for a subset of data:
SELECT * FROM `my_table` WHERE `subset_id` = id_value;
This request takes ~80ms to achieve, but I am trying to query it over 20k "id_value", which makes the total execution time of almost 1h.
I was hopping that adding an index on subset_id would help, but it's not changing anything (understanding how indexes work, it makes sense).
What I am trying to figure out is if there is any way to "index" the table in a way it wouldn't take 80ms to execute this query but something more reasonable?
Or in other work, is ~80ms for querying a 500k rows table "normal"?
Note: On the larger picture, I am using parallel queries and multiple connections to speed up the process, and tried various optimizations changing the innodb_buffer size. I'm also considering using a larger object querying the db once for the 500k rows instead of 20k*xx but having my code designed in a multiprocessed/co-routines/scalable way, I was trying to avoid this and focusing on optimizing the query/mysql server at the lowest level.
Thanks!

Use a single query with IN rather than a zillion queries:
SELECT *
FROM `my_table`
WHERE `subset_id` IN (id1, id2, . . .);
If your ids are already in a table -- or you can put them in one -- then use a table instead. You can still use IN
SELECT *
FROM `my_table`
WHERE `subset_id` IN (SELECT id FROM idtable);

Related

MySQL temporary indexes for user-defined queries

I am building an analytics platform where users can create reports and such against a MySQL database. Some of the tables in this database are pretty huge (billions of rows), so for all of the features so far I have indexes built to speed up each query.
However, the next feature is to add the ability for a user to define their own query so that they can analyze data in ways that we haven't pre-defined. They have full read permission to the relevant database, so basically any SELECT query is a valid query for them to enter. This creates problems, however, if a query is defined that filters or joins on a column we haven't currently indexed - sometimes to the point of taking over a minute for a simple query to execute - something as basic as:
SELECT tbl1.a, tbl2.b, SUM(tbl3.c)
FROM
tbl1
JOIN tbl2 ON tbl1.id = tbl2.id
JOIN tbl3 ON tbl1.id = tbl3.id
WHERE
tbl1.d > 0
GROUP BY
tbl1.a, tbl1.b, tbl3.c, tbl1.d
Now, assume that we've only created indexes on columns not appearing in this query so far. Also, we don't want too many indexes slowing down inserts, updates, and deletes (otherwise the simple solution would be to build an index on every column accessible by the users).
My question is, what is the best way to handle this? Currently, I'm thinking that we should scan the query, build indexes on anything appearing in a WHERE or JOIN that isn't already indexed, execute the query, and then drop the indexes that were built afterwards. However, the main things I'm unsure about are a) is there already some best practice for this sort of use case that I don't know about? and b) would the overhead of building these indexes be enough that it would negate any performance gains the indexes provide?
If this strategy doesn't work, the next option I can see working is to collect statistics on what types of queries the users run, and have some regular job periodically check what commonly used columns are missing indexes and create them.
If using MyISAM, then performing an ALTER statement on tables with large (billions of rows) in order to add an index will take a considerable amount of time, probably far longer than the 1 minute you've said for the statement above (and you'll need another ALTER to drop the index afterwards). During that time, the table will be locked meaning other users can't execute their own queries.
If your tables use the InnoDB engine and you're running MySQL 5.1+, then CREATE / DROP index statements shouldn't lock the table, but it still may take some time to execute.
There's a good rundown of the history of ALTER TABLE [here][1].
I'd also suggest that automated query analysis to identify and build indeces would quite difficult to get right. For example, what about cases such as selecting by foo.a but ordering by foo.b? This kind of query often needs a covering index over multiple columns, otherwise you may find your server tries a filesort on a huge resultset which can cause big problems.
Giving your users an "explain query" option would be a good first step. If they know enough SQL to perform custom queries then they should be able to analyse EXPLAIN in order to best execute their query (or at least realise that a given query will take ages).
So, going further with my idea, I propose you segment your datas into well identified views. You used abstract names so I can't reuse your business model, but I'll take a virtual example.
Say you have 3 tables:
customer (gender, social category, date of birth, ...)
invoice (date, amount, ...)
product (price, date of creation, ...)
you would create some sorts of materialized views for specific segments. It's like adding a business layer on top of the very bottom data representation layer.
For example, we could identify the following segments:
seniors having at least 2 invoices
invoices of 2013 with more than 1 product
How to do that? And how to do that efficiently? Regular views won't help your problem because they will have poor explain plans on random queries. What we need is a real physical representation of these segments. We could do something like this:
CREATE TABLE MV_SENIORS_WITH_2_INVOICES AS
SELECT ... /* select from the existing tables */
;
/* add indexes: */
ALTER TABLE MV_SENIORS_WITH_2_INVOICES ADD CONSTRAINT...
... etc.
So now, your guys just have to query MV_SENIORS_WITH_2_INVOICES instead of the original tables. Since there are less records, and probably more indexes, the performances will be better.
We're done! Oh wait, no :-)
We need to refresh these datas, a bit like a FAST REFRESH in Oracle. MySql does not have (not that I know... someone corrects me?) a similar system, so we have to create some triggers for that.
CREATE TRIGGER ... AFTER INSERT ON `seniors`
... /* insert the datas in MV_SENIORS_WITH_2_INVOICES if it matches the segment */
END;
Now we're done!

Retrieve min and max values from different tables with same strucure

I have some logs tables with the same structure. Each tables is related to a site and count billion of entries. The reason of this split is to perform quick and efficient query, because 99.99% of the query are related to the site.
But at this time, I would like to retrieve the min and max value of a column of these tables?
I can't manage to write the SQL request. Should I use UNION?
I am just looking for the request concept, not the final SQL request.
You could use a UNION, yes. Something like this should do:
SELECT MAX(PartialMax) AS TotalMax
FROM
( SELECT MAX(YourColumn) AS PartialMax FROM FirstTable UNION ALL SELECT MAX(YourColumn) AS PartialMax FROM SecondTable ) AS X;
If you have an index over the column you want to find a MAX inside, you should have very good performance as the query should seek to the end of the index on that column to find the maximum value very rapidly. Without an index on that column, the query has to scan the whole table to find the maximum value since nothing inherently orders it.
Added some details to address a concern about "enormous queries".
I'm not sure what you mean by "enormous". You could create a VIEW that does the UNIONs for you; then, you use the view and it will make the query very small:
SELECT MAX(YourColumn) FROM YourView;
but that just optimizes for the size of your query's text. Why do you believe it is important to optimize for that? The VIEW can be helpful for maintenance -- if you add or remove a partition, just fix the view appropriately. But a long query text shouldn't really be a problem.
Or by "enormous", are you worried about the amount of I/O the query will do? Nothing can help that much, aside from making sure each table has an index on YourColumn so that maximum value on each partition can be found very quickly.

IN Criteria performance for update operation in MySQL

I have read that creating a temporary table is best if the number of parameters passed in the IN criteria is large. This is for select queries. Does this hold true for update queries as well ?? I have an update query which uses 3 table joins (Inner Joins) and passes 1000 parameters in the IN criteria and this query runs in a loop for 200 or more times. Which is the best approach to execute this query ?
IN operations are usually slow. Passing 1000 parameters to any query sounds awful. If you can avoid that, do it. Now, I'd really have a go with the temp table. You can even play with the indexing of the table. I mean, instead of just putting values in it, play with the indexes that would help you optimize your searches.
On the other hand, adding with indexes is slower that adding without indexes. Go for an empiric test there. Now, what I think is a must, bear in mind that when using the other table you don't need to use the IN clause because you can use the EXISTS clause which results usually in better performance. I.E.:
select * from yourTable yt
where exists (
select * from yourTempTable ytt
where yt.id = ytt.id
)
I don't know your query, nor data, but that would give you an idea about how to do it. Note the inner select * is as fast as select aSingleField, as the database engine optimizes it.
Those are all my thoughts. But remember, to be 100% sure of what is best for your problem, there is nothing like performing both tests and timing them :) Hope this help.

MySQL Performance

We have a data warehouse with denormalized tables ranging from 500K to 6+ million rows. I am developing a reporting solution, so we are utilizing database paging for performance reasons. Our reports have search criteria and we have created the necessary indexes, however, performance is poor when dealing with the million(s) row tables. The client is set on always knowing the total records, so I have to fetch the data as well as the record count.
Are there any other things I can do to help with performance? I'm not the MySQL dba and he has not really offered anything up, so I'm not sure what he can do configuration wise.
Thanks!
You should use "Partitioning"
It's main goal is to reduce the amount of data read for particular SQL operations so that overall response time is reduced.
Refer:
http://dev.mysql.com/tech-resources/articles/performance-partitioning.html
If you partition the large tables and store the parts on different servers, than your query will run faster.
see: http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Also note that using NDB tables you can use HASH keys that get looked up in O(1) time.
For the number of lines you can keep a running total in a separate table and update that. For example in a after insert and after delete trigger.
Although the trigger will slow down deletes/inserts this will be spread over time. Note that you don't have to keep all totals in one row, you can store totals per condition. Something like:
table field condition row_count
----------------------------------------
table1 field1 cond_x 10
table1 field1 cond_y 20
select sum(row_count) as count_cond_xy
from totals where field = field1 and `table` = table1
and condition like 'cond_%';
//just a silly example you can come up with more efficient code, but I hope
//you get the gist of it.
If you find yourself always counting along the same conditions, this can speed your redesigned select count(x) from bigtable where ... up from minutes to instantly.

MySQL SELECT efficiency

I am using PHP to interact with a MySQL database, and I was wondering if querying MySQL with a "SELECT * FROM..." is more or less efficient than a "SELECT id FROM...".
Less efficient.
If you think about it, SQL is going to send all the data for each row you select.
Imagine that you have a column called MassiveTextBlock - this column will be included when you SELECT * and so SQL will have to send all of that data over when you may not actually require it. In contrast, SELECT id is just grabbing a collection of numbers.
It is less efficient because you are fetching a lot more information than just SELECT id. Additionally, the second question is much more likely to be served using just an index.
It would depend on why you are selecting the data. If you are writing a raw database interface like phpMySQL, then it may make sense.
If you are doing multiple queries on the same table with the same conditions and concatenation operations, then a SELECT col1, col2 FROM table may make more sense to do than using two independent SELECT col1 FROM table and a SELECT col2 FROM table to get different portions of the same data, as this performs both operations in the same query.
But, in general, you should only select the columns you need from the table in order to minimize unnecessary data from being dredged up by the DBMS. The benefits of this increase greatly if your database is on a different server from the client server, or if your database server is old and/or slow.
There is NO CONDITION in which a SELECT * is unavoidable, but if there is, then your data model probably has some serious design flaws.
It depends on your indexes.
Selecting fewer columns can sometimes save a lot of time, if you select only columns that exist in the index that MySQL has used to fetch the results.
For example, if you have an index on the column id, and you perform this query:
SELECT id FROM mytable WHERE id>5
Then MySQL only needs to read the index, and does not need to even read the table row. If on the other hand, you select additional columns, as on:
SELECT id, name FROM mytable WHERE id>5
Then MySQL will need to read the id from the index, then go to the table row to read the name column.
If, however, you are reading columns that aren't in the index you're selecting on anyway, then reading more columns really won't make nearly as much difference, though it will make a small difference. If some columns contain a large amount of data, such as large TEXT or BLOB columns, then it'd be wise to exclude them if you don't need them.