MySQL huge IN set for huge table - mysql

Consider this a theoretical question as much as practical.
One has a table with, say 1.000.000+ records of users and need to pull data for, say 50.000 of them from that table, using user_id only. How would you expect IN to behave? If not good, is it the only option or is there anything else one could try?

You could insert your search values into a single column temporary table and join on that. I have seen other databases do Bad Things when presented with very large in clauses.

The IN functionality has actually pretty poor performance, so this is something I would avoid. Most of the time you can get by by using a joined query, so depending on your database structure you should definitively favor a join over an IN-statement.

If IN starts to prove troublesome (as other answerers have suggested it might, you could try rewriting your query using EXISTS instead.
SELECT *
FROM MYTAB
WHERE MYKEY IN (SELECT KEYVAL
FROM MYOTHERTAB
WHERE some condition)
could become
SELECT *
FROM MYTAB
WHERE EXISTS (SELECT *
FROM MYOTHERTAB
WHERE some condition AND
MYTAB.MYKEY = MYOTHERTAB.KEYVAL)
I have often found this speeds things up quite a bit.

Use a JOIN to select the data you need.

Related

Searching in all the tables of a mysql database

I have a mysql (mariadb) database with numerous tables and all the tables have the same structure.
For the sake of simplicity, let's assume the structure is as below.
UserID - Varchar (primary)
Email - Varchar (indexed)
Is it possible to query all the tables together for the Email field?
Edit: I have not finalized the db design yet, I could put all the data in single table. But I am afraid that large table will slow down the operations, and if it crashes, it will be painful to restore. Thoughts?
I have read some answers that suggested dumping all data together in a temporary table, but that is not an option for me.
Mysql workbench or PHPMyAdmin is not useful either, I am looking for a SQL query, not a frontend search technique.
There's no concise way in SQL to say this sort of thing.
SELECT a,b,c FROM <<<all tables>>> WHERE b LIKE 'whatever%'
If you know all your table names in advance, you can write a query like this.
SELECT a,b,c FROM table1 WHERE b LIKE 'whatever%'
UNION ALL
SELECT a,b,c FROM table2 WHERE b LIKE 'whatever%'
UNION ALL
SELECT a,b,c FROM table3 WHERE b LIKE 'whatever%'
UNION ALL
SELECT a,b,c FROM table4 WHERE b LIKE 'whatever%'
...
Or you can create a view like this.
CREATE VIEW everything AS
SELECT * FROM table1
UNION ALL
SELECT * FROM table2
UNION ALL
SELECT * FROM table3
UNION ALL
SELECT * FROM table4
...
Then use
SELECT a,b,c FROM everything WHERE b LIKE 'whatever%'
If you don't know the names of all the tables in advance, you can retrieve them from MySQL's information_schema and write a program to create a query like one of my suggestion. If you decide to do that and need help, please ask another question.
These sorts of queries will, unfortunately, always be significantly slower than querying just one table. Why? MySQL must repeat the overhead of running the query on each table, and a single index is faster to use than multiple indexes on different tables.
Pro tip Try to design your databases so you don't add tables when you add users (or customers or whatever).
Edit You may be tempted to use multiple tables for query-performance reasons. With respect, please don't do that. Correct indexing will almost always give you better query performance than searching multiple tables. For what it's worth, a "huge" table for MySQL, one which challenges its capabilities, usually has at least a hundred million rows. Truly. Hundreds of thousands of rows are in its performance sweet spot, as long as they're indexed correctly. Here's a good reference about that, one of many. https://use-the-index-luke.com/
Another reason to avoid a design where you routinely create new tables in production: It's a pain in the ***xxx neck to maintain and optimize databases with large numbers of tables. Six months from now, as your database scales up, you'll almost certainly need to add indexes to help speed up some slow queries. If you have to add many indexes, you, or your successor, won't like it.
You may also be tempted to use multiple tables to make your database more resilient to crashes. With respect, it doesn't work that way. Crashes are rare, and catastrophic unrecoverable crashes are vanishingly rare on reliable hardware. And crashes can corrupt multiple tables. (Crash resilience: decent backups).
Keep in mind that MySQL has been in development for over a quarter-century (as have the other RDBMSs). Thousands of programmer years have gone into making it fast and resilient. You may as well leverage all that work, because you can't outsmart it. I know this because I've tried and failed.
Keep your database simple. Spend your time (your only irreplaceable asset) making your application excellent so you actually get millions of users.

How to speed up sql queries ? Indexes?

I have the following database structure :
create table Accounting
(
Channel,
Account
)
create table ChannelMapper
(
AccountingChannel,
ShipmentsMarketPlace,
ShipmentsChannel
)
create table AccountMapper
(
AccountingAccount,
ShipmentsComponent
)
create table Shipments
(
MarketPlace,
Component,
ProductGroup,
ShipmentChannel,
Amount
)
I have the following query running on these tables and I'm trying to optimize the query to run as fast as possible :
select Accounting.Channel, Accounting.Account, Shipments.MarketPlace
from Accounting join ChannelMapper on Accounting.Channel = ChannelMapper.AccountingChannel
join AccountMapper on Accounting.Accounting = ChannelMapper.AccountingAccount
join Shipments on
(
ChannelMapper.ShipmentsMarketPlace = Shipments.MarketPlace
and ChannelMapper.AccountingChannel = Shipments.ShipmentChannel
and AccountMapper.ShipmentsComponent = Shipments.Component
)
join (select Component, sum(amount) from Shipment group by component) as Totals
on Shipment.Component = Totals.Component
How do I make this query run as fast as possible ? Should I use indexes ? If so, which columns of which tables should I index ?
Here is a picture of my query plan :
Thanks,
Indexes are essential to any database.
Speaking in "layman" terms, indexes are... well, precisely that. You can think of an index as a second, hidden, table that stores two things: The sorted data and a pointer to its position in the table.
Some thumb rules on creating indexes:
Create indexes on every field that is (or will be) used in joins.
Create indexes on every field on which you want to perform frequent where conditions.
Avoid creating indexes on everything. Create index on the relevant fields of every table, and use relations to retrieve the desired data.
Avoid creating indexes on double fields, unless it is absolutely necessary.
Avoid creating indexes on varchar fields, unless it is absolutely necesary.
I recommend you to read this: http://dev.mysql.com/doc/refman/5.5/en/using-explain.html
Your JOINS should be the first place to look. The two most obvious candidates for indexes are AccountMapper.AccountingAccount and ChannelMapper.AccountingChannel.
You should consider indexing Shipments.MarketPlace,Shipments.ShipmentChannel and Shipments.Component as well.
However, adding indexes increases the workload in maintaining them. While they might give you a performance boost on this query, you might find that updating the tables becomes unacceptably slow. In any case, the MySQL optimiser might decide that a full scan of the table is quicker than accessing it by index.
Really the only way to do this is to set up the indexes that would appear to give you the best result and then benchmark the system to make sure you're getting the results you want here, whilst not compromising the performance elsewhere. Make good use of the EXPLAIN statement to find out what's going on, and remember that optimisations made by yourself or the optimiser on small tables may not be the same optimisations you'd need on larger ones.
The other three answers seem to have indexes covered so this is in addition to indexes. You have no where clause which means you are always selecting the whole darn database. In fact, your database design doesn't have anything useful in this regard, such as a shipping date. Think about that.
You also have this:
join (select Component, sum(amount) from Shipment group by component) as Totals
on Shipment.Component = Totals.Component
That's all well and good but you don't select anything from this subquery. Therefore why do you have it? If you did want to select something, such as the sum(amount), you will have to give that an alias to make it available in the select clause.

Should you always do a COUNT(*) before a SELECT * to determine if there are any rows?

In MySQL, is it generally a good idea to always do a COUNT(*) first to determine if you should do a SELECT * to actually fetch the rows, or is it better to just do the SELECT * directly and then check if it returned any rows?
Unless you lock the table/s in question, doing a select count(*) is useless. Consider:
Process 1:
SELECT COUNT(*) FROM T;
Process 2:
INSERT INTO T
Process 1:
...now doing something based on the obsolete count retrieved before...
Of course, locking a table is not a very good idea in a server environment.
It depends on whether you need the number, but in particular in mysql there's a calc_found_rows, IIRC. Look up the docs.
always the SELECT [field1, field2 | *] FROM.... The SELECT COUNT(*) will just bloat your code, add additional transport and data overhead and generally be unmaintainable.
The form is 2 queries, the latter is 1 query. Each query needs to talk with the database server. Do the math.
The answer is as in many of this kind questions - "it depends". What you shouldn't do is performing those two queries when you do not have an index on a table. In general, performing just COUNT is a waste of IO time, so if if this operation will help you to save some time spent on IO in MOST cases, than it might be an option.
In some cases some db driver implementations may not return the count of actually selected rows for select statement that returns records itself. The 'count(*)' issued beforehand is useful when you need to know the precise size of resulting recordset before you select actual data.

IN Criteria performance for update operation in MySQL

I have read that creating a temporary table is best if the number of parameters passed in the IN criteria is large. This is for select queries. Does this hold true for update queries as well ?? I have an update query which uses 3 table joins (Inner Joins) and passes 1000 parameters in the IN criteria and this query runs in a loop for 200 or more times. Which is the best approach to execute this query ?
IN operations are usually slow. Passing 1000 parameters to any query sounds awful. If you can avoid that, do it. Now, I'd really have a go with the temp table. You can even play with the indexing of the table. I mean, instead of just putting values in it, play with the indexes that would help you optimize your searches.
On the other hand, adding with indexes is slower that adding without indexes. Go for an empiric test there. Now, what I think is a must, bear in mind that when using the other table you don't need to use the IN clause because you can use the EXISTS clause which results usually in better performance. I.E.:
select * from yourTable yt
where exists (
select * from yourTempTable ytt
where yt.id = ytt.id
)
I don't know your query, nor data, but that would give you an idea about how to do it. Note the inner select * is as fast as select aSingleField, as the database engine optimizes it.
Those are all my thoughts. But remember, to be 100% sure of what is best for your problem, there is nothing like performing both tests and timing them :) Hope this help.

MySql Query, Database design,query efficiency

I have been working on php and mysql for 1 year. I have come across some code in php, where the programmers writes a query which he encloses it in for loop or foreach loop, and the query gets generated as like this (takes 65.134 seconds):
SELECT * from tbl_name where PNumber IN('p1','p2'.......,'p998','p999','p1000')
In my mind I believe that query is horrible, but people who are working before me, they say there is no other way round and we have to deal with it. Then I thought it should be wrong database design. But cannot come up with the solution. So any suggestions or opinions are welcome.
Couple of things to add. Column is indexed and table has almost 3 million records. I have given you example as p1,p2 etc..But those are originally phone numbers like 3371234567,5028129456 etc. The worst part, that I feel is, this column is varchar instead of int or big int which makes the comparison even worse.My question is, can we call this a good query, or is it wrong to generalize and it depends on the requirement?
I'm posting this with hopes that we can decrease the turnaround time by providing an example.
Check with your developers and see how they are producing the SQL command.
All those numbers must be coming from somewhere. Ask them where.
If the numbers are coming from a table, then we should simply JOIN those two tables.
EXAMPLE:
Granting that the phone numbers are stored in a table named PHONE_NUMBERS under a column named Phone -- and using your example tbl_name to which we match the column PNumber
SELECT t1.*
FROM tbl_name AS t1
INNER JOIN PHONE_NUMBERS AS t2
ON t1.PNumber = t2.Phone
However, even as an example, this is not enough. Like #George said, you'll have to give us more information about the data structure and the data source of the phone numbers. In fact, depending on the kind of data you show us and the results you need, your SQL query might need to remain using an IN statement instead of an INNER JOIN statement..
Please give us more information...
What can help us help you is if you could tell us the result of explain on your query.
ie:
explain SELECT * from tbl_name where PNumber IN('p1','p2'.......,'p998','p999','p1000');
This will give us some information on what the database is trying to do.