Improving MySQL efficiency - mysql

I have a table now containing over 43 million records. To execute SELECT, I usually select records with the same field, say A. Will it be more efficient to divide the table into several tables by different A and save in the database? How much can I gain?
I have one table named entry: {entryid (PK), B}, containing 6 thousand records, and several other tables with the similar structure T1: {id(PK), entryid, C, ...}, containing over millions of records. Do the following two processes have the same efficiency?
SELECT id FROM T1, entry WHERE T1.entryid = entry.entryid AND entry.B = XXX
and
SELECT entryid FROM entry WHERE B = XXX
//format a string S as (entryid1, entryid2, ... )
//then run
SELECT id FROM T1 WHERE entryid IN S

You will get performance improvement. You don't have to do that manually, but use built in MySQL partitioning. How much will you get really depends on your configuration and it would be the best for you to test it. For example, if you have monster server, 43M records is nothing and you will not get that much with partitioning (but you should get improvement anyway).
As for this question, I would say that first query will be a lot faster.
But it would be the best to measure your results because it may depend on your hardware coonfiguration, indexes (use EXPLAIN to check if you have correct indexes), your MySQL settings like query cache size, and engine you are using (MYISAM, InnoDB)...

Use the EXPLAIN Command to check your queries.
dev.mysql.com/doc/refman/5.0/en/explain.html
Here is an explaination
http://www.slideshare.net/phpcodemonkey/mysql-explain-explained
You need to make sure first and foremost you have the right indexes for a table that size especially for queries that join with other tables.

Related

Searching in all the tables of a mysql database

I have a mysql (mariadb) database with numerous tables and all the tables have the same structure.
For the sake of simplicity, let's assume the structure is as below.
UserID - Varchar (primary)
Email - Varchar (indexed)
Is it possible to query all the tables together for the Email field?
Edit: I have not finalized the db design yet, I could put all the data in single table. But I am afraid that large table will slow down the operations, and if it crashes, it will be painful to restore. Thoughts?
I have read some answers that suggested dumping all data together in a temporary table, but that is not an option for me.
Mysql workbench or PHPMyAdmin is not useful either, I am looking for a SQL query, not a frontend search technique.
There's no concise way in SQL to say this sort of thing.
SELECT a,b,c FROM <<<all tables>>> WHERE b LIKE 'whatever%'
If you know all your table names in advance, you can write a query like this.
SELECT a,b,c FROM table1 WHERE b LIKE 'whatever%'
UNION ALL
SELECT a,b,c FROM table2 WHERE b LIKE 'whatever%'
UNION ALL
SELECT a,b,c FROM table3 WHERE b LIKE 'whatever%'
UNION ALL
SELECT a,b,c FROM table4 WHERE b LIKE 'whatever%'
...
Or you can create a view like this.
CREATE VIEW everything AS
SELECT * FROM table1
UNION ALL
SELECT * FROM table2
UNION ALL
SELECT * FROM table3
UNION ALL
SELECT * FROM table4
...
Then use
SELECT a,b,c FROM everything WHERE b LIKE 'whatever%'
If you don't know the names of all the tables in advance, you can retrieve them from MySQL's information_schema and write a program to create a query like one of my suggestion. If you decide to do that and need help, please ask another question.
These sorts of queries will, unfortunately, always be significantly slower than querying just one table. Why? MySQL must repeat the overhead of running the query on each table, and a single index is faster to use than multiple indexes on different tables.
Pro tip Try to design your databases so you don't add tables when you add users (or customers or whatever).
Edit You may be tempted to use multiple tables for query-performance reasons. With respect, please don't do that. Correct indexing will almost always give you better query performance than searching multiple tables. For what it's worth, a "huge" table for MySQL, one which challenges its capabilities, usually has at least a hundred million rows. Truly. Hundreds of thousands of rows are in its performance sweet spot, as long as they're indexed correctly. Here's a good reference about that, one of many. https://use-the-index-luke.com/
Another reason to avoid a design where you routinely create new tables in production: It's a pain in the ***xxx neck to maintain and optimize databases with large numbers of tables. Six months from now, as your database scales up, you'll almost certainly need to add indexes to help speed up some slow queries. If you have to add many indexes, you, or your successor, won't like it.
You may also be tempted to use multiple tables to make your database more resilient to crashes. With respect, it doesn't work that way. Crashes are rare, and catastrophic unrecoverable crashes are vanishingly rare on reliable hardware. And crashes can corrupt multiple tables. (Crash resilience: decent backups).
Keep in mind that MySQL has been in development for over a quarter-century (as have the other RDBMSs). Thousands of programmer years have gone into making it fast and resilient. You may as well leverage all that work, because you can't outsmart it. I know this because I've tried and failed.
Keep your database simple. Spend your time (your only irreplaceable asset) making your application excellent so you actually get millions of users.

Mysql Performance: Which of the query will take more time?

I have two tables:
1. user table with around 10 million data
columns: token_type, cust_id(Primary)
2. pm_tmp table with 200k data
columns: id(Primary | AutoIncrement), user_id
user_id is foreign key for cust_id
1st Approach/Query:
update user set token_type='PRIME'
where cust_id in (select user_id from pm_tmp where id between 1 AND 60000);
2nd Approach/Query: Here we will run below query for different cust_id individually for 60000 records:
update user set token_type='PRIME' where cust_id='1111110';
Theoretically time will be less for the first query as it involves less number of commits and in turn less number of index rebuilds. But, I would recommend to go with the second option since its more controlled and will appear to be less in time and you can event think about executing 2 seperate sets parellelly.
Note: The first query will need sufficient memory provisioned for mysql buffers to get it executed quickly. Second query being set of independent single transaction queries, they will need comparatively less memory and hence will appear faster if executed on limited memory environments.
Well, you may rewrite the first query this way too.
update user u, pm_tmp p set u.token_type='PRIME' where u.cust_id=p.id and p.in <60000;
Some versions of MySQL have trouble optimizing in. I would recommend:
update user u join
pm_tmp pt
on u.cust_id = pt.user_id and pt.id between 1 AND 60000
set u.token_type = 'PRIME' ;
(Note: This assumes that cust_id is not repeated in pm_temp. If that is possible, you will want a select distinct subquery.)
Your second version would normally be considerably slower, because it requires executing thousands of queries instead of one. One consideration might be the update. Perhaps the logging and locking get more complicated as the number of updates increases. I don't actually know enough about MySQL internals to know if this would have a significant impact on performance.
IN ( SELECT ... ) is poorly optimized. (I can't provide specifics because both UPDATE and IN have been better optimized in some recent version(s) of MySQL.) Suffice it to say "avoid IN ( SELECT ... )".
Your first sentence should say "rows" instead of "columns".
Back to the rest of the question. 60K is too big of a chunk. I recommend only 1000. Aside from that, Gordon's Answer is probably the best.
But... You did not use OFFSET; Do not be tempted to use it; it will kill performance as you go farther and farther into the table.
Another thing. COMMIT after each chunk. Else you build up a huge undo log; this adds to the cost. (And is a reason why 1K is possibly faster than 60K.)
But wait! Why are you updating a huge table? That is usually a sign of bad schema design. Please explain the data flow.
Perhaps you have computed which items to flag as 'prime'? Well, you could keep that list around and do JOINs in the SELECTs to discover prime-ness when reading. This completely eliminates the UPDATE in question. Sure, the JOIN costs something, but not much.

MySQL: SELECT millions of rows

I have a dataset of about 32Million rows that I'm trying to export to provide some data for an analytics project.
Since my final data query will be large, I'm trying limit the number of rows I have to work with initially. I'm doing this by running a create table on the main table (32Million) records with a join on another table that's about 5k records. I made indexes on the columns where the JOIN is taking place, but not on the other where conditions. This query has been running for over 4 hours now.
What could I have done to speed this up and if there is something, would it be worth it to stop this query, do it, and start over? The data set is static and I'm not worried about preserving anything or proper database design long-term. I just need to get the data out and will discard the schema.
A simplified version of the query is below
CREATE TABLE RELEVANT_ALERTS
SELECT a.time, s.name,s.class, ...
FROM alerts a, sig s
WHERE a.IP <> 0
AND a.IP not between x and y
AND s.class in ('c1','c2','c3')
Try explain select to see what is going on first of all. Are your indexes properly setup?
Also you are not joining the two tables with their primary keys, is that on purpose? Where is your primary key and foreign key?
Can you also provide us with a table schema?
Also, could your hardware be the problem? How much does RAM and processing power does it have? I hope you are not running this on single core processor as that is bound to take a long time
I have a table with 2,000,000,000 (2 billion rows, 219 Gig) and it doesn't take more than 0.3 seconds to execute similar query to yours with properly setup indexes. This is on a 8 (2ghz) core processor with 64gb ram. So not the beefiest setup for the size of the database, but the indexes are held in the memory, so the queries can be fast.
It should not take that long. Can you please make sure you have indexes on the a.IP And s.class.
Also cant you put a.IP <> = 0 comparison after a.IP not between x and y, so you already have a filtered set for 0 comparison (as that will compare every single record I believe)
You can move s.class as the first comparison depending on how many rows s table has to really speed up the comparison.
Your join is a full cross-join it seems. That will take really really long in any case. Is there no common field in both tables? Why do you need this join? If you really want to do this, you should first create two tables from alerts and sig that fulfill your WHERE conditions and then join the resulting tables if you must.
Agree with Vish.
In addition, depending on your query workload, you could probably change the internal storage engine to MyISAM if it is currently InnoDB, since Mysiam is more optimized for read-only queries.
ALTER TABLE my_table ENGINE = MyISAM;
Also, you could change the isolation level of your database. For example, to set isolation level to read uncommitted:
SET tx_isolation = 'READ-UNCOMMITTED';
first try "explain select" to see what is slowing it down then try to add some indexes if you don't have any
Trust me, 4 hours is very normal: because you have a table of 32 millions rows and with the join you juste multiply 32 millions with 5000 so your query have a complexity of 320000000 * 5000 ...
to avoid that i suggest you to use an ETL WORFLOW ... Like Microsoft SSIS...
Withh SSIS you can reduce a lot the query's TIME...

MySQL Performance

We have a data warehouse with denormalized tables ranging from 500K to 6+ million rows. I am developing a reporting solution, so we are utilizing database paging for performance reasons. Our reports have search criteria and we have created the necessary indexes, however, performance is poor when dealing with the million(s) row tables. The client is set on always knowing the total records, so I have to fetch the data as well as the record count.
Are there any other things I can do to help with performance? I'm not the MySQL dba and he has not really offered anything up, so I'm not sure what he can do configuration wise.
Thanks!
You should use "Partitioning"
It's main goal is to reduce the amount of data read for particular SQL operations so that overall response time is reduced.
Refer:
http://dev.mysql.com/tech-resources/articles/performance-partitioning.html
If you partition the large tables and store the parts on different servers, than your query will run faster.
see: http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Also note that using NDB tables you can use HASH keys that get looked up in O(1) time.
For the number of lines you can keep a running total in a separate table and update that. For example in a after insert and after delete trigger.
Although the trigger will slow down deletes/inserts this will be spread over time. Note that you don't have to keep all totals in one row, you can store totals per condition. Something like:
table field condition row_count
----------------------------------------
table1 field1 cond_x 10
table1 field1 cond_y 20
select sum(row_count) as count_cond_xy
from totals where field = field1 and `table` = table1
and condition like 'cond_%';
//just a silly example you can come up with more efficient code, but I hope
//you get the gist of it.
If you find yourself always counting along the same conditions, this can speed your redesigned select count(x) from bigtable where ... up from minutes to instantly.

What are some optimization techniques for MySQL table with 300+ million records?

I am looking at storing some JMX data from JVMs on many servers for about 90 days. This data would be statistics like heap size and thread count. This will mean that one of the tables will have around 388 million records.
From this data I am building some graphs so you can compare the stats retrieved from the Mbeans. This means I will be grabbing some data at an interval using timestamps.
So the real question is, Is there anyway to optimize the table or query so you can perform these queries in a reasonable amount of time?
Thanks,
Josh
There are several things you can do:
Build your indexes to match the queries you are running. Run EXPLAIN to see the types of queries that are run and make sure that they all use an index where possible.
Partition your table. Paritioning is a technique for splitting a large table into several smaller ones by a specific (aggregate) key. MySQL supports this internally from ver. 5.1.
If necessary, build summary tables that cache the costlier parts of your queries. Then run your queries against the summary tables. Similarly, temporary in-memory tables can be used to store a simplified view of your table as a pre-processing stage.
3 suggestions:
index
index
index
p.s. for timestamps you may run into performance issues -- depending on how MySQL handles DATETIME and TIMESTAMP internally, it may be better to store timestamps as integers. (# secs since 1970 or whatever)
Well, for a start, I would suggest you use "offline" processing to produce 'graph ready' data (for most of the common cases) rather than trying to query the raw data on demand.
If you are using MYSQL 5.1 you can use the new features.
but be warned they contain lot of bugs.
first you should use indexes.
if this is not enough you can try to split the tables by using partitioning.
if this also wont work, you can also try load balancing.
A few suggestions.
You're probably going to run aggregate queries on this stuff, so after (or while) you load the data into your tables, you should pre-aggregate the data, for instance pre-compute totals by hour, or by user, or by week, whatever, you get the idea, and store that in cache tables that you use for your reporting graphs. If you can shrink your dataset by an order of magnitude, then, good for you !
This means I will be grabbing some data at an interval using timestamps.
So this means you only use data from the last X days ?
Deleting old data from tables can be horribly slow if you got a few tens of millions of rows to delete, partitioning is great for that (just drop that old partition). It also groups all records from the same time period close together on disk so it's a lot more cache-efficient.
Now if you use MySQL, I strongly suggest using MyISAM tables. You don't get crash-proofness or transactions and locking is dumb, but the size of the table is much smaller than InnoDB, which means it can fit in RAM, which means much quicker access.
Since big aggregates can involve lots of rather sequential disk IO, a fast IO system like RAID10 (or SSD) is a plus.
Is there anyway to optimize the table or query so you can perform these queries
in a reasonable amount of time?
That depends on the table and the queries ; can't give any advice without knowing more.
If you need complicated reporting queries with big aggregates and joins, remember that MySQL does not support any fancy JOINs, or hash-aggregates, or anything else useful really, basically the only thing it can do is nested-loop indexscan which is good on a cached table, and absolutely atrocious on other cases if some random access is involved.
I suggest you test with Postgres. For big aggregates the smarter optimizer does work well.
Example :
CREATE TABLE t (id INTEGER PRIMARY KEY AUTO_INCREMENT, category INT NOT NULL, counter INT NOT NULL) ENGINE=MyISAM;
INSERT INTO t (category, counter) SELECT n%10, n&255 FROM serie;
(serie contains 16M lines with n = 1 .. 16000000)
MySQL Postgres
58 s 100s INSERT
75s 51s CREATE INDEX on (category,id) (useless)
9.3s 5s SELECT category, sum(counter) FROM t GROUP BY category;
1.7s 0.5s SELECT category, sum(counter) FROM t WHERE id>15000000 GROUP BY category;
On a simple query like this pg is about 2-3x faster (the difference would be much larger if complex joins were involved).
EXPLAIN Your SELECT Queries
LIMIT 1 When Getting a Unique Row
SELECT * FROM user WHERE state = 'Alabama' // wrong
SELECT 1 FROM user WHERE state = 'Alabama' LIMIT 1
Index the Search Fields
Indexes are not just for the primary keys or the unique keys. If there are any columns in your table that you will search by, you should almost always index them.
Index and Use Same Column Types for Joins
If your application contains many JOIN queries, you need to make sure that the columns you join by are indexed on both tables. This affects how MySQL internally optimizes the join operation.
Do Not ORDER BY RAND()
If you really need random rows out of your results, there are much better ways of doing it. Granted it takes additional code, but you will prevent a bottleneck that gets exponentially worse as your data grows. The problem is, MySQL will have to perform RAND() operation (which takes processing power) for every single row in the table before sorting it and giving you just 1 row.
Use ENUM over VARCHAR
ENUM type columns are very fast and compact. Internally they are stored like TINYINT, yet they can contain and display string values.
Use NOT NULL If You Can
Unless you have a very specific reason to use a NULL value, you should always set your columns as NOT NULL.
"NULL columns require additional space in the row to record whether their values are NULL. For MyISAM tables, each NULL column takes one bit extra, rounded up to the nearest byte."
Store IP Addresses as UNSIGNED INT
In your queries you can use the INET_ATON() to convert and IP to an integer, and INET_NTOA() for vice versa. There are also similar functions in PHP called ip2long() and long2ip().