mysql get average of column join from million records

mysql get average of column join from million records - mysql

SELECT AVG(table1.column1) as a,
table2.column2
FROM table1
LEFT OUTER JOIN table2
ON table2.column2 = table1.column2
GROUP BY table2.column2 ORDER BY a DESC LIMIT 10
This is MySQL code. I have 1.5 Million rows in table1, 200.000 rows in table2.
I am still waiting for the query to finish.
Does anybody know a way to work in a shorter time?

Lot of comments in the same vein but I thought I'd give a thorough answer. I'm gonna use one of my own tables/databases here for explanation. Let's take this query:
SELECT A.id, B.asin FROM AmazonWishlistItems A LEFT JOIN AmazonWishlistItemPrices B ON (B.asin = A.asin) WHERE A.asin LIKE "%C%"
This query returns about 851 and takes 0.5 seconds. If we add the word EXPLAIN to the query, MySQL tells us what this query is doing.
mysql> EXPLAIN SELECT A.id, B.asin FROM AmazonWishlistItems A LEFT JOIN AmazonWishlistItemPrices B ON (B.asin = A.asin) WHERE A.asin LIKE "%C%";
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | A | ALL | NULL | NULL | NULL | NULL | 1183 | Using where |
| 1 | SIMPLE | B | ALL | NULL | NULL | NULL | NULL | 6594 | |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
2 rows in set (0.00 sec)
Important column to look at here is the rows as this is the number of records MySQL is having to look at and in this case for tables A and B it is having to look up all the rows even though there's only 851 that fit the condition. This is how tables can get out of hand quickly, this only has 6594 record to search through but left alone this could easily reach your 1.5 million rows.
So we can cut this down by adding an index to the table, allowing MySQL to store a reference for each record.
ALTER TABLE AmazonWishlistItemPrices ADD INDEX idx_asin (asin)
This simply says create an index called idx_asin and use the column asin to do the indexing. If we re run our EXPLAIN...
mysql> EXPLAIN SELECT A.id, B.asin FROM AmazonWishlistItems A LEFT JOIN AmazonWishlistItemPrices B ON (B.asin = A.asin) WHERE A.asin LIKE "%C%";
+----+-------------+-------+------+---------------+----------+---------+---------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+----------+---------+---------------------+------+-------------+
| 1 | SIMPLE | A | ALL | NULL | NULL | NULL | NULL | 1183 | Using where |
| 1 | SIMPLE | B | ref | idx_asin | idx_asin | 12 | mah_database.A.asin | 6 | Using index |
+----+-------------+-------+------+---------------+----------+---------+---------------------+------+-------------+
2 rows in set (0.00 sec)
We're down to six rows and you can see in the possible_keys it's found our index. You may find that with certain joins and where clauses your indexes are being ignored that's simply MySQL saying "I'm going to have to get all the data anyway" because of the conditions you've provided in the WHERE condition.
It's best to use numeric keys for indexing, you can get away with some varchars but they do take up disk space. You should have a PRIMARY KEY on each table where possible. So look at your database structure and consider adding some indexes.
Final thing to check if your table has indexes you can use SHOW CREATE TABLE followed by the table name.

Related

MySQL- Improvement on count(*) aggregation with composite index keys

I have a table with the following structure with almost 120000 rows,
desc user_group_report
+------------------+----------+------+-----+-------------------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+----------+------+-----+-------------------+-------+
| user_id | int | YES | MUL | NULL | |
| group_id | int(11) | YES | MUL | NULL | |
| type_id | int(11) | YES | | NULL | |
| group_desc | varchar(128)| NO| | NULL |
| status | enum('open','close')|NO| | NULL | |
| last_updated | datetime | NO | | CURRENT_TIMESTAMP | |
+------------------+----------+------+-----+-------------------+-------+
I have indexes on the following keys :
user_group_type(user_id,group_id,group_type)
group_type(group_id,type_id)
user_type(user_id,type_id)
user_group(user_id,group_id)
My issue is I am running a count(*) aggregation on above table group by group_id and with a clause on type_id
Here is the query :
select count(*) user_count, group_id
from user_group_report
where type_id = 1
group by group_id;
and here is the explain plan (query taking 0.3 secs on average):
+----+-------------+------------------+-------+---------------------------------+---------+---------+------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+---------------------------------+---------+---------+------+--------+--------------------------+
| 1 | SIMPLE | user_group_report | index | user_group_type,group_type,user_group | group_type | 10 | NULL | 119811 | Using where; Using index |
+----+-------------+------------------+-------+---------------------------------+---------+---------+------+--------+--------------------------+
Here as I understand the query almost does a full table scan because of complex indices and When I am trying to add an index on group_id, the rows in explain plan shows a less number (almost half the rows) but the time taking for query execution is increased to 0.4-0.5 secs.
I have tried different ways to add/remove indices but none of them is reducing the time taken.
Assuming the table structure cannot be changed and querying is independent of other tables, Can someone suggest me a better way to optimize the above query or If i am missing anything here.
PS:
I have already tried to modify the query to the following but couldn't find any improvement.
select count(user_id) user_count, group_id
from user_group_report
where type_id = 1
group by group_id;
Any little help is appreciated.
Edit:
As per the suggestions, I added a new index
type_group on (type_id,group_id)
This is the new explain plan. The number of rows in explain,reduced but the query execution time is still the same
+----+-------------+------------------+------+---------------------------------+---------+---------+-------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+------+---------------------------------+---------+---------+-------+-------+--------------------------+
| 1 | SIMPLE | user_group_report | ref | user_group_type,type_group,user_group | type_group | 5 | const | 59846 | Using where; Using index |
+----+-------------+------------------+------+---------------------------------+---------+---------+-------+-------+--------------------------+
EDIT 2:
Adding details as suggested in answers/comments
select count(*)
from user_group_report
where type_id = 1
This query itself is taking 0.25 secs to execute.
and here is the explain plan:
+----+-------------+------------------+------+---------------+---------+---------+-------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+------+---------------+---------+---------+-------+-------+-------------+
| 1 | SIMPLE | user_group_report | ref | type_group | type_group | 5 | const | 59866 | Using index |
+----+-------------+------------------+------+---------------+---------+---------+-------+-------+-------------+

I believe that your group_type is wrong. Try to switch the attributes.
create index ix_type_group on user_group_report(type_id,group_id)
This index is better for your query because you specify the type_id = 1 in the where clause. Therefore, the query processor finds the first record with type_id = 1 in your index and then it scans the records in the index with this type_id and performs the aggregation. With such index, only relevant records in the index are accessed which is not possible with the group_type index.

If type_id is selective (i.e. it reduces the search space significantly), creating an index on type_id, group_id should help significantly.
This is because it reduces the number of records that need to be grouped first (remove everything where type_id != 1), and only then does the grouping/summing.
EDIT:
Following on from the comments, it seems we need to figure out more about where the bottleneck is - finding the records, or grouping/summing.
The first step would be to measure the performance of:
select count(*)
from user_group_report
where type_id = 1
If that is significantly faster, the challenge is likely in the grouping than in finding the records. If that's just as slow, it's in finding the records in the first place.

Do most of the columns really need to be NULLable? Change to NOT NULL where applicable.
What percentage of the table has type_id = 1? If it is most of the table, then that would explain why you don't see much improvement. Meanwhile, the EXPLAIN seems to be thinking there are only two distinct values for type_id, hence it says only half the table will be scanned -- this number cannot be trusted.
To get more insight into what is going on, please do these:
EXPLAIN FORMAT=JSON SELECT...;
And
FLUSH STATUS;
SELECT ...
SHOW SESSION STATUS LIKE 'Handler%';
We can help interpret the data you get there. (Here is a brief discussion of such.)

MySQL: Inner join vs Where [duplicate]

This question already has answers here:
Explicit vs implicit SQL joins
(12 answers)
Closed 9 years ago.
Is there a difference in performance (in mysql) between
Select * from Table1 T1
Inner Join Table2 T2 On T1.ID = T2.ID
And
Select * from Table1 T1, Table2 T2
Where T1.ID = T2.ID
?

As pulled from the accepted answer in question 44917:
Performance wise, they are exactly the
same (at least in SQL Server) but be
aware that they are deprecating the
implicit outer join syntax.
In MySql the results are the same.
I would personally stick with joining tables explicitly... that is the "socialy acceptable" way of doing it.

They are the same. This can be seen by running the EXPLAIN command:
mysql> explain Select * from Table1 T1
-> Inner Join Table2 T2 On T1.ID = T2.ID;
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
| 1 | SIMPLE | T1 | index | PRIMARY | PRIMARY | 4 | NULL | 4 | Using index |
| 1 | SIMPLE | T2 | index | PRIMARY | PRIMARY | 4 | NULL | 4 | Using where; Using index; Using join buffer |
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
2 rows in set (0.00 sec)
mysql> explain Select * from Table1 T1, Table2 T2
-> Where T1.ID = T2.ID;
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
| 1 | SIMPLE | T1 | index | PRIMARY | PRIMARY | 4 | NULL | 4 | Using index |
| 1 | SIMPLE | T2 | index | PRIMARY | PRIMARY | 4 | NULL | 4 | Using where; Using index; Using join buffer |
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
2 rows in set (0.00 sec)

Well one late answer from me, As I am analyzing performance of a older application which uses comma based join instead of INNER JOIN clause.
So here are two tables which have a join (both have records more than 1 lac). When executing query which has a comma based join, it takes a lot longer than the INNER JOIN case.
When I analyzed the explain statement, I found that the query having comma join was using the join buffer. However the query having INNER JOIN clause had 'using Where'.
Also these queries are significantly different, as shown in rows column in explain query.
These are my queries and their respective explain results.
explain select count(*) FROM mbst a , his_moneypv2 b
WHERE b.yymm IN ('200802','200811','201001','201002','201003')
AND a.tel3 != ''
AND a.mb_no = b.mb_no
AND b.true_grade_class IN (3,6)
OR b.grade_class IN (4,7);
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+------+--------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | b | index_merge | PRIMARY,mb_no,yymm,yymm_2,idx_true_grade_class,idx_grade_class | idx_true_grade_class,idx_grade_class | 5,9 | NULL | 16924 | Using sort_union(idx_true_grade_class,idx_grade_class); Using where |
| 1 | SIMPLE | a | ALL | PRIMARY | NULL | NULL | NULL | 134472 | Using where; Using join buffer |
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+------+--------+---------------------------------------------------------------------+
v/s
explain select count(*) FROM mbst a inner join his_moneypv2 b
on a.mb_no = b.mb_no
WHERE b.yymm IN ('200802','200811','201001','201002','201003')
AND a.tel3 != ''
AND b.true_grade_class IN (3,6)
OR b.grade_class IN (4,7);
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+--------------------+-------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+--------------------+-------+---------------------------------------------------------------------+
| 1 | SIMPLE | b | index_merge | PRIMARY,mb_no,yymm,yymm_2,idx_true_grade_class,idx_grade_class | idx_true_grade_class,idx_grade_class | 5,9 | NULL | 16924 | Using sort_union(idx_true_grade_class,idx_grade_class); Using where |
| 1 | SIMPLE | a | eq_ref | PRIMARY | PRIMARY | 62 | shaklee_my.b.mb_no | 1 | Using where |
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+--------------------+------

Actually they are virtually the same, The JOIN / ON is newer ANSI syntac, the WHERE is older ANSI syntax. Both are recognized by query engines

The comma in a FROM clause is a CROSS JOIN. We can imagine that SQL server has a select query execution procedure which somehow should look like that:
1. iterate through every table
2. find rows that meet join predicate and put it into result table.
3. from the result table, get only those rows that meets the where condition.
If it really looks like that, then using a CROSS JOIN on a table that has a few thousands rows could allocate a lot of memory, when every row is combined with each other before the where condition is examined. Your SQL server could be quite busy then.

I would think so because the first example explicitly tells mysql which columns to join and how to join them where the second one mysql has to try and figure out where you want to join.

the second query is just another notation for an inner join, so if there is a difference in porformance it's only because one query can be parsed faster than the other one - and that difference, if it exists, will be so tiny that you won't notice it.
for more information you could try to take a look at this question (and use the search on SO next time before asking a question that already is answered)

The first query is easier to understand for MySQL so it is likely that the execution plan will be better and that the query will run faster.
The second query without the where clause, is a cross join. If MySQL is able to understand the where clause good enough, it will do its best to avoid cross joining all the rows, but nothing guarantee that.
In a case as simple as yours, the performance will be strictly identical.
Performance wise, the first query will always be better or identical to the second one. And from my point of view it is also a lot easier to understand when rereading.

MySQL datetime index is not working

Table structure:
+-------------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+----------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| total | int(11) | YES | | NULL | |
| thedatetime | datetime | YES | MUL | NULL | |
+-------------+----------+------+-----+---------+----------------+
Total rows: 137967
mysql> explain select * from out where thedatetime <= NOW();
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | out | ALL | thedatetime | NULL | NULL | NULL | 137967 | Using where |
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
The real query is much more longer with more table joins, the point is, I can't get the table to use the datetime index. This is going to be hard for me if I want to select all data until certain date. However, I noticed that I can get MySQL to use the index if I select a smaller subset of data.
mysql> explain select * from out where thedatetime <= '2008-01-01';
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
| 1 | SIMPLE | out | range | thedatetime | thedatetime | 9 | NULL | 15826 | Using where |
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
mysql> select count(*) from out where thedatetime <= '2008-01-01';
+----------+
| count(*) |
+----------+
| 15990 |
+----------+
So, what can I do to make sure MySQL will use the index no matter what date that I put?

There are two things in play here -
Index is not selective enough - if the index covers more than approx. 30% of the rows, MySQL will decide a full table scan is more efficient. When you contract the range the index kicks in.
One index per table in a join
The real query is much more longer
with more table joins, the point is ...
The point is exactly because it has joins that it probably can't use that index. MySQL can use one index per table in a join (unless it qualifies for an index-merge optimization). If the primary key is already used for the join, thedatetime won't be used. In order to use it, you need to create a multi-column index on the join key + thedatetime index, in the correct order.
Check the EXPLAIN of the actual query to see which key MySQL uses for the join. Modify that index to include the thedatetime column as well, or create a new multi-column index from both (depending on what you use the join key for).

Everything works as it is supposed to. :)
Indexes are there to speed up retrieval. They do it using index lookups.
In you first query the index is not used because you are retrieving ALL rows, and in this case using index is slower (lookup index, get row, lookup index, get row... x number of rows is slower then get all rows == table scan)
In the second query you are retrieving only a portion of the data and in this case table scan is much slower.
The job of the optimizer is to use statistics that RDBMS keeps on the index to determine the best plan. In first case index was considered, but planner (correctly) threw it away.
EDIT
You might want to read something like this to get some concepts and keywords regarding mysql query planner.

Why is this MySQL JOIN statement returning more results?

I have two (characteristic_list and measure_list) tables that are related to each other by a column called 'm_id'. I want to retrieve records using filters (columns from characteristic_list) within a date range (columns from measure_list). When I gave the following SQL using INNER JOIN, it takes a while to retrieve the record. What am I doing wrong?
mysql> explain select c.power_set_point, m.value, m.uut_id, m.m_id, m.measurement_status, m.step_name from measure_list as m INNER JOIN characteristic_lis
t as c ON (m.m_id=c.m_id) WHERE (m.sequence_end_time BETWEEN '2010-06-18' AND '2010-06-20');
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
| 1 | SIMPLE | c | ALL | NULL | NULL | NULL | NULL | 82952 | |
| 1 | SIMPLE | m | ALL | NULL | NULL | NULL | NULL | 85321 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
2 rows in set (0.00 sec)
mysql> select count(*) from measure_list;
+----------+
| count(*) |
+----------+
| 83635 |
+----------+
1 row in set (0.18 sec)
mysql> select count(*) from characteristic_list;
+----------+
| count(*) |
+----------+
| 83635 |
+----------+
1 row in set (0.10 sec)

The reason this query takes a while to execute is because it has to scan the entire table. You never want to see "ALL" as the type of the query. To speed things up, you need to make smart decisions about what to index.
See the following documents at the MySQL site:
http://dev.mysql.com/doc/refman/5.1/en/mysql-indexes.html
http://dev.mysql.com/doc/refman/5.1/en/using-explain.html

As an add-on to the previous answer by Dan, you should consider indexing the join columns and the where columns. In this case, that means the m_id cols in both tables and the sequence_end_time in the measure_list table. They are small enough that you could add an index, run explain plan and time it, then change the index and compare. Should be relatively quick to solve.

Getting a Column's Max Value

Is there any tangible difference (speed/efficiency) between these statements? Assume the column is indexed.
SELECT MAX(someIntColumn) AS someIntColumn
or
SELECT someIntColumn ORDER BY someIntColumn DESC LIMIT 1

This depends largely on the query optimizer in your SQL implementation. At best, they will have the same performance. Typically, however, the first query is potentially much faster.
The first query essentially asks for the DBMS to inspect every value in someIntColumn and pick the largest one.
The second query asks the DBMS to sort all the values in someIntColumn from largest to smallest and pick the first one. Depending on the number of rows in the table and the existence (or lack thereof) of an index on the column, this could be significantly slower.
If the query optimizer is sophisticated enough to realize that the second query is equivalent to the first one, you are in luck. But if you retarget your app to another DBMS, you might get unexpectedly poor performance.

EDIT based on explain plan:
Explain plan shows that max(column) is more efficient. The explain plan say, “Select tables optimized away”.
EXPLAIN SELECT version from schema_migrations order by version desc limit 1;
+----+-------------+-------------------+-------+---------------+--------------------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+-------+---------------+--------------------------+---------+------+------+-------------+
| 1 | SIMPLE | schema_migrations | index | NULL | unique_schema_migrations | 767 | NULL | 1 | Using index |
+----+-------------+-------------------+-------+---------------+--------------------------+---------+------+------+-------------+
1 row in set (0.00 sec)
EXPLAIN SELECT max(version) FROM schema_migrations ;
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
1 row in set (0.00 sec)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

mysql get average of column join from million records - mysql

Related

MySQL- Improvement on count(*) aggregation with composite index keys

MySQL: Inner join vs Where [duplicate]

MySQL datetime index is not working

Why is this MySQL JOIN statement returning more results?

Getting a Column's Max Value

Categories

Resources