Left join between a small and a very large table in mysql - mysql

I have two tables; one has 50 records and contains some city names and the other has 3173958 records and has city to country code info:
+---------+-----------+
| country | city |
+---------+-----------+
| gb | sapiston |
| gb | sapperton |
| gb | sarclet |
| gb | sarnau |
| gb | sarnau |
+---------+-----------+
The large table is indexed on city field but this query takes about 5 minutes to execute:
SELECT small.* , c2c.country FROM small LEFT JOIN c2c ON ( lower( small.city ) = lower( c2c.city ) );
What is the problem?
How can I make it faster?

In order to make use of index you should store city values in lower format wither in same column or in different indexed column because, applying lower function in query can not make use of index.
SELECT small.* , c2c.country
FROM small
LEFT JOIN c2c
ON small.city = c2c.city;
Also add following index and covering index on table for better performance:
ALTER TABLE small ADD KEY ix1(city);
ALTER TABLE c2c ADD KEY ix1(city, country);
After adding above indexes check query execution plan using EXPLAIN

When you use a function on column names inside WHERE clause, the indexes cannot be used; because MySQL has to get the computed value for all rows before it can do a comparison. The equality comparisons are usually case-insensitive (depending on column collations) so you can safely omit the LOWER function. Here is the revised query:
SELECT small.*, c2c.country
FROM small
LEFT JOIN c2c ON small.city = c2c.city
Next, you should add a covering index on c2c. The index should be made on (city, country). This way MySQL won't have to look at the table to retrieve the country names. It will look at the index while joining city and at the same time it can fetch country column from the same index.
Next, change small.* to only the columns you need.
Next, create an index on small.city if you have not done that already -- or -- if you find that you only need two/three columns from the small table then create a covering index instead. E.g. if you are selecting small.somecolumn (and using small.city in WHERE/ON clause) create an index for (city, somecolumn).
Last, make sure that the city column in both tables is same data-type, length, and most importantly, collation. When collation is different, MySQL has to convert the collations before comparing which could slow down your query.

Related

MySQL - is query with where by timestamp more efficient?

I have a table with primary key, indexed field and an unindexed timestamp field.
Does it more efficient to query by timestamp too? lets say - 12 hours period?
Is it enough to query by primary key or is it better to use indexed fields too? Lets say that query by the indexed field is not a must.
example:
p_key | project | name | timestamp
-----------------------------------------
1 | 1 | a | 18:00
2 | 1 | b | 19:00
I want to get record 1.
should I ask:
SELECT *
FROM tbl
WHERE p_key = 1 AND project = 1 AND timestamp BETWEEN 16:30 AND 18:30)
OR
SELECT *
FROM tbl
WHERE p_key = 1
Lets say that I have many records.
In your example it doesn't matter which query is more efficient in terms of execution time. The important piece to note is that a primary key is unique.
Your query:
SELECT * FROM tbl WHERE p_key = 1
Will return the same row as your other query:
SELECT * FROM tbl WHERE p_key = 1 AND project = 1 AND timestamp BETWEEN 16:30 AND 18:30)
Because both filter on the p_key = 1. The worst case scenario here is that the entry does not actually fall within your time span in the second query and you get no results at all.
I am assuming you have an index on the primary key here. This means there is absolutely no need to run the second query vs the first query, unless it is possible that it does not fall within the timespan requested.
So your efficiency in your database will be in that you do not need to create and maintain a new index for the second query. If you have "many" rows as you stated, this efficiency can become quite important.
A filter by an integer indexed field will be the fastest way to get your data under normal circunstances.
Understanding that your data looks like your example (I mean, the Timestamp is not significant in your query and filtering by the primary key you get a single record...)
In addition, by default a primary key generates an index, so you don't need to create it by yourself on this field.
The second option obviously!!!
SELECT * FROM tbl WHERE p_key = 1
Filtering by primary key is clearly more efficient than by any other field (in your example) since it is the only one to be indexed.
Furthermore the primary key is enough to get the record you expect. No need to add complexity, bug risk...and computing time (yes, the conditions in the where clause need to be processed. The more you add, the longer it can take)

Why performance of MySQL queries are so bad when using a CHAR/VARCHAR index?

First, I will describe a simplified version of the problem domain.
There is table strings:
CREATE TABLE strings (
value CHAR(3) COLLATE utf8_unicode_ci NOT NULL,
INDEX(value)
) ENGINE=InnoDB;
As you can see, it have a non-unique index of CHAR(3) column.
The table is populated using the following script:
CREATE TABLE a_variants (
letter CHAR(1) COLLATE utf8_unicode_ci NOT NULL
) ENGINE=MEMORY;
INSERT INTO a_variants VALUES -- 60 variants of letter 'A'
('A'),('a'),('À'),('Á'),('Â'),('Ã'),('Ä'),('Å'),('à'),('á'),('â'),('ã'),
('ä'),('å'),('Ā'),('ā'),('Ă'),('ă'),('Ą'),('ą'),('Ǎ'),('ǎ'),('Ǟ'),('ǟ'),
('Ǡ'),('ǡ'),('Ǻ'),('ǻ'),('Ȁ'),('ȁ'),('Ȃ'),('ȃ'),('Ȧ'),('ȧ'),('Ḁ'),('ḁ'),
('Ạ'),('ạ'),('Ả'),('ả'),('Ấ'),('ấ'),('Ầ'),('ầ'),('Ẩ'),('ẩ'),('Ẫ'),('ẫ'),
('Ậ'),('ậ'),('Ắ'),('ắ'),('Ằ'),('ằ'),('Ẳ'),('ẳ'),('Ẵ'),('ẵ'),('Ặ'),('ặ');
INSERT INTO strings
SELECT CONCAT(a.letter, b.letter, c.letter) -- 60^3 variants of string 'AAA'
FROM a_variants a, a_variants b, a_variants c
UNION ALL SELECT 'BBB'; -- one variant of string 'BBB'
So, it contains 216000 indistinguishable (in terms of the utf8_unicode_ci collation) variants of string "AAA" and one variant of string "BBB":
SELECT value, COUNT(*) FROM strings GROUP BY value;
+-------+----------+
| value | COUNT(*) |
+-------+----------+
| AAA | 216000 |
| BBB | 1 |
+-------+----------+
As value is indexed, I expect the following two queries to have similar performance:
SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA';
SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'BBB';
But in practice the first one is more than 300x times slower than the second! See:
+----------+------------+---------------------------------------------------------------+
| Query_ID | Duration | Query |
+----------+------------+---------------------------------------------------------------+
| 1 | 0.11749275 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA' |
| 2 | 0.00033325 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'BBB' |
| 3 | 0.11718050 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA' |
+----------+------------+---------------------------------------------------------------+
-- I ran the 'AAA' query twice here just to be sure.
If I change size of the indexed column or change its type to VARCHAR, the problem with performance still manifests itself. Meanwhile, in analogous situations, but when the non-unique index is not CHAR/VARCHAR (e.g. INT), queries are as fast as expected.
So, the question is why performance of MySQL queries are so bad when using a CHAR/VARCHAR index?
I have strong feeling that MySQL perform full linear scan of all the values matched by the index key. But why it do so when it can just return the count of the matched rows? Am I missing something and that is really needed? Or is that a sad shortcoming of MySQL optimizer?
Clearly, the issue is that the query is doing an index scan. The alternative approach would be to do two index lookups, for the first and last values that are the same, and then use meta information in the index for the calculation. Based on your observations, MySQL does both.
The rest of this answer is speculation.
The reason the performance is "only" 300 times slower, rather than 200,000 times slower, is because of overhead in reading the index. Actually scanning the entries is quite fast compared to other operations that are needed.
There is a fundamental difference between numbers and strings when it comes to comparisons. The engine can just look at the bit representations of two numbers and recognize whether they are the same or different. Unfortunately, for strings, you need to take encoding/collation into account. I think that is why it needs to look at the values.
It is possible that if you had 216,000 copies of exactly the same string, then MySQL would be able to do the count using metadata in the index. In other words, the indexer is smart enough to use metadata for exact equality comparisons. But, it is not smart enough to take encoding into account.
One of things you may want to check on is the logical I/O of each query. I'm sure you'll see quite a difference. To count the number of 'BBB's in the table, probably only 3 or 4 LIOs are needed (depending on things like bucket size). To count the number of 'AAA's, essentially the entire table must be scanned, index or not. With 216k rows, that can add up to significantly more LIOs -- not to mention physical I/Os. Logical I/Os are faster than physical I/Os, but any I/O is a performance killer.
As for text vs numbers, it is always easier and faster for software (any software, not just database engines) to compare numbers than text.

Optimizing the performance of MySQL regarding aggregation

I'm trying to optimize a report query, as most of report queries this one incorporates aggregation. Since the size of table is considerable and growing, I need to tend to its performance.
For example, I have a table with three columns: id, name, action. And I would like to count the number of actions each name has done:
SELECT name, COUNT(id) AS count
FROM tbl
GROUP BY name;
As simple as it gets, I can't run it in a acceptable time. It might take 30 seconds and there's no index, whatsoever, I can add which is taken into account, nevertheless improves it.
When I run EXPLAIN on the above query, it never uses any of indices of the table, i.e. an index on name.
Is there any way to improve the performance of aggregation? Why the index is not used?
[UPDATE]
Here's the EXPLAIN's output:
+----+-------------+-------+------+---------------+------+---------+------+---------+----------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------+-----------------+
| 1 | SIMPLE | tbl | ALL | NULL | NULL | NULL | NULL | 4025567 | 100.00 | Using temporary |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------+-----------------+
And here is the table's schema:
CREATE TABLE `tbl` (
`id` bigint(20) unsigned NOT NULL DEFAULT '0',
`name` varchar(1000) NOT NULL,
`action` int unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `inx` (`name`(255))
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The problem with your query and use of indexes is that you refer to two different columns in your SELECT statement yet only have one column in your indexes, plus the use of a prefix on the index.
Try this (refer to just the name column):
SELECT name, COUNT(*) AS count
FROM tbl
GROUP BY name;
With the following index (no prefix):
tbl (name)
Don't use a prefix on the index for this query because if you do, MySQL won't be able to use it as a covering index (will still have to hit the table).
If you use the above, MySQL will scan through the index on the name column, but won't have to scan the actual table data. You should see USING INDEX in the explain result.
This is as fast as MySQL will be able to accomplish such a task. The alternative is to store the aggregate result separately and keep it updated as your data changes.
Also, consider reducing the size of the name column, especially if you're hitting index size limits, which you most likely are hence why you're using the prefix. Save some room by not using UTF8 if you don't need it (UTF8 is 3 bytes per character for index).
It's a very common question and key for solution lies in fact, that your table is growing.
So, first way would be: to create index by name column if it isn't created yet. But: this will solve your issue for a time.
More proper approach would be: to create separate statistics table like
tbl_counts
+------+-------+
| name | count |
+------+-------+
And store your counts separately. When changing (insert/update or delete) your data in tbl table - you'll need to adjust corresponding row inside tbl_counts table. This way allows you to get rid of performing COUNT query at all - but will need to add some logic inside tbl table.
To maintain integrity of your statistics table you can either use triggers or do that inside application. This method is good if performance of COUNT query is much more important for you than your data changing queries (but overhead from changing tbl_counts table won't be too high)

MySQL Hash Indexes for Optimization

So maybe this is noob, but I'm messing with a couple tables.
I have TABLE A roughly 45,000 records
I have TABLE B roughly 1.5 million records
I have a query:
update
schema1.tablea a
inner join (
SELECT DISTINCT
ID, Lookup,
IDpart1, IDpart2
FROM
schema1.tableb
WHERE
IDpart1 is not NULL
AND
Lookup is not NULL
ORDER BY
ID,Lookup
) b Using(ID,Lookup)
set
a.Elg_IDpart1 = b.IDpart1,
a.Elg_IDpart2 = b.IDpart2
where
a.ID is NOT NULL
AND
a.Elg_IDpart1 is NULL
So I am forcing the index on ID, Lookup. Each table does have a index on those columns as well but because of the sub-query I forced it.
It is taking FOR-EVER to run, and it really should take, i'd imagine under 5 minutes...
My questions are in regards to the indexes, not the query.
I know that you can't use hash index in ordered index.
I currently have indexes on both ID, Lookup sperately, and as one index, and it is a B-Tree index. Based on my WHERE Clause, does a hash index fit for as an optimization technique??
Can I have a single hash index, and the rest of the indexes b B-tree index?
This is not a primary key field.
I would post my explain but i changed the name on these tables. Basically it is using the index only for ID...instead of using the ID, Lookup, I would like to force it to use both, or at least turn it into a different kind of index and see if that helps?
Now I know MySQL is smart enough to determine which index is most appropriate, so is that what it's doing?
The Lookup field maps the first and second part of the ID...
Any help or insight on this is appreciated.
 UPDATE
An EXPLAIN on the UPDATE after I took out sub-query.
+----+-------------+-------+------+-----------------------------+--------------+---------+-------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-----------------------------+--------------+---------+-------------------+-------+-------------+
| 1 | SIMPLE | m | ALL | Lookup_Idx,ID_Idx,ID_Lookup | | | | 44023 | Using where |
| 1 | SIMPLE | c | ref | ID_LookupIdx | ID_LookupIdx | 5 | schema1.tableb.ID | 4 | Using where |
+----+-------------+-------+------+-----------------------------+--------------+---------+-------------------+-------+-------------+
tablea relevant indexes:
ID_LookupIdx (ID, Lookup)
tableb relevant indexes:
ID (ID)
Lookup_Idx (Lookup)
ID_Lookup_Idx (ID, Lookup)
All of the indexes are normal B-trees.
Firstly, to deal with the specific questions that you raise:
I currently have indexes on both ID, Lookup sperately, and as one index, and it is a B-Tree index. Based on my WHERE Clause, does a hash index fit for as an optimization technique??
As documented under CREATE INDEX Syntax:
+----------------+--------------------------------+
| Storage Engine | Permissible Index Types |
+----------------+--------------------------------+
| MyISAM | BTREE |
| InnoDB | BTREE |
| MEMORY/HEAP | HASH, BTREE |
| NDB | BTREE, HASH (see note in text) |
+----------------+--------------------------------+
Therefore, before even considering HASH indexing, one should be aware that it is only available in the MEMORY and NDB storage engines: so may not even be an option to you.
Furthermore, be aware that indexes on combinations of ID and Lookup alone may not be optimal, as your WHERE predicate also filters on tablea.Elg_IDpart1 and tableb.IDpart1—you may benefit from indexing on those columns too.
Can I have a single hash index, and the rest of the indexes b B-tree index?
Provided that the desired index types are supported by the storage engine, you can mix them as you see fit.
instead of using the ID, Lookup, I would like to force it to use both, or at least turn it into a different kind of index and see if that helps?
You could use an index hint to force MySQL to use different indexes to those that the optimiser would otherwise have selected.
Now I know MySQL is smart enough to determine which index is most appropriate, so is that what it's doing?
It is usually smart enough, but not always. In this case, however, it has probably determined that the cardinality of the indexes is such that it is better to use those that it has chosen.
Now, depending on the version of MySQL that you are using, tables derived from subqueries may not have any indexes upon them that can be used for further processing: consequently the join with b may require a full scan of that derived table (there's insufficient information in your question to determine exactly how much of a problem this might be, but schema1.tableb having 1.5 million records suggests it could be a significant factor).
See Subquery Optimization for more information.
One should therefore try to avoid using derived tables if at all possible. In this case, there does not appear to be any purpose to your derived table as one could simply join schema1.tablea and schema1.tableb directly:
UPDATE schema1.tablea a
JOIN schema1.tableb b USING (ID, Lookup)
SET a.Elg_IDpart1 = b.IDpart1,
a.Elg_IDpart2 = b.IDpart2
WHERE a.Elg_IDpart1 IS NULL
AND a.ID IS NOT NULL
AND b.IDpart1 IS NOT NULL
AND b.Lookup IS NOT NULL
ORDER BY ID, Lookup
The only thing that has been lost is the filter for DISTINCT records, but duplicate records will simply (attempt to) overwrite updated values with those same values again—which will have no effect, but may have proved very costly (especially with so many records in that table).
The use of ORDER BY in the derived table was pointless as it could not be relied upon to achieve any particular order to the UPDATE, whereas in this revised version it will ensure that any updates that overwrite previous ones take place in the specified order: but is that necessary? Perhaps it can be removed and save on any sorting operation.
One should check the predicates in the WHERE clause: are they all necessary (the NOT NULL checks on a.ID and b.Lookup, for example, are superfluous given that any such NULL records will be eliminated by the JOIN predicate)?
Altogether, this leaves us with:
UPDATE schema1.tablea a
JOIN schema1.tableb b USING (ID, Lookup)
SET a.Elg_IDpart1 = b.IDpart1,
a.Elg_IDpart2 = b.IDpart2
WHERE a.Elg_IDpart1 IS NULL
AND b.IDpart1 IS NOT NULL
Only if performance is still unsatisfactory should one look further at the indexing. Are relevant columns (i.e. those used in the JOIN and WHERE predicates) indexed? Are the indexes being selected for use by MySQL (bear in mind that it can only use one index per table for lookups: for testing both the JOIN predicate and the filter predicates: perhaps you need an appropriate composite index)? Check the query execution plan by using EXPLAIN to investigate such issues further.

MySQL query takes too long -- what should be the index?

Here is my query:
CREATE TEMPORARY TABLE temptbl (
pibn INT UNSIGNED NOT NULL, page SMALLINT UNSIGNED NOT NULL)
ENGINE=MEMORY;
INSERT INTO temptbl (
SELECT pibn,page FROM mytable
WHERE word1=429907 AND word2=0);
ALTER TABLE temptbl ADD INDEX (pibn,page);
SELECT word1,COUNT(*) AS aaa
FROM mytable a
INNER JOIN temptbl b
ON a.pibn=b.pibn AND a.page=b.page
WHERE word2=0
GROUP BY word1 ORDER BY aaa DESC LIMIT 10;
DROP TABLE temptbl;
The issue is the SELECT word1,COUNT(*) AS aaa, specifically the count. That select statement takes 16 seconds.
EXPLAIN says:
+----+-------------+-------+------+---------------------------------+-------------+---------+-------------------------------------------------------------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------------------------+-------------+---------+-------------------------------------------------------------+-------+---------------------------------+
| 1 | SIMPLE | b | ALL | pibn | NULL | NULL | NULL | 26778 | Using temporary; Using filesort |
| 1 | SIMPLE | a | ref | w2pibnpage1,word21pibn,pibnpage | w2pibnpage1 | 9 | const,db.b.pibn,db.b.page | 4 | Using index |
+----+-------------+-------+------+---------------------------------+-------------+---------+-------------------------------------------------------------+-------+---------------------------------+
The index used (w2pibnpage1) is on:
word2,pibn,page,word1,id
I've been struggling with this for days, trying different combinations of columns for the index (which is annoying as it takes an hour to rebuild - millions of rows).
What should my indexes be, or what should I do to get this query to run in a fraction of a second (as it should)?
Here is a suggestion.
Presumably the temporary table is small. You can remove the index on that table, because a full table scan is fine there. In fact, that is what you want.
You then want indexes used on the big table. First the indexes need to match the join condition, then to match the where condition, and finally the group by condition. So, the suggestion is:
mytable(pibn, page, word2, word1, aaa)
I'm throwing in the order by column, so it doesn't have to fetch the value from the original data.
The query is taking a long time, but the expensive part seems to be accessing 'mytable' (you've not provided the structure of this) however the optimizer seems to think it only needs to fetch 4 rows from this using an index - which should be very fast. i.e. the data appears to be very skewed - how many rows does the last query examine (tally of counts)?
Without having a lok at the exact distribution of data, it's hard to be definitive - certainly you may need to hint the query to get it to work efficiently. The problem with designing indexes is that they should make all the queries faster - or at least give a reasonable tradeoff.
Looking at the predicates in the queries you've provided...
WHERE word1=429907 AND word2=0
Would be best served by an index on word1,word2,.... or word2,word1,.....
ON a.pibn=b.pibn AND a.page=b.page
WHERE a.word2=0
Would be best served by an index on mytable with word2+pibn+page in the leading columns.
How many distinct values are there for mytable.word1 and for mytable.word2? If word2 has a low number of distinct values (less than 20 or so) then it's not adding much selectivity to the index and can be omitted.
An index on word2,pibn,page,word1 gives you a covering index for the second query.
If your temptbl is small you want to first restrict the bigger table (mytable) and then join it (eventually by index) to your temptbl.
Right now, MySQL thinks it is better off by using the index of the bigger table to join.
You can get around this by doing a straight join:
SELECT word1,COUNT(*) AS aaa
FROM mytable a
STRAIGHT_JOIN temptbl b
ON a.pibn=b.pibn AND a.page=b.page
WHERE word2=0
GROUP BY word1
ORDER BY aaa DESC LIMIT 10;
This should use your index in mytable for the where clause and join mytable to temptbl via the index in temptbl.
If MySQL still wants to do it different, you can use FORCE INDEX to make it use the index.
With your data volumes is is not going to work fast no matter what you do, not without changing the schema.
If I understand you right, you're looking for the top words which go along with 429907 on the same pages.
You model as it it now would require counting all those words over an over again each time you run the query.
To speed it up, you would need to create an additional stats table:
CREATE TABLE word_pairs
(
word1_1 INT NOT NULL,
word1_2 INT NOT NULL,
cnt BIGINT NOT NULL,
PRIMARY KEY (word1_1, word1_2),
INDEX (word1_1, cnt),
INDEX (word1_2, cnt)
)
and update it each time you're inserting a record into a large table (increase the cnt for the newly inserted word and all the words it's being on the same page with).
This would probably bee too slow for a single server as such updates would require some time, so you would also need to shard that table across multiple servers.
If you had such a table you could just run:
SELECT *
FROM word_pairs
WHERE word1_1 = 429907
ORDER BY
cnt DESC
LIMIT 10
which would be instant.
I came up with this:
CREATE TEMPORARY TABLE temp1 (
pibn INT UNSIGNED NOT NULL, page SMALLINT UNSIGNED NOT NULL)
ENGINE=MEMORY;
INSERT INTO temp1 (
SELECT pibn,page FROM mytable
WHERE word1=429907 AND word2=0);
CREATE TEMPORARY TABLE temp2 (
word1 MEDIUMINT UNSIGNED NOT NULL)
ENGINE=MEMORY;
INSERT INTO temp2 (
SELECT a.word1
FROM mytable a, temp1 b
WHERE a.word2=0 AND a.pibn=b.pibn AND a.page=b.page);
DROP TABLE temp1;
CREATE INDEX index1 ON temp2 (word1) USING BTREE;
CREATE TEMPORARY TABLE temp3 (
word1 MEDIUMINT UNSIGNED NOT NULL, num INT UNSIGNED NOT NULL)
ENGINE=MEMORY;
INSERT INTO temp3 (SELECT word1,COUNT(*) AS aaa FROM temp2 USE INDEX (index1) GROUP BY word1);
DROP TABLE temp2;
CREATE INDEX index1 ON temp3 (num) USING BTREE;
SELECT word1,num FROM temp3 USE INDEX (index1) ORDER BY num DESC LIMIT 10;
DROP TABLE temp3;
Takes 5 seconds.