MySQL index for non-empty string query

MySQL index for non-empty string query - mysql

I have the table posts and its column descr in my application. And I need to query posts where description is not empty, but the table has too many rows, so I need to add an index. What is the best way to add this index?
Table structure (simplified):
CREATE TABLE posts (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
title VARCHAR(255),
descr VARCHAR(1024),
PRIMARY KEY(id)
) engine=InnoDB DEFAULT CHARSET=utf8;
Example of query:
SELECT * FROM posts WHERE descr <> '';
I don't want to create an index on the whole descr column, because it will be huge overhead.
Also I know variant about adding another column is_empty_descr BOOLEAN and add index to it. This solution I will use if no other variants would be found.
I tried to add INDEX( descr(1) ), but I couldn't find the way how to use it:
desrc <> '' - index is not used
LEFT(desrc, 1) = '' - index is not used
SUBSTR(desrc, 0, 1) = '' - index is not used
desrc LIKE 'a%' - index is used! But this is totally different case
In all my examples I see something like this:
mysql> EXPLAIN SELECT * FROM posts WHERE descr <> '';
+------+---------------+------+---------+------+------+-------------+
| type | possible_keys | key | key_len | ref | rows | Extra |
+------+---------------+------+---------+------+------+-------------+
| ALL | descr_1 | NULL | NULL | NULL | 42 | Using where |
+------+---------------+------+---------+------+------+-------------+
(I omited some result columns, because the table is too wide for this site)
Even if I pass FORCE index (descr_1) result will be the same.

In MySQL, you can add a description on a prefix of a string using this syntax:
create index idx_posts_descr1 on posts(descr(1));
You should test this to see if the index is used for that particular where clause, though.

Related

How to Optimized performance of JOIN query on large table

I am using Server version: 5.5.28-log MySQL Community Server (GPL).
I have a big table consist of 279703655 records called table A. I have to perform join on this table with one of my changelog table B and then insert matching records in new tmp table C.
B table has index on column type.
A table consist of prod_id,his_id and other columns.A table has index on both column prod_id,history_id.
When i am going to perform the following query
INSERT INTO C(prod,his_id,comm)
SELECT DISTINCT a.product_id,a.history_id,comm
FROM B as b INNER JOIN A as a ON a.his_id = b.his_id AND b.type="applications"
GROUP BY prod_id
ON DUPLICATE KEY UPDATE
`his_id` = VALUES(`his_id`);
it takes 7 to 8 min to insert records.
Even if i perform simple count from table A it took 15 min to give me count.
I have also tried a procedure to insert records in Limit but due to count query takes 15 min it is more slower then before.
BEGIN
DECLARE n INT DEFAULT 0;
DECLARE i INT DEFAULT 0;
SELECT COUNT(*) FROM A INTO n;
SET i=5000000;
WHILE i<n DO
INSERT INTO C(product_id,history_id,comments)
SELECT a.product_id,a.history_id,a.comments FROM B as b
INNER JOIN (SELECT * FROM A LIMIT i,1) as a ON a.history_id=b.history_id;
SET i = i + 5000000;
END WHILE;
End
But the above code is also take 15 to 20 min o execute.
Please suggest me how i make it faster.
Below is EXPLAIN result:
+----+-------------+-------+--------+---------------+---------+---------+-----------------+--------------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+-----------------+--------------+-------------+
| 1 | SIMPLE | a | ALL | (NULL) | (NULL) | (NULL) | (NULL) | 279703655 | |
| 1 | SIMPLE | b | eq_ref | PRIMARY | PRIMARY | 8 | DB.a.history_id | 1 | Using index |
+----+-------------+-------+--------+---------------+---------+---------+-----------------+--------------+-------------+
(from Comment)
CREATE TABLE B (
history_id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
history_hash char(32) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
type enum('products','brands','partnames','mc_partnames','applications') NOT NULL,
stamp timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (history_id),
UNIQUE KEY history_hash (history_hash),
KEY type (type),
KEY stamp (stamp)
);

Let's first look at the tables.
What you call table B is really a history table. Its primary key is the history_id.
What you call table A is really a product table with one product per row and product_id its primary key. Each product also has a history_id. Thus you have created a 1:n relation. A product has one history row; one history row relates to multiple products.
You are selecting the product table rows that have an 'application' type history entry. This should be written as:
select product_id, history_id, comm
from product
where history_id in
(
select history_id
from history
where type = 'applications'
);
(A join would work just as well, but isn't as clear. As there is only one history row per product, you can't get duplicates. Both GROUP BY and DISTINCT are completely superfluous in your query and should be removed in order not to give the DBMS unecessary work to do. But as mentioned: better don't join at all. If you want rows from table A, select from table A. If you want to look up rows in table B, look them up in the WHERE clause, where all criteria belongs.)
Now, we would have to know how many rows may be affected. If only 1% of all history rows are 'applications', then an index should be used. Preferably
create index idx1 on history (type, history_id);
… which finds rows by type and gets their history_id right away.
If, say 20%, of all all history rows are 'applications', then reading the table sequentially might be more efficient.
Then, how many product rows may we get? Even with a single history row, we might get millions of related product rows. Or vice versa, with millions of history rows we might get no product row at all. Again, we can provide an index, which may or may not be used by the DBMS:
create index idx2 on product (history_id, product_id, comm);
This is about as fast as it gets. Two indexes offered and a proper written query without an unnecessary join. There were times when MySQL had performance problems with IN. People rewrote the clause with EXISTS then. I don't think this is still necessary.
As of MySQL 8.0.3, you can create histogram statistics for tables.
analyze history update histogram on type;
analyze product update histogram on history_id;
This is an important step to help the optimizer to find the optimal way to select the data.

Indexes needed (assuming it is history_id, not his_id):
B: INDEX(type, history_id) -- in this order. Note: "covering"
A: INDEX(history_id, product_id, comm)
What column or combination of columns provides the uniqueness constraint that IODKU needs?
Really-- Provide SHOW CREATE TABLE.

Does index work with view?

Assume that I have two tables:
table1(ID, attribute1, attribute2) and
table2(ID, attribute1, attribute2) with ID is primary key of two table
and I have a view:
create view myview as
select ID, attribute1, attribute2 from table1
union
select ID, attribute1, attribute2 from table1
Can I use advantage of index of primary key (in sql in general and for mysql in my case), when I execute query like following query ?
select * from myview where ID = 100

It depends on your query. Using a view may limit the indexes that can be used efficiently.
For example using a table I have handy I can create a view using 2 UNIONed selects each with a WHERE clause.
CREATE VIEW fred AS
SELECT *
FROM item
WHERE code LIKE 'a%'
UNION SELECT *
FROM item
WHERE mmg_code LIKE '01%'
Both the code and the mmg_code fields have indexes. The table also has id as a primary key (highest value is about 59500).
As a query I can select from the view, or do a query similar to the view, or I can use an OR (all 3 should give the same results). I get 3 quite different EXPLAINs:-
SELECT *
FROM item
WHERE id > 59000
AND code LIKE 'a%'
UNION SELECT *
FROM item
WHERE id > 59000
AND mmg_code LIKE '01%';
gives and EXPLAIN of
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY item range PRIMARY,code,id,id_mmg_code,id_code,code_id PRIMARY 4 NULL 508 Using where
2 UNION item range PRIMARY,id,mmg_code,id_mmg_code,id_code,mmg_code_id PRIMARY 4 NULL 508 Using where
NULL UNION RESULT <union1,2> ALL NULL NULL NULL NULL NULL Using temporary
while the following
SELECT *
FROM item
WHERE id > 59000
AND (code LIKE 'a%'
OR mmg_code LIKE '01%');
gives and EXPLAIN of
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE item range PRIMARY,code,id,mmg_code,id_mmg_code,id_code,code_id,mmg_code_id PRIMARY 4 NULL 508 Using where
and the following
SELECT *
FROM fred
WHERE id > 59000;
gives and EXPLAIN of
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 4684 Using where
2 DERIVED item range code,code_id code 34 NULL 1175 Using index condition
3 UNION item range mmg_code,mmg_code_id mmg_code 27 NULL 3509 Using index condition
NULL UNION RESULT <union2,3> ALL NULL NULL NULL NULL NULL Using temporary
As you can see as indexes have been used in the view it has affected the indexes which can be used when selecting from the view.
The best index is potentially the primary key, but the view doesn't use this.

"Can I use advantage of index of primary key (in sql in general and for mysql in my case), when I execute query like following query?"
MySQL will consider using indexes that have been defined on the underlying tables. However you cannot create an index on the view. Check link mysql Restrictions on Views for further explanation.
Using mysql explain on a query using the view will show the keys being considered under the "possible_keys" column.
EXPLAIN select * from myview where ID = 100;

Increment an integer to be used as an index for new MySQL entries. Is there a better way?

I have a MySQL table of unique usernames and other properties. I read it's faster to look things up by an integer index (instead of a string) so I want each username to have a unique integer.
Currently, I have an extra table in my database which contains a single value: the "free" index. When a new user is created, it takes this values and pairs it with the username, then increments the value in the table.
It seems awkward having this extra table and performing commands on it. Is there a better way to do this?

You might want to look at AUTO_INCREMENT option which can be used like this:
CREATE TABLE users (
id INT NOT NULL AUTO_INCREMENT,
name CHAR(30) NOT NULL,
email CHAR(30) NOT NULL,
PRIMARY KEY (id)
);
Afterwards, you may add new users without mentioning the id field:
INSERT INTO users (name, email) VALUES ("John Dow", "jdow#email.com"), ("Alice Dow", "adow#gmail.com");
Nevertheless, when you SELECT from users, you will see id autoincrementing:
SELECT * from users;
will print out:
+----+-----------+----------------+
| id | name | email |
+----+-----------+----------------+
| 1 | John Dow | jdow#email.com |
| 2 | Alice Dow | adow#gmail.com |
+----+-----------+----------------+
However, most likely you won't see any significant performance improvement with numeric ids.
Non-username ids are covenient if users are allowed to change their usernames. Otherwise, I would stick to username as an id and primary index.

Why don't you store the integer index in the same table? So it'd be like another column which is auto-increasing and you could use that for lookup.

Slow execution of a subquery when no matches

Please note that I have asked this question on dba.stackexchange.com, but I thought I'd post it here too:
In MySQL, I have two basic tables - Posts and Followers:
CREATE TABLE Posts (
id int(11) NOT NULL AUTO_INCREMENT,
posted int(11) NOT NULL,
body varchar(512) NOT NULL,
authorId int(11) NOT NULL,
PRIMARY KEY (id),
KEY posted (posted),
KEY authorId (authorId,posted)
) ENGINE=InnoDB;
CREATE TABLE Followers (
userId int(11) NOT NULL,
followerId int(11) NOT NULL,
PRIMARY KEY (userId,followerId),
KEY followerId (followerId)
) ENGINE=InnoDB;
I have the following query, which seems to be optimized enough:
SELECT p.*
FROM Posts p
WHERE p.authorId IN (SELECT f.userId
FROM Followers f
WHERE f.followerId = 9
ORDER BY authorId)
ORDER BY posted
LIMIT 0, 20
EXPLAIN output:
+------+--------------------+-------+-----------------+--------------------+---------+---------+------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------------+-------+-----------------+--------------------+---------+---------+------------+------+--------------------------+
| 1 | PRIMARY | p | index | NULL | posted | 4 | NULL | 20 | Using where |
| 2 | DEPENDENT SUBQUERY | f | unique_subquery | PRIMARY,followerId | PRIMARY | 8 | func,const | 1 | Using index; Using where |
+------+--------------------+-------+-----------------+--------------------+---------+---------+------------+------+--------------------------+
When followerId is a valid id (meaning, it actually exists in both tables), the query execution is almost immediate. However, when the id is not present in the tables, the query only returns results (empty set) after a 7 second delay.
Why is this happening? Is there some way to speed up this query for cases where there are no matches (without having to do a check ahead of time)?

Is there some way to speed up this query ...???
Yes. You should do two things.
First, you should use EXISTS instead of IN (cross reference SQL Server IN vs. EXISTS Performance). It'll speed up the instances where there is a match, which will come in handy as your data set grows (it's may be fast enough now, but that doesn't mean you shouldn't follow best practices, and in this case EXISTS is a better practice than IN)
Second, you should modify the keys on your second table just a little bit. You were off to a good start using the compound key on (userId,followerId), but in terms of optimizing this particular query, you need to keep in mind the "leftmost prefix" rule of MySQL indices, eg
If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to find rows. http://dev.mysql.com/doc/refman/5.6/en/multiple-column-indexes.html
What your Query Execution Plan from EXPLAIN is telling you is that SQL thinks it makes more sense to join Followers to Posts (using the Primary Key on Posts) and filter the results for a given followerId off of that index. Think of it like saying "Show me all the possible matches, then reduce that down to just the ones that match followerId = {}"
If you replace your followerId key with a compound key (followerId,userId), you should be able to quickly zoom in to just the user ids associated with a given followerID and do the existence check against those.
I wish I knew how to explain this better... it's kind of a tough concept to grasp until you have a "Aha!" moment and it clicks. But if you look into the leftmost prefix rules on indices, and also change the key on followerId to be a key on (followerId,userId), I think it'll speed it up quite a bit. And if you use EXISTS instead of IN, that'll help you maintain that speed even as your data set grows.

try this one :
SELECT p.*
FROM Posts p
inner join Followers f On f.userId = p.authorId
WHERE f.followerId = 9
ORDER BY posted
LIMIT 0, 20

Is there any gain from storing comments and replies on separate tables instead of a unary relation on a comment table?

Choice 1:
comments {commentid,replyto,comment}
//replyto will be null on many posts
Choice 2:
comments {commentid,comment}
replies {replyid, replyto, reply}
It looks like a matter of choice rather than linear benefit analysis at the moment.

The first option looks like a simple one, but the problem is that you're building a tree-structure in SQL.
and SQL does not support hierarchical data.
Not recommended - ever
TABLE comment
-------------
id unsigned integer auto_increment primary key,
reply_to unsigned integer,
comment text,
foreign key FK_comment_reply_to(reply_to) references comment.id
ON UPDATE CASCADE ON DELETE CASCADE
Recommended - if you want a tree 2 levels deep
If you build it using 2 tables
TABLE main_post
----------------
id unsigned integer auto_increment primary key,
body text,
TABLE reply
-------------
id unsigned integer auto_increment primary key,
reply_to unsigned integer,
body text,
foreign key FK_reply_reply_to(reply_to) references main_post.id
ON UPDATE CASCADE ON DELETE CASCADE
Then you are building a much simpler structure that can be easily queried in SQL because the tree is only 1 level deep.
For this reason I'd recommend choice number 2.
Alternatives for deeper trees
If you want a hierarchical structure I'd look at nested sets insteads, see:
http://www.pure-performance.com/2009/03/managing-hierarchical-data-in-sql/

In fact this is not 'only' matter of choice, but aware decision. Relational databases are not good at solving problems of hierarchical nature. There were tons of discussions, articles, and even books about that, so lets narrow the problem to your case.
The second choice would work fine ONLY if you were to allow replies to comments, and not to replies itself, thus this would be a tree with maximum 2 levels. That might be ok, but if you were to do that better solution would be to place everything with COMMENTS table, and add two columns: THREAD_ID (all the comments with the same THREAD_ID would belong to same thread), SEQ_NUM (or simply DATE would tell us which comment was first). Similar way of organising comments is implemented here on SO.
The first choice is quite simple and generic - but implements recurention with all its cons. Lets stop a bit and think... note that we are actually NOT building a tree, but a 'forest'. We will have many commen threads and every single thread will be a separate tree - relatively small amount of data to organise. In that case I would add a THREAD_ID column to COMMENTS table and use only that table (it would be also good to set an composite index on COMMENTS table containing THREAD_ID and COMMENTID columns - in exactly that order).
So upon above I would choose "choice 1".
Next decision should be about where to do the processing and comment tree construction? I would just get all the comments from the table an organise them on a controller (MVC) side, i.e. JAVA or C++. Traversing the list of comments and building the tree in Main Memory (using objects and pointers or hash tables) would be an easy thing. It is a good option also because small amount of nodes (comments and replies within one thread).

I would say it depends very much on what you're trying to achieve with this, from what I can understand if you want a max 2-level tree you should go with choice 2, if you want a deeper tree go with choice 1 with the following modification
Choice 1: comments {commentid,toplevelcommentid | thread | (whatever parent this comment and possibly other comments is/are linked to so you can easily recreate the structure afterwards),replyto,comment}
and when displaying results select everything that has commentid or toplevelcommentid equal to a value and order by commentid so you can easily recreate the structural data with a single select query

1) Queries against the TEXT table were always 3 times slower than those against the VARCHAR table (averages: 0.10 seconds for the VARCHAR table, 0.29 seconds for the TEXT table). The difference is 100% repeatable.
CREATE TABLE varcharTable (a varchar(255) NOT NULL, PRIMARY KEY (a)) ENGINE=MyISAM;
CREATE TABLE textTable (a text NOT NULL, PRIMARY KEY (a(255))) ENGINE=MyISAM;
mysql> explain SELECT SQL_NO_CACHE count(*) from varcharTable where a LIKE "n%";
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
| 1 | SIMPLE | varcharTable | range | PRIMARY | PRIMARY | 257 | NULL | 5882 | Using where; Using index |
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
1 row in set (0.00 sec)
mysql> explain SELECT SQL_NO_CACHE count(*) from T where a LIKE "n%";
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | T | range | PRIMARY | PRIMARY | 257 | NULL | 5882 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)
Index is being used for the VARCHAR table, but not for the TEXT table (in the Extra column)
2) search is not required on comments table.So, querying is not required and since its long too. Its type is preferred to be text
And then since its text you cannot search on it .So, put the comments(non-searchable and affecting performance) and replies on separate table. So, that the replies table will function good and the comments table will be kept just for storage purpose, no search performed on them.
Conclusion: So, put them the Comments table in a separate table.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008