Recommended MySQL INDEX for storing domain names - mysql

I'm trying to store about 100 Million domain names in a MySQL database, but I can't figure out the right INDEX method to use on the domain names.
The issue being that LIKE queries will also be executed:
SELECT id FROM domains WHERE domain LIKE '%.example.com'
or
SELECT id FROM domains WHERE domain LIKE 'example.%'
If it makes it easier, '%example%' is not a requirement, but at best a nice to have / be able to.
What would be the proper index to use? Left to right (example.%) should be realitivly straight forward, but right to left (%.example.com) is problematic but the most common query.
I'm using MariaDB 10.3 on Linux. DB running on a PCI-e SSD, lookup times longer then 10 seconds should be coincided "unacceptable"

You can spend one virtual permanent column (rdomain) in your table where the virtual function stores the domainname in reverse order like REVERSE(domain). so it is possible to search from start of string i.e. search for '%.mydomain.com' -> WHERE rdomain like REVERSE('%.mydomain.com
the table
CREATE TABLE `myreverse` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`domain` varchar(64) CHARACTER SET latin1 DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_domain` (`domain`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
add the column
ALTER TABLE myreverse
ADD COLUMN rdomain VARCHAR(64) AS (REVERSE(domain)),
ADD KEY idx_rdomain (rdomain);
insert some data
INSERT INTO `myreverse` (`id`, `domain`)
VALUES
(2, 'img.google.com'),
(3, 'w3.google.com'),
(1, 'www.coogle.com'),
(4, 'www.google.de'),
(5, 'www.mydomain.com');
see the data
mysql> SELECT * from myreverse;
+----+------------------+------------------+
| id | domain | rdomain |
+----+------------------+------------------+
| 1 | www.google.com | moc.elgoog.www |
| 2 | img.google.com | moc.elgoog.gmi |
| 3 | w3.coogle.com | moc.elgooc.3w |
| 4 | www.google.de | ed.elgoog.www |
| 5 | www.mydomain.com | moc.niamodym.www |
+----+------------------+------------------+
5 rows in set (0.01 sec)
mysql>
now you can query with reverse order and MySQL can use the index.
query
mysql> select * from myreverse WHERE rdomain like REVERSE('%.google.com');
+----+----------------+----------------+
| id | domain | rdomain |
+----+----------------+----------------+
| 3 | w3.google.com | moc.elgoog.3w |
| 2 | img.google.com | moc.elgoog.gmi |
+----+----------------+----------------+
2 rows in set (0.00 sec)
mysql>
Here you can see that the optimizer use the index.
mysql> EXPLAIN select * from myreverse WHERE rdomain like REVERSE('%.google.com');
+----+-------------+-----------+------------+-------+---------------+-------------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+-------+---------------+-------------+---------+------+------+----------+-------------+
| 1 | SIMPLE | myreverse | NULL | range | idx_rdomain | idx_rdomain | 195 | NULL | 2 | 100.00 | Using where |
+----+-------------+-----------+------------+-------+---------------+-------------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.01 sec)
mysql>

I'm not sure an index would help you here. If you can't change the database, your options seem limited. One thing you could do, is if you're running both a subdomain and domain query back to back, to run the subdomain query first. That should help reduce the number of rows the domain query has to cover.
It would definitely help if you split the URL between subdomains and domains into different columsn in the database. Have indexes for both of them. Then you could query the subdomains only and the domains only. It should speed things up. And if there are a lot of repeating values, you should normalize those fields so to remove repetition and speed up queries even more.

Related

How to optimize a query in a one to many scenario when i have a fulltext index on a column?

My paintings table looks like this
| id | artist_id | name
| 1 | 7 | landscape painting
| 2 | 15 | flowers painting
| 3 | 15 | scuffed painting
The artist_id is indexed and the name has a fulltext index on it. The table contains about 10M record.
Queries that match the name agains some keywords are ok:
select * from `paintings` where match (`name`) against ('+scuffed*' in boolean mode) limit 10;
10 rows in set (0.04 sec)
But when I sometimes want to only check for a certain painting done by a certain artist:
select * from `paintings` where `artist_id` = 15 and match (`name`) against ('+scuffed*' in boolean mode) limit 10;
7 rows in set (0.40 sec)
As you can see it takes 10x longer to run the query when I include the artist_id. I also tried running a nested query in order to get only paintings that have specific ids:
select * from `paintings` where id in (SELECT id from paintings where artist_id = 15) and match (`name`) against ('+scuffed*' in boolean mode) limit 10;
7 rows in set (0.44 sec)
This ended up being even slower.
How can this query be optimized to work well with and without a where clause on the artist_id?
Thank you!
You need to create a COMPOSITE INDEX KEY consisting of columns (id and artist_id) to speed up your query:
mysql> ALTER TABLE paintings ADD INDEX cmp_id_artist_id (id, artist_id);
Query OK, 0 rows affected (0.02 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> SHOW INDEXES FROM paintings;
+-----------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-----------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| paintings | 1 | cmp_id_artist_id | 1 | id | A | 3 | NULL | NULL | | BTREE | | | YES | NULL |
| paintings | 1 | cmp_id_artist_id | 2 | artist_id | A | 3 | NULL | NULL | YES | BTREE | | | YES | NULL |
| paintings | 1 | ftindex_name | 1 | name | NULL | 3 | NULL | NULL | YES | FULLTEXT | | | YES | NULL |
+-----------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
3 rows in set (0.00 sec)
And now you can test again your 2nd query:
mysql> select * from `paintings` where `artist_id` = 15 and match (`name`) against ('+scuffed*' in boolean mode) limit 10;
+----+-----------+------------------+
| id | artist_id | name |
+----+-----------+------------------+
| 3 | 15 | scuffed painting |
+----+-----------+------------------+
1 row in set (0.00 sec)
Run EXPLAIN SELECT ... to see how the query is performed.
I think your second query is as optimal is can be. MySQL will perform the MATCH first, then check any other conditions.
You could add INDEX(artist_id), but I don't think that will help.
More
Let me provide another approach, then talk through the pros an cons.
PRIMARY KEY(artist_id, id),
INDEX(id),
FULLTEXT(name),
-- When searching _only_ by `name`:
WHERE MATCH(name) AGAINST('+string' IN BOOLEAN MODE)
-- When searching by artist _and_ name:
WHERE artist_id = 123
AND name LIKE '%string%';
Comments:
Without the artist, you get the full speed of FULLTEXT.
With artist, you abandon FULLTEXT (because it seems not to work for your dataset) and switch to the composite index, plus need to look only at those rows with the given artist_id.
id needs to be indexed to keep AUTO_INCREMENT happy; a simple INDEX(id) suffices.
The PRIMARY KEY must be unique; including id suffices.
Starting the PK with `artist_id clusters the rows for a given artist together, thereby speeding the query up (some for small tables, a lot for big tables).
This requires that you create two different queries, but that is probably not a big problem.
For an artist that has a lot of works, it may be 'too' slow -- because it is failing to use FULLTEXT.
FULLTEXT and LIKE have different rules for what will/won't match. So you may get different answers.
Another recent Question discussed why AGAINST('... +the ...' IN BOOLEAN MODE) never finds any rows.
AGAINST("+color +purple") will find purple color, but LIKE '%color purple% won't.
Be careful about benchmarking any approaches -- the performance will depend on how many works an artist has, and many other un-obvious differences.

Why the primary key is not the clustered index if another non clustered index is added in MariaDB

Hello I have a table created by the following query MariaDB version 10.5.9
CREATE TABLE `test` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`status` varchar(60) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `test_status_IDX` (`status`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4
I always thought that the primary key is by default the clustered index which also defines the order of the rows in the table but here it seems that the index on the status is picked as the clustered. Why is this happening and how can I change it?
MariaDB [test]> select * from test;
+----+--------+
| id | status |
+----+--------+
| 2 | cfrc |
| 5 | hjr |
| 1 | or |
| 3 | test |
| 6 | verve |
| 4 | yes |
+----+--------+
6 rows in set (0.001 sec)
It is not safe to assume that the results of SELECT will be ordered by any column across dB engines. You should always use ORDER BY col [ASC|DESC] if you expect sorting to happen. I see records being displayed in the order they were added, but that can change after deletions/insertions etc, and should not be relied on. See here for more details.
(I am going to cite MySQL docs in my answer but in the context of this question, the information applies to MariaDB as well.)
First of all, let's talk about index extensions. The InnoDB engine automatically creates an additional (composite) index behind the scenes whenever you define a secondary index (i.e. any index that is not the clustered index). That is called an index extension.
This extra index contains the columns you defined in your original secondary index (in the same order) with the columns of the primary key added after them. So, in your example, InnoDB creates an index extension for test_status_IDX (let's call it X), with columns (stauts, id).
Now let's look at the query select * from test;. There is no WHERE clause here, so all the optimizer needs to do to satisfy this query is fetch all columns for all rows of the table. This boils down to fetching status & id since there are no other columns in the table. These exact fields happen to be stored within the extended index X. This makes index X a covering index for this query. A covering index is an index that, given a query, can fully produce the results of the query without having to read any actual data rows.
Therefore, the optimizer reads & returns the values needed for the result of the query from index X, in the order that they appear there, which is by status, hence the order you observed.
To further demonstrate and extend (pun intended) this point, let's reproduce the example (tested with MariaDB 10.4):
1. First create the table & add the rows
CREATE TABLE foo (
id int(10) unsigned NOT NULL AUTO_INCREMENT,
status varchar(60) DEFAULT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB;
INSERT INTO foo VALUES
(1, 'or'),
(2, 'cfrc'),
(3, 'test'),
(4, 'yes'),
(5, 'hjr'),
(6, 'verve');
SELECT * FROM foo;
+----+--------+
| id | status |
+----+--------+
| 1 | or |
| 2 | cfrc |
| 3 | test |
| 4 | yes |
| 5 | hjr |
| 6 | verve |
+----+--------+`
2. Now let's add the secondary index and check the order again
CREATE INDEX secondary_idx ON foo (status);
SELECT * FROM foo;
+----+--------+
| id | status |
+----+--------+
| 2 | cfrc |
| 5 | hjr |
| 1 | or |
| 3 | test |
| 6 | verve |
| 4 | yes |
+----+--------+
As described above, the rows are returned in the order they appear in the (extended) secondary_idx
3. Now let's drop the index and re-add it with a prefix length of 2 bytes. This means that the index will not store the full value of the column but only its first two bytes, which means the extended index is no longer a covering index because it cannot fully produce the results of the query. Thus the clustered index will be used
ALTER TABLE foo DROP INDEX secondary_idx;
CREATE INDEX secondary_idx ON foo (status(2));
SELECT * FROM foo;
+----+--------+
| id | status |
+----+--------+
| 1 | or |
| 2 | cfrc |
| 3 | test |
| 4 | yes |
| 5 | hjr |
| 6 | verve |
+----+--------+
4. Let's showcase this behaviour in another way. Here we will retain the original secondary index (without a prefix length) but we will add a 3rd column to the table. This will once again render the secondary index a non covering index (because it does not contain the 3rd column), therefore, the clustered index will be used here as well.
ALTER TABLE foo DROP INDEX secondary_idx;
CREATE INDEX secondary_idx ON foo (status);
ALTER TABLE foo ADD bar integer NOT NULL;
SELECT * FROM foo;
+----+--------+-----+
| id | status | bar |
+----+--------+-----+
| 1 | or | 0 |
| 2 | cfrc | 0 |
| 3 | test | 0 |
| 4 | yes | 0 |
| 5 | hjr | 0 |
| 6 | verve | 0 |
+----+--------+-----+
Adding bar to the index (or dropping it from the table) will again make the query use the secondary index.
ALTER TABLE foo DROP INDEX secondary_idx;
CREATE INDEX secondary_idx ON foo (status, bar);
SELECT * FROM foo;
+----+--------+-----+
| id | status | bar |
+----+--------+-----+
| 2 | cfrc | 0 |
| 5 | hjr | 0 |
| 1 | or | 0 |
| 3 | test | 0 |
| 6 | verve | 0 |
| 4 | yes | 0 |
+----+--------+-----+
You can also use EXPLAIN on all of the SELECT statements above to see which index is used at each stage.
#aprsa is right I falsely assumed that the results will be in the same order as the clustered index but in this case(using INNODB) the status index is used for the query's evaluation so that's why it appears to be 'sorted' by the status. If I select the id then the primary index is used and the results appear to be 'sorted' by the id. In another engine this might not be true.
That particular table is composed of 2 BTrees:
The data, sorted by the PRIMARY KEY. Yes, it is clustered and is ordered 1,2,3,...
The secondary index, sorted by status. Each secondary index contains a copy of the PK so that it can reach into the other BTree to get the rest of the columns (not that there are any more!). That is, the is BTree is equivalent to a 2-column table with PRIMARY KEY(status) plus an id.
Note how the output is in status order. I have to assume it decided to simply read the secondary index in its order to provide the results.
Yes, you must specify an ORDER BY if you want a particular ordering. You must not assume the details I just discussed. Who knows, tomorrow there may be something else going, such as an in-memory "hash" that has the information scrambled in some other way!
(This Answer applies to both MySQL and MariaDB. However, MariaDB is already playing a game with hashing that MySQL has not yet picked up. Be warned! Or simply add an ORDER BY.)

MySQL doesn't use my index while it declares it will in explain statement

I recently encounter a problem involving MySQL DBSM.
The Table is like this:
CREATE TABLE `orders` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(60) DEFAULT NULL,
`age` int(11) DEFAULT NULL,
`sex` enum('男','女') DEFAULT NULL,
`amount` float(10,2) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `name_i` (`name`),
KEY `sex` (`sex`)
) ENGINE=InnoDB AUTO_INCREMENT=5000001 DEFAULT CHARSET=utf8
As is shown above ,I create a single colume index on col name
I want to perform a range query on name, and the explain statement is
mysql> explain select * from orders where name like '王%';
+----+-------------+--------+------------+-------+---------------+--------+---------+------+-------+----------+----------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------+--------+---------+------+-------+----------+----------------------------------+
| 1 | SIMPLE | orders | NULL | range | name_i | name_i | 183 | NULL | 20630 | 100.00 | Using index condition; Using MRR |
+----+-------------+--------+------------+-------+---------------+--------+---------+------+-------+----------+----------------------------------+
1 row in set, 1 warning (0.10 sec)
so it should use the index name_i and finish the query in a flash(my classmate spent 0.07 sec)
however , this is how it turned out:
| 4998119 | 王缝 | 27 | 男 | 159.21 |
| 4998232 | 王求葬 | 19 | 男 | 335.65 |
| 4998397 | 王倘予 | 49 | 女 | 103.39 |
| 4998482 | 王厚 | 77 | 男 | 960.69 |
| 4998703 | 王啄淋 | 73 | 女 | 458.85 |
| 4999106 | 王般埋 | 70 | 女 | 700.98 |
| 4999359 | 王胆具 | 31 | 女 | 362.83 |
| 4999510 | 王铁脾 | 31 | 女 | 973.09 |
| 4999880 | 王战万 | 59 | 女 | 127.28 |
| 4999928 | 王忆 | 42 | 女 | 72.47 |
+---------+--------+------+------+--------+
11160 rows in set (3.43 sec)
And it seems to not use the index at all, because the data is sorted by the primary key id rather than col name(besides it is too slow ,comparing to 0.07 sec).
Has anyone encountered the problem too?
What percentage of the table is "Kings" (王) ? If it is more than about 20%, it will choose to do a table scan instead of use the index. (And this may actually be faster.) (Based on Comments, 0.22% of the table is Kings.)
EXPLAIN and the execution of the query are separate things. Although I don't remember proving this, it is possible that the EXPLAIN might say one thing, but the query would work another way.
Do you have 5 million rows in the table? Was the cache 'cold' when you first ran it? And it had to fetch 11,160 rows from disk? Then the second time, all was in cache, so much faster?
Was the table loaded in "alphabetical" (or whatever the Chinese word for that is) order? If so, there is a good chance the ids and the names are in the same order?
Apparently you are using utf8_general_ci COLLATION? Maybe it does not sort Chinese well. (Provide a test case; I'll do some tests.)
I do not understand why it mentioned MRR.
I, too, am baffled by "1 min 32.24sec". The ORDER BY name should have further encouraged the Optimizer to use INDEX(name). Can you turn on "Optimizer trace".
To really see whether it used the index, do this:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
If the big number(s) look like the number of rows in the table, then it did a table scan. If they look more like 11160, then they used the index.

MySQL indexing columns vs joining tables

I am trying to figure out the most efficient way to extract values from database that has the structure similar to this:
table test:
int id (primary, auto increment)
varchar(50) stuff,
varchar(50) important_stuff;
where I need to do a query like
select * from test where important_stuff like 'prefix%';
The size of the entire table is approximately 10 million rows, however there are only about 500-1000 distinct values for important_stuff. My current solution is indexing important_stuff however the performance is not satisfactory. Will it be better to create a separate table that will match distinct important_stuff to a certain id, which will be stored in the 'test' table and then do
(select id from stuff_lookup where important_stuff like 'prefix%') a join select * from test b where b.stuff_id=a.id
or this:
select * from test where stuff_id exists in(select id from stuff_lookup where important_stuff like 'prefix%')
What is the best way to optimize things like that?
How big is innodb_buffer_pool_size? How much RAM is available? The former should be about 70% of the latter. You'll see in a minute why I bring up this setting.
Based on your 3 suggested SELECTs, the original one will work as good as the two complex ones. In some other case, the complex formulation might work better.
INDEX(important_stuff) is the 'best' index for
select * from test where important_stuff like 'prefix%';
Now, let's study how that query works with that index:
Reach into the BTree index, starting at 'prefix'. (Effort: Virtually instantaneous)
Scan forward for, say, 1000 entries. That will be about 10 InnoDB blocks (16KB each). Each entry will have the PRIMARY KEY (id). (Effort: <= 10 disk hits)
For each entry, look up the row (so you can get "*"). That's 1000 PK lookups in the BTree that contains both the PK and the data. At best, they might all be in 10 blocks. At worst, they could be in 1000 separate blocks. (Effort: 10-1000 blocks)
Total Effort: ~1010 blocks (worst case).
A standard spinning disk can handle ~100 reads/second. So. we are looking at 10 seconds.
Now, run the query again. Guess what; all those blocks are now in RAM (cached in the "buffer_pool", which is hopefully big enough for all of them). And it runs in less than 1 second.
OPTIMIZE TABLE was not necessary! It was not a statistics refresh, but rather caching that sped up the query.
I'm not MySQL user but I made some tests on my local database. I've added 10 millions rows as you wrote and distinct datas from third column are loaded quite fast. These are my results.
mysql> describe bigtable;
+-----------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| stuff | varchar(50) | NO | | NULL | |
| important_stuff | varchar(50) | NO | MUL | NULL | |
+-----------------+-------------+------+-----+---------+----------------+
3 rows in set (0.03 sec)
mysql> select count(*) from bigtable;
+----------+
| count(*) |
+----------+
| 10000089 |
+----------+
1 row in set (2.87 sec)
mysql> select count(distinct important_stuff) from bigtable;
+---------------------------------+
| count(distinct important_stuff) |
+---------------------------------+
| 1000 |
+---------------------------------+
1 row in set (0.01 sec)
mysql> select distinct important_stuff from bigtable;
....
| is_987 |
| is_988 |
| is_989 |
| is_99 |
| is_990 |
| is_991 |
| is_992 |
| is_993 |
| is_994 |
| is_995 |
| is_996 |
| is_997 |
| is_998 |
| is_999 |
+-----------------+
1000 rows in set (0.15 sec)
Important information is that I refreshed statistics on this table (before this operation I needed ~10 seconds to load these data).
mysql> optimize table bigtable;

How to delete Index?

Further to my previous question (qv) ...
I have already created the table(s) and populated with data. How do I set the prefix length to a very large value or remove it all togther, so that I don't have this problem? There will never be more than a few thousand rows and only this applciation is running on a dedicated PC, so performance is not an issue.
Solution, please for either PhpMyAdmin, or just MySQL command line.
Update: Can I just delete this index (or make it infinitely long)?
Hmmm, I would prefer to keep the unique index if I can. So, how to make it infinitely long?
Or should I redefine my text fields to be var_char with a limit to the length? (I do know the max possible lngth of the primary key)
mysql> describe tagged_chemicals;
+-------------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+-------+
| bar_code | text | NO | | NULL | |
| rfid_tag | text | NO | UNI | NULL | |
| checked_out | char(1) | NO | | N | |
+-------------+---------+------+-----+---------+-------+
3 rows in set (0.04 sec)
It'll probably be something like
CREATE INDEX part_of_name ON customer (name(10));
from create index documentation http://dev.mysql.com/doc/refman/5.0/en/create-index.html
where in your case the rfid_tag is length 20.