Fulltext index not matching column list on mysql - mysql

I use contao 4.9 and have problems with a view that only arise when using mysql 5.7.35 instead of mariadb. The query that creates the view is the following:
CREATE
OR REPLACE VIEW `tl_news4ward_articleWithTags` AS
SELECT
tl_news4ward_article.*,
GROUP_CONCAT(tag) AS tags
FROM
tl_news4ward_article
LEFT OUTER JOIN tl_news4ward_tag ON (
tl_news4ward_tag.pid = tl_news4ward_article.id
)
GROUP BY
tl_news4ward_article.id
The interesting part of the tl_news4ward_article table was created as follows:
CREATE TABLE `tl_news4ward_article` (
`title` varchar(255) NOT NULL DEFAULT '',
`keywords` text,
`description` text,
PRIMARY KEY (`id`),
KEY `pid` (`pid`),
KEY `alias` (`alias`)
) ENGINE=MyISAM AUTO_INCREMENT=53 DEFAULT CHARSET=utf8
And tl_news4ward_tag:
CREATE TABLE `tl_news4ward_tag` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`pid` int(10) unsigned NOT NULL DEFAULT '0',
`tstamp` int(10) unsigned NOT NULL DEFAULT '0',
`tag` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
KEY `pid` (`pid`),
FULLTEXT KEY `tag` (`tag`)
) ENGINE=MyISAM AUTO_INCREMENT=103 DEFAULT CHARSET=utf8
When I run the query SELECT MATCH (keywords,title,description) AGAINST (' something1 something2' IN BOOLEAN MODE) AS score FROM tl_news4ward_article it just works, but if I run SELECT MATCH (keywords,title,description) AGAINST (' something1 something2' IN BOOLEAN MODE) AS score FROM tl_news4ward_articleWithTags I get an error:
#1191 - Can't find FULLTEXT index matching the column list
I can provide more information if needed. It just works on Mariadb.
DB Fiddle:
https://www.db-fiddle.com/f/3LN1cAM6aoohaB4a1q6oWb/1
EDIT2:
The code comes from this Contao module: https://github.com/psi-4ward/news4ward/issues/106 we think is supposed to work with MySQL
EDIT3:
More clear compare-fiddle: https://dbfiddle.uk/?rdbms=mysql_5.7&rdbms2=mariadb_10.3&fiddle=2f1ab88e65a8acb3bb992a1cf6fb4101
EDIT4:
The above query is simplified - as you can see in the Contao module, the tags column should be included in the score and match as well.

The error comes from not having an index on the combination of the 3 columns:
FULLTEXT(keywords,title,description)
You should move to InnoDB, which now has FULLTEXT. The two tables may need to use the same Engine. Caution: There are several details about InnoDB's FULLTEXT that are different than MyISAM's. Here is a list: http://mysql.rjweb.org/doc.php/myisam2innodb#fulltext
MySQL's VIEWs are mostly syntactic sugar. While I have not encountered your specific problem, it seems clear that VIEW is getting in the way.
Did you try explicitly specifying the two 'algorithms' for expanding Views? https://dev.mysql.com/doc/refman/5.7/en/view-algorithms.html
If you are having performance problems with the query (once it works), it may be because of doing the JOIN before doing the filtering. If you encounter such, we can discuss that in a separate Question.
(From fiddle)
I prefer this way of getting the record, plus tags:
SELECT a.*,
( SELECT GROUP_CONCAT(tag)
FROM tl_news4ward_tag AS t
WHERE t.pid = a.id
) AS tags
FROM tl_news4ward_article AS a
But there is no way (with the current schema) to include tags in the MATCH.
4-col FT
If you decide that you must have tags in the FULLTEXT index, then the commalist of tags in the main table and get rid of the tags table. Then you need some stored procedures/functions to handle add-tag, del-tag, etc. Tip: FIND_IN_SET('foo', tags) will come in handy for testing for a particular tag.

You can't use full text indexes on view.
MySQL doesn't allow any form of indexes on a view, just on it's underlying tables. The reason for this is because MySQL only materializes a view when you select from it, due to the possibility of the underlying tables changing data all the time. If you had a view that returned 10 million rows, you'd have to apply a full text index to it every time you selected from it, and that takes a lot of time.
So the only chance you get is
add
ALTER TABLE tl_news4ward_article
ADD FULLTEXT(keywords,title,description)
SELECT * FROM tl_news4ward_articleWithTags t1 INNER JOIN ( SELECT MATCH (keywords,title,description) AGAINST (' something1 something2' IN BOOLEAN MODE) AS score ,id FROM tl_news4ward_article ) t2
ON t1.id = t2.id;
id | title | keywords | description | tags | score | id
-: | :-------- | :-------- | :---------- | :--- | ----: | -:
1 | something | something | something | null | 0 | 1
db<>fiddle here

Views do not have indexes, so index hints do not apply. Use of index hints when selecting from a view is not permitted.

This actually might be a MySQL 5.7 regression.
The linked Contao Module's code is 8 years (2014) old:
https://github.com/psi-4ward/news4ward_related/blame/master/RelatedHelper.php#L18
MySQL 5.7 was releasted 6 years (2015) ago.
The code broke was working in 5.6 but does no longer in 5.7
https://dbfiddle.uk/?rdbms=mysql_5.6&rdbms2=mysql_5.7&fiddle=1d875703c4781dad878e64943057de5a
Not sure if this s a bug in MySQL 5.7 or an intended removal of that functionality.
Lowset friction solution might really be to update the Contao module and remove the usage of the VIEW in the queries.

Related

How to optimize datetime comparisons in mysql in where clause

CONTEXT
I have a large table full of "documents" that are updated by outside sources. When I notice the updates are more recent than my last touchpoint I need to address these documents. I'm having some serious performance issues though.
EXAMPLE CODE
select count(*) from documents;
gets me back 212,494,397 documents in 1 min 15.24 sec.
select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);
which is apx the actual query gets me 55,988,860 in 14 min 36.23 sec.
select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE) limit 1;
notably takes about 15 minutes as well. (this was surprising to me)
THE PROBLEM
How do I perform the updated_at > last_indexed_at in a more
reasonable time?
DETAILS
I'm pretty certain that my query is, in some way, not sargable. Unfortunately, I can't find what about this query prevents it from being executed on a row independent basis.
select count(*)
from documents
where last_indexed_at is null or updated_at > last_indexed_at;
doesn't do any better.
nor does
select count( distinct( id ) )
from documents
where last_indexed_at is null or updated_at > last_indexed_at limit 1;
nor does
select count( distinct( id ) )
from documents limit 1;
EDIT: FOLLOW UP REQUESTED DATA
This question only involves one table (thankfully) in a rails project, so we conveniently have the rails definition for the table.
/*!40101 SET #saved_cs_client = ##character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `documents` (
`id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`document_id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`document_type` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`locale` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`allowed_ids` text COLLATE utf8mb4_unicode_ci NOT NULL,
`fields` mediumtext COLLATE utf8mb4_unicode_ci,
`created_at` datetime(6) NOT NULL,
`updated_at` datetime(6) NOT NULL,
`last_indexed_at` datetime(6) DEFAULT NULL,
`deleted_at` datetime(6) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_documents_on_document_type` (`document_type`),
KEY `index_documents_on_locale` (`locale`),
KEY `index_documents_on_last_indexed_at` (`last_indexed_at`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
SELECT VERSION(); got me 5.7.27-30-log
And probably most import,
explain select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);
gets me exactly
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
| 1 | SIMPLE | documents | NULL | ALL | NULL | NULL | NULL | NULL | 208793754 | 100.00 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
Oh! MySQL 5.7 introduced Generated Columns — which gives us a way of indexing expressions! 🥳
If you do something like this:
ALTER TABLE documents
ADD COLUMN dirty BOOL GENERATED ALWAYS AS (COALESCE(updated_at > last_indexed_at, TRUE)) STORED,
ADD INDEX index_documents_on_dirty(dirty);
...and change the query to:
SELECT COUNT(*) FROM documents WHERE dirty;
...what results do you get?
Hopefully, we're moving the work of evaluating COALESCE(updated_at > last_indexed_at, TRUE) from Read time to Write time.
Add a covering INDEX
If you had INDEX(last_indexed_at, updated_at), the 15-minute queries might run somewhat faster. (The order of the columns does not matter in this case.)
Assuming both of those columns are columns in the table. If so, then the query must read every row. (I don't know if the term "sargable" covers this situation.)
The INDEX I suggest will be faster because it is "covering". By reading only the index, there is less I/O.
The repeated 15 minutes is probably because innodb_buffer_pool_size was not big enough to hold the entire table. So, it was I/O-bound. My INDEX will be smaller, hence (hopefully) small enough to fit in the buffer_pool. So, it will be faster, and even faster on the second run.
Slow OR
OR is usually a terrible slowdown. But I don't think it matters here.
If you were to initialize last_indexed_at to some old date (say, '2000-01-01'), you could get rid of the COALESCE or OR.
Another way to clean it up is
SELECT SUM(last_indexed_at IS NULL) +
SUM(updated_at > last_indexed_at) AS "Need indexing"
FROM t;
I still need the index. SUM(boolean expression) sees the expression as 0 (false or NULL) or 1 (TRUE).
Meanwhile, I don't think the COUNT(DISTINCT id) is any different than COUNT(*). And the pair of SUMs should also give you the value.
Again, I am depending on "covering" being the trick.
"More than .." trick
In some situation, you don't really need the exact number, especially if it is "more than some threshold".
SELECT 1 FROM tbl WHERE ... LIMIT 1000,1;
If it comes back with "1", there are at least 1000 rows. If it comes back empty (no row returned), then not.
That will still have to touch up to 1000 rows (hopefully in an index), but that is better than touching a million.
You can, if you're on a recent MySQL version (5.7+), add a generated column to your table containing your search expression, then index that column.
ALTER TABLE t
ADD COLUMN needs_indexing TINYINT
GENERATED ALWAYS AS
(CASE WHEN last_indexed_at IS NULL THEN 1
WHEN updated_at > last_indexed_at THEN 1
ELSE 0 END) VIRTUAL;
ALTER TABLE t
ADD INDEX needs_indexing (needs_indexing);
This uses drive space for the index, but not in your table.
Then you can do SELECT SUM(needs_indexing) FROM t to get the number of items matching your criterion.
But: you don't have to count all the items to know you need to reindex some items. Doing a COUNT(*) on a large InnoDB table is very expensive as you have discovered. You can do this:
SELECT EXISTS (SELECT 1 FROM t WHERE needs_indexing = 1) something_needs_indexing;
You'll get 1 or 0 from this query very quickly. 1 means you have at least one row meeting your criteria.
And, of course, your indexing job can do
SELECT * FROM t WHERE needs_indexing=1 LIMIT 1;
or whatever makes sense. That will also be fast.

SQL query evaluates COUNT(*) differently if tables are defined as MyISAM or InnoDB

I am running a MySQL database.
I have the following script:
DROP TABLE IF EXISTS `org_apiinteg_assets`;
DROP TABLE IF EXISTS `assessmentinstances`;
CREATE TABLE `org_apiinteg_assets` (
`id` varchar(20) NOT NULL default '0',
`instance_id` varchar(20) default NULL,
PRIMARY KEY (`id`)
) ENGINE= MyISAM DEFAULT CHARSET=utf8 PACK_KEYS=1;
CREATE TABLE `assessmentinstances` (
`id` varchar(20) NOT NULL default '0',
`title` varchar(180) default NULL,
PRIMARY KEY (`id`)
) ENGINE= MyISAM DEFAULT CHARSET=utf8 PACK_KEYS=1;
INSERT INTO assessmentinstances(id, title) VALUES ('14026lvplotw6','One radio question survey');
INSERT INTO org_apiinteg_assets(id, instance_id) VALUES ('8kp9wgx43jflrgjfe','14026lvplotw6');
Looks like this
assessmentinstances
+---------------+---------------------------+
| id | title |
+---------------+---------------------------+
| 14026lvplotw6 | One radio question survey |
+---------------+---------------------------+
org_apiinteg_assets
+-------------------+---------------+
| id | instance_id |
+-------------------+---------------+
| 8kp9wgx43jflrgjfe | 14026lvplotw6 |
+-------------------+---------------+
And I then have the following query (I reduced it to the simplest failing query)
SELECT ai.id, COUNT(*) AS `count`
FROM assessmentinstances ai, org_apiinteg_assets a
WHERE a.instance_id = ai.id
AND ai.id = '14026lvplotw6'
AND a.id != '8kp9wgx43jflrgjfe';
When I run the query I get this
null, 0
Until now, all is good. Now, here is my issue, when I recreate both tables with ENGINE=InnoDB instead of ENGINE=MyISAM and run the same query again, I get this:
'14026lvplotw6','0'
So 2 things are confusing me:
Why don't I get the same result?
How can the COUNT(*) return 0 in the second case when it actually returns values for the row, and should therefore be 1?
I am lost, I'd appreciate if anybody could explain this behaviour to me.
EDIT:
Interestingly, if I add GROUP BY ai.id at the end of the query, it works fine in both cases and return no rows.
This happen because you are using aggregation function without GROUP BY .. in this case the result for non aggregated column is unpredictable .. (typically is show the first value encountered during the query)
Try adding a GROUP BY
SELECT ai.id, COUNT(*) AS `count`
FROM assessmentinstances ai, org_apiinteg_assets a
WHERE a.instance_id = ai.id
AND a.id != '8kp9wgx43jflrgjfe'
AND ai.id = '14026lvplotw6'
GROUP BY ai.id;
Remember that the use of aggregation in presence of column not mentioned in group by is deprecated in SQL and is not allowed in most of the db and in the more recent version of mysql (starting from 5.7)
EXPLAIN SELECT for MyISAM returns: Impossible WHERE noticed after reading const tables. So MyISAM isn't processing any data at all.
For the InnoDB there are two rows of EXPLAIN results: one Using Index and one Using where. So InnoDB data is being scanned and bits of it slip into the output as there is no aggregate function specified for the first column and AFAIK its not specified what should happen in such situation. If you directly specify some aggregate function, then if there are no matching rows, it will return NULL. So, for example, SELECT min(ai.id), COUNT(*) ... would return NULL, 0.

MYSQL GROUP BY returns no results with FULLTEXT SEARCH

I have the following MYSQL schema:
CREATE TABLE IF NOT EXISTS `tag` (
`id` SMALLINT UNSIGNED NOT NULL,
`tag` VARCHAR(15) NOT NULL,
FULLTEXT INDEX(`tag`),
PRIMARY KEY (`id`,`tag`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO `tag`(`id`,`tag`) VALUES
('1','motor'),
('2','motor');
I want to select rows from tag table grouped by tag column. I ran the following command:
SELECT COUNT(id),tag FROM tag
GROUP BY tag
The expected result:
COUNT(id) | tag
------------------
2 | motor
The actual result:
no rows returned
If I remove the FullText Index from the table, the results return as expected. I don't know what is going wrong when using fulltext with group
Update With further research the problem seems to be from the composite primary key. If I switch to one-column primary key, query works again but I need to use a composite key for this table as the same id could have multiple tags.
I created an SQL fiddle for you to try:
http://sqlfiddle.com/#!9/1f765d/1/0
Finally discovered the problem. The engine has 2 indices to choose from when executing the query, which ends up not using any at all and returning no results.
It's very likely that is a bug. This is a case where FORCE INDEX comes in handy as a workaround.
So the final working command:
SELECT COUNT(id),tag FROM tag
FORCE INDEX(PRIMARY)
GROUP BY tag
And this is a fiddle with the updated code:
http://sqlfiddle.com/#!9/a8568/21/0
Thanks everyone!

Simple Sphinx & mySQL Query

Forgive me for asking what should be a simple question but I am totally new to Sphinx.
I am using Sphinx with a mySQL datastore. The table looks like below with the Title and Content fields indexed by Sphinx.
CREATE TABLE `documents` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`group_id` int(11) NOT NULL,
`group_id2` int(11) NOT NULL,
`date_added` datetime NOT NULL,
`title` varchar(255) NOT NULL,
`content` text NOT NULL,
`url` varchar(255) NOT NULL,
`links` int(11) NOT NULL,
`hosts` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `url` (`url`)
) ENGINE=InnoDB AUTO_INCREMENT=439043 DEFAULT CHARSET=latin1
Now, if I connect through Sphinx with
mysql -h0 -P9306
I can run a simple query like...
SELECT * FROM test1 WHERE MATCH('test document');
And I will get back a result set like...
+--------+----------+------------+
| id | group_id | date_added |
+--------+----------+------------+
| 360625 | 1 | 1499727792 |
| 362257 | 1 | 1499727807 |
| 362777 | 1 | 1499727811 |
| 159717 | 1 | 1499717614 |
| 160557 | 1 | 1499717621 |
----------------------------------
When what I actually want is it to return a result set with column values from the documents table (like the URL, Title, Links, Hosts, etc. columns) and, if all possible, sort these by the relevancy of the Sphinx match.
Can that be accomplished in a single query? What might it look like?
Thanks in advance!
Two (main) options
Take the ids from the SphinxQL result, and run a MySQL Query to get the full details, see http://sphinxsearch.com/info/faq/#row-storage
eg SELECT * FROM documents WHERE id IN (3,5,7) ORDER BY FIELD(id,3,5,7)
This MySQL query, should be VERY quick, because its a PK lookup, and only retrieving a few rows (ie one page of results) - the heavy lifting of searching the whole table has already been done in first Sphinx Query.
Duplicate all the columns you want to retrieve in the resultset as Attributes. You've already made group_id and date_added as attributes, would need to make more attributes.
sql_field_string is a very convenient shortcut to make BOTH a Field and an String Attribute from one column. Not available for other column types, but less useful, as numeric columns, are not typically needed as fields anyway.
option 1 is good in it avoids duplicating the data, and saves memory (Sphinx wants to typically hold attributes in memory) - and may be most practical on big datasets.
whereas option 2 is good in that it avoids a second query for each result. But because have a copy of data, it may mean additional complication syncing.
Doesn't look relevant in your case, but if say had a 'clicks' column, which you want in increment often (when users click!), and need it in resultset but you don't really need it in sphinx for query purposes, the first option, would allow you only have to increment it in database, and the mysql query would always get the live value. But the second option means having to keep sphinx index in 'sync' at all times)

MySQL Query Optimization

I have web application that use a similar table scheme like below. simply I want to optimize the selection of articles. articles are selected based on the tag given. for example, if the tag is 'iphone' , the query should output all open articles about 'iphone' from the last month.
CREATE TABLE `article` (
`id` int(11) NOT NULL auto_increment,
`title` varchar(100) NOT NULL,
`body` varchar(200) NOT NULL,
`date` timestamp NOT NULL default CURRENT_TIMESTAMP,
`author_id` int(11) NOT NULL,
`section` varchar(30) NOT NULL,
`status` int(1) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
CREATE TABLE `tags` (
`name` varchar(30) NOT NULL,
`article_id` int(11) NOT NULL,
PRIMARY KEY (`name`,`article_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
CREATE TABLE `users` (
`id` int(11) NOT NULL auto_increment,
`username` varchar(30) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=3 ;
The following is my MySQL query
explain select article.id,users.username,article.title
from article,users,tags
where article.id=tags.article_id and tags.name = 'iphone4'
and article.author_id=users.id and article.status = '1'
and article.section = 'mobile'
and article.date > '2010-02-07 13:25:46'
ORDER BY tags.article_id DESC
the output is
id select_type table type possible_keys key key_len ref rows Extra <br>
1 SIMPLE tags ref PRIMARY PRIMARY 92 const 55 Using where; Using index <br>
1 SIMPLE article eq_ref PRIMARY PRIMARY 4 test.tags.article_id 1 Using where <br>
1 SIMPLE users eq_ref PRIMARY PRIMARY 4 test.article.author_id 1 <br>
is it possible to optimize it more?
This query may be optimized, depending on which condition is more selective: tags.name = 'iphone4' or article.date > '2010-02-07 13:25:46'
If there are less articles tagged iphone than those posted after Feb 7, then your original query is nice.
If there are many articles tagged iphone, but few those posted after Feb 7, then this query will be more efficient:
SELECT article.id, users.username, article.title
FROM tags
JOIN article
ON article.id = tags.article_id
AND article.status = '1'
AND article.section = 'mobile'
AND article.date > '2010-02-07 13:25:46'
JOIN users
ON users.id = article.author_id
WHERE tags.name = 'iphone4'
ORDER BY
tags.article_date DESC, tags.article_id DESC
Note that the ORDER BY condition has changed. This may or may not be what you want, however, generally the orders of id and date correspond to each other.
If you really need your original ORDER BY condition you may leave it but it will add a filesort (or just revert to your original plan).
In either case, create an index on
article (status, section, date, id)
the query should output all open articles about 'iphone' from the last month.
So the only query you are going to run on this data uses the tag and the date. You've got a index for the tag in the tags table, but the date is stored in a different table (article - you're a bit inconsistent with your naming schema). Adding an index on the article table using date would be no benefit at all. Using id,date (in that order) would help a little - but really the date needs to be denormalised into the tags table to get the query running really fast.
Unless you're regularly moving around bulk data sets - just add a datetime column with a default of the current timestamp to the tags table.
I expect that you may be wanting to interact with the data in lots of other ways - really you should set a low (no?) threshold for slow query logging then analyse the resulting data to identify where you're performance problems are (try looking at the queries with the highest values for duration^2*frequency first).
There's a script at the URL below which is useful for this analysis:
http://www.retards.org/projects/mysql/
You could index the additional fields in article that you are referencing in your select statement. In this case, I would suggest you create an index in article like this:
CREATE INDEX article_idx ON article (author_id, status, section, date);
Creating that index should speed up your query depending on how many overall records you are dealing with. From my understanding, properly creating indexes involves looking at the queries you've written and indexing the columns that are a part of your where clause. This helps the query optimizer better process the query in general. That does not mean create an index on each individual column, however, as its both inefficient to do so and ineffective. When possible, create multiple column indexes that represent your select statement.