MySQL fulltext selection performance - mysql

I have 1.5M rows in a table. Following is the table create code:
CREATE TABLE `jobs` (
`id` INT(8) NOT NULL AUTO_INCREMENT,
`job_id` VARCHAR(50) NOT NULL DEFAULT '',
`title` VARCHAR(255) NOT NULL DEFAULT '',
`company` VARCHAR(255) NOT NULL DEFAULT '',
`city` VARCHAR(50) NOT NULL DEFAULT '',
`state` VARCHAR(50) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
UNIQUE INDEX `job_id` (`job_id`),
FULLTEXT INDEX `search` (`title`, `company`, `city`, `state`)
)
COLLATE='utf8_general_ci'
ENGINE=MyISAM
The query below takes about 0.5 seconds, which is very high.
SELECT id, title, company, state, city FROM `jobs` WHERE MATCH (title, company, state, city) AGAINST ('software engineer in san fransisco california') LIMIT 0,10
How can I decrease execution time and still provide relevance results? Any suggestions?
So far I tried followings but there is no improvement at all.
-Searching in a single field that contains 4 field of data but it did not matter.
-Using in boolean mode>1 or >2, but then it gives me unrelated results
-Repearing the table, increasing key_buffer_size to 1GB from 16MB, changing table type to Innodb, changing character set to latin1 from utf8.
-Setting ft_max_word_len=1 and ft_stopword_file='' from default values.
-I searched online for many hours but no luck so far.
"Explain select..." output:
id;select_type;table;type ;possible_keys;key ;key_len;ref;rows;Extra
1 ;SIMPLE ;jobs ;fulltext;search ;search;0 ;\N ;1 ;Using where
Edit: Thank you for your suggestions but there is no imprevement at all.

Related

Mysql Fulltext Index with AND condition accross multiple tables is slow

I have two huge tables (55M rows) with the following structure:
CREATE TABLE `chapters` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`toc` varchar(5000) COLLATE utf8mb4_unicode_ci NOT NULL,
`author` varchar(5000) COLLATE utf8mb4_unicode_ci NOT NULL,
`ari_id` bigint(20) NOT NULL,
PRIMARY KEY (`id`),
KEY `ari_id` (`ari_id`),
FULLTEXT KEY `toc` (`toc`),
FULLTEXT KEY `author` (`author`)
) ENGINE=InnoDB AUTO_INCREMENT=52251463 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
CREATE TABLE `books` (
`ID` int(15) unsigned NOT NULL AUTO_INCREMENT,
`Title` varchar(2000) COLLATE utf8mb4_unicode_ci DEFAULT '',
`Author` varchar(2000) COLLATE utf8mb4_unicode_ci DEFAULT '',
`isOpenAccess` tinyint(1) NOT NULL,
`ari_id` bigint(20) NOT NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `ari_id` (`ari_id`),
FULLTEXT KEY `Title` (`Title`),
FULLTEXT KEY `Author` (`Author`),
) ENGINE=InnoDB AUTO_INCREMENT=2627161 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
I am using the following query for searching:
SELECT b.ari_id, b.Title, b.Author, t.toc, t.author
FROM books b
INNER JOIN chapters t
ON b.ari_id = t.ari_id
WHERE MATCH(t.toc) AGAINST('power*' IN BOOLEAN MODE)
AND b.isOpenaccess = 1
LIMIT 300
It is returning the results in about 12 seconds. Is there any chance that I can speed up the response time?
Second when I try to search from two fulltext indexes using "AND" operator, it takes forever to respond (146 seconds). The query I am running is as follows:
SELECT toc, author
FROM tocs
WHERE MATCH(toc) AGAINST('high*' IN BOOLEAN MODE)
AND MATCH(author) AGAINST('max*' IN BOOLEAN MODE)
LIMIT 300
In books, ari_id could be the PRIMARY KEY and you could get rid of id. (This may or may not help performance.)
MySQL likes to run the FULLTEXT part of the WHERE first, then AND with the other tests.
This:
WHERE MATCH(toc) AGAINST('high*' IN BOOLEAN MODE)
AND MATCH(author) AGAINST('max*' IN BOOLEAN MODE)
can be sped up with FULLTEXT(toc, author) and
WHERE MATCH(toc, author) AGAINST('high* max*' IN BOOLEAN MODE)
But it will find extra rows. (Eg, the toc has both high and max, but author has neither.)
A similar trick is not possible when doing FT queries on each of two Joined tables.
OTOH, by having a 3rd table with all of columns, plus ari_id, would let you combine the tests into a single MATCH. Then follow by further refinement.

How to break find duplicate SQL query on a large table in multiple parts

I have a large table with ~3 million records in a MySQL database. I am trying to find duplicate rows in this table using the following query -
SELECT package_id
FROM version
WHERE metadata IS NOT NULL AND metadata <> '{}'
GROUP BY package_id, metadata HAVING COUNT(package_id) > 1
This query takes ~23 seconds to run on the database. Our database host however kill any query taking larger than 3 seconds using pt-kill. So I need to find a way to break this query down, such as each of the subpart would be a separate query and each one takes less than 3 seconds. Adding just a LIMIT constraint doesn't do it for the query, so how do I break a query to work on different parts of the table.
Result of SHOW CREATE TABLE version
CREATE TABLE `version` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`package_id` bigint(20) unsigned NOT NULL,
`version_number` int(11) unsigned NOT NULL,
`current_state_id` tinyint(2) unsigned NOT NULL,
`md5sum` varchar(32) CHARACTER SET utf8 COLLATE utf8_general_cs NOT NULL DEFAULT '',
`uri` varchar(1024) CHARACTER SET utf8 COLLATE utf8_general_cs NOT NULL DEFAULT '',
`filename` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_cs NOT NULL DEFAULT '',
`size` bigint(11) unsigned NOT NULL DEFAULT '0',
`metadata` varchar(1024) CHARACTER SET utf8 COLLATE utf8_general_cs DEFAULT NULL,
`storage_type_id` tinyint(2) unsigned NOT NULL DEFAULT '1',
PRIMARY KEY (`id`),
UNIQUE KEY `idx_version_package_id_version_number` (`package_id`,`version_number`),
KEY `idx_version_md5sum` (`md5sum`),
KEY `idx_version_metadata` (`metadata`(255)),
KEY `idx_version_current_state_id` (`current_state_id`),
KEY `storage_type_id` (`storage_type_id`),
CONSTRAINT `_fk_version_current_state_id` FOREIGN KEY (`current_state_id`) REFERENCES `state` (`id`),
CONSTRAINT `_fk_version_package_id` FOREIGN KEY (`package_id`) REFERENCES `package` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=3248761 DEFAULT CHARSET=utf8
As can be seen there are many indexes on the table including index on Package_id + Version_number combination of field. The problem is that this table is only going to get bigger and I don't think Optimization even if it pulls me back in 3 second range would scale. So I need a way where I can partition this table and run on queries on separate parts.
Steps to improve speed.
Create Table version_small with just columns id and package_id with index on package_id.
insert into version_small select id and package_id from version;
Run your original query on optimised table above - should be much faster on smaller table.
OR
Create Table version_small with just columns id and package_id, and a int counter with unique index on package_id.
insert into version_small select id and package_id from version, on duplicate key increment counter;
The rows with counter>1 are package_id that have more than one entry.

mysql match against not matching text in brackets

I'm trying to use match against to return search results.
I have a problem though where it's not returning results where the matching text is in brackets. Just getting rid of the brackets isn't an option i'm afraid.
So running the sql fiddle below I would expect it to return two results, not one
CREATE TABLE `courses` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`course_code` varchar(100) NOT NULL,
`course_name` varchar(100) DEFAULT NULL,
`startdate` varchar(100) DEFAULT NULL,
`starttimestamp` varchar(45) DEFAULT NULL,
`prospectus_title` varchar(500) DEFAULT NULL,
PRIMARY KEY (`id`),
FULLTEXT KEY `info` (`course_name`,`prospectus_title`,`course_code`)
) ENGINE=MyISAM AUTO_INCREMENT=981074 DEFAULT CHARSET=utf8;
INSERT INTO `courses` (`id`, `course_code`, `course_name`, `startdate`, `starttimestamp`, `prospectus_title`) VALUES
('1', '1234', 'vrqtest', 'time','time', 'vrqtest'),
('2', '5678', '(vrq)test', 'time','time','(vrq)test');
SELECT * FROM courses force index(info)
WHERE starttimestamp IS NOT NULL AND (
MATCH ( course_name ) against ('vrq*' in boolean mode ))
SQL Fiddle
That returns only the first record but should also the second.
Any ideas?
Here's why:
https://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_ft_min_word_len
ft_min_word_len
The minimum length of the word to be included in a
FULLTEXT index. Defaults to 4
(vrq)test is too short to match. Add an extra character (e.g. (vrqa)test ) and it's matched.

MySQL-Query too slow

I am using the following tables in my MySQL-Database:
--
-- Table structure for table `company`
--
CREATE TABLE IF NOT EXISTS `company` (
`numb` varchar(4) NOT NULL,
`cik` varchar(30) NOT NULL,
`sNumber` varchar(30) NOT NULL,
`street1` varchar(255) NOT NULL,
`street2` varchar(255) NOT NULL,
`city` varchar(255) NOT NULL,
`state` varchar(100) NOT NULL,
`zip` varchar(100) NOT NULL,
`phone` varchar(255) NOT NULL,
`name` varchar(255) NOT NULL,
`dateChanged` varchar(30) NOT NULL,
`name2` varchar(255) NOT NULL,
`seriesId` varchar(30) NOT NULL,
`symbol` varchar(10) NOT NULL,
`exchange` varchar(20) NOT NULL,
PRIMARY KEY (`cik`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
INSERT INTO `company` (`numb`, `cik`, `sNumber`, `street1`, `street2`, `city`, `state`, `zip`, `phone`, `name`, `dateChanged`, `name2`, `seriesId`, `symbol`, `exchange`) VALUES
('6798', 'abc', '953551121', '701 AVENUE', '', 'GLENDALE', 'CA', '91201-2349', '818-244-8080', '', '', 'Public Store', '', 'PSA', 'NYSE')
--
-- Table structure for table `data`
--
CREATE TABLE IF NOT EXISTS `data` (
`id` int(100) NOT NULL AUTO_INCREMENT,
`number` varchar(100) NOT NULL,
`elementname` mediumtext NOT NULL,
`date` varchar(100) NOT NULL,
`elementvalue` longtext NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=18439;
INSERT INTO `data` (`id`, `number`, `elementname`, `date`, `elementvalue`) VALUES
(1, '0001393311-10-000004', 'StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest', '2009-12-31', '3399777000')
--
-- Table structure for table `filing`
--
CREATE TABLE IF NOT EXISTS `filing` (
`number` varchar(100) NOT NULL,
`file_number` varchar(100) NOT NULL,
`type` varchar(100) NOT NULL,
`amendment` tinyint(1) NOT NULL,
`date` varchar(100) NOT NULL,
`cik` varchar(30) NOT NULL,
PRIMARY KEY (`accession_number`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
INSERT INTO `filing` (`number`, `file_number`, `type`, `amendment`, `date`, `cik`) VALUES
('0001393311-10-000004', '001-33519', '10-K', 0, '2009-12-31', '0000751653'),
('0000751652-10-000006', '001-08796', '10-K', 0, '2009-12-31', '0000751652')
The data table has around 22.000 entries, filing and company tables have around 400 entries each. I want to operate the database with a lot more entries in the future.
I perform the following query, which selects the newest item with a given type:
SELECT data.elementname, data.elementvalue, company.name2 FROM data
JOIN filing ON data.number = filing.number
JOIN company ON filing.cik = company.cik
WHERE elementname IN ('Elem1', 'Elem2', 'Elem3', 'Elem4', 'Elem5', 'ElemN')
AND number IN (
SELECT number
FROM filing
WHERE filing.cik IN ('cik1', 'cik2', 'cikN')
AND filing.type = '1L'
GROUP BY filing.cik
)
It takes between ~0.28 and 0.4 seconds to complete, which appears to be very slow.
When i perform the query without the following line
WHERE filing.cik IN ('cik1', 'cik2', 'cikN')
it takes only ~0.035 seconds.
Any idea how to speed the query up or to optimize the table structure because the table is growing rapidly and it's already too slow.
First off, the table structure you posted for filing is incorrect, as the primary key you specified doesn't. I'll assume you mean number. Additionally, you didn't specify the table definition for company, which makes trying to provide advice for this somewhat difficult.
However, both of the comments are correct. You need some indexes. Based on the query, you should probably some the following indexes.
ALTER TABLE company ADD INDEX ( cik )
ALTER TABLE data ADD INDEX ( number )
I would also recommend taking a look at whether data.elementname actually needs to be a MEDIUMTEXT, which is a pretty huge column. If the rest of the data looks like the example data you provided, you should probably change it into a varchar. TEXT columns can cause some serious performance penalties due to the way they're stored.
Additionally, your PRIMARY KEY number columns, which are currently strings, look as though they could be reformatted into different columns that are actually of type INT. Keep in mind that VARCHAR PRIMARY KEY columns will not be as efficient as INTs, just because they're so much bigger.
Lastly, 22k rows isn't all that much data. You should a take a look at your my.cnf settings. Your key_buffer value may be too small to fit indexes entirely in memory. Additionally, you may want to consider using INNODB for these tables, combined with an innodb_buffer_pool value that'll keep everything in memory.

optimize query (2 simple left joins)

SELECT fcat.id,fcat.title,fcat.description,
count(DISTINCT ftopic.id) as number_topics,
count(DISTINCT fpost.id) as number_posts FROM fcat
LEFT JOIN ftopic ON fcat.id=ftopic.cat_id
LEFT JOIN fpost ON ftopic.id=fpost.topic_id
GROUP BY fcat.id
ORDER BY fcat.ord
LIMIT 100;
index on ftopic_cat_id, fpost.topic_id, fcat.ord
EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE fcat ALL PRIMARY NULL NULL NULL 11 Using temporary; Using filesort
1 SIMPLE ftopic ref PRIMARY,cat_id_2 cat_id_2 4 bloki.fcat.id 72
1 SIMPLE fpost ref topic_id_2 topic_id_2 4 bloki.ftopic.id 245
fcat - 11 rows,
ftopic - 1106 rows,
fpost - 363000 rows
Query takes 4,2 sec
TABLES:
CREATE TABLE IF NOT EXISTS `fcat` (
`id` int(11) NOT NULL auto_increment,
`title` varchar(250) collate utf8_unicode_ci NOT NULL,
`description` varchar(250) collate utf8_unicode_ci NOT NULL,
`created` datetime NOT NULL,
`visible` tinyint(4) NOT NULL default '1',
`ord` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `ord` (`ord`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=12 ;
CREATE TABLE IF NOT EXISTS `ftopic` (
`id` int(11) NOT NULL auto_increment,
`cat_id` int(11) NOT NULL,
`title` varchar(100) collate utf8_unicode_ci NOT NULL,
`created` datetime NOT NULL,
`updated` timestamp NOT NULL default CURRENT_TIMESTAMP,
`lastname` varchar(200) collate utf8_unicode_ci NOT NULL,
`visible` tinyint(4) NOT NULL default '1',
`closed` tinyint(4) NOT NULL default '0',
`views` int(11) NOT NULL default '1',
PRIMARY KEY (`id`),
KEY `cat_id_2` (`cat_id`,`updated`,`visible`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=1116 ;
CREATE TABLE IF NOT EXISTS `fpost` (
`id` int(11) NOT NULL auto_increment,
`topic_id` int(11) NOT NULL,
`pet_id` int(11) NOT NULL,
`content` text collate utf8_unicode_ci NOT NULL,
`imageName` varchar(300) collate utf8_unicode_ci NOT NULL,
`created` datetime NOT NULL,
`reply_id` int(11) NOT NULL,
`visible` tinyint(4) NOT NULL default '1',
`md5` varchar(100) collate utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `md5` (`md5`),
KEY `topic_id_2` (`topic_id`,`created`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=390971 ;
Thanks,
hamlet
you need to create a key with both fcat.id, fcat.ord
Bold rewrite
This code is not functionally identical, but...
Because you want to know about distinct ftopic.id and fpost.id I'm going to be bold and suggest two INNER JOIN's instead of LEFT JOIN's.
Then because the two id's are autoincrementing they will no longer repeat, so you can drop the distinct.
SELECT
fcat.id
, fcat.title
, fcat.description
, count(ftopic.id) as number_topics
, count(fpost.id) as number_posts
FROM fcat
INNER JOIN ftopic ON fcat.id = ftopic.cat_id
INNER JOIN fpost ON ftopic.id = fpost.topic_id
GROUP BY fcat.id
ORDER BY fcat.ord
LIMIT 100;
It depends on your data if this is what you are looking for, but I'm guessing it will be faster.
All your indexes seem to be in order though.
MySQL does not use indexes for small sample sizes!
Note that the explain list that MySQL only has 11 rows to consider for fcat. This is not enough for MySQL to really start worrying about indexes, so it doesn't.
Because going to the index for small row-counts slows things down.
MySQL is trying to speed things up so it chooses not to use the index, this confuses a lot of people because we are trained so hard on the index. Small sample sizes don't give good explains!
Increase the size of the test data so MySQL has more rows to consider and you should start seeing the index being used.
Common misconceptions about force index
Force index does not force MySQL to use an index as such.
It hints at MySQL to use a different index from the one it might naturally use and it pushes MySQL into using an index by setting a very high cost on a table scan.
(In your case MySQL is not using a table scan, so force index has no effect)
MySQL (same most other DBMS's on the planet) has a very strong urge to use indexes, so if it doesn't (use any) that's because using no index at all is faster.
How does MySQL know which index to use
One of the parameters the query optimizer uses is the stored cardinality of the indexes.
Over time these values change... But studying the table takes time, so MySQL doesn't do that unless you tell it to.
Another parameter that affects index selection is the predicted disk-seek-times that MySQL expects to encounter when performing the query.
Tips to improve index usage
ANALYZE TABLE will instruct MySQL to re-evaluate the indexes and update its key distribution (cardinality). (consider running it daily/weekly in a cron job)
SHOW INDEX FROM table will display the key distribution.
MyISAM tables and indexes fragment over time. Use OPTIMIZE TABLE to unfragment the tables and recreate the indexes.
FORCE/USE/IGNORE INDEX limits the options MySQL's query optimizer has to perform your query. Only consider it on complex queries.
Time the effect of your meddling with indexes on a regular basis. A forced index that speeds up your query today might slow it down tomorrow because the underlying data has changed.