I have a table called documents that have around 30 columns, around 3.5 million rows at a size of about 10GB. The most important columns are:
system_id, archive_id, content, barcodes, status and notes.
As you can see this is a multi-tenant application where each tenant is a system and references through system_id.
I have 2 indexes on this table where the first one is a BTREE and have the columns system_id, archive_id and status in it's index.
The other one is a FULLTEXT index containing the columns content, barcodes and notes.
I have two different tenants that I want to highlight. The first one (Customer A) has system_id = 1 and have say 1000 records in the documents table. The second one (Customer B) have system_id = 2 and say 400 000 records in this table.
The LIKE query for Customer A is:
SELECT *
FROM documents
WHERE system_id = 1 AND
CONCAT_WS(' ',content,barcodes,notes) LIKE '%office%' AND
status = 100
The above query will run in about 0.02 seconds. If I run a similar query but with the FULLTEXT search like
SELECT *
FROM documents
WHERE system_id = 1 AND
MATCH(content,barcodes,notes) AGAINST ('office' IN BOOLEAN MODE) AND
status = 100
This operation takes around 4 seconds?! I have read that the FULLTEXT search index should be a lot quicker than LIKE.
If I run the same queries but for Customer B (that has 400 000 records in the documents table) the LIKE search is a little bit slower than FULLTEXT but not with a lot.
What can the reason for this be?
Should I go with LIKE or FULLTEXT search in above situation (8GB RAM database server)?
I'm a little bit confused of why my queries with FULLTEXT search is taking so long. The text in content is probably not just words that a normal person would use because it's OCR-read from the document so there will be a lot of different words that might blow up the index?
The EXPLAINs will show that the fast query is using your index on system_id and status, not the LIKE. It was fast, not because of LIKE, but because of that filtering.
And the slow query decided to use the FULLTEXT index because the Optimizer is too dumb to realize that lots of rows contain "office".
LIKE, especially in conjunction with CONCAT_WS is not faster than FULLTEXT.
Related
SELECT t1.*
FROM
( SELECT key_a,key_b,MAX(date) as date
FROM large_table
WHERE date <= **20150126**
group by key_a,key_b
) AS t2
JOIN large_table AS t1 USING(key_a,key_b ,date)
large_table = 1,223,001,206 rows of data
Primary Key key_a,key_b,date
key on key_b
key on date
There are numerous empty dates between rows for a & b that I want the most recent behind or on the "Date" entered.
Is it the Mysql Join settings causing it to be slow ?
I can copy the entire set of a & b data with an INSERT to a temp table just by selecting all the rows and then run the same query on the temp table, but why do multi queries (insert selected, then select from) when only 1 is needed.
The query above only has 4,128,548 total results in the temp insert all dates table, and the date specific returns under 180,000 total.
Not table optimization, not keys, is it Max sort length, Join Buffer size , I have 128 gig ram, on a 32 core server running this, there is no reason for it to be slow, just never bulk insert this large of a single table to run Join queries on prior if anyone else has dealt with tables this size any info greatly appreciated.
Edited query, yes it's late long day had Distinct when it wasn't needed or in actual query
WHERE date <= **20150126**
group by key_a,key_b
needs an index starting with date. It's about doing what you can with the WHERE clause, not sparse or dense.
Then... Since the inner query references only 3 columns, building a 'covering' index may be useful. (Probably useful in your case.) So, tack on the other two fields, in either order. Such as
INDEX(`date`, key_a, key_b)
For MyISAM this step is critical. For InnoDB, this is redundant, since each secondary key (such as your INDEX(date)) implicitly includes the rest of the fields of the PK.
No, the PRIMARY KEY(key_a, key_b, date) cannot serve the purpose. It's in the wrong order. Also, it is (if you are using InnoDB) "clustered" with the index.
The query above only has 4,128,548 total results in the temp insert all dates table, and the date specific returns under 180,000 total.
Sorry, I had trouble parsing that. I assume you are saying 4M rows had 'date<...' and the subquery delivered only 180K rows. Hence, the outer query also returned 180K rows.
The first goal is to get through the 4M rows as efficiently as possible. With the index I propose, that might be about 20K blocks (#16KB each) of index scanning. That's 300MB.
Next the MAX and GROUP BY are performed. At 300MB, this will involve a disk tmp table. (See max_heap_size and max_tmp_table_size.)
Then comes the JOIN to fetch t1.*. You are using a good technique for fetching a bunch of rows from a huge table, where you need a GROUP BY (or LIMIT or ...) that is clumsy when done the obvious way. It goes like this: Write the subquery to find the PKs. Get the best index for it. Then JOIN on the PK.
Now for the JOIN. (Again, I assume InnoDB.) Since you are JOINing on the PK, each lookup into t1 will be efficient -- drill down the PK's BTree to find a row. Do that 180K times.
If those 180K lookups are scattered around the table, then this could be 180K disk hits.
Total effort: 20K + 180K = 200K disk hits, possibly less. On commodity spinning disks, this would take about 30 minutes (plus time for the tmp table). (No, only one core will be used. Anyway, I/O is probably the bottleneck.)
OPTIMIZE TABLE -- almost always useless.
I assume innodb_buffer_pool_size is about 90G? If things are going to be cached, that is where it would happen (for InnoDB). Since 200K blocks is 3GB, it could be easily cached. That is, if you run the query twice, the first might be 30 minutes, but the second might be less than 3 minutes.
To get more numbers, you could do:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS;
and look for 'Handler%', '%sort%', 'Innodb%' and maybe a few others.
What version are you running? Recent versions have a leapfrog technique that works better for max+groupby than what I described. I think it is called MRR. If so, your PK is actually optimal. (Hmmm... I should play around with that.)
PARTITIONing -- I don't see any benefit (for this query).
I wanted to build my first real search function. I've been Googling a while, but wasn't able to really make my mind up and understand everything.
My database exists of three InnoDB tables:
Products: Contains of a product information. Columns: proID (primary, auto-increment), content (contains up to a few hundred words), title, author, year, and a bunch of others that are not related to the search query. Rows: 100 to 2000.
Categories: Contains category information: Columns: catID (primary, auto-increment), catName. Rows: 5-30
Productscategories: Link between the two above. Each product can be related to multiple categories. Columns: pcID (primary, auto-increment), catID, proID. Rows: 1-5 times amount of products.
My search function offers the following things. They do not have to be filled in. If more than one is filled in, the final query will connect them with the AND-query:
Terms: Searches the content and title field. Searches on random terms, multiple words can be added, but searches for each of them seperate. Most likely 1 match with the database should be enough for a hit (OR-query)
Year: Searches on the year-column of products.
Category: Selectable from a list of categories. Multiple possible. The form returns the catID's of the chosen categories. 1 match with the database should be enough for a hit (OR-query)
Author: Searches on the author-column of products
As you may have noticed, when a category is selected, the tables products and productcategories are joined together for the search query. There is also a foreign key set between the two.
To clearify the relations an example of how it should be interpreted(no search for the year!):
Search WHERE (products.content = term 1 OR products.content = term 2 OR products.title = term 1 OR products.title = term 2 ......) AND (products.author = author) AND (productscategories.catID = catID1 OR productscategories.catID= catID2 ......)
Also note that I created a pagination system that only shows 10 results on each 'page'.
The question I am stuck with is the following: I wish to optimize this search query, but can't figure out which way is the best.
Most cases I found Googling used the LIKE %% mysqli-query. However some used the MATCH...AGAINST. I seem to really like the last one because I read it can sort on relevance and because it seems to make the query a lot easier to create (1 match against the term values instead of plenty of LIKE %% combined with OR). It seems I would only use it on the Term-searchfield though. However for MATCH...AGAINST I will need a MyIsam table (right?), in which I can't use the foreign key to prevent faults in the database.
MATCH...AGAINST example (without year field, category field and not joining products and productscategories):
SELECT *,MATCH (content,title) AGAINST ('search terms' IN BOOLEAN MODE) AS relevance
FROM products WHERE (MATCH (content,title) AGAINST ('search terms' IN BOOLEAN MODE)) AND
author='author' SORT BY relevance LIMIT 10
%LIKE% example(without year field, category field and not joining products and productscategories) and sadly no relevance sorting:
SELECT * FROM products WHERE
(content LIKE '%term1%' OR content LIKE '%term2' OR title LIKE '%term1%' OR title LIKE '%term2%')
AND (author='author') SORT BY title LIMIT 10
I could make a relevance sorting by using the CASE and add 'points' if a term comes in the title or the content? Or would that make the query too heavy for performance?
So what is the best way to make this kind of query? Go with Innodb and LIKE, or switch to MyIsam and use the MATCH...AGAINST for sorting?
You dont have to switch to MyIsam. Fulltext indexing is supported in Mysql 5.6 and higher.
I usually recommend using fulltext indexes. Create a fulltext index on your columns title,author,year
Then you can run a fulltext query on all 3 at the same time, and apply IN BOOLEAN MODE to really narrow your searches. This is ofcourse something you have to decide for yourself but the options in fulltext are more.
However, if you are running queries that spawn between a range, date for instance or a simple string. Then a standard index is better but for tekst searching in different columns, fulltext index is the way to go!
Read this: http://dev.mysql.com/doc/refman/5.6/en/fulltext-search.html
This mysql query is runned on a large (about 200 000 records, 41 columns) myisam table :
select t1.* from table t1 where 1 and t1.inactive = '0' and (t1.code like '%searchtext%' or t1.name like '%searchtext%' or t1.ext like '%searchtext%' ) order by t1.id desc LIMIT 0, 15
id is the primary index.
I tried adding a multiple column index on all 3 searched (like) columns. works ok but results are served on a auto filled ajax table on a website and the 2 seond return delay is a bit too slow.
I also tried adding seperate indexes on all 3 columns and a fulltext index on all 3 columns without significant improvement.
What would be the best way to optimize this type of query? I would like to achieve under 1 sec performance, is it doable?
The best thing you can do is implement paging. No matter what you do, that IO cost is going to be huge. If you only return one page of records, 10/25/ or whatever that will help a lot.
As for the index, you need to check the plan to see if your index is actually being used. A full text index might help but that depends on how many rows you return and what you pass in. Using parameters such as % really drain performance. You can still use an index if it ends with % but not starts with %. If you put % on both sides of the text you are searching for, indexes can't help too much.
You can create a full-text index that covers the three columns: code, name, and ext. Then perform a full-text query using the MATCH() AGAINST () function:
select t1.*
from table t1
where match(code, name, ext) against ('searchtext')
order by t1.id desc
limit 0, 15
If you omit the ORDER BY clause the rows are sorted by default using the MATCH function result relevance value. For more information read the Full-Text Search Functions documentation.
As #Vulcronos notes, the query optimizer is not able to use the index when the LIKE operator is used with an expression that starts with a wildcard %.
Example:
Table 1 has 100k records, and has a varchar field with a unique index on it.
Table 2 has 1 million records, and relates to table 1 through a table1_id field with a many-to-one relationship, and has three varchar fields, only one of them unique. The engine in question is InnoDB so no fulltext indexes.
For argument's sake, assume these tables will grow to a maximum of 1 million and 10 million records respectively.
When I enter a search term into my form, I want it to search both tables across all four (total) available varchar fields with a LIKE and return only the records from Table1 - so I'm grouping by table1.id here. What I'm wondering is, is it more efficient to search the million records table first since it has only one field that needs to be searched and that one field is unique and then use the fetched IDs in an table1.id IN ({IDS}) query, or would it be better to join them outright and search them right then and there without making a round trip to the database?
In other words, when doing joins, does MySQL join according to the searched term, or join first and search later? That is, if I do a join and the LIKE on both tables in one query, will it first join them and then look through them for matching records, or will it join only the records it found to be matching?
Edit: I have made two sample tables and faked some data. This example query is a join and a LIKE search across all fields. For demo purposes I used LIKE '%q%' but in reality the q may be anything. The actual search on bogus 100k/1mil records took 0.03 seconds, MySQL says. Here is the explain: http://bit.ly/PsFBxK
Here is the explain query of searching just table2 on its one unique field: http://bit.ly/S06Hug and for this one to actually happen, MySQL says it took it 0.0135 seconds.
I have this query in MySQL:
select *
from alias where
name like '%jandro%';
Which results are:
Jandro, Alejandro
The index on name cannot be applied to higher performance because it is a range filter. Is there any way of improving that query performance?
I have tried with a full-text index, but it only works for complete words.
I also tried with a MEMORY ENGINE table, and it is faster, but I would like a better choice.
EDIT
I think i will have just to accept this for now:
select *
from alias where match(name) against ('jandro*' in boolean mode);
I've done this in the past (not on MySQL, and before full text searching was commonly available on database servers) by creating a lookup table, in which I created all left-chopping substrings to search on.
In my case, it was merited - a key user journey involved searching for names in much the way you suggest, and performance was key.
It worked with triggers on insert, update and delete.
Translated to your example:
Table alias
ID name
1 Jandro
2 Alejandro
Table name_lookup
alias_id name_substring
1 Jandro
1 andro
1 ndro
1 dro
1 ro
2 Alejandro
2 lejandro
2 ejandro
2 jandro
2 andro
2 ndro
2 dro
2 ro
Your query then becomes
select alias_id, name
from alias a,
name_lookup nl
where a.id = ni.alias_id
and ni.name_substring like 'andro%'
That way, you hit the index on the name_substring table.
It's only worth doing for common queries on huge data sets - but it works, and it's quick.