Partitioning of a large MySQL table that uses LIKE for search - mysql

I have a table with 80 millions of records. The structure of the table:
id - autoincrement,
code - alphanumeric code from 5 to 100 characters,
other fields.
The most used query is
SELECT * FROM table
WHERE code LIKE '%{user-defined-value}%'
The number of queries is growing as well as the recodrs count. Very soon I will have performance problems.
Is there any way to split the table in the parts? Or maybe some other ways to optimize the table?

The leading % in the search is the killer here. It negates the use of any index.
The only thing I can think of is to partition the table based on length of code.
For example, if the code that is entered is 10 characters long, then first search the table with 10 character codes, without the leading percent sign, then search the table with 11 character codes, with the leading percent sign, and then the table with 12 character codes, with the leading percent sign, and so on.
This saves you from searching through all of codes that are less than 10 characters long that will never match. Also, you are able to utilize an index for one of the searches (the first one).
This also will help keep the table sizes somewhat smaller.
You can use a UNION to perform all of the queries at once, though you'll probably want to create the query dynamically.
You should also take a look to see if FULLTEXT indexing might be a better solution.

Some thoughts:
You can split the table into multiple smaller tables based on a certain condition. For example, on ID perhaps or may be code or may be any other fields. It basically means that you keep a certain type of records in a table and split different types into different tables
Try MySQL Partitioning
If possible. purge older entries or you may at least think of moving them to another archive table
Instead of LIKE, consider using REGEXP for regular expression search
Rather than running SELECT *, try selecting only selective columns SELECT id, code, ...
I'm not sure if this query is somewhat related to a search within your application where a user inputted value is compared with the code column and results echoed to the user. But if yes, you can try to add options to the search query, like asking user if he wants an exact match or should start with match etc. This way you do not necessarily need to run a LIKE match everytime
This should have been the first point, but I assume you have the right indexes on the table
Try using more of the query cache. The best way to use it is to avoid frequent updates to the table because on each update the query cache is cleaned. So lesser the updates, more likely it is that MySQL caches the queries, which will then mean quicker results
Hope the above helps!

Related

Get mySQL full text match score for strings not in the table (optimally in a mixed result set with matches from the table)?

This must be a niche scenario since I have not been able to find a similar question around and in my brief testing in my SQL workbench just using the string in place of the column name did not work.
eg:
SELECT MATCH ('fork') AGAINST ('user entered text about forks' IN NATURAL LANGUAGE MODE);
Doesn't work...
I have a query that returns matches on a full text index with the relevance score as one of the columns returned. In this app, I am looking for "search suggestions" in a suggestions table that is built off the websites search index content. The user side also stores everything they search for in their local browser storage.
Currently, I have front end code that uses regex to pull matches from their local storage search history (up to 5) and then sends what they typed (as they type) to the back end to get the best matches from the suggestions table.
The way it works now, is the (up to 5) history matches are shown first, then the rest are filled in up to 10 total matches from the back end. What I would prefer, is that I send the history matches to the back end and include them in the FT match query in some way so that the result set contains all matched suggestions from the table + the history matches sent from the front end, but all sorted by the full text match relevance score to get them all in order of relevance. The new way may result in no history matches showing or it might result in more than 5 history matches showing, it would all boil down the releveance score.
Is something like this possible? The only other way I could image doing this is somehow creating a temporary table with a full text index, on the fly, and then joining that table in my current query, then removing the temp table when its done. The problem with that, in my mind, is that this is all happening in real time as the user types so I don't want to add something like that if its going to bog down the response time. Is there a fast/optimal way of doing this? Is there a way that would also remove the temporary table when the query ends?
Or is there some other command that can just give me a score based on string value against what the user typed in like what I tried above?
EDIT:
It looks like my temporary table idea could work:
https://dev.mysql.com/doc/refman/8.0/en/create-temporary-table.html
I'll just have to see what kind of perforamce impact this has. Im still interested to hear thoughts on if this is the best / only way or if there is a better one.
The CREATE TEMPORARY TABLE route was the way to go here. I tested it out and its working.
Worthy of note to future travelers. I had to switch my main table from innodb to myisam for this to work. I was able to mix/match the myisam temp table with the innodb main table, but the scoring algorithms are different so the innodb matches were taking priority due to higher scores. This was not an issue for me as I really did not need / use transactions for the primary suggestions table so I just made them both MyISAM engines.
Another item of note, is that I had to switch to splitting the user's query into "words" and ecapsulating them in "*" and running the match as a boolean search instead of natural language becausae in the case of the temp table, a user would likely have entered similar searches which would mean most of the words were in more than 50% of the rows so no matches were returning. Boolean search works around this. Again, not a big deal for my particular use case.
Had I needed to stay in innodb for this, it would have been a problem because from what I can tell, there is no way to set a full text index on an innodb temporary table.

How to use index for VARCHAR field in MySQL with LIKE and partial match?

I have a MySQL table with 42 million records in it.
"name" field in this table has index on it (NOT UNIQUE and NOT PK)
If I use
SELECT x FROM table WHERE name='asdef'
it uses index and I get the result quickly.
IF I use
SELECT x FROM table WHERE name LIKE '%sd%'
it does not use index even if I use FORCE INDEX or USE INDEX.
I am absolutely in need of doing partial matching. How can I do this while keeping my field as VARCHAR?
Well, you have a problem. And, SQL may provide some tools, but they may not solve your problem.
First, is your "partial" search really for a word inside a phrase. If so, you can use MySQL full text search to look for words. You may need to pay attention to the stop word list and minimum search length to make it work for your data.
Second, are the names repeated throughout the table? If so, then normalization will help. For instance, if there are 50 thousand names for the 42 million records, searching through the 50 thousand is much more feasible.
Third, are there a handful of finite terms that you are looking for? If so, then you can add flags into the table that are maintained via triggers.
Fourth, how wide are the rows, independent of name? If the rows are wide, you can make the exhaustive search more efficient by storing name in a separate table that fits and stays in memory.

Fast mysql query to randomly select N usernames

In my jsp application I have a search box that lets user to search for user names in the database. I send an ajax call on each keystroke and fetch 5 random names starting with the entered string.
I am using the below query:
select userid,name,pic from tbl_mst_users where name like 'queryStr%' order by rand() limit 5
But this is very slow as I have more than 2000 records in my table.
Is there any better approach which takes less time and let me achieve the same..? I need random values.
How slow is "very slow", in seconds?
The reason why your query could be slow is most likely that you didn't place an index on name. 2000 rows should be a piece of cake for MySQL to handle.
The other possible reason is that you have many columns in the SELECT clause. I assume in this case the MySQL engine first copies all this data to a temp table before sorting this large result set.
I advise the following, so that you work only with indexes, for as long as possible:
SELECT userid, name, pic
FROM tbl_mst_users
JOIN (
-- here, MySQL works on indexes only
SELECT userid
FROM tbl_mst_users
WHERE name LIKE 'queryStr%'
ORDER BY RAND() LIMIT 5
) AS sub USING(userid); -- join other columns only after picking the rows in the sub-query.
This method is a bit better, but still does not scale well. However, it should be sufficient for small tables (2000 rows is, indeed, small).
The link provided by #user1461434 is quite interesting. It describes a solution with almost constant performance. Only drawback is that it returns only one random row at a time.
does table has indexing on name?
if not apply it
2.MediaWiki uses an interesting trick (for Wikipedia's Special:Random feature): the table with the articles has an extra column with a random number (generated when the article is created). To get a random article, generate a random number and get the article with the next larger or smaller (don't recall which) value in the random number column. With an index, this can be very fast. (And MediaWiki is written in PHP and developed for MySQL.)
This approach can cause a problem if the resulting numbers are badly distributed; IIRC, this has been fixed on MediaWiki, so if you decide to do it this way you should take a look at the code to see how it's currently done (probably they periodically regenerate the random number column).
3.http://jan.kneschke.de/projects/mysql/order-by-rand/

Using MySQL to search through large data sets?

Now I'm a really advanced PHP developer and heavily knowledged on small-scale MySQL sets, however I'm now building a large infrastructure for a startup I've recently joined and their servers push around 1 million rows of data every day using their massive server power and previous architecture.
I need to know what is the best way to search through large data sets (it currently resides at 84.9 million) rows with a database size of 394.4 gigabytes. It is hosted using Amazon RDS so it does not have any downtime or slowness, it's just that I want to know what's the best way to access large data sets internally.
For example, if I wanted to search through a database of 84 million rows it takes me 6 minutes. Now, if I made a direct request to a specific id or title it would serve it instantly. So how would I search through a large data set.
Just to remind you, it's fast to find information through database by passing in one variable but when searching it performs VERY slow.
MySQL query example:
SELECT u.*, COUNT(*) AS user_count, f.* FROM users u LEFT JOIN friends f ON u.user_id=(f.friend_from||f.friend_to) WHERE u.user_name LIKE ('%james%smith%') GROUP BY u.signed_up LIMIT 0, 100
That query under 84m rows is sigificantly slow. Specifically 47.41 seconds to perform this query standalone, any ideas guys?
All I need is that challenge sorted and I'll be able to get the drift. Also, I know MySQL isn't very good for large data sets and something like Oracle or MSSQL however I've been told to rebuild it on MySQL rather than other database solutions at this moment.
LIKE is VERY slow for a variety of reasons:
Unless your LIKE expression starts with a constant, no index will be used.
E.g. LIKE ('james%smith%') is good, LIKE ('%james%smith%') is bad for indexing. Your example will NOT use any indexes on "user_name" field.
String matching is complex (algorythmically) business compared to regular operators.
To resolve:
Make sure your LIKE expression starts with a constant, not a wildcard, if you have an index on that field you might be able to use.
Consider making an index table (in the literature/library context of the word "index", not a database index context) if you search for whole words. Or a substring lookup table if searching for random often repeating substrings.
E.g. if all user names are of the form "FN LN" or "LN, FN" - split them up and store first names and/or last names in a dictionary table, joining to that table (and doing straight equality) in your query.
LIKE ('%james%smith%')
Avoid these things like the plague. They are impossible for a general DBMS to optimise.
The right way is to calculate things like this (first and last names) at the time where the data is inserted or updated so that the cost is amortised across all reads. This can be done by adding two new columns (indexed) and using insert/update triggers.
Or, if you want all words in the column, have the trigger break the data into words then have an application-level index table to find relevant records, something like:
main_table:
id integer primary key
blah blah blah
text varchar(60)
appl_index:
id index
word varchar(20)
primary key (id,word)
index (word)
Then you can query appl_index to find those ids that have both james and smith in them, far faster than the abominable like '%...'. You could also break the actual words out to a separate table and use word IDs but that's a matter of taste - it's effect on performance would be questionable.
You may well have a similar problems with f.friend_from||f.friend_to but I've not seen that syntax before (if, as it seems to be, the context is u.user_id can be one or the other).
Basically, if you want your databases to scale, don't do anything that even looks like a per-row function in your selects. Take that from someone who works with mainframe databases where 84 million rows is about the size of our config tables :-)
And, as with all optimisation questions, measure, don't guess!

How do I search part of a column?

I have a mysql table containing 40 million records that is being populated by a process over which I have no control. Data is added only once every month. This table needs to be search-able by the Name column. But the name column contains the full name in the format 'Last First Middle'.
In the sphinx.conf, I have
sql_query = SELECT Id, OwnersName,
substring_index(substring_index(OwnersName,' ',2),' ',-1) as firstname,
substring_index(OwnersName,' ',2) as lastname
FROM table1
How do I use sphinx search to search by firstname and/or lastname? I would like to be able to search for 'Smith' in only the first name?
Per-row functions in SQL queries are always a bad idea for tables that may grow large. If you want to search on part of a column, it should be extracted out to its own column and indexed.
I would suggest, if you have power over the schema (as opposed to the population process), inserting new columns called OwnersFirstName and OwnersLastName along with an update/insert trigger which extracts the relevant information from OwnersName and populats the new columns appropriately.
This means the expense of figuring out the first name is only done when a row is changed, not every single time you run your query. That is the right time to do it.
Then your queries become blindingly fast. And, yes, this breaks 3NF, but most people don't realize that it's okay to do that for performance reasons, provided you understand the consequences. And, since the new columns are controlled by the triggers, the data duplication that would be cause for concern is "clean".
Most problems people have with databases is the speed of their queries. Wasting a bit of disk space to gain a large amount of performance improvement is usually okay.
If you have absolutely no power over even the schema, another possibility is to create your own database with the "correct" schema and populate it periodically from the real database. Then query yours. That may involve a fair bit of data transfer every month however so the first option is the better one, if allowed.
Judging by the other answers, I may have missed something... but to restrict a search in Sphinx to a specific field, make sure you're using the extended (or extended2) match mode, and then use the following query string: #firstname Smith.
You could use substring to get the parts of the field that you want to search in, but that will slow down the process. The query can not use any kind of index to do the comparison, so it has to touch each record in the table.
The best would be not to store several values in the same field, but put the name components in three separate fields. When you store more than one value in a fields it's almost always some problems accessing the data. I see this over and over in different forums...
This is an intractable problrm because fulll names can contains prefixes, suffixes, middle names and no middle names, composite first and last names with and without hyphens, etc. There is no reasonable way to do this with 100% reliability