Is there an efficient string matching algorithm in MySQL? - mysql

Is there an implementation of a fast string matching algorithm for searching keywords in MySQL? For example Aho-Corasick or any other fast string matching algorithm.
Typically Aho-Corasick is implemented in Java or any other compiled language but it should be possible to write it as a stored procedure in MySQL.
Thanks!

As stored procedures are turing-complete, and you can use a "cursor" to loop through the records in a table (possibly with some existing "WHERE" cause), then you can do it in a stored procedure.
A stored function would also be possible.
However, the MySQL stored-routine language is so terrible both in terms of programmer-usability and performance, that the result is unlikely to be easy or fast.
So you might be better off writing a MySQL UDF (which you can write in any language, provided you can make it look like a C library) and having that do it instead.
Consider your specific requirements. I am assuming that a query with lots of "OR col LIKE ..." tagged together is too inefficient for you, as you wish to match thousands of patterns at once, right?

Related

MySQL - Best methods to provide fast Dynamic filter support for large-scale database record lists?

I am curious what techniques Database Developers and Architects use to create dynamic filter data response Stored Procedures (or Functions) for large-scale databases.
For example, let's take a database with millions of people in it, and we want to provide a stored procedure "get-person-list" which takes a JSON parameter. Within this JSON parameter, we can define filters such as $.filter.name.first, $.filter.name.last, $.filter.phone.number, $.filter.address.city, etc.
The frontend (web solution) allows the user to define one or more filters, so the front-end can say "Show me everyone with a First name of Ted and last name of Smith in San Diego."
The payload would look like this:
{
"filter": {
"name": {
"last": "smith",
"first": "ted"
},
"address": {
"city": "san diego"
}
}
}
Now, what would the best technique be to write a single stored procedure capable of handling numerous (dozens or more) filter settings (dynamically) and returning the proper result set all with the best optimization/speed?
Is it possible to do this with CTE, or are prepared statements based on IF/THEN logic (building out the SQL to be executed based on filter value) the best/only real method?
How do big companies with huge databases and thousands of users write their calls to return complex dynamic lists of data as quickly as possible?
Everything Bill wrote is true, and good advice.
I'll take it a little further. You're proposing building a search layer into your system, which is fine.
You're proposing an interface in which you pass a JSON object to code inside the DBMS.That's not fine. That code will either have a bunch of canned queries handling the various search scenarios, or will have a mess of string-handling code that reads JSON, puts together appropriate queries, then uses MySQL's PREPARE statement to run them. From my experience that is, with respect, a really bad idea.
Here's why:
The stored-procedure language has very weak string-handling support compared to host languages. No sprintf. No arrays of strings. No join or implode operators. Clunky regex, and not always present on every server. You're going to need string handling to build search queries.
Stored procedures are trickier to debug, test, deploy, and maintain than ordinary application code. That work requires special skills and special access.
You will need to maintain this code, especially if your system proves successful. You'll add requirements that will require expanding your search capabilities.
It's impossible (seriously, impossible) to know what your actual application usage patterns will be at scale. You surely will, as a consequence of growth, find usage patterns that surprise you. My point is that you can't design and build a search system and then forget about it. It will evolve along with your app.
To keep up with evolving usage patterns, you'll need to refactor some queries and add some indexes. You will be under pressure when you do that work: People will be complaining about performance. See points 1 and 2 above.
MySQL / MariaDB's stored procedures aren't compiled with an optimizing compiler, unlike Oracle and SQL Server's. So there's no compelling performance win.
So don't use a stored procedure for this. Please. Ask me how I know this sometime.
If you need a search module with a JSON interface, implement it in your favorite language (php, C#, nodejs, java, whatever). It will be easier to debug, test, deploy, and maintain.
To write a query that searches a variety of columns, you would have to write dynamic SQL. That is, write code to parse your JSON payload for the filter keys and values, and format SQL expressions in a string that is part of a dynamic SQL statement. Then prepare and execute that string.
In general, you can't "optimize for everything." Trying to optimize when you don't know in advance which queries your users will submit is a nigh-impossible task. There's no perfect solution.
The most common method of optimizing search is to create indexes. But you need to know the types of search in advance to create indexes. You need to know which columns will be included, and which types of search operations will be used, because the column order in an index affects optimization.
For N columns, there are N-factorial permutations of columns, but clearly this is impractical because MySQL only allows 64 indexes per table. You simply can't create all the indexes needed to optimize every possible query your users attempt.
The alternative is to optimize queries partially, by indexing a few combinations of columns, and hope that these help the users' most common queries. Use application logs to determine what the most common queries are.
There are other types of indexes. You could use fulltext indexing, either the implementation built in to MySQL, or else supplement your MySQL database with ElasticSearch or similar technology. These provide a different type of index that effectively indexes everything with one index, so you can search based on multiple columns.
There's no single product that is "best." Which fulltext indexing technology meets your needs requires you to evaluate different products. This is some of the unglamorous work of software development — testing, benchmarking, and matching product features to your application requirements. There are few types of work that I enjoy less. It's a toss-up between this and resolving git merge conflicts.
It's also more work to manage copies of data in multiple datastores, making sure data changes in your SQL database are also copied into the fulltext search index. This involves techniques like ETL (extract, transform, load) and CDC (change data capture).
But you asked how big companies with huge databases do this, and this is how.
Input
I to that "all the time". The web page has a <form>. When submitted, I look for fields of that form that were filled in, then build
WHERE this = "..."
AND that = "..."
into the suitable SELECT statement.
Note: I leave out any fields that were not specified in the form; I make sure to escape the strings.
I'm walking through $_GET[] instead of JSON, so it is quite easy.
INDEXing
If you have columns for each possible fields, then it is a matter of providing indexes only for the most likely columns to search on. (There are practical and even hard-coded limits on Indexes.)
If you have stored the attributes in EAV table structure, you have my condolences. Search the [entitity-attribute-value] tag for many other poor soles who wandered into that swamp.
If you store the attributes in JSON, well that is likely to be an order of magnitude worse than EAV.
If you throw all the information in a FULLTEXT columns and use MATCH, then you can get enough speed for "millions" or rows. But it comes with various caveats (word length, stoplist, endings, surprise matches, etc).
If you would like to discuss further, then scale back your expectations and make a list of likely search keys. We can then discuss what technique might be best.

Rails ordering of collections by proc

I want to sort a collection using a custom proc. I know Rails has the order method, but I don't believe this works with procs, so I'm just using sort_by instead. Can someone go into detail about the speed I'm sacrificing, or suggest alternatives? My understanding is that the exact implementation of order will depend on the adapter (which, in my case, is mysql), but I'm wondering if there are ways to take advantage of this to speed the sort up.
As an example, I want to do this:
Model.order(|m| m.get_priority )
but am forced to do this
Model.all.sort_by{|m| m.get_priority}
sort_by is implemented at Ruby level and it's part of Ruby, not ActiveRecord. Therefore, the sorting will not be executed by the database, rather by the Ruby interpreter.
This is not an optimal solution as DBMS are generally more efficient at sorting data as they may use existing indexes.
If get_priority performs some sort of computation outside the database, then you don't have a lot of alternatives to the code you posted here unless you want to cache the result of the get_priority as a column in the Model table and sort against it using the ActiveRecord order statement that will result in an ORDER BY SQL statement.

Parameters in MySQL stored procedure

I want to create a MySQL stored procedure (SP) with input parameters.
However, the number of parameters cannot be determined at the time of writing the SP.
(The scenario is that the users will have multiple options to choose. The options chosen will form the search criteria:
select ...
where prod_category = option1 && option2 && option3 &&...
So, if someone chooses only option1 and option2, only 2 parameters will be sent. Sometimes it may be 50+ options are chosen and hence 50+ parameters will have to be sent.)
So, I have 3 questions:
1. Can I handle such a scenario using MySQL stored procedures (SP)?
2. Is the SP the professional way to handle such scenario?
3. If SP is not the professional way to handle these scenarios, is there anything else that will handle these searches efficiently? The search is the core functionality of my application.
Thanks in advance for any help!
MySQL stored procedures accept only a fixed number of arguments. You can build your list of parameters and values delimited on a single string parameter and then process them on your procedure, or use your application language to build the query instead.
From http://forums.mysql.com/read.php?98,154749,155001#msg-155001
No, MySQL sprocs accept only a fixed number of arguments. ISO SQL is
somewhat optimised for correct RDBMS logic (unless you were to ask EF
Codd, CJ Date or Fabian Pascal), but in many ways SQL is quite
literal, so if SQL seems to make what you are trying to do very
difficult, it could be that your design needs another look, in this
case aspects of the design that require repeated multiple ad hoc
deletions.
If such deletions are unavoidable, here are three options. First, in
an application language build the delete query to include
comma-delimited string of IDs. Second, pass such a string to an sproc
that PREPAREs the query using such a string. Third, populate a temp
table with the target IDs and write a simple join query that deletes
the joined IDs.
There are lots of great reasons to use stored procedures. Here's an article that lists some. Hopefully that will address the "professionalism" question.
As for the passing of parameters, I don't believe you can have a variable list.
A long time ago, I saw it "done" by writing the values to a table and having the stored procedure read them back in. (Use a session_id in the table and then pass that to the procedure).
As for "efficiency", it depends on your definition. There might be a slight speed benefit to the stored procedures, but I wouldn't worry about that. What did you mean?

select * math operations

is it possible to select colums and do complex operations on them for example select factorial(column1) from table1 or select integral_of(something) from table2
perhaps there are libraries that support such operations?
Yes, you can call all pre-defined functions of your DB on the select columns and you can use CREATE FUNCTION to define your own.
But DBs are meant to wade through huge amounts of data, not to do complex calculations on them. If you try this, you'll find that many operations are awfully slow (especially the user defined ones).
Which is why most people fetch the data from the database and then do the complex math on the application side. This also makes it more simple to test and optimize the code or replace it with a new version.
Yes, it is. If the function you want is not built into your RDBMS, you can write your own User Defined Functions.
You'll find an example here: http://www.15seconds.com/Issue/000817.htm.

MySQL - Convert Number to English Representation

Is there a particularly easy way to convert a number like "21.08" into "Twenty One and 08/100" using MySQL?
I know in Oracle there was a trick you could use that involved working with Julian dates. It would get the job done in a line or so but it doesn't appear to work in MySQL (since it doesn't support Julian dates).
It's not a particularly hard problem in a "real" programming language but the thought of writing it out as a stored procedure or function is dreadful.
Curious as to why you're doing this at the database layer instead of at the presentation layer...
If you really, really want to do this with MySQL, you could create two lookup tables called e.g. "ones" and "tens" that stored the English representation and then perform a query on each digit. Extract the digits by casting the number to a string and iterating backward from the decimal point, then performing a lookup in the appropriate table. Perhaps a third table could be used to supply strings like "Hundred", "Thousand", etc.
That's the most straightforward solution I can see, but it's going to be painful to write and probably quite brittle when it comes to internationalization. Also, it clutters the schema with lookup tables that don't have anything to do with your data.
Maybe writing a User-Defined Function (UDF) would be a better solution, though I imagine it will still be pretty time-consuming.