MYSQL Forced pattern match - mysql

I am having trouble figuring out if its possible to force mysql to consider certain specified strings as identical when choosing results in a select query.
For example i have a column containing the word "trachiotomy", but due to the nature of the language it is very likely that the search query will be "trahiotomy" (notice the c missing).
Is there any way I can force the query to recognize any pattern of letters to another ?
For example to match any instance within words of the "ach" sequence of letters to "ah" also - and vice versa. In essence force it regardless of how it was written.
Another example would be the word Archon - which I would like to match with Arhon as well.
So that if a user input was Archon it would match the database data Arhon and vice versa.
I experimented with soundex a bit and it does match some instances, but it seems that due to the way the algorithm works it cant do it in cases where the desired matched string is in the beginning of the word.
For instance the word "Chorevo" cant match the word "Horevo" unless i can somehow force it to consider that "chor" is equal to "hor" and vice versa in any word.
I am reading into REGEXP to see if it can be matched thus somehow. (something like
REGEXP 'arch', 'arh')
At this point i am using a full text match query, but could change that if that proves to be a problem.
I am not sure I have made this clear but would appreciate any help possible.

This is known as phonetic matching. MySQL implements a relatively primitive version of this in the soundex(str) function and a SOUNDS_LIKE b clause (which is just shorthand for soundex(a) = soundex(b). By nature such matching is language-specific, and the MySQL implementation is designed for English words and thus may not work in your situation.
Alternatively you could research/write your own transformation that does what you want and apply it to the data before saving in the database (in a separate column or table).

Related

Is wildcard LIKE more performant than a multiple boolean search in MySQL?

I have been building an API for for a website and the objects I am searching for have a LOT of true/false fields. Instead of creating a huge db structure to manage the options I thought about serializing them in a string similar to '001001000010101001' where 1 is true and 0 is false (I am talking about 100 different options). The other reason I am doing this is to have a clean database so that all of those fields get grouped in a single field (I already have serializer/deserializer).
Now in the search function, since not all of the options get searched at the same time, I was thinking about using a LIKE statement with wildcards.
For example I would do something like this:
WHERE options LIKE '1_1__1__1___1%' (The final wildcard is to reduce the number of _ wildcards so that only the beginning of the pattern gets matched. I would stop at the last checked options to check and % wildcard the rest).
Would this (on average because sometimes there might be 2 or 3 parameters selected and many times there might be all of them) be more performant than a multiple series of AND xxx AND XXX AND ....?
Or is there a way more efficient (and clean to maintain) way of doing this that I am completely missing?
Let me discuss some of the items you bring up.
I understand the 0/1 approach. I find it interesting that you chose to go with strings instead of numbers. For up to 64 true/false value, a BIGINT (8 bytes) or something smaller would work.
Did you also want to test for false for some flags? Your string mechanism makes that rather obvious. (Your example should include a 0.)
Searching with LIKE will be efficient only if there is no leading wildcard (_ or %). So, I would expect most of your queries to involve a full table scan. Your serializer, plus the b prefix would work for setting the strings.
The integer approach that I am countering with would involve & and other boolean operations to test. This would probably be noticeably faster. This would necessitate a full table scan.
If there are other columns being tested in the WHERE clause, let's see them. They will be important to performance and indexing.
Using numbers instead of strings would be slightly faster because of smaller size and faster testing. But not a lot.
You can get some of the benefits of numeric by using the SET datatype. (Again limited to 64 independent flags per column.) The advantage is that you talk to the database via strings, not cryptic bit positions.
If this is a real estate app, consider a few columns like: kitchen, bedrooms, unusual_items (second dishwasher, jacuzzi), etc. No matter how it is implemented (string, integer, SET), this suggestion won't impact performance much.
Another performance note, again with a real estate model: Since #bedrooms is almost always a criteria, make it a column by itself. This may allow for some use of it in a composite index.

How to get same result as following Mysql query from Solr?

Mysql Query : The inner query returns all the attribute_value containing "man" and it's position in attribute value. The outer query orders it in descending order of position number. Thereby giving results in order where "man" starts moving from 1st position to later positions Like
man
manager
aman
human
hanuman
assistant manager
indian institute of management
This is the SQL query:
SELECT f1.av
FROM (
SELECT `attribute_value` av, LOCATE("man",LOWER(`attribute_value`)) po
FROM db_attributes WHERE `attribute_value` LIKE "%man%"
) f1
ORDER BY f1.po
I want to achieve this using solr. Right now I am clueless about how to achieve this. Solr is loaded with all attribute values. Help is greatly appreciated.
This question is about how to do partial string matching that is NOT left-anchored. This may be some misunderstanding of what Solr (and any index) provides and what it does not provide.
You can do this query in mysql because it is computed at execution time, at the cost of examining every row. But it is unnatural to attempt this query in Solr because the entire point of an index is to minimize cost at execution time and NOT touch every record. I.E., the index wants to precompute a subset for a given potential input.
Consider: your two basic fieldType for this are string and text. String only supports exact matching. Text does tokenizing and stemming. Do you want a search for "ingition" to match "ignite"? It appears you do not, since you are not treating the input as a word or word-stem, but rather a string.
In that case, you probably want to look at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory, which can be used to produce all the left-anchored substrings of given tokens. By using a second field, you can also have EdgeNGramFilterFactory produce right anchored substrings (then search both for matches). But this is not the same as producing all possible substrings as your example usage suggests.
As for the resultset order, you would have to define a relevance that sorts the way you want. That probably means a separate string field with high score for exact match and the atomized field for matching at a lower relevance.
In short, you probably should not be thinking of reproducing these particular mysql queries exactly in Solr. I would push for clarification or redefinition of the use case (left or right anchoring).

To keep periods in acronyms or not in a database?

Acronyms are a pain in my database, especially when doing a search. I haven't decided if I should accept periods during search queries. These are the problems I face when searching:
'IRQ' will not find 'I.R.Q.'
'I.R.Q' will not find 'IRQ'
'IRQ.' or 'IR.Q' will not find 'IRQ' or 'I.R.Q.'
etc...
The same problem goes for ellipses (...) or three series of periods.
I just need to know what directions should I take with this issue:
Is it better to remove all periods when inserting the string to the database?
If so what regex can I use to identify periods (instead of ellipses or three series of periods) to identify what needs to be removed?
If it is possible to keep the periods in acronyms, how can it be scripted in a query to find 'I.R.Q' if I input 'IRQ' in the search field, through MySQL using regex or maybe a MySQL function I don't know about?
My responses for each question:
Is it better to remove all periods when inserting the string to the database?
Yes and no. You want the database to have the original text. If you want, create a separate field that is "cleaned up" to search against. Here, you can remove periods, make everything lowercase, etc.
If so what regex can I use to identify periods (instead of ellipses or three series of periods) to identify what needs to be removed?
/\.+/
That finds one or more periods in a given spot. But you'll want to integrate it with your search formula.
Note: regex on a database isn't known to have high performance. Be cautious with this.
Other note: you may want to use FullText search in MySQL. This also, isn't known to have high performance with data sets over 1000+ entries. If you have big data and need fulltext search, use Sphinx (available as a MySQL plug-in and RAM-based indexing system).
If it is possible to keep the periods in acronyms, how can it be scripted in a query to find 'I.R.Q' if I input 'IRQ' in the search field, through MySQL using regex or maybe a MySQL function I don't know about?
Yes, by having the 2 fields I described in the first bullet's answer.
You need to consider the sanctity of your input. If it is not yours to alter then don't alter it. Instead you should have a separate system to allow for text searching, and that can alter the text as it sees fit to be able to handle these types of issues.
Have a read up on Lucene, and specifically Lucene's standard analyzer, to see the types of changes that are commonly carried out to allow successful searching of complex text.
I think you can use the REGEXP function of MySQL to send an acronym :
SELECT col1, col2...coln FROM yourTable WHERE colWithAcronym REGEXP "#I\.?R\.?Q\.?#"
If you use PHP you can build your regexp by this simple loop :
$result = "#";
foreach($yourAcronym as $char){
$result.=$char."\\.?";
}
$result.="#";
The functionality you are searching for is a fulltext search. Mysql supports this for myisam-tables, but not for innodb. (http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html)
Alternatively you could go for an external framework that provides that funcionality. Lucene is a popular open-source one. (lucene.apache.org)
There would be 2 methods,
1. save data -removing symbols from text and match accordingly,
2. you can make a regex ,like this for eg.
select * from table where acronym regexp '^[A-Z]+[.]?[A-Z]+[.]?[A-Z]+[.]?$';
Please note, however, that this requires the acronym to be stored in uppercase. If you don't want the case to matter, just change [A-Z] to [A-Za-z].

How to solve Dilemma of storing human names in MySQL and keep both discriminability and a search for similar names?

I was granted with the beautiful task ;-) to design some tables in a MySQL Database which should hold human names.
Criteria:
I have only the full names. (There is no separation for e.g. prename, surname and so on)
The storage should be diacritic sensitive. (The following names stand for different persons)
"Voss" and "Voß".
"Joel" and "Joël".
"franc" and "Franc" and "Fránc".
A search should return all similar names to the search string: E.g: Search for "franc" should return ["franc", "Franc", "Fránc"] and so on... (It would be awesome if the search would return not only the diacritice insensitive matches but perhaps similar sounding names or names that match in parts to the search string, too...)
I thougt of using the COLLATION utf8_bin for the column (declared as unique) in which I will store the names. This would satisfy point 2. But this will hurt point three. Declaring the column name as unique with collation utf8_unicode_ci satisfys point 3. but it hurts point two.
So my question is: Is there a way to solve this task and respecting all criteria? And since I don't want to reinvent the wheel: Is there an elegant way to handle human names (and their searches) in databases? (Sadly, I do not have the possibility of splitting the names into prename, surnames and optional middlenames...)
Edit:
The amount of names is arount a million (~1.000.000) entrys. And if it matters: I am using python as scripting language to populate the database and query the data later on.
What is useful is if you can decompose the full name into component "name words" and store a phonetic encoding (metaphone or one of the many other choices) for each of them. You just need the notion of name words though, not specifically categorizing it as first or middle or last, which is fine because those categories don't work well across cultures anyway). But you can use positional order information later in ranking if you want so that searching for "Paul Carl" matches "Paul Karl" better than matching "Carl Paul". You need to be aware of ambiguous punctuation that may require storing multiple versions of some name words. For instance Bre-Anna Heim would be broken into the name words "bre" "anna" "breanna" and "heim". Sometimes the dash is irrelevant like Bre-Anna, but sometimes not like in Sally-June". Bre-Anna never uses just Bre or Anna, but Sally-June may just use Sally or just June sometimes. It's hard to know which, so cover both possibilities.
You can write your query against this by similarly decomposing and phonetically encoding the full name you're searching for. Your query can return, say, those full names that have two or more component name phonetic matches (or one if there is only one name in the search or the source). This gives you a subset of full names to consider further. You could come up with a simple ranking of them, or even do something like a distance matching algorithm on this subset, which would be too expensive computationally to do against the entire million names. When I say distance matching, I'm talking on-line algorithms like Levenshtein distance and the like.
(edit) The reasoning for this is handling cases like the following name: Maria de los Angeles Gomez-Rodriguez. One data entry person may just enter Maria Gomez. Another might enter Maria Gomez Rodriguez. Yet another might enter Maria Angeles Rodrigus.
You can use an algorithm like Metaphone (or Double Metaphone) in another column so that you can try to find names that are "similar" to each other. You will have to look for an international version that knows about the german esset character.

MySQL Fulltext search but using LIKE

I'm recently doing some string searches from a table with about 50k strings in it, fairly large I'd say but not that big. I was doing some nested queries for a 'search within results' kinda thing. I was using LIKE statement to get a match of a searched keyword.
I came across MySQL's Full-Text search which I tried so I added a fulltext index to my str column. I'm aware that Full-text searches doesn't work on virtually created tables or even with Views so queries with sub-selects will not fit. I mentioned I was doing a nested queries, example is:
SELECT s2.id, s2.str
FROM
(
SELECT s1.id, s1.str
FROM
(
SELECT id, str
FROM strings
WHERE str LIKE '%term%'
) AS s1
WHERE s1.str LIKE '%another_term%'
) AS s2
WHERE s2.str LIKE '%a_much_deeper_term%';
This is actually not applied to any code yet, I was just doing some tests. Also, searching strings like this can be easily achieved by using Sphinx (performance wise) but let's consider Sphinx not being available and I want to know how this will work well in pure SQL query. Running this query on a table without Full-text added takes about 2.97 secs. (depends on the search term). However, running this query on a table with Full-text added to the str column finished in like 104ms which is fast (i think?).
My question is simple, is it valid to use LIKE or is it a good practice to use it at all in a table with Full-text added when normally we would use MATCH and AGAINST statements?
Thanks!
In this case you not neccessarily need subselects. You can siply use:
SELECT id, str
FROM item_strings
WHERE str LIKE '%term%'
AND str LIKE '%another_term%'
AND str LIKE '%a_much_deeper_term%'
... but also raises a good question: the order in which you are excluding the rows. I guess MySQL is smart enough to assume that the longest term will be the most restrictive, so starting with a_much_deeper_term it will eliminate most of the records then perform addtitional comparsion only on a few rows. - Contrary to this, if you start with term you will probably end up with many possible records then you have to compare them against the st of the terms.
The interesting part is that you can force the order in which the comparsion is made by using your original subselect example. This gives the opportunity to make a decision which term is the most restrictive based upon more han just the length, but for example:
the ratio of consonants a vowels
the longest chain of consonants of the word
the most used vowel in the word
...etc. You can also apply some heuristics based on the type of textual infomation you are handling.
Edit:
This is just a hunch but it could be possible to apply the LIKE to the words in the fulltext indexitself. Then match the rows against the index as if you have serched for full words.
I'm not sure if this is actually done, but it would be a smart thing to pull off by the MySQL people. Also note that this theory can only be used if all possible ocurrences arein fact in the fulltext search. For this you need that:
Your search pattern must be at least the size of the miimal word-length. (If you re searching for example %id% then it can be a part of a 3 letter word too, which is excluded by default form FULLTEXT index).
Your search pattern must not be a substring of any listed excluded word for example: and, of etc.
Your pattern must not contain any special characters.