How to get same result as following Mysql query from Solr? - mysql

Mysql Query : The inner query returns all the attribute_value containing "man" and it's position in attribute value. The outer query orders it in descending order of position number. Thereby giving results in order where "man" starts moving from 1st position to later positions Like
man
manager
aman
human
hanuman
assistant manager
indian institute of management
This is the SQL query:
SELECT f1.av
FROM (
SELECT `attribute_value` av, LOCATE("man",LOWER(`attribute_value`)) po
FROM db_attributes WHERE `attribute_value` LIKE "%man%"
) f1
ORDER BY f1.po
I want to achieve this using solr. Right now I am clueless about how to achieve this. Solr is loaded with all attribute values. Help is greatly appreciated.

This question is about how to do partial string matching that is NOT left-anchored. This may be some misunderstanding of what Solr (and any index) provides and what it does not provide.
You can do this query in mysql because it is computed at execution time, at the cost of examining every row. But it is unnatural to attempt this query in Solr because the entire point of an index is to minimize cost at execution time and NOT touch every record. I.E., the index wants to precompute a subset for a given potential input.
Consider: your two basic fieldType for this are string and text. String only supports exact matching. Text does tokenizing and stemming. Do you want a search for "ingition" to match "ignite"? It appears you do not, since you are not treating the input as a word or word-stem, but rather a string.
In that case, you probably want to look at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory, which can be used to produce all the left-anchored substrings of given tokens. By using a second field, you can also have EdgeNGramFilterFactory produce right anchored substrings (then search both for matches). But this is not the same as producing all possible substrings as your example usage suggests.
As for the resultset order, you would have to define a relevance that sorts the way you want. That probably means a separate string field with high score for exact match and the atomized field for matching at a lower relevance.
In short, you probably should not be thinking of reproducing these particular mysql queries exactly in Solr. I would push for clarification or redefinition of the use case (left or right anchoring).

Related

SQL Index on Strings Helpful?

So I have used MySQL a lot in small projects, for school; however, I'm not taking over a enterprise-ish scale project, and now speed matters, not just getting the right information back. I have Googled around a lot trying to learn how indexes might make my website faster, and I am hoping to further understand how they work, not just when to use them.
So, I find myself doing a lot of SELECT DISTINCTS in order to get all the distinct values, so i can populate my dropdowns. I have heard that this would be faster if this column was indexed; however, I don't completely understand why. If the values in this columns were ints, I would totally understand; basically a data structure like a BST would be created, and search times could be Log(n); however, if my column is strings, how can it put a string in a BST? This doesn't seem possible, since there is no metric to compare a string against another string (like there are with numbers). It seems like an index would just create a list of all the possible values for that column, but it seems as if the search would still require the database to go through every single row, making this search linear, just like if the database just scanned a regular tables.
My second question is what does the database do once it finds the right value in the index data structure. For example, let's say I'm doing a where age = 42. So, the database goes through the data structure until it finds 42, but how does it map that lookup to the whole row? Does the index have some sort of row number associated with it?
Lastly, if I am doing these frequent SELECT DISTINCT statements, is adding an index going to help? I feel like this must be a common task for websites, as many sites have dropdowns where you can filter results, I'm just trying to figure out if I'm approaching it the right way.
Thanks in advance.
You logic is good, however, your assumption that there is no metric to compare string to other strings is incorrect. Strings can simply be compared in alphabetical order, giving them a perfectly usable comparison metric that can be used to build the index.
It takes a tiny bit longer to compare strings then it does ints, however, having an index still speeds things up, regardless of the comparison cost.
I would like to mention however that if you are using SELECT DISTINCT as much as you say, there are probably problems with your database schema.
You should learn about normalizing your database. I recommend starting with this link: http://databases.about.com/od/specificproducts/a/normalization.htm
Normalization will provide you with querying mechanism that can vastly outweigh benefits received from indexing.
if your strings are something small like categories, then an index will help. If you have large chunks of random text, then you will likely want a full text index. If you are having to use select distinct a lot, your database may not be properly normalized for what you are doing. You could also put the distinct values in a separate table (that only has the distinct values), but this only helps if the content does not change a lot. Indexing strategies are particular to your application's access patterns, the data itself, and how the tables are normalized (or not).
HTH

MYSQL Forced pattern match

I am having trouble figuring out if its possible to force mysql to consider certain specified strings as identical when choosing results in a select query.
For example i have a column containing the word "trachiotomy", but due to the nature of the language it is very likely that the search query will be "trahiotomy" (notice the c missing).
Is there any way I can force the query to recognize any pattern of letters to another ?
For example to match any instance within words of the "ach" sequence of letters to "ah" also - and vice versa. In essence force it regardless of how it was written.
Another example would be the word Archon - which I would like to match with Arhon as well.
So that if a user input was Archon it would match the database data Arhon and vice versa.
I experimented with soundex a bit and it does match some instances, but it seems that due to the way the algorithm works it cant do it in cases where the desired matched string is in the beginning of the word.
For instance the word "Chorevo" cant match the word "Horevo" unless i can somehow force it to consider that "chor" is equal to "hor" and vice versa in any word.
I am reading into REGEXP to see if it can be matched thus somehow. (something like
REGEXP 'arch', 'arh')
At this point i am using a full text match query, but could change that if that proves to be a problem.
I am not sure I have made this clear but would appreciate any help possible.
This is known as phonetic matching. MySQL implements a relatively primitive version of this in the soundex(str) function and a SOUNDS_LIKE b clause (which is just shorthand for soundex(a) = soundex(b). By nature such matching is language-specific, and the MySQL implementation is designed for English words and thus may not work in your situation.
Alternatively you could research/write your own transformation that does what you want and apply it to the data before saving in the database (in a separate column or table).

MySQL Fulltext search but using LIKE

I'm recently doing some string searches from a table with about 50k strings in it, fairly large I'd say but not that big. I was doing some nested queries for a 'search within results' kinda thing. I was using LIKE statement to get a match of a searched keyword.
I came across MySQL's Full-Text search which I tried so I added a fulltext index to my str column. I'm aware that Full-text searches doesn't work on virtually created tables or even with Views so queries with sub-selects will not fit. I mentioned I was doing a nested queries, example is:
SELECT s2.id, s2.str
FROM
(
SELECT s1.id, s1.str
FROM
(
SELECT id, str
FROM strings
WHERE str LIKE '%term%'
) AS s1
WHERE s1.str LIKE '%another_term%'
) AS s2
WHERE s2.str LIKE '%a_much_deeper_term%';
This is actually not applied to any code yet, I was just doing some tests. Also, searching strings like this can be easily achieved by using Sphinx (performance wise) but let's consider Sphinx not being available and I want to know how this will work well in pure SQL query. Running this query on a table without Full-text added takes about 2.97 secs. (depends on the search term). However, running this query on a table with Full-text added to the str column finished in like 104ms which is fast (i think?).
My question is simple, is it valid to use LIKE or is it a good practice to use it at all in a table with Full-text added when normally we would use MATCH and AGAINST statements?
Thanks!
In this case you not neccessarily need subselects. You can siply use:
SELECT id, str
FROM item_strings
WHERE str LIKE '%term%'
AND str LIKE '%another_term%'
AND str LIKE '%a_much_deeper_term%'
... but also raises a good question: the order in which you are excluding the rows. I guess MySQL is smart enough to assume that the longest term will be the most restrictive, so starting with a_much_deeper_term it will eliminate most of the records then perform addtitional comparsion only on a few rows. - Contrary to this, if you start with term you will probably end up with many possible records then you have to compare them against the st of the terms.
The interesting part is that you can force the order in which the comparsion is made by using your original subselect example. This gives the opportunity to make a decision which term is the most restrictive based upon more han just the length, but for example:
the ratio of consonants a vowels
the longest chain of consonants of the word
the most used vowel in the word
...etc. You can also apply some heuristics based on the type of textual infomation you are handling.
Edit:
This is just a hunch but it could be possible to apply the LIKE to the words in the fulltext indexitself. Then match the rows against the index as if you have serched for full words.
I'm not sure if this is actually done, but it would be a smart thing to pull off by the MySQL people. Also note that this theory can only be used if all possible ocurrences arein fact in the fulltext search. For this you need that:
Your search pattern must be at least the size of the miimal word-length. (If you re searching for example %id% then it can be a part of a 3 letter word too, which is excluded by default form FULLTEXT index).
Your search pattern must not be a substring of any listed excluded word for example: and, of etc.
Your pattern must not contain any special characters.

Can MySQL fulltext search return an index(position) instead of a score?

I would like to use the position/index found by the Match...Against fulltext search in mysql to return some text before and after the match in the field. Is this possible? In all the examples I have seen, the Match...Against returns a score in the select instead of a location or position in the text field of which is being searched.
SELECT
random_field,
MATCH ($search_fields)
AGAINST ('".mysql_real_escape_string(trim($keywords))."' IN BOOLEAN MODE)
AS score
FROM indexed_sites
WHERE
MATCH ($search_fields)
AGAINST ('".mysql_real_escape_string($keywords)."' IN BOOLEAN MODE)
ORDER BY score DESC;
This will give me a field and a score...but I would like an index/position instead of (or along side) a score.
Fulltext searching is a scoring function. its not a search for occurrence function. In other words the highest scoring result may not have a starting position for the match. As it may be a combination of weighted results of different matches within the text. if you include query expansion the search for word/s may not even appear in the result!
http://dev.mysql.com/doc/refman/5.0/en/fulltext-query-expansion.html
I hope that makes some sense.
Anyway your best bet is to take the results and then use some text searching function to find the first occurrence of the first matching word. My guess is that would be best suited to a text processing language like perl or a more general language like php or what ever language you are using to run the query.
DC

How to search for rows containing a substring?

If I store an HTML TEXTAREA in my ODBC database each time the user submits a form, what's the SELECT statement to retrieve 1) all rows which contain a given sub-string 2) all rows which don't (and is the search case sensitive?)
Edit: if LIKE "%SUBSTRING%" is going to be slow, would it be better to get everything & sort it out in PHP?
Well, you can always try WHERE textcolumn LIKE "%SUBSTRING%" - but this is guaranteed to be pretty slow, as your query can't do an index match because you are looking for characters on the left side.
It depends on the field type - a textarea usually won't be saved as VARCHAR, but rather as (a kind of) TEXT field, so you can use the MATCH AGAINST operator.
To get the columns that don't match, simply put a NOT in front of the like: WHERE textcolumn NOT LIKE "%SUBSTRING%".
Whether the search is case-sensitive or not depends on how you stock the data, especially what COLLATION you use. By default, the search will be case-insensitive.
Updated answer to reflect question update:
I say that doing a WHERE field LIKE "%value%" is slower than WHERE field LIKE "value%" if the column field has an index, but this is still considerably faster than getting all values and having your application filter. Both scenario's:
1/ If you do SELECT field FROM table WHERE field LIKE "%value%", MySQL will scan the entire table, and only send the fields containing "value".
2/ If you do SELECT field FROM table and then have your application (in your case PHP) filter only the rows with "value" in it, MySQL will also scan the entire table, but send all the fields to PHP, which then has to do additional work. This is much slower than case #1.
Solution: Please do use the WHERE clause, and use EXPLAIN to see the performance.
Info on MySQL's full text search. This is restricted to MyISAM tables, so may not be suitable if you wantto use a different table type.
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
Even if WHERE textcolumn LIKE "%SUBSTRING%" is going to be slow, I think it is probably better to let the Database handle it rather than have PHP handle it. If it is possible to restrict searches by some other criteria (date range, user, etc) then you may find the substring search is OK (ish).
If you are searching for whole words, you could pull out all the individual words into a separate table and use that to restrict the substring search. (So when searching for "my search string" you look for the the longest word "search" only do the substring search on records containing the word "search")
I simply use SELECT ColumnName1, ColumnName2,.....WHERE LOCATE(subtr, ColumnNameX)<>0
To get rows with ColumnNameX having the substring.
Replace <> with = to get rows NOT having the substring.