Position independent string matching - mysql

I have 2,000,000 strings in my mysql database. Now , when a new string comes as input, I try to find out if the string is already in my database, else, I insert the string.
Definition of String Match
For my case, position of a word in the text doesn't matter. Only all the words should be present in the string and no extra words in either string.
Ex - Ram is a boy AND boy is a Ram will be said to match. Ram is a good boy won't match.
PS - Please ignore the sense
Now, my question is what is the best way to do these matching given the number of strings(2,000,000) I have to match with .
Solution I could think of :
Index all the strings in SOLR/Sphinx
On new search, I will just
hit the search server and have to consider at max top 10 strings
Advantages :-
Faster than mysql full text search
Disadvantages :-
Keeping search server updated with the new queries in mysql
database.
Are there any other better solutions that I can go for ? Any suggestions and approach to tackle this are most welcome :)
Thanks !

You could just compute a second column that has the words in sorted order. THen just a unique index on that column :)
ALTER TABLE table ADD sorted varchar(255) not null, unique index(sorted);
then... (PHP for convenience, but other languages will be similar)
$words = explode(' ',trim($string));
sort($words);
$sorted = mysql_real_escape_string(implode(' ',$words));
$string = mysql_real_escape_string($string);
$sql = "INSERT IGNORE INTO table SET `string`='$string',`sorted`='$sorted'";

I would suggest to create some more tables that stores the information about your existing data.
so that regardless of how much data your table has you will not have to deal with performance issue during "match/check and insert" logic in your query.
please check the schema suggestion I have made for similar requirement in another post on SO.
accommodate fuzzy matching
in above post to achieve your needs you will need just one extra table where I have mentioned data match with 90% accuracy. let me know if that answer is not clear or you have any doubt on that.
EDIT-1
in your case you will have 3 tables. one you already have, where you have your 2,000,000 string messages stored. now another two table i was talking about is as follows.
second table to store all unique Expression (unique word accross all messages)
third table to store link between each Expression(word) and messgae that word appears in.
see the below query results.
Now lets say your input has a string "Is Boy Ram"
first extract Each Expression from string you have 3 in this string. "Is" and "Ram" and "Boy".
now its just matter of completing the Select query to see if these all expression exist in last table
"MyData_ExpressionString" for single StringID. I guess now you have better picture and you know what to do next. and yes, i haven't created Indexes but I guess you already know what indexes you will need.

Calculate a bloom filter for each string by adding all the words to the filter for the given string. On any new string lookup, calculate the bloom filter, and lookup the matching strings in the DB.
You probably can get by with a fairly short bloom filter, some testing on your strings could tell you how long you need.

Related

SQL Query to find duplicates where a String contains a specific id

I have a database table where one field (payload) is a string where a JSON-object is stored. This JSON has multiple attributes. I would like to find a way to query all entries where the payload json-object contains the same value for the attribute id_o to find duplicates.
So for example if there existed multiple entries where id_o of the payload-string is "id_o: 100" I want to get these rows back.
How can I do this?
Thanks in advance!
I have faced similar issue like this before.
I used regexp_substr
SELECT regexp_substr(yourJSONcolumn, '"id_o":"([^,]*)',0,1,'e') end as give_it_a_name
the comma in "([^,])" can be replaced with a "." if after the id_0:100 has a . or something else that you want to remove.
i think storing json in database is not a good experience. Now your db needs a normalization, it will be good, if you create a new row in your db, give it a unique index and store this id_o property there.
UPDATE
here what i find in another question:
If you really want to be able to add as many fields as you want with no limitation (other than an arbitrary document size limit), consider a NoSQL solution such as MongoDB.
For relational databases: use one column per value. Putting a JSON blob in a column makes it virtually impossible to query (and painfully slow when you actually find a query that works).
Relational databases take advantage of data types when indexing, and are intended to be implemented with a normalized structure.
As a side note: this isn't to say you should never store JSON in a relational database. If you're adding true metadata, or if your JSON is describing information that does not need to be queried and is only used for display, it may be overkill to create a separate column for all of the data points.
I guess you JSON looks like this: {..,"id_o":"100",..}
SELECT * FROM your_table WHERE your_column LIKE '%"id_o":"100"%'

MySQL fulltext search on multiple tables - another one

So I checked around about MySQL fulltext search on multiple, made-to-be joint tables. I know, now, that this is not possible because an index cannot be made on joint tables. The given solution is always to do two "match" with and/or – but it doesn't solve my problem.
The situation is as follows. I got :
– A "works" table that contains book titles, short descriptions and texts extracts.
– A "authors" table, with the name of the authors.
My search must be made IN BOOLEAN MODE for some reasons. Also, the default behavior for the words entered in the search field is AND (I preprocess the request by replacing spaces with +).
A user will typically enter in the search field : "NameOfAuthor TitleOfTheBook" or "NameOfAuthor aRandomWord (that he looks for in the extracts)" or "TitleOfTheBook" alone. He expect to find out only the results (and all of them) that matches all the word he entered.
So if I :
– match against the "works" fields OR the "authors" fields, I will have an answer only if the short descriptions in the "works" table mention the name of the author.
If I don't preprocess the query (if I don't transform "NameOfAuthor TitleOfTheBook" into "+NameOfAuthor +TitleOfTheBook"), I will have all the books from one author and all the books that contains some words of the query, which is not suitable.
– match against the "works" fields AND the "authors" fields, I will have nothing. If I don't preprocess the query for the "Match against author" part, it may work in this case, but not in general, because it will not work with any search that doesn't mention the author's name.
It seems to me that the only solution is an index that would mix works fields and author name. But it's not possible to do an index over a joint… The situation seems so typical that I can't believe that this is a real issue. So I'm probably stupid, but I just can't figure a solution. Any idea ? Must I create a specific, virtual table for this search ?
Thank you very much !
Well, writing down the question helped me to figure something… The idea would be to crush the user input into an $wordsArray and do the fulltext search for each of them.
So, the idea would be to :
//Parse the words from the query field
for (there is still a $word to check in the $wordsArray) {
// do a fulltext search on "works" fields against $word OR "authors" fields against $word
// save the results in a multi-dimensional $resultArray
}
// Keep only the results that exists in every row of the $resultArray
// Display
I think that is quite heavy, though… But the only alternative I can imagine is a database pregenerated table for those search purpose with an index on it. It all depends on the scale.
Except if someone else has a better solution !

How to search for rows containing a substring?

If I store an HTML TEXTAREA in my ODBC database each time the user submits a form, what's the SELECT statement to retrieve 1) all rows which contain a given sub-string 2) all rows which don't (and is the search case sensitive?)
Edit: if LIKE "%SUBSTRING%" is going to be slow, would it be better to get everything & sort it out in PHP?
Well, you can always try WHERE textcolumn LIKE "%SUBSTRING%" - but this is guaranteed to be pretty slow, as your query can't do an index match because you are looking for characters on the left side.
It depends on the field type - a textarea usually won't be saved as VARCHAR, but rather as (a kind of) TEXT field, so you can use the MATCH AGAINST operator.
To get the columns that don't match, simply put a NOT in front of the like: WHERE textcolumn NOT LIKE "%SUBSTRING%".
Whether the search is case-sensitive or not depends on how you stock the data, especially what COLLATION you use. By default, the search will be case-insensitive.
Updated answer to reflect question update:
I say that doing a WHERE field LIKE "%value%" is slower than WHERE field LIKE "value%" if the column field has an index, but this is still considerably faster than getting all values and having your application filter. Both scenario's:
1/ If you do SELECT field FROM table WHERE field LIKE "%value%", MySQL will scan the entire table, and only send the fields containing "value".
2/ If you do SELECT field FROM table and then have your application (in your case PHP) filter only the rows with "value" in it, MySQL will also scan the entire table, but send all the fields to PHP, which then has to do additional work. This is much slower than case #1.
Solution: Please do use the WHERE clause, and use EXPLAIN to see the performance.
Info on MySQL's full text search. This is restricted to MyISAM tables, so may not be suitable if you wantto use a different table type.
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
Even if WHERE textcolumn LIKE "%SUBSTRING%" is going to be slow, I think it is probably better to let the Database handle it rather than have PHP handle it. If it is possible to restrict searches by some other criteria (date range, user, etc) then you may find the substring search is OK (ish).
If you are searching for whole words, you could pull out all the individual words into a separate table and use that to restrict the substring search. (So when searching for "my search string" you look for the the longest word "search" only do the substring search on records containing the word "search")
I simply use SELECT ColumnName1, ColumnName2,.....WHERE LOCATE(subtr, ColumnNameX)<>0
To get rows with ColumnNameX having the substring.
Replace <> with = to get rows NOT having the substring.

How do I search part of a column?

I have a mysql table containing 40 million records that is being populated by a process over which I have no control. Data is added only once every month. This table needs to be search-able by the Name column. But the name column contains the full name in the format 'Last First Middle'.
In the sphinx.conf, I have
sql_query = SELECT Id, OwnersName,
substring_index(substring_index(OwnersName,' ',2),' ',-1) as firstname,
substring_index(OwnersName,' ',2) as lastname
FROM table1
How do I use sphinx search to search by firstname and/or lastname? I would like to be able to search for 'Smith' in only the first name?
Per-row functions in SQL queries are always a bad idea for tables that may grow large. If you want to search on part of a column, it should be extracted out to its own column and indexed.
I would suggest, if you have power over the schema (as opposed to the population process), inserting new columns called OwnersFirstName and OwnersLastName along with an update/insert trigger which extracts the relevant information from OwnersName and populats the new columns appropriately.
This means the expense of figuring out the first name is only done when a row is changed, not every single time you run your query. That is the right time to do it.
Then your queries become blindingly fast. And, yes, this breaks 3NF, but most people don't realize that it's okay to do that for performance reasons, provided you understand the consequences. And, since the new columns are controlled by the triggers, the data duplication that would be cause for concern is "clean".
Most problems people have with databases is the speed of their queries. Wasting a bit of disk space to gain a large amount of performance improvement is usually okay.
If you have absolutely no power over even the schema, another possibility is to create your own database with the "correct" schema and populate it periodically from the real database. Then query yours. That may involve a fair bit of data transfer every month however so the first option is the better one, if allowed.
Judging by the other answers, I may have missed something... but to restrict a search in Sphinx to a specific field, make sure you're using the extended (or extended2) match mode, and then use the following query string: #firstname Smith.
You could use substring to get the parts of the field that you want to search in, but that will slow down the process. The query can not use any kind of index to do the comparison, so it has to touch each record in the table.
The best would be not to store several values in the same field, but put the name components in three separate fields. When you store more than one value in a fields it's almost always some problems accessing the data. I see this over and over in different forums...
This is an intractable problrm because fulll names can contains prefixes, suffixes, middle names and no middle names, composite first and last names with and without hyphens, etc. There is no reasonable way to do this with 100% reliability

How does a hash table work? Is it faster than "SELECT * from .."

Let's say, I have :
Key | Indexes | Key-values
----+---------+------------
001 | 100001 | Alex
002 | 100002 | Micheal
003 | 100003 | Daniel
Lets say, we want to search 001, how to do the fast searching process using hash table?
Isn't it the same as we use the "SELECT * from .. " in mysql? I read alot, they say, the "SELECT *" searching from beginning to end, but hash table is not? Why and how?
By using hash table, are we reducing the records we are searching? How?
Can anyone demonstrate how to insert and retrieve hash table process in mysql query code? e.g.,
SELECT * from table1 where hash_value="bla" ...
Another scenario:
If the indexes are like S0001, S0002, T0001, T0002, etc. In mysql i could use:
SELECT * from table WHERE value = S*
isn't it the same and faster?
A simple hash table works by keeping the items on several lists, instead of just one. It uses a very fast and repeatable (i.e. non-random) method to choose which list to keep each item on. So when it is time to find the item again, it repeats that method to discover which list to look in, and then does a normal (slow) linear search in that list.
By dividing the items up into 17 lists, the search becomes 17 times faster, which is a good improvement.
Although of course this is only true if the lists are roughly the same length, so it is important to choose a good method of distributing the items between the lists.
In your example table, the first column is the key, the thing we need to find the item. And lets suppose we will maintain 17 lists. To insert something, we perform an operation on the key called hashing. This just turns the key into a number. It doesn't return a random number, because it must always return the same number for the same key. But at the same time, the numbers must be "spread out" widely.
Then we take the resulting number and use modulus to shrink it down to the size of our list:
Hash(key) % 17
This all happens extremely fast. Our lists are in an array, so:
_lists[Hash(key % 17)].Add(record);
And then later, to find the item using that key:
Record found = _lists[Hash(key % 17)].Find(key);
Note that each list can just be any container type, or a linked list class that you write by hand. When we execute a Find in that list, it works the slow way (examine the key of each record).
Do not worry about what MySQL is doing internally to locate records quickly. The job of a database is to do that sort of thing for you. Just run a SELECT [columns] FROM table WHERE [condition]; query and let the database generate a query plan for you. Note that you don't want to use SELECT *, since if you ever add a column to the table that will break all your old queries that relied on there being a certain number of columns in a certain order.
If you really want to know what's going on under the hood (it's good to know, but do not implement it yourself: that is the purpose of a database!), you need to know what indexes are and how they work. If a table has no index on the columns involved in the WHERE clause, then, as you say, the database will have to search through every row in the table to find the ones matching your condition. But if there is an index, the database will search the index to find the exact location of the rows you want, and jump directly to them. Indexes are usually implemented as B+-trees, a type of search tree that uses very few comparisons to locate a specific element. Searching a B-tree for a specific key is very fast. MySQL is also capable of using hash indexes, but these tend to be slower for database uses. Hash indexes usually only perform well on long keys (character strings especially), since they reduce the size of the key to a fixed hash size. For data types like integers and real numbers, which have a well-defined ordering and fixed length, the easy searchability of a B-tree usually provides better performance.
You might like to look at the chapters in the MySQL manual and PostgreSQL manual on indexing.
http://en.wikipedia.org/wiki/Hash_table
Hash tables may be used as in-memory data structures. Hash tables may also be adopted for use with persistent data structures; database indices sometimes use disk-based data structures based on hash tables, although balanced trees are more popular.
I guess you could use a hash function to get the ID you want to select from. Like
SELECT * FROM table WHERE value = hash_fn(whatever_input_you_build_your_hash_value_from)
Then you don't need to know the id of the row you want to select and can do an exact query. Since you know that the row will always have the same id because of the input you build the hash value form and you can always recreate this id through the hash function.
However this isn't always true depending on the size of the table and the maximum number of hashvalues (you often have "X mod hash-table-size" somewhere in your hash). To take care of this you should have a deterministic strategy you use each time you get two values with the same id. You should check Wikipedia for more info on this strategy, its called collision handling and should be mentioned in the same article as hash-tables.
MySQL probably uses hashtables somewhere because of the O(1) feature norheim.se (up) mentioned.
Hash tables are great for locating entries at O(1) cost where the key (that is used for hashing) is already known. They are in widespread use both in collection libraries and in database engines. You should be able to find plenty of information about them on the internet. Why don't you start with Wikipedia or just do a Google search?
I don't know the details of mysql. If there is a structure in there called "hash table", that would probably be a kind of table that uses hashing for locating the keys. I'm sure someone else will tell you about that. =)
EDIT: (in response to comment)
Ok. I'll try to make a grossly simplified explanation: A hash table is a table where the entries are located based on a function of the key. For instance, say that you want to store info about a set of persons. If you store it in a plain unsorted array, you would need to iterate over the elements in sequence in order to find the entry you are looking for. On average, this will need N/2 comparisons.
If, instead, you put all entries at indexes based on the first character of the persons first name. (A=0, B=1, C=2 etc), you will immediately be able to find the correct entry as long as you know the first name. This is the basic idea. You probably realize that some special handling (rehashing, or allowing lists of entries) is required in order to support multiple entries having the same first letter. If you have a well-dimensioned hash table, you should be able to get straight to the item you are searching for. This means approx one comparison, with the disclaimer of the special handling I just mentioned.