The problem here is that i have multiple columns:
| artist | name | lyrics | content
I want to search in these columns by multiple keywords. The problem is that i can't make any good algorithm with LIKE or/and.
The best possibility is to search for each keyword in each column, but in that way i will get result that may contain the keyword in the name but will not contain the second keyword of artist.
I want everything to be with AND, but this way, It will work for the keywords if there is only one column that i'm searching about. In other way, to receive a result, every of the column must have all keywords...
Is there any possibility someone to know what algorithm i have to create, that when you search with 3 keywords (ex: 1 for artist and 2 for name) to find the correct result?
The best solution is not to use MySQL for the search, but use a text-indexing tool like Apache Solr.
You could then run queries against Solr like:
name:"word" AND artist:"otherword"
It's pretty common to use Solr for indexing data even if you keep the original data in MySQL.
Not only would it give you the query flexibility you want, but it would run hundreds of times faster than using LIKE '%word%' queries on MySQL.
Another alternative is to use MySQL's builtin fulltext indexing feature:
CREATE FULLTEXT INDEX myft ON mytable (artist, name, lyrics, content);
SELECT * FROM mytable
WHERE MATCH(artist, name, lyrics, content)
AGAINST ('+word +otherword' IN BOOLEAN MODE)
But it's not as flexible if you want to search for words that occur in specific columns, unless you create a separate index on each column.
AND works for displaying multiple rows too. it just depends upon the rows you have in your table which you havent provided. PS, im sorry if my answer is not clear, i dont have the reputation to make it a comment
Related
I am currently using MySQL but I am willing to migrate if necessary to any solution suggested.
I am looking for an easy way to implement a search on a table.
The table has multiple entries with data similar to what will be found on user accounts, like names, addresses, phone numbers and a text column that contains comments of arbitrary length.
I want to make a search so that I can go over all rows and columns and find the best matching row. Slightly misspells corrected (Not very important). But most important is the ability to cross search everything.
Table can have as many as 20,000 rows.
Search parameter will be for example: "Company First Name"
Expected results:
company|Contact First Name|Address|...|...
example 2, slightly misspelled search parameters : "Pinaple Street Compani"
Expected results row:
company|pinapple street|..|...
companie|pinapple street|..|...
company|pinaple street|..|...
EDIT:
Forgot to clarify that multiple searches will be done at the same time so it has to be fast (Around 100 searches at the same time). Also the language of the data is not english and the database is utf8 with support for non-english characters
The misspelling problem is hard, if not impossible, to solve well in pure MySQL.
The multiple-column FULLTEXT search isn't so bad.
Your query will look something like this ...
SELECT column, column
FROM table
WHERE MATCH(Company, FirstName, LastName, whatever, whatever)
AGAINST('search terms' IN NATURAL LANGUAGE MODE)
It will produce a bunch of results, ordered by what MySQL guesses is the most likely hit first. MySQL's guesses aren't great, but they're usually adequate.
You'll need a FULLTEXT index matching the list of columns in your MATCH() clause. Creating that index looks like this.
ALTER TABLE book
ADD FULLTEXT INDEX Fulltext_search_index_1
(Company, FirstName, LastName, whatever, whatever);
Comments in your question notwithstanding, you just need an index for the group of columns which you will search.
20K rows won't be a big burden on any recent-vintage server hardware.
Misspelling: You could try SOUNDEX(), but it's an early 20th century algorithm designed by the Bell System to look up peoples' names in American English. It's designed to get many false positive hits, and it really is dumber than a bucket of rocks.
If you really do need spell correction you may need to investigate Sphinx.
I think this is a long-shot....
My database has the following fields: title, description, date, price, hash
At present I generate an MD5 hash like so md5($title.$desc.$date.$price) and place it in the hash field for each item, so that when a new item is added to the database I have an easy and fairly reliable way of knowing whether an item with the same details already exists in the database.
What I would like to do is expand this, so the match process is a little more fuzzy. The reason for this is that I'm seeing lots of duplicate items in the database where the description may be only one or two characters different, or the price might be slightly different.
The database is large (3mill rows) and is INNODB. I also have Sphinx at my disposal if this offers a way of filtering out similar results when they are returned from searches.
Well Sphinx (or other 'search engine') would need a similar 'hash' computing to be able to remove duplicates at query time.
Where sphinx might help you, when you insert an item into the database, use sphinx to run a search on the database for similar items. You could get a 'ranked' list of potential duplicates. If the top item has a high score, you could say its suffiently similar, and then store that fact in the databse.
(How I do it, is have a second column on the table called 'grouper', by default it just duplicates the primary key of the item. But if a duplicate is found, instead its changed to the PK of the item it duplicates. Can then just run a mysql (or sphinx!) GROUP BY on that grouper column)
You could use SOUNDEX on the description (used to cope with slightly different spellings of words).
http://dev.mysql.com/doc/refman/5.6/en/string-functions.html#function_soundex
For the price, if you round it to the nearest 10 (or whatever is reasonable) before creating the MD5 that should cope with small differences.
I am trying to make a database of products that can be searched by many facets(like newegg or amazon). At first I was going to try to do the whole thing with mysql but further research has led me to believe that is a bad idea so instead I am thinking about using Sphinx.
My question is how would I set up the mysql tables for this? Would I just have one table for the products and another one with all the facets that would just have a couple large varchar fields and foreign key to the product?
I am not a huge Sphinx expert, but I'd say that you don't have to stick all your data in one table. Sphinx can handle associations just fine. If you are planning to use Rails for your front-end then take a look at thinking_sphinx gem. It definitely allows you to specify attributes based on data spread out into many tables. In my experience I didn't have to change my data structure to accommodate Sphinx.
I'll pipe in.
You don't really need to actually. Facets in Sphinx are just ID's (at least in 0.9.9 the current stable release). I am going to assume that you have a standard product table with your different facets stored as foreign keys to other tables.
So assuming you have this you can just select over the main product table and set up the facets in sphinx as per the documentation.
I would really need to see your table structure to comment further. It sounds like you have your products spread over multiple tables. In this case as you mentioned I would go with a single table which you index on which is populated with the contents of all the others.
The great thing about Sphinx is that you can use a MySQL query to get your data into Sphinx. This allows you to structure your database in a way that's optimized for your business logic, without having to worry about how search will perform. As long as you're creative with the query you write for sql_query, you can normalize your database however you'd like, and still be able to grab all the text to be indexed with a single query. For example, if you need to get strings from a many-to-one relationship into your index, you can do so using a subquery.
sql_query = SELECT *, (SELECT pa.text FROM products_attr pa WHERE pa.product_id=p.id ) \
FROM products p;
Additionally, if you drop downs where you search on attribute IDs, you use Sphinx's multi-value attribute. This way, you can search by attribute ID, as well as the text of the attrbute.
sql_attr_multi = uint attributes from query; \
SELECT product_id AS id, id AS attribute FROM product_attributes ;
I have any kind of content what has an ID now here I can specify multiple types for the content.
The question is, should I use multiple rows to add multiple types or use the type field and put there the types separated with commas and parse them in PHP
Multiple Rows
`content_id` | `type`
1 | 1
1 | 2
1 | 3
VS
Single Row
`content_id` | `type`
1 | 1,2,3
EDIT
I'm looking for the faster answer, not the easier, please consider this. Performance is really important for me. So I'm talking about a really huge database with millions or ten millions of rows.
I'd generally always recommend the "multiple rows" approach as it has several advantages:
You can use SQL to return for example WHERE type=3 without any great difficulty as you don't have to use WHERE type LIKE '%3%', which is less efficient
If you ever need to store additional data against each content_id and type pair, you'll find it a lot easier in the multiple row version
You'll be able to apply one, or more, indexes to your table when it's stored in the "multiple row" format to improve the speed at which data is retrieved
It's easier to write a query to add/remove content_id and type pairs when each pair is stored separately than when you store them as a comma seaparated list
It'll (nearly) always be quicker to let SQL process the data to give you a subset than to pass it to PHP, or anything else, for processing
In general, let SQL do what it does best, which is allow you to store the data, and obtain subsets of the data.
I always use multiple rows. If you use single rows your data is hard to read and you have to split it up once you grab it from the database.
Use multiple rows. That way, you can index that type column later, and search it faster if you need to in the future. Also it removes a dependency on your front-end language to do parsing on query results.
Normalised vs de-normalised design.
usually I would recommend sticking to the "multiple rows" style (normalised)
Although sometimes (for performance/storage reasons) people deliberately implement "single row" style.
Have a look here:
http://www.databasedesign-resource.com/denormalization.html
The single row could be better in a few cases. Reporting tends to be easer with some denormalization is the main example. So if your code is cleaner/performs better with the single row, then go for that. Other wise the multiple rows would be the way to go.
Never, ever, ever cram multiple logical fields into a single field with comma separators.
The right way is to create multiple rows.
If there's some performance reason that demands you use a single row, at least make multiple fields in the row. But that said, there is almost never a good performance reason to do this. First make a good design.
Do you ever want to know all the records with, say, type=2? With multiple rows, this is easy: "select content_id from mytable where type=2". With the crammed field, you would have to say "select content_id from mytable where type like '%2%'". Oh, except what happens if there are more than 11 types? The above query would find "12". Okay, you could say "where type like '%,2,%'". Except that doesn't work if 2 is the first or the last in the list. Even if you came up with a way to do it reliably, a LIKE search with an initial % means a sequential read of every record in the table, which is very slow.
How big will you make the cram field? What if the string of types is too big to fit in your maximum?
Do you carry any data about the types? If you create a second table with key of "type" and, say, a description of that type, how will you join to that table. With multiple rows, you could simply write "select content_id, type_id, description from content join type using (type_id)". With a crammed field ... not so easy.
If you add a new type, how do you make it consistent? Suppose it used to say "3,7,9" and now you add "5". Can you say "3,7,9,5" ? Or do they have to be in order? If they're not in order, it's impossible to check for equality, because "1,2" and "2,1" will not look equal but they are really equivalent. In either case, updating a type field now becomes a program rather than a single SQL statement.
If there is some trivial performace gain, it's just not worth it.
I have three tables, all of which have a column with a fulltext index. The user will enter search terms into a single text box, and then all three tables will be searched.
This is better explained with an example:
documents
doc_id
name FULLTEXT
table2
id
doc_id
a_field FULLTEXT
table3
id
doc_id
another_field FULLTEXT
(I realise this looks stupid but that's because I've removed all the other fields and tables to simplify it).
So basically I want to do a fulltext search on name, a_field and another_field, and then show the results as a list of documents, preferably with what caused that document to be found, e.g. if another_field matched, I would display what another_field is.
I began working on a system whereby three fulltext search queries are performed and the results inserted into a table with a structure like:
search_results
table_name
row_id
score
(This could later be made to cache results for a few days with e.g. a hash of the search terms).
This idea has two problems. The first is that the same document can be in the search results up to three times with different scores. Instead of that, if the search term is matched in two tables, it should have one result, but a higher score.
The second is that parsing the results is difficult. I want to display a list of documents, but I don't immediately know the doc_id without a join of some kind; however the table to join to is dependant on the table_name column, and I'm not sure how to accomplish that.
Wanting to search multiple related tables like this must be a common thing, so I guess what I'm asking is am I approaching this in the right way? Can someone tell me the best way of doing it please.
I would create a denormalized single index. Ie, put all three document types into a single table with fields for doc_id, doc_type and a single fulltext block. Then you can search all three document types at once.
You might also find that Lucene would make sense in this situation. It gives you faster searching, as well as much more functionality around how the searching and scoring works.
The downside is that you're keeping a separate denomalized copy of the text for each record. The upside is that searching is much faster.