Bulding search engine for large database - mysql

I'm building a fairly large database where I will have a lot of tables with various data.
But each table has similar fields, for example video title or track title.
Now the problem I'm facing is how to build a query which would look for a keyword match across five or more tables, keep in mind that each table can potentially have from 100k to 1million rows or in some cases even couple million rows.
I think using joins or separate queries for each table would be very slow, so what I thought of is to make one separate table where I would store search data.
For example I think it could have fields like these,
id ---- username ---- title ---- body ---- date ---- belongs_to ---- post_id
This way I think it would perform a lot faster searches, or am I totally wrong?
The only problem with this approach that I can think of it is that it would be hard to manage this table because if original record from some of the tables is deleted I would also need to delete record from 'search' table as well.

Don't use MySQL for joining lots of tables, I would suggest you to take a look at Apache Solr, with RDBMS

Take a look at some information retrieval systems. They also require their own indices, so you need to index the data after each update (or in regular intervals) to keep the search index up to date. But they offer the following advantages:
much faster, because they use special algorithms and data structures designed for specifically that purpose
ability to search for documents based on a set of terms (and maybe also a set of negative terms that must not appear in the result)
search for phrases (i.e. terms that appear after each other in a specific order)
automatic stemming (i.e. stripping the endings of words like "s", "ed", "ing" ...)
detection of spelling mistakes (i.e. "Did you mean ...?")
stopwords to avoid indexing really common meaningless words ("a", "the", etc.)
wildcard queries
advanced ranking strategies (i.e. rank by relevance, based on the number and the position of each occurrences of the search terms)
I have used xapian in the past for my projects and I was quite happy with it. Lucene, Solr and elastic search are some other really popular projects that might fit your needs.

Related

Autocomplete with MySQL Fulltext search that proposes words instead of results

There are lots of question around fulltext searches with mySQL and I've read lots of them without finding what I am looking for (in google or stackoverflow).
I am not looking to match rows (or documents) but I am looking to match words contained in the rows.
For ex, imagine you have a companies table, with an id, a name and a small_description column. You could find rows like :
1 | MyBaker | fine bakery since 1920
2 | Bakery factory | all the materials for a bakery
etc...
now, when the user types "bak", I would like to suggest him the word "bakery" (and I do not want to directly suggest him MyBaker and Bakery factory since there are hundreds of companies that will match but only a handful different words)
I think that the underlying mySQL fulltext engine is already having some kind of "word lookup", so I'd like to use that instead of parsing the name and small_description myself to recreate another table with word | nb_occurences
(not to mention that it may be hard to keep synchronized if lots of update are done in the other table to decrement the counters :( )
the reason behind this is to create an autocomplete search
where word suggestions will be correlated to the database content
For ex, amazon (.fr) is doing a pretty awful job. If you type "tel", it will suggest a dozen "telephone" matches and 0 "television" or "telescope" or "telemetry" ... !
while this is not really a problem in desktop where typing the full word is fast, for mobile it is really a problem
this is amplified by the fact that some words suggested by the smartphone keyboard are not in my database AND that some words of my database are never suggested by the smartphone keyboard.
for ex, my database have 0 telephone and television but lots of telemetry and teleconference
finally, I'd also like to forgive bad spelling if possible (ex : telme should match telemetry)
I hope someone can help me to leverage the existing fulltext index to achieve my goal
FULLTEXT search finds rows of data matching the word or words you present to it. As you know, it is not simply a word search.
You could, in your back-end program, take the results of your FULLTEXT search, break it up into words, and consider the most frequent of those words for autocompletion. This might work well if you modified your searches using WITH QUERY EXPANSION.
(Keep in mind that natural language FULLTEXT searches work strangely with small sets of data to search, so test with a table with many rows, not just a few.)
But, FULLTEXT does not handle stemming (chateau + chateaux - chat) correctly, nor does it offer to correct misspellings.
You could use Apache Lucene for your purpose, but it is a large and complex system.
I think you need the word / nb_appearances table, unpleasant as it is to maintain. It will give you the capability of doing
SELECT word
FROM words
WHERE word LIKE CONCAT(:input,'%')
ORDER BY nb_appearances DESC;
to get partial word matches. FULLTEXT cannot do that. You can also add a second lookup table to correct common misspellings in your application domain, for example, telmetry --> telemetry. It is a pain in the neck, of course.

sql query LIKE % on Index

I am using a mysql database.
My website is cut in different elements (PRJ_12 for projet 12, TSK_14 for task 14, DOC_18 for document 18, etc). We currently store the references to these elements in our database as VARCHAR. The relation columns are Indexed so it is faster to select.
We are thinking of currint these columns in 2 columns (on column "element_type" with PRJ and one "element_id" with 12). We are thinking on this solution as we do a lot of requests containing LIKE ...% (for example retrieve all tasks of one user, no matter the id of the task).
However, splitting these columns in 2 will increase the number of Indexed columns.
So, I have two questions :
Is a LIKE ...% request in an Indexed column realy more slow than a a simple where query (without like). I know that if the column is not indexed, it is not advisable to do where ... LIKE % requests but I don't realy know how Index work).
The fact that we split the reference columns in two will double the number of Indexed table. Is that a problem?
Thanks,
1) A like is always more costly than a full comparison (with = ), however it all comes down to the field data types and the number of records (unless we're talking of a huge table you shouldn't have issues)
2) Multicolumn indexes are not a problem, yes it makes the index bigger, but so what? Data types and ammount of total rows matter, but thats what indexes are for.
So go for it
There are a number of factors involved, but in general, adding one more index on a table that has only one index already is unlikely to be a big problem. Some things to consider.
If the table most mostly read-only, then it is almost certainly not a problem. If updates are rare, then the indexes won't need to be modified often meaning there will be very little extra cost (aside from the additional disk space).
If updates to existing records do not change either of those key values, then no index modification should be needed and so again there would be no additional runtime cost.
DELETES and INSERTS will need to update both indexes. So if that is the majority of the operations (and far exceeding reads), then an additional index might incur measurable performance degradation (but it might not be a lot and not noticeable from a human perspective).
The like operator as you describe the usage should be fully optimized. In other words, the clause WHERE combinedfield LIKE 'PRJ%' should perform essentially the same as WHERE element_type = 'PRJ' if there is an index existing in both situations. The more expensive situation is if you use the wild card at the beginning (e.g., LIKE '%abc%'). You can think of a LIKE search as being equivalent to looking up a word in a dictionary. The search for 'overf%' is basically the same as a search for 'overflow'. You can do a "manual" binary search in the dictionary and quickly find the first word beginning with 'overf'. Searching for '%low', though is much more expensive. You have to scan the entire dictionary in order to find all the words that end with "low".
Having two separate fields to represent two separate values is almost always better in the long run since you can construct more efficient queries, easily perform joins, etc.
So based on the given information, I would recommend splitting it into two fields and index both fields.

Which of these 2 MySQL DB Schema approaches would be most efficient for retrieval and sorting?

I'm confused as to which of the two db schema approaches I should adopt for the following situation.
I need to store multiple attributes for a website, e.g. page size, word count, category, etc. and where the number of attributes may increase in the future. The purpose is to display this table to the user and he should be able to quickly filter/sort amongst the data (so the table strucuture should support fast querying & sorting). I also want to keep a log of previous data to maintain a timeline of changes. So the two table structure options I've thought of are:
Option A
website_attributes
id, website_id, page_size, word_count, category_id, title_id, ...... (going up to 18 columns and have to keep in mind that there might be a few null values and may also need to add more columns in the future)
website_attributes_change_log
same table strucuture as above with an added column for "change_update_time"
I feel the advantage of this schema is the queries will be easy to write even when some attributes are linked to other tables and also sorting will be simple. The disadvantage I guess will be adding columns later can be problematic with ALTER TABLE taking very long to run on large data tables + there could be many rows with many null columns.
Option B
website_attribute_fields
attribute_id, attribute_name (e.g. page_size), attribute_value_type (e.g. int)
website_attributes
id, website_id, attribute_id, attribute_value, last_update_time
The advantage out here seems to be the flexibility of this approach, in that I can add columns whenever and also I save on storage space. However, as much as I'd like to adopt this approach, I feel that writing queries will be especially complex when needing to display the tables [since I will need to display records for multiple sites at a time and there will also be cross referencing of values with other tables for certain attributes] + sorting the data might be difficult [given that this is not a column based approach].
A sample output of what I'd be looking at would be:
Site-A.com, 232032 bytes, 232 words, PR 4, Real Estate [linked to category table], ..
Site-B.com, ..., ..., ... ,...
And the user needs to be able to sort by all the number based columns, in which case approach B might be difficult.
So I want to know if I'd be doing the right thing by going with Option A or whether there are other better options that I might have not even considered in the first place.
I would recommend using Option A.
You can mitigate the pain of long-running ALTER TABLE by using pt-online-schema-change.
The upcoming MySQL 5.6 supports non-blocking ALTER TABLE operations.
Option B is called Entity-Attribute-Value, or EAV. This breaks rules of relational database design, so it's bound to be awkward to write SQL queries against data in this format. You'll probably regret using it.
I have posted several times on Stack Overflow describing pitfalls of EAV.
Also in my blog: EAV FAIL.
Option A is a better way ,though the time may be large when alert table for adding a extra column, querying and sorting options are quicker. I have used the design like Option A before, and it won't take too long when alert table while millions records in the table.
you should go with option 2 because it is more flexible and uses less ram. When you are using option1 then you have to fetch a lot of content into the ram, so will increases the chances of page fault. If you want to increase the querying time of the database then you should defiantly index your database to get fast result
I think Option A is not a good design. When you design a good data model you should not change the tables in a future. If you domain SQL language, using queries in option B will not be difficult. Also it is the solution of your real problem: "you need to store some attributes (open number, not final attributes) of some webpages, therefore, exist an entity for representation of those attributes"
Use Option A as the attributes are fixed. It will be difficult to query and process data from second model as there will be query based on multiple attributes.

Sphinx vs. MySql - Search through list of friends (efficiency/speed)

I'm porting my application searches over to Sphinx from MySQL and am having a hard time figuring this one out, or if it even needs to be ported at all (I really want to know if it's worth using sphinx for this specific case for efficiency/speed):
users
uid uname
1 alex
2 barry
3 david
friends
uid | fid
1 2
2 1
1 3
3 1
Details are:
- InnoDB
- users: index on uid, index on uname
- friends: combined index on uid,fid
Normally, to search all of alex's friends with mysql:
$uid = 1
$searchstr = "%$friendSearch%";
$query = "SELECT f.fid, u.uname FROM friends f
JOIN users u ON f.fid=u.uid
WHERE f.uid=:uid AND u.uname LIKE :friendSearch";
$friends = $dbh->prepare($query);
$friends->bindParam(':uid', $uid, PDO::PARAM_INT);
$friends->bindParam(':friendSearch', $searchstr, PDO::PARAM_STR);
$friends->execute();
Is it any more efficient to find alex's friends with sphinx vs mysql or would that be an overkill? If sphinx would be faster for this as the list hits thousands of people,
what would the indexing query look like? How would I delete a friendship that no longer exists with sphinx as well, can I have a detailed example in this case? Should I change this query to use Sphinx?
Ok this is how I see this working.
I have the exact same problem with MongoDB. MongoDB "offers" searching capabilities but just like MySQL you should never use them unless you wanna be choked with IO, CPU and memory problems and be forced to use a lot more servers to cope with your index than you normally would.
The whole idea if using Sphinx (or another search tech) is to lower cost per server by having a performant index searcher.
Sphinx however is not a storage engine. It is not as simple to query exact relationships across tables, they have remmedied this a little with SphinxQL but due to the nature of the full text index it still doesn't do an integral join like you would get in MySQL.
Instead I would store the relationships within MySQL but have an index of "users" within Sphinx.
In my website I personally have 2 indexes:
main (houses users,videos,channels and playlists)
help (help system search)
These are delta updated once every minute. Since realtime indexes are still bit experimental at times and I personally have seen problems with high insertion/deletion rates I keep to delta updates. So I would use a delta index to update the main searchable objects of my site since this is less resource intensive and more performant than realtime indexes (from my own tests).
Do note inorder to process deletions and what not your Sphinx collection through delta you will need a killlist and certain filters for your delta index. Here is an example from my index:
source main_delta : main
{
sql_query_pre = SET NAMES utf8
sql_query_pre =
sql_query = \
SELECT id, deleted, _id, uid, listing, title, description, category, tags, author_name, duration, rating, views, type, adult, videos, UNIX_TIMESTAMP(date_uploaded) AS date_uploaded \
FROM documents \
WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 ) OR update_time >( SELECT last_index_time FROM sph_counter WHERE counter_id=1 )
sql_query_killlist = SELECT id FROM documents WHERE update_time>=( SELECT last_index_time FROM sph_counter WHERE counter_id=1 ) OR deleted = 1
}
This processes deletions and additions once every minute which is pretty much realtime for a real web app.
So now we know how to store our indexes. I need to talk about the relationships. Sphinx (even though it has SphinxQL) won't do integral joins across data so I would personally recommend doing the relationship outside of Sphinx, not only that but as I said this relationship table will get high load so this is something that could impact the Sphinx index.
I would do a query to pick out all ids and using that set of ids use the "filter" method on the sphinx API to filter the main index down to specific document ids. Once this is done you can search in Sphinx as normal. This is the most performant method I have found to date of dealing with this.
The key thing to remember at all times is that Sphinx is a search tech while MySQL is a storage tech. Keep that in mind and you should be ok.
Edit
As #N.B said (which I overlooked in my answer) Sphinx does have SphinxSE. Although primative and still in sort of testing stage of its development (same as realtime indexes) it does provide an actual MyISAM/InnoDB type storage to Sphinx. This is awesome. However there are caveats (as with anything):
The language is primative
The joins are primative
However it can/could do the job your looking for so be sure to look into it.
so I'm going to go ahead and kinda outline what -I- feel the best use cases for sphinx are and you can kinda decide if it's more or less in line for what you're looking to do.
If all you're looking to do is a string search one one field; then with MySQL you can do wild card searches without much trouble and honstly with an index on it unless you're expecting millions of rows you are going to be fine.
Now take facebook, that is not only indexing names, but pages ect or even any advanced search fields. Sphinx can take in x columns from MySQL, PostGRES, MongoDB, (insert your db you want here) and create a searchable full-text index across all of those.
Example:
You have 5 fields (house number, street, city, state, zipcode) and you want to do a full text search across all of those. Now with MySQL you could do searches on every single one, however with sphinx you can glob them all together then sphinx does some awesome statistical findings based on the string you've passed in and the matches which are resulting from it.
This Link: PHP Sphinx Searching does a great job at walking you through what it would look like and how things work together.
So you aren't really replacing a database; you're just adding a special daemon to it (sphinx) which allows you to create specialized indexes and run your full text searches against it.
No index can help you with this query, since you're looking for the string as an infix, not a prefix (you're looking for '%friendname%', not 'friendname%'.
Moreover, the LIKE solution will get you into corners: suppose you were looking for a friend called Ann. The LIKE expression will also match Marianne, Danny etc. There's no "complete word" notion in a LIKE expression.
A real solution is to use a text index. A FULLTEXT index is only available on MyISAM, and MySQL 5.6 (not GA at this time) will introduce FULLTEXT on InnoDB.
Otherwise you can indeed use Sphinx to search the text.
With just hundreds or thousands, you will probably not see a big difference, unless you're really going to do many searches per second. With larger numbers, you will eventually realize that a full table scan is inferior to Sphinx search.
I'm using Sphinx a lot, on dozens and sometimes hundreds of millions large texts, and can testify it works like a charm.
The problem with Sphinx is, of course, that it's an external tool. With Sphinx you have to tell it to read data from your database. You can do so (using crontab for example) every 5 minutes, every hour, etc. So if rows are DELETEd, they will only be removed from sphinx the next time it reads the data from table. If you can live with that - that's the simplest solution.
If you can't, there are real time indexes in sphinx, so you may directly instruct it to remove certain rows. I am unable to explain everything in this port, so here are a couple links for you:
Index updates
Real time indexes
As final conclusion, you have three options:
Risk it and use a full table scan, assuming you won't have high load.
Wait for MySQL 5.6 and use FULLTEXT with InnoDB.
Use sphinx
At this point in time, I would certainly use option #3: use sphinx.
Take a look at the solution I propose here:
https://stackoverflow.com/a/22531268/543814
Your friend names are probably short, and your query looks simple enough. You can probably afford to store all suffixes, perhaps in a separate table, pointing back to the original table to get the full name.
This would give you fast infix search at the cost of a little bit more storage space.
Furthermore, to avoid finding 'Marianne' when searching for 'Ann', consider:
Using case-sensitive search. (Fragile; may break if your users enter their names or their search queries with incorrect capitalization.)
After the query, filtering your search results further, requiring word boundaries around the search term (e.g. regex \bAnn\b).

Decatenate with MySQL?

I have an authors table in my database that lists an author's whole name, e.g. "Charles Dickinson". I would like to sort of "decatenate" at the space, so that I can get 'Charles" and "Dickinson" separately. I know there is the explode function in PHP, but is there anything similar for a straight mysql query? Thanks.
No, don't do that. Seriously. That is a performance killer. If you ever find yourself having to process a sub-column (part of a column) in some way, your DB design is flawed. It may well work okay on a home address book application or any of myriad other small databases but it will not be scalable.
Store the components of the name in separate columns. It's almost invariably a lot faster to join columns together with a simple concatenation (when you need the full name) than it is to split them apart with a character search.
If, for some reason you cannot split the field, at least put in the extra columns and use an insert/update trigger to populate them. While not 3NF, this will guarantee that the data is still consistent and will massively speed up your queries. You could also ensure that the extra columns are lower-cased (and indexed if you're searching on them) at the same time so as to not have to fiddle around with case issues.
This is related: MySQL Split String