I use mysql as my main database and I sync some data to elasticsearch to make use of features like fuzzy search and aggregations. However, this problem can be applied to and couple of relational and non-relational databases.
When user searches something, I make query to elastic, get ids (primary keys in mysql) and make another query to mysql database, where I filter by ids that were returned from elastic. I use this approach as you often need to load some additional data from relational database, and it would be hell to maintain these relations inside document-based elastic (e.g. load user with comment).
Problem is, same filters will not be applied to elastic query and mysql query. In above example, what if you need to filter comments by some user param - that filter will be applied to mysql query, but not elastic. If same filters won't be applied, pagination will mismatch - 2nd page in mysql can be 4th in elastic. If I take all of the ids from elastic (no pagination), I am afraid of a long response time and clusters failing + you can't get more than 10K records from elastic without scroll api.
I need a conceptual solution here, not actual query examples. Feel free to suggest totaly different approach altogether. Also, I don't need a perfect pagination match, since mysql will do pagination anyway. If elastic needs to get more records, it's fine, I just don't want to couse too heavy load.
Im afraid there is no general solution for the problem you are explaining . It varies by your response time expectations; size of data etc.
For example,
If you can ensure that one side of JOIN data will be much lesser - you could change join direction; First do the query on mySQL and then do an id based terms search in ES.
Consider using database embedded search like postgres depending on how complex your queries are and other features of ES you are leveraging
Related
I have joined a new company where I observed the below use case.
Use case :- A table has around 500 GB of data. Data is user action events for each and every user activity. Purpose is to analyse the activity count
for different permutation and combination for any given date range. So data is further supplied to elastic(and lucene in different similar scenario use case).
My understanding is for this kind of scenario DB in itself should be sufficient.But When I try to query the DB for specific permutation and combination for given data range its damn slow and most of the time gets times out.
But when I to fetch same combination with elastic(or lucene), it is much faster. There is no full text search support required here.
Not sure what is causing the elastic(or lucene) to be much faster than SQL based DB even for regular(not full text) search ?
what can be the probable reason for the same ? I can think of two reasons here
Elastic(or lucene) keeps the data in compressed form. So may be it is quicker to search here ?
Elastic may help to achieve the parallelism with data kept in multiple shards by default. But in lucene case tI do not even see any parallelism.
When setting up a MySQL / ElasticSearch combo, is it better to:
Completely sync all model information to ES (even the non-search data), so that when a result is found, I have all its information handy.
Only sync the searchable fields, and then when I get the results back, use the id field to find the actual data in the MySQL database?
The Elasticsearch model of data prefers non-normalized data, usually. Depending on the use case (large amount of data, underpowered machines, too few nodes etc) keeping relationships in ES (parent-child) to mimic the inner joins and the like from the RDB world is expensive.
Your question is very open-ended and the answer depends on the use-case. Generally speaking:
avoid mimicking the exact DB Tables - ES indices plus their relationships
advantage of keeping everything in ES is that you don't need to update both mechanisms at the same time
if your search-able data is very small compared to the overall amount of data, I don't see why you couldn't synchronize just the search-able data with ES
try to flatten the data in ES and resist any impulse of using parent/child just because this is how it's done in MySQL
I'm not saying you cannot use parent/child. You can, but make sure you test this before adopting this approach and make sure you are ok with the response times. This is, anyway, a valid advice for any kind of approach you choose.
ElasticSearch is a search engine. I would advise you to not use it as a database system. I suggest you to only index the search data and a unique id from your database so that you can retrieve the results from MySQL using the unique key returned by ElasticSearch.
This way you'll be using both applications for what they're intended. Elastic search is not the best for querying relations and you'll have to write lot more code for operating on related data than simply using MySql for it.
Also, you don't want to tie up your persistence layer with search layer. These should be as independent as possible, and change in one should not affect the other, as much as possible. Otherwise, you'll have to update both your systems if either has to change.
Querying MySQL on some IDs is very fast, so you can use it and leave the slow part (querying on full text) to elastic search.
Although it's depend on situation, I would suggest you to go with #2:
Faster when indexing: we only fetch searchable data from DB and index to ES, compare to fetch all and index all
Smaller storage size: since indexed data is smaller than #1, it's more easier to backup, restore, recover, upgrade your ES in production. It'll also keep your storage size small when your data growing up, and you can also consider to use SSD to enhance performance with lower cost.
In general, a search app will search on some fields and show all possible data to user. E.g searching for products but will show pricing/stock info.. in result page, which only available in DB. So it's nature to have a 2nd step to query for extra info in DB and combine it with search results to display.
Hope it help.
Is it true that relational database, like MySql, performs better than a graph database, like Neo4j, when a query is about to search for specific data within a specific table and a specific column.
For instance, if the query is: "search for all events that took place in Paris".
Let's assume for simplicity that MySql would have an Event table with an index upon "City" to optimize this kind of query.
What about Neo4j?
One might think that a graph database has to traverse all graphs to retrieve the concerned events...
However it's possible to create some indexes with Neo4j as its documentation precises.
Why RDMBS would be faster than it for this kind of analysis/statistics request?
As you already mentioned: you would create indices for this purpose. The default index provider in Neo4j is lucene, which is very fast and allows fine grained indexing and querying possibilities.
Indices can be used for nodes or relationships and (normally) keep track which values have been set on certain properties on nodes or relationships.
You normally have to do the indexing in your application code unless you're using neo4j's auto indexing feature that automatically indexes all nodes and/or relationships with given properties.
So queries like "search for all events that took place in Paris" are absolutely no problem and are very performant when indices are used.
I have a question about making the decision whether to use MySQL database or Mongo database, the problem with my decision is that I am highly depending on these things:
I want to select records between two dates (period)
However is this possible?
My Application won't do any complex queries, just basic crud. It has Facebook integration so sometimes I got to JOIN the users table at the current setup.
Either DB will allow you to filter between dates and I wouldn't use that requirement to make the decision. Some questions you should answer:
Do you need to store your data in a relational system, like MySQL? Relational databases are better at cross entity joining.
Will your data be very complicated, but you will only make simple queries (e.g. by an ID), if so MongoDB may be a better fit as storing and retrieving complex data is a cinch.
Who and where will you be querying the data from? MySql uses SQL for querying, which is a much more well known skill than mongo's JSON query syntax.
These are just three questions to ask. In order to make a recommendation, we'll need to know more about your application?
MySQL(SQL) or MongoDB(NoSQL), both can work for your needs. but idea behind using RDBMS/NoSQL is the requirement of your application
if your application care about speed and no relation between the data is necessary and your data schema changes very frequently, you can choose MongoDB, faster since no joins needed, every data is a stored as document
else, go for MySQL
If you are looking for range queries in MongoDB - yes, Mongo supports those. For date-based range queries, have a look at this: http://cookbook.mongodb.org/patterns/date_range/
What options exist for creating a scalable, full text search with results that need to be sorted on a per user basis? This is for PHP/MySQL (Symfony/Doctrine as well, if relevant).
In our case, we have a database of workouts that have been performed by users. The workouts that the user has done before should appear at the top of the results. The more frequently they've done the workout, the higher it should appear in search matches. If it helps, you can assume we know the number of times a user has done a workout in advance.
Possible Solutions
Sphinx - Use Sphinx to implement full text search, do all the querying and sorting in MySQL. This seems promising (and there's a Symfony Plugin!) but I don't know much about it.
Lucene - Use Lucene to perform full text search and put the users' completions into the query. As is suggested in this Stack Overflow thread. Alternatively, use Lucene to retrieve the results, then reorder them in PHP. However, both solutions seem clunky and potentially unscalable as a user may have completed hundreds of workouts.
Mysql - No native full text support (InnoDB), so we'd have use LIKE or REGEX, which isn't scalable.
MySQL does have a native FULLTEXT support, though only in MyISAM tables.
For most real-world tasks, Sphinx is the fastest engine. However, it is an external index, so it can only be updated on a timely basis with a cron script.
By using SphinxSE (a pluggable MySQL interface to Sphinx), you can join MySQL tables and Sphinx indexes in one query. Updating, though, will still require an external script.
Since the number of workouts performed seems to change frequently, keeping it in Sphinx would require too much effort on rebuilding the index.
With SphinxSE, you can write a query similar to that:
SELECT *
FROM workouts w
JOIN user_workouts uw
ON uw.workout = w.id
WHERE w.query = 'query query query;filter=user_id,$user_id'
AND uw.user = $user_id
ORDER BY
uw.times_performed DESC
I'm not sure why you're assuming using Lucene would be unscalable. Hundreds of workouts per user is not a lot of data to deal with.
Try using Solr/Lucene for the search backend. It has a JSON/XML interface which will play nicely with your PHP frontend. Store a user's completed workout # in a database table. When a query is issued, take the results from Solr, and you can select from the database table and resort in PHP code. Should be plenty fast and scalable. With Solr, maintaining your index is dirt simple; just issue add/update/delete requests to your Solr server.