I was looking for any results that show the difference between MYSQL FULL TEXT search and ELASTIC SEARCH.
Which one is quicker?
Has someone already used it in huge databases?
MySQL and ES has totally different ideas behind data storages. ES gives you more control
Generally speaking, ES is faster for fulltext
More important, ES was built for scaling, MySQL wasn't. Idk what do you mean by "huge databases", but in modern world, "huge" in most cases mean that you cant fit data into single average hardware node => you need horizontal scaling. Scaling MySQL is nightmare, as it fits the CA points of CAP, ES made for AP
ES is production-ready/proven solution for fulltext search and analytics. Can't say the same about fulltext storages in MySQL
Related
I'm working on a analysis assignment, we got a partial data-set from the university Library containing almost 300.000.000 rows.
Each row contains:
ID
Date
Owner
Deadline
Checkout_date
Checkin_date
I put all this inside a MySQL table, then I started querying that for my analysis assignment, however simple query (SELECT * FROM table WHERE ID = something) where taking 9-10 minutes to complete. So I created an index for all the columns, which made it noticeable faster ~ 30 sec.
So I started reading similar issues, and people recommended switching to a "Wide column store" or "Search engine" instead of "Relational".
So my question is, what would be the best database engine to use for this data?
Using a search engine to search is IMO the best option.
Elasticsearch of course!
Disclaimer: I work at elastic. :)
The answer is, of course, "it depends". In your example, you're counting the number of records in the database with a given ID. I find it hard to believe that it would take 30 seconds in MySQL, unless you're on some sluggish laptop.
MySQL has powered an incredible number of systems because it is full-featured, stable, and has pretty good performance. It's bad (or has been bad) at some things, like text search, clustering, etc.
Systems like Elasticsearch are good with globs of text, but still may not be a good fit for your system, depending on usage. From your schema, you have one text field ("owner"), and you wouldn't need Elasticsearch's text searching capabilities on a field like that (who ever needed to stem a user name?). Elasticsearch is also used widely for log files, which also don't need a text engine. It is, however, good with blocks of text and with with clustering.
If this is a class assignment, I'd stick with MySQL.
When setting up a MySQL / ElasticSearch combo, is it better to:
Completely sync all model information to ES (even the non-search data), so that when a result is found, I have all its information handy.
Only sync the searchable fields, and then when I get the results back, use the id field to find the actual data in the MySQL database?
The Elasticsearch model of data prefers non-normalized data, usually. Depending on the use case (large amount of data, underpowered machines, too few nodes etc) keeping relationships in ES (parent-child) to mimic the inner joins and the like from the RDB world is expensive.
Your question is very open-ended and the answer depends on the use-case. Generally speaking:
avoid mimicking the exact DB Tables - ES indices plus their relationships
advantage of keeping everything in ES is that you don't need to update both mechanisms at the same time
if your search-able data is very small compared to the overall amount of data, I don't see why you couldn't synchronize just the search-able data with ES
try to flatten the data in ES and resist any impulse of using parent/child just because this is how it's done in MySQL
I'm not saying you cannot use parent/child. You can, but make sure you test this before adopting this approach and make sure you are ok with the response times. This is, anyway, a valid advice for any kind of approach you choose.
ElasticSearch is a search engine. I would advise you to not use it as a database system. I suggest you to only index the search data and a unique id from your database so that you can retrieve the results from MySQL using the unique key returned by ElasticSearch.
This way you'll be using both applications for what they're intended. Elastic search is not the best for querying relations and you'll have to write lot more code for operating on related data than simply using MySql for it.
Also, you don't want to tie up your persistence layer with search layer. These should be as independent as possible, and change in one should not affect the other, as much as possible. Otherwise, you'll have to update both your systems if either has to change.
Querying MySQL on some IDs is very fast, so you can use it and leave the slow part (querying on full text) to elastic search.
Although it's depend on situation, I would suggest you to go with #2:
Faster when indexing: we only fetch searchable data from DB and index to ES, compare to fetch all and index all
Smaller storage size: since indexed data is smaller than #1, it's more easier to backup, restore, recover, upgrade your ES in production. It'll also keep your storage size small when your data growing up, and you can also consider to use SSD to enhance performance with lower cost.
In general, a search app will search on some fields and show all possible data to user. E.g searching for products but will show pricing/stock info.. in result page, which only available in DB. So it's nature to have a 2nd step to query for extra info in DB and combine it with search results to display.
Hope it help.
I want to implement a feature where a list of nearby venues can be presented sorted by the distance from user's location. The approach I have right now is to store lat and lon values as floats and to make a query that is looking for +/- values of the location of the user (searching for a square that extends north, south, east and west of the user). Then I do a quick calculation across the resultset determining the distance and sort in my business logic. Now I am approaching this with the perspective of someone who has primarily used relational databases (the app is running MySQL with Hibernate), but is there a better approach (in a different database like Neo4J or with a better column type?)
Also the approach I have has a semi complex workaround for queries at or near 0 lat or 0 lon).
As for my working definition of optimal I'm looking for approaches that are scalable to potentially hundreds of venues in a 10 mile radius and hundreds of thousands of venues in total. To put it another way approximately 1% of SimpleGEO, so if the scale of this problem doesn't require an optimal solution then "you're alright" would also be an interesting answer, though I'd be intested in knowing why)
You could have a look at Lucene/Solr. Lucene supported location-aware search at least since v2.9.
If you're worried about the Lucene complexities, there's Hibernate Search which is meant to replicated all database changes across to Lucene transparently.
MongoDB has native support for geospatial indexes and extensions to the query language to support a lot of different ways of querying your geo spatial documents.
But if you are looking for relation database try PostgreSQL with PostGIS.
Are you look to Hibernate Spatial?
Hibernate Spatial is a generic extension to Hibernate for handling geographic data. And HS have MySQL Provider.
http://www.hibernatespatial.org/
The site currently does mainly range searches (latitude & longitude) with some filtering like WHERE color = "red" type of clauses. However using MySQL with geospatial index is still quite slow and I need to speed it up.
Problem: Will using Solr to do the search be a good idea?
If so, should I only duplicate the range columns from MySQL into Solr, and do the WHERE clauses in MySQL, or do both type of queries in Solr?
I've read that Solr is not for storing data like a database (ie. MySQL). Does this mean that if my search can take place over 10 different columns (or field in Solr terms), and the MySQL table that I replicated Solr's from only has 11 tables, I would still keep the MySQL table even though that will use up almost twice as much storage space half of which is redundant?
It appears that I'm using structured data (because each row has many columns defined?) and storing the entire table in Solr instead having redundant data on MySQL and Solr will save storage space and number of database access operations when writing. Is Solr a good choice here?
In terms of speed, would it be better to use PostGIS or Solr?
Solr has very fast numerical/date range queries. Solr 3 geospatial takes advantage of that, and I wrote a plugin that does even better. I doubt MySQL is faster.
That said, if the sole problem you are trying to solve is slow geospatial queries then bringing in Solr may solve it but will add a lot of overall complexity to your system since it isn't designed to replace relational databases--it works alongside them. Don't get me wrong; Solr is awesome, particularly for faceted navigation and text search. But you didn't state you wanted to take advantage of Solr's primary features.
PostGIS is by far the most mature open-source GIS storage system. I suggest you try it as an experiment to see if it's better. I would try a lat + lon pair of columns approach like what you are doing now with MySQL, and I would also try using the PostGIS native geospatial way to do it, whatever that is exactly.
One thing you could try in either MySQL or PostGIS is to round your latitude and longitude value to the number of decimals to get an appropriate level of precision you need, which is surely far less than the full precision of a double. And if you store them in floats rather than doubles, right there the precision is capped to 2.37 meters. The system you use will probably have a much easier time doing range queries if there are fewer distinct values to scan over.
I am working on an application that needs to do interesting things with search, including full-text search, hit-highlighting, faceted-search, etc...
The dataset is likely to be between 3000-10000 records with 20-30 fields on each, and is all stored in MySQL. The traffic profile of the site is likely to be on the small size of medium.
All of these requirements could be achieved (clunkily) in MySQL, but at what point (in terms of data-size and traffic levels) does it become worth looking at more focused technologies like Solr or Sphinx?
This question calls for a very broad answer to be answered in all aspects. There are very well certain specificas that may make one system superior to another for a special use case, but I want to cover the basics here.
I will deal entirely with Solr as an example for several search engines that function roughly the same way.
I want to start with some hard facts:
You cannot rely on Solr/Lucene as a secure database. There are a list of facts why but they mostly consist of missing recovery options, lack of acid transactions, possible complications etc. If you decide to use solr, you need to populate your index from another source like an SQL table. In fact solr is perfect for storing documents that include data from several tables and relations, that would otherwise requrie complex joins to be constructed.
Solr/Lucene provides mind blowing text-analysis / stemming / full text search scoring / fuzziness functions. Things you just can not do with MySQL. In fact full text search in MySql is limited to MyIsam and scoring is very trivial and limited. Weighting fields, boosting documents on certain metrics, score results based on phrase proximity, matching accurazy etc is very hard work to almost impossible.
In Solr/Lucene you have documents. You cannot really store relations and process. Well you can of course index the keys of other documents inside a multivalued field of some document so this way you can actually store 1:n relations and do it both ways to get n:n, but its data overhead. Don't get me wrong, its perfectily fine and efficient for a lot of purposes (for example for some product catalog where you want to store the distributors for products and you want to search only parts that are available at certain distributors or something). But you reach the end of possibilities with HAS / HAS NOT. You can almonst not do something like "get all products that are available at at least 3 distributors".
Solr/Lucene has very nice facetting features and post search analysis. For example: After a very broad search that had 40000 hits you can display that you would only get 3 hits if you refined your search to the combination of having this field this value and that field that value. Stuff that need additional queries in MySQL is done efficiently and convinient.
So let's sum up
The power of Lucene is text searching/analyzing. It is also mind blowingly fast because of the reverse index structure. You can really do a lot of post processing and satisfy other needs. Altough it's document oriented and has no "graph querying" like triple stores do with SPARQL, basic N:M relations are possible to store and to query. If your application is focused on text searching you should definitely go for Solr/Lucene if you haven't good reasons, like very complex, multi-dmensional range filter queries, to do otherwise.
If you do not have text-search but rather something where you can point and click something but not enter text, good old relational databases are probably a better way to go.
Use Solr if:
You do not want to stress your database.
Get really full text search.
Perform lightning fast search results.
I currently maintain a news website with 5 million users per month, with MySQL as the main datastore and Solr as the search engine.
Solr works like magick for full text indexing, which is difficult to achieve with Mysql. A mix of Mysql and Solr can be used: Mysql for CRUD operations and Solr for searches. I have previusly worked with one of India's best real estate online classifieds portal which was using Solr for search ( and was previously using Mysql). The migration reduced the search times manifold.
Solr can be easily integrated with Mysql:
Solr Full Dataimport can be used for importing data from Mysql tables into Solr collections.
Solr Delta import can be scheduled at short frequencies to load latest data from Mysql to Solr collections.