Full-text Search functionality in Spring Boot microservices based application - mysql

We have a microservice-based application which is developed in Spring Boot. Let us assume there are 3 microservices A, B, and C. The front-end is written using Angular and the backend comprises MySQL database and Hibernate for ORM. We are required to implement a full-text search functionality that will have a search box on the UI where the user can enter the text of his choice. The search must be able to return data from databases from all the 3 microservices. I am facing difficulty in finalizing the search technology for the same. Some of the technologies I have in my purview are
Hibernate Search
Apache Solr
ElasticSearch
Which is the best technology for this problem? If possible, are there any examples of the same?

Hibernate search looks to be depending on internal search solution for it to be providing full text search which could be either plain Apache Lucene or Elasticsearch. I'm not sure but its Elasticsearch integration if its already matured as version 6.0 is still in development stage.
The older/stable version of Hibernate search i.e. 5.11 supports Elasticsearch 2.0 to 5.6.
But looking at your queries, it depends on what you uses case you have. Perhaps below points would help you.
What is the size of the data you have and what is the expected growth rate of your documents/data.
What would be your write vs read rates for this application?
What type of search use-cases you have? What features of search are you looking for? for e.g. Autocomplete, autosuggestion, highlighting, faceted-search
Are you looking for a distributed search or do you have a limitation in using hardware?
Is there a requirement to support search in multiple-languages?
Is that only text search would suffice or would you also be doing analysis on the search logs or click-view data in the future?
What options do you have when it comes to ingesting documents into your search engine. If its Elasticsearch you can easily make use of Beats or Logstash. Or you can simply dump raw data into ES, and then make use of a combination of Ingest API to do pre-processing/enrichment/filtering and then again push the processed data in different index in Elasticsearch.
Both Solr and Elasticsearch are great technologies but if you have to use one of them, I would strongly suggest in using Elasticsearch because it would help you in all of the above queries, much more powerful distributed model, it has it own amazing DSL which is very mature and easy to use, has excellent administrative tools/API for data managment, It is extremely fast and easy to set up. Not to mention their aggregation queries which helps you get analytical information about the documents you'd have ingested.
You would also have the luxury of setting up your own dashboard via Kibana which would help you quickly create some great visualizers.
Plus point is it is completely RESTful by nature, so that means it makes your life easier when it comes to deployment of your applications. I'd suggest you to start from here and spend sometime in understanding the technology.
Hope this helps!

For fulltext search Elasticsearch and Apache Solr are the best choices from the given selection.
However, I strongly disagree that Elasticsearch is better than Solr or other way around without knowing more info about your business case. Both technologies will perform equally for the given problem since they are both build upon Apache Lucene search engine.
They both offer great REST CLients.
Here you can check out example implementation of both - Apache Solr and Elasticsearch in the same project in Java. You can also check what are the differences and which one do you prefer.
Also, there are six write-ups on how to use Apache Solr and Elasticsearch written here. Last chapter is about research which shows that both engines are almost equal and have differences only in very specific business cases. Both have many supporting tools as well.

Related

Automatic conversion of SQL query to ElasticSearch Query

I have a service which currently stores data in Oracle DB.
I am working on a project where I need to run a set of sql queries to get some aggregated data. I would want to store these queries at one place, which I can iterate over, and get the required data.
Say, I have 10 queries today. But, I can keep adding more, without toching the code.
But, tomorrow we would want to switch to ElasticSearch. Is there a way, that i can use the same sql queries to search through even ElasticSearch.
You might want to look at this Elasticsearch plugin which aims at providing an SQL layer on top of Elasticsearch
https://github.com/NLPchina/elasticsearch-sql
With Elasticsearch 6.3 released in June 2018, you might not need an "automatic conversion" anymore.
The 6.3 release comes with native SQL support! (still experimental for now)
Have you (or someone you know) ever:
Said “I know how to do this thing in a SQL statement -- how do I do the same thing in Elasticsearch?”
Tried to build out full-text search with tokenization, stemming, synonyms, relevance sorting on top of a SQL engine like a relational database?
Tried to scale out a traditional database to billions of rows?
Tried to connect a 3rd party tool like a BI system to Elasticsearch?
These are all things which we hope we can make inroads into our new Elasticsearch SQL release.
Our hope is to allow developers, data scientists, and others that are familiar with the SQL language -- but so far unfamiliar with or unable to use the Elasticsearch query language -- to use the speed, scalability, and full-text power that Elasticsearch offers and others have grown to know and love.
If you’re just getting started using this functionality or the power of Elasticsearch that powers it, here are a few things to try:
SELECT … ORDER BY SCORE() DESC to be able to sort by the relevance of the search results
Get all of the full-text magic from tokenization to stemming by using the MATCH operator like SELECT … WHERE MATCH(fieldname, 'some text')
Connect your favorite JDBC-compatible tool to Elasticsearch with our JDBC driver
Learn how to use the full power of the Elasticsearch DSL by translating a SQL query you know via the translate API
Note that this feature is made available in the “default” (non-OSS-only) distribution of Elasticsearch and the REST API -- including the “translate” functionality and the CLI tool are completely free.
You could probably make some kind of parser, but then again I don't think that is really even a good idea, even if the parser is well-written. You got to remember Elasticsearch uses inverted indexes since it's based on Lucene. Querying it as you would query a relational database kind of defeats that logic, thus it isn't even clear if it would be of any use to use ElasticSearch, you probably better stick to pure SQL queries.
Also, given you currently have only 10 queries and you already plan on switching to ES, i'd strongly suggest to adapt those 10 requests into proper ES queries, switch to ES, and then only create new requests within the ES logic.

Do I need a search engine for a model with 20+ attributes

I am planning to develop an application which will have a Model with lots of attributes on it. These attributes will be one of the most important parts of the application thus users will be firing search queries most of the time in order to find the result they are looking for.
My question is, is it OK to relay on mysql or postgres for it, or should i start with something like solr, elasticsearch from the beginning.
I want this application not to consume lots of memory while doing these searches. This is the first thing I want, since i will start with a basic server setup with 2 cores and 4gb ram.
Both of them (rdbms and fulltext se) are valid technologies...mainly it depends on
your access pattern
features you want to offer in your search services
For instance if you want to do fulltext search, or you want things like autocompletion, faceting,stemming Solr or ES is your friend. On the other side, if you want to pickup data in realtime (and you don't want things above) I would use an rdbms
In general: you described a bit your "non" functional requirements, but the decision involves functional requirements, too. Definitely

Cassandra equivalent for MySQL's full text search

We are planning to migrate an application from MySQL to Cassandra. The one major issue we're seeing is that the application makes extensive use of MyISAM's full text search. What can we use an alternative on cassandra?
There is an implementation of Solr in Cassandra: Solandra.
Solr (pronounced as /soʊlə/,/soʊlər/, SOH-lər) is an open source
enterprise search platform from the Apache Lucene project. Its major
features include powerful full-text search, hit highlighting, faceted
search, dynamic clustering, database integration, and rich document
(e.g., Word, PDF) handling. Providing distributed search and index
replication, Solr is highly scalable.1
You can find some other information here: http://www.datastax.com/docs/datastax_enterprise2.0/search/dse_search_about
Use Elassandra which runs Elasticsearch as a plugin for Apache Cassandra.
Some real example of Elassandra can be found in here

Good search solution for Zend Framework + Doctrine + MySQL?

I've looked into Doctrine's built-in search, MySQL myisam fulltext search, Zend_Lucene, and sphinx - but all the nuances and implementation details are making it hard to sort out for me, given that I don't have experience with anything other than the myisam search.
What I really want is something simple that will work with the Zend Framework and Doctrine (MySQL back-end, probably InnoDB). I don't need complex things like word substitutions, auto-complete, and so on (not that I'd be opposed to such things, if it were easy enough and time effective enough to implement).
The main thing is the ability to search for strings across multiple database tables, and multiple fields with some basic search criteria (e.g. user.state. = CA AND user.active = 1). The size of the database will start at around 50K+ records (old data being dumped in), the biggest single searchable table would be around 15K records, and it would grow considerably over time.
That said, Zend_Lucene is appealing to me because it is flexible (in case I do need my search solution to gorw in the future) and because it can parse MS Office files (which will be uploaded to my application by users). But its flexibility also makes it kind of complicated to set up.
I suppose the most straightforward option would be to just use Doctrine's search capabilities, but I'm not sure if that's going to be able to handle what I need. And I don't know that there is any option out there which is going to combine my desire for simplicity & power.
What search solutions would you recommend I investigate? And why would you think that solution would work well in this situation?
I would recomment using Solr search engine.
Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface (which is really great) and many more features.
It runs in a Java servlet container such as Tomcat.
You can use the solr-php-client to handle queries in php.

Which best fits my needs: MongoDB, CouchDB, or MySQL. Criteria defined in question

Our website needs a content management type system. For example, admins want to create promotion pages on the fly. They'll supply some text and images for the page and the url that the page needs to be on. We need a data store for this. The criteria for the data store are simple and defined below. I am not familiar with CouchDB or MongoDB, but think that they may be a better fit for this than MySQL, but am looking for someone with more knowledge of MongoDB and CouchDB to chime in.
On a scale of 1 to 10 how would you rate MongoDB, CouchDB, and MySQL for the following:
Java client
Track web clicks
CMS like system
Store uploaded files
Easy to setup failover
Support
Documentation
Which would you choose under these circumstances?
Each one is suitable for different usecases. But in low traffic sites mysql/postgresql is better.
Java client: all of them have clients
Track web clicks : mongo and cassandra is more suitable for this high write situation
Store uploaded files : mongo with gridfs is suitable. cassandra can store up to 2gb by each column splitted into 1 mb. mysql is not suitable. storing only file location and store the file in the filesystem is preffered for cassandra and mysql.
Easy to setup failover : cassandra is the best, mongo second
Support : all have good support, mysql has the largest community, mongo is second
Documentation : 1st mysql, 2nd mongo
I prefer MongoDB for analytics (web clicks, counters, logs) (you need a 64 bit system) and mysql or postgresql for main data. on the companies using mongo page in the mongo website, you can see most of them are using mongo for analytics. mongo can be suitable for main data after version 1.8. the problem with cassandra is it's poor querying capabilities (not suitable for a cms). and the problem with mysql is not as easy scalable & HA as cassandra & mongo and also mysql is slower especially on writes. I don't recommend couchdb, it's the slowest one.
my best
Serdar Irmak
Here are some quick answers based on my experience with Mongo.
Java client
Not sure, but it does exist and it is well supported. Lots of docs, even several POJO wrappers to make it easy.
Track web clicks
8 or 9. It's really easy to do both inserts and updates thanks to "fire and forget". MongoDB has built-in tools to map-reduce the data and easy tools to export the data to SQL for analysis (if Mongo isn't good enough).
CMS like system
8 or 9. It's easy to store the whole web page content. It's really easy to "hook on" extra columns. This is really Mongo's "bread and butter".
Store uploaded files
There's a learning curve here, but Mongo has a GridFS system designed specifically for both saving and serving binary data.
Easy to set up failover
Start your primary server: ./mongo --bindip 1.2.3.4 --dbpath /my/data/files --master
Start your slave: ./mongo --bindip 1.2.3.5 --dbpath /my/data/files --slave --source 1.2.3.4
Support
10gen has a mailing list: http://groups.google.com/group/mongodb-user. They also have paid support.
Their response time generally ranks somewhere between excellent and awesome.
Documentation
Average. It's all there, but it is still a little dis-organized. Chock it up to a lot of new development in the last.
My take on CouchDB:
Java Client: Is great, use ektorp which is pretty easy and complete object mapping. Anyway all the API is just Json over HTTP so it is all easy.
Track web clicks: Maybe redis is a better tool for this. CouchDB is not the better option here.
CMS like system: It is great as you can easly combine templates, dynamic forms, data and etc and collate them using views.
Store uploaded files: Any document in couchdb can have arbitary attachments so it's a natural fit.
Easy to setup failover: Master/master replication make sure you are always read to go, database never gets corrupts so in case of failure it's only a matter of start couch again and it will take over where it stop (minimal downtime) and replication will catch the changes.
Support: Have a mailing list and paid support.
Documentation: use the open book http://guide.couchdb.org and wiki.
I think there are plenty of other posts related to this topic. However, I'll chime in since I've moved off mysql and onto mongodb. It's fast, very fast but that doesn't mean it's perfect. My advice, use what you're comfortable with. If it takes you longer to refactor code in order to make it fit with mongo or couch, then stick to mysql if that's what you're familiar with. If this is something you want to pick up as a skillset then by all means learn mongodb or couchdb.
For me, I went with mongodb for couple of reasons, file storage via gridfs and geolocation. Yea I could've used mysql but I wanted to see what all the fuss was about. I must say, I'm impress and I still have ways to go before I can say I'm comfortable with mongo.
With what you've listed, I can tell you that mongo will fit most of your needs.
I don't see anything here like "must handle millions of req/s" that would indicate rolling your own would be better than using something off the shelf like Drupal.