Automatic conversion of SQL query to ElasticSearch Query - mysql

I have a service which currently stores data in Oracle DB.
I am working on a project where I need to run a set of sql queries to get some aggregated data. I would want to store these queries at one place, which I can iterate over, and get the required data.
Say, I have 10 queries today. But, I can keep adding more, without toching the code.
But, tomorrow we would want to switch to ElasticSearch. Is there a way, that i can use the same sql queries to search through even ElasticSearch.

You might want to look at this Elasticsearch plugin which aims at providing an SQL layer on top of Elasticsearch
https://github.com/NLPchina/elasticsearch-sql

With Elasticsearch 6.3 released in June 2018, you might not need an "automatic conversion" anymore.
The 6.3 release comes with native SQL support! (still experimental for now)
Have you (or someone you know) ever:
Said “I know how to do this thing in a SQL statement -- how do I do the same thing in Elasticsearch?”
Tried to build out full-text search with tokenization, stemming, synonyms, relevance sorting on top of a SQL engine like a relational database?
Tried to scale out a traditional database to billions of rows?
Tried to connect a 3rd party tool like a BI system to Elasticsearch?
These are all things which we hope we can make inroads into our new Elasticsearch SQL release.
Our hope is to allow developers, data scientists, and others that are familiar with the SQL language -- but so far unfamiliar with or unable to use the Elasticsearch query language -- to use the speed, scalability, and full-text power that Elasticsearch offers and others have grown to know and love.
If you’re just getting started using this functionality or the power of Elasticsearch that powers it, here are a few things to try:
SELECT … ORDER BY SCORE() DESC to be able to sort by the relevance of the search results
Get all of the full-text magic from tokenization to stemming by using the MATCH operator like SELECT … WHERE MATCH(fieldname, 'some text')
Connect your favorite JDBC-compatible tool to Elasticsearch with our JDBC driver
Learn how to use the full power of the Elasticsearch DSL by translating a SQL query you know via the translate API
Note that this feature is made available in the “default” (non-OSS-only) distribution of Elasticsearch and the REST API -- including the “translate” functionality and the CLI tool are completely free.

You could probably make some kind of parser, but then again I don't think that is really even a good idea, even if the parser is well-written. You got to remember Elasticsearch uses inverted indexes since it's based on Lucene. Querying it as you would query a relational database kind of defeats that logic, thus it isn't even clear if it would be of any use to use ElasticSearch, you probably better stick to pure SQL queries.
Also, given you currently have only 10 queries and you already plan on switching to ES, i'd strongly suggest to adapt those 10 requests into proper ES queries, switch to ES, and then only create new requests within the ES logic.

Related

Solr-ish Query API on top of relational database

I have a data source which is sitting in relational database. I managed to index/store everything into Solr and thrilled to see the search performance and the awesome API (search/admin..etc).
However, people say if your data is truly structured, relational database should be fast if you index everything. However, even if I dump all the data into a relation database like MySQL, what I am missing is all the beautiful query API.
I guess my question is:
is it possible to only use the query API of Solr-ish and totally use relation database as the backend instead of using index at all.
if that is not possible, is there any mature project/product that can build a full stack query API on a relational database?
Document search engines and relational databases serves different usage patterns. If you're using Solr for anything that involves tokenization and analysis chains, replicating that in an RDBMS requires implementing that functionality yourself (or just using a subset, such as full text indices in certain RDBMSes). I detailed some of these differences and features in Should I just query the database or use a proper search engine solution?.
It's usually better to use the RDBMS as the main storage for your data and then push it into the search index as required. This will also let you get new features from those who care about search and the problem it tries to solve, without having to wait for a niche product to implement it on top of your RDBMS (there's still quite a few new features in each iteration of Lucene, Elastic and solr).

Only solr or with Mysql

I want to use solr for my search index.What confuse me is ,should i put most the data fields in solr ,or only search for the id ,then get the data from Mysql,please help.Which is faster,better
I had the same Question in 2010 an decided to take Solr as a search index only to get a list of IDs in the first step an read the data from MySQL related to the IDs in the 2nd step.
That works fine in an Environment with 20 million docs.
During an reconstruction of the whole application in 2014, we decided to additional store the data in Solr (not only indexing) in order to fetch the whole docs during a search, so that the MySQL connect is not necessary anymore.
We are talking about an Web - Application with only max. 1-3 thousand parallel users and there is absolutely no perceived difference in application speed between the version from 2010 and 2014.
But there are some benefits, if you take the documents from Solr, not MySql.
The application code is a bit cleaner.
You only need one connect to get the data....
But: the main reason, why we begin to store the document in solr is: we needed to use the highlighting feature. This will only work well, if you store the docs on solr and fetches them from solr too.
Btw: there is no change in search performance if you store the docs or not.
The disadvantage is, that you have to hold the data twice:
1.) in MySQL as the base dataset and
2.) in Solr for your application.
And: if you have very big documents, solr probably is not the right tool to serve that kind of documents.
Putting all the data into Solr will absolutely be faster: you are saving yourself from having to make two queries, and also you are removing the need for a slow piece of code (PHP or whatever) to bridge the gap between these two where you pull out the id from solr and then query mysql. Alternatively you could put everything into MySQL, which would be of comparable speed. i.e. choose a technology suiting your needs best, but don't mix unnecessarily for performance reasons. A good comparison when you might want to use Solr vs MySQL can be found here.

Performing a join across multiple heterogenous databases e.g. PostgreSQL and MySQL

There's a project I'm working on, kind of a distributed Database thing.
I started by creating the conceptual schema, and I've partitioned the tables such that I may require to perform joins between tables in MySQL and PostgreSQL.
I know I can write some sort of middleware that will break down the SQL queries and issue sub-queries targeting individual DBs, and them merge the results, but I'd like to do do this using SQL if possible.
My search so far has yielded this (Federated storage engine for MySQL) but it seems to work for MySQL databases.
If it's possible, I'd appreciate some pointer's on what to look at, preferably in Python.
Thanks.
It might take some time to set up, but PrestoDB is a valid OpenSource solution to consider.
see https://prestodb.io/
You connect connect to Presto with JDBC, send it the SQL, it interprets the different connections, dispatches to the different sources, then does the final work on the Presto node before returning the result.
From the postgres side, you can try using a foreign data wrapper such as mysql_ftw (example). Queries with joins can then be run through various Postgres clients, such as psql, pgAdmin, psycopg2 (for Python), etc.
This is not possible with SQL.
Your options are to write your own "middleware" as you hinted at. To do that in Python, you would use the standard DB-API drivers for both databases and write individual queries; then merge their results. An ORM like sqlalchemy will go a long way to help with that.
The other option is to use an integration layer. There are many options out there, however, none that I know that are written in Python. mule esb, apache servicemix, wso2 and jboss metamatrix are some of the more popular ones.
You can colocate the data on a single RDBMS node (either PostgreSQL or MySQL for example).
Two main approaches
Readonly - You might want to use read-replicas of both source systems, then use a process to copy the data to a new writeable converged node; OR
Primary - You might chose a primary database of 2. Move the data from one to the primary using a conversion process (eg. ETL or off the shelf table-level replication)
Then you can just run the query on the one RDBMS with JOINs as usual.
BONUS: You can also do log reading from RDBMS that can ship logs through Kafka. You can make it really complex as required.

Good search solution for Zend Framework + Doctrine + MySQL?

I've looked into Doctrine's built-in search, MySQL myisam fulltext search, Zend_Lucene, and sphinx - but all the nuances and implementation details are making it hard to sort out for me, given that I don't have experience with anything other than the myisam search.
What I really want is something simple that will work with the Zend Framework and Doctrine (MySQL back-end, probably InnoDB). I don't need complex things like word substitutions, auto-complete, and so on (not that I'd be opposed to such things, if it were easy enough and time effective enough to implement).
The main thing is the ability to search for strings across multiple database tables, and multiple fields with some basic search criteria (e.g. user.state. = CA AND user.active = 1). The size of the database will start at around 50K+ records (old data being dumped in), the biggest single searchable table would be around 15K records, and it would grow considerably over time.
That said, Zend_Lucene is appealing to me because it is flexible (in case I do need my search solution to gorw in the future) and because it can parse MS Office files (which will be uploaded to my application by users). But its flexibility also makes it kind of complicated to set up.
I suppose the most straightforward option would be to just use Doctrine's search capabilities, but I'm not sure if that's going to be able to handle what I need. And I don't know that there is any option out there which is going to combine my desire for simplicity & power.
What search solutions would you recommend I investigate? And why would you think that solution would work well in this situation?
I would recomment using Solr search engine.
Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface (which is really great) and many more features.
It runs in a Java servlet container such as Tomcat.
You can use the solr-php-client to handle queries in php.

Geospatial and full text search for Rails app hosted on Heroku

I'm planning out a Rails app that will be hosted on Heroku and will need both geospatial and full text search capabilities.
I know that Heroku offers add-ons like WebSolr and IndexTank that sound like they can do the job, but I was wondering if this could be done in MySQL and/or PostgreSQL without having to pay for any add-ons?
Depending on the scale of your application you should be able to accomplish both FULLTEXT and SPATIAL indexes in MySQL with ease. Once your application gets massive, i.e hundreds of millions of rows with high concurrency and multiples of thousands of requests per second you might need to move to another solution for either FULLTEXT or SPATIAL queries. But, I wouldn't recommend optimize for that early on, since it can be very hard to do properly. For the foreseeable future MySQL should suffice.
You can read about spatial indexes in MySQL here. You can read about fulltext indexes in MySQL here. Finally, I would recommend taking the steps outlined here to make your schema.rb file and rake tasks work with these two index types.
I have only used MySQL for both, but my understanding is that PostgreSQL has a good geo-spatial index solution as well.
If you have a database at Heroku, you can use Postgres's support for Full Text Search: http://www.postgresql.org/docs/8.3/static/textsearch.html. The oldest servers Heroku runs (for shared databases) are on 8.3 and 8.4. The newest are on 9.0.
A blog post noticing this little fact can be seen here: https://tenderlovemaking.com/2009/10/17/full-text-search-on-heroku.html
Apparently, that "texticle" (heh. cute.) addon works...pretty well. It will even create the right indexes for you, as I understand it.
Here's the underlying story: postgres full-text-search is pretty fast and fuss-free (although Rails-integration may not be great), although it does not offer the bells and whistles of Solr or IndexTank. Make sure you read about how to properly set up GIN and/or GiST indexes, and use the tsvector/tsquery types.
The short version:
Create an (in this case, expression-based) index: CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', body));. In this case "body" is the field being indexed.
Use the ## operator: SELECT * FROM ... WHERE to_tsvector('english', pgweb.body) ## to_tsquery('hello & world') LIMIT 30
The hard part may be mapping things back into application land, the blog post previously cited is trying to do that.
The dedicated databases can also be requisitioned with PostGIS, which is a very powerful and fully featured system for indexing and querying geographical data. OpenStreetMap uses the PostgreSQL geometry types (built-in) extensively, and many people combine that with PostGIS to great effect.
Both of these (full text search, PostGIS) take advantage of the extensible data type and indexing infrastructure in Postgres, so you should expect them to work with high performance for many, many records (spend a little time carefully reviewing the situation if things look busted). You might also take advantage of fact that you are able to leverage these features in combination with transactions and structured data. For example:
CREATE TABLE products (pk bigserial, price numeric, quantity integer, description text); can just as easily be used with full text search...any text field will do, and it can be in connection with regular attributes (price, quantity in this case).
I'd use thinking sphinx, a full text search engine also deployable on heroku.
It has geo search built-in: http://freelancing-god.github.com/ts/en/geosearching.html
EDIT:
Sphynx is almost ready for heroku, see here: http://flying-sphinx.com/
IndexTank is now free up to 100k documents on Heroku, we just haven't updated the documentation. This may not be enough for your needs, but I thought I'd let you know just in case.
For full text search via Postgre I recommend pg_search, I am using it myself on heroku at the moment. I have not used texticle but from what I can see pg_search has more development activity lately and it has been built upon texticle (it will not add indexes for you, you have to do it yourself).
I cannot find the thread now but I saw that Heroku gave option for pg geo search but it was in beta.
My advice is if you are not able to find postgre solution is to host your own instance of SOLR (on EC2 instance) and use sunspot solr gem to integrate it with rails.
I have implemented my own solution and used WebSolr as well. Basically that is what they give you their own SOLR instance hassle free. Is it worth the money, in my opinion no. For integration that use sunspot solr client as well, so it is just are you going to pay somebody 20$/40$/... to host SOLR for you. I know you also get backups, maintenance etc. but call me cheap I prefer my own instance. Also WebSolr is locked on 1.4.x version of SOLR.