Real time migration of data from MySQL to elasticsearch? - mysql

I have tons of data present in MySQL in form of different database, and their respective tables. They all are related to each other. But when I have to do analysis in data, I have to create different scripts, that combine data, merge it and show me as a result, but this takes a lot of time, and effort too. I love elasticsearch for its speed and visualization of data via kibana, therefore I have decided to move my entire MySQL data in real time to elasticsearch, keeping data in MySQL too. But I want a scalable strategy, and process that migrates that data to elasticsearch.
Suggest the best tool, or methods to do the job.
Thank you.

Prior to Elasticsearch 2.x you could write your own Elasticsearch _river plugin that you can install into elasticsearch. You can control how often you want this said data you've munged with your scripts to be pulled in by the _river (Note: this is not truly recommended).
You may also use your favourite Queuing Message Broker tool such as ActiveMQ to push your data into elasticsearch
If you want full control to meet your need for real time migration of data you may also write a simple app that makes use of elasticsearch REST end point, by simply writing to it via REST. You can even do bulk POST
Make use of any of the elasticsearch tools such as beat, logstash that are great at shipping almost any type of data into elasticsearch
For other alternatives of munging your data to a flat file, or if you want to maintain relationships see this post here

Related

How can we use streaming in spark from multiple source? e.g First take data from HDFS and then consume streaming from Kafka

The problem arise when I already have a system and I want to implement a Spark Streaming on top.
I have 50 million rows transactional data on MySQL, I want to do reporting on those data. I thought to dump the data into HDFS.
Now, Data are coming everyday also in DB and I am adding KAFKA for new data.
I want to know how can I combine multiple source data and do analytics in real-time (1-2 minutes delay is ok) and save those results because future data needs previous results.
Joins are possible in SparkSQL, but what happens when you need to update data in mysql? Then your HDFS data becomes invalid very quickly (faster than a few minutes, for sure). Tip: Spark can use JDBC rather than need HDFS exports.
Without knowing more about your systems, I say keep the mysql database running, as there is probably something else actively using it. If you want to use Kafka, then that's a continous feed of data, but HDFS/MySQL are not. Combining remote batch lookups with streams will be slow (could be more than few minutes).
However, if you use Debezium to get data into Kafka from mysql, then you then have data centralized in one location, and then ingest from Kafka into an indexable location such as Druid, Apache Pinot, Clickhouse, or maybe ksqlDB to ingest.
Query from those, as they are purpose built for that use case, and you don't need Spark. Pick one or more as they each support different use cases / query patterns.

How to store JSON in DB without schema

I have a requirement to design an app to store JSON via REST API. I don't want to put limitation on JSON size(number of keys,etc). I see that MySQL supports to store JSON, but we have to create table/schema and then store the records
Is there any way to store JSON in any type of DB and have to query data with keys
EDIT: I don't want use any in-memory DB like Redis
Use ElasticSearch. In addition to schema less json, it support fast search.
The tagline of ElasticSearch is "You know, for search".
It is built on top of text indexing library called "Apache Lucene".
The advantage of using ElasticSearch are:
Scalable to petabytes of data clusters.
Fully open source. No cost to pay.
Enterprise support available for platinum license.
Comes with additional benefits such as analytics using Kibana.
I believe NoSQL is best solution. i.e MongoDB. I have tested MongoDB, looks good and has python module to interact easily. For quick overview on pros https://www.studytonight.com/mongodb/advantages-of-mongodb
I've had great results with Elasticsearch, so I second this approach as well. One question to ask yourself is how you plan to access the JSON data once it is in a repository like Elasticsearch. Will you simply store the JSON doc or will you attempt to flatten out the properties so that they can be individually aggregated? But yes, it is indeed fully scalable by increasing your compute capacity via instance size, expanding your disk space or by implementing index sharding if you have billions of records in a single index.

Implementing a search with Elasticsearch using mysql data

I am new to Elasticsearch. I was using MySQL Full Text features till now.
I want my MySQL database as my primary database and want to use Elasticsearch alongside as a search engine in my website. I got several problems when thinking about it. The main problem is Syncing between MySQL database and Elastic search.
Some say to use Logstash. But even though I use it, would I need to write separate functions in my program to database transactions and Elasticsearch indexing?
You will need to run periodic job doing full reindex and/or send individual document updates for ES indexing. Logstash sounds like ill-suited thing for the purpose. You need just the usual ES API to index stuff.

What would be a preferrable approach for rendering time series data

We have a simple JSON feed which provides stock/price information at a certain point in time.
e.g.
{t0, {MSFT, 20}, {AAPL, 30}}
{t1, {MSFT, 10}, {AAPL, 40}}
{t2, {MSFT, 5}, {AAPL, 50}}
What would be a preferred mechanism to store/retrieve this data and to plot a graph based on this data (say MSFT). Should I use redis or mysql?
I would also want to show the latest entries to all users in the portal as and when new data is received. The data could be retrieved every minute. Should I use node.js for this
Ours is a rails application and would like to know what libraries/database should I use to model this capability.
Depends on the traffic and the data. If the data is relational, meaning it is formally described and organized according to the relational model, then MySQL is better. If most of the queries are get and set with key->value , meaning you are going to get the data using one key, and you need to support many clients and many set/get per minute, then defiantly go with Redis. There are many other noSQL DBs that might fit, have a look at this post for a great review of some of the most popular ones.
So many ways to do this.. if getting an update once a minute is enough have the client do AJAX calls every minute to get the updated data, and then you can build your server side using php, .NET, java servlet ot node.js, again, depend on the expected user concurrency. PHP is very easy to develop on, while node.js can support many short i/o requests. Another option you might want to consider is you use server push (Node's socket.io for example) instead of the client AJAX call. In this way the client will be notified immediately on an update.
Personally, I like both node.js and Redis and used them couple of production applications, supporting many concurrent users using a single server. I like node since it's easy to develop, and support many users, and Redis for it's amazing speed and concurrent requests. Having said that, I also use MySQL for saving relational data, and PHP servers for fast development of APIs. Each have its own benefits.
Hope you'll find this info helpful.
Kuf.
As Kuf mentioned, there are so many ways to go about this and it really does depends on your needs: low latency, data storage, or ease of implementation.
Redis will most likely be the best solution if you are going for a low latency and easy solution to implement. You can use Pub/Sub to push updates to clients (e.g. Node’s socket.io) in real-time and run a second Redis instance to store the JSON data as a sorted set using the timestamp as a score. I’ve used the same to much success storing time-based statistical data. The downside to this solution is that it is resource (i.e. memory) expensive if you want to store a lot of data. 
If you are looking to store a lot of data in JSON format and want to use a pull to fetch data every minute, then using ElasticSearch to store/retrieve data is another possibility. You can use ElasticSearch’s range query to search using a timestamp field, for example:
"range": {
"#timestamp": {
"gte": date_from,
"lte": now
}
}
This adds the flexibility of using an extremely scalable and redundant system, storing larger amounts of data, and a RESTful real-time API. 
Best of luck!
Since you're basically storing JSON data...
Postgres has a native JSON datatype
Also MongoDB might be a good fit too as JSON -> BSON
But if its just serving data even something as simple as memcached would suffice.
If you have a lot of data to keep updated in real-time like stock ticker prices, the solution should involve the server publishing to the client, not the client continually hitting the server for updates. Publish/subscribe (pub/sub) type model with websockets might be a good choice at the moment, depending on your client requirements.
For plotting the data using data from websockets there is already a question about that here.
Ruby-toolbox has a category called HTTP Pub Sub which might be a good place to start. Whether MySQL or Redis is better depends on what you will be doing with it aside from just streaming stock prices. Redis may be a better choice for performance. Note also that websocket-rails assumes Redis, if you were to use that- just as an example.
I would not recommend a simple JSON API (non-pubsub) in this case, because it will not scale as well (see this answer), but if you don't think you'll have many clients, go for it.
Cube could be a good example for reference. It uses MongoDB for data storage.
For plotting time series data, you may try out cubism.js.
Both projects are from square.

Fluentd+Mongo vs. Logstash

Our team now uses zabbix for monitoring and alert. In addition, we use fluent to gather log to an central mongoDB and it is put to work for a week. Recently we were discussing another solution - Logstash. I wanna ask which difference between them? In my opinion, I'd like use zabbix as a data-gathering and alert-sending platform and fluent plays the 'data-gathering' role in the whole infrastructure. While I've looked into Logstash website and found out that Logstash is not only a log-gathering system, but also a whole solutions for gathering, presentation and search.
Would anybody can give some advice or share some experience?
Logstash is pretty versatile (disclaimer: have only been playing with it for a few weeks).
We'd been looking at Graylog2 for a while (listening for syslog and providing a nice search UI) but the message processing functionality in it is based on the Drools engine and is.. arcane at best.
I found it was much easier to have logstash read syslog files from our central server, massage the events and output to Graylog2. Gave us much more flexibility and should allow us to add application level events alongside the OS level syslog data.
It has a zabbix output, so you might find it's worth a look.
Logstash is a great fit with Zabbix.
I forked a repo on github to take the logstash statsd output and send it to Zabbix for trending/alerting. As was mentioned by another, logstash also has a Zabbix output plugin which is great for notifying/sending on matching events.
Personally, I prefer the native Logstash->Elasticsearch backend to Logstash->Graylog2(->Elasticsearch).
It's easier to manage, especially if you have a large volume of log data. At present, Graylog2 uses Elasticsearch as well, but uses only a single index for all data. If you're periodically cleaning up old data, this means the equivalent of a lot of SQL "delete from table where date>YYYY.MM.DD" type-calls to clean out old data, where Logstash defaults to daily indexes (the equivalent of "drop table YYYY.MM.DD"), so clean-up is nicer.
It also results in cleaner searches, requiring less heap space, as you can search over a known date because the index is named for the day's data it contains.