Is there any difference or recommandation using the first or the second one ?
Both bring mysql data to elastic
Thanks in advance :
https://github.com/adibendahan/mysqlbeat
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html
Push vs Pull: Logstash's jdbc input pulls data from remote SQL servers on a cron, while the beat will push the results of the query to Elasticsearch.
Location: Another thing to consider is that this beat will be running on every SQL server, vs having one Logstash with multiple jdbc input plugin blocks. There are pros and cons to this depending on your scale. If you have thousands of databases, it isn't scalable to have Logstash querying each one, particularly if the list of databases and queries are constantly changing. It would be much easier to manage one beat per SQL server. If you have a simple setup with just a few databases, it would probably be faster to use the Logstash input plugin, because then you wouldn't have another service to maintain (the beat).
Related
I am new to Elasticsearch. I was using MySQL Full Text features till now.
I want my MySQL database as my primary database and want to use Elasticsearch alongside as a search engine in my website. I got several problems when thinking about it. The main problem is Syncing between MySQL database and Elastic search.
Some say to use Logstash. But even though I use it, would I need to write separate functions in my program to database transactions and Elasticsearch indexing?
You will need to run periodic job doing full reindex and/or send individual document updates for ES indexing. Logstash sounds like ill-suited thing for the purpose. You need just the usual ES API to index stuff.
I have tons of data present in MySQL in form of different database, and their respective tables. They all are related to each other. But when I have to do analysis in data, I have to create different scripts, that combine data, merge it and show me as a result, but this takes a lot of time, and effort too. I love elasticsearch for its speed and visualization of data via kibana, therefore I have decided to move my entire MySQL data in real time to elasticsearch, keeping data in MySQL too. But I want a scalable strategy, and process that migrates that data to elasticsearch.
Suggest the best tool, or methods to do the job.
Thank you.
Prior to Elasticsearch 2.x you could write your own Elasticsearch _river plugin that you can install into elasticsearch. You can control how often you want this said data you've munged with your scripts to be pulled in by the _river (Note: this is not truly recommended).
You may also use your favourite Queuing Message Broker tool such as ActiveMQ to push your data into elasticsearch
If you want full control to meet your need for real time migration of data you may also write a simple app that makes use of elasticsearch REST end point, by simply writing to it via REST. You can even do bulk POST
Make use of any of the elasticsearch tools such as beat, logstash that are great at shipping almost any type of data into elasticsearch
For other alternatives of munging your data to a flat file, or if you want to maintain relationships see this post here
I have a large mySQL database with heavy load and would like to replicate the data in this database to Hbase in order to do analytical work on it.
edit: I want the data to replicate relatively quickly, and without any schema changes (no timestamped rows, etc)
I've read that this can be done using flume, with mySQL as a source, possibly the mySQL bin logs, and Hbase as a sink, but haven't found any detail (high or low level). What are the major tasks to make this work?
Similar question were asked and answered earlier but didn't really explain how or point to resources that would:
Flume to migrate data from MySQL to Hadoop
Continuous data migration from mysql to Hbase
You are better off using SQOOP for this purpose, IMHO. It was developed for exactly this purpose. Flume was made for a rather different purpose, like aggregating log data, data generated from sensors etc.
See this for more details.
So far there are three options worth considering:
Sqoop: After initial bulk import, it supports two types of incremental udpates import: APPEND, LAST-MODFIED. But being said, It won't give you Real-Time or even near Real-Time replication. It's not because Sqoop can't run that fast, it's because you don't want to plug in a Sqoop pipe to your Mysql server and puling data every 1 or 2 mins.
Trigger: This is a quick-dirty solution, by adding triggers to the source RDBMS, and update your HBase according. This one gives you Real-Time satisfaction. But you have to mess up the source DB by adding triggers. It might be ok as a temporal solution, but long term, it just won't do.
Flume: This one, you will need to put in the most development effort. It doesn't need to touch the DB, it doesn't add in Reading traffic to the DB neither(It tails the transaction logs).
Personally I'd go for flume, not only it channels the data from RDBMS to your HBase, but also can you do something with the data while they are streaming through your flume pipe. (e.g. transformation, notification, alerting etc etc)
We use Cassandra as our primary data store for our application that collects a very large amount of data and requires large amount of storage and very fast write throughput.
We plan to extract this data on a periodic basis and load into a relational database (like mySQL). What extraction mechanisms exist that can scale to the tune of hundreds of millions of records daily? Expensive third party ETL tools like Informatica are not an option for us.
So far my web searches have revealed only Hadoop with Pig or Hive as an option. However being very new to this field, I am not sure how well they would scale and also how much load they would put on the Cassandra cluster itself when running? Are there other options as well?
You should take a look at sqoop, it has an integration with Cassandra as shown here.
This will also scale easily, you need a Hadoop cluster to get sqoop working, the way it works is basically:
Slice your dataset into different partitions.
Run a Map/Reduce job where each mapper will be responsible for transferring 1 slice.
So the bigger the dataset you wish to export, the higher the number of mappers, which means that if you keep increasing your cluster the throughput will keep increasing. It's all a matter of what resources you have.
As far as the load on the Cassandra cluster, I am not certain since I have not used the Cassandra connector with sqoop personally, but if you wish to extract data you will need to put some load on your cluster anyway. You could for example do it once a day at a certain time where the traffic is lowest, so that in case your Cassandra availability drops the impact is minimal.
I'm also thinking that if this is related to your other question, you might want to consider exporting to Hive instead of MySQL, in which case sqoop works too because it can export to Hive directly. And once it's in Hive you can use the same cluster as used by sqoop to run your analytics jobs.
There is no way to extract data out of cassandra other than paying for etl tool. I tried different way like copy command or cql query -- all the methods gives times out irrespective of changing timeout parameter in Cassandra.Yaml. Cassandra experts say you can not query the data without 'where' clause. This is big restriction to me. This may be one of the main reason not to use cassandra at least for me.
Disclaimer: I am a newbie w.r.t Hadoop and Hive.
We have set up a MySql Cluster (version 7.2.5) which stores huge amounts of data. The rows runs into millions and are partitioned based on Mysql's autosharding logic. Even though we are leveraging Adaptive Query Localization (AQL) of Cluster 7.2, some of our queries have multiple joins and runs into quite a few minutes and sometimes hours.
In this scenario, can I use Hive along with Hadoop to query the DB and retrieve the data? Will it make the querying faster? Does it duplicate the data in its file system? What are the pros and cons of this type of approach?
My intent is to use Hive as a layer on top of MySQL Cluster and use it for read/write from and to MySQL Cluster DB. I do not have any transactions in my application. So is this really possible?
I think it is possible. The closest solution in this direction known to me is :http://www.hadapt.com/ by Daniel Abadi.
The idea of it solution to have local RDBMS on each node and run usual hadoop MR, and Hive on top of it on these nodes.
In principle if you will do smart Hive integration and push down predicates to MySQL instances it can give you some performance gains.
In the same time you should do some serious hacking to make hadoop to be aware of you sharding placement to preserve data locality.
Summarizing all above - it should be possible but will require serious development.
In the same time - I am not aware of out of the box solution to run hive over Mysql cluster as is.