Fluentd+Mongo vs. Logstash - zabbix

Our team now uses zabbix for monitoring and alert. In addition, we use fluent to gather log to an central mongoDB and it is put to work for a week. Recently we were discussing another solution - Logstash. I wanna ask which difference between them? In my opinion, I'd like use zabbix as a data-gathering and alert-sending platform and fluent plays the 'data-gathering' role in the whole infrastructure. While I've looked into Logstash website and found out that Logstash is not only a log-gathering system, but also a whole solutions for gathering, presentation and search.
Would anybody can give some advice or share some experience?

Logstash is pretty versatile (disclaimer: have only been playing with it for a few weeks).
We'd been looking at Graylog2 for a while (listening for syslog and providing a nice search UI) but the message processing functionality in it is based on the Drools engine and is.. arcane at best.
I found it was much easier to have logstash read syslog files from our central server, massage the events and output to Graylog2. Gave us much more flexibility and should allow us to add application level events alongside the OS level syslog data.
It has a zabbix output, so you might find it's worth a look.

Logstash is a great fit with Zabbix.
I forked a repo on github to take the logstash statsd output and send it to Zabbix for trending/alerting. As was mentioned by another, logstash also has a Zabbix output plugin which is great for notifying/sending on matching events.
Personally, I prefer the native Logstash->Elasticsearch backend to Logstash->Graylog2(->Elasticsearch).
It's easier to manage, especially if you have a large volume of log data. At present, Graylog2 uses Elasticsearch as well, but uses only a single index for all data. If you're periodically cleaning up old data, this means the equivalent of a lot of SQL "delete from table where date>YYYY.MM.DD" type-calls to clean out old data, where Logstash defaults to daily indexes (the equivalent of "drop table YYYY.MM.DD"), so clean-up is nicer.
It also results in cleaner searches, requiring less heap space, as you can search over a known date because the index is named for the day's data it contains.

Related

Real time migration of data from MySQL to elasticsearch?

I have tons of data present in MySQL in form of different database, and their respective tables. They all are related to each other. But when I have to do analysis in data, I have to create different scripts, that combine data, merge it and show me as a result, but this takes a lot of time, and effort too. I love elasticsearch for its speed and visualization of data via kibana, therefore I have decided to move my entire MySQL data in real time to elasticsearch, keeping data in MySQL too. But I want a scalable strategy, and process that migrates that data to elasticsearch.
Suggest the best tool, or methods to do the job.
Thank you.
Prior to Elasticsearch 2.x you could write your own Elasticsearch _river plugin that you can install into elasticsearch. You can control how often you want this said data you've munged with your scripts to be pulled in by the _river (Note: this is not truly recommended).
You may also use your favourite Queuing Message Broker tool such as ActiveMQ to push your data into elasticsearch
If you want full control to meet your need for real time migration of data you may also write a simple app that makes use of elasticsearch REST end point, by simply writing to it via REST. You can even do bulk POST
Make use of any of the elasticsearch tools such as beat, logstash that are great at shipping almost any type of data into elasticsearch
For other alternatives of munging your data to a flat file, or if you want to maintain relationships see this post here

Using MongoDB vs MySQL with lots of JSON fields?

There is a microblogging type of application. Two main basic database stores zeroed upon are:
MySQL or MongoDB.
I am planning to denormalize lot of data I.e. A vote done on a post is stored in a voting table, also a count is incremented in the main posts table. There are other actions involved with the post too (e.g. Like, vote down).
If I use MySQL, some of the data better suits as JSON than fixed schema, for faster lookups.
E.g.
POST_ID | activity_data
213423424 | { 'likes': {'count':213,'recent_likers' :
['john','jack',..fixed list of recent N users]} , 'smiles' :
{'count':345,'recent_smilers' :
['mary','jack',..fixed list of recent N users]} }
There are other components of the application as well, where usage of JSON is being proposed.
So, to update a JSON field, the sequence is:
Read the JSON in python script.
Update the JSON
Store the JSON back into MySQL.
It would have been single operation in MongoDB with atomic operations like $push,$inc,$pull etc. Also
document structure of MongoDB suits my data well.
My considerations while choosing the data store.
Regarding MySQL:
Stable and familiar.
Backup and restore is easy.
Some future schema changes can be avoided using some fields as schemaless JSON.
May have to use layer of memcached early.
JSON blobs will be static in some tables like main Posts, however will be updated alot in some other tables like Post votes and likes.
Regarding MongoDB:
Better suited to store schema less data as documents.
Caching might be avoided till a later stage.
Sometimes the app may become write intensive, MongoDB can perform better at those points where unsafe writes are not an issue.
Not sure about stability and reliability.
Not sure about how easy is it to backup and restore.
Questions:
Shall we chose MongoDB if half of data is schemaless, and is being stored as JSON if using MySQL?
Some of the data like main posts is critical, so it will be saved using safe writes, the counters etc
will be saved using unsafe writes. Is this policy based on importance of data, and write intensiveness correct?
How easy is it to monitor, backup and restore MongoDB as compared to MySQL? We need to plan periodic backups ( say daily ), and restore them with ease in case of disaster. What are the best options I have with MongoDB to make it a safe bet for the application.
Stability, backup, snapshots, restoring, wider adoption I.e.database durability are the reasons pointing me
to use MySQL as RDBMS+NoSql even though a NoSQL document storage could serve my purpose better.
Please focus your views on the choice between MySQL and MongoDB considering the database design I have in mind. I know there could be better ways to plan database design with either RDBMS or MongoDB documents. But that is not the current focus of my question.
UPDATE : From MySQL 5.7 onwards, MySQL supports a rich native JSON datatype which provides data flexibility as well as rich JSON querying.
https://dev.mysql.com/doc/refman/5.7/en/json.html
So, to directly answer the questions...
Shall we chose mongodb if half of data is schemaless, and is being stored as JSON if using MySQL?
Schemaless storage is certainly a compelling reason to go with MongoDB, but as you've pointed out, it's fairly easy to store JSON in a RDBMS as well. The power behind MongoDB is in the rich queries against schemaless storage.
If I might point out a small flaw in the illustration about updating a JSON field, it's not simply a matter of getting the current value, updating the document and then pushing it back to the database. The process must all be wrapped in a transaction. Transactions tend to be fairly straightforward, until you start denormalizing your database. Then something as simple as recording an upvote can lock tables all over your schema.
With MongoDB, there are no transactions. But operations can almost always be structured in a way that allow for atomic updates. This usually involves some dramatic shifts from the SQL paradigms, but in my opinion they're fairly obvious once you stop trying to force objects into tables. At the very least, lots of other folks have run into the same problems you'll be facing, and the Mongo community tends to be fairly open and vocal about the challenges they've overcome.
Some of the data like main posts is critical , so it will be saved using safe writes , the counters etc will be saved using unsafe writes. Is this policy based on importance of data, and write intensiveness correct?
By "safe writes" I assume you mean the option to turn on an automatic "getLastError()" after every write. We have a very thin wrapper over a DBCollection that allows us fine grained control over when getLastError() is called. However, our policy is not based on how "important" data is, but rather whether the code following the query is expecting any modifications to be immediately visible in the following reads.
Generally speaking, this is still a poor indicator, and we have instead migrated to findAndModify() for the same behavior. On the occasion where we still explicitly call getLastError() it is when the database is likely to reject a write, such as when we insert() with an _id that may be a duplicate.
How easy is it to monitor,backup and restore Mongodb as compared to mysql? We need to plan periodic backups (say daily), and restore them with ease in case of disaster. What are the best options I have with mongoDb to make it a safe bet for the application?
I'm afraid I can't speak to whether our backup/restore policy is effective as we have not had to restore yet. We're following the MongoDB recommendations for backing up; #mark-hillick has done a great job of summarizing those. We're using replica sets, and we have migrated MongoDB versions as well as introduced new replica members. So far we've had no downtime, so I'm not sure I can speak well to this point.
Stability,backup,snapshots,restoring,wider adoption i.e.database durability are the reasons pointing me to use MySQL as RDBMS+NoSql even though a NoSQL document storage could serve my purpose better.
So, in my experience, MongoDB offers storage of schemaless data with a set of query primitives rich enough that transactions can often be replaced by atomic operations. It's been tough to unlearn 10+ years worth of SQL experience, but every problem I've encountered has been addressed by the community or 10gen directly. We have not lost data or had any downtime that I can recall.
To put it simply, MongoDB is hands down the best data storage ecosystem I have ever used in terms of querying, maintenance, scalability, and reliability. Unless I had an application that was so clearly relational that I could not in good conscience use anything other than SQL, I would make every effort to use MongoDB.
I don't work for 10gen, but I'm very grateful for the folks who do.
I'm not going to comment on the comparisons (I work for 10gen and don't feel it's appropriate for me to do so), however, I will answer the specific MongoDB questions so that you can better make your decision.
Back-Up
Documentation here is very thorough, covering many aspects:
Block-Level Methods (LVM makes it very easy and quite a lot of folk do this)
With/Without Journaling
EBS Snapshots
General Snapshots
Replication (technically not back-up, however, a lot of folk use replica sets for their redundancy and back-up - not recommending this but it is done)
Until recently, there is no MongoDB equivalent of mylvmbackup but a nice guy wrote one :) In his words
Early days so far: it's just a glorified shell script and needs way more error checking. But already it works for me and I figured I'd share the joy. Bug reports, patches & suggestions welcome.
Get yourself a copy from here.
Restores
Formats etc
mongodump is completely documented here and mongorestore is here.
mongodump will not contain the indexes but does contain the system.indexes collection so mongorestore can rebuild the indexes when you restore the bson file. The bson file is the actual data whereas mongoexport/mongoimport are not type-safe so it could be anything (techically speaking) :)
Monitoring
Documented here.
I like Cacti but afaik, the Cacti templates have not kept up with the changes in MongoDB and so rely on old syntax so post 2.0.4, I believe there are issues.
Nagios works well but it's Nagios so you either love or hate it. A lot of folk use Nagios and it seems to provide them with great visiblity.
I've heard of some folk looking at Zappix but I've never used it so can't comment.
Additionally, you can use MMS, which is free and hosted externally. Your MongoDB instances run an agent and one of those agents communicate (using python code) over https to mms.10gen.com. We use MMS to view all performance statistics on the MongoDB instances and it is very beneficial from a high-level wide view as well as offering the ability to drill down. It's simple to install and you don't have to run any hardware for this. Many customers run it and some compliment it with Cacti/Nagios.
Help information on MMS can be found here (it's a very detailed, inclusive document).
One of the disadvantages of a mysql solution with stored json is that you will not be able to efficiently search on the json data. If you store it all in mongodb, you can create indexes and/or queries on all of your data including the json.
Mongo's writes work very well, and really the only thing you lose vs mysql is transaction support, and thus the ability to rollback multipart saves. However, if you are able to commit your changes in atomic operations, then there isn't a data safety issue. If you are replicated, mongo provides an "eventually consistent" promise such that the slaves will eventually mirror the master.
Mongodb doesn't provide native enforcement or cascading of certain db constructs such as foreign keys, so you have to manage those yourself (such as either through composition, which is one of mongo's strenghts), or through use of dbrefs.
If you really need transaction support and robust 'safe' writes, yet still desire the flexibility provided by nosql, you might consider a hybrid solution. This would allow you to use mysql as your main post store, and then use mongodb as your 'schemaless' store. Here is a link to a doc discussing hybrid mongo/rdbms solutions: http://www.10gen.com/events/hybrid-applications The article is from 10gen's site, but you can find other examples simply by doing a quick google search.
Update 5/28/2019
The here have been a number of changes to both MySQL and Mongodb since this answer was posted, so the pros/cons between them have become even blurrier. This update doesn't really help with the original question, but I am doing it to make sure any new readers have a bit more recent information.
MongoDB now supports transactions: https://docs.mongodb.com/manual/core/transactions/
MySql now supports indexing and searching json fields:
https://dev.mysql.com/doc/refman/5.7/en/json.html

Which best fits my needs: MongoDB, CouchDB, or MySQL. Criteria defined in question

Our website needs a content management type system. For example, admins want to create promotion pages on the fly. They'll supply some text and images for the page and the url that the page needs to be on. We need a data store for this. The criteria for the data store are simple and defined below. I am not familiar with CouchDB or MongoDB, but think that they may be a better fit for this than MySQL, but am looking for someone with more knowledge of MongoDB and CouchDB to chime in.
On a scale of 1 to 10 how would you rate MongoDB, CouchDB, and MySQL for the following:
Java client
Track web clicks
CMS like system
Store uploaded files
Easy to setup failover
Support
Documentation
Which would you choose under these circumstances?
Each one is suitable for different usecases. But in low traffic sites mysql/postgresql is better.
Java client: all of them have clients
Track web clicks : mongo and cassandra is more suitable for this high write situation
Store uploaded files : mongo with gridfs is suitable. cassandra can store up to 2gb by each column splitted into 1 mb. mysql is not suitable. storing only file location and store the file in the filesystem is preffered for cassandra and mysql.
Easy to setup failover : cassandra is the best, mongo second
Support : all have good support, mysql has the largest community, mongo is second
Documentation : 1st mysql, 2nd mongo
I prefer MongoDB for analytics (web clicks, counters, logs) (you need a 64 bit system) and mysql or postgresql for main data. on the companies using mongo page in the mongo website, you can see most of them are using mongo for analytics. mongo can be suitable for main data after version 1.8. the problem with cassandra is it's poor querying capabilities (not suitable for a cms). and the problem with mysql is not as easy scalable & HA as cassandra & mongo and also mysql is slower especially on writes. I don't recommend couchdb, it's the slowest one.
my best
Serdar Irmak
Here are some quick answers based on my experience with Mongo.
Java client
Not sure, but it does exist and it is well supported. Lots of docs, even several POJO wrappers to make it easy.
Track web clicks
8 or 9. It's really easy to do both inserts and updates thanks to "fire and forget". MongoDB has built-in tools to map-reduce the data and easy tools to export the data to SQL for analysis (if Mongo isn't good enough).
CMS like system
8 or 9. It's easy to store the whole web page content. It's really easy to "hook on" extra columns. This is really Mongo's "bread and butter".
Store uploaded files
There's a learning curve here, but Mongo has a GridFS system designed specifically for both saving and serving binary data.
Easy to set up failover
Start your primary server: ./mongo --bindip 1.2.3.4 --dbpath /my/data/files --master
Start your slave: ./mongo --bindip 1.2.3.5 --dbpath /my/data/files --slave --source 1.2.3.4
Support
10gen has a mailing list: http://groups.google.com/group/mongodb-user. They also have paid support.
Their response time generally ranks somewhere between excellent and awesome.
Documentation
Average. It's all there, but it is still a little dis-organized. Chock it up to a lot of new development in the last.
My take on CouchDB:
Java Client: Is great, use ektorp which is pretty easy and complete object mapping. Anyway all the API is just Json over HTTP so it is all easy.
Track web clicks: Maybe redis is a better tool for this. CouchDB is not the better option here.
CMS like system: It is great as you can easly combine templates, dynamic forms, data and etc and collate them using views.
Store uploaded files: Any document in couchdb can have arbitary attachments so it's a natural fit.
Easy to setup failover: Master/master replication make sure you are always read to go, database never gets corrupts so in case of failure it's only a matter of start couch again and it will take over where it stop (minimal downtime) and replication will catch the changes.
Support: Have a mailing list and paid support.
Documentation: use the open book http://guide.couchdb.org and wiki.
I think there are plenty of other posts related to this topic. However, I'll chime in since I've moved off mysql and onto mongodb. It's fast, very fast but that doesn't mean it's perfect. My advice, use what you're comfortable with. If it takes you longer to refactor code in order to make it fit with mongo or couch, then stick to mysql if that's what you're familiar with. If this is something you want to pick up as a skillset then by all means learn mongodb or couchdb.
For me, I went with mongodb for couple of reasons, file storage via gridfs and geolocation. Yea I could've used mysql but I wanted to see what all the fuss was about. I must say, I'm impress and I still have ways to go before I can say I'm comfortable with mongo.
With what you've listed, I can tell you that mongo will fit most of your needs.
I don't see anything here like "must handle millions of req/s" that would indicate rolling your own would be better than using something off the shelf like Drupal.

Spring-Batch for a massive nightly / hourly Hive / MySQL data processing

I'm looking into replacing a bunch of Python ETL scripts that perform a nightly / hourly data summary and statistics gathering on a massive amount of data.
What I'd like to achieve is
Robustness - a failing job / step should be automatically restarted. In some cases I'd like to execute a recovery step instead.
The framework must be able to recover from crashes. I guess some persistence would be needed here.
Monitoring - I need to be able to monitor the progress of jobs / steps, and preferably see history and statistics with regards to the performance.
Traceability - I must be able to understand the state of the executions
Manual intervention - nice to have... being able to start / stop / pause a job from an API / UI / command line.
Simplicity - I prefer not to get angry looks from my colleagues when I introduce the replacement... Having a simple and easy to understand API is a requirement.
The current scripts do the following:
Collect text logs from many machines, and push them into Hadoop DFS. We may use Flume for this step in the future (see http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/).
Perform Hive summary queries on the data, and insert (overwrite) to new Hive tables / partitions.
Extract the new summaries data into files, and load (merge) into MySql tables. This is data needed later for on-line reports.
Perform additional joins on the newly added MySql data (from MySql tables), and update the data.
My idea is to replace the scripts with spring-batch. I also looked into Scriptella, but I believe it is too 'simple' for this case.
since I saw some bad vibes on Spring-Batch (mostly old posts) I'm hoping to get some inputs here. I also haven't seen much about spring-batch and Hive integration, which is troublesome.
If you want to stay within the Hadoop ecosystem, I'd highly recommend checking out Oozie to automate your workflow. We (Cloudera) provide a packaged version of Oozie that you can use to get started. See our recent blog post for more details.
Why not use JasperETL or Talend? Seems like the right tool for the job.
I've used Cascading quite a bit and found it be quite impressive:
Cascading
It is a M/R abstraction layer, and runs on Hadoop.

How do applications collect statistics?

I need to collect statistics from my server application written in python. I am looking for some general guidance on how to setup models and exactly how to store the statistics information. I was thinking of storing and organizing all this information in a database, but my implementation is turning out to be too specific.
I need to collect stats like active users, requests processed and things like that over time.
Are there any guides or techniques out there to create some more generic statistics storage systems?
Like most software solutions there is no single solution that I can recommend that will solve your problem. But I have created a few similar programs and here's some things that I found that worked well.
Create an asynchronous logging service so the logging doesn't adversely affect your code's performance. (You need to be mindful of where you are storing your data, where it is processed, etc. because you can still significantly degrade performance if you're not careful.) I have found that creating a web service is often convenient.
Try and save as much information about the request as possible. In the future this will make it easier to add new queries and reports.
Normalize your data
Always include the time the action was performed. If you can capture run time that it typically useful too.
One approach is to do this by stages: Store activity logs, including requests and users, as text files. Later, mine the logs into data points (python should be able to do this easily). You may want to use the logging library for python for the logging stage. In general, start with high time-resolution logging, which you can later aggregate into hourly, daily, weekly summaries etc.