Graylog vs Fluentd - open-source

Out of available open source log management tools, I have come across these two but couldn't figure out which one to use. I tried searching articles for Graylog vs Fluentd but couldn't find.
Could someone suggest which one would be good for the following criteria,
* production master-slaves architecture with not so high end hardware, like single core processor with 4GB RAM and decent drive size to accommodate logs
* log search via UI
* alerts based on rules
* minimal setup (if possible)
* dynamically add/remove slave hosts from VIP
Thanks in advance
Mirza

I tried searching articles for Graylog vs Fluentd but couldn't find.
Because Graylog and Fluentd are different layer tool.
Fluentd is a streaming event collector. Fluentd doesn't have a storage and visualization UI like Graylog.
Graylog is a log management tool based on Elasticsearch, not event collector.
Fluentd can be a data source of Graylog so not "vs".
Here is an one example Fluentd and Graylog combo: http://www.fluentd.org/guides/recipes/graylog2

Related

Filebeat Central Management Alternative

We have a on-premise setup of the free version of the ELK stack. Actually Elasticsearch cluster and some Kibana nodes (no Logstash).
On the application servers we have installed filebeat 7.9.0 which ships the logs to the Elasticsearch ingest nodes, and there is very minimal processing done by the filebeat on the log events before sending (e.g. multiline=true, dissect, drop_fields and json_decode).
As of today, there are only 3 application servers on the production set-up, but it might scale to more number of machines (application servers) going forward.
I understand that, the central management of the filebeat configuration is possible (which is also coming to its end of life) with a license version of ELK stack.
I want to know what are the alternatives available to manage the filebeat configuration apart from the central management through Kibana.
The goal is in future if number of application servers grow to lets say 20, and the filebeat configuration has to undergo a change, changing the configuration on each of the servers shall be manual activity with its own risks associated. i.e. change the configuration at one location and somehow it is updated on filebeat on all application servers.
Please let me know, if this can be achieved ..
Any pointers / thoughts towards the solution let me know
Note: We do not have infrastructure as a code in the organization yet, so this may not be a suitable solution.
Thanks in advance ..
The replacement of Central Management is Elastic Fleet: Installing a single agent on a server, the rest can be done from Kibana. https://www.elastic.co/blog/introducing-elastic-agent-and-ingest-manager gives a better overview of the features and current screenshots.
Most parts of Fleet are also available for free.

network stream analysis using wireshark and apache storm

I want to make a simple application that analyzes network traffic generated by Wireshark and counts how often I access google. From this project I want to introduce myself to network traffic processing.
In order to get traffic, I use Wireshark.
In order to analyze the network traffic streams I want to use Apache Storm.
Well, I am new to stream processing, apache storm and wireshark (this is very first use), so I want to ask you if is possible to realize this simple project.
I think Apache Storm can read the data easily from json files.
So the things that puzzles me are:
Can I export wireshark data to json files?
Can I, somehow read in real time the data generated from wireshark,
Process this data using Apache Storm,
Or, if you can suggest me a more appropriate instrument with a concrete tutorial I would be grateful.
Sounds like an interesting project to start with.
To read data from Wireshark in real-time, I suppose you can make use of any Wireshark API and then create a client that pushes data to some port in your system.
Create a Storm Spout that reads from this port
Create a Storm Bolt to parse these message (if you want to convert to json)
Create a Storm Bolt to filter the traffic (the ones between google? may be based on IP address?)
Optionally store the results to some datastore.
Alternatives : You can also try apache-flink or apache-spark, which also provides stream processing capabilities.

Stream data from MySQL Binary Log to Kinesis

We have a write-intensive table (on AWS RDS MySQL) from a legacy system and we'd like to stream every write event (insert or updated) from that table to kinesis. The idea is to create a pipe to warmup caches and update search engines.
Currently we do that using a rudimentar polling architecture, basically using SQL, but the ideal would be to have a push architecture reading the events directly from the transaction log.
Has anyone tried it? Any suggested architecture?
I've worked with some customers doing that already, in Oracle. Seems also that LinkedIn uses a lot that technique of streaming data from databases to somewhere else. They created a platform called Databus to accomplish that in an agnostic way - https://github.com/linkedin/databus/wiki/Databus-for-MySQL.
There is a public project in Github, following LinkedIn principles that is already streaming binlog from Mysql to Kinesis Streams - https://github.com/cmerrick/plainview
If you want to get into the nitty gritty details of LinkedIn approach, there is a really nice (and extensive) blog post available - https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying.
Last but not least, Yelp is doing that as well, but with Kafka - https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html
Not getting into the basics of Kinesis Streams, for the sake of brevity, if we bring Kinesis Streams to the game, I don't see why it shouldn't work. As a matter of fact, it was built for that - your database transaction log is a stream of events. Borrowing an excerpt from Amazon Web Services public documentation: Amazon Kinesis Streams allows for real-time data processing. With Amazon Kinesis Streams, you can continuously collect data as it is generated and promptly react to critical information about your business and operations.
Hope this helps.
aws DMS service offers data migration from SQL db to kinesis.
Use the AWS Database Migration Service to Stream Change Data to Amazon Kinesis Data Streams
https://aws.amazon.com/blogs/database/use-the-aws-database-migration-service-to-stream-change-data-to-amazon-kinesis-data-streams/

Fluentd+Mongo vs. Logstash

Our team now uses zabbix for monitoring and alert. In addition, we use fluent to gather log to an central mongoDB and it is put to work for a week. Recently we were discussing another solution - Logstash. I wanna ask which difference between them? In my opinion, I'd like use zabbix as a data-gathering and alert-sending platform and fluent plays the 'data-gathering' role in the whole infrastructure. While I've looked into Logstash website and found out that Logstash is not only a log-gathering system, but also a whole solutions for gathering, presentation and search.
Would anybody can give some advice or share some experience?
Logstash is pretty versatile (disclaimer: have only been playing with it for a few weeks).
We'd been looking at Graylog2 for a while (listening for syslog and providing a nice search UI) but the message processing functionality in it is based on the Drools engine and is.. arcane at best.
I found it was much easier to have logstash read syslog files from our central server, massage the events and output to Graylog2. Gave us much more flexibility and should allow us to add application level events alongside the OS level syslog data.
It has a zabbix output, so you might find it's worth a look.
Logstash is a great fit with Zabbix.
I forked a repo on github to take the logstash statsd output and send it to Zabbix for trending/alerting. As was mentioned by another, logstash also has a Zabbix output plugin which is great for notifying/sending on matching events.
Personally, I prefer the native Logstash->Elasticsearch backend to Logstash->Graylog2(->Elasticsearch).
It's easier to manage, especially if you have a large volume of log data. At present, Graylog2 uses Elasticsearch as well, but uses only a single index for all data. If you're periodically cleaning up old data, this means the equivalent of a lot of SQL "delete from table where date>YYYY.MM.DD" type-calls to clean out old data, where Logstash defaults to daily indexes (the equivalent of "drop table YYYY.MM.DD"), so clean-up is nicer.
It also results in cleaner searches, requiring less heap space, as you can search over a known date because the index is named for the day's data it contains.

Spring-Batch for a massive nightly / hourly Hive / MySQL data processing

I'm looking into replacing a bunch of Python ETL scripts that perform a nightly / hourly data summary and statistics gathering on a massive amount of data.
What I'd like to achieve is
Robustness - a failing job / step should be automatically restarted. In some cases I'd like to execute a recovery step instead.
The framework must be able to recover from crashes. I guess some persistence would be needed here.
Monitoring - I need to be able to monitor the progress of jobs / steps, and preferably see history and statistics with regards to the performance.
Traceability - I must be able to understand the state of the executions
Manual intervention - nice to have... being able to start / stop / pause a job from an API / UI / command line.
Simplicity - I prefer not to get angry looks from my colleagues when I introduce the replacement... Having a simple and easy to understand API is a requirement.
The current scripts do the following:
Collect text logs from many machines, and push them into Hadoop DFS. We may use Flume for this step in the future (see http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/).
Perform Hive summary queries on the data, and insert (overwrite) to new Hive tables / partitions.
Extract the new summaries data into files, and load (merge) into MySql tables. This is data needed later for on-line reports.
Perform additional joins on the newly added MySql data (from MySql tables), and update the data.
My idea is to replace the scripts with spring-batch. I also looked into Scriptella, but I believe it is too 'simple' for this case.
since I saw some bad vibes on Spring-Batch (mostly old posts) I'm hoping to get some inputs here. I also haven't seen much about spring-batch and Hive integration, which is troublesome.
If you want to stay within the Hadoop ecosystem, I'd highly recommend checking out Oozie to automate your workflow. We (Cloudera) provide a packaged version of Oozie that you can use to get started. See our recent blog post for more details.
Why not use JasperETL or Talend? Seems like the right tool for the job.
I've used Cascading quite a bit and found it be quite impressive:
Cascading
It is a M/R abstraction layer, and runs on Hadoop.