I'm new to logstash but I like how easy it makes shipping logs and aggregating them. Basically it just works. One problem I have is I'm not sure how to go about making my configurations maintainable. Do people usually have one monolithic configuration file with a bunch of conditionals or do they separate them out into different configurations and launch an agent for each one?
We heavily use Logstash to monitor ftbpro.com. I have two notes which you might find useful:
You should run one agent (process) per machine, not more. Logstash agents requires some amount of CPU and memory, especially under high loads, so you don't want to run more than one on a single machine.
We manage our Logstash configurations with Chef. We have a separate template for each configuration and Chef assembles the configuration by the roles of the machine. So the final result is one large configuration in each machine, but on our repository the configurations are separate and thus maintainable.
Hope this helps you.
I'll offer the following advice
Send your data to Redis as a "channel" rather than a "list", based on time and date, which makes managing Redis a lot easier.
http://www.nightbluefruit.com/blog/2014/03/managing-logstash-with-the-redis-client/
Related
So I am using Mantl.io for our environment. Things are going very well and we are now past the POC phase and starting to think about how we are going to handle continuous delivery. Obviously automation is key. Maybe my approach or thinking is wrong but I am trying to figure out a way to manage the json I will pass to marathon to deploy the docker containers from our registry via a jenkins job call. We have various environments (testing, perf, prod, etc) and in each of these environments I will have my 30+ microservices needing different variables set for cpu, memory, environment variables, etc.
So I am just not sure the best approach for taking my docker containers and linking them with what could be maybe 10 or more different configurations per microservice depending on the environment.
Are there tools for building, managing, versioning, linking containers to configs to environments? I just can't seem to find anything in this realm and that leads me to believe I am headed down the wrong path.
Thanks
Our team now uses zabbix for monitoring and alert. In addition, we use fluent to gather log to an central mongoDB and it is put to work for a week. Recently we were discussing another solution - Logstash. I wanna ask which difference between them? In my opinion, I'd like use zabbix as a data-gathering and alert-sending platform and fluent plays the 'data-gathering' role in the whole infrastructure. While I've looked into Logstash website and found out that Logstash is not only a log-gathering system, but also a whole solutions for gathering, presentation and search.
Would anybody can give some advice or share some experience?
Logstash is pretty versatile (disclaimer: have only been playing with it for a few weeks).
We'd been looking at Graylog2 for a while (listening for syslog and providing a nice search UI) but the message processing functionality in it is based on the Drools engine and is.. arcane at best.
I found it was much easier to have logstash read syslog files from our central server, massage the events and output to Graylog2. Gave us much more flexibility and should allow us to add application level events alongside the OS level syslog data.
It has a zabbix output, so you might find it's worth a look.
Logstash is a great fit with Zabbix.
I forked a repo on github to take the logstash statsd output and send it to Zabbix for trending/alerting. As was mentioned by another, logstash also has a Zabbix output plugin which is great for notifying/sending on matching events.
Personally, I prefer the native Logstash->Elasticsearch backend to Logstash->Graylog2(->Elasticsearch).
It's easier to manage, especially if you have a large volume of log data. At present, Graylog2 uses Elasticsearch as well, but uses only a single index for all data. If you're periodically cleaning up old data, this means the equivalent of a lot of SQL "delete from table where date>YYYY.MM.DD" type-calls to clean out old data, where Logstash defaults to daily indexes (the equivalent of "drop table YYYY.MM.DD"), so clean-up is nicer.
It also results in cleaner searches, requiring less heap space, as you can search over a known date because the index is named for the day's data it contains.
Background info:
We are currently 3 web programmers (good, real-life friends, no distrust issues).
Each programmers SSH into the single Linux server, where the code resides, under their own username with sudo powers.
We all use work on the different files at one time. We ask the question "Are you in the file __?" sometimes. We use Vim so we know if the file is opened or not.
Our development code (no production yet) resides in /var/www/
Our remote repo is hosted on bitbucket.
I am *very* new to Git. I used subversion before but I was basically spoon-fed instructions and was told exactly what to type to sync up codes and commit.
I read about half of Scott Chacon's Pro Git and that's the extent to most of my Git knowledge.
In case it matters, we run Ubuntu 11.04, Apache 2.2.17, and Git 1.7.4.1.
So Jan Hudec gave me some advice in the previous question. He told me that a good practice to do the following:
Each developer have their own repo on their local computer.
Let the /var/www/ be the repo on the server. Set the .git folder to permission 770.
That would mean that each developer's computer need to have their own LAMP stack (or at least Apache, PHP, MySQL, and Python installed).
The codes are mostly JavaScript and PHP files so it's not a big deal to clone it over. However how do we locally manage the database?
In this case, we only have two tables and it'll be simple to recreate the entire database locally (at least for testing). But in the future when the database gets too big, then should we just remotely log on the MySQL database on the server or should we just have a "sample" data for developing and testing purposes?
What you're doing is transitioning from "everybody works together in one environment" to "everybody has their own development environment". The major benefit is everybody won't be stepping on each other's feet.
Other benefits include a heterogeneous development environment, that is if everyone is developing on the same machine the software will become dependent on that one setup because developers are lazy. If everyone develops in different environments, even just with slightly different versions of the same stuff, they'll be forced to write more robust code to deal with that.
The main drawback, as you've noticed, is setting up the environment is harder. In particular, making sure the database works.
First, each developer should have their own database. This doesn't mean they all have to have their own database server (though its good for heterogeneous purposes) but they should have their own database instance which they control.
Second, you should have a schema and not just whatever's in the database. It should be in a version controlled file.
Third, setting up a fresh database should be automatic. This lets developers set up a clean database with no hassle.
Fourth, you'll need to get interesting test data into that database. Here's where things get interesting...
You have several routes to do that.
First is to make a dump of an existing database which contains realistic data, sanitized of course. This is easy, and provides realistic data, but it is very brittle. Developers will have to hunt around to find interesting data to do their testing. That data may change in the next dump, breaking their tests. Or it just might not exist at all.
Second is to write "test fixtures". Basically each test populates the database with the test data it needs. This has the benefit of allowing the developer to get precisely the data they want, and know precisely the state the database is in. The drawbacks are that it can be very time consuming, and often the data is too clean. The data will not contain all the gritty real data that can cause real bugs.
Third is to not access the database at all and instead "mock" all the database calls. You trick all the methods which normally query a database into instead returning testing data. This is much like writing test fixtures, and has most of the same drawbacks and benefits, but it's FAR more invasive. It will be difficult to do unless your system has been designed to do it. It also never actually tests if your database calls work.
Finally, you can build up a set of libraries which generate semi-random data for you. I call this "The Sims Technique" after the video game where you create fake families, torture them and then throw them away. For example, lets say you have User object who needs a name, an age, a Payment object and a Session object. To test a User you might want users with different names, ages, ability to pay and login status. To control all that you need to generate test data for names, ages, Payments and Sessions. So you write a function to generate names and one to generate ages. These can be as simple as picking randomly from a list. Then you write one to make you a Payment object and one a Session object. By default, all the attributes will be random, but valid... unless you specify otherwise. For example...
# Generate a random login session, but guarantee that it's logged in.
session = Session.sim( logged_in = true )
Then you can use this to put together an interesting User.
# A user who is logged in but has an invalid Visa card
# Their name and age will be random but valid
user = User.sim(
session = Session.sim( logged_in = true ),
payment = Payment.sim( invalid = true, type = "Visa" ),
);
This has all the advantages of test fixtures, but since some of the data is unpredictable it has some of the advantages of real data. Adding "interesting" data to your default sim and rand functions will have wide ranging repercussions. For example, adding a Unicode name to random_name will likely discover all sorts of interesting bugs! It unfortunately is expensive and time consuming to build up.
There you have it. Unfortunately there's no easy answer to the database problem, but I implore you to not simply copy the production database as it's a losing proposition in the long run. You'll likely do a hybrid of all the choices: copying, fixtures, mocking, semi-random data.
A few options, in order of increasing complexity:
You all connect to the live master DB, read/write permissions. This is risky, but I guess you're already doing it. Make sure you have backups!
Use test fixtures to populate a local test DB and just use it. Not sure what tools there are for this in the PHP world.
Copy (mysqldump) the master database and import it into your local machines' MySQL instances, then set up your dev environments to connect to your local MySQL. Repeat the dump/import as necessary
Set up one-way replication from the master to your local instances.
Optionally, set up a read-only user on the main DB, and configure your app to let you switch to a read-only connection to the real master DB in case you can't wait for that next copy of the master data.
Own repo does not mean own Staging server (this config is hardly maintained and extremely bad scaled to 10-20-100 developers)
It's always better to have as soon as possible (semi-)automated build-system, which convert repository-stored source-data to live system (less handwork - less changes to make non-code errors) and (maybe) some type of Continuos Integration (test often, find bugs fast). For build-system (DB-part) you have only to prepare initial data (tables structures, data-dumps) as (versioned) texts, which are
easy mergeable between merges
handled and processed and converted to final usable object by code, not by hand - no human errors, no operation's interferences
I need to collect statistics from my server application written in python. I am looking for some general guidance on how to setup models and exactly how to store the statistics information. I was thinking of storing and organizing all this information in a database, but my implementation is turning out to be too specific.
I need to collect stats like active users, requests processed and things like that over time.
Are there any guides or techniques out there to create some more generic statistics storage systems?
Like most software solutions there is no single solution that I can recommend that will solve your problem. But I have created a few similar programs and here's some things that I found that worked well.
Create an asynchronous logging service so the logging doesn't adversely affect your code's performance. (You need to be mindful of where you are storing your data, where it is processed, etc. because you can still significantly degrade performance if you're not careful.) I have found that creating a web service is often convenient.
Try and save as much information about the request as possible. In the future this will make it easier to add new queries and reports.
Normalize your data
Always include the time the action was performed. If you can capture run time that it typically useful too.
One approach is to do this by stages: Store activity logs, including requests and users, as text files. Later, mine the logs into data points (python should be able to do this easily). You may want to use the logging library for python for the logging stage. In general, start with high time-resolution logging, which you can later aggregate into hourly, daily, weekly summaries etc.
I'm not really asking whether I should use either a RDBMS or config files for 100% of my application configuration, but rather what kind of configuration is best addressed by each method.
For example, I've heard that "any kind of configuration that is not changeable by the end-user" should be in config files rather than the database. Is this accurate? How do you address configuration?
(I'm primarily concerned with many-user web applications here, but no particular platform.)
I find that during development it is of great benefit to have configuration stored in a file.
It is far easier to check out a file (web.config, app.config, or some custom file) and make changes that are instantly picked up when the code is run. There is a little more friction involved in working with configuration stored in a database. If your team uses a single development database you could easily impact other team members with your change, and if you have individual databases it takes more than a "get latest" to be up and running with the latest configuration. Also, the flexibility of XML makes it more natural to store configuration that is more than just "name-value" pairs in a file than in a relational DB.
The drawback is where you want to reuse the configuration across multiple apps or web site instances. In my own case, we have a single config file in a well-known location that can be referenced by any application.
At least, this is how we store "static" configuration that does not have to be updated by the system at runtime. User settings are probably more suited to storage in the DB.
The oneliner: As a general principle - the more likely the config data should change the better to put it into db.
The legal disclaimer:
You would need to have almost always a kind of "bootstrapping" configuration, which must be saved into a file, thus if you are using a db to store your configuration the size of the "bootrapping" conf would depend on the other great principle:
"Work smarter not harder !!!"
One thing to conside is how much config data there is, and perhaps how often it is likely to change. If the amount of data is small, then saving this in a database (if your not already using a db for anything else), would be overkill, equally maintaining a db for something that gets changed once every 6 months would probably be a waste of resources.
That said, if your already using a database for other parts of your site, then adding a table or two for configuration data is probabley not a big issue, and may fit in well with the way you are storing the rest of your data. If you already have a class for saving your data to a db, why write a new one to save to a config file.