Should InfluxDB be used for monitoring networks, server status (like MySQL) and API data (e. g. Yahoo Finance)? What are the main pros versus the client software such as Wireshark?
InfluxDB even in the community edition (only single instance) can handle huge amount of incoming data: thousands of timeseries and millions of data values if you have sufficient storage for given amount of data. By default InfluxDB will retain incoming data forever, you can configure data retention policy for each namespace if you're interested e.g. in last 30 days.
For monitoring MySQL have a look at Telegraf's MySQL plugin, which is a data collector that should run on MySQL server. InfluxDB is "just" a timeseries database, not data collector nor monitoring tool.
With simple configuration (in /etc/telegraf/telegraf.conf) you can get some basic metrics:
[[inputs.mysql]]
servers = ["tcp(127.0.0.1:3306)/"]
beside the database itself you might want to monitor system status (CPU, memory):
[[inputs.cpu]]
fielddrop = ["time_*"]
percpu = false
totalcpu = true
[[inputs.disk]]
[[inputs.diskio]]
[[inputs.io]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.net]]
interfaces = ["eth0"]
Of course you're not limited to using just Telegraf for collecting metrics, you could use collectd, statsd, etc. but integration with Telegraf is probably the easiest way.
Wireshark is a tool for packet inspection, it's completely different category of tools. Wireshark's output could be probably used for monitoring SQL queries on the fly (after doing a lot of parsing). But that kind of data are not suitable for timeseries database (you could store it in Elasticsearch or some column database).
Timeseries database typically store metrics: number of packets, number of queries, number of connections. And aggregate them over time.
Related
I have two MySQL RDS's (hosted on AWS). One of these RDS instances is my "production" RDS, and the other is my "performance" RDS. These RDS's have the same schema and tables.
Once a year, we take a snapshot of the production RDS, and load it into the performance RDS, so that our performance environment will have similar data to production. This process takes a while - there's data specific to the performance environment that must be re-added each time we do this mirror.
I'm trying to find a way to automate this process, and to achieve the following:
Do a one time mirror in which all data is copied over from our production database to our performance database.
Continuously (preferably weekly) mirror all new data (but not old data) between our production and performance MySQL RDS's.
During the continuous mirroring, I'd like for the production data not to overwrite anything already in the performance database. I'd only want new data to be inserted into the production database.
During the continuous mirroring, I'd like to change some of the data as it goes onto the performance RDS (for instance, I'd like to obfuscate user emails).
The following are the tools I've been researching to assist me with this process:
AWS Database Migration Service seems to be capable of handling a task like this, but the documentation recommends using different tools for homogeneous data migration.
Amazon Kinesis Data Streams also seems able to handle my use case - I could write a "fetcher" program that gets all new data from the prod MySQL binlog, sends it to Kinesis Data Streams, then write a Lambda that transforms the data (and decides on what data to send/add/obfuscate) and sends it to my destination (being the performance RDS, or if I can't directly do that, then a consumer HTTP endpoint I write that updates the performance RDS).
I'm not sure which of these tools to use - DMS seems to be built for migrating heterogeneous data and not homogeneous data, so I'm not sure if I should use it. Similarly, it seems like I could create something that works with Kinesis Data Streams, but the fact that I'll have to make a custom program that fetches data from MySQL's binlog and another program that consumes from Kinesis makes me feel like Kinesis isn't the best tool for this either.
Which of these tools is best capable of handling my use case? Or is there another tool that I should be using for this instead?
I have installed Zabbix 4.0 for remote monitoring of Linux server. My first understanding is that the Zabbix-agent monitors the server and send the data to Mysql database for storing. The Zabbix frontend retrieves the data from Mysql database and shows the above-said metrics (in the form of graphs), as shown in the attached image.
Now, instead of directly viewing from the web interface, I want to construct the ML model from metrics like CPU utilization/load, memory utilization, hard disk usage, and traffic in/out. I checked all the Columns of all the Tables in Mysql database to retrieve the above-said metrics. However, I could not find any columns or tables that stored these metrics. My second is understanding is that Zabbix front-end constructs these metrics on graph indirectly from the stored columns in Mysql databse Tables.
I want to know if my both the understandings are correct or no.
I also want to know, considering my both understandings are true, how do i extract metrics like CPU utilization/load, memory utilization, hard disk usage, and traffic in/out for constructing ML model from data that is stored in Mysql database.
If my understandings are false, how should I collect these metrics.
Any details or documentation that could help me is appreciable.
Zabbix data is stored in the Mysql database in various tables (history and trends, differentiated by data type).
The difference between history and trend is described here.
I strongly advise against the direct use of mysql, because of complexity and compatibility.
The best course of action is through the API (history.get and trend.get) to extract the data and feed it to your ML.
Zabbix itself supports predictive triggering, but I did not implement it yet.
My company would like to make use of Change Data Capture to replace Interchange of Interface Files between Upstream System and Downstream Systems. Upstream System runs in Oracle Database and contains superset of data while Downstream Systems run in MySQL Database and contains subset of data which are not totally mutual exclusive. We decided to use CDC because we would like to enjoy
Data transfer by delta instead of full-set
Automatic data synchronization
Automatic re-send if data transfer interrupted
However, compared with interface file, we found the following drawback of CDC
Too complex from Architecture's point of view
High security control demand in both ends and the middle network
Complex data management as different recipients need different sets of data
Create single point of failure
Not transparent in the data transferred, compared with plain text file
Difficult to control effective time of data in downstream system if synchronization is real-time
Considerably high cost than File Transfer
How can we overcome the above disadvantages?
I have a massive table that records events happening on our website. It has tens of millions of rows.
I've already tried adding indexing and other optimizations.
However, it's still very taxing on our server (even though we have quite a powerful one) and takes 20 seconds on some large graph/chart queries. So long in fact that our daemon intervenes to kill the queries often.
Currently we have a Google Compute instance on the frontend and a Google SQL instance on the backend.
So my question is this - is there some better way of storing an querying time series data using the Google Cloud?
I mean, do they have some specialist server or storage engine?
I need something I can connect to my php application.
Elasticsearch is awesome for time series data.
You can run it on compute engine, or they have a hosted version.
It is accessed via an HTTP JSON API, and there are several PHP clients (although I tend to make the API calls directly as i find it better to understand their query language that way).
https://www.elastic.co
They also have an automated graphing interface for time series data. It's called Kibana.
Enjoy!!
Update: I missed the important part of the question "using the Google Cloud?" My answer does not use any specialized GC services or infrastructure.
I have used ElasticSearch for storing events and profiling information from a web site. I even wrote a statsd backend storing stat information in elasticsearch.
After elasticsearch changed kibana from 3 to 4, I found the interface extremely bad for looking at stats. You can only chart 1 metric from each query, so if you want to chart time, average time, and avg time of 90% you must do 3 queries, instead of 1 that returns 3 values. (the same issue existing in 3, just version 4 looked more ugly and was more confusing to my users)
My recommendation is to choose a Time Series Database that is supported by graphana - a time series charting front end. OpenTSDB stores information in a hadoop-like format, so it will be able to scale out massively. Most of the others store events similar to row-based information.
For capturing statistics, you can either use statsd or reimann (or reimann and then statsd). Reimann can add alerting and monitoring before events are sent to your stats database, statsd merely collates, averages, and flushes stats to a DB.
http://docs.grafana.org/
https://github.com/markkimsal/statsd-elasticsearch-backend
https://github.com/etsy/statsd
http://riemann.io/
I have 150 GB of MySQL data, Plan to replace MySQL to Casandra as backend.
Analytics, plan to go with Hadoop, HIVE or HBASE.
Currently I have 4 physical machines for POC. Please some one help me to come up with best efficient architecture.
Per day I will get 5 GB of Data.
Daily Status report I have to send to each customer.
Have to give Analysis report based on request : for example : 1 week report or last month first 2 week report. Is it possible to produce report instantly using HIVe or HBASE ?
I want to give best performance using Cassandra, Hadoop .
Hadoop can process your data using map reduce paradigm or other, using emerging technologies such as Spark. The advantage is a reliable distributed filesystem and the usage of data locality to send the computation to the nodes that have the data.
Hive is a good SQL-like way of processing files and generate your reports once a day. It's batch processing and 5 more GB a day shouldn't produce a big impact. It has a high overhead latency though, but shouldn't be a problem if you do it once a day.
HBase and Cassandra are NoSQL databases whose purpose is to serve data with low latency. If that's a requirement, you should go with any of those. HBase uses the DFS to store the data and Cassandra has good connectors to Hadoop, so it's simple to run jobs consuming from these two sources.
For reports based on request, specifying a date range, you should store the data in an efficient way so you don't have to ingest data that's not needed for your report. Hive supports partitioning and that can be done using date (i.e. /<year>/<month>/<day>/). Using partitioning can significantly optimize your job execution times.
If you go to the NoSQL approach, be sure the rowkeys have some date format as prefix (e.g. 20140521...) so that you can select those that start by the dates you want.
Some questions you should also consider are:
how many data do you want to store in your cluster – e.g. last 180
days, etc. This will affect the number of nodes / disks. Beware data is usually replicated 3 times.
how many files do you have in HDFS – when the number of files is high,
Namenode will be hit hard on retrieval of file metadata. Some
solutions exist such as replicated namenode or using MapR Hadoop
distribution which doesn't rely on a Namenode per se.