Difference between data extractor agents and dms agents in AWS SCT - aws-sct

I am bit confuse between the different kind of agents present in AWS SCT archive. I am using 1.0.624 version of SCT.
agents - data extractors tool rpm
dms agents - dms agent rpm
As per my understanding both of them are used to extract data from data warehouse (on-premise or RDS instance) to Amazon Redshift in two steps . First the agent copies the data to s3 or snowball and from there it get copied to Redshift.
Now my question is what's the difference between two. Why there are different agents present in AWS SCT archive ?
Thanks in advance !!!

agents - data extractors tool rpm <-- this is used by SCT to perform the one-time extraction and transfer to S3, Redshift, etc. described here https://docs.aws.amazon.com/SchemaConversionTool/latest/userguide/agents.html
dms agents - dms agent rpm <-- this is used by the DMS service with Snowball edge, described here https://docs.aws.amazon.com/dms/latest/userguide/CHAP_LargeDBs.Process.html
I agree the documentation could be improved.

Related

Mirroring homogeneous data from one MySQL RDS to another MySQL RDS

I have two MySQL RDS's (hosted on AWS). One of these RDS instances is my "production" RDS, and the other is my "performance" RDS. These RDS's have the same schema and tables.
Once a year, we take a snapshot of the production RDS, and load it into the performance RDS, so that our performance environment will have similar data to production. This process takes a while - there's data specific to the performance environment that must be re-added each time we do this mirror.
I'm trying to find a way to automate this process, and to achieve the following:
Do a one time mirror in which all data is copied over from our production database to our performance database.
Continuously (preferably weekly) mirror all new data (but not old data) between our production and performance MySQL RDS's.
During the continuous mirroring, I'd like for the production data not to overwrite anything already in the performance database. I'd only want new data to be inserted into the production database.
During the continuous mirroring, I'd like to change some of the data as it goes onto the performance RDS (for instance, I'd like to obfuscate user emails).
The following are the tools I've been researching to assist me with this process:
AWS Database Migration Service seems to be capable of handling a task like this, but the documentation recommends using different tools for homogeneous data migration.
Amazon Kinesis Data Streams also seems able to handle my use case - I could write a "fetcher" program that gets all new data from the prod MySQL binlog, sends it to Kinesis Data Streams, then write a Lambda that transforms the data (and decides on what data to send/add/obfuscate) and sends it to my destination (being the performance RDS, or if I can't directly do that, then a consumer HTTP endpoint I write that updates the performance RDS).
I'm not sure which of these tools to use - DMS seems to be built for migrating heterogeneous data and not homogeneous data, so I'm not sure if I should use it. Similarly, it seems like I could create something that works with Kinesis Data Streams, but the fact that I'll have to make a custom program that fetches data from MySQL's binlog and another program that consumes from Kinesis makes me feel like Kinesis isn't the best tool for this either.
Which of these tools is best capable of handling my use case? Or is there another tool that I should be using for this instead?

Filebeat Central Management Alternative

We have a on-premise setup of the free version of the ELK stack. Actually Elasticsearch cluster and some Kibana nodes (no Logstash).
On the application servers we have installed filebeat 7.9.0 which ships the logs to the Elasticsearch ingest nodes, and there is very minimal processing done by the filebeat on the log events before sending (e.g. multiline=true, dissect, drop_fields and json_decode).
As of today, there are only 3 application servers on the production set-up, but it might scale to more number of machines (application servers) going forward.
I understand that, the central management of the filebeat configuration is possible (which is also coming to its end of life) with a license version of ELK stack.
I want to know what are the alternatives available to manage the filebeat configuration apart from the central management through Kibana.
The goal is in future if number of application servers grow to lets say 20, and the filebeat configuration has to undergo a change, changing the configuration on each of the servers shall be manual activity with its own risks associated. i.e. change the configuration at one location and somehow it is updated on filebeat on all application servers.
Please let me know, if this can be achieved ..
Any pointers / thoughts towards the solution let me know
Note: We do not have infrastructure as a code in the organization yet, so this may not be a suitable solution.
Thanks in advance ..
The replacement of Central Management is Elastic Fleet: Installing a single agent on a server, the rest can be done from Kibana. https://www.elastic.co/blog/introducing-elastic-agent-and-ingest-manager gives a better overview of the features and current screenshots.
Most parts of Fleet are also available for free.

AWS - How to find all AWS products that interact with a MySQL DB?

Recently I was given a responsibility on some product running at AWS. This product includes several web-services running on EC2 instances and several databases (MySQL, DynamoDB etc').
Since there is barely any documentation, I'm still trying to wrap my head around the architecture of the data flow. Do I have a way, given a known MySQL DB - for example, to map out its' interactions with other AWS resources? For instance: some EC2 tries to write / read data from the database; A lambda is pushing data from that database to S3, and so on.
I've tried looking at cloudtrail logs and found them pretty insufficient. Is there another way?
I would prefer if this could be done through awscli obviously, but a GUI solution could also work if it exists.

AWS MySQL to GCP BigQuery data migration

I'm planning a Data Migration from AWS MySQL instances to GCP BigQuery. I don't want to migrate every MySQL Database because finally I want to create a Data Warehouse using BigQuery.
Would exporting AWS MySQL DB to S3 buckets as csv/json/avro, then transfer to GCP buckets be a good option? What would be the best practices for this Data pipeline?
If this was a MySQL to MySQL migration; there were other possible options. But in this case the option you mentioned is perfect.. Also, remember that your MySQL database will keep getting updated.. So, your destination DB might have some records missed out.. because it is not real-time DB transfer.
Your proposal of exporting to S3 files should work OK, and to export the files you can take advantage of the AWS Database Migration Service
With that service you can do either a once-off export to S3, or an incremental export with Change Data Capture. Unfortunately, since BigQuery is not really designed for working with changes on its tables, implementing CDC can be a bit cumbersome (although totally doable). You need to take into account the cost of transferring data across providers.
Another option, which would be much easier for you, is to use the same AWS Database Migration service to move data directly to Amazon Redshift.
In this case, you would get change data capture automatically, so you don't need to worry about anything. And RedShift is an excellent tool to build your data warehouse.
If you don't want to use RedShift for any reason, and you prefer a fully serverless solution, then you can easily use AWS Glue Catalog to read from your databases and export to AWS Athena.
The cool thing about the AWS based solutions is everything is tightly integrated, you can use the same account/users for billing, IAM, monitoring... and since you are moving data within a single provider, there is no extra charge for networking, no latency, and potentially fewer security issues.

InfluxDB heavy usage in monitoring

Should InfluxDB be used for monitoring networks, server status (like MySQL) and API data (e. g. Yahoo Finance)? What are the main pros versus the client software such as Wireshark?
InfluxDB even in the community edition (only single instance) can handle huge amount of incoming data: thousands of timeseries and millions of data values if you have sufficient storage for given amount of data. By default InfluxDB will retain incoming data forever, you can configure data retention policy for each namespace if you're interested e.g. in last 30 days.
For monitoring MySQL have a look at Telegraf's MySQL plugin, which is a data collector that should run on MySQL server. InfluxDB is "just" a timeseries database, not data collector nor monitoring tool.
With simple configuration (in /etc/telegraf/telegraf.conf) you can get some basic metrics:
[[inputs.mysql]]
servers = ["tcp(127.0.0.1:3306)/"]
beside the database itself you might want to monitor system status (CPU, memory):
[[inputs.cpu]]
fielddrop = ["time_*"]
percpu = false
totalcpu = true
[[inputs.disk]]
[[inputs.diskio]]
[[inputs.io]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.net]]
interfaces = ["eth0"]
Of course you're not limited to using just Telegraf for collecting metrics, you could use collectd, statsd, etc. but integration with Telegraf is probably the easiest way.
Wireshark is a tool for packet inspection, it's completely different category of tools. Wireshark's output could be probably used for monitoring SQL queries on the fly (after doing a lot of parsing). But that kind of data are not suitable for timeseries database (you could store it in Elasticsearch or some column database).
Timeseries database typically store metrics: number of packets, number of queries, number of connections. And aggregate them over time.