I am looking to move some MySQL databases to the cloud in Amazon Redshift. Currently I am creating a Python script to convert the tables to CSVs, encrypt them, put them in S3, then COPY the data into Redshift. However, the way it is set up I would have to copy the data one table at a time. I have read that you can split your data into multiple files and upload them in parallel, however I believe this is still only for loading data into one table. Is there a way to use COPY on multiple tables at once? Having to copy data over from each table individually seems very inefficient.
All of your statements are correct.
The COPY command can load from multiple files in parallel (in fact, that is recommended because it can then spread the load job across multiple nodes), but it only loads on table per COPY command.
You could connect to Redshift via multiple sessions and run a COPY command in each session to load multiple tables simultaneously (but be careful of the impact on production users).
If you are wishing to migrate data from an on-premises database to Amazon Redshift, consider using:
AWS Schema Conversion Tool
AWS Database Migration Service
The Database Migration Service can even perform on-going updates of Redshift whenever data is updated in the source database.
Related
I want to create connector (something like Debezium in Kafka-Connect) to reflect every change in MySQL source database in BigQuery tables.
There is one problem - source database is dropped every 10mins and re-created - some of rows are the same, some are updated and some are totally new. So I cannot do it via Debezium, because every 10min I would have all records in Kafka.
I want migrate to BQ tables just new or updated values. The mechanism to "copy" the whole source database but to deduplicate records (which not be excatly the same because this will be a new database). So for example create hash from every record and check - if hash is already in BQ - pass and if there is not add it.
I think it should be this:
Best effort de-duplication
but how to create the whole pipeline with MySQL as a source.
Cloud Data Fusion Replication lets you replicate your data continuously and in real time from operational data stores, such as SQL Server and MySQL, into BigQuery.
To use Replication, you create a new instance of Cloud Data Fusion and add the Replication app.
In short you do below
Set up your MySQL database to enable replication.
Create and run a Cloud Data Fusion Replication pipeline.
View the results in BigQuery.
Yu can see more at Replicating data from MySQL to BigQuery
I am using Oracle database but open to use other database so tagging all of them.
I am designing one system in which I have to inject all the data of existing database table into new database and whatever changes happens in existing database should reflect in new database on daily basis. My approach is.
I will copy all the data of existing database to new database.
Then I will create a trigger which will record all the changes in the table and store in another table(all the DML operations).
Once in a day my API will read the data generated by trigger and copy into new system. I don't need live data so I will schedule job only once in a day to copy data into new database
is this the proper approach? any suggestions?
Common practice would be to back up your primary instance and restore it on the secondary once a day.
You could schedule the backup and restore in sequence as a daily jobs.
If your copy database is Sql server, then I suggested you use LinkedServer. Based on the documentation:
Linked servers enable you to implement distributed databases that can
fetch and update data in other databases. They are a good solution in
the scenarios where you need to implement database sharding without
need to create a custom application code or directly load from remote
data sources. Linked servers offer the following advantages:
The ability to access data from outside of SQL Server.
The ability to issue distributed queries, updates, commands, and
transactions on heterogeneous data sources across the enterprise.
The ability to address diverse data sources similarly.
You can find more information based on the documentation.
Visit https://learn.microsoft.com/en-us/sql/relational-databases/linked-servers/linked-servers-database-engine?view=sql-server-ver15
I'm planning a Data Migration from AWS MySQL instances to GCP BigQuery. I don't want to migrate every MySQL Database because finally I want to create a Data Warehouse using BigQuery.
Would exporting AWS MySQL DB to S3 buckets as csv/json/avro, then transfer to GCP buckets be a good option? What would be the best practices for this Data pipeline?
If this was a MySQL to MySQL migration; there were other possible options. But in this case the option you mentioned is perfect.. Also, remember that your MySQL database will keep getting updated.. So, your destination DB might have some records missed out.. because it is not real-time DB transfer.
Your proposal of exporting to S3 files should work OK, and to export the files you can take advantage of the AWS Database Migration Service
With that service you can do either a once-off export to S3, or an incremental export with Change Data Capture. Unfortunately, since BigQuery is not really designed for working with changes on its tables, implementing CDC can be a bit cumbersome (although totally doable). You need to take into account the cost of transferring data across providers.
Another option, which would be much easier for you, is to use the same AWS Database Migration service to move data directly to Amazon Redshift.
In this case, you would get change data capture automatically, so you don't need to worry about anything. And RedShift is an excellent tool to build your data warehouse.
If you don't want to use RedShift for any reason, and you prefer a fully serverless solution, then you can easily use AWS Glue Catalog to read from your databases and export to AWS Athena.
The cool thing about the AWS based solutions is everything is tightly integrated, you can use the same account/users for billing, IAM, monitoring... and since you are moving data within a single provider, there is no extra charge for networking, no latency, and potentially fewer security issues.
How should you go about restoring (and backing) up a MySQL database "safely"? By "safely" I mean: the restore should create/overwrite a desired database, but not risk altering anything outside that database.
I have already read https://dev.mysql.com/doc/refman/5.7/en/backup-types.html.
I have external users. They & I may want to exchange backups for restore. We do not have a commercial MySQL Enterprise Backup, and are not looking for a third-party commercial offering.
In Microsoft SQL Server there are BACKUP and RESTORE commands. BACKUP creates a file containing just the database you want; both its rows and all its schema/structure are included. RESTORE accepts such a file, and creates or overwrites its structure. The user can restore to a same-named database, or specify a different database name. This kind of behaviour is just what I am looking for.
In MySQL I have come across 3 possibilities:
Most people seem to use mysqldump to create a "dump file", and mysql to read that back in. The dump file contains a list of arbitrary MySQL statements, which are simply executed by mysql. This is quite unacceptable: the file could contain any SQL statements. (Limiting access rights of restoring user to try to ensure it cannot do anything "naughty" is not acceptable.) There is also the issue that the user may have created the dump file with the "Include CREATE Schema" option (MySQL Workbench), which hard-codes the original database name for recreation. This "dump" approach is totally unsuitable to me, and I find it surprising that anyone would use it in a production environment.
I have come across MySQL's SELECT ... INTO OUTFILE and LOAD DATA INFILE statements. At least they do not contain SQL code to execute. However, they look like a lot of work, deal with a table at time not the whole database, and don't deal with the structure of the tables, you have to know that yourself for restoring. There is a mysqlimport helper command-line utility, but I don't see anything for the export side, and I don't see it for restoring a complete database.
The last is to use what MySQL refers to as "Physical (Raw)" rather than "Logical" Backups. This works on the database directories and files themselves. It is the equivalent of SQL Server's detach/attach method for backing up/restoring. But, as per https://dev.mysql.com/doc/refman/5.7/en/backup-types.html, it has all sorts of caveats, e.g. "Backups are portable only to other machines that have identical or similar hardware characteristics." (I have no idea, e.g. some users are Windows versus Windows, I have no idea about their architecture) and "Backups can be performed while the MySQL server is not running. If the server is running, it is necessary to perform appropriate locking so that the server does not change database contents during the backup." (let alone restores).
So can anything satisfy (what I regard as) my modest requirements, as outlined above, for MySQL backup/restore? Am I really the only person who finds the above 3 as the only, yet unacceptable, possible solutions?
1 - mysqldump - I use this quite a bit, usually in environments where I am handling all the details myself. I do have one configuration where I use that to send copies of a development database - to be dumped/restored in its entirety - to other developers. It is probably the fastest solution, has some reasonable configuration options (e.g., to include/exclude specific tables) and generates very functional SQL code (e.g., each INSERT batch is small enough to avoid locking/speed issues). For a "replace entire database" or "replace key tables in a specific database" solution, it works very well. I am not too concerned about the "arbitrary SQL commands" problem - if that is an issue then you likely have other issues with users trying to "do their own thing".
2 - SELECT ... INTO OUTFILE and LOAD DATA INFILE - The problem with these is that if you have any really big tables then the LOAD DATA INFILE statement can cause problems because it is trying to load everything all at once. You also have to add code to create (if needed) or empty the tables before LOAD DATA.
3 - Physical (raw) file transfer. This can work but under limited circumstances. I had one situation with a multi-gigabyte database and decided to compress the raw files, move them to the new machine, uncompress and just tell MySQL "everything is already there". It mostly worked well. But I would not recommend it for any unattended/end-user process due to the MANY possible problems.
What do I recommend?
1 - mysqldump - live with its limitations and risks, set up a script to call mysqldump and compress the file (I am pretty sure there are options in mysqldump to do the compression automatically), include the date in the file name so that there is less confusion as the files are sent around, and make a simple script for users to load the file.
2 - Write your own program. I have done this a few times. This is more work initially but allows you to control every aspect of the process and transfer a file that only contains data without any actual SQL code. You can control the specific database, tables, etc. One catch is that if you make any changes to the table structure, indexes, etc. you will need to make sure that information is somehow transmitted to the receiving problem so that it can change the structures as needed - that is not a problem with mysqldump as it normally replaces the tables, creating the new structures, indexes, etc. This can be written in any language that can connect to MySQL - it does not have to be the same language as your application.
If you're not going to use third party tools (like innobackupex for example) then you're limited to use ... mysqldump, which is in the mysql package.
I can't understand why it is not acceptable for you, why you don't like sql commands in those dumps. Best practice,when restoring a single db into the server, which already contains other databases, is to have a separated user, with rights only to write into the restored db. Then even if the user performing restore, would change the sql commands and tried to write to another db, they will not be able to.
When doing raw backup (physical copy of database files) you need to have all the instances down, mysql server not running. Similar hardware means you need to have the same directories as the source server (unless you would change my.cnf before starting the server, and putting all the files to right directories).
When coming into mysql, try to not compare it to sql server - it's totally different approach and philosophy.
But if you would convinced yourself anyhow to use third party tool - I recommend innobackex from Percona, which is free btw.
The export tool that complements mysqlimport is mysqldump --tab. This outputs CSV files like SELECT...INTO OUTFILE. It also outputs the table structure in much smaller .sql files. So there are two files for each table.
Once you recreate your tables from the .sql files, you can use mysqlimport to import all the data files. You can even use the mysqlimport --use-threads option to make it load multiple data files in parallel.
This way you have more control over which schema to load the data into, and it should run a lot faster than loading a large SQL dump.
I am working for a client who uses multiple RDS (MySQL) instances on AWS and wants me to consolidate data from there and other sources into a single instance and do reporting off that.
What would be the most efficient way to transfer selective data from other AWS RDS MySQL instances to mine?
I don't want to migrate the entire DB, rather just a few columns and rows based on which have relevant data and what was last created/updated.
One option would be to use a PHP script that'd read from one DB and insert it into another, but it'd be very inefficient. Unlike SQL Server or ORACLE, MySQL also does not have the ability to write queries across servers, else I'd have just used that in a stored procedure.
I'd appreciate any inputs regarding this.
If your overall objective is reporting and analytics, the standard practice is to move your transactional data from RDS to Redshift which will become your data warehouse. This blog article by AWS provides an approach to do it.
For the consolidation operation, you can use AWS Data Migration Service which will allow you to migrate data column wise with following options.
Migrate existing data
Migrate existing data & replicate ongoing changes
Replicate data changes only
For more details read this whitepaper.
Note: If you need to process the data while moving, use AWS Data Pipeline.
Did you take a look at the RDS migration tool?