MySQL cloning aggregated database from an existing database - mysql

We have a MySQL database based on InnoDB. We are looking to build an Analytics system for this data. We are thinking to create a cloned database that denormalizes the data to prevent join and uses MyIsam for faster querying. This second database will also facilitate avoiding extra load on the main database to which the data will be written.
Apart from this, we are also creating some extra tables that will store aggregated numbers to avoid recalculation.
I am wondering how can I sync these tables once every day to keep them updated. It looks similar to Master-slave config of MySQL which uses binary log. But in our case, the second database is not an exact slave. Are there any open-source reliable tools or any other ideas which I can use to write an 'update mechanism'?
Thanks in advance.

Related

mysql automatic replication of partial data

I have to create a dashboard based on a table in mysql, and only on today datas
This db is used on a service with a massive data quantity, and continous read and write data, so I'd like to replicate in a "slave" instance part of this table (only today data).
Is it possible to do it in Mysql, without scripting?
Thanks
MySQL has no built-in feature to replicate a subset of rows. There are replication filters to replicate a subset of schemas or tables, but not rows.
One workaround could be to replicate fully to the replica, then on the replica delete any data that is more than one day old.
But this would work only for a database that is INSERT-only. If you also have UPDATE and DELETE operations replicated, they might find that they are trying to change rows that are missing. If you use ROW-based binary logs, this would result in a replication error when it can't find the row, and replication would stop.
It might work if you only use STATEMENT-based binary logs, but I've never tried it so I can't predict what other problems might occur. Also, you can't fully prevent ROW-based binary logs from occurring, because individual sessions can change their binary log format.
I think you're going to need a bespoke solution no matter what. Probably not using replication, but just an ETL job to query the current day's data and import it into another MySQL instance (not a replica).

How to fill for the first time a SQL database with multiple tables

I have a general question regarding the method of how to fill a database for the first time. Actually, I work on "raw" datasets within R (dataframes that I've built to work and give insights quickly) but I now need to structure and load everything in a relational Database.
For the DB design, everything is OK (=> Conceptual, logical and 3NF). The result is a quite "complex" (it's all relative) data model with many junction tables and foreign keys within tables.
My question is : Now, what is the easiest way for me to populate this DB ?
My approach would be to generate a .csv for each table starting from my "raw" dataframes in R and then load them table per table in the DB. Is it the good way to do it or do you have any easier method ? . Another point is, how to not struggle with FK constraints while populating ?
Thank you very much for the answers. I realize it's very "methodological" questions but I can't find any tutorial/thread related
Notes : I work with R (dplyr, etc.) and MySQL
A serious relational database, such as Postgres for example, will offer features for populating a large database.
Bulk loading
Look for commands that read in external data to be loaded into a table with a matching field structure. The data moves directly from the OS’s file system file directly into the table. This is vastly faster than loading individual rows with the usual SQL INSERT. Such commands are not standardized, so you must look for the proprietary commands in your particular database engine.
In Postgres that would be the COPY command.
Temporarily disabling referential-integrity
Look for commands that defer enforcing the foreign key relationship rules until after the data is loaded.
In Postgres, use SET CONSTRAINTS … DEFERRED to not check constraints during each statement, and instead wait until the end of the transaction.
Alternatively, if your database lacks such a feature, as part of your mass import routine, you could delete your constraints before and then re-establish them after. But beware, this may affect all other transactions in all other database connections. If you know the database has no other users, then perhaps this is workable.
Other issues
For other issues to consider, see the Populating a Database in the Postgres documentation (whether you use Postgres or not).
Disable Autocommit
Use COPY (for mass import, mentioned above)
Remove Indexes
Remove Foreign Key Constraints (mentioned above)
Increase maintenance_work_mem (changing the memory allocation of your database engine)
Increase max_wal_size (changing the configuration of your database engine’s write-ahead log)
Disable WAL Archival and Streaming Replication (consider moving a copy of your database to replicant server(s) rather than letting replication move the mass data)
Run ANALYZE Afterwards (remind your database engine to survey the new state of the data, for use by its query planner)
Database migration
By the way, you will likely find a database migration tool helpful in creating the tables and columns, and possibly in loading the data. Consider tools such as Flyway or Liquibase.

Copying data from PostgreSQL to MySQL

I currently have a PostgreSQL database, because one of the pieces of software we're using only supports this particular database engine. I then have a query which summarizes and splits the data from the app into a more useful format.
In my MySQL database, I have a table which contains an identical schema to the output of the query described above.
What I would like to develop is an hourly cron job which will run the query against the PostgreSQL database, then insert the results into the MySQL database. During the hour period, I don't expect to ever see more than 10,000 new rows (and that's a stretch) which would need to be transferred.
Both databases are on separate physical servers, continents apart from one another. The MySQL instance runs on Amazon RDS - so we don't have a lot of control over the machine itself. The PostgreSQL instance runs on a VM on one of our servers, giving us complete control.
The duplication is, unfortunately, necessary because the PostgreSQL database only acts as a collector for the information, while the MySQL database has an application running on it which needs the data. For simplicity, we're wanting to do the move/merge and delete from PostgreSQL hourly to keep things clean.
To be clear - I'm a network/sysadmin guy - not a DBA. I don't really understand all of the intricacies necessary in converting one format to the other. What I do know is that the data being transferred consists of 1xVARCHAR, 1xDATETIME and 6xBIGINT columns.
The closest guess I have for an approach is to use some scripting language to make the query, convert results into an internal data structure, then split it back out to MySQL again.
In doing so, are there any particular good or bad practices I should be wary of when writing the script? Or - any documentation that I should look at which might be useful for doing this kind of conversion? I've found plenty of scheduling jobs which look very manageable and well-documented, but the ongoing nature of this script (hourly run) seems less common and/or less documented.
Open to any suggestions.
Use the same database system on both ends and use replication
If your remote end was also PostgreSQL, you could use streaming replication with hot standby to keep the remote end in sync with the local one transparently and automatically.
If the local end and remote end were both MySQL, you could do something similar using MySQL's various replication features like binlog replication.
Sync using an external script
There's nothing wrong with using an external script. In fact, even if you use DBI-Link or similar (see below) you probably have to use an external script (or psql) from a cron job to initiate repliation, unless you're going to use PgAgent to do it.
Either accumulate rows in a queue table maintained by a trigger procedure, or make sure you can write a query that always reliably selects only the new rows. Then connect to the target database and INSERT the new rows.
If the rows to be copied are too big to comfortably fit in memory you can use a cursor and read the rows with FETCH, which can be helpful if the rows to be copied are too big to comfortably fit in memory.
I'd do the work in this order:
Connect to PostgreSQL
Connect to MySQL
Begin a PostgreSQL transaction
Begin a MySQL transaction. If your MySQL is using MyISAM, go and fix it now.
Read the rows from PostgreSQL, possibly via a cursor or with DELETE FROM queue_table RETURNING *
Insert them into MySQL
DELETE any rows from the queue table in PostgreSQL if you haven't already.
COMMIT the MySQL transaction.
If the MySQL COMMIT succeeded, COMMIT the PostgreSQL transaction. If it failed, ROLLBACK the PostgreSQL transaction and try the whole thing again.
The PostgreSQL COMMIT is incredibly unlikely to fail because it's a local database, but if you need perfect reliability you can use two-phase commit on the PostgreSQL side, where you:
PREPARE TRANSACTION in PostgreSQL
COMMIT in MySQL
then either COMMIT PREPARED or ROLLBACK PREPARED in PostgreSQL depending on the outcome of the MySQL commit.
This is likely too complicated for your needs, but is the only way to be totally sure the change happens on both databases or neither, never just one.
BTW, seriously, if your MySQL is using MyISAM table storage, you should probably remedy that. It's vulnerable to data loss on crash, and it can't be transactionally updated. Convert to InnoDB.
Use DBI-Link in PostgreSQL
Maybe it's because I'm comfortable with PostgreSQL, but I'd do this using a PostgreSQL function that used DBI-link via PL/Perlu to do the job.
When replication should take place, I'd run a PL/PgSQL or PL/Perl procedure that uses DBI-Link to connect to the MySQL database and insert the data in the queue table.
Many examples exist for DBI-Link, so I won't repeat them here. This is a common use case.
Use a trigger to queue changes and DBI-link to sync
If you only want to copy new rows and your table is append-only, you could write a trigger procedure that appends all newly INSERTed rows into a separate queue table with the same definition as the main table. When you want to sync, your sync procedure can then in a single transaction LOCK TABLE the_queue_table IN EXCLUSIVE MODE;, copy the data, and DELETE FROM the_queue_table;. This guarantees that no rows will be lost, though it only works for INSERT-only tables. Handling UPDATE and DELETE on the target table is possible, but much more complicated.
Add MySQL to PostgreSQL with a foreign data wrapper
Alternately, for PostgreSQL 9.1 and above, I might consider using the MySQL Foreign Data Wrapper, ODBC FDW or JDBC FDW to allow PostgreSQL to see the remote MySQL table as if it were a local table. Then I could just use a writable CTE to copy the data.
WITH moved_rows AS (
DELETE FROM queue_table RETURNING *
)
INSERT INTO mysql_table
SELECT * FROM moved_rows;
In short you have two scenarios:
1) Make destination pull the data from source into its own structure
2) Make source push out the data from its structure to destination
I'd rather try the second one, look around and find a way to create postgresql trigger or some special "virtual" table, or maybe pl/pgsql function - then instead of external script, you'll be able to execute the procedure by executing some query from cron, or possibly from inside postgres, there are some possibilities of operation scheduling.
I'd choose 2nd scenario, because postgres is much more flexible, and manipulating data some special, DIY ways - you will simply have more possibilities.
External script probably isn't a good solution, e.g. because you will need to treat binary data with special care, or convert dates&times from DATE to VARCHAR and then to DATE again. Inside external script, various text-stored data will be probably just strings, and you will need to quote it too.

One-way database sync to MySQL

I have an VFP based application with a directory full of DBFs. I use ODBC in .NET to connect and perform transactions on this database. I want to mirror this data to mySQL running on my webhost.
Notes:
This will be a one-way mirror only. VFP to mySQL
Only inserts and updates must be supported. Deletes don't matter
Not all tables are required. In fact, I would prefer to use a defined SELECT statement to only mirror psuedo-views of the necessary data
I do not have the luxury of a "timemodified" stamp on any VFP records.
I don't have a ton of data records (maybe a few thousand total) nor do I have a ton of concurrent users on the mySQL side, want to be as efficient as possible though.
Proposed Strategy for Inserts (doesn't seem that bad...):
Build temp table in mySQL, insert all primary keys of the VFP table/view I want to mirror
Run "SELECT primaryKey from tempTable not in (SELECT primaryKey from mirroredTable)" on mySQL side to identify missing records
Generate and run the necessary INSERT sql for those records
Blow away the temp table
Proposed Strategy for Updates (seems really heavyweight, probably breaks open queries on mySQL dropped table):
Build temp table in mySQL and insert ALL records from VFP table/view I want to mirror
Drop existing mySQL table
Alter tempTable name to new table name
These are just the first strategies that come to mind, I'm sure there are more effective ways of doing it (especially the update side).
I'm looking for some alternate strategies here. Any brilliant ideas?
It sounds like you're going for something small, but you might try glancing at some replication design patterns. Microsoft has documented some data replication patterns here and that is a good starting point. My suggestion is to check out the simple Move Copy of Data pattern.
Are your VFP tables in a VFP database (DBC)? If so, you should be able to use triggers on that database to set up the information about what data needs to updated in MySQL.

How do I set up replication from MySQL to MongoDB?

I have a bunch of data from a scientific experiment stored in a MySQL database, but I want to use MongoDB to take advantage of its map/reduce functionality to power some web charts. What is the best way to have new writes to MySQL replicate into Mongo? Some solution where I inspect the binary MySQL log and update accordingly, just like standard MySQL replication?
Thanks!
Alex
MySQL and MongoDB uses very different data and query models, so you can't transfer data directly.
Alas, moving data between the two must be done manually, and doing that efficiently depends very much on your data. Eg. you could transfer each table to a separate collection (roughly a table in MongoDB-lingo), and making the unique attributes in the tables to the _id-attribute. Alternately, you can make the _id to tablename+unique_id.
Basically, as Document databases are essentially free-form, you are free to invent your on schemes ad-infinitum (as long as the _id-attributes are unique within the collection).
Tungsten Replicator is data replication engine for MySQL.
Using heterogeneous replication, you may be able to set up MySQL to MongoDB replication.
I am not familiar with MongoDB but my quick look shows it is incompatible with mySQL so unless someone has written something to import form mySQL you are out of luck.
You could write your own import function.
assuming your mySQL tables use an incrementing unique 'id' field you could track the last row in mysql and then send it to mongodb when it changes.
Alterations and deletions would be much more difficult to deal with. if this is important then inserting the data at the source is probably the best bet.
Do you need to insert the data into mysql at all? could you do it all in mongodb and save all the trouble?
DC