Replicate MySQL Data to ClickHouse [closed] - mysql

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I want to periodically insert data from an MySQL database into clickHouse, i.e., when data is added/updated in mySQL database, I want that data to be added automatically to clickHouse.
I am thinking of using the Change Data Capture (CDC). CDC is a technique that captures changes made to data in MySQL and applies it to the destination ClickHouse table. It only imports changed data, not the entire database. To use the CDC method with a MySQL database, we must utilize the Binary Change Log (binlog). Binlog allows us to capture change data as a stream, enabling near real-time replication.
Binlog not only captures data changes (INSERT, UPDATE, DELETE) but also table schema changes such as ADD/DROP COLUMN. It also ensures that rows deleted from MySQL are also deleted in ClickHouse.
After having the changes, How can I insert it in the ClickHouse?

[experimental] MaterializedMySQL
Creates ClickHouse database with all the tables existing in MySQL, and all the data in those tables.
ClickHouse server works as MySQL replica. It reads binlog and performs DDL and DML queries.
https://clickhouse.tech/docs/en/engines/database-engines/materialized-mysql/
https://altinity.com/blog/2018/6/30/realtime-mysql-clickhouse-replication-in-practice
https://clickhouse.tech/docs/en/sql-reference/dictionaries/external-dictionaries/external-dicts-dict-sources/#dicts-external_dicts_dict_sources-mysql
https://altinity.com/blog/dictionaries-explained
https://altinity.com/blog/2020/5/19/clickhouse-dictionaries-reloaded

Related

How can I go about migrating data from one database to another with different structure? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have about 10,000 data in an old MySQL database written in PHP. This old database has no structure and relationships defined. It's completely legacy design. I'm now working to refactor the entire system of which the tables and their relationships have been completely defined now.
The issue now remains how best to move the data from the old database (written with PHP without framework) to the new (written in Laravel).
Will Laravel commands be a good option where I read data from the old specifying what column is needed and then inserting into the new database?
From the top of my head the following comes to mind:
1. Plain raw SQL
You could write a series of raw sql statements which will read the old database and insert records in the new database. This can be done without the help of an ORM like eloquent.
Advantages:
Nothing beats raw SQL in performance, so the migration will run fast
Disadvantages:
If the database structure is very different it might be hard to write the correct queries
It's easier to forget things like adding primary and foreign keys
2. Laravel commands
You could write one (or multiple) artisan commands which perform the data migration (in steps). This way you can use the DB facade in Laravel to read the old database and use Eloquent to write the data to the new database.
Advantages:
Easier to write as you can leverage eloquent models
Eloquent takes care of things you otherwise might forget like adding primary and foreign keys
Disadvantages:
Raw SQL will probably out-preform the usage of Eloquent.
If you have large amounts of data you'll have to optimize your scripts for memory usage. Otherwise you might run into memory limit issues.
So Laravel commands could surely be a good solution depending on how different your data structures are, how large your datasets are and how important performance is.

How to do audit when using Spring batch as ETL [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have a requirement to use Spring Batch as ETL to migrate data from one set of tables in a source database (MySQL) to another set of tables in destination database(MySQL). The schema in destination tables is different from schema in source table, so i'm using processor to transform the data to match the schema of destination.
I need to do this migration block by block, i.e., like set of records at once on demand (Not all at once).
I have few concerns to take care of.
1) Audit (Make sure all the data is migrated)
2) Rollback & Retry (In case of error)
3) Error handling
4) How to keep updated of the new data of source table while migration is happening (No downtime)
Below are my thoughts of the same.
I will generate a random ID that will be unique for each job (may be a UUID for each job), and then put it in destination table (column in every row) while migrating.
1) Audit: My thought is to keep a count of the records i'm reading and then compare it with the rows of the destination table, once the migration is done.
2) Rollback & Retry: If the record count doesn't match in audit check, i will delete all the rows with the batch UUID, and then initiate the batch job again.
3) Error handling: Not sure of what other cases i need to be aware of, So i'm thinking just to log the errors.
4) Delta changes: I'm thinking to run the batch job again and again to find for the changes (with created_at, updated_at column values) until 0 records are found.
I want to understand, if any of the above steps can be done it in more better way? Please suggest.
You might need to spend some more time reviewing Spring Batch as it already takes care of most of these things for you.
You can already run a single Spring Batch job and set it to do the processing in chunks, the size of which you configure when you setup the job.
Auditing is already covered. The batch tables (and the Spring Batch admin interface) will keep a counts of all the records read in and written for each job instance, as well as the failure count (it you configured it to suppress exceptions).
Spring batch jobs already have retry and auto recover logic based on the aforementioned tables that track how many of the input records had already completed. This is not an option when using a database as an input source though. You would need to find an option in your table setup to identify the records completed, and or use a uniqueness constraint to the destination database so it can not re-write duplicate records. Another option could be to have your job's first step to be to read in records to a flat file, and then read from that file as a next step to process from. This should let the Spring Batch auto recovery logic work when the job is restarted.
Error handling is already covered as well. Any failure will stop the job so no data is lost. It will roll back the data in the current chunk it is processing. You can can set it to ignore (suppress) specific exceptions if there are specific failures you want it to keep running on. You can also set specific numbers of the different which exceptions to allow. And of course you could log failure details so you can lookup later.
As mentioned before you would need a value or trigger on your source query to identify which records you have processed. Which will allow you to keep running the chunk queries to pickup new records.

Multiple simultaneous selects from mysql database table [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm working on a database that has one table with 21 million records. Data is loaded once when the database is created and there are no more insert, update or delete operations. A web application accesses the database to make select statements.
It currently takes 25 second per request for the server to receive a response. However if multiple clients are making simultaneous requests the response time increases significantly. Is there a way of speeding this process up ?
I'm using MyISAM instead of InnoDB with fixed max rows and have indexed based on the searched field.
If no data is being updated/inserted/deleted, then this might be case where you want to tell the database not to lock the table while you are reading it.
For MYSQL this seems to be something along the lines of:
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED ;
SELECT * FROM TABLE_NAME ;
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
(ref: http://itecsoftware.com/with-nolock-table-hint-equivalent-for-mysql)
More reading in the docs, if it helps:
https://dev.mysql.com/doc/refman/5.7/en/innodb-transaction-isolation-levels.html
The TSQL equivalent, which may help if you need to google further, is
SELECT * FROM TABLE WITH (nolock)
This may improve performance. As noted in other comments some good indexing may help, and maybe breaking the table out further (if possible) to spread things around so you aren't accessing all the data if you don't need it.
As a note; locking a table prevents other people changing data while you are using it. Not locking a table that is has a lot of inserts/deletes/updates may cause your selects to return multiple rows of the same data (as it gets moved around on the harddrive), rows with missing columns and so forth.
Since you've got one table you are selecting against your requests are all taking turns locking and unlocking the table. If you aren't doing updates, inserts or deletes then your data shouldn't change, so you should be ok to forgo the locks.

WHEN use attribute DELAY_KEY_WRITE? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
WHEN use attribute DELAY_KEY_WRITE?
How it helped ?
CRETA TABLE(....) ;DELAY_KEY_WRITE =1
Another performance option in MySQL is the DELAY_KEY_WRITE option. According to the MySQL documentation the option makes index updates faster because they are not flushed to disk until the table is closed.
Note that this option applies only to MyISAM tables,
You can enable it on a table by table basis using the following SQL statement:
ALTER TABLE sometable DELAY_KEY_WRITE = 1;
This can also be set in the advanced table options in the MySQL Query Browser.
This performance option could be handy if you have to do a lot of update, because you can delay writing the indexes until tables are closed. So frequent updates to large tables, may want to check out this option.
Ok, so when does MySQL close tables?
That should have been your next question. It looks as though tables are opened when they are needed, but then added to the table cache. This cache can be flushed manually with FLUSH TABLES; but here's how they are closed automatically according to the docs:
1.When the cache is full and a thread tries to open a table that is not in the cache.
2.When the cache contains more than table_cache entries and a thread is no longer using a table.
3.FLUSH TABLES; is called.
"If DELAY_KEY_WRITE is enabled, this means that the key buffer for tables with this option are not flushed on every index update, but only when a table is closed. This speeds up writes on keys a lot, but if you use this feature, you should add automatic checking of all MyISAM tables by starting the server with the --myisam-recover option (for example, --myisam-recover=BACKUP,FORCE)."
So if you do use this option you may want to flush your table cache periodically, and make sure you startup using the myisam-recover option.

Comparison of MongoDB, MySQL and PostGreSQL [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
If i had to develop a
Core Java application which processes CSV files and stored output in a Open-source DB
Data size would be 10 GB initially (porting from existing sources)
Would grow at 1 GB per month
A typical transaction could fetch 100,000 rows
Could be accessed by 1000 users at a given time
And had choice of
Mongodb
MySQL
PostGresql
which would be the best choice of DB ?
This compares MongoDB with MySQL
This compares PostgreSQL to MySQL
Security alerts for MongoDB
With increasing data it's better to have a DB that scale easly and SQL doesn't scale smoothly and eventually breaks doing it, in fact usually for Big Data only High scalable DB are used.
But you said that entries can have correlation with each other so in this case it's better to use a relational DB because the NO-SQL ones can "lose" some correlation.
Like #Craig Ringer said don't consider only those DBs there are a lot of different solutions who has their own pros and cons ( for example redis is very very fast but it's almost without any kind of complex logic because it's a simple Key-Value storage, or Cassandra is faster than Mongo but works better with schemed data, Mongo is a documental DB so can store any kind of data in the same Collection ).
IMHO you should try to set up some bench marking sessions with different DB and Use case and focus on what you want to be done fast and then choose the better in that field.