Periodically verify data stored correctly within Couchbase - couchbase

How can I run a check on couchbase to scrub / verify all the data in the database, and ensure it is free from errors ?
In MSSQL we have checkdb, on ZFS we have scrub - is there anything like this in couch ?

No, there is nothing because CouchDB is append-only DB and for this reason data corruption is not possible.

You can use couchbase kafka adapter to stream data from couchbase to kafka and from kafka you can store in file system & validate as you like. CouchbaseKafka adapter uses TAP protocol to push data to kafka.
Another alternative to use existing cbbackup utility to preform backup and data check as a side effect.

Related

Mirroring homogeneous data from one MySQL RDS to another MySQL RDS

I have two MySQL RDS's (hosted on AWS). One of these RDS instances is my "production" RDS, and the other is my "performance" RDS. These RDS's have the same schema and tables.
Once a year, we take a snapshot of the production RDS, and load it into the performance RDS, so that our performance environment will have similar data to production. This process takes a while - there's data specific to the performance environment that must be re-added each time we do this mirror.
I'm trying to find a way to automate this process, and to achieve the following:
Do a one time mirror in which all data is copied over from our production database to our performance database.
Continuously (preferably weekly) mirror all new data (but not old data) between our production and performance MySQL RDS's.
During the continuous mirroring, I'd like for the production data not to overwrite anything already in the performance database. I'd only want new data to be inserted into the production database.
During the continuous mirroring, I'd like to change some of the data as it goes onto the performance RDS (for instance, I'd like to obfuscate user emails).
The following are the tools I've been researching to assist me with this process:
AWS Database Migration Service seems to be capable of handling a task like this, but the documentation recommends using different tools for homogeneous data migration.
Amazon Kinesis Data Streams also seems able to handle my use case - I could write a "fetcher" program that gets all new data from the prod MySQL binlog, sends it to Kinesis Data Streams, then write a Lambda that transforms the data (and decides on what data to send/add/obfuscate) and sends it to my destination (being the performance RDS, or if I can't directly do that, then a consumer HTTP endpoint I write that updates the performance RDS).
I'm not sure which of these tools to use - DMS seems to be built for migrating heterogeneous data and not homogeneous data, so I'm not sure if I should use it. Similarly, it seems like I could create something that works with Kinesis Data Streams, but the fact that I'll have to make a custom program that fetches data from MySQL's binlog and another program that consumes from Kinesis makes me feel like Kinesis isn't the best tool for this either.
Which of these tools is best capable of handling my use case? Or is there another tool that I should be using for this instead?

Kafka producer vs Kafka connect to read MySQL Datasource

I have created a kafka producer that reads website click data streams from MySQL database and it works well. I found out that I can also just connect kafka to MySQL datasource using kafka connect or debezium. My target is to ingest the data using kafka and send it to Storm to consume and do analysis. It looks like both ways can achieve my target but using kafka producer may require me to build a kafka service that keeps reading the datasource.
Which of the two approaches would be more efficient for my data pipe line?
I'd advice to not re-invent the wheel and use Debezium (disclaimer: I'm its project lead).
It's feature-rich (supported data types, configuration options, can do initial snapshotting etc.) and well tested in production. Another key aspect to keep in mind is that Debezium is based on reading the DB's log instead of polling (you might do the same in your producer, it's not clear from the question). This provides many advantages over polling:
no delay as with low-frequency polls, no CPU load as with high-frequency polls
can capture all changes without missing some between two polls
can capture DELETEs
no impact on schema (doesn't need a column to identify altered rows)

Storing in a JSON file vs. JSON object vs. MYSQL database

I've programmed a chat application using nodejs, expressjs and socket.io.
So, When I used MYSQL database the application slows down then I replaced it with storing the data using JSON objects (in the server-side of nodejs).
Now everything is ok and the application is working well, but If I want to release an update of the nodejs' app.js file it should be restarted, so, everything in the JSON objects will be deleted!
How can I fix this problem? and can I store in a JSON file to fix it? and will the application stay at the same speed or what?
Storing data in RAM will be always faster than writing to a database, but the problem in your case you need to persist it somewhere.
There are many solutions to this problem but if you are using JSON's, I recommend to look at mongodb database.
MongoDb supports multiple storage engines, but the one that you are interested is in memory.
An interesting architecture for you can be the following replica set configuration:
Primary with in-memory storage engine
Secondary with in-memory storage engine
An other Secondary with wiredtiger storage engine.
You will have the benefits of speed by storing in RAM, and also it will be persisted in the database.
A simpler possibility will be to use a key-store db like redis. Easier to configure.

how to update ignite caches when data is available in persistent store and when csvs are available?

I am trying apache ignite data grid to query cached data using sql.
I could load data into ignite caches on startup from mysql and csv and am able to query using sql.
To deploy in production, in addition to loading cache on startup. I want to keep updating different caches once I have data is available in mysql and when csvs are are created for some caches.
I can not use read through as I will be using sql queries.
How it can be done in ignite ?
Read through cannot be configured for SQL Queries. You can go through this discussion in Apache Ignite Users forum.
http://apache-ignite-users.70518.x6.nabble.com/quot-Read-through-quot-implementation-for-sql-query-td2735.html
If you elaborate a bit on your use case, I can suggest you an alternative.
If you're updating database directly, the only way to achieve this is to manually reload data. You can have a trigger on DB that will somehow initiate the reload, or have a mechanism that will periodically check if there were any changes.
However, the preferable way to do this is to never update DB directly, but always use Ignite API for this with write-through. This way you will guarantee that cache and DB are always consistent.

Stream data from MySQL Binary Log to Kinesis

We have a write-intensive table (on AWS RDS MySQL) from a legacy system and we'd like to stream every write event (insert or updated) from that table to kinesis. The idea is to create a pipe to warmup caches and update search engines.
Currently we do that using a rudimentar polling architecture, basically using SQL, but the ideal would be to have a push architecture reading the events directly from the transaction log.
Has anyone tried it? Any suggested architecture?
I've worked with some customers doing that already, in Oracle. Seems also that LinkedIn uses a lot that technique of streaming data from databases to somewhere else. They created a platform called Databus to accomplish that in an agnostic way - https://github.com/linkedin/databus/wiki/Databus-for-MySQL.
There is a public project in Github, following LinkedIn principles that is already streaming binlog from Mysql to Kinesis Streams - https://github.com/cmerrick/plainview
If you want to get into the nitty gritty details of LinkedIn approach, there is a really nice (and extensive) blog post available - https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying.
Last but not least, Yelp is doing that as well, but with Kafka - https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html
Not getting into the basics of Kinesis Streams, for the sake of brevity, if we bring Kinesis Streams to the game, I don't see why it shouldn't work. As a matter of fact, it was built for that - your database transaction log is a stream of events. Borrowing an excerpt from Amazon Web Services public documentation: Amazon Kinesis Streams allows for real-time data processing. With Amazon Kinesis Streams, you can continuously collect data as it is generated and promptly react to critical information about your business and operations.
Hope this helps.
aws DMS service offers data migration from SQL db to kinesis.
Use the AWS Database Migration Service to Stream Change Data to Amazon Kinesis Data Streams
https://aws.amazon.com/blogs/database/use-the-aws-database-migration-service-to-stream-change-data-to-amazon-kinesis-data-streams/