Couchbase - smart cross data center replication (XCDR) - couchbase

I have 2 Couchbase clusters. 1 for real time work and 1 for back-end data query.
I wish to replicate only 10% of the data from the real time bucket to the back-end because it's used for statistically annalists.
Note one: I know it's not possible by the UI, I'm looking for a way to write some-kind of extension for it that could "sit" in the middle of the XCDR and filter it.
Note two: As I understand Elastic-Search are using the replication feature to get noticed for changes on the cluster and build there own indexes. If I could "listen" for those notification myself I could take it from there, reading and sending the relevant data myself.
Any ideas on how I can make it work?
==NOTES==
I found the following link: http://blog.couchbase.com/xdcr-aspnet-and-nancy, this give a basic example of Sinatra project which XDCR can connect to. But there is no link to a documentation on the Rest API for one which doesn't want to work with Sinatra.
As for #Cihan question, replication 10% of the data is the basic use I wish for and for that I can use only the key. But in general I probably like to munipulate the data and also be able to merge it to an existing data - that would be a case if I have 2 real time clusters replicating to 1 back-end cluster.

Don't have anything built in today to do this. You could set up XDCR and delete the data that you don't need on the destination cluster but it may reappear as updates happen so your cleanup will have to continuously run. would a method like that work?
By the way we do plan to have the facility in future. one comment that would be helpful for me is what type of a filtering would suffice in your case? can we filter with a prefix only to achieve your case or would you need a more sophisticated filtering expression?
thanks
Cihan Biyikoglu

Related

AWS DMS to replicate transactional data to data warehouse on an ongoing basis

I'm hoping someone can tell me if I'm absolutely crazy before I go too far down this path. I have an application with MySQL as the backend. I needed to create more robust reporting and i opted to build a data warehouse in pgsql. The challenge is I don't want the DW to be just updated once or twice a day. I'd like it to be near real time (some lag is expected and not a problem).
I looked at AWS glue and a few other options and finally settled on DMS as a method of replicating the data from the MySQL source to the pgsql target db for staging. I then set up trigger functions that will basically manipulate the inserted/updated data in the pgsql db, landing it in the data warehouse. The application is also connected to the DW and can pull reports and dashboard metrics from the DW as needed.
I've built a proof of concept and it seems to work, but it's really only me hitting the application at the moment, so I'm not sure if it will hold up if I were to proceed with this idea and put it in production.
I currently have a dms.t2.small replication instance (engine version 2.4.4) running at about 15-20% CPU utilization. I don't have it configured for Multi AZ currently.
I'm seeing combined CDCLatencyTarget/CDCLatencySource values averaging about 9 seconds. I think if that holds true it wouldn't be unbearable, although the less time the better. I'd say if it gets up over a minute we may start to see complaints.
I know that DMS is more meant for migrations, so I'd like to know if I'm just doing this in a really stupid way, or if this is a more or less valid use case? Are there issues with DMS that I am unaware of that will cause me to later regret this decision?
Also, I'd love any ideas you have for how I can put safeguards in place to ensure that source and target stay synced or if they don't that I'm made aware of it, or something that would allow it to self-heal.

Business Intelligence: Live reports from MySQL

I wanted to create a (nearly) live dashboard from MySQL databases I tried PowerBI, SSRS and other similar tools but they were not as fast as I wanted. What I have in mind is the data to be updated every 1 minute or even less. Is it possible? and are there any free (or inexpensive) tools for this?
Edit: I want to build a wallboard to show some data on a big TV screen. I need it to be real-time. I tried SSRS autorefresh as well but it has a loading sign and very slow, plus PowerBI uses Azure which is very complex to configure and blocked for my country.
This is a topic which has many more layers than to ask which tool is best for this case.
You have to consider
Velocity
Veracity
Variety
Kind
Use Case
of the data. Sure, this is usually only being recounted if talking about Big Data, but will give you a feeling about the size and complexity of data.
Loading
Is the data being loaded and you "just" use it? Or do you also need to load it realtime or near-realtime (for clarification read this answer here)?
Polling/Pushing
Do you want to poll data every x seconds or minutes? Or do you want to work event based? What are the requirements which will need you to show data this fast?
Use case
Do you want to show financial data? Do you need to show data about error and system logs of servers and applications? Do you want to generate insights as soon as a visitor of a webpage is making a request?
Conclusion
When thinking about those questions, keep in mind this should just be a hint to go into one direction or another. Depending on the data and the use case, you might use an ELK stack (for logs), Power BI (for financial data) or even some scripts (for billing).

Best way to report events / read events (also MySQL)

So I'm going to attempt to create a basic monitoring tool in VB.net. Now I'd like some advice on how basically to tackle the logging and reporting side of things so I'd appreciate some responses from users who I'm sure have a better idea than me and can tell me far more efficient ways of doing things.
So my plan is to have a client tool, which will read from a MySQL database values and basically change every x interval, I'm thinking 10/15 minutes at the moment. This side of the application is quite easy, I mean I can get something to read a database every x amount of time and then change labels and display alerts based on them. - This is all well documented and I am probably okay with that.
The second part is to have a client that sits in the system tray of the server gathering the required information. Now the system tray part I think will probably be the trickiest bit of this, however that's not really part of my question.
So I assume I can use the normal information gathering commands and store them perhaps as strings and I can then connect to the same database and add them to the relevant fields. For example if I had a MySQL table called "server" and a column titled "Connection" I could check if the server has an internet connection for example and store the result as the value 1 for yes and 0 for no and then send a MySQL command to the table to update the "connection" value to either 0/1.
Then I assume the monitoring tool I can run a MySQL query to check the "Connection" column and if the value is = 0 change a label or flag an error and if 1 report that connectivity is okay?
My main questions about the above are listed below.
Is using a MySQL database the most efficient way of doing something like this?
Obviously if my database goes down there's no more reporting, I still think that's a con I'll have to live with though.
Storing everything as values within the code is the best way to store my data?
Is there anything particular type of format I should use in the MySQL colum, I was thinking maybe tinyint(9)?
Is the above method redundant and pointless?
I assume all these database connections could cause some unwanted server load, however the 15 minute refresh time should combat that.
Is there a way to properly combat delays with perhaps client updating not in time for the reporter so it picks up false data, perhaps a fail safe for a column containing last updated time?
You probably don't need the tool that gathers information per se. The web app (real time monitor) can do that, since the clients are storing their information in the same database. The web app can access the database every 15 minutes and display the data, without the intermediate step of saving it again. This will provide the web app with the latest information instead of a potential 29-minute delay.
In other words, the clients are saving the connection information once. Don't duplicate it in the database.
MySQL should work just about as well as anything.
It's a bad idea to hard code "everything". You can use application settings or a MySQL table if you need to store IPs, etc.
In an application like this, the conversion will more than offset the data savings of a tinyint. I would use the most convenient data type.

MySQL push changes

I'd like to be able to replicate a bunch of mysql tables to a custom service.
Right now, my best idea is creating an after insert trigger on each table and have these push to a 'cache' table that will get polled by my custom service for updated rows.
The problem with the above is that it means I have to poll at regular intervals. I'm wondering if there is a way to do it where mysql pushes updates to my service. The best way for this that I can think of is if triggers could support actions other than updating other tables, like doing a POST (which as far as I can tell is not possible).
I'm pretty sure there's a way to have mysql push binary logs to me somehow, but I dont know how to do that.
You can extend the engine to run system code from your function. Here's an overview.
Given this effort (setup and maintenance), a polling script doesn't look too bad.

Using combination of MySQL and MongoDB

Does it make sense to use a combination of MySQL and MongoDB. What im trying to do basically is use MySQl as a "raw data backup" type thing where all the data is being stored there but not being read from there.
The Data is also stored at the same time in MongoDB and the reads happen only from mongoDB because I dont have to do joins and stuff.
For example assume in building NetFlix
in mysql i have a table for Comments and Movies. Then when a comment is made In mySQL i just add it to the table, and in MongoDB i update the movies document to hold this new comment.
And then when i want to get movies and comments i just grab the document from mongoDb.
My main concern is because of how "new" mongodb is compared to MySQL. In the case where something unexpected happens in Mongo, we have a MySQL backup where we can quickly get the app fallback to mysql and memcached.
On paper it may sound like a good idea, but there are a lot of things you will have to take into account. This will make your application way more complex than you may think. I'll give you some examples.
Two different systems
You'll be dealing with two different systems, each with its own behavior. These different behaviors will make it quite hard to keep everything synchronized.
What will happen when a write in MongoDB fails, but succeeds in MySQL?
Or the other way around, when a column constraint in MySQL is violated, for example?
What if a deadlock occurs in MySQL?
What if your schema changes? One migration is painful, but you'll have to do two migrations.
You'd have to deal with some of these scenarios in your application code. Which brings me to the next point.
Two data access layers
Your application needs to interact with two external systems, so you'll need to write two data access layers.
These layers both have to be tested.
Both have to be maintained.
The rest of your application needs to communicate with both layers.
Abstracting away both layers will introduce another layer, which will further increase complexity.
Chance of cascading failure
Should MongoDB fail, the application will fall back to MySQL and memcached. But at this point memcached will be empty. So each request right after MongoDB fails will hit the database. If you have a high-traffic site, this can easily take down MySQL as well.
Word of advice
Identify all possible ways in which you think 'something unexpected' can happen with MongoDB. Then use the most simple solution for each individual case. For example, if it's data loss you're worried about, use replication. If it's data corruption, use delayed replication.