Azure CosmosDB - its use as a queue - message-queue

I have been doing some reading on Azure Cosmos DB (the new Document DB) and noticed that it allows Azure functions to be executed when data is written to the db.
Typically I would have written to a service bus and then processed the message using an Azure function and storing the message in the document db for history.
I wanted some help on good practice for CosmoDB

It depends on your use case, your throughput requirement? what processing you will be doing on data? how transient your data is? will it be globally distributed etc.
Yes, Cosmos DB can ingest data with very high rate and storage can scale elastically too. Azure Functions are certainly a viable option to process the change feed in cosmos db.
Here is more information: https://learn.microsoft.com/en-us/azure/cosmos-db/serverless-computing-database

Related

Mirroring homogeneous data from one MySQL RDS to another MySQL RDS

I have two MySQL RDS's (hosted on AWS). One of these RDS instances is my "production" RDS, and the other is my "performance" RDS. These RDS's have the same schema and tables.
Once a year, we take a snapshot of the production RDS, and load it into the performance RDS, so that our performance environment will have similar data to production. This process takes a while - there's data specific to the performance environment that must be re-added each time we do this mirror.
I'm trying to find a way to automate this process, and to achieve the following:
Do a one time mirror in which all data is copied over from our production database to our performance database.
Continuously (preferably weekly) mirror all new data (but not old data) between our production and performance MySQL RDS's.
During the continuous mirroring, I'd like for the production data not to overwrite anything already in the performance database. I'd only want new data to be inserted into the production database.
During the continuous mirroring, I'd like to change some of the data as it goes onto the performance RDS (for instance, I'd like to obfuscate user emails).
The following are the tools I've been researching to assist me with this process:
AWS Database Migration Service seems to be capable of handling a task like this, but the documentation recommends using different tools for homogeneous data migration.
Amazon Kinesis Data Streams also seems able to handle my use case - I could write a "fetcher" program that gets all new data from the prod MySQL binlog, sends it to Kinesis Data Streams, then write a Lambda that transforms the data (and decides on what data to send/add/obfuscate) and sends it to my destination (being the performance RDS, or if I can't directly do that, then a consumer HTTP endpoint I write that updates the performance RDS).
I'm not sure which of these tools to use - DMS seems to be built for migrating heterogeneous data and not homogeneous data, so I'm not sure if I should use it. Similarly, it seems like I could create something that works with Kinesis Data Streams, but the fact that I'll have to make a custom program that fetches data from MySQL's binlog and another program that consumes from Kinesis makes me feel like Kinesis isn't the best tool for this either.
Which of these tools is best capable of handling my use case? Or is there another tool that I should be using for this instead?

Where to store key-value pairs for runtime retrieval from within Cloud Function?

In a Cloud Function I need to retrieve a bunch of key-value pairs to process. Right now I'm storing them as json-file in Cloud Storage.
Is there any better way?
Env-variables don't suite as (a) there are too many kv pairs, (b) the same gcf may need different sets of kv depending on the incoming params, (c) those kv could be changed over time.
BigQuery seems to be an overkill, also given that some kv have few levels of nesting.
Thanks!
You can use Memorystore, but it's not persistent see the FAQ.
Cloud Memorystore for Redis provides a fully managed in-memory data
store service built on scalable, secure, and highly available
infrastructure managed by Google. Use Cloud Memorystore to build
application caches that provides sub-millisecond data access. Cloud
Memorystore is compatible with the Redis protocol, allowing easy
migration with zero code changes.
Serverless VPC Access enables you to connect from the Cloud Functions environment directly to your Memorystore instances.
Note: Some resources, such as Memorystore instances, require connections to come from the same region as the resource.
Update
For persisted storage you could use Firestore.
See a tutorial about using Use Cloud Firestore with Cloud Functions

Stream data from MySQL Binary Log to Kinesis

We have a write-intensive table (on AWS RDS MySQL) from a legacy system and we'd like to stream every write event (insert or updated) from that table to kinesis. The idea is to create a pipe to warmup caches and update search engines.
Currently we do that using a rudimentar polling architecture, basically using SQL, but the ideal would be to have a push architecture reading the events directly from the transaction log.
Has anyone tried it? Any suggested architecture?
I've worked with some customers doing that already, in Oracle. Seems also that LinkedIn uses a lot that technique of streaming data from databases to somewhere else. They created a platform called Databus to accomplish that in an agnostic way - https://github.com/linkedin/databus/wiki/Databus-for-MySQL.
There is a public project in Github, following LinkedIn principles that is already streaming binlog from Mysql to Kinesis Streams - https://github.com/cmerrick/plainview
If you want to get into the nitty gritty details of LinkedIn approach, there is a really nice (and extensive) blog post available - https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying.
Last but not least, Yelp is doing that as well, but with Kafka - https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html
Not getting into the basics of Kinesis Streams, for the sake of brevity, if we bring Kinesis Streams to the game, I don't see why it shouldn't work. As a matter of fact, it was built for that - your database transaction log is a stream of events. Borrowing an excerpt from Amazon Web Services public documentation: Amazon Kinesis Streams allows for real-time data processing. With Amazon Kinesis Streams, you can continuously collect data as it is generated and promptly react to critical information about your business and operations.
Hope this helps.
aws DMS service offers data migration from SQL db to kinesis.
Use the AWS Database Migration Service to Stream Change Data to Amazon Kinesis Data Streams
https://aws.amazon.com/blogs/database/use-the-aws-database-migration-service-to-stream-change-data-to-amazon-kinesis-data-streams/

Periodically verify data stored correctly within Couchbase

How can I run a check on couchbase to scrub / verify all the data in the database, and ensure it is free from errors ?
In MSSQL we have checkdb, on ZFS we have scrub - is there anything like this in couch ?
No, there is nothing because CouchDB is append-only DB and for this reason data corruption is not possible.
You can use couchbase kafka adapter to stream data from couchbase to kafka and from kafka you can store in file system & validate as you like. CouchbaseKafka adapter uses TAP protocol to push data to kafka.
Another alternative to use existing cbbackup utility to preform backup and data check as a side effect.

HTML5 local storage memory architecture?

I have gone through many resources online but could not get the memory architecture used by HTML5 local storage. Is the data from local storage brought in memory while working over it (something like caching)?
Also in case I want to implement my app working in offline mode (basic purpose of storing into local storage), is it fine to store data as global JSON objects rather than going for local storage.
In short , I am getting a lot of JSON data while I login to my app(cross platform HTML5 app). Shall i store this data as global object or rather store it in memory.
Well, it depends on how sensitive is your information and the approach you want to follow.
Local storage
You can use local storage for "temporal" data, passing parameters and some config. values. AFAIK local storage should be used with care in the sense that the stored information is not ensure to be there always, as it could be deleted to reclaim some device memory or cleaning process. But you can use it without much fear.
To store JSON in local storage you will have to stringify your object to store it in a local storage key. JSON.stringify() function will do the trick for you.
So far I havenĀ“t found official information, but I think there is a limit of MB that you can store in local storage, however I Think that is not controlled directly via cordova. Again, is not official data, just take that in mind if your data in JSON notation is extremely big.
Store data as global objects
Storing data as global objects could be useful if you have some variables or data that is shared across functions inside the app, to ease access. However, bear in mind that data stored in global variables could be lost if the app is re-started, stopped, crashed or quit.
If it is not sensitive information or you can recover it later, go ahead and use local storage or global variables.
Permanent storage
For sensitive data or more permanent information I will suggest to store your JSON data in the app file system. That is write your JSON data in a file and when required recover the information from the file and store it in a variable to access it, that way if your app is offline, or the app is re-started or quit, you can always recover the information from the file system. The only way to loose that data is if the app is deleted from the device.
In my case I am using the three methods in the app I am developing, so just decide which approach will work the best for you and your needs.