What is the recommended way to get another domain's data in a CQRS workflow? - language-agnostic

I'm working on a microservice architecture with separated databases but need to replicate some data for resiliency.
As an exemple, let's say I'm working on a blog and have two domains: users and articles, each with its own database. In case the users microservice goes down I still need to be able to show an article author name.
-- in the 'users' domain's database
create table users (
id uuid primary key,
name varchar(32)
);
-- in to the 'articles' domain's database
create table articles (
id uuid primary key,
author uuid,
author_name varchar(32),
contents text
);
So when I'm creating an article, I send the user identifier.
My question is, at what point and how am I supposed to get the username?
I can't trust the user to send the real user name, it has to be fetched from somewhere in the system
I can't fetch it from the controller since it is on another domain
I can't fetch it from the event handler since it is on another domain
I can't use a saga since sagas are not supposed to make queries, only commands
FWIW, my reference for these is this F.A.Q.
Thanks a lot for reading this; I hope you'll have a solution for me! Have a nice day.

My question is, at what point and how am I supposed to get the username?
1) You fetch the username from the local cache of reference data
2) Your reporting logic needs to support the case that the cache doesn't yet have a copy of the reference data
3) Your reporting logic needs to support the case that the cached copy of the reference data is stale.
Reference data here being shorthand for any information that the service needs, for which it isn't itself the authority.
So in a typical solution, the User service would have the authoritative copy/copies of the username, and all of the logic for determining whether or not a change to that value is allowed. The Articles service would have a local copy of that data, with metadata describing how long that information may be used.
The user database would have a copy of all of the information that it is responsible for. The article database would only have the slice of user information that the article service cares about.
A common way to implement this is to arrange a subscription, pulling the data from the users database to the articles database when the non-authoritative copy is no longer fresh.
You can treat the cache as a fallback position -- if we can't get timely access to the latest username, then use the cached copy.
But there's no magic - it will sometimes happen that the remote data is not available AND the local cache doesn't have a valid copy.
It may help to keep in mind that a lot of your data is already reference data -- copied into your local databases by the real world.
If I may ask, instead of having metadata then pulling the data periodically to update the cache, shouldn't I just replicate it once then listen for the 'username changed' event?
What happens if that event doesn't get delivered?
In distributed systems, it's really important to ask what happens if some process fails or some message is lost right at a critical point. How do you recover.
When I follow through that line of thinking, what I end up with is that client polling is the primary mechanism for retrieving reference data, and push notifications are latency optimizations that indicate we should poll now, rather than waiting for the entire scheduled interval.

Related

Prevent duplicated values on fields that should be unique on PouchDB Sync

I have a system that uses CouchDB as DB, and the clients connect through PouchDB, having a local copy of the database for offline use. The app has no backend API, it connects directly to the DB.
Many databases on the system contain one or more fields that should be unique (no other documents should have the same value). Since CouchDB doesn't really have a "unique costraint" for fields, the uniqueness of the documents is managed through code on the client side. The issue comes from the offline synchronization from PouchDB.
Let's say there is a pages object in the system that has two fields that should be unique, name and slug. Through code we make sure that before posting a new page, those two fields do not exist in the DB. Then let's say one PC goes offline for a day, and creates a page with the slug "homepage", while the same day, a PC that was online created another page with the slug "homepage" now saved on the remote DB. When PC 1 goes online, it will sync the local and remote DBs skipping the validation code and adding a second "homepage" page.
One workaroung to this is to set the must-be-unique field as the _id of the document and manage syncing conflicts, but that is not possible in a reasonable way for more than one unique element. (I would still appreciate a response that only takes into account a single unique field tho).
Also in some cases it is less than ideal to use the _id as the unique field. For example, in a POS system, cashiers have a pin to check in with when taking an order. Using a 4 number pin as _id does not seem ideal.
Another option is to ask an action from the user before syncing when noticing a conflict. But that would require a pre-syncing phase that checks the whole database, and interrupts the user. I'm not sure how to implement it in a seamless way in the system regarding user experience tho.
Any suggestions on how to handle this massive issue?

If my users are stored in another database, should I duplicate them in my service that uses SQL database?

If my users are stored in some other database, but I am building posts in my SQL database, should I create another table users?
If I did, I would be duplicating all of my users and would have to make sure this stays in sync with the other database, but on the other hand, my posts tables could save space by referring to fk instead of full id string each time.
What is the recommendation? Create another table users or just pass in the user ids to query?
If you have a service that stores and provides information about users then other services that need that information should communicate with the User service to get it. That is, presumably, the reason the User service exists in the first place.
Depending on the volatility of the users list and requirements for changes there to be respected in the Posts service you might consider some short-term caching in the Posts service, but I certainly wouldn't persist another copy of the user list there.
There are 3 obvious solutions.
The simplest, cleanest and fastest is to use foreign keys and joins between your "posts" database and your "users" database. In this case, when you show a list of posts, you can get both the post and user data in a single query, and there's no need to keep things up to date.
The next option is to store a copy of the user data alongside your posts. This leads to entertaining failure modes - data in the user database may get out of sync. However, this is a fairly common strategy when using 3rd party authentication systems (e.g. logging on with your Google/Facebook/Github/Stack Exchange credentials). The way to make this work is to minimize the amount of data you duplicate, and have it be safe if it's out of date. For instance, a user's display name is probably okay; current bank account balance is probably not.
The final option is to store the primary key for users in your posts database, and to retrieve the user data at run time. This is less likely to lead to bugs with data getting out of sync, but it can cause performance problems - retrieving user details for 1000 posts one by one is obviously much slower than retrieving everything through a joined query.
The choice then is "do I have a service which combines post and user data and my UI retrieves everything from that service, or do I let the UI retrieve posts, and then users for each post". That's mostly down to the application usage, and whether you can use asynchronous calls to retrieve user information. If at all possible (assuming you're building a web application), the simplest option might be to return the posts and user IDs and use Ajax requests to retrieve the user data as needed.
The CQRS approach (common to microservice architectures) provides some structure for this.

Move information-resource stored in the database tables with two step using 'reservation'

I need to architect a database and service, I have resource that I need to deliver to the users. And the delivery takes some time or requires user to do some more job.
These are the tables I store information into.
Table - Description
_______________________
R - to store resources
RESERVE - to reserve requested resources
HACK - to track some requests that couldn`t be made with my client application (statistics)
FAIL - to track requests that can`t be resolved, but the user isn't guilty (statistics)
SUCCESS - to track successfully delivery (statistics)
The first step when a user requests resouce
IF (condition1 is true - user have the right to request resource) THEN
IF (i've successfully RESERVE-d resource and commited the transaction) THEN
nothing to do more
ELSE
save request into FAIL
ELSE
save request into HACK
Then the second step
IF (condition2 is true - user done his job and requests the reserved resource) THEN
IF (the resource delivered successfully) THEN
save request into SUCCESS
ELSE
save request into FAIL
depending on application logic move resource from RESERVE to R or not
ELSE
save request into HACK, contact to the user,
if this is really a hacker move resource from RESERVE to R
This is how I think to implement the system. I've stored transactions into the procedures. But the main application logic, where I decide which procedure to call are done in the application/service layer.
Am I on a right way, is such code division between the db and the service layers normal? Your experienced opinions are very important.
Clarifying and answering to RecentCoin's questions.
The difference between the HACK and FAIL tables are that I store more information in the HACK table, like user IP and XFF. I`m not going to penalize each user that appeared in that table. There can be 2 reasons that a user(request) is tracked as a hack. The first is that I have a bug (mainly in the client app) and this will help me to fix them. The second is that someone does manually requests, and tries to bypass the rules. If he tries 'harder' I'll be able to take some precautions.
The separation of the reserve and the success tables has these reasons.
2.1. I use reserve table in some transactions and queries without using the success table, so I can lock them separately.
2.2. The data stored in success will not slow down my queries, wile I'm querying the reserve table.
2.3. The success table is kind of a log for statistics, that I can delete or move to other database for future analyse.
2.4. I delete the rows from the reserve after I move them to the success table. So I can evaluate approximately the max rows count in that table, because I have max limit for reservations for each user.
The points 2.3 and 2.4 could be achieved too by keeping in one table.
So are the reasons 2.1 and 2.2 enough good to keep the data separately?
The resource "delivered successfully" mean that the admin and the service are done everything they could do successfully, if they couldn't then the reservation fails
4 and 6. The restrictions and right are simple, they are like city and country restrictions, The users are 'flat', don't have any roles or hierarchy.
I have some tables to store users and their information. I don't have LDAP or AD.
You're going in the right direction, but there are some other things that need to be more clearly thought out.
You're going to have to define what constitutes a "hack" vs a "fail". Especially with new systems, users get confused and it's pretty easy for them to make honest mistakes. This seems like something you want to penalize them for in some fashion so I'd be extremely careful with this.
You will want to consider having "reserve" and "success" be equivalent. Why store the same record twice? You should have a really compelling reason do that.
You will need to define "delivered successfully" since that could be anything from an entry in a calendar to getting more pens and post notes.
You will want to define your resources as well as which user(s) have rights to them. For example, you may have a conference room that only managers are allowed to book, but you might want to include the managers' administrative assistants in that list since they would be booking the room for the manager(s).
Do you have a database of users? LDAP or Active Directory or will you need to create all of that yourself? If you do have LDAP or AD, can use something like SAML?
6.You are going to want to consider how you want to assign those rights. Will they be group based where group membership confers the rights to reserve, request, or use a given thing? For example, you may only want architects printing to the large format printer.

How did Facebook or Twitter implement their subscribe system

I'm working on a SNS like mobile app project, where users upload their contents and can see updates of their subscribed topic or friends on their homepage.
I store user contents in mysql, and query the user specific homepage data by simply querying out first who and what the user subscribed and then query the content table filtering out using the 'where userid IN (....) or topic IN (....)' clause.
I suspect this would become quite slow when the content table piles up or when a user subscribe tons of users or topics. Our newly released app is already starting to have thousands of new users each week, and getting more over time. Scalability must be a concern for us right now.
So I wonder how Facebook or Twitter handle this subscribing problem with their amazing number of users. Do they handle a list for each user? I tried to search, but all I got is how to interact with Facebook or Twitter rather than how they actually implement this feature.
I noticed that you see only updates rather than history in your feed when using Facebook. Which means that subscribing a new user won't dump lots out dated content into your feed as how it would be by using my current method.
How do Facebook design their database and how did they dispatch new contents to subscribed users?
My backend is currently PHP+MySQL, and I don't mind introducing other backend technologies such as Redis or JMS and stuff if that's the way it should be done.
Sounds like you guys are still in a pretty early stage. There are N-number of ways to solve this, all depending on which stage of DAUs you think you'll hit in the near term, how much money you have to spend on hardware, time in your hands to build it, etc.
You can try an interim table that queues up the newly introduced items, its meta-data on what it entails (which topic, friend user_id list, etc.). Then use a queue-consumer system like RabbitMQ/GearMan to manage the consumption of this growing list, and figure out who should process this. Build the queue-consumer program in Scala or a J2EE system like Maven/Tomcat, something that can persist. If you really wanna stick with PHP, build a PHP REST API that can live in php5-fpm's memory, and managed by the FastCGI process manager, and called via a proxy like nginx, initiated by curl calls at an appropriate interval from a cron executed script.
[EDIT] - It's probably better to not use a DB for a queueing system, use a cache server like Redis, it outperforms a DB in many ways and it can persist to disk (lookup RDB and AOF). It's not very fault tolerant in case the job fails all of a sudden, you might lose a job record. Most likely you won't care on these crash edge cases. Also lookup php-resque!
To prep for the SNS to go out efficiently, I'm assuming you're already de-normalizing the tables. I'd imagine a "user_topic" table with the topic mapped to users who subscribed to them. Create another table "notification_metadata" describing where users prefer receiving notifications (SMS/push/email/in-app notification), and the meta-data needed to push to those channels (mobile client approval keys for APNS/GCM, email addresses, user auth-tokens). Use JSON blobs for the two fields in notification_metadata, so each user will have a single row. This saves I/O hits on the DB.
Use user_id as your primary key for "notification_meta" and user_id + topic_id as PK for "user_topic". DO NOT add an auto-increment "id" field for either, it's pretty useless in this use case (takes up space, CPU, index memory, etc). If both fields are in the PK, queries on user_topic will be all from memory, and the only disk hit is on "notification_meta" during the JOIN.
So if a user subscribes to 2 topics, there'll be two entries in "user_topic", and each user will always have a single row in "notification_meta"
There are more ways to scale, like dynamically creating a new table for each new topic, sharding to different MySQL instances based on user_id, partitioning, etc. There's N-ways to scale, especially in MySQL. Good luck!

Secure encrypted database design

I have a web based (perl/MySQL) CRM system, and I need a section for HR to add details about disciplinary actions and salary.
All this information that we store in the database needs to be encrypted so that we developers can't see it.
I was thinking about using AES encryption, but what do I use as the key? If I use the HR Manager's password then if she forgets her password, we lose all HR information. If she changes her password, then we have to decrypt all information and re-encrypt with the new password, which seems inefficient, and dangerous, and could go horrifically wrong if there's an error half way through the process.
I had the idea that I could have an encryption key that encrypts all the information, and use the HR manager's password to encrypt the key. Then she can change her password all she likes and we'll only need to re-encrypt the key. (And without the HR Manager's password, the data is secure)
But then there's still the problem of multi-user access to the encrypted data.
I could keep a 'plaintext' copy of the key off site, and encrypt it with each new HR person's password. But then I know the master key, which doesn't seem ideal.
Has anyone tried this before, and succeeded?
GnuPG allows documents to be encrypted using multiple public keys, and decrypted using any one of the corresponding private keys. In this way, you could allow data to be encrypted using the public keys of the everyone in the HR department. Decryption could be performed by any one having one of the private keys. Decryption would require both the private key and the passphrase protecting the key to be known to the system. The private keys could be held within the system, and the passphrase solicited from the user.
The data would probably get quite bloated by GnuPG using lots of keys: it has to create a session key for the payload and then encrypt that key using each of the public keys. The encrypted keys are stored alongside the data.
The weak parts of the system are that the private keys need to be available to the system (ie. not under the control of the user), and the passphrase will have to pass through the system, and so could be compromised (ie. logged, stolen) by dodgy code. Ultimately, the raw data passes through the system too, so dodgy code could compromise that without worrying about the keys. Good code review and release control will be essential to maintain security.
You are best avoiding using MySQL's built in encryption functions: these get logged in the replication, slow, or query logs, and can be visible in the processlist - and so anyone having access to the logs and processlist have access to the data.
Why not just limit access to the database or table in general. That seems much easier. If the developer has access to query the production, there is no way to prevent them from seeing the data b/c at the end of the day, the UI has to decrypt / display the data anwyays.
In the experience I've had, the amount of work it takes to achieve the "developers cannot see production data at all" is immense and nearly imposible. At the end of the day, if the developers have to support the system, it will be difficult to achieve. If you have to debug a production problem, then it's impossible not to give some developers access to production data. The alternative is to create a large number of levels and groups of support, backups, test data, etc..
It can work, but it's not as easy as business owners may think.
Another approach is to use a single system-wide key stored in the database - perhaps with a unique id so that new keys can be added periodically. Using Counter Mode, the standard MySQL AES encryption can be used without directly exposing the cleartext to the database, and the size of the encrypted data will be exactly the same as the size of the cleartext. A sketch of the algorithm:
The application generates a unique initial counter value for the record. This might be based on some unique attribute of the record, or you could generate and store a unique value for this purpose.
The application generates a stream of counter blocks for the record based on the initial counter value. The counter stream must be the same size or up to 1 block larger than the cleartext.
The application determines which key to use. If keys are being periodically rotated, then the most recent one should be used.
The counter stream is sent to the database to be encrypted: something like
select aes_encrypt( 'counter', key ) from hrkeys where key_id = 'id';
The resulting encrypted counter value is trimmed to the length of the cleartext, and XORed with the cleartext to produce the encrypted text.
The encrypted text is stored.
Decryption is exactly the same process applied to the encrypted text.
The advantages are that the cleartext never goes any where near the database, and so the administrators cannot see the sensitive data. However, you are then left with the problem of preventing your adminstrators from accessing the encrypted counter values or the keys. The first can be achieved by using SSL connections between your application and database for the encryption operations. The second can be mitigated with access control, ensuring that the keys never appear in the database dumps, storing the keys in in-memory tables so that access control cannot be subverted by restarting the database with "skip-grants". Ultimately, the only way to eliminate this threat is to use a tamper-proof device (HSM) for performing encryption. The higher the security you require, the less likely you will be able to store the keys in the database.
See Wikipedia - Counter Mode
I am just thinking out loud.
This seems to call for a public/private key mechanism. The information would be stored encrypted with the HR public key and would only be viewable by someone in possession of the associated private key.
This, to me, seems to rule out a web based interface to view these confidential data (entering them via the web interface is certainly feasible).
Given that individuals come and go, tying the keys to a specific person's account seems infeasible. Instead, one must handle key distribution separately and have a mechanism for someone to change the keypair used (and re-encrypt the database — again without the use of a web interface) in case the current HR manager is replaced with someone else. Of course, nothing would prevent the HR manager from dumping all the data before leaving while before the keys are replaced.
I'm not sure how feasible this is currently, or what current stable DB systems have support for this, but alternate authentication mechanisms at the database level may help. For example Drizzle, a refactoring of the MySQL code base, supports (or aims to?) completely pluggable authentication, allowing no auth, server housed auth, or auth through PAM or some other mechanism, meaning you can use LDAP.
If you had different levels of access based on the database connection, and the application login also specified what you could actually access in the database, you could theoretically build a system where it wasn't possible to access the confidential database info unless using an account with specific access rights, regardless of the privilege escalation attempts in the application itself.
As long as the people setting user account access rights can be trusted or themselves are OK to see the confidential information, this should be fairly secure.
P.S. It might be useful to use a generic DB connection for "regular" application information, but when an attempt to access confidential information is made, then the specific DB connection is attempted. This allows for a few DB connections to handle most requests, assuming the majority of users aren't viewing confidential info. Otherwise, a separate DB connection per user may become burdensome to the DB.