Prevent duplicated values on fields that should be unique on PouchDB Sync - duplicates

I have a system that uses CouchDB as DB, and the clients connect through PouchDB, having a local copy of the database for offline use. The app has no backend API, it connects directly to the DB.
Many databases on the system contain one or more fields that should be unique (no other documents should have the same value). Since CouchDB doesn't really have a "unique costraint" for fields, the uniqueness of the documents is managed through code on the client side. The issue comes from the offline synchronization from PouchDB.
Let's say there is a pages object in the system that has two fields that should be unique, name and slug. Through code we make sure that before posting a new page, those two fields do not exist in the DB. Then let's say one PC goes offline for a day, and creates a page with the slug "homepage", while the same day, a PC that was online created another page with the slug "homepage" now saved on the remote DB. When PC 1 goes online, it will sync the local and remote DBs skipping the validation code and adding a second "homepage" page.
One workaroung to this is to set the must-be-unique field as the _id of the document and manage syncing conflicts, but that is not possible in a reasonable way for more than one unique element. (I would still appreciate a response that only takes into account a single unique field tho).
Also in some cases it is less than ideal to use the _id as the unique field. For example, in a POS system, cashiers have a pin to check in with when taking an order. Using a 4 number pin as _id does not seem ideal.
Another option is to ask an action from the user before syncing when noticing a conflict. But that would require a pre-syncing phase that checks the whole database, and interrupts the user. I'm not sure how to implement it in a seamless way in the system regarding user experience tho.
Any suggestions on how to handle this massive issue?

Related

Sync multiple local databases to one remotely

I need to create a system with local webservers on Raspberry Pi 4 running laravel for API calls, websockets, etc. Each RPI will be installed in multiple customers places.
For this project i want to have the abality to save/sync the database to a remote server (when the local system is connected to internet).
Multiple locale databases => One remote database cutomers based
The question is, how to synchronize databases and identify properly each customers data and render them in a mutualised remote dashboard.
My first thought was to set a customer_id or a team_id on each tables but it seems dirty.
The other way is to create multiple databases on the remote server for the synchronization and one extra database to set customers ids and database connection informations...
Someone has already experimented something like that? Is there a sure and clean way to do this?
You refer to locale but I am assuming you mean local.
From what you have said you have two options at the central site. The central database can either store information from the remote databases into a single table with an additional column that indicates which remote site it's from, or you can setup a separate table (or database) for each remote site.
How do you want to use the data?
If you only ever want to work with the data from one remote site at a time it doesn't really matter - in both scenarios you need to identify what data you want to work with and build your SQL statement to either filter by the appropriate column, or you need to direct it to the appropriate table(s).
If you want to work on data from multiple remote sites at the same time, then using different tables requires tyhat you use UNION queries to extract the data and this is unlikely to scale well. In that case you would be better off using a column to mark each record with the remote site it references.
I recommend that you consider using Uuids as primary keys - it may be that key collision will not be an issue in your scenario but if it becomes one trying to alter the design retrospectively is likely to be quite a bit of work.
You also asked about how to synchronize the databases. That will depend on what type of connection you have between the sites and the capabilities of your software, but typically you would have the local system periodically talk to a webservice at the central site. Assuming you are collecting sensor data or some such the dialogue would be something like:
Client - Hello Server, my last sensor reading is timestamped xxxx
Server - Hello Client, [ send me sensor readings from yyyy | I don't need any data ]
You can include things like a signature check (for example an MD5 sum of the records within a time period) if you want to but that may be overkill.

What is the recommended way to get another domain's data in a CQRS workflow?

I'm working on a microservice architecture with separated databases but need to replicate some data for resiliency.
As an exemple, let's say I'm working on a blog and have two domains: users and articles, each with its own database. In case the users microservice goes down I still need to be able to show an article author name.
-- in the 'users' domain's database
create table users (
id uuid primary key,
name varchar(32)
);
-- in to the 'articles' domain's database
create table articles (
id uuid primary key,
author uuid,
author_name varchar(32),
contents text
);
So when I'm creating an article, I send the user identifier.
My question is, at what point and how am I supposed to get the username?
I can't trust the user to send the real user name, it has to be fetched from somewhere in the system
I can't fetch it from the controller since it is on another domain
I can't fetch it from the event handler since it is on another domain
I can't use a saga since sagas are not supposed to make queries, only commands
FWIW, my reference for these is this F.A.Q.
Thanks a lot for reading this; I hope you'll have a solution for me! Have a nice day.
My question is, at what point and how am I supposed to get the username?
1) You fetch the username from the local cache of reference data
2) Your reporting logic needs to support the case that the cache doesn't yet have a copy of the reference data
3) Your reporting logic needs to support the case that the cached copy of the reference data is stale.
Reference data here being shorthand for any information that the service needs, for which it isn't itself the authority.
So in a typical solution, the User service would have the authoritative copy/copies of the username, and all of the logic for determining whether or not a change to that value is allowed. The Articles service would have a local copy of that data, with metadata describing how long that information may be used.
The user database would have a copy of all of the information that it is responsible for. The article database would only have the slice of user information that the article service cares about.
A common way to implement this is to arrange a subscription, pulling the data from the users database to the articles database when the non-authoritative copy is no longer fresh.
You can treat the cache as a fallback position -- if we can't get timely access to the latest username, then use the cached copy.
But there's no magic - it will sometimes happen that the remote data is not available AND the local cache doesn't have a valid copy.
It may help to keep in mind that a lot of your data is already reference data -- copied into your local databases by the real world.
If I may ask, instead of having metadata then pulling the data periodically to update the cache, shouldn't I just replicate it once then listen for the 'username changed' event?
What happens if that event doesn't get delivered?
In distributed systems, it's really important to ask what happens if some process fails or some message is lost right at a critical point. How do you recover.
When I follow through that line of thinking, what I end up with is that client polling is the primary mechanism for retrieving reference data, and push notifications are latency optimizations that indicate we should poll now, rather than waiting for the entire scheduled interval.

How did Facebook or Twitter implement their subscribe system

I'm working on a SNS like mobile app project, where users upload their contents and can see updates of their subscribed topic or friends on their homepage.
I store user contents in mysql, and query the user specific homepage data by simply querying out first who and what the user subscribed and then query the content table filtering out using the 'where userid IN (....) or topic IN (....)' clause.
I suspect this would become quite slow when the content table piles up or when a user subscribe tons of users or topics. Our newly released app is already starting to have thousands of new users each week, and getting more over time. Scalability must be a concern for us right now.
So I wonder how Facebook or Twitter handle this subscribing problem with their amazing number of users. Do they handle a list for each user? I tried to search, but all I got is how to interact with Facebook or Twitter rather than how they actually implement this feature.
I noticed that you see only updates rather than history in your feed when using Facebook. Which means that subscribing a new user won't dump lots out dated content into your feed as how it would be by using my current method.
How do Facebook design their database and how did they dispatch new contents to subscribed users?
My backend is currently PHP+MySQL, and I don't mind introducing other backend technologies such as Redis or JMS and stuff if that's the way it should be done.
Sounds like you guys are still in a pretty early stage. There are N-number of ways to solve this, all depending on which stage of DAUs you think you'll hit in the near term, how much money you have to spend on hardware, time in your hands to build it, etc.
You can try an interim table that queues up the newly introduced items, its meta-data on what it entails (which topic, friend user_id list, etc.). Then use a queue-consumer system like RabbitMQ/GearMan to manage the consumption of this growing list, and figure out who should process this. Build the queue-consumer program in Scala or a J2EE system like Maven/Tomcat, something that can persist. If you really wanna stick with PHP, build a PHP REST API that can live in php5-fpm's memory, and managed by the FastCGI process manager, and called via a proxy like nginx, initiated by curl calls at an appropriate interval from a cron executed script.
[EDIT] - It's probably better to not use a DB for a queueing system, use a cache server like Redis, it outperforms a DB in many ways and it can persist to disk (lookup RDB and AOF). It's not very fault tolerant in case the job fails all of a sudden, you might lose a job record. Most likely you won't care on these crash edge cases. Also lookup php-resque!
To prep for the SNS to go out efficiently, I'm assuming you're already de-normalizing the tables. I'd imagine a "user_topic" table with the topic mapped to users who subscribed to them. Create another table "notification_metadata" describing where users prefer receiving notifications (SMS/push/email/in-app notification), and the meta-data needed to push to those channels (mobile client approval keys for APNS/GCM, email addresses, user auth-tokens). Use JSON blobs for the two fields in notification_metadata, so each user will have a single row. This saves I/O hits on the DB.
Use user_id as your primary key for "notification_meta" and user_id + topic_id as PK for "user_topic". DO NOT add an auto-increment "id" field for either, it's pretty useless in this use case (takes up space, CPU, index memory, etc). If both fields are in the PK, queries on user_topic will be all from memory, and the only disk hit is on "notification_meta" during the JOIN.
So if a user subscribes to 2 topics, there'll be two entries in "user_topic", and each user will always have a single row in "notification_meta"
There are more ways to scale, like dynamically creating a new table for each new topic, sharding to different MySQL instances based on user_id, partitioning, etc. There's N-ways to scale, especially in MySQL. Good luck!

Server-side functionality depending on whether a user "likes" a Facebook page with PHP/JS SDK's

I am trying to execute a mySql database query on my website depending on whether a user has "liked" my Facebook page. I have found a few ways to do this, using the PHP and JS SDK's, namely using the API with /USER_ID/likes/PAGE_ID.
When a user has liked my page, I want to add a value to their data in my database, so I thought of adding a function that is called each time the user visits the site, and if they like it, add value to database and also have a boolean value in there so it doesn't keep adding to the value. However, I guessed this would be waste of calls to the server if this happened every time, so I am not sure how to go about setting this up, any ideas?
Unless you are dealing with huge volumes of users I wouldn't worry about it because a check like that on one row of an indexed mysql table should be very quick (200 milliseconds or less for the entire request over a normal internet connection.). And if the data you need is stored on the server, then how could you possibly avoid the trip to the server? Unless you store the data in a cookie.

Django Log File vs MySql Database

So I am going to be building a website using the Django web framework. In this website, I am going to have an advertising component. Whenever an advertisement is clicked, I need to record it. We charge the customer every time a separate user clicks on the advertisement. So my question is, should I record all the click entries into a log file or should I just create a Django model and record the data into a mysql database? I know its easier to create a model but am worried if there is a lot of high traffic to the website. Please give me some advice. I appreciate you taking the time to read and address my concerns.
Awesome. Thank you. I will definitely use a database.
Traditionally, this sort of interactions is stored in a DB. You could do it in a log, but I see at least two disadvantages:
log rotation
the fact that after logging you'll still have to process the data in a meaningful manner.
IMO, you could do it in a separate DB (see the multiple db feature in django). This way, you could have the performance somewhat more balanced.
You should save all clicks to a DB. A database is created to handle the kind of data you are trying to save.
Additionally, a database will allow you to analyze your data a lot more simply then a flat file. If you want to graph traffic from country, or by user agent or by date range, this will be almost trivial in a database, but parsing giganitc log files could be more involving.
Also a database will be easier to extend. Right now you are just tracking clicks but what happens if you want to start pushing advertisements that require some sort of additional user action or conversion. You will be able to extend this beyond clicks extremely easy in a database.