Do namespaces affect the 1 sec per write per entity group? - namespaces

I'm achieving strong consistency by making all the entities in my gcloud datastore schema the ancestor of a single root entity. The entities are also being partitioned by a namespace for each user. So the key for each entity looks like
[per-user namespace] -> ["RootEntityKind", CONSTANT] -> ["ChildEntityKind", Child_UUID]
So for the purposes of the 1 sec / write / entity group limit, if I have N namespaces, does this mean that I have N entity groups or just one?

In Cloud Datastore, root entities in different namespaces are in different entity groups. This means for N namespaces, you'll have N entity groups.
In the new Datastore mode of Cloud Firestore, the recently announced upgrade to Cloud Datastore, there is no longer any entity group based limitations. You also no longer need to use entity groups to achieve strong consistency.

Related

Group chats and private chats, seperate table or single table with type attribute?

Currently I have a table users, a table chats, however I want there to be "Group chats" and "Private chats (dm)".
A group chat needs more data than a private chat, such as for example: Group name, picture, ....
What is the best way to approach this?
Do I make 1 table chats, and put a type attribute in there that deteremines if it is private or not and leave some columns blank if it is a private chat. OR Would I make 2 tables one for private chats, and one for group chats?
This is a similar scenario to the general question "should you split sensitive columns into a new table" and the general answer is the same, it is going to depend largely on your data access code and your security framework.
What about a third option, why not just model a Private Chat as a Group Chat that only has 2 members in the group?. Sometimes splitting the model into these types is a premature optimisation, especially in the context of a chat style application. For instance, couldn't a private chat benefit from having an image in the same way that a group chat does? Could there not be some benefit to users being able to specify a group name to their own private group?
You will find the whole development and management of your application a lot simpler if there is just one type of chat and it is up to the user to decide how many people can join or indeed if other people can join the chat.
If you still want to explore the 2 conceptual types this is this is an answer that might give you some indirect insights: https://stackoverflow.com/a/74398184/1690217 but ultimately we need additional information to justify selecting one structure over the other. Performance, Security and general data governance are some considerations that have implications or impose caveats on implementation.
From a structural point of view, your Group Chats and Private Chats can be both implementations of a common Chat table, conceptually we could say that both forms inherit from Chat.
In relational databases we have 3 general options to model inheritance:
Table Per Hierarchy (TPH)
Use a single table with a discriminator column that determines for each row what the specific type is. Then in your application layer or via views you can query the specific fields that each type and scenario needs.
In TPH the base type is usually an abstract type definition
Table Per Type (TPT)
The base type and each concrete type exists as their own separate tables. The FK from the inheriting tables is the PK and shares the same PK value as the corresponding record in the base table, creating a 1:0-1 relationship. This requires some slightly more complicated data access logic but it makes it harder to accidentally retrieve a Private Chat in a Group Chat context because the data needs to be queried explicitly from the correct table.
In TPT the base type is itself a concrete type and data records do not have to inherit into the extended types at all.
Simple Isolated Tables (No inheritance in the schema)
This is often the simplest approach, if your tables do have inheritance in the application logic then the common properties would be replicated in each table. This can result in a lot of redundant data access logic, but the OO inheritance in the application layer following DRY principal solves most of code redundancy issues.
This answer to How can you represent inheritance in a database? covers DB inheritance from a more academic and researched point of view.
From a performance point of view, there are benefits to isolating workloads if the usage pattern is significantly different. So if Group Chats have a different usage profile, either the frequency or type of queries is significantly different, or the additional fields in Group Chat would benefit from their own index profiles, then splitting the tables will allow your database engine to provide better index management and execution plan optimisations due to more accurate capture of table statistics.
From a security and compliance point of view, a single table implementation (TPH) can reduce the data access logic and therefore the overall attack surface of the code. But a good ORM or code generation strategy usually mitigates any issues that might be raised in this space. Conversely TPH or simple tables make it easier to define database or schema level security policies and constraints.
Ultimately, which solution is best for you will come down to the effort required to implement and maintain the application logic for your choice.
I will sometimes use a mix of TPT and TPH in the same database but often lean towards TPT if I need inheritance within the data schema, this old post explains my reasoning against TPH: Database Design: Discrimator vs Separate Tables with regard to Constraints. My general rule is that if the type needs to be polymorphic, either to be considered of both types or for the type context to somehow change dynamically in the application runtime, then TPT or no inheritance is simpler to implement.
I use TPH when the differences between the types is minimal and not expected to reasonably diverge too much over the application life time, but also when the usage and implementations are going to be very similar.
TPT provides a way to express inheritance but also maintain a branch of vastly different behaviours or interactions (on top of the base implementation). many TPT implementations look as if they might as well have been separate tables, the desire to constrain the 1:1 link between the records is often a strong decider when choosing this architectural pattern. A good way to think about this model, even if you do not use inheritance at the application logic level, is that you can extend the base record to include the metadata and behaviours of any of the inheriting types. In fact with TPT it is hard to constrain the data records such that you cannot extend into multiple types.
Due to this limitation, TPT can often be modelled from the application layer as not using OO Inheritance at all
TPT complements Composition over Inheritance
TPH is often the default way to model a Domain Model that implements simple inheritance, but this introduces a problem in application logic if you need to change the type or is incompatible with the idea that a single record could be both types. There are simple workarounds for this, but historically this causes issues from a code maintenance point of view, it's a clash of concepts really, TPH aligns with Inheritance more than Composition
In the context of Chat, TPT can work from a Composition point of view. All chats have the same basic features and interactions, but Group Chat records can have extended metadata and behaviours. Unless you envision Private Chat having a lot of its own specific implementation there is not really a reason to extend the base concept of Chat to a Private Chat implementation if there is no difference in that implementation.
For that reason too though, is there a need to differentiate between Private and Group chats at all from a database perspective? Your application runtime shouldn't be using blind SELECT * style queries to access the data in either case, it should be requesting the specific fields that it needs for the given context, whether you use a Field in the table, or the Name of the table to discrimate between the different concepts is less important than being able to justify the existence of or the difference between those concepts.

Store "extended" metadata on entities stored in Azure Cosmos DB as JSON documents

We are building a REST API in .NET deployed to Azure App Service / Azure API App. From this API, client can create "Products" and query "Products". The product entity has a set of fields that are common, and that all clients have to provide when creating a product, like the fields below (example)
{
"id": "cbf3f7aa-4743-4198-b307-260f703c42c1"
"name": "Product One"
"description": "The number one product"
}
We store these products currently as self-contained documents in Azure Cosmos DB.
Question 1: Partitioning.
The collection will not store a huge amount of documents, we talk about maximum around 2 500 000 documents between 1 - 5 kb each (estimates). We currently have chosen the id field (which is our system generated id, not the internal Cosmos DB document id) as partition key which means 2 500 000 logical partitions with one document each partition. The documents will be used in some low-latency workloads, but these workloads will query by id (the partition key). Clients will also query by e.g. name, and then we have a fan-out query, but those queries will not be latency-critical. In the portal, you can't create a single partition collection anymore, but you can do it from the SDK or have a fixed partition key value. If we have all these documents in one single partition (we talk about data far below 10 GB here), we will never get any fan-out queries, but rely more on the index within the one logical partition. So the question: Even if we don't have huge amounts of data, is it still wise to partition like we currently have done?
Question 2: Extended metadata.
We will face clients that want to write client/application/customer-specific metadata beyond the basic common fields. What is the best way to do this?
Some brainstorming from me below.
1: Just dump everything in one self-contained document.
One option is to allow clients in the API to add a type of nested "extendedMetadata" field with key-value pairs when creating a product. Cosmos DB is schema agnostic, so in theory this should work fine. Some products can have zero extended metadata, while other products can have a lot of extended metadata. For the clients, we can promise the basic common fields, but for the extended metadata field we cannot promise anything in terms of number of fields, naming etc. The document size will then vary. These products will as mentioned still be used in latency-critical workloads that will query by "id" (the partition key"). The extended metadata will never be used in any latency-critical workloads. How much and how in general affects the document size the performance / throughput? For the latency-critical read scenario, the query optimizer will go straight to the right partition, and then use the index to quickly retrieve the document fields of interest. Or will the whole document always be loaded and processed independent of which fields you want to query?
{
"id": "cbf3f7aa-4743-4198-b307-260f703c42c1"
"name": "Product One"
"description": "The number one product"
"extendedMetadta" : {
"prop1": "prop1",
"prop2": "prop2",
"propN": "propN"
}
}
The extended metadata is only useful to retrieve from the same API in certain situations. We can then do something like:
api.org.com/products/{id} -- will always return a product with the basic common fields
api.org.com/products/{id}/extended -- will return the full document (basic + extended metadata)
2: Split the document
One option might be to do some kind of splitting. If a client from the API creates a product that contains extended metadata, we can implement some logic that splits the document if extendedMetadata contains data. I guess the split can be done in many ways, brainstorming below. I guess the main objetive to split the documents (which require more work on write operations) is to get better throughput in case the document size plays a significant role here (in most cases, the clients will be ok with the basic common fields).
One basic document that only contains the basic common fields, and one extended document that (with the same id) contains the basic common fields + extended metadata (duplication of the basic common fields) We can add a "type" field that differentiates between the basic and extended document. If a client asks for extended, we will only query documents of type "extended".
One basic document that only contains the basic common fields + a reference to an extended document that only contains the extended metadata. This means a read operation where client asks for product with extended metadata require reading two documents.
Look into splitting it in different collections, one collection holds the basic documents with throughput dedicated to low-latency read scenarios, and one collection for the extended metadata.
Sorry for a long post. Hope this was understandable, looking forward for your feedback!
Answer 1:
If you can guarantee that the documents total size will never be more than 10GB, then creating a fixed collection is the way to go for 2 reasons.
First, there is no need for a cross partition query. I'm not saying it will be lightning fast without partitioning but because you are only interacting
with a simple physical partition, it will be faster than going in every single physical partition looking for data.
(Keep in mind however that every time people think that they can guarantee things like max size of something, it usually doesn't work out.)
The /id partitioning strategy is only efficient if you can ALWAYS provide the id. This is called a read. If you need to search by any other property, this means that
you are performing a query. This is where the system wouldn't do so well.
Ideally you should design your Cosmos DB collection in a way that you never do a cross partition query as part of your every day work load. Maybe once in a blue moon for reporting reasons.
Answer 2:
Cosmos DB is a NoSQL schema-less database for a reason.
The second approach in your brainstorming would be fitting for a traditional RDBMS database but we don't have that here.
You can simply go with your first approach and either have everything under a single property or just have them at the top level.
Remember that you can just map the response to any object that you want, so you can simply have 2 DTOs. A slim and an extended version
and just map to different versions depending on the endpoint.
Hope this helps.

Optimal DB structure for creating user segments

I want to create a segmentation engine and can't seem to figure out the most optimal DB or DB structure to use for the task.
Currently I use MySQL as my primary DB, but the segmentation engine is a separate software component and thus can have a different DB if applicably.
Basically I have 10 million unique users identified with UserID (integer). Administrator of the Segmentation engine dynamically creates segments with some predefined rules (like age range, geolocation, transaction history etc.). Application should execute the rules of each segment periodically (once every 15 minutes) to get the current list of all users (can be upto 1 million users each) that belong to the segment and store it.
Later application is exposing the API to allow external systems use segmentation functionality, namely:
1. Get list of all segments that a particular UserID belongs to.
2. Get list of all UserIDs that a particular segment contains.
Note that because segments need to be updated very frequently (every 15 min) this causes massive transactions in DB to "maintain" the segments, where non-applicable users should be removed and new ones added all the time.
I have considered several approaches so far:
1. Plain MySQL where I have a table of users belonging to segments (SegmentID,UserID). (this approach has 2 drawbacks: storage space and constant delete/insert/update in MySQL which will degrade innodb performance by introducing page splitting).
Using JSON data types in MySQL, where I can have table (UserID,Segments), where segments is a json containing an array of SegmentIDs. (drawbacks here are slow search and slow updates)
Using Redis with Sets (UserID,Segments), where UserID will be the Key and Segments will be the Set of SegmentIDs. (drawbacks here are no simple way to search by SegmentID).
Has anyone worked with similar task and can provide any guidance?
Any feedback will be appreciated, so I can be pointed to a direction which I can further research.
I think you can use Elasticsearch for this task.

How do you organize buckets in Riak?

Since Riak uses buckets as a way of separating keys, is it possible to have buckets within buckets? If not how would one go about organizing a Riak setup with many buckets for several apps.
The basic problem is how one would go about representing "databases" and "tables" within Riak. Since a bucket translates to a table, what translates to a database?
Namespaces in programming languages usually have hierarchies. It makes sense for Riak buckets to also allow hierarchies, since buckets are essentially namespaces.
You need to think about Riak as about very big key -> value "table" where buckets are only prefixes for keys. Now when you know that you can do anything with buckets as long as they are still binary objects.
You can create linear "tables":
<<"table1">>
<<"table2">>
Or you can create hierarchies:
<<"db1.table1">>
<<"db1.table2">>
<<"db2.table1">>
<<"db2.table2">>
Or you even can use tuples as buckets:
1> term_to_binary({"db1", "table1"}).
<<131,104,2,107,0,3,100,98,49,107,0,6,116,97,98,108,101,49>>
2> term_to_binary({"db1", "table2"}).
<<131,104,2,107,0,3,100,98,49,107,0,6,116,97,98,108,101,50>>
3> term_to_binary({"db2", "table1"}).
<<131,104,2,107,0,3,100,98,50,107,0,6,116,97,98,108,101,49>>
4> term_to_binary({"db2", "table2"}).
<<131,104,2,107,0,3,100,98,50,107,0,6,116,97,98,108,101,50>>

Multiple Mappers for the same class in different databases

I am currently working on a Wikipedia API which means that we have a
database for each language we want to use. The structure of each
database is identical, they only differ in their language. The only
place where this information is stored is in the name of the database.
When starting with one language the straight forward approach to use a
mapping between the tables to needed classes (e.g. Page) looked fine.
We defined an engine and corresponding metadata. When we added a
second
database with its own setup for engine and metadata we ran into the
following error:
ArgumentError:
Class '<class 'wp.orm.types.pages.Page'>' already has a primary mapper defined.
Use non_primary=True to create a non primary Mapper.clear_mappers() will remove
*all* current mappers from all classes.
I found an email saying that there must be at least one primary
mapper, so using this option for all databases doesn't seem feasible.
The next idea is to use sharding. For that we need a way to
distinguish
between the databases from the perspective of an instance, as noted in
the docs:
"You need a function which can return
a single shard id, given an instance
to be saved; this is called
"shard_chooser"
I am stuck here. Is there a way to get the database name given an
Object
it is loaded from? Or a possibility to add a static attribute based on
the engine? The alternative would be to add a language column to every
table which is just ugly.
Am I overseeing other possibilities? Any ideas how to define multiple
mappers for the same class, that map against tables in different
databases?
I asked this question on a mailing list and got this answer by Michael Bayer:
if you'd like distinct classes to
indicate that they "belong" in a
different database, and you have very
clear lines as to how this is
performed, use the "entity_name"
concept described at
http://www.sqlalchemy.org/trac/wiki/UsageRecipes/EntityName
. this sounds very much like your use
case.
The next idea is to use sharding. For that we need a way to
distinguish
between the databases from the perspective of an instance, as noted
in
the docs:
"You need a function which can return a single shard id, given an
instance to be saved; this is called "shard_chooser"
horizontal sharding is a method of
storing many homogeneous instances
across multiple databases, with the
implication that you're creating one
big "virtual" database among
partitions - the main concept is
that an individual instance gets
placed in different partitions based
on some ruleset. This is a little
like your use case as well but since
you have a very simple delineation i
think the "entity name" approach is
easier.
So the basic idea is to generate anonymous subclasses for each desired mapping which are distinguished by the Entity_Name. The details can be found in Michaels Link