I have an application with entities like User, Message and MessageFeatures. Each User can have many messages and each message has a MessageFeatures entity. Currently the relational model is expressed as:
User{
UUID id
String email
...
}
Message{
UUID id,
UUID userId
String text
....
}
MessageFeatures{
UUID id
UUID messageId
UUID userId
PrimitiveObject feature1
....
PrimitiveObject featureN
}
The most important queries are:
Get all messages for user
Get all message features for a user
Get message by uuid
Get/Update message feature by uuid
Get message feature by message uuid
Less important(can be slow) queries are like :
Get message features where user_id = someuuid and featureX = value
Get all/count user uuids for which featureX = value
update message features set featureX = newValue where featureX = oldValue
While evaluating couchbase i am unable to arrive at a proper data model. I do not think putting all messages and message features for a user in a single document is a good idea because the size will keep on increasing and based on current data it will easily be in range of 4-5 MB for 2 year data. Also to maintain consistency i can update only one message feature at a time as atomicity is per document.
If i do not place them in a single document they will be scattered around the cluster and queries like get all messages/messagefeatures of a user will result in scatter and gather.
I have checked out global secondary indexes and N1QL but even if I index user_uuid field of messages it will only help in fetching the message_uuids of that user, loading all the messages will result in scatter and gather...
Is there a way to force that all messages, message features of a user_uuid get mapped to a same physical node without embedding them in the same document something like hashtags in redis.
You should translate the relational model above directly to Couchbase. You should create GSI indexes for all the relationships (id fields). Use EXPLAIN to make sure every query uses an index. For direct lookup by id, use USE KEYS.
Scatter/gather in Couchbase means something different than what you describe. It is when a single index scan has to visit several nodes and then merge the scan results (distributed index). Instead, each GSI index lives on a single node, so GSI indexes avoid scatter/gather.
Finally, note that Couchbase is fast at key-value fetches even across nodes, so you do not need to worry about locality of data.
Related
After much searching I could not get an answer on hom to implement one of the most basic db inserts.
(The fact that I could not get an answer even addressing this concept makes me think that I am just not thinking in the kafka-way)
A much used example would be where a customer places an order and the order contains many products.
When the order is inserted, the traditional method would be to insert the order with all the data that it may contain (userID,
payment, discount etc..), send back the primary_key, insert this key
as a foreign field on all the products, and then insert all the
products.
This of coarse would later have the effect of just querying the order, and then with the foreign key getting all the products that belong
to that order.
So I am wondering how this would be implemented in Kafka?
This is what I was thinking:
I would produce an order from the front end with Rest Api into a kafka initial-orders topic. This topic would then write it to my ordersDB.
The db would then write back into kafka as a producer the order with its primary_key into a kafka final-orders topic.
After inserting the order, my frontend would make repeated fetchRequests (maybe using KSQL) to the kafka final-orders topic for an order that contains all the same values as the order that it just inserted.
When it gets a return, it just inserts the primary_id field into the products and sends it to the kafka products topic who then writes it into the products db.
My problem with this approach is:
What happens when my front-end does not receive data back form the
repeated fetchRequests. Do I timeout after lets say a second and try and reverse everything? What if
the network is slow? Should I rather send all the data (order and products) to kafka at
once and let kafka sort it out internally and only send back a
200_OK if everything went well?
Is there any mechanism available to prevent duplicate data in certain fields of my entities? Something similar to an SQL unique.
Failing that, what techniques do people normally use to prevent duplicate values?
The only way to do the equivalent on a UNIQUE constraint in SQL will not scale very well in a NoSQL storage system like Cloud Datastore. This is mainly because it would require a read before every write, and a transaction surrounding the two operations.
If that's not an issue (ie, you don't write values very often), the process might look something like:
Begin a serializable transaction
Query across all Kinds for a match of property = value
If the query has matches, abort the transaction
If there are no matches, insert new entity with property = value
Commit the transaction
Using gcloud-python, this might look like...
from gcloud import datastore
client = datastore.Client()
with client.transaction(serializable=True) as t:
q = client.query(kind='MyKind')
q.add_filter('property', '=', 'value')
if q.fetch(limit=1):
t.rollback()
else:
entity = datastore.Entity(datastore.Key('MyKind'))
entity.property = 'value'
t.put(entity)
t.commit()
Note: The serializable flag on Transactions is relatively new in gcloud-python. See https://github.com/GoogleCloudPlatform/gcloud-python/pull/1205/files for details.
The "right way" to do this is to design your data such that the key is your measure of "uniqueness", but without knowing more about what you're trying to do, I can't say much else.
The approach given above will not work in the datastore because you cannot to a query across arbitrary entities inside a transaction. If you try, an exception will be thrown.
However you can do it by using a new kind for each unique field and doing a "get" (lookup by key) within the transaction.
For example, say you have a Person kind and you want to ensure that Person.email is unique you also need a kind for e.g. UniquePersonEmail. That does not need to be referenced by anything but it is just there to ensure uniqueness.
start transaction
get UniquePersonEmail with id = theNewAccountEmail
if exists abort
put UniquePersonEmail with id = theNewAccountEmail
put Person with all the other details including the email
commit the transaction
So you end up doing one read and two writes to create your account.
I am working on a save application, basically the user could go to an article and click save to store it in his profile. Instead of using a relational database the Application currently is using dynamodb. Each article has a specific type of article. The way the structure is currently used for this application is:
user-id [string][DynamoDBHashKey]
type-of-article [string] [DynamoDBRangeKey]
json [string]
user-id is the unique identifier for the user, type-of-article is well.. the type of the article, and the json is all the articles saved in a json format. the json format being:
[{article-id: timestamp}, {article-id: timestamp}]
Article #1 ^ Article #2 ^
article-id is (again) the article unique identifier and timestamp is the timestamp for when that article was stored .
Note This was done before dynamodb started supporting for json documents as Map and Lists. And the code is not mine.. It was already done..
So when the application needs to remove an article from saved It calls dynamo to get the json modify the json and then stores it again. When is going to add a new article it does the same thing. Now a problem appeared when I wanted to display all the articles ordered by the timestamps. I had to call to get all the types and merge them in a dictionary to sort them. (In the user profile I need to show all saved articles, no matter what type, sorted) Now the application is taking more than 700 or 900 ms to respond.
Personally I don't think this is the best way to approach this. So i'm thinking on rewriting the previous code to implement the new features from dynamodb (List and Maps). Now my idea for the structure in dynamodb is like this:
user-id [string] [DynamoDBHashKey]
saved-articles [List]
article-type_1
article_1 [Map] {id: article-id, timestamp: date}
article_2 [Map] {id: article-id, timestamp: date}
article-type_2
article_1 [Map] {id: article-id, timestamp: date}
But i'm relatively new to dynamodb, I made some test code to store this in dynamo using list and maps. I did it using the low level api and with the Object Persistence Model.
Now, my question is: is this a better approach or if is not why ? and what would be the better approach.
This way I think I can use the low level Api to only get the saved-articles of article-type #2. Or if I need them all I just call it all.
I would stick with a solution that is a bit more NoSQL-like. For NoSQL databases, if you have nested data models and/or updating existing records, those are often indicators that your data model can be optimized. I really see 2 objects that your application is using, 'users' and 'articles'. I would avoid a nested data model and updating existing records by doing the following:
'user' table
user id as hash key
'article' table
article id as hash key
timestamp as range key
user id (used in global secondary index described below)
article type and any other attributes would be non-key attributes
You'd also have a global secondary index on the article table that would allow you to search for articles by user id, which would look like something (assuming you want a user's articles sorted by date):
user id as hash key
timestamp as range key
article id as projected attribute
With this model, you never need to go back and edit existing records, you just add records that are 'edited' as new records, and you take the one with the most recent timestamp as your current version.
One thing to remember with NoSQL is that storage space is cheap, reads are cheap, but editing existing records are usually expensive and undesirable operations.
Say my Couchbase DB has millions of user objects, each user object contains some primitive fields (score, balance etc.)
And say I read & write most of those fields on every server request.
I see 2 options of storing the User object in Couchbase:
A single JSON object mapped to a user key (e.g. user_555)
Mapping each field into a separate entry (e.g. score_555 and balance_555)
Option 1 - Single CB lookup, JSON parsing
Option 2 - Twice the lookups, less parsing if any
How can I tell which one is better in terms of performance?
What if I had 3 fields? what if 4? does it make a difference?
Thanks
Eyal
Think about your data structure and access patterns first before worrying if json parsing or extra lookups will add overhead to your system.
From my perspective and experience I would try to model documents based upon logical object groupings, I would store 'user' attributes together. If you were to store each field separately you'd have to do a series of lookups if you ever wanted to provide a client or service with a full overview of the player profile.
I've used Couchbase as the main data store for a social mobile game, we store 90% of user data in a user document, this contains all the relevant fields such as score,level,progress etc. For the majority of operations such as a new score or upgrades we want to be dealing with the whole User object in the application layer so it makes sense to inflate the user object from the cb document, alter/read what we need and then persist it again if there have been changes.
The only time we have id references to other documents is in the form of player purchases where we have an array of ids that each reference a separate purchase. We do this as we wanted to have richer information on each purchase (date of transaction,transaction id,product type etc) that isn't relevant to the user document as when a purchase is made we verify it's legitimate and then add to the User inventory and create the separate purchase document.
So our structure is:
UserDoc:
-Fields specific to a User (score,level,progress,friends,inventory)
-Arrays of IDS pointing to specific purchases
The only time I'd consider splitting out some specific fields as you outlined above would be if your user document got seriously large but I think it'd be best to divide documents up per groupings of data as opposed to specific fields.
Hope that helped!
I am in the process of creating a simple activity stream for my app.
The current technology layer and logic is as follows:
** All data relating to an activity is stored in MYSQL and an array of all activity id's are kept in Redis for every user.**
User performs action and activity is directly stored in an 'activities' table in MYSQL and a unique 'activity_id' is returned.
An array of this user's 'followers' is retrieved from the database and for each follower I push this new activity_id into their list in Redis.
When a user views their stream I retrieve the array of activity id's from redis based on their userid. I then perform a simple MYSQL WHERE IN($ids) query to get the actual activity data for all these activity id's.
This kind of setup should I believe be quite scaleable as the queries will always be very simple IN queries. However it presents several problems.
Removing a Follower - If a user stops following someone we need to remove all activity_id's that correspond with that user from their Redis list. This requires looping through all ID's in the Redis list and removing the ones that correspond to the removed user. This strikes me as quite unelegant, is there a better way of managing this?
'archiving' - I would like to keep the Redis lists to a length of
say 1000 activity_id's as a maximum as well as frequently prune old data from the MYSQL activities table to prevent it from growing to an unmanageable size. Obviously this can be achieved
by removing old id's from the users stream list when we add a new
one. However, I am unsure how to go about archiving this data so
that users can view very old activity data should they choose to.
What would be the best way to do this? Or am I simply better off
enforcing this limit completely and preventing users from viewing very old activity data?
To summarise: what I would really like to know is if my current setup/logic is a good/bad idea. Do I need a total rethink? If so what are your recommended models? If you feel all is okay, how should I go about addressing the two issues above? I realise this question is quite broad and all answers will be opinion based, but that is exactly what I am looking for. Well formed opinions.
Many thanks in advance.
1 doesn't seem so difficult to perform (no looping):
delete Redis from Redis
join activities on Redis.activity_id = activities.id
and activities.user_id = 2
and Redis.user_id = 1
;
2 I'm not really sure about archiving. You could create archive tables every period and move old activities from the main table to an archive table periodically. Seems like a single properly normalized activity table ought to be able to get pretty big though. (make sure any "large" activity stores the activity data in a separate table, the main activity table should be "narrow" since it's expected to have a lot of entries)