Using Couchbase with thousands of different schemas - couchbase

Consider a multi-tenant application in which tenants are free to model their own schemas. I.e.: backend-as-a-service.
With these requirements a 'table' per bucket is undoable. Instead, I'm thinking of simply having an attribute 'schema-id' define the id of the schema. Each 'schema-id' is a compound key based on tenantId + schemaid.
As far as retrieval goes only 'get by id' should be supported. In that sense I'm only using Couchbase as a k/v store instead of a documents store.
Any caveats to the above? Would the sheer number of entities per bucket be a problem? Any other things to think about?

The key pattern idea sounds great to me. You will have to make sure your cluster is sized correctly and stays sized correctly over time.
If you wanted to really control everything tightly, you could even front the whole thing with a simple REST API. Then you could control access tightly, control that key pattern, etc. Each user of the service would get an API key that would give them a session.

Going with different buckets for different schemas will not scale,because i think there is a restriction of only 10 buckets in CB.
Since the is key is known by the client we can map the data from CB to a particular class since we know what type of schema it will be from the key.
Example if the key is PRODUCT_1234 or USER_12345,then we know for first key the data is of type PRODUCT for 2nd it is of type USER.

Related

Saving Thales keys to different location than KMDATA/local

I'm generating cryptographic keys using Thales HSMs. The encrypted keys are stored to /opt/nfast/kmdata/local. Since I may need to generate a very important number of keys (over 20 000 keys), I thought storing all the keys to a single directory won't be the best option (I'm mainly afraid of performance issues).
I would like to either split the local directory to sub directories or ideally store the keys to a RDBMS database.
Is there any "standard" way to update the default HSM behavior ?
Ask Thales support about it (my bet is that it is not possible to change).
There are some ideas how you might deal with this situation:
1. OS level
use OS and filesystem suitable for storing that many files in a single directory
2. Application level
use key diversification instead of key generation (if possible for your use case) -- i.e. if you need to provide keys for thousands of entities use a master key and diversify all keys for your entities using this master key and some diversification data (e.g. entity identity / serial number / etc.). This way you don't need to store thousands of keys as you simply diversify them on-demand. Remember to carefully analyze if you can use key diversification at all (as there are some consequences)
store those keys encrypted outside the security world (if possible for your use case) -- i.e. import/generate each key as temporal object and immediately wrap it using some persistent wrapping key (you could call this key LMK). Safely store the wrapped value in RDBMS (or anywhere else) and delete the temporal object. Later when you need to access this particular key you simply unwrap it back into temporal object and use it. Again this approach has some consequences and you must analyze thoroughly if it can be used in your situation
Good luck with your project!
Disclaimer: I am no crypto expert so please do validate my thoughts.

Suitability of AWS Cognito Identity ID for SQL primary key

I am working on a platform where unique user ID's are Identity ID's from a Amazon Cognito identity pool. Which look like so: "us-east-1:128d0a74-c82f-4553-916d-90053e4a8b0f"
The platform has a MySQL database that has a table of items that users can view. I need to add a favorites table that holds every favorited item of every user. This table could possibly grow to millions of rows.
The layout of the 'favorites' table would look like so:
userID, itemID, dateAdded
where userID and itemID together are a composite primary key.
My understanding is that this type of userID (practically an expanded UUID, that needs to be stored as a char or varchar) gives poor indexing performance. So using it as a key or index for millions of rows is discouraged.
My question is: Is my understanding correct, and should I be worried about performance later on due to this key? Are there any mitigations I can take to reduce performance risks?
My overall database knowledge isn't that great, so if this is a large problem...Would moving the favorited list to a NoSQL table (where the userID as a key would allow constant access time), and retrieving an array of favorited item ID's, to be used in a SELECT...WHERE IN query, be an acceptable alternative?
Thanks so much!
Ok so here I want to say why this is not good, the alternative, and the read/write workflow of your application.
Why not: this is not a good architecture because if something happens to your Cognito user pool, you cant repopulate it with the same ids for each individual user. Moreover, Cognito is getting offered in more regions now; compare to last year. Lets say your users' base are in Indonesia, and now that Cognito is being available in Singapore; you want to move your user pools from Tokyo to Singapore; because of the latency issue; not only you have the problem of moving the users; you have the issue of populating your database; so your approach lacks the scalability, maintainability and breaks the single responsibility principle (updating Cognito required you to update the db and vica versa).
Alternative solution: leave the db index to the db domain; and use the username as the link between your db and your Cognito user pool. So:
Read work flow will be:
User authentication: User authenticates and gets the token.
Your app verifies the token, and from its payload get the username.
You app contacts the db and get the information of the user, based on the username.
Your app will bring the user to its page and provides the information which was stored in the database.
Write work flow will be:
Your app gets the write request with the user with the token.
verifies the token.
Writes to the database based on the unique username.
Regarding MySQL, if you use the UserID and CognitoID composite for the primary key, it has a negative impact on query performance therefore not recommended for a large dataset.
However using this or even UserID for NoSQL DynamoDB is more suitable unless you have complex queries. You can also enforce security with AWS DynamoDB fine-grained access control connecting with Cognito Identity Pools.
While cognito itself has some issues, which are discussed in this article, and there are too many to list...
It's a terrible idea to use cognito and then create a completely separate user Id to use as a PK. First of all it is also going to be a CHAR or VARCHAR, so it doesn't actually help. Additionally now you have extra complexity to deal with an imaginary problem. If you don't like what cognito is giving you then either pair it with another solution or replace it altogether.
Don't overengineer your solution to solve a trivial case that may never come up. Use the Cognito userId because you use Cognito. 99.9999% of the time this is all you need and will support your use case.
Specifically this SO post explains that there is are zero problems with your approach:
There's nothing wrong with using a CHAR or VARCHAR as a primary key.
Sure it'll take up a little more space than an INT in many cases, but there are many cases where it is the most logical choice and may even reduce the number of columns you need, improving efficiency, by avoiding the need to have a separate ID field.

Couchbase: What benefits do I get from using the document ID?

I'm new to the NoSQL world as I've been programming RDBMS for a while now. In an RDBMS, you have the notion of a PRIMARY KEY per table. You reference other tables using FOREIGN KEYs and usually, if denormalized well, you have another table that just basically contains mapping from TABLE A and TABLE B so you can join them.
In Couchbase, there's this concept of a Document ID where a document has it's own unique key external from the document itself. What is this document ID good for? The only use I see for it is querying for the object itself (using USE KEYS clause).
I could just specify an "id" and "type" in my JSON document and just assign random UUIDs for the document key.
What benefits do I get from using it? ELI5 if possible.
And also, why do some developers add "prexifes" to the document ID (e.g
customer:: customername".
That is an excellent question, and the answer is both historical and technical.
Historical: Couchbase originated from CouchOne/CouchDB and Membase, the latter being a persistent distributed version of the memcached key-value store. Couchbase still operates as a key-value store, and the fastest way to retrieve a document is via a key lookup. You could retrieve a document using an index based on one of the document fields, but that would be slower.
Technically, the ability to retrieve documents extremely quickly given their ID is one advantage that makes Couchbase attractive for many users/applications (along with scalability and reliability).
Why do some developers add "prefixes" to document IDs, such as "customer::{customer name}". For issues related to fast retrieval and data modeling. Let's say you have a small document containing a customer's basic profile, and you use the customer's email address as the document ID. The customer logs in, and your application can retrieve this profile using a very fast k-v lookup using the e-mail as ID. You want to keep this document small so it can be retrieved more quickly.
Maybe the customer sometimes wants to view their entire purchase history. Your application might want to keep that purchase history in a separate document, because it's too big to retrieve unless you really need it. So you would store it with the document id {email}::purchase_history, so you can again use a k-v lookup to retrieve it. Also, you don't need to store the key for the purchase history record anywhere - it is implied. Similarly, the customer's mailing addresses might be stored under document ID {email}::addresses. Etc.
Data modeling in Couchbase is just as important as in a traditional RDBMS, but you go about it differently. There's a nice discussion of this in the free online training: https://training.couchbase.com/online?utm_source=sem&utm_medium=textad&utm_campaign=adwords&utm_term=couchbase%20online%20training&utm_content=&gclid=CMrM66Sgw9MCFYGHfgodSncCGA#
Why does Couchbase still use an external key instead of a primary key field inside the JSON? Because Couchbase still permits non-JSON data (e.g., binary data). In addition, while a relational database could permit multiple fields or combination of fields to be candidate keys, Couchbase uses the document ID for its version of sharding, so the document ID can't be treated like other fields.

Should I obscure primary key values?

I'm building a web application where the front end is a highly-specialized search engine. Searching is handled at the main URL, and the user is passed off to a sub-directory when they click on a search result for a more detailed display. This hand-off is being done as a GET request with the primary key being passed in the query string. I seem to recall reading somewhere that exposing primary keys to the user was not a good idea, so I decided to implement reversible encryption.
I'm starting to wonder if I'm just being paranoid. The reversible encryption (base64) is probably easily broken by anybody who cares to try, makes the URLs very ugly, and also longer than they otherwise would be. Should I just drop the encryption and send my primary keys in the clear?
What you're doing is basically obfuscation. A reversible encrypted (and base64 doesn't really count as encryption) primary key is still a primary key.
What you were reading comes down to this: you generally don't want to have your primary keys have any kind of meaning outside the system. This is called a technical primary key rather than a natural primary key. That's why you might use an auto number field for Patient ID rather than SSN (which is called a natural primary key).
Technical primary keys are generally favoured over natural primary keys because things that seem constant do change and this can cause problems. Even countries can come into existence and cease to exist.
If you do have technical primary keys you don't want to make them de facto natural primary keys by giving them meaning they didn't otherwise have. I think it's fine to put a primary key in a URL but security is a separate topic. If someone can change that URL and get access to something they shouldn't have access to then it's a security problem and needs to be handled by authentication and authorization.
Some will argue they should never be seen by users. I don't think you need to go that far.
On the dangers of exposing your primary key, you'll want to read "autoincrement considered harmful", By Joshua Schachter.
URLs that include an identifier will
let you down for three reasons.
The first is that given the URL for
some object, you can figure out the
URLs for objects that were created
around it. This exposes the number of
objects in your database to possible
competitors or other people you might
not want having this information (as
famously demonstrated by the Allies
guessing German tank production levels
by looking at the serial numbers.)
Secondly, at some point some jerk will
get the idea to write a shell script
with a for-loop and try to fetch every
single object from your system; this
is definitely no fun.
Finally, in the case of users, it
allows people to derive some sort of
social hierarchy. Witness the frequent
hijacking and/or hacking of
high-prestige low-digit ICQ ids.
If you're worried about someone altering the URL to try and look at other values, then perhaps you need to look at token generation.
For instance, instead of giving the user a 'SearchID' value, you give them a SearchToken, which is some long unique psuedo-random value (Read: GUID), which you then map to the SearchID internally.
Of course, you'll also need to apply session security and soforth still - because even a unique URL with a non-sequential ID isn't protected against sniffing by anything between your server and the user.
If you're obscuring the primary keys for a security reason, don't do it. That's called security by obscurity and there is a better way. Having said that, there is at least one valid reason to obscure primary keys and that's to prevent someone from scraping all your content by simply examining a querystring in a URL and determining that they can simply increment an id value and pull down every record. A determined scraper may still be able to discover your means of obsuring and do this despite your best efforts, but at least you haven't made it easy.
PostgreSQL provides multiple solutions for this problem, and that could be adapted for others RDBMs:
hashids : https://hashids.org/postgresql/
Hashids is a small open-source library that generates short, unique, non-sequential ids from numbers.
It converts numbers like 347 into strings like “yr8”, or array of numbers like [27, 986] into “3kTMd”.
You can also decode those ids back. This is useful in bundling several parameters into one or simply using them as short UIDs.
optimus is similar to hashids but provides only integers as output: https://github.com/jenssegers/optimus
skip32 at https://wiki.postgresql.org/wiki/Skip32_(crypt_32_bits):
It may be used to generate series of unique values that look random, or to obfuscate a SERIAL primary key without loosing its unicity property.
pseudo_encrypt() at https://wiki.postgresql.org/wiki/Pseudo_encrypt:
pseudo_encrypt(int) can be used as a pseudo-random generator of unique values. It produces an integer output that is uniquely associated to its integer input (by a mathematical permutation), but looks random at the same time, with zero collision. This is useful to communicate numbers generated sequentially without revealing their ordinal position in the sequence (for ticket numbers, URLs shorteners, promo codes...)
this article gives details on how this is done at Instagram: https://instagram-engineering.com/sharding-ids-at-instagram-1cf5a71e5a5c and it boils down to:
We’ve delegated ID creation to each table inside each shard, by using PL/PGSQL, Postgres’ internal programming language, and Postgres’ existing auto-increment functionality.
Each of our IDs consists of:
41 bits for time in milliseconds (gives us 41 years of IDs with a custom epoch)
13 bits that represent the logical shard ID
10 bits that represent an auto-incrementing sequence, modulus 1024. This means we can generate 1024 IDs, per shard, per millisecond
Just send the primary keys. As long as your database operations are sealed off from the user interface, this is no problem.
For your purposes (building a search engine) the security tradeoffs benefits of encrypting database primary keys is negligible. Base64 encoding isn't encryption - it's security through obscurity and won't even be a speedbump to an attacker.
If you're trying to secure database query input just use parametrized queries. There's no reason at all to hide primary keys if they are manipulated by the public.
When you see base64 in the URL, you are pretty much guaranteed the developers of that site don't know what they are doing and the site is vulnerable.
URLs that include an identifier will
let you down for three reasons.
Wrong, wrong, wrong.
First - every request has to be validated, regardless of it coming in the form of a HTTP GET with an id, or a POST, or a web service call.
Second - a properly made web-site needs protection against bots which relies on IP address tracking and request frequency analysis; hiding ids might stop some people from writing a shell script to get a sequence of objects, but there are other ways to exploit a web site by using a bruteforce attack of some sort.
Third - ICQ ids are valuable but only because they're related to users and are a user's primary means of identification; it's a one-of-a-kind approach to user authentication, not used by any other service, program or web-site.
So, to conclude.. Yes, you need to worry about scrapers and DDOS attacks and data protection and a whole bunch of other stuff, but hiding ids will not properly solve any of those problems.
When I need a query string parameter to be able to identify a single row in a column, I normally add a GUID column to that table, and then pass the GUID in the connection string instead of the row's primary key value.

What exactly is GUID? Why and where I should use it?

What exactly is GUID? Why and where I should use it?
I've seen references to GUID in a lot of places, and in wikipedia,
but it is not very clear telling you where to use it.
If someone could answer this, it would be nice.
Thanks
GUID technically stands for globally unique identifier. What it is, actually, is a 128 bit structure that is unlikely to ever repeat or create a collision. If you do the maths, the domain of values is in the undecillions.
Use guids when you have multiple independent systems or clients generating ID's that need to be unique.
For example, if I have 5 client apps creating and inserting transactional data into a table that has a unique constraint on the ID, then use guids. This prevents having to force a client to request an issued ID from the server first.
This is also great for object factories and systems that have numerous object types stored in different tables where you don't want any 2 objects to have the same ID. This makes caching and scavenging schemas much easier to implement.
A GUID is a "Globally Unique IDentifier". You use it anywhere that you need an identifier that guaranteed to be different than every other.
Usually, you only need a value to be "locally unique" -- the Primary Key identity in a database table,for example, needs only be different from the other rows in that table, but can be the same as the ID in other tables. (no need for a GUID here)
GUIDs are generally used when you will be defining an ID that must be different from an ID that someone else (outside of your control) will be defining. One such place in the Interface identifier on ActiveX controls. Anyone can create an ActiveX, and not know with what other control someone will be using them with --- and there's nothing to stop everyone from giving their controls the same name. GUIDs keep them distinct.
GUIDs are a combination of the time (in very small fractions of a second) (so it assured to be different from any GUID defined before or later), and a number defining your location (sometimes taken from the MAC address of you network card) (so it's assured to be different from any other GUID defined right now by someone else).
They are also sometimes known as UUIDs (universally unique ID).
As addition to all the other answers, here is an online GUID generator:
http://www.guidgenerator.com/
What is a GUID?
GUID (or UUID) is an acronym for
'Globally Unique Identifier' (or
'Universally Unique Identifier'). It
is a 128-bit integer number used to
identify resources. The term GUID is
generally used by developers working
with Microsoft technologies, while
UUID is used everywhere else.
How unique is a GUID?
128-bits is big enough and the
generation algorithm is unique enough
that if 1,0000,000,000 GUIDs per
second were generated for 1 year the
probability of a duplicate would be
only 50%. Or if every human on Earth
generated 600,000,000 GUIDs there
would only be a 50% probability of a
duplicate.
How are GUIDs used?
GUIDs are used in software development
as database keys, component
identifiers, or just about anywhere
else a truly unique identifier is
required. GUIDs are also used to
identify all interfaces and objects in
COM programming.
A GUID is a "Globally Unique ID". Also called a UUID (Universally Unique ID).
It's basically a 128 bit number that is generated in a way (see RFC 4112 http://www.ietf.org/rfc/rfc4122.txt) that makes it nearly impossible for duplicates to be generated. This way, I can generate GUIDs without some third party organization having to give them to me to ensure they are unique.
One widespread use of GUIDs is as identifiers for COM entities on Windows (classes, typelibs, interfaces, etc.). Using GUIDs, developers could build their COM components without going to Microsoft to get a unique identifier. Even though identifying COM entities is a major use of GUIDs, they are used for many things that need unique identifiers. Some developers will generate GUIDs for database records to provide them an ID that can be used even when they must be unique across many different databases.
Generally, you can think of a GUID as a serial number that can be generated by anyone at anytime and they'll know that the serial number will be unique.
Other ways to get unique identifiers include getting a domain name. To ensure the uniqueness of domain names, you have to get it from some organization (ultimately administered by ICANN).
Because GUIDs can be unwieldy (from a human readable point of view they are a string of hexadecimal numbers, usually grouped like so: aaaaaaaa-bbbb-cccc-dddd-ffffffffffff), some namespaces that need unique names across different organization use another scheme (often based on Internet domain names).
So, the namespace for Java packages by convention starts with the orgnaization's domain name (reversed) followed by names that are determined in some organization specfic way. For example, a Java package might be named:
com.example.jpackage
This means that dealing with name collisions becomes the responsibility of each organization.
XML namespaces are also made unique in a similar way - by convention, someone creating an XML namespace is supposed to make it 'underneath' a registered domain name under their control. For example:
xmlns="http://www.w3.org/1999/xhtml"
Another way that unique IDs have been managed is for Ethernet MAC addresses. A company that makes Ethernet cards has to get a block of addresses assigned to them by the IEEE (I think it's the IEEE). In this case the scheme has worked pretty well, and even if a manufacturer screws up and issues cards with duplicate MAC addresses, things will still work OK as long as those cards are not on the same subnet, since beyond a subnet, only the IP address is used to route packets. Although there are some other uses of MAC addresses that might be affected - one of the algorithms for generating GUIDs uses the MAC address as one parameter. This GUID generation method is not as widely used anymore because it is considered a privacy threat.
One example of a scheme to come up with unique identifiers that didn't work very well was the Microsoft provided ID's for 'VxD' drivers in Windows 9x. Developers of third party VxD drivers were supposed to ask Microsoft for a set of IDs to use for any drivers the third party wrote. This way, Microsoft could ensure there were not duplicate IDs. Unfortunately, many driver writers never bothered, and simply used whatever ID was in the example VxD they used as a starting point. I'm not sure how much trouble this caused - I don't think VxD ID uniqueness was absolutely necessary, but it probably affected some functionality in some APIs.
GUID or UUID (globally vs Universally) Unique IDentifier is, well, a unique ID :) When you need something really unique machine generated, there are libraries to get you one.
See GUID on wikipedia for details.
As to when you don't need a GUID, it is when a counter that you control (one way or another, like a SERIAL SQL type or a sequence) gets incremented. Indexing a "text" value (GUID in textual form) or a 128 bit binary value (which a GUID is) is far more expensive than an integer.
Someone said they are conceptually 128-bit random values, and that is substantially true, but having done a little reading on UUID (GUID usually refers to Microsoft's implementation of UUID), I see that there are several different UUID versions, and most of them are not actually random. So it is possible to generate a UUID for a machine (or something else) and be able to reliably repeat that process to obtain the same UUID down the road, which is important for some applications.
For me it's easier to think of them as simply "128-bit random values". Which is essentially what they are. There are some algorithms for including a bit of information in a few digits of your GUID (thus the random part gets a bit smaller), but still they are pretty large almost-random values.
Since they are so large, it is extremely unlikely that two GUIDs will ever be generated that are the same. For all practical purposes, every GUID ever generated is unique in the world.
I'll leave it to you to figure out where to use them, but other answers already have some examples. Let your imagination run wild. :)
Can be a hard thing to understand because of all the maths that goes on behind generating them. Think of it as a unique id. You can get Visual Studio to generate one for you, or .NET if you happen to be using C# or one of the many other applications or websites. They are considered unique because there is such a silly small chance you'll see the same one twice that it isn't worth considering.
128-bit Globally Unique ID. You can generate GUIDs from now until sunset and you never generate the same GUID twice, and neither will anyone else. They are used a lot with COM.
As for example of something you would use them for, we use them in one of our products. Our users can generate categories and cards on various devices. We want to make sure that we don't confuse a category made on one device with a category created on a different one, so it's important that IDs are unique no matter who generates them, where they generate them, and when they generate them. So we use GUIDs (actually we use our own scheme using 64-bit numbers but they are similar to GUIDs).
I worked on an ACD call center system a few years back where we wanted to gather call detail records from multiple call processors into a single database. I setup a column in MS SQL to generate a GUID for the database key rather than using a system-generated sequential ID (identity column). Back then, this required setting the default value to NewID (or generating it in the code, but the NewID() function was safer). Of course, having a large value for a key may raise a few eyebrows, but I would rather give up the space than risk a collision.
I didn't see anyone address using a GUID as a database key so I thought it might help to know you could do that too.
GUID stands for "Globally Unique Identifier" and you use it when you want to have, erm, a Globally Unique Identifier.
In RSS feeds, for example, you should have a GUID for each item in the feed. That way, the feed reader software can keep track of whether you have read that item or not. Without a GUID, it would be impossible to tell.
A GUID differs from something like a database ID in that no matter who creates an object -- you, me, the guy down the street -- our GUIDs will always be different. There should be no collisions using a GUID.
You'll also see the term UUID, which stands for "Universally Unique Identifier." There is essentially no difference between the two. UUID is the more appropriate term. GUID is the term used by Microsoft.
If you need to generate an identifier that needs to be unique during the whole lifetime of your application, you use a GUID.
Imagine you have a server with sessions, if you give each session a GUID, you are certain that it will be unique for every session ever created by your server. This is useful for tracing bugs.
One particularly useful application of GUIDs that I've found is using them to track unique visitors in webapps where the visitors are anonymous (i.e. not logged in or registered).
GUID = Global Unique IDentifier.
Use it when you want to uniquely identify something in a global context.
This generator can be handy.
The Wikipedia article on GUIDs is pretty clear on what they are used for - maybe rephrasing your question would help - what do you need a GUID for?
To actually see what it looks like on a windows computer, go to cmd or powershell.
Powershell => [guid]::NewGuid()
CMD => powershell [guid]::NewGuid()