Dynamodb with unique secondary/third index - unique

I'm getting ready to rebuild a database that has 3 different tables all containing the same data. the difference is the HashKey for each, UserId, UserName, Email. I'm trying to combine them all into one table as I think the redundancy is bad as well as slow. What I plan on doing is have UserId as the HashKey, and having UserName and Email as secondary indexes. I have not found a way to have dynamoDb force uniqueness on the secondary indexes, so plan on using conditional writes that check for uniqueness in those before writing to the database. With SQL this would be very easy, is there a better way of doing this in DynamoDb? I need to be able to look up a user based off of either of the three UserId, UserName and Email. I'd like to keep this to one table and not use another table that references the Email to UserId or UserName to UserId.

You are correct that DynamoDB does not enforce uniqueness on Global Secondary Indexes.
If you are going to use a single DynamoDB table, the only thing that is enforced to be unique is the primary key (hash + optional range key). This is because an item is uniquely identified by that key. So to combine your tables into a single table will require that enforcement in the application logic.
Maintaining a Global Secondary Index for a uniquely identified key on a per item basis is the equivalent of maintaining a second table. The Global Secondary Index would require the same provisioned throughput as if you had created a second/third table. The benefit of using a Global Secondary Index is that you don't have to maintain the index yourself.
Just as a warning: Global Secondary Indexes are eventually consistent in DynamoDB. This means that even though you've received a 200 response for a PutItem, it may not show up immediately if you check the Global Secondary Index. This could lead to a race condition where you check for one of the values and it has not yet propagated to the index. You'd have the same issue if you maintain the index yourself - you'd need to lock on something to make sure the writes to all three tables are transactional.

Related

MySQL query and insertion optimisation with varchar(255) UUIDs

I think this question has been asked in some way shape or form but I couldn't find a question that had asked exactly what I wish to understand so I thought I'd put the question here
Problem statement
I have built a web application with a MySQL database of say customer records with an INT(11) id PK AI field and a VARCHAR(255) uuid field. The uuid field is not indexed nor set as unique. The uuid field is used as a public identifier so its part of URLs etc. - e.g. https://web.com/get_customer/[uuid]. This was done because the UUID is 'harder' to guess for a regular John Doe - but understand that it is certainly not 'unguessable' in theory. But the issue now is that as the database is growing larger I have observed that the query to retrieve a particular customer record is taking longer to complete.
My thoughts on how to solve the issue
The solution that is coming to mind is to make the uuid field unique and also index the same. But I've been doing some reading in relation to this and various blog posts, StackOverflow answers on this have described putting indices on UUIDs as being really bad for performance. I also read that it will also increase the time it takes to insert a new customer record into the database as the MySQL database will take time to find the correct location in which to place the record as a part of the index.
The above mentioned https://web.com/get_customer/[uuid] can be accessed without having to authenticate which is why I'm not using the id field for the same. It is possible for me to consider moving to integer based UUIDs (I don't need the UUIDs to be universally unique - they just need to be unique for that particular table) - will that improve the the indicing performance and in turn the insertion and querying performance?
Is there a good blog post or information page on how to best set up a database for such a requirement - Need the ability to store a customer record which is 'hard' to guess, easy to insert and easy to query in a large data set.
Any assistance is most appreciated. Thank you!
The received wisdom you mention about putting indexes on UUIDs only comes up when you use them in place of autoincrementing primary keys. Why? The entire table (InnoDB) is built behind the primary key as a clustered index, and bulk loading works best when the index values are sequential.
You certainly can put an ordinary index on your UUID column. If you want your INSERT operations to fail in the astronomically unlikely event you get a random duplicate UUID value you can use an index like this.
ALTER TABLE customer ADD UNIQUE INDEX uuid_constraint (uuid);
But duplicate UUIDv4s are very rare indeed. They have 122 random bits, and most software generating them these days uses cryptographic-quality random number generators. Omitting the UNIQUE index is, I believe, an acceptable risk. (Don't use UUIDv1, 2, 3, or 5: they're not hard enough to guess to keep your data secure.)
If your UUID index isn't unique, you save time on INSERTs and UPDATEs: they don't need to look at the index to detect uniqueness constraint violations.
Edit. When UUID data is in a UNIQUE index, INSERTs are more costly than they are in a similar non-unique index. Should you use a UNIQUE index? Not if you have a high volume of INSERTs. If you have a low volume of INSERTs it's fine to use UNIQUE.
This is the index to use if you omit UNIQUE:
ALTER TABLE customer ADD UNIQUE INDEX uuid (uuid);
To make lookups very fast you can use covering indexes. If your most common lookup query is, for example,
SELECT uuid, givenname, surname, email
FROM customer
WHERE uuid = :uuid
you can create this so-called covering index.
ALTER TABLE customer
ADD INDEX uuid_covering (uuid, givenname, surname, email);
Then your query will be satisfied directly from the index and therefore be faster.
There's always an extra cost to INSERT and UPDATE operations when you have more indexes. But the cost of a full table scan for a query is, in a large table, far far greater than the extra INSERT or UPDATE cost. That's doubly true if you do a lot of queries.
In computer science there's often a space / time tradeoff. SQL indexes use space to save time. It's generally considered a good tradeoff.
(There's all sorts of trickery available to you by using composite primary keys to speed things up. But that's a topic for when you have gigarows.)
(You can also save index and table space by storing UUIDs in BINARY(16) columns and use UUID_TO_BIN() and BIN_TO_UUID() functions to convert them. )

Index every column to add foreign keys

I am currently learning about foreign keys and trying to add them as much as I can in my application to ensure data-integrity. I am using INNODB on Mysql.
My clicks table has a structure something like...
id, timestamp, link_id, user_id, ip_id, user_agent_id, ... etc for about 12 _id columns.
Obviously these all point to other tables, so should I add a foreign key on them? MySQL is creating an index automatically for every foreign key, so essentially I'll have an index on every column? Is this what I want?
FYI - this table will essentially be my most bulky table. My research basically tells me I'm sacrificing performance for integrity but doesn't suggest how harsh the performance drop will be.
Right before inserting such a row, you did 12 inserts or lookups to get the ids, correct? Then, as you do the INSERT, it will do 12 checks to verify that all of those ids have a match. Why bother; you just verified them with the code.
Sure, have FKs in development. But in production, you should have weeded out all the coding mistakes, so FKs are a waste.
A related tip -- Don't do all the work at once. Put the raw (not-yet-normalized) data into a staging table. Periodically do bulk operations to add new normalization keys and get the _id's back. Then move them into the 'real' table. This has the added advantage of decreasing the interference with reads on the table. If you are expecting more than 100 inserts/second, let's discuss further.
The generic answer is that if you considered a data item so important that you created a lookup table for the possible values, then you should create a foreign key relationship to ensure you are not getting any orphan records.
However, you should reconsider, whether all data items (fields) in your clicks table need a lookup table. For example ip_id field probably represents an IP address. You can simply store the IP address directly in the clicks table, you do not really need a lookup table, since IP addresses have a wide range and the IP addresses are unique.
Based on the re-evaluation of the fields, you may be able to reduce the number of related tables, thus the number of foreign keys and indexes.
Here are three things to consider:
What is the ratio of reads to writes on this table? If you are reading much more often than writing, then more indexes could be good, but if it is the other way around then the cost of maintaining those indexes becomes harder to bear.
Are some of the foreign keys not very selective? If you have an index on the gender_id column then it is probably a waste of space. My general rule is that indexes without included columns should have about 1000 distinct values (unless values are unique) and then tweak from there.
Are some foreign keys rarely or never going to be used as a filter for a query? If you have a last_modified_user_id field but you never have any queries that will return a list of items which were last modified by a particular user then an index on that field is less useful.
A little bit of knowledge about indexes can go a long way. I recommend http://use-the-index-luke.com

Simple MySQL table design

I think I should be counted as database newbie, so read the question as a newbie question. I currently create a table, which holds environment variables for a number of hosts, like this:
create table envs (
host varchar(255),
envname varchar(255),
envvalue varchar(8192),
PRIMARY KEY(host, envname)
);
Very simple, one table holding all the data I need. Common operation is to get all the environment variables for a given host, another is to get a given environment variable for a given host, third example operation would be to get a given environment variable for all hosts and list duplicates.
Performance is not expected to be an issue, it's going to be maybe tens of hosts, dozens of variables per host, average max 1 query per second.
Now I've read that having composite primary key is not necessarily a good idea. Is this true for above use case? If it is true, how should I change the database design? If not, is the above one-table database fine for the purposes I listed above?
I don't see a problem here with the primary key. The semantics of a primary key is to uniquely identify the non-key attribute values for the key values. As I assume that for one host and one envname there is at most one envvalue the primary key makes perfect sense.
It could be that some people argue against composite primary keys because they are afraid of performance issues. However performance considerations should never influence the choice of the primary key. Many database systems automatically create an index structure for the primary key; the choice of this index structure can influence performance. However this choice can mostly be changed manually and should be done at a later point if you really have performance issues.
Your one-table design and choice of primary key is fine.
Now I've read that having composite primary key is not necessarily a good idea. Is this true for above use case?
No. Use a composite primary key on (host, envname).
If it is true, how should I change the database design?
N/A.
If not, is the above one-table database fine for the purposes I listed above?
Yes: it's known as the Entity–Attribute–Value model.
It's a bad idea, because you store unique values (host, envname) several times.
What if you were to change the hostname from srv01 to *srv01_new*? You'd have to change every ocurrence of srv01 in your table. And what if, some day, you decide you need to create a new table that holds additional information about every single host.
Now, if you change the hostname, you have to change those information as well.
To get to your question: It's not an issue of performance, but of normalization.
Databases should generally be normalized as far as possible. If you are intrigued enough, read on.
You should create one table for your hosts, having a unique id (int) as primary key and a unique (index) name as the hostname.
Your table should then only reference the id of the host, not the name. This way, your hostname is only stored once in your whole database and can be altered to whatever you desire, without breaking other tables.
If your environment names are unique, too, you should create another table for those, having the same layout as the hosts table (id, name).
Your combination table then stores the id of the host and the id of the environment, along with the value. You must of course keep the combined primary key, so every combination of host/environment is unique and easily indexable.
Then, you have a many-to-many-relationship with additional attributes and perfect normalization.

Can a database table be without a primary key?

Can anyone tell me if a table in a relational database (such as MySQL / SQL SERVER) can be without a primary key?
For example, I could have table day_temperature, where I register temperature and time. I don't see the reason to have a primary key for such a table.
Technically, you can declare such a table.
But in your case, the time should be made the PRIMARY KEY, since it's probably wrong to have different temperatures for the same time and probably useless to have same more than once.
Logically, each table should have a PRIMARY KEY so that you could distinguish two records.
If you don't have a candidate key in you data, just create a surrogate one (AUTO_INCREMENT, SERIAL or whatever your database offers).
The only excuse for not having a PRIMARY KEY is a log or similar table which is a subject to heavy DML and having an index on it will impact performance beyond the level of tolerance.
Like always it depends.
Table does not have to have primary key. Much more important is to have correct indexes. On database engine depends how primary key affects indexes (i.e. creates unique index for primary key column/columns).
However, in your case (and 99% other cases too), I would add a new auto increment unique column like temp_id and make it surrogate primary key.
It makes much easier maintaining this table -- for example finding and removing records (i.e. duplicated records) -- and believe me -- for every table comes time to fix things :(.
If the possibility of having duplicate entries (for example for the same time) is not a problem, and you don't expect to have to query for specific records or range of records, you can do without any kind of key.
You don't need a PK, but it's recommended that you have one. It's the best way to identify unique rows. Sometimes you don't want an auto incremental int PK, but rather create the PK on something else. For example in your case, if there's only one unique row per time, you should create the PK on the time. It makes looks up based on time faster, plus it ensures that they're unique (you can be sure that the data integrity isn't violated):
Even if you do not add a primary key to an InnoDB table in MySQL, MySQL adds a hidden clustered index to that table. If you do not define a primary key, MySQL locates the first UNIQUE index where all the key columns are NOT NULL and InnoDB uses it as the clustered index.
If the table has no primary key or suitable UNIQUE index, InnoDB internally generates a clustered index GEN_CLUST_INDEX on a synthetic column containing row ID values.
https://dev.mysql.com/doc/refman/8.0/en/innodb-index-types.html
The time would then become your primary key. It will help index that column so that you can query data based on say a date range. The PK is what ultimately makes your row unique, so in your example, the datetime is the PK.
I would include a surrogate/auto-increment key, especially if there is any possibility of duplicate time/temperature readings. You would have no other way to uniquely identify a duplicate row.
I run into the same question on one of the tables i did.
The problem was that the PK was supposed to be composed out of all the rows of the table all is well but this means that the table size will grow very fast with each row inserted.
I choose to not have a PK, but only have an index on the row i do the lookup on.
When you replicate a database on mysql, A table without a primary key may cause delay in the replication.
http://lists.mysql.com/mysql/227217
The most common mistake when using ROW or MIXED is the failure to
verify that every table you want to replicate has a PRIMARY KEY on
it. This is a mistake because when a ROW event (such as the one
documented above) is sent to the slave and neither the master's copy
nor the slave's copy of the table has a PRIMARY KEY on the table,
there is no way to easily identify which unique row you want
replication to change.
According to your answer I would consider three options:
put a PK on both cols, this way for each time there could be only one temp and vise versa. This solution allows for multiple rows with the same temp or the same time just that there wouldn't be any two rows with same temp AND time.
don't put a PK at all but do put a unique index on both cols. one unique index containing both cols. this would allow for nulls in temp and time but incurs more space to maintain index.
these two options would be best for retrieval speed if you have heavy reads but would result in lower inserts rate as indices would have to be updated as well.
don't put any index at all, nor PK. this would be best for inserts but very bad for searching. useful for logging where retrieval is done by another
mechanism or when inserting device is not required to check for dups.
Also, it is very important to consider cardinality here and think about future consequences of using an auto incremented number. if you're planning to do A LOT OF inserts then even an auto incremented unsigned bigint would be a risk because it would eventually run out. In your example I guess you'll be saving data daily - for how long? this would be problematic if you saved temp every minute... so I'll take this as an extreme example.
I guess it is best to think about what you need from the table. are you doing "save-and-forget" for the entire year for the temp at every minute? are you going to use this table frequently in real-time decision making in your business logic? I think it is best to segregate data necessary for real-time (oltp) from long-term saving data that would be required seldom and its retrieval latency is allowed to be high (olap). it's even worth duplicating the data into two different tables, one heavily indexed and get erased once in a while to control cardinality and the second is actually saved on a magentic disk with almost no indices at all (it is possible to transfer a schema from your main fs into another fs).
I've got a better example of a table that doesn't need a primary key - a joiner table. Say I have a table with something called "capabilities", and another table with something called "groups", and I want a joiner table that tells me all the capabilities that all the groups might have, so it's basicallly
create table capability_group
( capability_id varchar(32),
group_id varchar(32));
There is no reason to have a primary key on that, because you never address a single row - you either want all the capabilities for a given group, or all the groups for a given capabilty. It would be better to have a unique constraint on (capabilty_id,group_id), and separate indexes on both fields.

Understanding keys in databases

This question is geared towards MySQL, since that is what I'm using -- but I think that it's probably the same or similar for almost every major database implementation.
How do keys work in a database? By that I mean, when you set a field to 'primary key', 'unique key' or an 'index' -- what do each of these do, and when should I use each one?
Right now I have a table containing a few fields, one of them being a GUID (minus the { and } around it). I set the GUID field to the primary key and I see that it created a binary tree. So it improves search performance -- but what differentiates that from other types of keys?
I realize this may not really be programming related (although it is development related) -- I wasn't sure where exactly to ask this but SO is what I use the most so I'll ask here. Migrate as necessary
There are probably hundreds of references for this elsewhere on the web, so a bit of Googling will help you get deep into understanding DB design. That said, the basic gist is:
primary key: a field or combination of fields which must be unique for each row, and which is/are indexed to provide rapid lookup of a row given a key value; cannot contain NULL, and a table can only have one primary key. Generally indexed in a clustered index, which means that the data in the table is reordered to match the order of the index, a process that greatly improves serial data retrieval. (This is the main reason a table can only have one primary key -- the order of the data can't match the order of more than one index!)
unique key: same as a primary key, but on some DB platforms, can contain NULL values so long as they don't violate the uniqueness constraint. (In other words, if the unique key contains a single column, there can only be one row in the table with NULL in that column; if the key contains more than one column, then the table can only contain rows with NULLs in the columns such that there's no non-unique duplication of NULL values across the columns in the key.) On other platforms (including MySQL), unique constraints can contain multiple NULLs; the uniqueness constraint only applies to non-NULL values of the referenced columns. There can be more than one of these per table. Indexed in a non-clustered index.
index: a field or combination of fields which are pre-indexed for more rapid retrieval given a value for the field(s) in the index. A table can have more than one index.
When you define a primary key, the database creates an index based on that key. It needs to be unique. In general you can create an index that to speed up access to data based on non-unique query data. The indexed retrieval time for a uniquely keyed data should be better than for non-uniquely keyed indexes, so I try to use unique indexes where possible.
At the most basic, primary keys represent how the records will be physically stored in memory / on disk, you would want the unique field you're going to search on the most to be this as it will greatly reduce searching.
Unique key's are fields that can only contain unique values.
An index is a specialized "map" to the database file that queries can reference.
These are extremely simplified answers, but I think that's the gist of it.
One more thing, any key is essentially a separate table that is sorted by the index that points directly to the row(s) that match the key.
A BTree style index is stored in a balanced tree, a balanced tree is a tree structure where traveling left is smaller and traveling right is larger.
5
3 7
2 4 6 8
Would be an example of a balanced tree. The other major type is a Hash, where a mathematical expression turns the key into the relative memory location of the key.
In order to really understand keys, you have to understand them at three levels: conceptual, logical, and physical. I'm going to reverse my habitual order, and discuss physical first.
Most programmers tend to think at the physical level. At the physical level, a key is a surrogate (stand-in) for the address of a row. When a row is to be referenced, a copy of the key can be used to specify the row. When a reference to a row is made in another row, the copy is known as a foreign key.
Most experienced programmers have a thorough understanding of pointers and addresses, and would understand exactly how the data structure worked if only it used pointers and addresses. Before the relational databases became dominant, there were in fact databases that used pointers to records embedded in other records to tie the data together.
A disadvantage to using keys instead of pointers is that the DBMS has to use an index to translate a key reference back to a pointer in order to retrieve the row in question. An advantage is that the level of indirection allows the DBMS to shuffle all the rows in a table for whatever purpose, as long as the DBMS updates all the relevant indexes accordingly.
Viewed at this level, keys might as well be simple, integer, and autoincremented. These work faster than other kinds of keys, and they sidestep certain data management issues that arise when user supplied data is missing or inconsistent. However, sidestepping data management issues at this level can create a minefield at the two higher levels.
At the logical level, a key is a minimal subset of the data in a tuple (row) that allows a single matching tuple to be specified, and when the DBMS retrieves the container for that tuple, all the attributes in the tuple are now available. Every relation has at least one candidate key. In the worst case, the entire tuple is the only candidate key. When multiple candidate keys exist for a single relation (table), common practice is to choose one candidate key as the primary key, and to make all references via this primary key.
(Actually, relation and table are not synonymous, but I'm simplifying here. Likewise, tuple and row are not synonymous, although they look identical at first glance.)
The primary reason to declare a primary key is to rule out duplicate keys or missing keys.
Sometimes database people choose to leave duplicate and missing key avoidance up to the programmers whose applications write to the database. More commonly, a primary key constraint serves to reflect an error back to a program that violates a primary key constraint.
When a DBMS sets up a primary key constraint, it also builds an index on the primary key. This allows the DBMS to find duplicates quickly, and it also speeds up certain queries that use the key column(s).
At the conceptual level, keys are the means by which the user community identifies instances of entities, whether those entities are persons (employees, travellers, etc.), things (bank accounts, hotel rooms, etc.) or whatever. The key is data and the entity identified by the key is not data. The key can thus be seen a surrogate for the entity in the database.
At the conceptual level, keys are always natural, and never automatically supplied by the system. However, in the real world, keys are often mismanaged, and the consequences of mismanagement are overcome by what is called "common sense". Instilling common sense into an automated system is generally not feasible.
I never really described an index in the above, but it's implicit in what I said. An index is a data structure that serves to map from a key to a pointer. In all the databases you are likely to use, indexes are declared by the database builder (or perhaps a DBA) and managed by the DBMS.