DynamoDB M-M Adjacency List Design Pattern - many-to-many

Referring to amazon. I was wondering if anyone could help me.
The first image is of the table, and the second is the GSI. Here is the table:
On the table, I don't understand how one is to create the sort-key? Is this one attribute that stores both Bill-ID and Invoice-ID? or two separate attributes? I have a feeling it's the one flexible attribute, and if so how do you differentiate one from the other? And how are we meant to construct the query on the sort-key?
Is it just by looking the prefix "Bill-" or "Invoice-"?
The practice of DynamoDB seems to make use of dashes ("-") to separate values in an attribute. If anyone can give me use cases of such things, I would be grateful as well, but I am going off tangent unless it's important in this case.
Now, this is very relatable and very interesting YouTube, where the presenter uses ONE product table to store various types of items: Books, Song Albums, and Movies; and each has their own attributes.
Again I have a problem understanding the sort-key used there. I understand that productID=1 is bookID, and productID=2 is an Album. Now where it gets confusing now is what I circled in red. These are the tracks of Album 2. However, the structure of the sort key is "albumID:trackID". Now, where is the "trackID"? Is it meant to substitute the word "trackID" with actual ID? or is this meant to be a text exactly as "albumID:trackID"?.
What if I wanted to query a specific trackID? what would be the syntax of my query?
Please see the image here from the youtube:
Thank you all in advance!!! :-)

In the first picture you posted the items in the base table (primary key) would look like this:
First_id(Partition key) Second_id(Sort Key) Dated
------------- ---------- ------
Invoice-92551 Invoice-92551 2018-02-07
Invoice-92551 Bill-4224663 2017-12-03
Invoice-92551 Bill-4224687 2018-01-09
Invoice-92552 Invoice-92552 2018-03-04
Invoice-92552 Bill-4224687 2018-01-09
Bill-4224663 Bill-4224663 2018-12-03
Bill-4224687 Bill-4224687 2018-01-09
And the same items in the GSI the items would look like this
Second_id(Partition Key) First_id
---------- ---------------
Invoice-92551 Invoice-92551
Bill-4224663 Invoice-92551
Bill-4224687 Invoice-92551
Invoice-92552 Invoice-92552
Bill-4224687 Invoice-92552
Bill-4224663 Bill-4224663
Bill-4224687 Bill-4224687
They have drawn it in quite a confusing way.
They have merged the partition keys into one box, but they are separate items.
They have also tried to show the GSI in the same picture. You can think of the base table and the GSI as two separate tables that are kept in sync, in many ways that's what they are.
They haven't actually provided a name to the key attributes. In my example I have named them First_id and Second_id.
When you do a query on the base table, you can use a query with the partition key Invoice-92551 and you get both the Invoice item plus all the bill items that belong to it.
Imagine you are viewing invoice Invoice-92551 in an application and you can see it has two associated bills (Bill-4224663 and Bill-4224687). If you clicked on the bill, the application would probably do a query on the GSI. The GSI query would have partition key Bill-4224687. If you look at the GSI table I have drawn above, you can see this will return two items, showing that Bill-4224687 is part of two invoices (Invoice-92551 and Invoice-92552)
In your second picture, the words 'bookID' and 'albumID' etc are supposed to represent actual IDs (lets say 293847 and 3340876).
I would draw his example like this:
ProductID(Partition Key) TypeID(Sort Key) Title Name
--------- ------ ------ ------
Album1 Album1 Dark Side
Album1 Album1:Track1 Speak to me
Album1 Album1:Track2 Breathe
Movie8 Movie8 Idiocracy
Movie8 Movie8:Actor1 Luke Wilson
Movie8 Movie8:Actor2 Maya Rudolph
Here are your queries:
Partition key: Album1
Gives you ALL the information (inc tracks) on Album 1 (Dark Side)
Partition key: Album1 and Sort Key: Album1:Track2
Gives you just the information on Breathe.
Partition key: Movie8
Gives you ALL the information (inc actors) on Movie8 (Idiocracy)
If I was building the table I would make it so the words Movie, Album etc were part of the actual ID (say Movie018274 and Album983745987) but that's not required, it just makes the IDs more human readable.

Stu's answer is not quite correct, the table actually looks as it is illustrated:
First_id(Partition key) Second_id(Sort Key) Dated
------------- ---------- ------
Invoice-92551 Invoice-92551 2018-02-07
Invoice-92551 Bill-4224663 2017-12-03
Invoice-92551 Bill-4224687 2018-01-09
Invoice-92552 Invoice-92552 2018-03-04
Invoice-92552 Bill-4224687 2018-01-09
Bill-4224663 Bill-4224663 2018-12-03
Bill-4224687 Bill-4224687 2018-01-09
In the table above, the Bill items (i.e. partition key = Bill-xxxxx) hold common information for the bill, where as the Invoice items with Bill items as sort key hold information for the bill that is specific to the given invoice.
In order to fully reconstruct a bill, a GSI is required that allows you to lookup the complete information for a bill (i.e. the common record + invoice specific records):
Second_id(Partition Key) First_id Data
---------- --------------- -----------
Bill-4224663 Bill-4224663 Common bill data
Bill-4224663 Invoice-92551 Bill data for Invoice-92551
Bill-4224687 Bill-4224687 Common bill data
Bill-4224687 Invoice-92551 Bill data for Invoice-92551
Bill-4224687 Invoice-92552 Bill data for Invoice-92552
Invoice-92551 Invoice-92551 Redundant data!
Invoice-92552 Invoice-92552 Redundant data!

Related

Do I need a Primary Key If I'm using 1 to Many Relationship?

I have a table called branch
It looks something like.
+----------------+--------------+
| branch_id | branch_name |
+----------------+--------------+
| 1 | TestBranch1 |
| 2 | TestBranch2 |
+----------------+--------------+
I've set the branch_id as primary key.
Now my question is related to the next table called item
It looks like this.
+----------------+-----------+---------------------------+
| branch_id | item_id | item_name |
+----------------+-----------+---------------------------+
| 1 | 1 | Apple |
| 1 | 2 | Ball |
| 2 | 1 | Totally Difference Apple |
| 2 | 2 | Apple Apple 2 |
+----------------+-----------+---------------------------+
I'd like to know if I need to create a primary key for my item table?
UPDATE
They do not share the same items. Sorry for the confusion.. A branch can create a product that doesn't exist in the other branch. They are like two stores sharing the same database.
UPDATE
Sorry for the incomplete information.
These tables are actually from two local database...
I'm trying to create a database that can exist on its own but would still have no problem when mixed with another. So the system would just append all the item data from another branch without mixing them up.. The branches doesn't take the item_id of the other branches in consideration when generating a unique_id for their items. All the databases however may share same branch table as reference.
Thank you guys in advance.
I'd like to know if I need to create a primary key for my item table?
You always1 need a key2, whether the table is involved in a relationship3 or not. The only question is what kind of key?
Here are your options in this case:
Make {item_id} alone a key. This makes the relationship "non-identifying" and item a "strong" entity...
Which produces a slimmer key (compared to the second option), therefore any child tables that may reference it are slimmer.
Any ON UPDATE CASCADE actions are cut-off at the level of the item and not propagated to children.
May play better with ORMs.
Make a composite4 key on {branch_id, item_no}. This makes the relationship "identifying" and item a "weak" entity...
Which makes item itself slimmer (one less index).
Which may be very useful for clustering.
May help you avoid a JOIN in some cases (if there are child tables, branch_id is propagated to them).
May be necessary for correctly modelling "diamond-shaped" dependencies.
So pick your poison ;)
Of course, branch_id is a foreign key (but not key) in both cases.
And orthogonal to all that, if item_name has to be unique per-branch (as opposed to per whole table), you need a composite key on {branch_id, item_name} as well.
1 From the logical perspective, you always need a key, otherwise your table would be a multiset, therefore not a relation (which is a set), therefore your database would no longer be "relational". From the physical perspective, there may be some special cases for breaking this rule, but they are rare.
2 Whether its primary or not is immaterial from the logical standpoint, although it may be important if the DBMS ascribes a special meaning to it, such is the case with InnoDB which uses primary key as clustering key.
3 Please make a distinction between "relation" and "relationship".
4 Aka. "compound".
According to your example data you are using n to m relations and not 1 to m. It should be like this
item table
----------
item_id | item_name
1 | Apple
2 | Ball
branch_item table
-----------------
item_id | branch_id
1 | 1
1 | 2
2 | 1
2 | 2
And your brach_item table should have a compound unique key containg branch_id and item_id to make sure no duplicate entries can be added.
Yes you do. The Primary key is what allows the many to one relationship to exist.
This requirement is already catered for by the branch_id column.
The item_id column is not required for the one-to-many relationship in your example.

MySQL - When should I use a different table for similar data?

Lets say I'm storing play by play info for sports: basketball, football, and baseball. The data basically fits the same model:
| play_id | play_type_id | play_description_id | player1_id | player2_id | player3_id |
Those are the basic columns that each sport would share, but there would be several more. Some columns would only be used by certain sports - like player3_id would be used by football for who made a tackle, but never by basketball - there wouldn't be a lot of these limited-use columns, but some.
Each game can have anywhere from 300 - 1000 rows (high estimate), so this table could grow to the billions eventually.
My questions are:
Should I just start off with different tables for each sport, even though there'd be about a 90% overlap of columns?
At what point should I look into partitioning the table? How would I do this? I'm thinking of archiving all the plays from the 2012 season (whether it be a sports specific table or all-inclusive).
Sorry if my post isn't more concise. This is all a hypothetical case, I'm just trying to figure out what the disadvantages of having one massive table would be, obviously performance is a consideration, but at what point does the table's size warrant being divided. Because this isn't a real project, it's hard to determine what the advantages of having a table like this would be. So again, sorry if this is a stupid question.
EDIT/ADDITIONAL QUESTION:
On a somewhat side-note, I haven't use noSQL DBs before, but is that something I should consider for a project like this? Lets say that there'd be a high velocity of reads and return time would be crucial, but it also needs to have the ability to run complex queries like "how many ground balls has playerA hit to secondbase, off playerB, in night games, during 2002 - 2013?"
I would separate it in multiple table. That way it is more flexible.
And if you want to make some statistics your are gonna be able to do more complex queries than if you have only one table.
It could look like this
Table PLAYER
ID | FIRSTNAME | LASTNAME | DATE_OF_BIRTH
-----------------------------------------
1 | michael | Jordan | 12.5.65
Table SPORT
ID | NAME | DESCRIPTION
------------------------------------------
1 | Basketball | Best sport in the world
2 | Golf | Nice sport too
Table PLAYER_SPORT
SPORT_ID | PLAYER_ID | PLAYER_POSITION_ID
--------------------------------------------
1 | 1 | 1 /* Michael Jordan play Basketball */
2 | 1 | NULL /* Michael Jordan play also Golf */
Table PLAYER_POSITION
ID | POSITION | DESCRIPTION | SPORT_ID
-------------------------------------------
1 | Middlefield | Any description.. | 1
As far as your table structure is concerned the best practice is to have another table for Mapping play_id and player_id. There is no need of columns player1_id,player2_id,player3_id. Just make a new table which has play_id and player_id columns.
Should I just start off with different tables for each sport, even
though there'd be about a 90% overlap of columns?
I don't think that would help you much, the problem of growth rate for a single table will occur for segmentation-ed tables, this kind of distribution will just make a delay and will not solve the problem. Also you will lose integrity and consistency by violating Normal Forms.
At what point should I look into partitioning the table? How would I
do this? I'm thinking of archiving all the plays from the 2012 season
(whether it be a sports specific table or all-inclusive).
You need to use logical database partitioning.
I think a range partition on mach-date field will be helpful.
Documents about MySql partitioning could be found here.
Recomanding to use NoSql will need more information about your application, BTW NoSql will come with its pros and cons. Having a look at the post may helps.
.

Merging two datasets into a new entity

I am new to all this and struggling to get my head around some of the conundrums thrown up. My area of interest is census data. What I am currently doing is taking the data from a 1901 and a 1911 censuses and merging them into a new database. I then ascertain that a particular person is actually the same person on both censuses, once I am certain that 1901 Jack Thelad (aged 31) with ID 55 is the same as 1911 Jack Thelad (aged 41) with ID 777 what is the best way to deal with the primary key issue?
1901 Jack Thelad ID55
1911 Jack Thelad ID777
MergedCensus Jack Thelad ID???
Should I look on the primary key as like a social security number, allocate Jack Thelad a number in my MergedCensus and then copy that number back into the 1901 and 1911 data effectively overwriting ID55 and ID77?
in this new database which i assume you are designing, could u have a table that was:
newId | name | 1901id | 1911id |
------|-------------|---------|--------|
1234 | Jack Thelad | ID55 | ID77 |
and then you could search
SELECT data,data,data from newtable,1901id,1911id where newtable.1901id=1901table.id

What is the best way to handle these MySQL database relationsships?

I'm building a small website that let users recommend their favourite books to eachother. So I have two tables, books and groups. A user can have 0 or more books in their library, and a book belongs to 1 or more groups. Currently, my tables look like this:
books table
|---------|------------|---------------|
| book_id | book_title | book_owner_id |
|---------|------------|---------------|
| 22 | something | 12 |
|---------|------------|---------------|
| 23 | something2 | 12 |
|---------|------------|---------------|
groups table
|----------|------------|---------------|---------|
| group_id | group_name | book_owner_id | book_id |
|----------|------------|---------------|---------|
| 231 | random | 12 | 22 |
|----------|------------|---------------|---------|
| 231 | random | 12 | 23 |
|----------|------------|---------------|---------|
As you can see, the relationsships between users+books and books+groups are defined in the tables. Should I define the relationsships in their own tables instead? Something like this:
books table
|---------|------------|
| book_id | book_title |
|---------|------------|
| 22 | something |
|---------|------------|
| 23 | something2 |
|---------|------------|
books_users_relationsship table
|---------|------------|---------|
| rel_id | user_id | book_id |
|---------|------------|---------|
| 1 | 12 | 22 |
|---------|------------|---------|
| 2 | 12 | 23 |
|---------|------------|---------|
groups table
|----------|------------|
| group_id | group_name |
|----------|------------|
| 231 | random |
|----------|------------|
groups_books_relationsship table
|----------|---------|
| group_id | book_id |
|----------|---------|
| 231 | 22 |
|----------|---------|
| 231 | 23 |
|----------|---------|
Thanks for your time.
The second form with four tables is the correct one. You could delete rel_id from books_users_relationsship as primary key might be composite with both user_id and book_id, just like in groups_books_relationsship table.
You do not need a "relationship table" to support a relationship. In Databases, implementing a Foreign Key in a child table defines the Relation between the parent and the child. You need tables only if they contain data, or to resolve a many-to-many relationship (and that has no data other than the Primary Keys of the parents).
The second problem you are facing, the reason the Relations become complex, and even optional, is due to the first two tables not being Normalised. Many problems ensue from that.
if you look closely at book, you may notice that the same book (title) gets repeated
likewise, there is no differentiation between (a) a book in terms of its existence in the world and (b) a copy of a book, that is owned by a member, and available for borrowing
eg. the review is about an existing book, once, and applies to all copies of a book; not to an owned book.
your "relationship" tables also have data in them, and the data is repeated.
all this repeated data needs to be maintained and kept in synch.
all those problems are eliminated if the data is Normalised.
Therefore (since you are seeking the "best way"), the sequence is to normalise the data first, after which (no surprise) the Relations are easy and not complex, and no data is repeated (in either the tables or the relations).
when Normalising, it is best to model the real world (not the entire real world, but whatever parts of it that you are implementing in the database). That insulates your database from the effects of change, and functional extensions to it in future do not require the existing tables to be changed.
It is also important to use accurate names for tables and columns, for the same reason. group in non-specific and will cause a problem in future when you implement some other form of grouping.
The relations can be now defined at the correct "level", between the correct tables.
The need to stick an Id column on everything that moves severely hinders your ability to understand the data and thus the Normalisation process, and robs the database of Relational power.
Notice that the existing keys are already unique and meaningful, short and efficient, no additional surrogate keys (and their additional index) is required.
ReviewerId, OwnerId and BorrowerIdare allMemberIds`, as Foreign Keys, showing the explicit Role in which they are used.
Note that your problem space is not as simple as you think, it is used as a case study and shipped with tutorials for SQL (eg. MS SQL, Sybase).
Social Library Data Model
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find IDEF1X Notational useful.
I have provided the structure required to support borrowing, to again illustrate how easy it is to implement Relations on Normalised data, and to show the correct tables upon which borrowing depends (it is not between any book and any person; only owned book can be borrowed).
These issues are very important because they define the Referential Integrity of the database.
It is also important to implement that in the database itself, which is the Standard location (rather than in app code all over the place). Declarative Referential Integrity is part of IEC/ISO/ANSI Standard SQL. And the question has a database design tag.
Referential Integrity cannot be defined or enforced in some databases that do not fully implement the SQL Standard (sometimes it can be defined but it is not enforced, which is confusing). Nevertheless, you can design and implement whatever parts of a database your particular database supports.

What is your opinion on using textual identifiers in table columns when approaching the database with normalization and scalability in mind?

Which table structure is considered better normalized ?
for example
Note: idType tells on which thing the comment has taken place on, and the subjectid is the id of the item the comment has taken place on.
useing idType the textually named identifier for the subjectid.
commentid ---- subjectid ----- idType
--------------------------------------
1 22 post
2 26 photo
3 84 reply
4 36 post
5 22 status
Compared to this.
commentid ---- postid ----- photoid-----replyid
-----------------------------------------------
1 22 NULL NULL
2 NULL 56 NULL
3 23 NULL NULL
4 NULL NULL 55
5 26 NULL NULL
I am looking at both of them and I dont think in the first table I would be able to relate it to a foreign key constraint =( (ie. comment gets deleted if the post or photo is deleted), where as in the second one that is possible, how would you approach a similar issue keeping in mind that the database will need to expand overtime and data integrity is also important =).
Thanks
The first is more normalized, if slightly incomplete. There are a couple of approaches you can take, the simplest (and strictly speaking, the most 'correct') will need two tables, with the obvious FK constraint.
commentid ---- subjectid ----- idType
--------------------------------------
1 22 post
2 26 photo
3 84 reply
4 36 post
5 22 status
idType
------
post
photo
reply
status
If you like, you can use a char(1) or similar to reduce the impact of the varchar on key/index length, or to facilitate use with an ORM if you plan to use one. NULL's are always a bother, and if you start to see them turn up in your design, you will be better off if you can figure out a convenient way to eliminate them.
The second approach is one I prefer when dealing with more than 100 million rows:
commentid ---- subjectid
------------------------
1 22
2 26
3 84
4 36
5 22
postIds ---- subjectid
----------------------
1 22
4 36
photoIds ---- subjectid
-----------------------
2 26
replyIds ---- subjectid
-----------------------
3 84
statusIds ---- subjectid
------------------------
5 22
There is of course also the (slightly denormalized) hybrid approach, which I use extensively with large datasets, as they tend to be dirty. Simply provide the specialization tables for the pre-defined idTypes, but keep an adhoc idType column on the commentId table.
Note that even the hybrid approach only requires 2x the space of the denormalized table; and provides trivial query restriction by idType. The integrity constraint however is not straight forward, being an FK constraint on a derived UNION of the type-tables. My general approach is to use a trigger on either the hybrid table, or an equivalent updatable-view to propigate updates to the correct sub-type table.
Both the simple approach and the more complex sub-type table approach work; still, for most purposes KISS applies, so just I suspect you should probably just introduce an ID_TYPES table, the relevant FK, and be done with it.