I am trying to create a database that can be used like Twitter works. That is:
Treestructure: Any node can have multiple childnodes.
All nodes have a timestamp
Criteria 1 and 2 suggests a table structure based on basic columns something like:
NodeID (int)
ParentNodeID (int)
UserID (int)
TS (TimeStamp)
MSG (varchar)
When viewing any node (n) all parent nodes until and including root should be selected, that is easy using the ParentNodeID pointer.
Here comes the caveat: In addition to the parent nodes all child nodes from the current node (n) should also be selected in Chronological order (based on TS) from the table. All child nodes, no matter what child-branch, that belongs to the subtree where (n) is the root.
How do I best (better) structure the table for such queries?
You should take look at how Twitter have been evolving, and check if your use case is similar enough.
A good start could be this article with database schema examples: https://web.archive.org/web/20161224194257/http://www.cubrid.org/blog/dev-platform/decomposing-twitter-database-perspective/
Related
Structure in MySql (for compactness i am using a simplified notation)
Notation: table name->[column1(key or index), column2, …]
documents->[doc_id(primary key), title, description]
elements->[element_id(primary key), doc_id(index), title, description]
Each document can contain a large number of elements (between 1 and 100k+)
We have two key requirements:
Load all elements for a given doc_id quickly
Update the value of one individual element by his element_id quickly
Structure in Cassandra
1st solution
documents->[doc_id(primary key), title, description, elements] (elements could be a SET or a TEXT, each time new elements are added (they are never removed) we would append it to this column)
elements->[element_id(primary key), title, description]
To load a document we would need:
Load document with given and get all element ids: SELECT * from documents where doc_id=‘id’
Load all elements with the given ids: SELECT * FROM elements where element_id IN (ids loaded from query a)
Updating elements would be done by their primary key.
2nd solution
documents->[doc_id(primary key), title, description]
elements->[element_id(primary key), doc_id(secondary index), title, description]
To load a document we would need:
SELECT * from elements where doc_id=‘id’
Updating elements would be done by their primary key.
Questions regarding our solutions:
1st: Will it be efficient to query 100k+ primary keys in the elements table?
SELECT * FROM elements WHERE element_id IN (element_id1,.... element_id100K+)?
2nd: Will it be efficient to query just by a secondary index?
Could anyone give any advice how would we create a model for our use case?
With cassandra it's all about the access pattern (I hope I understood it correctly, if not please comment)
1st
documents should not use sets because set is limited to 65 535 elements and has to be read, updated in it's entirety every time a change is made. Since you need 100k+ it's not what you want. You could use frozen collections etc, but then again, reading everything in memory every time is bound to be slow.
2nd
secondary indexes, well, small cardinality data might be fine But from I understand you have 100k per document, this might even be fine but then again It's not the best practice. I would simply try it out in your concrete case.
3rd - disk is cheap approach - always write the data the way you are going to read it - cassandra's writes are dirt cheap so prepare the views at write time,
this one satisfies reading of all the elements belonging to doc_id
documents->[doc_id(primary key), title_doc (static), description_doc(static), element_id(clustering key), title, description]
elements remain pretty much as they were:
elements->[element_id(primary key), doc_id, title, description]
When doing updates, you update it in documents and elements (for consistency you can use batch operation - should you need it) If having element_id you can quickly issue another query after you get it's doc Id.
Depending on your updating needs documentId could also be a set. (I might have not gotten this part right because not sure what data is available when updating the element do you have the doc_id also and can one element be in more docs?)
Also since having 100k+ elements in single partition is not the best thing to have because of the retrievals ( all requests will go to one node) I would propose to have composite partitioning key (bucket) I think in your case a simple fixed int would be just fine. So every time you go to retrieve the elements you just issue selects to documentid + (1, 2, 3, 4 ...) and then merge the result at the client - this will be significantly faster.
One tricky part would be that you don't go into every single bucket for elementid that is stored in the document ... when I think about it then it would be better to use a base of two for buckets. In your case 16 would be ideal ... then when you look to update specific element just use some simple hash function known to you and use last 4 bits.
Now when I think about it if the element id + doc id is always known to you you might not even need the elements table at all.
Hope this helps
Based on the suggestion of Marko, our solution is:
CREATE TABLE documents (
doc_id uuid,
description text,
title text,
PRIMARY KEY (doc_id)
);
CREATE TABLE nodes (
doc_id uuid,
element_id uuid,
title text,
PRIMARY KEY (doc_id, element_id)
);
We can retrieve all elements with the following query:
SELECT * FROM elements WHERE doc_id='id'
And update the elements:
UPDATE elements SET title='Hello' WHERE doc_id='id' AND element_id='id';
I am using a modified version of adjacency list to store hierarchical data. So a tree of this sort is created.
The same is represented in mysql schema in this way :
What is the best way to delete a node in between say C so that the children F, G are now children of A ?
At application level you'll need to list down the childs of the node to be deleted and assign those childs' levels columns to parent node.
You can otherwise create a trigger on delete of node from the table which will do the above task.
If you would have structured your table correctly your all tasks would have been much easier.
table should have been just a parent child relation where immediate parent is saved.
Id Name ParentId
1 A NULL
2 B 1
3 D 2
levels can be determined by the no. of times recursion is done.
you can even do that now by adding the parent column and writing a script which would add the last levels available in the column.
So in this case, deletion would be simply
1. get list of childs - where parentid=nodeid.
2. updating their parent column with nodeids' parentid.
instead of updating and calculating all 3 level columns.
In my project I have locations of the product, which may look like this:
- Location 1
-- SubLocation 1
-- SubLocation 2
- Location 2
-- SubLocation 3
-- SubLocation 4
Imagine a zones with subzones in the facilty.
I need to store that in DB and then retrive sometime later, like this: SubLocation 1 at Location 1.
My first guess is to have two tables with one to many ralationship, but that won't scale, it later I'll need to have something like this:
- Location 2
-- SubLocation 3
-- SubLocation 4
---- SubLocation 5
---- SubLocation 6
So my question is what's the best way to store such structure in relational database?
You can define parent_id reference FK to another record with id (roots have null parent_id).
To define hierarhy and retrieve all subtree in one query you can define an additional field path (VARCHAR). The field should have full path of ids separated with '_'
In your case SubLocation 5 has the path="2_4_5"
To retrieve all the children of SubLocation 4 you can use
select *
from myTable
where path like '2_4%';
There is level depth restriction (size of the path in fact) but for most cases it should work.
Dealing with hierarchical data is difficult in MySQL. So, although you might store the data in recursive tables, querying the data is (in general) not easy.
If you have a fixed set of hierarchies, such as three (I'm thinking "city", "state", "country"), then you can have a separate table for each entity. This works and is particularly useful in situations where the elements can change over time.
Alternatively, you can have a single table that flattens out the dimensions. So, "city", "state", and "country" are all stored on a single row. This flattens out the data, so it is no longer normalized. Updates become tedious. But if the data is rarely updated, then that is not an issue. This form is a "dimensional" form and used for OLAP solutions.
There are hybrid approaches, where you store each element in a single table, in a recursive form. However, the table also contains the "full path" to the top. For instance in your last example:
/location2/sublocation3
/location2/sublocation4
/location2/sublocation4/sublocation5
/location2/sublocation4/sublocation6
This facilitates querying the data. But it comes at the cost of maintenance. Changing a something such as sublocation4 requires changing many rows. Think triggers.
The easiest solution is to use different tables for different entities, if you can.
You can store it in one tabel and retreive sublocations using self join.
I want to have folders and documents which every one have a folder. Folders can have infinite children folders. What is the best mysql schema in your opinion.Do you think this is good?
Table Folders
id
name
parent (if null the root)
auth_user (access control type)
created_date
created_by
Table documents
id
name
type
idFolder (FK id of folders)
auth_user (access control type)
created_date
created_by
Do you think the above is good or gonna have problem later? Do you think with the above can get fast and easy the folders tree (i think with ORDER BY parent ASC can get the tree right)?
adjacency lists are nice for inserts and moving sub-trees but if you need to query deeper than one level it's pain in the a** because you will end up with n-joins if you go n-levels deep. An example: Show me all descendants/ancestors of Folder X.
I suggest to use the adjacency list (the parent_id) in combination with one of the following models:
Nested Sets
Materialized Paths
I really like the nested set - but it has a draw back - inserts are slow. But usually you will have more reads (browsing) the structure than inserting new nodes.
Another thing:
I usually put folders and documents in the same table and flag them with a boolean is_folder column. I like to think of folders/files as "nodes" in a tree so they're basically the same. Further metadata will be stored in another table.
I am working on one document system and got some logic/architectual problems. In this system will be many types of documents - incoming, outgoing, etc. Every document type have its own number of cols whitch must be filled. On the paper all is easy, but in software - I need some advice :)
For example:
incoming document type 1 have 16 cols,
outgoing document type 1 have 15cols,
inner document have 9 cols,
etc...
At first I thought, that I will make one table, named "Categories" where will be stored (in tree) document types (incoming, outgoing, etc) and one generic table "Documents" with maximum of possible rows (for example 25) where would be stored all documents and if not used some cell, then ignored it.
After I thougth that I can make a much simpler - for every type of document - own table, but after some thinking is seemed to be worst solution.
So I want the best possible solution for this.
Maybe you can help me?
Thanks!
This is a typical example for table inheritance. You'd do something like this:
Document
----------
DocumentId (PK)
DocumentType
... any columns common to the different formats
DocumentIncoming
----------
DocumentId (PK, FK to Document)
... columns specific to Incoming
DocumentOutgoing
----------
DocumentId (PK, FK to Document)
... columns specific to Outgoing
Use a central "Documents" table that contains a category code and only those columns that apply to every single category.
Then, for each category use a table that links back to the appropriate record in Documents and "adds" the additional columns appropriate for that category.