Data structure for a set of changes similar to SVN? - mysql

So far we have been storing information of changes as following.
Imagine having a changeset table structure of something that gets changed that is called object. The object is connected to say a foreign element by a foreign key. The object gets created like this
changesetId (Timestamp) | objectId | foreignKey | name (String) | description (String)
2015-04-29 23:28:52 | 2 | 123 | none | none
Now we change the name, the table will look like that after the name change
changesetId (Timestamp) | objectId | foreignKey | name (String) | description (String)
2015-04-29 23:28:52 | 2 | 123 | none | none
2015-04-29 23:30:01 | 2 | null | foo | null
This structure is exactly the minimum. It contains exactly the change we did. But to create the current version of the object, we have to add up the changes to actually get the final version. E.g.
changesetId (Timestamp) | objectId | foreignKey | name (String) | description (String)
2015-04-29 23:28:52 | 2 | 123 | none | none
2015-04-29 23:30:01 | 2 | null | foo | null
*2015-04-29 23:30:01 | 2 | 123 | foo | none
the * marking the final version, which does not exist in the DB.
So if we only store exactly the changes, we have more work to do. Especially, when coming from a foreign object f. If I have a number of objects f and I want to get all changes to the object from our table, I have to create a bit of an ugly SQL. This obviously gets worse, the more foreign objects you have.
Basically I have to do:
Select all F that I want and
Select all objects WHERE foreignKey = foreignId
OR Select all objects that have objectId in (Select all objects that have foreignKey = foreignId)
e.g. I have to select the objects that have foreignKey 123 or elements that have foreignKey null but there exists an entry with same objectId with foreignKey 123.
The more dependencies, the uglier this SQL gets obviously.
Did I make myself clear?
Wouldn't it be much easier to keep always all fields in all versions
e.g. a simple name change gets:
changesetId (Timestamp) | objectId | foreignKey | name (String) | description (String)
2015-04-29 23:28:52 | 2 | 123 | none | none
2015-04-29 23:30:01 | 2 | 123 | foo | none
now to create a diff I have to compare both versions, but I don't have to do the extra work for selecting the right elements nor for calculating the final version of said timestamp.
What do you consider the proven best solution?
how is svn doing it?

For your use case the method you suggest seem to be better. Key value stores like LSM trees do exactly the same. They just write a newer version of the object without deleting the older version. If, at any point of time, you need the change that was made, I think you can just diff two adjacent versions.
The second method might use more space if you have a lot of variable length text fields, but that's a trade-off you get for speed and maintainability.

Related

Data model: Having separate table for a field vs having it as a column in the main table

We have a scenario where we want to store the 'status'(say of a 'user')
We want to impose restriction for the allowed values of 'user' status.
So we considered two alternatives:
Have 'status' as a column of enumeration type in the 'user' table
Have a separate table for 'status' and populate with the allowed values during DB initialisation and have it as foreign key in the 'user' table.
Can you suggest which is a better approach and why
Appreciate if references are shared on what is the best practice
Enum is less preferred. Do a separate table for statuses. With a separate table it will be easy to change or add statuses, add relative data (just add a new field in your status table if you ever need it in the future), easy to get a list of distinct statuses. You will also have an option to set a status field in the main table to be NULL or to set for some other value by default. You can reuse statuses in the other table.
If you have only 2 statuses, say 'active' and 'inactive', just use a BOOL(or TINYINT) field type in the main table.
(There are many Q&A debating ENUM vs TINYINT UNSIGNED vs VARCHAR(..).)
If the set of options is not likely to change often, then I vote for ENUM.
Acts and feels like a human-readable string.
1 byte. (I would not make an enum with 256+ options; not even more than, say, a dozen.)
I would consider starting with option "unknown" instead of making the column nullable. This is a crude way to deal with spelling errors in input.
BOOL:
may have some hiccups; I avoid it.
In the grand scheme of things, it usually does not save enough space to matter.
I will consider using SET or *INT for a large number of boolean flags.
Any column (enum/tinyint/bool) with poor cardinality will not be useful alone in an index such as INDEX(status). OTOH, a composite index may be useful, such as INDEX(status, create_date).
ENUM examples
Some enums encountered that have more than 2 options; you judge whether they are good or bad:
Database Column ENUM
| mysql | sql_data_access | enum('CONTAINS_SQL','NO_SQL','READS_SQL_DATA','MODIFIES_SQL_DATA') |
| mysql | interval_field | enum('YEAR','QUARTER','MONTH','DAY','HOUR','MINUTE','WEEK','SECOND |
| mysql | ssl_type | enum('','ANY','X509','SPECIFIED') |
| performance_schema | TIMER_NAME | enum('CYCLE','NANOSECOND','MICROSECOND','MILLISECOND','TICK') |
| common_schema | hint_type | enum('step_into','step_over','step_out','run') |
| common_schema | statement_type | enum('sql','script','script,sql','unknown') |
| mworld | Continent | enum('Asia','Europe','North America','Africa','Oceania','Antarctic |
| try | priority | enum('LOW','NORMAL','HIGH','UBER') |
| alerts | Stage | enum('DISCOVER','NOTIFY','ACK','CLEAR') |
| todo | stage | enum('unk','load','priming','running','stopping') |
| zip | z_type | enum('STANDARD','UNIQUE','','PO BOX ONLY','Community Post Office', |

MySQL Table structure: Multiple attributes for each item

I wanted to ask you which could be the best approach creating my MySQL database structure having the following case.
I've got a table with items, which is not needed to describe as the only important field here is the ID.
Now, I'd like to be able to assign some attributes to each item - by its ID, of course. But I don't know exactly how to do it, as I'd like to keep it dynamic (so, I do not have to modify the table structure if I want to add a new attribute type).
What I think
I think - and, in fact, is the structure that I have right now - that I can make a table items_attributes with the following structure:
+----+---------+----------------+-----------------+
| id | item_id | attribute_name | attribute_value |
+----+---------+----------------+-----------------+
| 1 | 1 | place | Barcelona |
| 2 | 2 | author_name | Matt |
| 3 | 1 | author_name | Kate |
| 4 | 1 | pages | 200 |
| 5 | 1 | author_name | John |
+----+---------+----------------+-----------------+
I put data as an example for you to see that those attributes can be repeated (it's not a relation 1 to 1).
The problem with this approach
I have the need to make some querys, some of them for statistic purpouses, and if I have a lot of attributes for a lot of items, this can be a bit slow.
Furthermore - maybe because I'm not an expert on MySQL - everytime I want to make a search and find "those items that have 'place' = 'Barcelona' AND 'author_name' = 'John'", I end up having to make multiple JOINs for every condition.
Repeating the example before, my query would end up like:
SELECT *
FROM items its
JOIN items_attributes attr
ON its.id = attr.item_id
AND attr.attribute_name = 'place'
AND attr.attribute_value = 'Barcelona'
AND attr.attribute_name = 'author_name'
AND attr.attribute_value = 'John';
As you can see, this will return nothing, as an attribute_name cannot have two values at once in the same row, and an OR condition would not be what I'm searching for as the items MUST have both attributes values as stated.
So the only possibility is to make a JOIN on the same repeated table for every condition to search, which I think it's very slow to perform when there are a lot of terms to search for.
What I'd like
As I said, I'd like to be able to keep the attributes types dynamical, so by adding a new input on 'attribute_name' would be enough, without having to add a new column to a table. Also, as they are 1-N relationship, they cannot be put in the 'items' table as new columns.
If the structure, in your opinion, is the only one that can acheive my interests, if you could light up some ideas so the search queries are not a ton of JOINs it would be great, too.
I don't know if it's quite hard to get it as I've been struggling my head until now and I haven't come up with a solution. Hope you guys can help me with that!
In any case, thank you for your time and attention!
Kind regards.
You're thinking in the right direction, the direction of normalization. The normal for you would like to have in your database is the fifth normal form (or sixth, even). Stackoverflow on this matter.
Table Attribute:
+----+----------------+
| id | attribute_name |
+----+----------------+
| 1 | place |
| 2 | author name |
| 3 | pages |
+----+----------------+
Table ItemAttribute
+--------+----------------+
| item_id| attribute_id |
+--------+----------------+
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
+--------+----------------+
So for each property of an object (item in this case) you create a new table and name it accordingly. It requires lots of joins, but your database will be highly flexible and organized. Good luck!
In my Opinion it should be something like this, i know there are a lot of table, but actually it normilizes your DB
Maybe that is why because i cant understant where you get your att_value column, and what should contains this columns

MySQL: Finding existences between values in database and array

I'd like to know how can I make a unique query to find which values exist and which values do not. I explain.
What I have
I've got a database table with a structure as follows:
+----+--------+-----------+-----------+
| id | action | button_id | type |
+----+--------+-----------+-----------+
| 1 | 1 | 1 | button |
| 2 | 2 | 4 | button |
| 3 | 1 | 2 | attribute |
+----+--------+-----------+-----------+
As you can see, an action can have multiple button_id values. For your knowledge, a button_id can be assigned to multiple action, too, but a button_id can only have a type for an action.
So, button_id 1 can be also present in action 4 with the type "attribute" set to it, but it cannot be duplicated to the same action with another type.
The problem
The problem comes when I want to update the buttons in an action. I receive an action object with an array of the buttons it have (in PHP) with the structure below (I write it in JSON structure):
"buttons":
[
{
"id":"1",
"type":"button"
},
{
"id":"3",
"type":"attribute"
}
]
As you can see, the button with ID 1 remains the same, but I've got a new button to deal with (the button with ID 3) and the button with ID 2 is not present anymore.
What I'd want
I'd want to be able to make a unique MySQL query that returns me which values from those I receive exists and which do not, and which may be present in the database but not in that array.
To sum up: I want to know the differences between the buttons in the array received and those present in the database.
So, as an example with the received data described before and the database as we have it right now, I expect to receive something like this:
+--------+-----------+--------+------------+
| action | button_id | exists | is_present |
+--------+-----------+--------+------------+
| 1 | 1 | 1 | 1 |
| 1 | 2 | 1 | 0 |
| 1 | 3 | 0 | 1 |
+--------+-----------+--------+------------+
With this information, I'd be able to know that button with ID 2 does not exist anymore (because it's not present in the new array) and button with ID 3 is a new button because it does not exists previously but it's present in the new array.
What I've tried
There are some tests I've tried, but none of them gives me what I need, and not only tested with MySQL pure queries.
For example, I've tried to check the existence for each button I receive but that would leave me without being able to find if a button is deleted (so it's not present in the received array).
Checking that but taking as reference the buttons in the database has the same effect, as I will be able to check which have been updated or deleted, but it would skip those that are new and not present in the database.
I've tried to write some queries making COUNT queries and GROUP BY button_id, and so, but no luck neither.
(I won't write the queries because none of them have given me the expected results, so they won't be of any help for you).
Any combination of those explained before I think will be much slower than doing it purely by database queries, and that's why I'm asking for it.
The question
Is there a query that would return to me something like the response explained before in "What I'd want" section, so it would make only a call to the MySQL server?
Thank you all for your time, your responses and your patience for any lack of information you may find by my part.
Of course, any doubts, questions you have or information you may need, comment it and I'll try to explain it better or to add it.
Kind regards.
To do that in a single query would be very cubersome. Here is a solution that is not exactly what you are looking for but should do the job.
Let's say your table looks like this :
CREATE TABLE htmlComponent
(
id int auto_increment primary key,
action int,
button_id int not null,
type varchar(20),
dtInserted datetime,
dtUpdated datetime
);
CREATE UNIQUE INDEX buttonType ON htmlComponent(button_id, type);
Now we need to update the table according to the buttons / atributes you have for a specific action.
-- Reset dtInserted and dtUpdated for action 1
UPDATE htmlComponent SET dtInserted = null, dtUpdated = null WHERE action=1;
-- INSERT or UPDATE according to the data inside the json structure
INSERT INTO htmlComponent (action, button_id, type, dtInserted)
VALUES
(1, 1, 'button', NOW()),
(1, 3, 'attribute', NOW())
ON DUPLICATE KEY UPDATE
button_id = VALUES(button_id),
type = VALUES(type),
dtInserted = null,
dtUpdated = NOW();
-- Getting the result
SELECT * FROM htmlComponent where action=1;
Your should end up with this result which will make it easy to figure out what doesn't exists anymore, what is new and what was updated.
+----+--------+-----------+-----------+----------------------------+----------------------------+
| ID | ACTION | BUTTON_ID | TYPE | DTINSERTED | DTUPDATED |
+----+--------+-----------+-----------+----------------------------+----------------------------+
| 1 | 1 | 1 | button | (null) | February, 09 2015 16:21:49 |
| 3 | 1 | 2 | attribute | (null) | (null) |
| 4 | 1 | 3 | attribute | February, 09 2015 16:21:49 | (null) |
+----+--------+-----------+-----------+----------------------------+----------------------------+
Here is a fiddle. Please note I had to put the UPDATE and the INSERT in the left panel because DML are not allowed in the query panel.

How to split CSVs from one column to rows in a new table in MSSQL 2008 R2

Imagine the following (very bad) table design in MSSQL2008R2:
Table "Posts":
| Id (PK, int) | DatasourceId (PK, int) | QuotedPostIds (nvarchar(255)) | [...]
| 1 | 1 | | [...]
| 2 | 1 | 1 | [...]
| 2 | 2 | 1 | [...]
[...]
| 102322 | 2 | 123;45345;4356;76757 | [...]
So, the column QuotedPostIds contains a semicolon-separated list of self-referencing PostIds (Kids, don't do that at home!). Since this design is ugly as a hell, I'd like to extract the values from the QuotedPostIds table to a new n:m relationship table like this:
Desired new table "QuotedPosts":
| QuotingPostId (int) | QuotedPostId (int) | DatasourceId (int) |
| 2 | 1 | 1 |
| 2 | 1 | 2 |
[...]
| 102322 | 123 | 2 |
| 102322 | 45345 | 2 |
| 102322 | 4356 | 2 |
| 102322 | 76757 | 2 |
The primary key for this table could either be a combination of QuotingPostId, QuotedPostId and DatasourceID or an additional artificial key generated by the database.
It is worth noticing that the current Posts table contains about 6,300,000 rows but only about 285,000 of those have a value set in the QuotedPostIds column. Therefore, it might be a good idea to pre-filter those rows. In any case, I'd like to perform the normalization using internal MSSQL functionality only, if possible.
I already read other posts regarding this topic which mostly dealt with split functions but neither could I find out how exactly to create the new table and also copying the appropriate value from the Datasource column, nor how to filter the rows to touch accordingly.
Thank you!
€dit: I thought it through and finally solved the problem using an external C# program instead of internal MSSQL functionality. Since it seems that it could have been done using Mikael Eriksson's suggestion, I will mark his post as an answer.
From comments you say you have a string split function that you you don't know how to use with a table.
The answer is to use cross apply something like this.
select P.Id,
S.Value
from Posts as P
cross apply dbo.Split(';', P.QuotedPostIds) as S

Database "pointers" to rows?

Is there a way to have "pointers to rows" in a database?
for example I have X product rows, all these rows represent distinct products but many have the same field values except their "id" and "color_id" are different.
I thought of just duplicating the rows but this could be error prone, plus making a small change would have to be done on several rows, again buggy.
Question: Is there a way to fill some rows fully, then use a special value to "point to" certain field values?
For example:
id | field1 | field2 | field3 | color_id
-----------------------------------------------
1 | value1 | value2 | value3 | blue
2 | point[1] | point[1] | point[1] | red (same as row 1, except id and color)
3 | point[1] | point[1] | point[1] | green (same as row 1, except id and color)
4 | valueA | valueB | valueC | orange
5 | point[4] | point[4] | point[4] | brown (same as row 4, except id and color)
6 | valueX | valueY | valueZ | pink
7 | point[6] | point[6] | point[6] | yellow (same as row 6, except id and color)
I'm using MySQL, but this is more of a general question. Also if this is goes completely against database theory, some explanation of why this is bad would be appreciated.
This does go against database design. Look for descriptions of normalization and relational algebra. It is bad mainly because of the comment you have made "duplicating the rows but this could be error prone, plus making a small change would have to be done on several rows, again buggy."
The idea of relational databases is to act on sets of data and find things by matching on primary and foreign keys and absolutely not to use or think of pointers at all.
If you have common data for each product, then create a product table
create table product (
product_id int,
field1 ...,
field2 ...,
field3
)
with primary key on product_id
The main table would have fields id, color_id and product_id
if product table looks like
product_id | field1 | field2 | field3
-----------------------------------------------
1 | value1 | value2 | value3
2 | valueA | valueB | valueC
3 | valueX | valueY | valueZ
The main table would look like
id | product_id | color_id
--------------------------------
1 | 1 | blue
2 | 1 | red
3 | 1 | green
4 | 2 | orange
5 | 2 | brown
6 | 3 | pink
7 | 3 | yellow
Sure there is a way to have pointers to rows in a database. Just don't use a relational DBMS. In the 1960s and 1970s, there were several very successful DBMS products that were based entirely on linking records together by embedding pointers to records inside other records. Perhaps the most well known of these was IMS.
The down side of having pointers to records in other records was that the resulting database was far less flexible than relational databases ended up being. For predeterimned access paths, a database built on a network of pointers is actually faster than a relational database. But when you want to combine the data in multiple ways, the lack of flexibility will kill you.
That is why relational DBMSes took over the field in the 1980s and 1990s, although hierarchical and network databases still survive for fairly specialized work.
As others have suggested, you should learn normalization. When you do, you will learn how to decompose tables into smaller tables with fewer coulmns (fields) in each table. When you need to use the data in joined fashion, you can use a relational join to put the data back together. Relational joins can be almost as fast as navigating by pointers, especially if you have the right indexes built.
Normalization will help you avoid harmful redundancy, which is the problem you highlighted in your question.
One way of doing this is to separate the columns that seem to have repeated data and put that in a separate table. Give each of the rows in this new table a unique id. Add a column to the original table which contains the id in the new table. Then use a FOREIGN KEY relationship between the original table and the new table's id column.
well this would be called normalization under normal circumstances .. the whole point of it is to deal with that kinda scenarios .. so no it cant be done the way u want to do it.. u will need to normalize the data properly.
Create separate tables for the field1, field2 and field three values.
Put existing values there, and reference them by putting their id's into your current table.
If you're using common string values, it's good to store the strings in a separate table and refer to them with foreign keys. If you're storing anything like an integer, it wouldn't be worth it - the size of the pointer would be comparable to the size of the data itself.
It does go against database theory because you're throwing the relational part of databases out the window.
The way to do it is to make an ObjectID column that contains the key of the row you want to point to.
id | field1 | field2 | field3 | color_id | object_id |
------------------------------------------------------------
1 | value1 | value2 | value3 | blue
2 | null | null | null | red | 1 |
3 | null | null | null | green | 1 |
4 | valueA | valueB | valueC | orange
5 | null | null | null | brown | 4 |
6 | valueX | valueY | valueZ | pink
7 | null | null | null | yellow | 6 |
But remember: This is a bad idea. Don't do it. If you did want to do it, that would be how.
There are instances where it's required; but after dealing with a system that was pervasive in this, I'd always try to find another way, even if it means duplicating data and letting your business layer keep everything straight.
I work in a system where this was done throughout the system, and it's maddening when you have to recreate the functionality of relationships because someone wanted to be clever.
The way you would want to implement this in a database would be to create two tables:
object_id | field1 | field2 | field3
and
instance_id | object_id | colour
And then the rows of the second would point to the first, and you could generate the full table of data you want on the fly by
select t1.*, t2.colour from t1 join t2 on (t1.object_id=t2.object_id)
You should probably have two tables with a foreign key relationship.
Example
Products:
Id
field1
field2
field3
ProductColors:
Id
ProductId
Color