Database design for multiple models? - mysql

I have this design.
Table models:
id - primary key
title - varchar(256)
Table model_instances:
id - primary key
model_id - foreign key to app_models.id
title - varchar(256)
Table model_fields:
id - pk
model_id - foreign key to models.id
instance_id - foreign key to model_instances.id
title - name of the field
type - enum [text, checkbox, radio, select, 'etc']
Table model_field_values:
instance_id - forein key model_instance.id
field_id - foreign key to model_fields.id
value - text
Also there can be many values for some field (like for multiple select dropdown)
The problem is: value is always text field, because I want to store different types of data (text, datetime, integer) and this table contains all values for all instances of all models.
For example, if I have 10 models and every model has 1000 instances with 10 fields then model_field_values (at minimum) would contain 100000 rows, if some fields are multiple, then it would contain (120000-150000 rows).
SQL's select using value field would be slow.
Solution 1:
For every model create new model_field_values like:
model.id = 1, model_field_values_1
...
model.id = 10, model_field_values_10
Solution 2:
Because model_fields contains all fields for model, we can create model_field_values like this
model_fields for model.id=1 (by primary key): 1 - text, 2 - integer, 3 - datetime, 4 - smalltext
Fields for model_field_values_1: field_1 text, field_2 integer, field_3 datetime, field_4 varchar(256)
This solution is not good for fields with multiple values, because every multiple value should have another table with link to the row in model_field_values_1, but it is good for searching through database because mysql would use native datatypes in where clauses (not text fields).
May be I miss something? May be there is a better design?
This database would be used in crm-system, where user can create different model with many instances in these models, so I can not preconfigure all tables with all columns.

Note: 200,000 rows (two tenths of a megarow) is, in the usual operation of MySQL, a medium sized table. It's generally possible to index such a table fairly efficiently. http://use-the-index-luke.com/
That being said, I think I understand your problem. It is, in the jargon of object-oriented design, polymorphism.
You have this model_field_value table, containing
instance_id
field_id
value
Your problem is, the value's native data type is sometimes VARCHAR(255), sometimes DATETIME or maybe TIMESTAMP, and sometimes INT.
And you'll sometimes need to do queries like this one
SELECT fv.instance_id
FROM model_field_value fv
WHERE fv.field_id = something
AND fv.value >= '2017-01-01'
AND fv.value < '2018-01-01'
to find DATETIME values that happened in calendar year 2017. For example.
This is generally a pain in the neck with key/value storage like what you need. For a query like my example to be sargable, you need to be able to put an index on a DATETIME column. But if you don't have such a column, you can't index it. Duh.
Here's a suggestion. Give your table these columns.
instance_id INT pk fk
field_id INT pk fk
value VARCHAR(255) a text representation of every value.
value_double DOUBLE a numeric representation of every numeric value, or NULL
value_ts TIMESTAMP a timestamp value if possible, or NULL
This table will contain redundant data, and you'll have to be very careful when you're writing it to make sure it's correct. But you will be able to put indexes on the value_ts and value_double columns, so you can make those kinds of queries sargable.
Just an idea.

Related

How to represent a non SQL custom schema structure in a SQL database

I have an interesting problem. I have a relational database to which i can use custom scripts to create the tables. It's pseudo SQL and doesn't use standard create syntax. Rather it's fairly limited. What i want to do is store my schema in a MySQL database.
In my custom relational database I have a table called:
Person with fields
id as NUMBER not nullable PK
name as TEXT (64) characters max
year of birth as DATE
So in order to generate the create scripts I thought of using MySQL database to store the schema. For example
I have a MySQL table called custom_table with id and name
e.g. 1, Person would be the first record in it
I have another MySQL table called custom_fields with the following:
field_id as int, not null, pk
table_name_id, foreign key to custom_table
field_name as varchar(255)
field_type as varchar(255)
is_primary_key as tinyint(1)
is_nullable as tinyint(1)
The data set would look like:
field_id
table_name_id
field_name
field_type
is_primary_key
is_nullable
1
1
id
NUMBER
1
0
2
1
name
TEXT
0
1
3
1
year
DATE
0
1
The part that I am stuck on, is how/where do i store the length of the TEXT field. I have other field types such as decimal which accept additional parameters or default values as well.
I was thinking of maybe have a table called field_date, field_number, field_text which would be related back to the custom_fields table via foreign key relationship but i am unsure how to enforce the fact that each field_id should only exist at most one time in any other table. Any insight would be appreciated or direction to research. My challenge is that I haven't been able to find anything in stack or other sites related to something like this.
Yes database tables can and do store the schema. It's called the 'catalog'. Every SQL database should have one, and it's maintained by every CREATE TABLE, etc. And you can query it, just like any other table.
If your (rather mysterious) "pseudo SQL" DBMS doesn't do that, get a proper DBMS. Don't try to re-invent the wheel, because trying to maintain a 'shadow' of the actual schema will lead to anomalies.

How to store a data whose type can be numeric, date or string in mysql

We're developing a monitoring system. In our system values are reported by agents running on different servers. This observations reported can be values like:
A numeric value. e.g. "CPU USAGE" = 55. Meaning 55% of the CPU is in
use).
Certain event was fired. e.g. "Backup completed".
Status: e.g. SQL Server is offline.
We want to store this observations (which are not know in advance and will be added dynamically to the system without recompiling).
We are considering adding different columns to the observations table like this:
IntMeasure -> INTEGER
FloatMeasure -> FLOAT
Status -> varchar(255)
So if the value we whish to store is a number we can use IntMeasure or FloatMeasure according to the type. If the value is a status we can store the status literal string (or a status id if we decide to add a Statuses(id, name) table).
We suppose it's possible to have a more correct design but would probably become to slow and dark due to joins and dynamic table names depending on types? How would a join work if we can't specify the tables in advance in the query?
I haven't done a formal study, but from my own experience I would guess that more than 80% of database design flaws are generated from designing with performance as the most important (if not only) consideration.
If a good design calls for multiple tables, create multiple tables. Don't automatically assume that joins are something to be avoided. They are rarely the true cause of performance problems.
The primary consideration, first and foremost in all stages of database design, is data integrity. "The answer may not always be correct, but we can get it to you very quickly" is not a goal any shop should be working toward. Once data integrity has been locked down, if performance ever becomes an issue, it can be addressed. Don't sacrifice data integrity, especially to solve problems that may not exist.
With that in mind, look at what you need. You have observations you need to store. These observations can vary in the number and types of attributes and can be things like the value of a measurement, the notification of an event and the change of a status, among others and with the possibility of future observations being added.
This would appear to fit into a standard "type/subtype" pattern, with the "Observation" entry being the type and each type or kind of observation being the subtype, and suggests some form of type indicator field such as:
create table Observations(
...,
ObservationKind char( 1 ) check( ObservationKind in( 'M', 'E', 'S' )),
...
);
But hardcoding a list like this in a check constraint has a very low maintainability level. It becomes part of the schema and can be altered only with DDL statements. Not something your DBA is going to look forward to.
So have the kinds of observations in their own lookup table:
ID Name Meaning
== =========== =======
M Measurement The value of some system metric (CPU_Usage).
E Event An event has been detected.
S Status A change in a status has been detected.
(The char field could just as well be int or smallint. I use char here for illustration.)
Then fill out the Observations table with a PK and the attributes that would be common to all observations.
create table Observations(
ID int identity primary key,
ObservationKind char( 1 ) not null,
DateEntered date not null,
...,
constraint FK_ObservationKind foreign key( ObservationKind )
references ObservationKinds( ID ),
constraint UQ_ObservationIDKind( ID, ObservationKind )
);
It may seem strange to create a unique index on the combination of Kind field and the PK, which is unique all by itself, but bear with me a moment.
Now each kind or subtype gets its own table. Note that each kind of observation gets a table, not the data type.
create table Measurements(
ID int not null,
ObservationKind char( 1 ) check( ObservationKind = 'M' ),
Name varchar( 32 ) not null, -- Such as "CPU Usage"
Value double not null, -- such as 55.00
..., -- other attributes of Measurement observations
constraint PK_Measurements primary key( ID, ObservationKind ),
constraint FK_Measurements_Observations foreign key( ID, ObservationKind )
references Observations( ID, ObservationKind )
);
The first two fields will be the same for the other kinds of observations except the check constraint will force the value to the appropriate kind. The other fields may differ in number, name and data type.
Let's examine an example tuple that may exist in the Measurements table:
ID ObservationKind Name Value ...
==== =============== ========= =====
1001 M CPU Usage 55.0 ...
In order for this tuple to exist in this table, a matching entry must first exist in the Observations table with an ID value of 1001 and an observation kind value of 'M'. No other entry with an ID value of 1001 can exist in either the Observations table or the Measurements table and cannot exist at all in any other of the "kind" tables (Events, Status). This works the same way for all the kind tables.
I would further recommend creating a view for each kind of observation which will provide a join of each kind with the main observation table:
create view MeasurementObservations as
select ...
from Observations o
join Measurements m
on m.ID = o.ID;
Any code that works solely with measurements would need to only hit this view instead of the underlying tables. Using views to create a wall of abstraction between the application code and the raw data greatly enhances the maintainability of the database.
Now the creation of another kind of observation, such as "Error", involves a simple Insert statement to the ObservationKinds table:
F Fault A fault or error has been detected.
Of course, you need to create a new table and view for these error observations, but doing so will have no impact on existing tables, views or application code (except, of course, to write the new code to work with the new observations).
Just create it as a VARCHAR
This will allow you to store whatever data you require in it. It is much more difficult to do queries based on the number in the field such as
Select * from table where MyVARCHARField > 50 //get CPU > 50
However if you think you want to do this, then either you need a field per item or a generalised table such as
Create Table
Description : Varchar
ValueType : Varchar //Can be String, Float, Int
ValueString: Varchar
ValueFloat: Float
ValueInt : Int
Then when you are filling the data you can put your value in the correct field and select like this.
Select Description ,ValueInt from table where Description like '%cpu%' and ValueInt > 50
I had a used two columns for a similar problem. First column was for data type and second value contained data as a Varchar.
First column had codes ( e.g. 1= integer, 2 = string, 3 = date and so on), which could be combined to compare values. ( e.g. find the max integer where type=1)
I did not have joins, but i think you can use this approach. It will also help you if tomorrow more data types are introduced.

Use varbinary as foreign key

Is it ok to use data type varbinary for foreign keys?
Why?
I have an EvalAnswer table with a FK to a Score table.
The score is sensitive and should be encrypted. The encrypt/decrypt happens in the asp.net (4.0) project and not in sql server (2008), so the data type needs to be varbinary.
EDIT: more info
Of course.
I have these columns: Id, Score, ScoreText, Description, Index
The Id is an incremental counter. (PK)
The Score is the score as number (such as 1).
The ScoreText is the score as a letter (Score 1 equals letter A).
The Description is a comment for every score.
The reason I have it like this is also that there are special situations,
such as one of the questions have only scoring from 1-4, and the rest has 1-5.
So every question has a score 1, but the the description differs from another questions score 1.
So If I have 5 questions, this gives 5*5 rows in the Score table. (All with different description)
When I page load I get the correct scoring (with description) for every dropdownlist. Normally 1-5.
But when the user has saved the scoring, I need to know the earlier saaved score for every question when I page load.
Therefore I have a relation between EvalAnswer and the scoring.
There are questions with relation to the score table which is NOT sensitive.
But some are. And for them I need to hide the relation beetween EvalAnswer and Score.
What might be a bad design is the fact that I use the same table (the score table) as the
one to show the available scoring for every questions.
and also as the one to hold what the user has chosen. (this is the FK from EvalAnswer to Score)
Please advice.......
I suggest adding an ID int column to the Score table and reference this field from the EvalAnswer table.
This means your table scripts will change to
CREATE TABLE Score (
Id int not null identity primary key
, Code varbinary(max) --> The new field containing the encrypted information
, Score int
, ScoreText varchar(5)
, Description varchar(max)
, Index int)
CREATE TABLE EvalAnswer (
Id int not null identity,
ScoreId int not null references Score(Id)
...
)
As you can see, the "Old" Id field has now become the Code field. The new field is an identity column containing a unique number
There is nothing against using a varbinary column in a foreign key, but it will make querying and debugging much harder.
Also note there is a 900 byte limit on the width of an index which might be easier to hit when storing an encrypted blob.

Sphinx Search, compound key

After my previous question (http://stackoverflow.com/questions/8217522/best-way-to-search-for-partial-words-in-large-mysql-dataset), I've chosen Sphinx as the search engine above my MySQL database.
I've done some small tests with it, and it looks great. However, i'm at a point right now, where I need some help / opinions.
I have a table articles (structure isn't important), a table properties (structure isn't important either), and a table with values of each property per article (this is what it's all about).
The table where these values are stored, has the following structure:
articleID UNSIGNED INT
propertyID UNSIGNED INT
value VARCHAR(255)
The primary key is a compound key of articleID and propertyID.
I want Sphinx to search through the value column. However, to create an index in Sphinx, I need a unique id. I don't have right here.
Also when searching, I want to be able to filter on the propertyID column (only search values for propertyID 2 for example, which I can do by defining it as attribute).
On the Sphinx forum, I found I could create a multi-value attribute, and set this as query for my Sphinx index:
SELECT articleID, value, GROUP_CONCAT(propertyID) FROM t1 GROUP BY articleID
articleID will be unique now, however, now I'm missing values. So I'm pretty sure this isn't the solution, right?
There are a few other options, like:
Add an extra column to the table, which is unique
Create a calculated unique value in the query (like articleID*100000+propertyID)
Are there any other options I could use, and what would you do?
In your suggestions
Add an extra column to the table, which is unique
This can not be done for an existing table with large number of records as adding a new field to a large table take some time and during that time the database will not be responsive.
Create a calculated unique value in the query (like articleID*100000+propertyID)
If you do this you have to find a way to get the articleID and propertyID from the calculated unique id.
Another alternative way is that you can create a new table having a key field for sphinx and another two fields to hold articleID and propertyID.
new_sphinx_table with following fields
id - UNSIGNED INT/ BIGINT
articleID - UNSIGNED INT
propertyID - UNSIGNED INT
Then you can write an indexing query like below
SELECT id, t1.articleID, t1.propertyID, value FROM t1 INNER JOIN new_sphinx_table nt ON t1.articleID = nt.articleID AND t1.propertyID = nt.propertyID;
This is a sample so you can modify it to fit to your requirements.
What sphinx return is matched new_sphinx_table.id values with other attributed columns. You can get result by using new_sphinx_table.id values and joining your t1 named table and new_sphinx_table

opinions and advice on database structure

I'm building this tool for classifying data. Basically I will be regularly receiving rows of data in a flat-file that look like this:
a:b:c:d:e
a:b:c:d:e
a:b:c:d:e
a:b:c:d:e
And I have a list of categories to break these rows up into, for example:
Original Cat1 Cat2 Cat3 Cat4 Cat5
---------------------------------------
a:b:c:d:e a b c d e
As of right this second, there category names are known, as well as number of categories to break the data down by. But this might change over time (for instance, categories added/removed...total number of categories changed).
Okay so I'm not really looking for help on how to parse the rows or get data into a db or anything...I know how to do all that, and have the core script mostly written already, to handle parsing rows of values and separating into variable amount of categories.
Mostly I'm looking for advice on how to structure my database to store this stuff. So I've been thinking about it, and this is what I came up with:
Table: Generated
generated_id int - unique id for each row generated
generated_timestamp datetime - timestamp of when row was generated
last_updated datetime - timestamp of when row last updated
generated_method varchar(6) - method in which row was generated (manual or auto)
original_string varchar (255) - the original string
Table: Categories
category_id int - unique id for category
category_name varchar(20) - name of category
Table: Category_Values
category_map_id int - unique id for each value (not sure if I actually need this)
category_id int - id value to link to table Categories
generated_id int - id value to link to table Generated
category_value varchar (255) - value for the category
Basically the idea is when I parse a row, I will insert a new entry into table Generated, as well as X entries in table Category_Values, where X is however many categories there currently are. And the category names are stored in another table Categories.
What my script will immediately do is process rows of raw values and output the generated category values to a new file to be sent somewhere. But then I have this db I'm making to store the data generated so that I can make another script, where I can search for and list previously generated values, or update previously generated entries with new values or whatever.
Does this look like an okay database structure? Anything obvious I'm missing or potentially gimping myself on? For example, with this structure...well...I'm not a sql expert, but I think I should be able to do like
select * from Generated where original_string = '$string'
// id is put into $id
and then
select * from Category_Values where generated_id = '$id'
...and then I'll have my data to work with for search results or form to alter data...well I'm fairly certain I can even combine this into one query with a join or something but I'm not that great with sql so I don't know how to actually do that..but point is, I know I can do what I need from this db structure..but am I making this harder than it needs to be? Making some obvious noob mistake?
My suggestion:
Table: Generated
id unsigned int autoincrement primary key
generated_timestamp timestamp
last_updated timestamp default '0000-00-00' ON UPDATE CURRENT_TIMESTAMP
generated_method ENUM('manual','auto')
original_string varchar (255)
Table: Categories
id unsigned int autoincrement primary key
category_name varchar(20)
Table: Category_Values
id unsigned int autoincrement primary key
category_id int
generated_id int
category_value varchar (255) - value for the category
FOREIGN KEY `fk_cat`(category_id) REFERENCES category.id
FOREIGN KEY `fk_gen`(generated_id) REFERENCES generated.id
Links
Timestamps: http://dev.mysql.com/doc/refman/5.1/en/timestamp.html
Create table syntax: http://dev.mysql.com/doc/refman/5.1/en/create-table.html
Enums: http://dev.mysql.com/doc/refman/5.1/en/enum.html
I think this solution is perfect for what you want to do. The Categories list is now flexible so that you can add new categories or retire old ones (I would recommend thinking long and hard about it before agreeing to delete a category - would you orphan record or remove them too, etc.)
Basically, I'm saying you are right on target. The structure is simple but it will work well for you. Great job (and great job giving exactly the right amount of information in the question).