opinions and advice on database structure - mysql

I'm building this tool for classifying data. Basically I will be regularly receiving rows of data in a flat-file that look like this:
a:b:c:d:e
a:b:c:d:e
a:b:c:d:e
a:b:c:d:e
And I have a list of categories to break these rows up into, for example:
Original Cat1 Cat2 Cat3 Cat4 Cat5
---------------------------------------
a:b:c:d:e a b c d e
As of right this second, there category names are known, as well as number of categories to break the data down by. But this might change over time (for instance, categories added/removed...total number of categories changed).
Okay so I'm not really looking for help on how to parse the rows or get data into a db or anything...I know how to do all that, and have the core script mostly written already, to handle parsing rows of values and separating into variable amount of categories.
Mostly I'm looking for advice on how to structure my database to store this stuff. So I've been thinking about it, and this is what I came up with:
Table: Generated
generated_id int - unique id for each row generated
generated_timestamp datetime - timestamp of when row was generated
last_updated datetime - timestamp of when row last updated
generated_method varchar(6) - method in which row was generated (manual or auto)
original_string varchar (255) - the original string
Table: Categories
category_id int - unique id for category
category_name varchar(20) - name of category
Table: Category_Values
category_map_id int - unique id for each value (not sure if I actually need this)
category_id int - id value to link to table Categories
generated_id int - id value to link to table Generated
category_value varchar (255) - value for the category
Basically the idea is when I parse a row, I will insert a new entry into table Generated, as well as X entries in table Category_Values, where X is however many categories there currently are. And the category names are stored in another table Categories.
What my script will immediately do is process rows of raw values and output the generated category values to a new file to be sent somewhere. But then I have this db I'm making to store the data generated so that I can make another script, where I can search for and list previously generated values, or update previously generated entries with new values or whatever.
Does this look like an okay database structure? Anything obvious I'm missing or potentially gimping myself on? For example, with this structure...well...I'm not a sql expert, but I think I should be able to do like
select * from Generated where original_string = '$string'
// id is put into $id
and then
select * from Category_Values where generated_id = '$id'
...and then I'll have my data to work with for search results or form to alter data...well I'm fairly certain I can even combine this into one query with a join or something but I'm not that great with sql so I don't know how to actually do that..but point is, I know I can do what I need from this db structure..but am I making this harder than it needs to be? Making some obvious noob mistake?

My suggestion:
Table: Generated
id unsigned int autoincrement primary key
generated_timestamp timestamp
last_updated timestamp default '0000-00-00' ON UPDATE CURRENT_TIMESTAMP
generated_method ENUM('manual','auto')
original_string varchar (255)
Table: Categories
id unsigned int autoincrement primary key
category_name varchar(20)
Table: Category_Values
id unsigned int autoincrement primary key
category_id int
generated_id int
category_value varchar (255) - value for the category
FOREIGN KEY `fk_cat`(category_id) REFERENCES category.id
FOREIGN KEY `fk_gen`(generated_id) REFERENCES generated.id
Links
Timestamps: http://dev.mysql.com/doc/refman/5.1/en/timestamp.html
Create table syntax: http://dev.mysql.com/doc/refman/5.1/en/create-table.html
Enums: http://dev.mysql.com/doc/refman/5.1/en/enum.html

I think this solution is perfect for what you want to do. The Categories list is now flexible so that you can add new categories or retire old ones (I would recommend thinking long and hard about it before agreeing to delete a category - would you orphan record or remove them too, etc.)
Basically, I'm saying you are right on target. The structure is simple but it will work well for you. Great job (and great job giving exactly the right amount of information in the question).

Related

Database design for multiple models?

I have this design.
Table models:
id - primary key
title - varchar(256)
Table model_instances:
id - primary key
model_id - foreign key to app_models.id
title - varchar(256)
Table model_fields:
id - pk
model_id - foreign key to models.id
instance_id - foreign key to model_instances.id
title - name of the field
type - enum [text, checkbox, radio, select, 'etc']
Table model_field_values:
instance_id - forein key model_instance.id
field_id - foreign key to model_fields.id
value - text
Also there can be many values for some field (like for multiple select dropdown)
The problem is: value is always text field, because I want to store different types of data (text, datetime, integer) and this table contains all values for all instances of all models.
For example, if I have 10 models and every model has 1000 instances with 10 fields then model_field_values (at minimum) would contain 100000 rows, if some fields are multiple, then it would contain (120000-150000 rows).
SQL's select using value field would be slow.
Solution 1:
For every model create new model_field_values like:
model.id = 1, model_field_values_1
...
model.id = 10, model_field_values_10
Solution 2:
Because model_fields contains all fields for model, we can create model_field_values like this
model_fields for model.id=1 (by primary key): 1 - text, 2 - integer, 3 - datetime, 4 - smalltext
Fields for model_field_values_1: field_1 text, field_2 integer, field_3 datetime, field_4 varchar(256)
This solution is not good for fields with multiple values, because every multiple value should have another table with link to the row in model_field_values_1, but it is good for searching through database because mysql would use native datatypes in where clauses (not text fields).
May be I miss something? May be there is a better design?
This database would be used in crm-system, where user can create different model with many instances in these models, so I can not preconfigure all tables with all columns.
Note: 200,000 rows (two tenths of a megarow) is, in the usual operation of MySQL, a medium sized table. It's generally possible to index such a table fairly efficiently. http://use-the-index-luke.com/
That being said, I think I understand your problem. It is, in the jargon of object-oriented design, polymorphism.
You have this model_field_value table, containing
instance_id
field_id
value
Your problem is, the value's native data type is sometimes VARCHAR(255), sometimes DATETIME or maybe TIMESTAMP, and sometimes INT.
And you'll sometimes need to do queries like this one
SELECT fv.instance_id
FROM model_field_value fv
WHERE fv.field_id = something
AND fv.value >= '2017-01-01'
AND fv.value < '2018-01-01'
to find DATETIME values that happened in calendar year 2017. For example.
This is generally a pain in the neck with key/value storage like what you need. For a query like my example to be sargable, you need to be able to put an index on a DATETIME column. But if you don't have such a column, you can't index it. Duh.
Here's a suggestion. Give your table these columns.
instance_id INT pk fk
field_id INT pk fk
value VARCHAR(255) a text representation of every value.
value_double DOUBLE a numeric representation of every numeric value, or NULL
value_ts TIMESTAMP a timestamp value if possible, or NULL
This table will contain redundant data, and you'll have to be very careful when you're writing it to make sure it's correct. But you will be able to put indexes on the value_ts and value_double columns, so you can make those kinds of queries sargable.
Just an idea.

Best Practice: find row for unique id from multiple tables

our database contain 5+ tables
user
----------
user_id (PK) int NOT NULL
name varchar(50) NOT NULL
photo
--------
photo_id (PK) int NOT NULL
user_id (FK) int NOT NULL
title varchar(50) NOT NULL
comment
-------
comment_id (PK) int NOT NULL
photo_id int NOT NULL
user_id int NOT NULL
message varchar(50) NOT NULL
all primary key id's are unique id's.
all data are linked to http://domain.com/{primary_key_id}
after user visit the link with id, which is unique for all tables.
how should i implement to find what table this id belongs to?
solution 1
select user_id from user where user_id = {primary_key_id}
// if not found, then move next
select photo_id from photo where photo_id = {primary_key_id}
... continue on, until we find which table this primary key belongs to.
solution 2
create object table to hold all the uniqe id and there data type
create trigger on all the tables for AFTER INSERT, to create row in object table with its data type, which was inserted to a selected table
when required, then do select statement to find the table name the id belongs to.
second solution will be double insert. 1 insert for row to actual table with complete data and 2 insert for inserting unique id and table name in object table, which we created on step 1.
select type from object_table where id = {primary_key_id}
solution 3
prepend table name + id = encode into new unique integer - using php
decode id and get the original id with table name (even if its just as number type)
i don't know how to implement this in php, but this solution sounds better!? what are your suggestion?
I don't know what you mean by Facebook reference in the comments but I'll explain my comment a little further.
You don't need unique ID's across five DB tables, just one per table. You have couple of options how to create your links (you can create the links yourself can you?):
using GET variables: http://domain.com/page.html?pk={id}&table={table}
using plain URL: http://domain.com/{id}{table}
Depending on the syntax of the link you choose the function to parse it. You can for example use one or both of the following:
http://php.net/manual/en/function.explode.php
http://www.php.net/manual/en/function.parse-url.php
When you get the simple model working you may add encoding/decoding/hashing functions. But do you really need them? And in what level? (I have no experience in that area so I'll shut up now.)
Is it actually important to maintain uniqueness across tables?
If no, just implement the solution 3 if you can (e.g. using URL encoding).
If yes, you'll need the "parent" table in any case, so the DBMS can enforce the uniqueness.
You can still try to implement the solution 3 on top of that,
or add a type discriminator1 there and you'll be able to (quickly) know which table is referenced for any given ID.
1 Take a look at the lower part of this answer. This is in fact a form of inheritance.

Proper database design for this task

Ok, so I am going to have at least 2 tables, possibly three.
The data is going to be as follows:
First, a list of search terms. These search terms are unrelated to anything else in the program (only involved in getting the outputs, no manipulation of this data at all), so I plan to store them separately in their own table.
Then things get trickier. I've got a list of words, and each word can be in multiple categories. So for example, if you have "sad", it could be under "angst" and "tragedy", just as "happy" could be under "joy" and "fulfillment".
Would it be better to set up a table where I've got three columns: a UID, a word, and a category, or would it be better to set up two tables: both with UIDs, one with the word, one with the category, and set them up as a foreign key?
The ultimate role is generating number of words in a given category over a given period of time.
I'll be using MySQL and Python (MySQLdb) if that helps anyone.
Ignoring your 'search terms' table (since it doesnt seem to have any relevance to the question), I would probably do it similar to this
words (w_id int, w_word varchar(50))
categories (c_id int, c_category)
wordcategories (wc_wordid int, wc_catid int)
Add foreign key constraints from the ids in wordcategories, onto word and categories tables
Without having a whole lot of details, I would set it up the following way:
Word Table
id int PK
word varchar(20)
Category Table
id int PK
category varchar(20)
Word_Category Table
wordId int PK
categoryId int PK
The third would be the join table between the word and the category. This table would contain the foreign key constraints to the word and category tables.

Sphinx Search, compound key

After my previous question (http://stackoverflow.com/questions/8217522/best-way-to-search-for-partial-words-in-large-mysql-dataset), I've chosen Sphinx as the search engine above my MySQL database.
I've done some small tests with it, and it looks great. However, i'm at a point right now, where I need some help / opinions.
I have a table articles (structure isn't important), a table properties (structure isn't important either), and a table with values of each property per article (this is what it's all about).
The table where these values are stored, has the following structure:
articleID UNSIGNED INT
propertyID UNSIGNED INT
value VARCHAR(255)
The primary key is a compound key of articleID and propertyID.
I want Sphinx to search through the value column. However, to create an index in Sphinx, I need a unique id. I don't have right here.
Also when searching, I want to be able to filter on the propertyID column (only search values for propertyID 2 for example, which I can do by defining it as attribute).
On the Sphinx forum, I found I could create a multi-value attribute, and set this as query for my Sphinx index:
SELECT articleID, value, GROUP_CONCAT(propertyID) FROM t1 GROUP BY articleID
articleID will be unique now, however, now I'm missing values. So I'm pretty sure this isn't the solution, right?
There are a few other options, like:
Add an extra column to the table, which is unique
Create a calculated unique value in the query (like articleID*100000+propertyID)
Are there any other options I could use, and what would you do?
In your suggestions
Add an extra column to the table, which is unique
This can not be done for an existing table with large number of records as adding a new field to a large table take some time and during that time the database will not be responsive.
Create a calculated unique value in the query (like articleID*100000+propertyID)
If you do this you have to find a way to get the articleID and propertyID from the calculated unique id.
Another alternative way is that you can create a new table having a key field for sphinx and another two fields to hold articleID and propertyID.
new_sphinx_table with following fields
id - UNSIGNED INT/ BIGINT
articleID - UNSIGNED INT
propertyID - UNSIGNED INT
Then you can write an indexing query like below
SELECT id, t1.articleID, t1.propertyID, value FROM t1 INNER JOIN new_sphinx_table nt ON t1.articleID = nt.articleID AND t1.propertyID = nt.propertyID;
This is a sample so you can modify it to fit to your requirements.
What sphinx return is matched new_sphinx_table.id values with other attributed columns. You can get result by using new_sphinx_table.id values and joining your t1 named table and new_sphinx_table

MySql database formatting

I am currently developing a database storage solution for product inventory information for the company I work for. I am using MySql, and I am having a hard time coming up with an efficient, feasible format for the data storage.
As it works right now, we have ~25000 products to keep track of. For each product, there are about 20 different categories that we need to track information for(quantity available, price, etc..). This report is downloaded and updated every 3-4 days, and it is stored and updated in excel right now.
My problem is that the only solution I have come up with so far is to create separate tables for each one of the categories mentioned above, using foreign keys based off of the product skus, and cascading to update each respective table. However, this method would require that every table add 24000 rows each time the program is run, given that each product needs updated for the date it was run. The problem with this is that the data will be store for around a year, so the tables will grow an extensive amount. My research for other database formats has yielded some examples, but none on the scale of this. They are geared towards adding maybe 100 rows a day.
Does anybody know or have any ideas of a suitable way to set up this kind of database, or is the method I described above suitable and within the limitations of the MySql tables?
Thanks,
Mike
25,000 rows is nothing to MySQL or a flat file for that case. Do not initially worry about data volume. I've worked on many retail database schemas and products are usually defined by either a static or arbitrary-length set of attributes. Your data quantity ends of not being that far off either way.
Static:
create table products (
product_id integer primary key auto_increment
, product_name varchar(255) -- or whatever
, attribute1_id -- FK
, attribute2_id -- FK
, ...
, attributeX_id -- FK
);
create table attributes (
attribute_id integer primary key -- whatever
, attribute_type -- Category?
, attribute_value varchar(255)
);
Or, you obviously:
create table products (
product_id integer primary key auto_increment
, product_name varchar(255) -- or whatever
);
create table product_attributes (
product_id integer
, attribute_id integer
, -- other stuff you want like date of assignment
, primary key (product_id , attribute_id)
);
create table attributes (
attribute_id integer primary key -- whatever
, attribute_type -- Category?
, attribute_value varchar(255)
);
I would not hesitate to shove a few hundred million records into a basic structure like either.