Proper database design for this task - mysql

Ok, so I am going to have at least 2 tables, possibly three.
The data is going to be as follows:
First, a list of search terms. These search terms are unrelated to anything else in the program (only involved in getting the outputs, no manipulation of this data at all), so I plan to store them separately in their own table.
Then things get trickier. I've got a list of words, and each word can be in multiple categories. So for example, if you have "sad", it could be under "angst" and "tragedy", just as "happy" could be under "joy" and "fulfillment".
Would it be better to set up a table where I've got three columns: a UID, a word, and a category, or would it be better to set up two tables: both with UIDs, one with the word, one with the category, and set them up as a foreign key?
The ultimate role is generating number of words in a given category over a given period of time.
I'll be using MySQL and Python (MySQLdb) if that helps anyone.

Ignoring your 'search terms' table (since it doesnt seem to have any relevance to the question), I would probably do it similar to this
words (w_id int, w_word varchar(50))
categories (c_id int, c_category)
wordcategories (wc_wordid int, wc_catid int)
Add foreign key constraints from the ids in wordcategories, onto word and categories tables

Without having a whole lot of details, I would set it up the following way:
Word Table
id int PK
word varchar(20)
Category Table
id int PK
category varchar(20)
Word_Category Table
wordId int PK
categoryId int PK
The third would be the join table between the word and the category. This table would contain the foreign key constraints to the word and category tables.

Related

4 tables in my DB need a field called, "categories". How can I make this work?

For my RSS aggregator, there a four tables that represent rss and atom feeds and, their articles. Each feed type and entry type will have zero or more categories. In the interest of not duplicating data, I'd like to have only one table for categories.
How can I accomplish this?
One way is to keep categories in one single table - e.g. category - and define an X table for each entity/table that needs 0 or more category associations:
rssFeedXCategory
rssFeedId INT FK -> rssFeed (id)
categoryId INT FK -> category (id)
atomFeedXCategory
atomFeedId INT FK -> atomFeed (id)
categoryId INT FK -> category (id)
and so on.
You can define a PK for both columns in each table, but an extra identity column may also be used. When working with an ORM, I also have an extra identity/autoincrement column (e.g. XId INT), so that a single column can be used to identity a row.

Do SQL tables benefit or need to have a unique ID column?

In my experience almost all tables have a field called Id that is a unique primary key and indexed. My question is if I'm not using this value anywhere and will never need this value what is the benefit of having it.
Here is my problem:
The database has these many-many relationship tables (often called map tables) that link two other tables together by their unique Ids.
ex rows:
Id= 1 MachineId= 1 UserId= 2
Id= 2 MachineId= 1 UserId= 3
The way the code stands today when it updates this table it removes all Users of a machine and then proceeds to add all of the current users back. This is how they chose to remove old entries. The problem is this inflates the Id column unnecessarily because you remove/add for every user even if nothings changed. This happens by default every 90mins.
One solution to this is to fix the code to do things the right way. Another solution is to just remove the Id field altogether. Since we don't link to this table somewhere else and we don't use the Id value in code anywhere (we don't even pull it from the DB) why do we need it?
So back to my original question. Is the Id field needed for something else? Or does it provide some benefit that I would lose that I may want?
No, it's not needed, and especially for those many-to-many relationships, it is perfectly acceptible to just not have them.
Those ids are especially useful if you have foreign key relations to that table, but even then, you can have foreign keys that consist of a unique combination of multiple columns, so even for foreign keys you don't strictly need them, although it is very much recommended to use single value keys for this purpose.
The added benefit of having a key you don't need, is that you don't need to add it, once you are going to need it. Hardly an excuse. :)
In case you want to google more info:
Those many to many tables are often called a 'junction table' or 'cross-reference table'.
A 'meaningless' unique ID, often auto-numbered, is also called a 'surrogate key'
A key (including primary keys and foreign keys) that consists of multiple fields, is called a 'compound key'. 'Composite key' is often used as a synonym, although Wikipedia has a slightly different definition.
Technically, you don't NEED to have a unique id, but there are far too many situations where not having one will really screw you up. e.g. Consider an address book. You might assume that "firstname, lastname, address" is enough to identify someone, but consider "John Smith, 123 Main Street" and "John Smith, 123 Main Street" (John Junior). Obvious solution: add a "Jr." field, or add more to the key and keep hoping you won't ever get a duplicate.... or you just add an auto_increment ID field and be done with it. doesn't matter what other fields are duplicates across records, you KNOW the id field will be unique.
You can easily make a unique composite key if you'd like, but then if you need to set up a foreign key relationship, you'd have to duplicate ALL of those keys' fields in the foreign table.
e.g.
table A (
p, q, r, s t -> char
primary key (p, q, r, s, t)
)
table B (
h, i, j, k -> whatever
p, q, r, s, t -> char
foreign key (p, q, r, s, t) -> A (p, q, r, s, t)
)
Now you've got your pqrst fields in two tables, and have to write them out, IN FULL, for every join operation. Whereas, if you had a simple single ID field:
table A (
id -> primary key int
p, q, r, s ,t
)
table B (
h, i, j, k -> hwatever
a_id -> int
foreign key (a_id) -> A (id)
)
one simple int field carried between the two tables, v.s. n fields for every field in the composite key.
Short answer: In this case, you described in your question, the ID column is not necessary, you can drop it if you will and add a PK/Unique to the table which is built up using the IDs from the linked tables.
Long answer (with my personal opinion): The ID column was used to speed up queries which are using lots of joins (comparing an integer is mostly faster than comparing long strings), and to moderate the size of link tables and foreign key columns. Another usage is to add a unique identifier to those tables which laks simple real life identifiers (like log tables).
In some cases the ID column is just added because all tables contains an ID column.
You always have to consider that the ID column has any meaning or is it really necessary: if you have character codes (using only ASCII characters) with less than 4 charactes length, the code will be smaller than the INT ID column (INT is stored on 4 bytes, bigint on 8 bytes).
Another thing: Always add the name of the entity to the ID column (such as PersonID, InvoiceID) to make the queries and schema more readable. In my opinion, a column's name always should represent what it stores, the name ID is just not describes the value stored in the column, when PersonID does. Furthermore, you can (and should) use the same name in foreign keys.
In most cases in our time with the current hardwares, the ID columns are mostly complicates the databases (you always have to join several tables to get the business/natural key). Furthermore the ID has no meaning for the business. You can always consider to leave the ID column and use a natural key as primary key. (You could leave the ID column, when you have an ID as a PK and one more column defined as unique not null: for example: A table contains invoices: the PK could be the InvoiceNumber which is on the paper based invoice, rather than the ID, but if the database is responsible to generate that number, you have to use a sequence based column.)
The ID (or any machine generated identifier) is useful when you don't have simple natural keys to use (or you have natural keys, but they are too wide or the have to build up using several other columns), or the natural key us mutable, you have to have some king of uniq identifier (car's license plate number is one example).

Mysql database empty column values vs additional identifying table

Sorry, not sure if question title is reflects the real question, but here goes:
I designing system which have standard orders table but with additional previous and next columns.
The question is which approach for foreign keys is better
Here I have basic table with following columns (previous, next) which are self referencing foreign keys. The problem with this table is that the first placed order doesn't have previous and next fields, so they left out empty, so if I have say 10 000 records 30% of them have those columns empty that's 3000 rows which is quite a lot I think, and also I expect numbers to grow. so in a let's say a year time period it can come to 30000 rows with empty columns, and I am not sure if it's ok.
The solution I've have came with is to main table with other 2 tables which have foreign keys to that table. In this case those 2 additional tables are identifying tables and nothing more, and there's no longer rows with empty columns.
So the question is which solution is better when considering query speed, table optimization, and common good practices, or maybe there's one even better that I don't know? (P.s. I am using mysql with InnoDB engine).
If your aim is to do order sets, you could simply add a new table for that, and just have a single column as a foreign key to that table in the order table.
The orders could also include a rank column to indicate in which order orders belonging to the same set come.
create table order_sets (
id not null auto_increment,
-- customer related data, etc...
primary key(id)
);
create table orders (
id int not null auto_increment,
name varchar,
quantity int,
set_id foreign key (order_set),
set_rank int,
primary key(id)
);
Then inserting a new order means updating the rank of all other orders which come after in the same set, if any.
Likewise, for grouping queries, things are way easier than having to follow prev and next links. I'm pretty sure you will need these queries, and the performances will be much better that way.

Sphinx Search, compound key

After my previous question (http://stackoverflow.com/questions/8217522/best-way-to-search-for-partial-words-in-large-mysql-dataset), I've chosen Sphinx as the search engine above my MySQL database.
I've done some small tests with it, and it looks great. However, i'm at a point right now, where I need some help / opinions.
I have a table articles (structure isn't important), a table properties (structure isn't important either), and a table with values of each property per article (this is what it's all about).
The table where these values are stored, has the following structure:
articleID UNSIGNED INT
propertyID UNSIGNED INT
value VARCHAR(255)
The primary key is a compound key of articleID and propertyID.
I want Sphinx to search through the value column. However, to create an index in Sphinx, I need a unique id. I don't have right here.
Also when searching, I want to be able to filter on the propertyID column (only search values for propertyID 2 for example, which I can do by defining it as attribute).
On the Sphinx forum, I found I could create a multi-value attribute, and set this as query for my Sphinx index:
SELECT articleID, value, GROUP_CONCAT(propertyID) FROM t1 GROUP BY articleID
articleID will be unique now, however, now I'm missing values. So I'm pretty sure this isn't the solution, right?
There are a few other options, like:
Add an extra column to the table, which is unique
Create a calculated unique value in the query (like articleID*100000+propertyID)
Are there any other options I could use, and what would you do?
In your suggestions
Add an extra column to the table, which is unique
This can not be done for an existing table with large number of records as adding a new field to a large table take some time and during that time the database will not be responsive.
Create a calculated unique value in the query (like articleID*100000+propertyID)
If you do this you have to find a way to get the articleID and propertyID from the calculated unique id.
Another alternative way is that you can create a new table having a key field for sphinx and another two fields to hold articleID and propertyID.
new_sphinx_table with following fields
id - UNSIGNED INT/ BIGINT
articleID - UNSIGNED INT
propertyID - UNSIGNED INT
Then you can write an indexing query like below
SELECT id, t1.articleID, t1.propertyID, value FROM t1 INNER JOIN new_sphinx_table nt ON t1.articleID = nt.articleID AND t1.propertyID = nt.propertyID;
This is a sample so you can modify it to fit to your requirements.
What sphinx return is matched new_sphinx_table.id values with other attributed columns. You can get result by using new_sphinx_table.id values and joining your t1 named table and new_sphinx_table

opinions and advice on database structure

I'm building this tool for classifying data. Basically I will be regularly receiving rows of data in a flat-file that look like this:
a:b:c:d:e
a:b:c:d:e
a:b:c:d:e
a:b:c:d:e
And I have a list of categories to break these rows up into, for example:
Original Cat1 Cat2 Cat3 Cat4 Cat5
---------------------------------------
a:b:c:d:e a b c d e
As of right this second, there category names are known, as well as number of categories to break the data down by. But this might change over time (for instance, categories added/removed...total number of categories changed).
Okay so I'm not really looking for help on how to parse the rows or get data into a db or anything...I know how to do all that, and have the core script mostly written already, to handle parsing rows of values and separating into variable amount of categories.
Mostly I'm looking for advice on how to structure my database to store this stuff. So I've been thinking about it, and this is what I came up with:
Table: Generated
generated_id int - unique id for each row generated
generated_timestamp datetime - timestamp of when row was generated
last_updated datetime - timestamp of when row last updated
generated_method varchar(6) - method in which row was generated (manual or auto)
original_string varchar (255) - the original string
Table: Categories
category_id int - unique id for category
category_name varchar(20) - name of category
Table: Category_Values
category_map_id int - unique id for each value (not sure if I actually need this)
category_id int - id value to link to table Categories
generated_id int - id value to link to table Generated
category_value varchar (255) - value for the category
Basically the idea is when I parse a row, I will insert a new entry into table Generated, as well as X entries in table Category_Values, where X is however many categories there currently are. And the category names are stored in another table Categories.
What my script will immediately do is process rows of raw values and output the generated category values to a new file to be sent somewhere. But then I have this db I'm making to store the data generated so that I can make another script, where I can search for and list previously generated values, or update previously generated entries with new values or whatever.
Does this look like an okay database structure? Anything obvious I'm missing or potentially gimping myself on? For example, with this structure...well...I'm not a sql expert, but I think I should be able to do like
select * from Generated where original_string = '$string'
// id is put into $id
and then
select * from Category_Values where generated_id = '$id'
...and then I'll have my data to work with for search results or form to alter data...well I'm fairly certain I can even combine this into one query with a join or something but I'm not that great with sql so I don't know how to actually do that..but point is, I know I can do what I need from this db structure..but am I making this harder than it needs to be? Making some obvious noob mistake?
My suggestion:
Table: Generated
id unsigned int autoincrement primary key
generated_timestamp timestamp
last_updated timestamp default '0000-00-00' ON UPDATE CURRENT_TIMESTAMP
generated_method ENUM('manual','auto')
original_string varchar (255)
Table: Categories
id unsigned int autoincrement primary key
category_name varchar(20)
Table: Category_Values
id unsigned int autoincrement primary key
category_id int
generated_id int
category_value varchar (255) - value for the category
FOREIGN KEY `fk_cat`(category_id) REFERENCES category.id
FOREIGN KEY `fk_gen`(generated_id) REFERENCES generated.id
Links
Timestamps: http://dev.mysql.com/doc/refman/5.1/en/timestamp.html
Create table syntax: http://dev.mysql.com/doc/refman/5.1/en/create-table.html
Enums: http://dev.mysql.com/doc/refman/5.1/en/enum.html
I think this solution is perfect for what you want to do. The Categories list is now flexible so that you can add new categories or retire old ones (I would recommend thinking long and hard about it before agreeing to delete a category - would you orphan record or remove them too, etc.)
Basically, I'm saying you are right on target. The structure is simple but it will work well for you. Great job (and great job giving exactly the right amount of information in the question).