Approach to Primary Key on Related Tables to Save Storage Space - mysql

I have a question regarding primary keys in Relational Databases. Let's assume that I have the following tables:
Box
id
box_name
BoxItems
id
item_name
belongs_to_box_id (foreign key)
Let's also assume that I intend to store millions of items per day. I would probably use bigint or a guid for the BoxItems.Id.
What I was thinking, and I need your advice on that, is instead of Bigint Id for the BoxItems, use a sequencial TinyInt number and what identified each item is the combination of the belongs_to_box_id plus the tinyint row (e.g. item_numner).
So now instead of the above we get:
BoxItems
belongs_to_box_id
item_sequence_number [TINYINT]
item_name
Example:
Items.Insert(1,1, "my item 1");
Items.Insert(1,2, "my item 2");
So instead of using bigint or GUID for that matter, I can use tinyint and save a lot of disk space.
I want to know what the cons and pros of such approach. I am developing my app using MySQL and ASP.NET 4.5

When you think about it, there's really not much difference between the "box/contents" problem and the "order/line item" problem.
create table boxes (
box_id integer primary key,
box_name varchar(35) not null
);
create table boxed_items (
box_id integer not null references boxes (box_id),
box_item_num tinyint not null,
item_name varchar(35) not null
);
For MySQL, you'd probably use unsigned integer and unsigned tinyint. There's no compelling reason for a database to avoid negative numbers, but developers should lean on the Principle of Least Surprise.
Make sure 256 values are enough. Getting that wrong can be expensive to correct in a table that gets millions of rows each day.

I would recommend writing a simple test for both approaches and compare performance, disk space and ease of implementation and make a judgement call. Both of your suggestions are reasonable and I doubt there will be much of a difference in performance but the best way to find out is to just try it out and then you will know for sure.

Related

2 independent auto_incremental fields in MySQL table

I am trying to create a MySQL table that has a generic ID column, but also a secondary ID column, both of which need some form of auto incrementing
currently my MySQL table looks like this:
`ban_id` mediumint unsigned NOT NULL AUTO_INCREMENT,
`student_uuid` varchar(36) NOT NULL,
`student_ban_id` tinyint unsigned NOT NULL AUTO_INCREMENT,
(a bunch of data irrelevant to this question)
PRIMARY KEY (`student_uuid`, `student_ban_id`),
UNIQUE (`ban_id`)
The desired behavior is that ban_id is just a generic entry_id and that student_ban_id is the ban's number for the given student. (my reasoning is that I want to be able to reference bans by an id value if the student_uuid is unavailable, but the program spec also requires the ability to take student:banID as a valid means of reference)
A example row might be BanID:501, {studentUUID}, studentBanID:2 (501st ban, 2nd ban against the given student)
I have run into the issue that the MyISAM engine does not support tracking two separate incremental columns at once (I believe it can handle both desired behaviors, but not at the same time)
What might be the best way to achieve such a behavior?
Much appreciated!
-Cryptic

BIGINT vs combined ( tinyint, int ) for Large auto_increment key?

I am designing a website and have a table which dealing with lot of inserts. On each month this table will get at least 50 million records.
So currently I am using bigint unsgined data type as the primary key of this table.
CREATE TABLE `class`.`add_contact_details`
(
`con_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
`add_id_ref` BIGINT UNSIGNED NOT NULL,
`con_name` VARCHAR(200),
`con_email` VARCHAR(200),
`con_phone` VARCHAR(200),
`con_fax` VARCHAR(200),
`con_mailbox` VARCHAR(500),
`con_status_show_email` TINYINT(1),
`con_status_show_phone` TINYINT(1),
`con_status_show_fax` TINYINT(1),
`con_status_show_mailbox` TINYINT(1),
PRIMARY KEY (`con_id`) ) ENGINE=INNODB CHARSET=latin1 COLLATE=latin1_swedish_ci;
So by doing a big research I found that most of the people are worry about using BIGINT because it is memory consuming and need lot of space.
So I found an article that describing a alternate for that. Here it is
"You could use a combined ( tinyint, int ) key. The tinyint would start at, and default to, 1. IF the int value is ever about to overflow, you change the tinyint's default value to 2, and reset the int value to 1. You can create code that runs every day, or on another applicable schedule, which checks for that condition and makes that change if needed."
So it make sense right? So Is there anybody who is using this?
What should I use by considering the performance?
Is there any alternative enterprice level solution for this?
Stick with the BIGINT. You do save two or three bytes using the dual key, but you do pay for it.
References to the table need to use two keys instead of one, so all foreign key relationships are more complicated.
where clauses to find a single row are much more complicated. Consider the difference between:
where id in (1, 2, 3, 4, 5)
and
where id_part1 = 0 and id_part2 = 1 or
id_part1 = 0 and id_part2 = 2 or
. . .
The step to increment the first part is no automatic, requiring either manual intervention or the overhead of a trigger.
This reminds me (in a bad way) of segmented memory architectures that were in vogue 20+ years ago. Be happy the computer can understand 64-bit keys with no problems.
BIGINT needs 8 bytes of storage so 50 million records is 400 MB per month which shouldn't be an issue at all.
We are running databases (on DB2) with a couple of TB on a single server.
The only thing you should consider for querying by PK is putting an index on that field.
best regards
Romeo Kienzler

How to create index on massive data (mysql)

I am currently evaluating strategy for storing supplier catalogs.
There can be multiple items in catalog vary from 100 to 0.25million.
Each item may have multiple errors. application should support browsing of catalog items
Group by Type of Error, Category, Manufacturer, Suppliers etc..
Browse items for any group, Should be able to sort and search on multiple columns (partid,
names, price etc..)
Question is when i have to provide functionality of "Multiple SEARCH and SORT and GROUP" how should i create index.
According to mysql doc & blogs for index it seems that creating index on individual column will not be used by all query.
Creating multi column index is even not specific for my case.
There might be 20 - 30 combination of group search & sort.
How do i scale and how can i make search fast.
Expecting to handle 50 million records of data.
Currently evaluating on 15 million of data.
Suggestions are welcome.
CREATE TABLE CATALOG_ITEM
(
AUTO_ID BIGINT PRIMARY KEY AUTO_INCREMENT,
TENANT_ID VARCHAR(40) NOT NULL,
CATALOG_ID VARCHAR(40) NOT NULL,
CATALOG_VERSION INT NOT NULL,
ITEM_ID VARCHAR(40) NOT NULL,
VERSION INT NOT NULL,
NAME VARCHAR(250) NOT NULL,
DESCRIPTION VARCHAR(2000) NOT NULL,
CURRENCY VARCHAR(5) NOT NULL,
PRICE DOUBLE NOT NULL,
UOM VARCHAR(10) NOT NULL,
LEAD_TIME INT DEFAULT 0,
SUPPLIER_ID VARCHAR(40) NOT NULL,
SUPPLIER_NAME VARCHAR(100) NOT NULL,
SUPPLIER_PART_ID VARCHAR(40) NOT NULL,
MANUFACTURER_PART_ID VARCHAR(40),
MANUFACTURER_NAME VARCHAR(100),
CATEGORY_CODE VARCHAR(40) NOT NULL,
CATEGORY_NAME VARCHAR(100) NOT NULL,
SOURCE_TYPE INT DEFAULT 0,
ACTIVE BOOLEAN,
SUPPLIER_PRODUCT_URL VARCHAR(250),
MANUFACTURER_PRODUCT_URL VARCHAR(250),
IMAGE_URL VARCHAR(250),
THUMBNAIL_URL VARCHAR(250),
UNIQUE(TENANT_ID,ITEM_ID,VERSION),
UNIQUE(TENANT_ID,CATALOG_ID,ITEM_ID)
);
CREATE TABLE CATALOG_ITEM_ERROR
(
ITEM_REF BIGINT,
FIELD VARCHAR(40) NOT NULL,
ERROR_TYPE INT NOT NULL,
ERROR_VALUE VARCHAR(2000)
);
If you are determined to do this solely in MySQL, then you should be creating indexes that will work for all your queries. It's OK to have 20 or 30 indexes if there are 20-30 different queries doing your sorting. But you can probalby do it with far less indexes than that.
You also need to plan how these indexes will be maintained. I'm assuming because this is for supplier catalogs that the data is not going to change much. In this case, simply creating all the indexes you need should do the job nicely. If the data rows are going to be edited or inserted frequently in realtime, then you have to consider that with your indexing - then having 20 or 30 indexes might not be such a good idea (since MySQL will be constantly having to update them all). You also have to consider which MySQL storage engine to use. If your data never changes, MyISAM (the default engine, basically fast flat files) is a good choice. If it changes a lot, then you should be using InnoDB so you can get row level locking. InnoDB would also allow you to define a clustered index, which is a special index that controls the order stuff is stored on disk. So if you had one particular query that is run 99% of the time, you could create a clustered index for it and all the data would already be in the right order on disk, and would return super super fast. But, every insert or update to the data would result in the entire table being reordered on disk, which is not fast for lots of data. You'd never use one if the data changed at all frequently, and you might have to batch load data updates (like new versions of a supplier's million rows). Again, it comes down to whether you will be updating it never, now and then, or constantly in realtime.
Finally, you should consider alternative means than doing this in MySQL. There are a lot of really good search products out there now, such as Apache Solr or Sphinx (mentioned in a comment above), which could make your life a lot easier when coding up the search interfaces themselves. You could index the catalogs in one of these and then use them provide some really awesome search features like full text and/or faceted search. It's like having a private google search engine indexing your stuff, is a good way to describe how these work. It takes time to write the code to interface with the search server, but you will most likely save that time not having to write and wrap your head around the indexing problem and other issues I mentioned above.
If you do just go with creating all the indexes though, learn how to use the EXPLAIN command in MySQL. That will let you see what MySQL's plan for executing a query will be. You can create indexes then re-run EXPLAIN on your queries and see how MySQL is going to use them. This way you can make sure that each of your query methods has indexes supporting it, and is not falling back to a scanning your entire table of data to find things. With as many rows as you're talking about, every query MUST be able to use indexes to find its data. If you get those right, it'll perform fine.

Need some tips on create table

I plan to create a table to store the race result like this:
Place RaceNumber Gender   Name  Result
12 0112 Male Mike Lee 1:32:40
16 0117 Female Rose Mary 2:20:40
I am confused at the items type definitions.
I am not sure the result can be set to varchar(32) or other type?
and for racenumber, between int(11) and varchar(11), which one is better?
Can I use UNIQUE KEY like my way?
Do I need to split name to firstname and lastName in my DB table?
DROP TABLE IF EXISTS `race_result`;
CREATE TABLE IF NOT EXISTS `race_result` (
`id` int(11) NOT NULL auto_increment,
`place` int(11) NOT NULL,
`racenumber` int(11) NOT NULL,
`gender` enum('male','female') NOT NULL,
`name` varchar(16) NOT NULL,
`result` varchar(32) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `racenumber` (`racenumber`,`id`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=utf8 AUTO_INCREMENT=3;
Some advice/opinions regarding datatypes.
Result - This is a time, you may want to do some calculations on this time, therefore you should store it as a time type.
RaceNumber - This is a reference, whilst it is a number, you will be performing no calculations on this number. Therefore you should store it as a varchar rather than an int. This will avoid confusion as to its usage and avoid accidently manipulation of it as a number.
Name - Look at the length of string you allow for the Name. Be careful about limiting this value by so much. 16 characters may be too small for some names in the future.
Place - Is this required storage? Can you calculate the place of a runner based on their Result alone? However, you should keep a good primary key for your table.
In answer to your specific questions:
Result: I would just set the result to an integer number of seconds. My opinion is that data should be stored in databases, not formatting. Since the likely things you're going to want to do with this is sort by it and return rows less than or greater than specific values of it, an integer seems better to me.
Race number: Same for race number. If it's always going to be numeric, use an integer and worry about the formatting in the application. If it can be non-numeric then by all means make it varchar but, for a numeric value, I can't see enough gain in making it so.
Unique key: I don't really see the point in having a unique index on race number and ID. ID is, by definition, already unique as a primary key. Perhaps you meant race number and place although even that is risky in the event of two people drawing for a place.
Split names: If you're ever going to treat them as individual values, then yes. Otherwise no. In other words, avoid things like where fullname like 'Mike %'.
For the name, if you ever want to sort on lastname, while you display it as "firstname lastname", then you will need to use separate columns.
In general: think about what you want to do with the data. Leave formatting to the application that is displaying the data. Avoid situations where you need string manipulation or complicated maths to get at the values you need.

MySQL column with various types

I seem to often find myself wanting to store data of more than one type (usually specifically integers and text) in the same column in a MySQL database. I know this is horrible, but the reason it happens is when I'm storing responses that people have made to questions in a questionnaire. Some questions need an integer response, some need a text response and some might be an item selected from a list.
The approaches I've taken in the past have been:
Store everything as text and convert to int (or whatever) when needed later.
Have two columns - one for text and one for int. Then you just fill one in per row per response, and leave the other one as null.
Have two tables - one for text responses and one for integer responses.
I don't really like any of those, though, and I have a feeling there must be a much better way to deal with this kind of situation.
To make it more concrete, here's an example of the tables I might have:
CREATE TABLE question (
id int(11) NOT NULL auto_increment,
text VARCHAR(200) NOT NULL default '',
PRIMARY KEY ('id')
)
CREATE TABLE response (
id int(11) NOT NULL auto_increment,
question int (11) NOT NULL,
user int (11) NOT NULL,
response VARCHAR(200) NOT NULL default ''
)
or, if I went with using option 2 above:
CREATE TABLE response (
id int(11) NOT NULL auto_increment,
question int (11) NOT NULL,
user int (11) NOT NULL,
text_response VARCHAR(200),
numeric_response int(11)
)
and if I used option 3 there'd be a responseInteger table and a responseText table.
Is any of those the right approach, or am I missing an obvious alternative?
[Option 2 is] NOT the most normalized option [as #Ray claims]. The most normalized would have no nullable fields and obviously option 2 would require a null on every row.
At this point in your design you have to think about the usage, the queries you'll do, the reports you'll write. Will you want to do math on all of the numeric responses at the same time? i.e. WHERE numeric_response IS NOT NULL? Probably unlikely.
More likely would be, What's the average response WHERE Question = 11. In those cases you can either choose the INT table or the INT column and neither would be easier to do than the other.
If you did do two tables, you'd more than likely be constantly unioning them together for questions like, what % of questions have a response etc.
Can you see how the questions you ask your database to answer start to drive the design?
I'd opt for Option 1. The answers are always text strings, but sometimes the text string happens to be the representation of an integer. What is less easy is to determine what constraints, if any, should be placed on the answer to a given question. If some answer should only be a sequence of one or more digits, how do you validate that? Most likely, the Questions table should contain information about the possible answers, and that should guide the validation.
I note that the combination of QuestionID and UserID is (or should be) unique (for a given questionnaire). So, you really don't need the auto-increment column in the answer. You should also have a unique constraint (or primary key constraint) on the QuestionID and UserID anyway (regardless of whether you keep the auto-increment column).
Option 2 is the correct, most normalized option.