I'm creating the database schema for a system and I started to wonder something about the Integer datatypes in MySQL. I've seen that, at least in Moodle, the datatypes are sometimes TINYINT (for stuff like flags), INT (for id numbers) or BIGINT (for almost-infinite AI values, like user id's).
I was wondering: how does that affect the actual database? If I use INT for something like a flag (e.g 1 = pending, 2 = in process, 3 = revewing, 4 = processed) instead of TINYINT or BIGINT, does it has repercussions? What about not setting up constraints? (Like, again, with a flag, using TINYINT(1) or TINYINT without an specific number)
The size that you provide will not affect how data is stored.
So INT(11) is stored on disk the same way as INT(3).
With your example of
1 = pending, 2 = in process, 3 = revewing, 4 = processed
I would use an ENUM('pending', 'inprocess', 'revewing', 'processed') instead. That keeps it readable in your code and in the data, while it provides the same speed as using an integer.
What is the difference between tinyint, smallint, mediumint, bigint and int in MySQL?
You should read about the different numeric datatypes and make your decision based on that.
Related
I am looking into storing a "large" amount of data and not sure what the best solution is, so any help would be most appreciated. The structure of the data is
450,000 rows
11,000 columns
My requirements are:
1) Need as fast access as possible to a small subset of the data e.g. rows (1,2,3) and columns (5,10,1000)
2) Needs to be scalable will be adding columns every month but the number of rows are fixed.
My understanding is that often its best to store as:
id| row_number| column_number| value
but this would create 4,950,000,000 entries? I have tried storing as just rows and columns as is in MySQL but it is very slow at subsetting the data.
Thanks!
Build the giant matrix table
As N.B. said in comments, there's no cleaner way than using one mysql row for each matrix value.
You can do it without the id column:
CREATE TABLE `stackoverflow`.`matrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
You may add a UNIQUE INDEX on colNum, rowNum, or only a non-unique INDEX on colNum if you often access matrix by column (because PRIMARY INDEX is on ( `rowNum`, `colNum` ), note the order, so it will be inefficient when it comes to select a whole column).
You'll probably need more than 200Go to store the 450.000x11.000 lines, including indexes.
Inserting data may be slow (because there are two indexes to rebuild, and 450.000 entries [1 per row] to add when adding a column).
Edit should be very fast, as index wouldn't change and value is of fixed size
If you access same subsets (rows + cols) often, maybe you can use PARTITIONing of the table if you need something "faster" than what mysql provides by default.
After years of experience (20201 edit)
Re-reading myself years later, I would say the "cache" ideas are totally dumb, as it's MySQL role to handle these sort of cache (it should actually already be in the innodb pool cache).
A better thing would be, if matrix is full of zeroes, not storing the zero values, and consider 0 as "default" in the client code. That way, you may lightenup the storage (if needed: mysql should actually be pretty fast responding to queries event on such 5 billion row table)
Another thing, if storage makes issue, is to use a single ID to identify both row and col: you say number of rows is fixed (450000) so you may replace (row, col) with a single (id = 450000*col+row) value [tho it needs BIGINT so maybe not better than 2 columns)
Don't do like below: don't reinvent MySQL cache
Add a cache (actually no)
Since you said you add values, and doesn't seem to edit matrix values, a cache can speed up frequently asked rows/columns.
If you often read the same rows/columns, you can cache their result in another table (same structure to make it easier):
CREATE TABLE `stackoverflow`.`cachedPartialMatrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
That table will be void at the beginning, and each SELECT on the matrix table will feed the cache. When you want to get a column / row:
SELECT the row/column from that caching table
If the SELECT returns a void/partial result (no data returned or not enough data to match the expected row/column number) then do the SELECT on the matrix table
Save the SELECT from the matrix table to the cachingPartialMatrix
If the caching matrix gets too big, clear it (the bigger cached matrix is, the slower it becomes)
Smarter cache (actually, no)
You can make it even smarter with a third table to count how many times a selection is done:
CREATE TABLE `stackoverflow`.`requestsCounter` (
`isRowSelect` BOOLEAN NOT NULL ,
`index` INT NOT NULL ,
`count` INT NOT NULL ,
`lastDate` DATETIME NOT NULL,
PRIMARY KEY ( `isRowSelect` , `index` )
) ENGINE = MYISAM ;
When you do a request on your matrix (one may use TRIGGERS) for the Nth-row or Kth-column, increment the counter. When the counter gets big enough, feed the cache.
lastDate can be used to remove some old values from the cache (take care: if you remove the Nth-column from cache entries because its ``lastDate```is old enough, you may break some other entries cache) or to regularly clear the cache and only leave the recently selected values.
I'm currently building a Rails app, and I'm trying to make sure I don't make any big mistakes now in the database design (MySQL) that will haunt me down the road.
Currently I have a bunch of fields in various tables that reference simple constant data. For example, one of my tables is called rates and it's fields are
id, amount and unit, with unit being per hour / day / week / month.
These values will not change, and instead of using a reference table I just used a single CHAR to represent the values, so hour would be 'h', day would be 'd', week would be 'w' and so on. I figured this would make the database more human friendly and limit the possibility of IDs changing somehow and all the data being corrupted because of it.
Does this seem like a reasonable approach, or am I missing some potential pitfall?
If it were up to me I would use a lookup table which contains key/value pairs, where the value is anything you want it to be. Something like this:
CREATE TABLE `lookup_rate_type`
(
`id` TINYINT UNSIGNED NOT NULL AUTO_INCREMENT,
`name` VARCHAR(20) NOT NULL,
PRIMARY KEY (`id`), UNIQUE KEY (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Then, in any tables where you need to reference a rate, simply use the id which correlates to the rate you are interested in, lets add a few rows:
INSERT INTO `lookup_rate_type` SET name = 'second'; # id = 1
INSERT INTO `lookup_rate_type` SET name = 'minute'; # id = 2
INSERT INTO `lookup_rate_type` SET name = 'hour'; # id = 3
Now, if you have a table which needs a reference to rate you would this:
INSERT INTO `some_table` SET rate = 27, rate_type_id = 3; # 27 / hour
INSERT INTO `some_table` SET rate = 3600, rate_type_id = 2; # 3600 / minute
The nice thing about this approach, if you ever decide you want to use just the first letter to identify a rate_type, you simply need to update the lookup_rate_type table:
UPDATE `lookup_rate_type` SET name = 's' WHERE ID = 1 LIMIT 1;
Even though you changed the name value (within a context you understand, second became s), any tables storing a relation to the value, will remain unchanged.
With this solution you can add as many rows as you need to the lookup table, vs having to run an alter statement just to add an enum, which can be painful if dealing with a large table.
The only drawback with this solution is you will probably need to use class constants to allow your application code access to the int values so you can conduct CRUD ops according to your business logic, something like:
class LookupRateType
{
const SECOND = 1;
const MINUTE = 2;
}
// Calling code example
$sql = sprintf('INSERT INTO some_table SET rate = %d, rate_type_id = %d', 27, LookupRateType::MINUTE);
Also, if you wanted to, when dealing with a small list of constants you can forgo the lookup table altogether and deal only with class constants. The drawback here is you will have to know the id when looking for specific data, vs using a query:
# If you use class constants only, you can do this
SELECT * FROM some_table WHERE rate_type_id = (SELECT id FROM lookup_rate_type WHERE name = 'hour');
# But if you can lookup the context with the code you can simply input the id
SELECT * FROM some_table WHERE rate_type_id = 3;
These are scenarios I have come across in dealing with a multitude of different applications, and have found the lookup table and/or class constants to be the best solution.
If you really want to do this, I'd suggest using an ENUM column:
CREATE TABLE rates (
id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
amount INTEGER NOT NULL,
unit ENUM( 'hour', 'day', 'week', 'month' ) NOT NULL
);
This is even more readable than using single-character abbreviations, while being no less efficient. In strict SQL mode, it also ensures that attempts to insert invalid values will be reported as errors.
(However, eggyal's suggestion of using a SMALLINT column and measuring durations in hours is also worth considering.)
Ps. I'm not really familiar with Rails, but at least the first result I got when I Googled for "mysql rails enum" says:
"The best part from the Rails side, is that you don’t have to change anything at all in your code to swap a varchar out for an ENUM."
I have a MySQL (MyISAM) database with different tables. Lets take for example the database "rh955_omf" with the following tables:
signal (600 MBytes, 17925 entries)
picture (5'355 MBytes, 17925 entries)
velocity (680 MBytes, 4979 entries)
Actually I'm just concentrating onto the signal table entries. Therefore I want to describe this table a bit better. It's created as following:
CREATE TABLE rh955_omf.signal(MeasNr TINYINT UNSIGNED, ExperimentNr TINYINT UNSIGNED, Time INT, SequenceNr SMALLINT UNSIGNED, MeanBeatRate SMALLINT UNSIGNED, MedBeatRate SMALLINT UNSIGNED, MeanAmp1 MEDIUMINT UNSIGNED, MeanAmp2 MEDIUMINT UNSIGNED, StdDeviationAmp1 DOUBLE, StdDeviationAmp2 DOUBLE, MeanDeltaAmp MEDIUMINT UNSIGNED, Offset INT UNSIGNED, NrOfPeaks SMALLINT UNSIGNED, `Signal` MEDIUMBLOB, Peakcoord MEDIUMBLOB, Validity BOOL, Comment VARCHAR(255), PRIMARY KEY(MeasNr, ExperimentNr, SequenceNr));
I load the values from this table with the following command:
SELECT MeanBeatRate FROM rh955_omf.signal WHERE MeasNr = 3 AND ExperimentNr = 10 AND SequenceNr BETWEEN 0 AND 407
If I load the whole "MeanBeatRate" row (int 16 values) for the first time, it takes me about 54 seconds (MeasNr = 1..3, ExperimentNr = 1..24, SequenceNr >= min AND <= max). If I load it again, it takes 0.5 seconds (cache).
So what I want to do is speeding up the database. Therefore, I created some new databases, but didn't put all the tables into the different databases:
rh955_copy_omf: signal table
rh955_p_copy_omf: signal table, picture table
rh955_v_p_copy_omf: signal table, picture table, velocity table
I restarted the computer and loaded all the "MeanBeatRate" values from the different tables. That gave me the following time:
rh955_omf: 54s (as mentioned before)
rh955_copy_omf: 3.1s
rh955_p_copy_omf: 12.9s
rh955_v_p_copy_omf: 10.7s
So it looks like the time to load the data is dependent of the other tables in the database. Is this even possible (because I'm just searching in the "signal" table)? And what is even more confusing: In the table "rh955_v_p_copy_omf" I have all the data I also have in the original table, but the performance is ~5 times better. Any explanation for that behavior? I would be thankful for any help because I'm really stuck at this point and need to increase the database performance...
Additional information: In one case, I stored the data in the table with the command "LOAD DATA INFILE 'D:/Exported MySQL/rh955/signal.omf' INTO table rh955_omf.signal" (that's the case where loading data is slow), in the other cases I stored the data line by line. Maybe that's the case why the performance is different? If yes, what's the workaround to store data from a file?
Are they indexed the same?
Are the server parameters the same for each DB (same memory, start up configuration and parameters, etc)?
Are they on the same disk?
If they are on the same disk is there some other application running at the same time which influences where the read-write heads are?
Do you stop the other databases each time or have some still running some times?
I'm a developer with some limited database knowledge, trying to put together a scalable DB design for a new app. Any thoughts that anyone could provide on this problem would be appreciated.
Assume I currently have the following table:
Stuff
------------
ID Integer
Attr1 Integer
Attr2 Integer
Attr3 Double
Attr4 TinyInt
Attr5 Varchar(250)
Looking forward, assume we will have 500 million records in this table. However, at any given time only 5000 or so records will have anything in the Attr5 column; all other records will have a blank or null Attr5 column. The Attr5 column is populated with 100-200 characters when a record is inserted, then a nightly process will clear the data in it.
My concern is that such a large varchar field in the center of a tablespace that otherwise contains mostly small numeric fields will decrease the efficiency of reads against the table. As such, I was wandering if it might be better to change the DB design to use two tables like this:
Stuff
------------
ID Integer
Attr1 Integer
Attr2 Integer
Attr3 Double
Attr4 TinyInt
Stuff_Text
------------
StuffID Integer
Attr5 Varchar(250)
Then just delete from Stuff_Text during the nightly process keeping it at 5,000 records, thus keeping the Stuff table minimal in size.
So my question is this: Is it necessary to break this table into two, or is the database engine intelligent enough to store and access the information efficiently? I could see the DB compressing the data efficiency and storing records without data in Attr5 as if there was no varchar column. I could also see the DB leaving an open 250 bytes of data in every record anticipating data for Attr5. I tend to expect the former, as I thought that was the purpose of varchar over char, but my DB experience is limited so I figure I'd better double check.
I am using MySQL 5.1, currently on Windows 2000AS, eventually upgrading to Windows Server 2008 family. Database is currently on a standard 7200 rpm magnetic disc, eventually to be moved to an SSD.
Stuff
------------
ID Integer
Attr1 Integer
Attr2 Integer
Attr3 Double
Attr4 TinyInt
Attr5 Integer NOT NULL DEFAULT 0 (build an index on this)
Stuff_Text
------------
Attr5_id Integer (primary key)
Attr5_text Varchar(250)
In action
desc select * from Stuff WHERE Attr5<>0;
desc select Stuff.*, Stuff_text.Attr5_text
from Stuff
inner join Stuff_text ON Stuff.Attr5=Stuff_text.Attr5_id;
don't store NULL
make use on integer as foreign key
when pulling of record where Attr5 <>0 <-- scan 5,000 rows
much smaller index size
do a benchmark yourself
If you're using VARCHAR and allowing NULL values, then you shouldn't have problems. Becouse it's really efficient storing this kind of datatype. This is very different from CHAR datatype, but you already has VARCHAR.
Anyway, splitting it into two tables is not a bad idea. This could be good to keep the query cache alive, but it mostly depends in the use these tables have.
Last thing i can say: Try to benchmark it. Instert a bulk of data and try to simulate some use.
Funny thing I've found abount mysql. MySQL has a 3 byte numeric type - MEDIUMINT. Its range is from -8388608 to 8388607. It seems strange to me. Size of numeric types choosen for better performance, I thought data should be aligned to a machine word or double word. And if we need some restriction rules for numeric ranges, it must be external relative to datatype. For example:
CREATE TABLE ... (
id INT RANGE(0, 500) PRIMARY KEY
)
So, does anyone know why 3 bytes? Is there any reason?
The reason is so that if you have a number that falls within a 3 byte range, you don't waste space by storing it using 4 bytes.
When you have twenty billion rows, it matters.
The alignment issue you mentioned applies mostly to data in RAM. Nothing forces MySQL to use 3 bytes to store that type as it processes it.
This might have a small advantage in using disk cache more efficiently though.
We frequently use tinyint, smallint, and mediumint as very significant space savings. Keep in mind, it makes your indexes that much smaller.
This effect is magnified when you have really small join tables, like:
id1 smallint unsigned not null,
id2 mediumint unsigned not null,
primary key (id1, id2)
And then you have hundreds of millions or billions of records.