Optimizing MySQL Table Structure and impact of row size - mysql

One of my database tables has grown quite large, to the point where I think it is impacting the performance on my site (it is definitely making backups a lot slower).
It has ~13,000,000 rows and is 4.2 GiB in size, of which 1.2 GiB is data.
The structure looks like this:
CREATE TABLE IF NOT EXISTS `t1` (
`id` int(10) unsigned NOT NULL,
`int2` int(10) unsigned NOT NULL,
`int3` int(10) unsigned NOT NULL,
`int4` int(10) unsigned NOT NULL,
`char1` varchar(255) NOT NULL,
`int5` int(10) NOT NULL,
`char2` varchar(1024) DEFAULT NULL,
`char3` varchar(1024) NOT NULL,
PRIMARY KEY (`id`,`int2`,`int3`,`int4`),
KEY `key1` (`id`,`int2`,`char1`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Common operations in this table are insert and selects, rows are never updated and rarely deleted. int2 is a running version number, which means usually only the rows with the highest value of int2 for that id are selected.
I have been thinking of several ways of optimizing this and I was wondering which one would be the which one to pursue:
char1 (which is in the index) actually only contains about 40,000 different strings. I could move the strings into a second table (idchar -> char) and then just save the id in my main table, at the cost of an additional id lookup step during inserts and selects.
char2 and char3 are often empty. I could move them to a separate table that I would then do a LEFT JOIN on in selects.
Even if char2 and char3 contain data they are usually shorter than 1024 chars. I could probably shorten these to ~200.
Which one of these do you think is the most promising? Does decreasing the row size (either by making char1 into an integer or by removing/resizing columns) in MySQL InnoDB tables actually have a big impact on performance?
Thanks

There are several options. From what you say, moving char1 to another table seems quite reasonable. The additional lookup could, under some circumstances, even be faster than storing the raw data in the tables. (This occurs when the repeated values cause the table to be larger than necessary, especially when the larger table might be larger than available memory.) And, this would save space both in the data table and the corresponding index.
The exact impact on performance is hard to say, without understanding much more about your system and the query load.
Moving char3 and char4 to another table will have minimal impact. The overhead of the link to the other table would eat up any gains in space. You could save a couple bytes per record by storing them as varchar(255) rather than varchar(1024).
If you have a natural partitioning key, then partitioning is definitely an option, particularly for reducing the time for backups. This is very handy for a transaction-style table, where records are inserted and never or very rarely modified. If, on the other hand, the records contain customer records and any could be modified at any time, then you would still need to back up all the partitions.

There are several factors that could affect performance of your DB. Partitioning is definitive the best option but not allways can be done. If you are searching char1 before insert, then partitioning can be a problem because you have to search all the parts for the key. You must analize how the data is generated and most important how you make your querys for this table. This is the key so you should post your querys over this table. In the case on char2 and char3, moving to another table won't make any difference. You also should mention the physical distribution of you data. Are you using a single data file? Are data files on same physical disk as SO? Give more details so we can give you more help.

Related

MySQL: Which is smaller, storing 2 sets of similar data in 1 table vs 2 tables (+indexes)?

I was asked to optimize (size-wise) statistics system for a certain site and I noticed that they store 2 sets of stat data in a single table. Those sets are product displays on search lists and visits on product pages. Each row has a product id, stat date, stat count and stat flag columns. The flag column indicates if it's a search list display or page visit stat. Stats are stored per day and product id, stat date (actually combined with product ids and stat types) and stat count have indexes.
I was wondering if it's better (size-wise) to store those two sets as separate tables or keep them as a single one. I presume that the part which would make a difference would be the flag column (lets say its a 1 byte TINYINT) and indexes. I'm especially interested about how the space taken by indexes would change in 2 table scenario. The table in question already has a few millions of records.
I'll probably do some tests when I have more time, but I was wondering if someone had already challenged a similar problem.
Ordinarily, if two kinds of observations are conformable, it's best to keep them in a single table. By "conformable," I mean that their basic data is the same.
It seems that your observations are indeed conformable.
Why is this?
First, you can add more conformable observations trivially easily. For example, you could add sales to search-list and product-page views, by adding a new value to the flag column.
Second, you can report quite easily on combinations of the kinds of observations. If you separate these things into different tables, you'll be doing UNIONs or JOINs when you want to get them back together.
Third, when indexing is done correctly the access times are basically the same.
Fourth, the difference in disk space usage is small. You need indexes in either case.
Fifth, the difference in disk space cost is trivial. You have several million rows, or in other words, a dozen or so gigabytes. The highest-quality Amazon Web Services storage costs about US$ 1.00 per year per gigabyte. It's less than the heat for your office will cost for the day you will spend refactoring this stuff. Let it be.
Finally I got a moment to conduct a test. It was just a small scale test with 12k and 48k records.
The table that stored both types of data had following structure:
CREATE TABLE IF NOT EXISTS `stat_test` (
`id_off` int(11) NOT NULL,
`stat_date` date NOT NULL,
`stat_count` int(11) NOT NULL,
`stat_type` tinyint(11) NOT NULL,
PRIMARY KEY (`id_off`,`stat_date`,`stat_type`),
KEY `id_off` (`id_off`),
KEY `stat_count` (`stat_count`)
) ENGINE=InnoDB DEFAULT CHARSET=latin2;
The other two tables had this structure:
CREATE TABLE IF NOT EXISTS `stat_test_other` (
`id_off` int(11) NOT NULL,
`stat_date` date NOT NULL,
`stat_count` int(11) NOT NULL,
PRIMARY KEY (`id_off`,`stat_date`),
KEY `id_off` (`id_off`),
KEY `stat_count` (`stat_count`)
) ENGINE=InnoDB DEFAULT CHARSET=latin2;
In case of 12k records 2 separate tables were actually slightly bigger than the one storing everything, but in case of 48k records, two tables were smaller and by a noticeable value.
In the end I didn't split the data into two tables to solve my initial space problem. I managed to considerably reduce the size of the database, by removing the redundant id_off index and adjusting the data types (in most cases unsigned smallint was more than enough to store all the values I needed). Note that originally stat_type was also of type int and for this column unsigned tinyint was enough. All in all this reduced the size of the database from 1.5GB to 600MB (and my limit was just 2GB for the database). Another advantage of this solution was the fact that I didn't have to modify a single line of code to make everything work (since the site was written by someone else, I didn't had to spend hours trying to understand the source code).

Keeping video viewing statistics breakdown by video time in a database

I need to keep a number of statistics about the videos being watched, and one of them is what parts of the video are being watched most. The design I came up with is to split the video into 256 intervals and keep the floating-point number of views for each of them. I receive the data as a number of intervals the user watched continuously. The problem is how to store them. There are two solutions I see.
Row per every video segment
Let's have a database table like this:
CREATE TABLE `video_heatmap` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`video_id` int(11) NOT NULL,
`position` tinyint(3) unsigned NOT NULL,
`views` float NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `idx_lookup` (`video_id`,`position`)
) ENGINE=MyISAM
Then, whenever we have to process a number of views, make sure there are the respective database rows and add appropriate values to the views column. I found out it's a lot faster if the existence of rows is taken care of first (SELECT COUNT(*) of rows for a given video and INSERT IGNORE if they are lacking), and then a number of update queries is used like this:
UPDATE video_heatmap
SET views = views + ?
WHERE video_id = ? AND position >= ? AND position < ?
This seems, however, a little bloated. The other solution I came up with is
Row per video, update in transactions
A table will look (sort of) like this:
CREATE TABLE video (
id INT NOT NULL AUTO_INCREMENT,
heatmap BINARY (4 * 256) NOT NULL,
...
) ENGINE=InnoDB
Then, upon every time a view needs to be stored, it will be done in a transaction with consistent snapshot, in a sequence like this:
If the video doesn't exist in the database, it is created.
A row is retrieved, heatmap, an array of floats stored in the binary form, is converted into a form more friendly for processing (in PHP).
Values in the array are increased appropriately and the array is converted back.
Row is changed via UPDATE query.
So far the advantages can be summed up like this:
First approach
Stores data as floats, not as some magical binary array.
Doesn't require transaction support, so doesn't require InnoDB, and we're using MyISAM for everything at the moment, so there won't be any need to mix storage engines. (only applies in my specific situation)
Doesn't require a transaction WITH CONSISTENT SNAPSHOT. I don't know what are the performance penalties of those.
I already implemented it and it works. (only applies in my specific situation)
Second approach
Is using a lot less storage space (the first approach is storing video ID 256 times and stores position for every segment of the video, not to mention primary key).
Should scale better, because of InnoDB's per-row locking as opposed to MyISAM's table locking.
Might generally work faster because there are a lot less requests being made.
Easier to implement in code (although the other one is already implemented).
So, what should I do? If it wasn't for the rest of our system using MyISAM consistently, I'd go with the second approach, but currently I'm leaning to the first one. But maybe there are some reasons to favour one approach or another?
Second approach looks tempting at first sight, but it makes queries like "how many views for segment x of video y" unable to use an index on video.heatmap. Not sure if this is a real-life concern for you though. Also, you would have to parse back and forth the entire array every time you need data for one segment only.
But first and foremost, your second solution is hackish (but interesting nonetheless). I wouldn't recommend denormalising your database until you face an acutal performance issue.
Also, try populating the video_headmap table in advance with wiews = 0 as soon as a video is inserted (a trigger can help).
If space is really a concern, remove your surrogate key video_headmap.id and instead make (video_id, position) the primary key (then get rid of the superfluous UNIQUE constraint). But this shouldn't come into the equation. 256 x 12 bytes per video (rough row length with 3 numeric columns, okay add some for the index) is only an extra 3kb per video!
Finally, nothing prevents you from switching your current table to InnoDB and leverage its row-level locking capability.
Please note I fail to undestand why views cannot be an UNSIGNED INT. I would recommend changing this type.

Time series data base for scientific experiments

I have to perform scientific experiments using time series.
I intend to use MySQL as the data storage platform.
I'm thinking of using the following set of tables to store the data:
Table1 --> ts_id (store the time series index, I will have to deal with several time series)
Table2 --> ts_id, obs_date, value (should be indexed by {ts_idx,obs_date})
Because there will be many time series (hundreds) each with possibly millions of observations, table 2 may grow very large.
The problem is that I have to replicate this experiment several times, so I'm not sure what would be the best approach:
add an experiment_id to the tables and allow them to grow even more.
create a separate data base for each experiment.
if option 2 is better (I personally think so), what would be the best logical way to do this? I have many different experiments to perform, each needing replication. If I create a different data base for every replication, I'd get hundreds of data bases pretty soon. Is there a way to logically organize them, such as each replication as a "sub-database" of its experiment master database?
You might want to start out by considering how you will need to analyze your data.
Assumably your analysis will need to know about experiment name, experiment replica number, internal replicates (e.g. at each timepoint there 3 "identical" subjects measured for each treatment). So your db schema might be something like this:
experiments
exp_id int unsigned not null auto_increment primary key,
exp_name varchar(45)
other fields that any kind of experiment can have
replicates
rep_id int unsigned not null auto_increment primary key,
exp_id int unsigned not null foreign key to experiments
other fields that any kind of experiment replica can have
subjects
subject_id int unsigned not null auto_increment primary key,
subject_name varchar(45),
other fields that any kind of subject can have
observations
ob_id int unsigned not null auto_increment primary key,
rep_id int unsigned not null foreign key to replicates,
subject_id int unsigned not null foreign key to subjects,
ob_time timestamp
other fields to hold the measurements you make at each timepoint
If you have internal replicates you'll need another table to hold the internal replicate/subject relationship.
Don't worry about your millions of rows. As long as you index sensibly, there won't likely be any problems. But if worse comes to worst you can always partition your observation table (likely to be the largest) by rep_id.
Should you have more than one data base, one for each experiment?
The answer to your question hinges on your answer to this question: Will you want to do a lot of analysis that compares one experiment to another?
If you will do a lot of experiment-to-experiment comparison, it will be a horrendous pain in the neck to have a separate data base for every experiment.
I think your suggestion of an experiment ID column in your observation table is a fine idea. That way you can build an experiment table with an overall description of your experiment. That table can also hold the units of observation in your value column (e.g. temp, voltage, etc).
If you have some kind of complex organization of your multiple experiments, you can store that organization in your experiment table.
Notice that MySQL is quite efficient at slinging around short rows of data. You can buy a nice server for the cost of a few dozen hours of your labor, or rent one on a cloud service for the cost of a few hours of labor.
Notice also that MySQL offers the MERGE storage engine. http://dev.mysql.com/doc/refman/5.5/en/merge-storage-engine.html This allows a bunch of different tables with the same column structure to be accessed as if it were one table. This would allow you to store results from individual experiments or groups of them in their own tables, and then access them together. If you have problems scaling up your data collection system, you may want to consider this. But the good news is you can get your database working and then convert to this.
Another question: why do you have a table with nothing but ts_id values in it? I don't get that.

MySql - WAMP - Huge Table is very slow (20 million rows)

So I posted this! yesterday and got a perfect answer, which required running this code first: ALTER TABLE mytable AUTO_INCREMENT=10000001;
I ran it several times, but restarted WAMP after a couple of hours of it not working. After running overnight (12 hours), the code still hadn't run.
I am wondering if my database table size is past the limits of mysql or my computer or both.
However, I have a sneaky suspicion that proper indexing or some other factor could greatly impact my performance. I know 20 million is a lot of rows, but is it too much?
I don't know much about indexes, except that they are important. I attempted to add them to the name and state fields, which I believe I did successfully.
Incidentally, I am trying to add a unique ID field, which is what my post yesterday was all about.
So, the question is: Is 20 million rows outside the scope of MySql? If not, am I missing an index or some other setting that would help better work with this 20 million rows? Can I put indexes on all the columns and make it super fast?
As always, thanks in advance...
Here are the specs:
My PC is XP, running WAMPSERVER, Win32 NTFS, Intel Duo Core, T9300 # 2.50GHz, 1.17 GHz, 1.98 GB or RAM
DB: 1 table, 20 million rows
The size of the tables is:
Data 4.4 Gigs, Indexes 1.3 Gigs, Total 5.8 Gigs
The indexes are set up on the 'BUSINESS NAME' and 'STATE' fields
The table fields are like this:
`BUSINESS NAME` TEXT NOT NULL,
`ADDRESS` TEXT NOT NULL,
`CITY` TEXT NOT NULL,
`STATE` TEXT NOT NULL,
`ZIP CODE` TEXT NOT NULL,
`COUNTY` TEXT NOT NULL,
`WEB ADDRESS` TEXT NOT NULL,
`PHONE NUMBER` TEXT NOT NULL,
`FAX NUMBER` TEXT NOT NULL,
`CONTACT NAME` TEXT NOT NULL,
`TITLE` TEXT NOT NULL,
`GENDER` TEXT NOT NULL,
`EMPLOYEE` TEXT NOT NULL,
`SALES` TEXT NOT NULL,
`MAJOR DIVISION DESCRIPTION` TEXT NOT NULL,
`SIC 2 CODE DESCRIPTION` TEXT NOT NULL,
`SIC 4 CODE` TEXT NOT NULL,
`SIC 4 CODE DESCRIPTION` TEXT NOT NULL
Some answers:
20 million rows is well within the capability of MySQL. I work on a database that has over 500 million rows in one of its tables. It can take hours to restructure a table, but ordinary queries aren't a problem as long as they're assisted by an index.
Your laptop is pretty out of date and underpowered to use as a high-scale database server. It's going to take a long time to do a table restructure. The low amount of memory and typically slow laptop disk is probably constraining you. You're probably using default settings for MySQL too, which are designed to work on very old computers.
I wouldn't recommend using TEXT data type for every column. There's no reason you need TEXT for most of those columns.
Don't create an index on every column, especially if you insist on using TEXT data types. You can't even index a TEXT column unless you define a prefix index. In general, choose indexes to support specific queries.
You probably have many other questions based on the above, but there's too much to cover in a single StackOverflow post. You might want to take training or read a book if you're going to work with databases.
I recommend High Performance MySQL, 2nd Edition.
Re your followup questions:
For MySQL tuning, here's a good place to start: http://www.mysqlperformanceblog.com/2006/09/29/what-to-tune-in-mysql-server-after-installation/
Many ALTER TABLE operations cause a table restructure, which means basically lock the table, make a copy of the whole table with the changes applied, and then rename the new and old tables and drop the old table. If the table is very large, this can take a long time.
A TEXT data type can store up to 64KB, which is overkill for a phone number or a state. I would use CHAR(10) for a typical US phone number. I would use CHAR(2) for a US state. In general, use the most compact and thrifty data type that supports the range of data you need in a given column.
It's going to take a long time because you've only got 2GB RAM and 6GB of data/indexes and it's going to force a ton of swapping in/out between RAM and disk. There's not much you can do about that, though.
You could try running this in batches.
Create a separate empty table with the auto_increment column included in it. Then insert your records a certain amount at a time (say, 1 state at a time). That might help it go faster since you should be able to handle those smaller datasets completely in memory instead of paging to disk.
You'll probably get a lot better responses for this if it's on dba.stackexchange.com also.
I believe the hardware is fine but you need to spare your resources a lot better.
Db structure optimization!
Do not use TEXT!
For phonenumbers use bigint unsigned. Any signs or alpha must be parsed and converted.
For any other alpha-numeric column use eg varchar([32-256]).
Zip-code is of course mediumint unsigned.
Gender should be enum('Male','Female')
Sales could be an int unsigned
State should be enum('Alaska',...)
Country should be enum('Albania',...)
When building a large index the fastest way is to create a new table and do INSERT INTO ... SELECT FROM ... rather then ALTER TABLE ....
Changing the State and Country fields to enum will drastically reduce you indexes size.

Mysql InnoDB performance optimization and indexing

I have 2 databases and I need to link information between two big tables (more than 3M entries each, continuously growing).
The 1st database has a table 'pages' that stores various information about web pages, and includes the URL of each one. The column 'URL' is a varchar(512) and has no index.
The 2nd database has a table 'urlHops' defined as:
CREATE TABLE urlHops (
dest varchar(512) NOT NULL,
src varchar(512) DEFAULT NULL,
timestamp timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
KEY dest_key (dest),
KEY src_key (src)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Now, I need basically to issue (efficiently) queries like this:
select p.id,p.URL from db1.pages p, db2.urlHops u where u.src=p.URL and u.dest=?
At first, I thought to add an index on pages(URL). But it's a very long column, and I already issue a lot of INSERTs and UPDATEs on the same table (way more than the number of SELECTs I would do using this index).
Other possible solutions I thought are:
-adding a column to pages, storing the md5 hash of the URL and indexing it; this way I could do queries using the md5 of the URL, with the advantage of an index on a smaller column.
-adding another table that contains only page id and page URL, indexing both columns. But this is maybe a waste of space, having only the advantage of not slowing down the inserts and updates I execute on 'pages'.
I don't want to slow down the inserts and updates, but at the same time I would be able to do the queries on the URL efficiently. Any advice?
My primary concern is performance; if needed, wasting some disk space is not a problem.
Thank you, regards
Davide
The MD5 hash suggestion you had is very good - it's documented in High Performance MySQL 2nd Ed. There's a couple of tricks to get it to work:
CREATE TABLE urls (
id NOT NULL primary key auto_increment,
url varchar(255) not null,
url_crc32 INT UNSIGNED not null,
INDEX (url_crc32)
);
Select queries have to look like this:
SELECT * FROM urls WHERE url='http://stackoverflow.com' AND url_crc32=crc32('http://stackoverflow.com');
The url_crc32 is designed to work with the index, including url in the WHERE clause is designed to prevent hash collisions.
I'd probably recommend crc32 over md5. There will be a few more collisions, but you have a higher chance of fitting all the index in memory.
If pages to URL's is a 1-to-1 relationship and that table has a unique id (primary key?), you could store that id value in the src and dest fields in the urlHops table instead of the full URL.
This would make indexing and joins much more efficient.
I would create a page_url table that has auto-inc integer primary key, and your URL value. Then update Pages and urlHops to use page_url.id.
Your urlHops would become (dest int,src int,...)
Your Pages table would replace url with pageid.
Index page_url.url field, and you should be good to go.