I have encountered a problem when designing the table schema for our system.
Here is the situation:
our system has a lot of items ( more than 20 millions ), each item has an unique id, but for each item there can be lots of records. For example for the item with id 1 there are about 5000 records and each record has more than 20 attributes. The needs to be identified by its id and status of one or more of its attributes for use in select, update or delete.
I want to use innodb
But the problem is when using innodb, there must be an cluster index.
Due to the situation described above it seems had to find a cluster index so I can only use an auto_increment int as the key
The current design is as follows:
create table record (
item_key int(10) unsigned NOT NULL AUTO_INCREMENT,
item_id int(10) unsigned NOT NULL,
attribute_1 char(32) NOT NULL,
attribute_2 int(10) unsigned NOT NULL,
.
.
.
.
.
attribute_20 int(10) unsigned NOT NULL,
PRIMARY KEY (`item_key`),
KEY `iattribute_1` (`item_id`,`attribute_1`),
KEY `iattribute_2` (`item_id`,`attribute_2`)
) ENGINE=InnoDB AUTO_INCREMENT=22 DEFAULT CHARSET=latin1
the sql statement:
select * from records
where item_id=1 and attribute_1='a1' and attribute_2 between 10 and 1000;
the update and delete statement are similar.
I don't think this a good design, but I can't think of anything else; all suggestions welcome.
Sorry if I didn't make the question clear.
What I want to access ( select, update, delete, insert) is the records, not the items.
The items have there own attributes, but in the descriptions above, the attributes that I mentioned are belongs to the records.
Every item can have many records, like item 1 have about 5000 records.
Every record have 42 attributes, some of them can be NULL, every record has an unique id, this id is unique among different items, but this id is an string not an number
I want to access the records in this way:
A. I will only get(or update or delete) the records that belongs to one specific item at on time or in one query
B. I will get or update the values of all attributes or some specific attributes in the query
C. The attributes that in the condition of the query may not the same as the attributes that I want.
So there could be some SQL statements like:
Select attribute_1, attribute_N from record_table_1 where item_id=1 and attribute_K='some value' and attribute_M between 10 and 100
And the reasons that why I think the original design is not good are:
I can't choose an attribute or the record id as the primary key, because it is no use, in every query, I have to assign the item id and some attributes as the query condition ( like "where item_id=1 and attribute_1='value1' and attribte_2 between 2 and 3), so I can only use an auto_increment int number as the primary key. The result of this is that each query have to scan two b-trees, and it look like that scan of the secondary index is not effective.
Also compound keys seems useless, because the condition of the query could vary among many attributes.
With the original design, it seems that I have add a lot of indexes to satisfy different queries, otherwise I have to deal with the full table scan problem, but it is obviously that too many indexes is not good for update, delete, insert operations.
If you want a cluster index and don't want to use the myisam engine, it sounds like you should use two tables: one for the unique properties of the items and the other for each instance of the item (with the specified attributes).
You're right the schema is wrong. Having the attribute 1..20 as fields within the table is not the way to do this, you need a separate table to store this information. This table would have the item_key from this record together with its own key and a value and therefore this second table would have indexes that allow much better searching.
Something like the following:
Looking at the diagram it is obvious that something is wrong because the record table is too empty, it doesn't look right to me so maybe I'm missing something in the original question....
Compound Keys
I think maybe you are looking to have compound key rather than a clustered index which is a different thing. You can achieve this by:
create table record (
item_id int(10) unsigned NOT NULL,
attribute_1 char(32) NOT NULL,
attribute_2 int(10) unsigned NOT NULL,
.
.
.
.
.
attribute_20 int(10) unsigned NOT NULL,
PRIMARY KEY (`item_id`,`attribute_1`,`attribute_2`),
KEY `iattribute_1` (`item_id`,`attribute_1`),
KEY `iattribute_2` (`item_id`,`attribute_2`)
) ENGINE=InnoDB AUTO_INCREMENT=22 DEFAULT CHARSET=latin1
Related
I have a table definition:
CREATE TABLE `k_timestamps` (
`id` bigint(20) NOT NULL,
`k_timestamp` datetime NULL DEFAULT NULL,
`data1` smallint(6) NOT NULL,
KEY `k_timestamp_key` (`k_timestamp`,`id`) USING BTREE,
CONSTRAINT `k_time_fk` FOREIGN KEY (`id`) REFERENCES `data` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
Basically, I have a whole lot of id and data1 key-value pairs, and every few hours I either add new key-value pairs not seen before to the list, or the value of a previous id has changed. I want to track what all the values were for every id in time. Thus, the id column can contain duplicate id's and is not the primary key.
Side note, k_time_fk points to another, much smaller table that has common information for a particular id regardless of what the current time is or value it currently holds.
(id, k_timestamp) should be thought of as the (composite) primary key of the table.
For example,
id k_timestamp data1
1597071247 2012-11-15 12:25:47 4
1597355222 2012-11-15 12:25:47 4
1597201376 2012-11-15 12:25:47 4
1597071243 2012-11-15 13:25:47 4
1597071247 2012-11-15 13:25:47 3
1597071249 2012-11-15 13:25:47 3
Anyways, I ran this query:
SELECT concat(table_schema,'.',table_name),
concat(round(table_rows/1000000,2),'M') rows,
concat(round(data_length/(1024*1024*1024),2),'G') DATA,
concat(round(index_length/(1024*1024*1024),2),'G') idx,
concat(round((data_length+index_length)/(1024*1024*1024),2),'G') total_size,
round(index_length/data_length,2) idxfrac
FROM information_schema.TABLES ORDER BY data_length+index_length DESC LIMIT 20;
To pull space info on my table:
rows Data idx total_size idxfrac
11.25M 0.50G 0.87G 1.36G 1.76
I'm not really sure I understand this, how can the index be taking up so much space? Is there something obvious I did wrong here, or is this normal? I'm looking to try to reduce to footprint of this table if possible. I'm not even really sure what that k_timestamp_key really buys for me, can it be safely deleted?
The index is bigger because InnoDB tables will assign a 6 byte primary key when you have no unique column that it can treat as a unique index. All other indexes in the table also contain the primary key... see 14.2.3.12.2. Clustered and Secondary Indexes from the manual
Firstly, yes, this is pretty normal behaviour, as innvo writes.
Secondly, you can optimize the table and its index using OPTIMIZE TABLE. As your primary key is likely to be "fragmented" - i.e. it's not safe to assume that an inserted row is physically next to the previous row - there may be some gains there.
Finally, you may not need a primary key on the table, but you almost certainly need an index if you're querying across millions of rows...
Let's say we have a (InnoDB) table associations in a MySQL-Database which has the following structure:
CREATE TABLE `associations` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`fk_id_1` int(11) NOT NULL,
`fk_id_2` int(11) NOT NULL,
`fk_id_3` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `some_unique_constraint` (`fk_id_1`,`fk_id_2`),
KEY `fk_id_2_INDEX` (`fk_id_2`),
KEY `fk_id_3_INDEX` (`fk_id_3`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin$$
There are jumps in the column id (I know this is an issue of how the autoincremented value is generated while multiple threads try to get one). Since no other table is using the column id as a reference I plan to drop the column id and to create it again, hopefully the counting holes will be gone. I backed up my database and tested that. The result was a little confusing. The order of the rows seemed to have changed. If I am not mistaken the order is first by fk_id_1 then fk_id_2 then fk_id_3.
Is this the natural order in which MySQL sets the table, when assignung an new generated autoincrement key to the rows?
Is there more I should know, that happened during this process?
The reason, why I need to know about this is that I need to make the column id useful for another task I intend to accomplish where gaps are a no go.
There is no natural order to a table in any mainstream RDBS.
Only the outermost ORDER BY in a SELECT statement will guarantee the order of results.
If you want "order":
create a new table
INSERT..SELECT..ORDER BY fk_id_1, fk_id_2, fk_id_3
Drop old table
Rename new table
Or live with gaps... OCD isn't good for developers
Edit:
Question says "no dependency" on this value but turns out there is.
If gaps are not allowed then don't use autonumber and use fk_id_1, fk_id_2, fk_id_3 as your key, with a ROW_NUMBER emulation. Or code your downstream to deal with gaps.
Autonumbers will have gaps: immutable fact of life.
I have the following table:
create table stuff (
id mediumint unsigned not null auto_increment primary key,
title varchar(150) not null,
link varchar(250) not null,
time timestamp default current_timestamp not null,
content varchar(1500)
);
If I EXPLAIN the query
select id from stuff order by id;
then it says it uses they primary key as an index for ordering the results. But with this query:
select id,title from stuff order by id;
EXPLAIN says no possible keys and it resorts to filesort.
Why is that? Isn't the data of a certain row stored together in the database? If it can order the results using the index when I'm querying only the id then why adding an other column to the query makes a difference? The primary key identifies the row already, so I think it should use the primary key for ordering in the second case too.
Can you explain why this is not the case?
Sure, because it is more performant in this query: you need to read full index and after that iteratively read row by row from data. This is extremely unefficient. Instead of this mysql just prefers to read the data right from the data file.
Also, what kind of storage engine do you use? Seems like mysam.
For this case innodb would be more efficient, since it uses clustered indexes over primary key (which is monotonously growing in your case).
So I have a class that creates a table to be populated with data. Right now I have all the column names the same (product_name, date, etc). I noticed that when I view my tables in Webmin that there is only one index named "product_date" despite the fact that there are, supposedly, two tables using the index. I don't think this can be good.
My question is whether or not this will cause a conflict in the future? I don't want to populate the tables with thousands of rows if I'm only going to need to restructure everything later. I can't imagine that I'm the first to encounter this... maybe I'm just misinformed on how indexes work/webmin displays indexes and being overly paranoid.
(edit)
In response to one comment below, here are the results of SHOW CREATE TABLE tablename:
c_1 | CREATE TABLE c_1 (
p_id int(11) NOT NULL auto_increment,
nm varchar(100) NOT NULL,
m_name text NOT NULL,
PRIMARY KEY (p_id),
KEY nm (nm),
FULLTEXT KEY m_name (m_name)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
c_2 | CREATE TABLE c_2 (
p_id int(11) NOT NULL auto_increment,
ne varchar(100) NOT NULL,
m_name text NOT NULL,
PRIMARY KEY (p_id),
KEY nm (nm),
FULLTEXT KEY metaphone_name (m_name)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
Note that all the indexes on equivalent columns are named the same way.
If it's an index per table, no problem
If I understand your question correctly (a big if), you must create an index for each table. Indexes do not cover more than one table until you get into advanced concepts like indexed/materialized views, which I don't think MySQL handles.
I am sorry if this is a dumb question (cause it sounds unlikely).
I have a table that is 20 Million rows. However, only about 300K of these rows get accessed regularly, and they can be identified in a column condition called "app_user=1"
Is there anyway i can just index those rows, and when I call a select, i will be sure to pass in the condition as well?
I would recommend splitting the table into two separate tables. But in case you don't want to do that, the highest performance way to do this if you're always going to include "where app_user=1" in your queries is to create a primary key on the table that includes the app_user column as the first part of the key. InnoDB will use this as a clustered index which saves you a few extra disk accesses. You can create the table like this:
create table testTable (
app_user tinyint UNSIGNED default 0,
id int UNSIGNED NOT NULL,
name varchar(255) default '',
PRIMARY KEY k1(app_user, id)
) ENGINE=InnoDB;
A friend wrote this article on clustered indexes in InnoDB a while back:
http://www.joehruska.com/?p=6
Add a column called app_user and index on that, then pass in "WHERE app_user = 1" in your query.
You could go further to partition your table based on that column.