I am designing a website and have a table which dealing with lot of inserts. On each month this table will get at least 50 million records.
So currently I am using bigint unsgined data type as the primary key of this table.
CREATE TABLE `class`.`add_contact_details`
(
`con_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
`add_id_ref` BIGINT UNSIGNED NOT NULL,
`con_name` VARCHAR(200),
`con_email` VARCHAR(200),
`con_phone` VARCHAR(200),
`con_fax` VARCHAR(200),
`con_mailbox` VARCHAR(500),
`con_status_show_email` TINYINT(1),
`con_status_show_phone` TINYINT(1),
`con_status_show_fax` TINYINT(1),
`con_status_show_mailbox` TINYINT(1),
PRIMARY KEY (`con_id`) ) ENGINE=INNODB CHARSET=latin1 COLLATE=latin1_swedish_ci;
So by doing a big research I found that most of the people are worry about using BIGINT because it is memory consuming and need lot of space.
So I found an article that describing a alternate for that. Here it is
"You could use a combined ( tinyint, int ) key. The tinyint would start at, and default to, 1. IF the int value is ever about to overflow, you change the tinyint's default value to 2, and reset the int value to 1. You can create code that runs every day, or on another applicable schedule, which checks for that condition and makes that change if needed."
So it make sense right? So Is there anybody who is using this?
What should I use by considering the performance?
Is there any alternative enterprice level solution for this?
Stick with the BIGINT. You do save two or three bytes using the dual key, but you do pay for it.
References to the table need to use two keys instead of one, so all foreign key relationships are more complicated.
where clauses to find a single row are much more complicated. Consider the difference between:
where id in (1, 2, 3, 4, 5)
and
where id_part1 = 0 and id_part2 = 1 or
id_part1 = 0 and id_part2 = 2 or
. . .
The step to increment the first part is no automatic, requiring either manual intervention or the overhead of a trigger.
This reminds me (in a bad way) of segmented memory architectures that were in vogue 20+ years ago. Be happy the computer can understand 64-bit keys with no problems.
BIGINT needs 8 bytes of storage so 50 million records is 400 MB per month which shouldn't be an issue at all.
We are running databases (on DB2) with a couple of TB on a single server.
The only thing you should consider for querying by PK is putting an index on that field.
best regards
Romeo Kienzler
Related
We have a large MySQL table (device_data) with the following columns:
ID (int)
dt (timestamp)
serial_number (char(20))
data1 (double)
data2 (double)
... // other columns
The table receives around 10M rows every day.
We have done a sharding by separating the table based on the date of the timestamp (device_data_YYYYMMDD). However, we feel this is not effective because most of our queries (shown below) always check on the "serial_number" and will perform across many dates.
SELECT * FROM device_data WHERE serial_number = 'XXX' AND dt >= '2018-01-01' AND dt <= '2018-01-07';
Therefore, we think that creating the sharding based on the serial number will be more effective. Basically, we will have:
device_data_<serial_number>
device_data_0012393746
device_data_7891238456
Hence, when we want to find data for a particular device, we can easily reference as:
SELECT * FROM device_data_<serial_number> WHERE dt >= '2018-01-01' AND dt <= '2018-01-07';
This approach seems to be effective because:
The application at all time will access the data based on the device first.
We have checked that there is no query that access the data without specifying the device serial number first.
The table for each device will be relatively small (9000 rows per day)
A few challenges that we think we will face is:
We have alot of devices. This means that the table device_data_ will be alot too. I have checked that MySQL does not provide limitation in the number of tables in the database. Will this impact on performance vs keeping them in one table?
How will this impact on later on when we would like to scale MySQL (e.g. using master / slave, etc)?
Are there other alternative / solution in resolving this?
Update. Below is the show create table result from our existing table:
CREATE TABLE `test_udp_new` (
`id` int(20) unsigned NOT NULL AUTO_INCREMENT,
`dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`device_sn` varchar(20) NOT NULL,
`gps_date` datetime NOT NULL,
`lat` decimal(10,5) DEFAULT NULL,
`lng` decimal(10,5) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `device_sn_2` (`dt`,`device_sn`),
KEY `dt` (`dt`),
KEY `data` (`data`) USING BTREE,
KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
) ENGINE=InnoDB AUTO_INCREMENT=44449751 DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC
The most frequent queries being run:
SELECT *
FROM test_udp_new
WHERE device_sn = 'xxx'
AND dt >= 'xxx'
AND dt <= 'xxx'
ORDER BY dt DESC;
The optimal way to handle that query is in a non-partitioned table with
INDEX(serial_number, dt)
Even better is to change the PRIMARY KEY. Assuming you currently have id AUTO_INCREMENT because there is not a unique combination of columns suitable for being a "natural PK",
PRIMARY KEY(serial_number, dt, id), -- to optimize that query
INDEX(id) -- to keep AUTO_INCREMENT happy
If there are other queries that are run often, please provide them; this may hurt them. In large tables, it is a juggling task to find the optimal index(es).
Other Comments:
There are very few use cases for which partitioning actually speed up processing.
Making lots of 'identical' tables is a maintenance nightmare, and, again, not a performance benefit. There are probably a hundred Q&A on stackoverflow shouting not to do such.
By having serial_number first in the PRIMARY KEY, all queries referring to a single serial_number are likely to benefit.
A million serial_numbers? No problem.
One common use case for partitioning involves purging "old" data. This is because big DELETEs are much more costly than DROP PARTITION. That involves PARTITION BY RANGE(TO_DAYS(dt)). If you are interested in that, my PK suggestion still stands. (And the query in question will run about the same speed with or without this partitioning.)
How many months before the table outgrows your disk? (If this will be an issue, let's discuss it.)
Do you need 8-byte DOUBLE? FLOAT has about 7 significant digits of precision and takes only 4 bytes.
You are using InnoDB?
Is serial_number fixed at 20 characters? If not, use VARCHAR. Also, CHARACTER SET ascii may be better than the default of utf8?
Each table (or each partition of a table) involves at least one file that the OS must deal with. When you have "too many", the OS groans, often before MySQL groans. (It is hard to make either "die" of overdose.)
Addressing the query
PRIMARY KEY (`id`),
KEY `device_sn_2` (`dt`,`device_sn`),
KEY `dt` (`dt`),
KEY `data` (`data`) USING BTREE,
KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
-->
PRIMARY KEY(`device_sn`,`dt`, id),
INDEX(id)
KEY `dt_sn` (`dt`,`device_sn`),
KEY `data` (`data`) USING BTREE,
Notes:
By starting the PK with device_sn, dt, you get the clustering benefits to make the query with WHERE device_sn = .. AND dt BETWEEN ...
INDEX(id) is to keep AUTO_INCREMENT happy.
When you have INDEX(a,b), INDEX(a) is redundant.
The (20) is meaningless; id will max out at about 4 billion.
I tossed the last index because it is probably helped enough by the new PK.
lng decimal(10,5) -- Don't need 5 decimal places to left of point; only need 3 or 2. So: lat decimal(7,5),lng decimal(8,5)`. This will save a total of 3 bytes per row.
I am currently facing an issue with designing a database table and updating/inserting values into it.
The table is used to collect and aggregate statistics that are identified by:
the source
the user
the statistic
an optional material (e.g. item type)
an optional entity (e.g. animal)
My main issue is, that my proposed primary key is too large because of VARCHARs that are used to identify a statistic.
My current table is created like this:
CREATE TABLE `Statistics` (
`server_id` varchar(255) NOT NULL,
`player_id` binary(16) NOT NULL,
`statistic` varchar(255) NOT NULL,
`material` varchar(255) DEFAULT NULL,
`entity` varchar(255) DEFAULT NULL,
`value` bigint(20) NOT NULL)
In particular, the server_id is configurable, the player_id is a UUID, statistic is the representation of an enumeration that may change, material and entity likewise. The value is then aggregated using SUM() to calculate the overall statistic.
So far it works but I have to use DELETE AND INSERT statements whenever I want to update a value, because I have no primary key and I can't figure out how to create such a primary key in the constraints of MySQL.
My main question is: How can I efficiently update values in this table and insert them when they are not currently present without resorting to deleting all the rows and inserting new ones?
The main issue seems to be the restriction MySQL puts on the primary key. I don't think adding an id column would solve this.
Simply add an auto-incremented id:
CREATE TABLE `Statistics` (
statistis_id int auto_increment primary key,
`server_id` varchar(255) NOT NULL,
`player_id` binary(16) NOT NULL,
`statistic` varchar(255) NOT NULL,
`material` varchar(255) DEFAULT NULL,
`entity` varchar(255) DEFAULT NULL,
`value` bigint(20) NOT NULL
);
Voila! A primary key. But you probably want an index. One that comes to mind:
create index idx_statistics_server_player_statistic on statistics(server_id, player_id, statistic)`
Depending on what your code looks like, you might want additional or different keys in the index, or more than one index.
Follow the below hope it will solve your problem :-
- First use a variable let suppose "detailed" as money with your table.
- in your project when you use insert statement then before using statement get the maximum of detailed (SELECT MAX(detailed)+1 as maxid FROM TABLE_NAME( and use this as use number which will help you to FETCH,DELETE the record.
-you can also update with this also BUT during update MAXIMUM of detailed is not required.
Hope you understand this and it will help you .
I have dug a bit more through the internet and optimized my code a lot.
I asked this question because of bad performance, which I assumed was because of the DELETE and INSERT statements following each other.
I was thinking that I could try to reduce the load by doing INSERT IGNORE statements followed by UPDATE statements or INSERT .. ON DUPLICATE KEY UPDATE statements. But they require keys to be useful which I haven't had access to, because of constraints in MySQL.
I have fixed the performance issues though:
By reducing the amount of statements generated asynchronously (I know JDBC is blocking but it worked, it just blocked thousand of threads) and disabling auto-commit, I was able to improve the performance by 600 times (from 60 seconds down to 0.1 seconds).
Next steps are to improve the connection string and gaining even more performance.
I Have gone through some Q&A eg.
How does database indexing work?
Mysql covering vs composite vs column index
These are good reads, But I got some more question about Indexes ie.
Assuming below table and Execution plans:
CREATE TABLE student(`id` INT(9),
`name` VARCHAR(50),
`rollNum` INT(9),
`address` VARCHAR(50),
`deleted` int(2) default 0,
Key `name_address_key`(`name`,`deleted`),
Key `name_key`(`name`)
);
Plan 1: explain select * from student where name = "abc" and deleted =0;
its shows key = name_address_key
Plan 2: explain select * from student where name = "abc"
its also shows same key = name_address_key
my question is :
How MySQl decide the index for execution plan?
Since the name column is a prefix of name_address_key, that index can be used for matching name just as well as name_key can. There's no reason for it to prefer one over the other, and but the cardinality of name_address_key is presumably higher, so it chooses that one.
There's no point in having name_key, since it's redundant with name_address_key and just wastes space.
I would expect it to pick name_key since the size of the index is smaller.
I would recommend removing name_key as being essentially useless, as #Barmar discusses.
Don't us 4-byte int for flags (deleted), see TINYINT and other smaller datatypes.
Do have a PRIMARY KEY.
Another good read (in my biased opinion): Indexing Cookbook
I have a question regarding primary keys in Relational Databases. Let's assume that I have the following tables:
Box
id
box_name
BoxItems
id
item_name
belongs_to_box_id (foreign key)
Let's also assume that I intend to store millions of items per day. I would probably use bigint or a guid for the BoxItems.Id.
What I was thinking, and I need your advice on that, is instead of Bigint Id for the BoxItems, use a sequencial TinyInt number and what identified each item is the combination of the belongs_to_box_id plus the tinyint row (e.g. item_numner).
So now instead of the above we get:
BoxItems
belongs_to_box_id
item_sequence_number [TINYINT]
item_name
Example:
Items.Insert(1,1, "my item 1");
Items.Insert(1,2, "my item 2");
So instead of using bigint or GUID for that matter, I can use tinyint and save a lot of disk space.
I want to know what the cons and pros of such approach. I am developing my app using MySQL and ASP.NET 4.5
When you think about it, there's really not much difference between the "box/contents" problem and the "order/line item" problem.
create table boxes (
box_id integer primary key,
box_name varchar(35) not null
);
create table boxed_items (
box_id integer not null references boxes (box_id),
box_item_num tinyint not null,
item_name varchar(35) not null
);
For MySQL, you'd probably use unsigned integer and unsigned tinyint. There's no compelling reason for a database to avoid negative numbers, but developers should lean on the Principle of Least Surprise.
Make sure 256 values are enough. Getting that wrong can be expensive to correct in a table that gets millions of rows each day.
I would recommend writing a simple test for both approaches and compare performance, disk space and ease of implementation and make a judgement call. Both of your suggestions are reasonable and I doubt there will be much of a difference in performance but the best way to find out is to just try it out and then you will know for sure.
I am currently evaluating strategy for storing supplier catalogs.
There can be multiple items in catalog vary from 100 to 0.25million.
Each item may have multiple errors. application should support browsing of catalog items
Group by Type of Error, Category, Manufacturer, Suppliers etc..
Browse items for any group, Should be able to sort and search on multiple columns (partid,
names, price etc..)
Question is when i have to provide functionality of "Multiple SEARCH and SORT and GROUP" how should i create index.
According to mysql doc & blogs for index it seems that creating index on individual column will not be used by all query.
Creating multi column index is even not specific for my case.
There might be 20 - 30 combination of group search & sort.
How do i scale and how can i make search fast.
Expecting to handle 50 million records of data.
Currently evaluating on 15 million of data.
Suggestions are welcome.
CREATE TABLE CATALOG_ITEM
(
AUTO_ID BIGINT PRIMARY KEY AUTO_INCREMENT,
TENANT_ID VARCHAR(40) NOT NULL,
CATALOG_ID VARCHAR(40) NOT NULL,
CATALOG_VERSION INT NOT NULL,
ITEM_ID VARCHAR(40) NOT NULL,
VERSION INT NOT NULL,
NAME VARCHAR(250) NOT NULL,
DESCRIPTION VARCHAR(2000) NOT NULL,
CURRENCY VARCHAR(5) NOT NULL,
PRICE DOUBLE NOT NULL,
UOM VARCHAR(10) NOT NULL,
LEAD_TIME INT DEFAULT 0,
SUPPLIER_ID VARCHAR(40) NOT NULL,
SUPPLIER_NAME VARCHAR(100) NOT NULL,
SUPPLIER_PART_ID VARCHAR(40) NOT NULL,
MANUFACTURER_PART_ID VARCHAR(40),
MANUFACTURER_NAME VARCHAR(100),
CATEGORY_CODE VARCHAR(40) NOT NULL,
CATEGORY_NAME VARCHAR(100) NOT NULL,
SOURCE_TYPE INT DEFAULT 0,
ACTIVE BOOLEAN,
SUPPLIER_PRODUCT_URL VARCHAR(250),
MANUFACTURER_PRODUCT_URL VARCHAR(250),
IMAGE_URL VARCHAR(250),
THUMBNAIL_URL VARCHAR(250),
UNIQUE(TENANT_ID,ITEM_ID,VERSION),
UNIQUE(TENANT_ID,CATALOG_ID,ITEM_ID)
);
CREATE TABLE CATALOG_ITEM_ERROR
(
ITEM_REF BIGINT,
FIELD VARCHAR(40) NOT NULL,
ERROR_TYPE INT NOT NULL,
ERROR_VALUE VARCHAR(2000)
);
If you are determined to do this solely in MySQL, then you should be creating indexes that will work for all your queries. It's OK to have 20 or 30 indexes if there are 20-30 different queries doing your sorting. But you can probalby do it with far less indexes than that.
You also need to plan how these indexes will be maintained. I'm assuming because this is for supplier catalogs that the data is not going to change much. In this case, simply creating all the indexes you need should do the job nicely. If the data rows are going to be edited or inserted frequently in realtime, then you have to consider that with your indexing - then having 20 or 30 indexes might not be such a good idea (since MySQL will be constantly having to update them all). You also have to consider which MySQL storage engine to use. If your data never changes, MyISAM (the default engine, basically fast flat files) is a good choice. If it changes a lot, then you should be using InnoDB so you can get row level locking. InnoDB would also allow you to define a clustered index, which is a special index that controls the order stuff is stored on disk. So if you had one particular query that is run 99% of the time, you could create a clustered index for it and all the data would already be in the right order on disk, and would return super super fast. But, every insert or update to the data would result in the entire table being reordered on disk, which is not fast for lots of data. You'd never use one if the data changed at all frequently, and you might have to batch load data updates (like new versions of a supplier's million rows). Again, it comes down to whether you will be updating it never, now and then, or constantly in realtime.
Finally, you should consider alternative means than doing this in MySQL. There are a lot of really good search products out there now, such as Apache Solr or Sphinx (mentioned in a comment above), which could make your life a lot easier when coding up the search interfaces themselves. You could index the catalogs in one of these and then use them provide some really awesome search features like full text and/or faceted search. It's like having a private google search engine indexing your stuff, is a good way to describe how these work. It takes time to write the code to interface with the search server, but you will most likely save that time not having to write and wrap your head around the indexing problem and other issues I mentioned above.
If you do just go with creating all the indexes though, learn how to use the EXPLAIN command in MySQL. That will let you see what MySQL's plan for executing a query will be. You can create indexes then re-run EXPLAIN on your queries and see how MySQL is going to use them. This way you can make sure that each of your query methods has indexes supporting it, and is not falling back to a scanning your entire table of data to find things. With as many rows as you're talking about, every query MUST be able to use indexes to find its data. If you get those right, it'll perform fine.