I am currently facing an issue with designing a database table and updating/inserting values into it.
The table is used to collect and aggregate statistics that are identified by:
the source
the user
the statistic
an optional material (e.g. item type)
an optional entity (e.g. animal)
My main issue is, that my proposed primary key is too large because of VARCHARs that are used to identify a statistic.
My current table is created like this:
CREATE TABLE `Statistics` (
`server_id` varchar(255) NOT NULL,
`player_id` binary(16) NOT NULL,
`statistic` varchar(255) NOT NULL,
`material` varchar(255) DEFAULT NULL,
`entity` varchar(255) DEFAULT NULL,
`value` bigint(20) NOT NULL)
In particular, the server_id is configurable, the player_id is a UUID, statistic is the representation of an enumeration that may change, material and entity likewise. The value is then aggregated using SUM() to calculate the overall statistic.
So far it works but I have to use DELETE AND INSERT statements whenever I want to update a value, because I have no primary key and I can't figure out how to create such a primary key in the constraints of MySQL.
My main question is: How can I efficiently update values in this table and insert them when they are not currently present without resorting to deleting all the rows and inserting new ones?
The main issue seems to be the restriction MySQL puts on the primary key. I don't think adding an id column would solve this.
Simply add an auto-incremented id:
CREATE TABLE `Statistics` (
statistis_id int auto_increment primary key,
`server_id` varchar(255) NOT NULL,
`player_id` binary(16) NOT NULL,
`statistic` varchar(255) NOT NULL,
`material` varchar(255) DEFAULT NULL,
`entity` varchar(255) DEFAULT NULL,
`value` bigint(20) NOT NULL
);
Voila! A primary key. But you probably want an index. One that comes to mind:
create index idx_statistics_server_player_statistic on statistics(server_id, player_id, statistic)`
Depending on what your code looks like, you might want additional or different keys in the index, or more than one index.
Follow the below hope it will solve your problem :-
- First use a variable let suppose "detailed" as money with your table.
- in your project when you use insert statement then before using statement get the maximum of detailed (SELECT MAX(detailed)+1 as maxid FROM TABLE_NAME( and use this as use number which will help you to FETCH,DELETE the record.
-you can also update with this also BUT during update MAXIMUM of detailed is not required.
Hope you understand this and it will help you .
I have dug a bit more through the internet and optimized my code a lot.
I asked this question because of bad performance, which I assumed was because of the DELETE and INSERT statements following each other.
I was thinking that I could try to reduce the load by doing INSERT IGNORE statements followed by UPDATE statements or INSERT .. ON DUPLICATE KEY UPDATE statements. But they require keys to be useful which I haven't had access to, because of constraints in MySQL.
I have fixed the performance issues though:
By reducing the amount of statements generated asynchronously (I know JDBC is blocking but it worked, it just blocked thousand of threads) and disabling auto-commit, I was able to improve the performance by 600 times (from 60 seconds down to 0.1 seconds).
Next steps are to improve the connection string and gaining even more performance.
Related
Im revisiting my database and noticed I had some primary keys that were of type INT.
This wasn't unique enough so I thought I would have a guid.
I come from a microsoft sql background and in the ssms you can
choose type to "uniqeidentifier" and auto increment it.
In mysql however Ive found that you have to make triggers that execute on insert for the tables you want
to generate a guide id for. Example:
Table:
CREATE TABLE `tbl_test` (
`GUID` char(40) NOT NULL,
`Name` varchar(50) NOT NULL,
PRIMARY KEY (`GUID`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Trigger:
CREATE TRIGGER `t_GUID` BEFORE INSERT ON `tbl_test`
FOR EACH ROW begin
SET new.GUID = uuid();
Alternatively you have to insert the guid yourself in the backend.
Im no DB expert but still remember that triggers cause performance problems.
The above is something I found here and is 9 years old so I was hoping something has changed?
As far as stated in the documentation, you can use uid() as a column default starting version 8.0.13, so something like this should work:
create table tbl_test (
guid binary(16) default (uuid_to_bin(uuid())) not null primary key,
name varchar(50) not null
);
This is pretty much copied from the documentation. I don't have a recent enough version of MySQL at hand to test this.
You can make a
INSERT INTO `tbl_test` VALUES (uuid(),'testname');
This would generate a new uuid, when you call it.
Or you can also use the modern uuid v4 by using one of these functions instead of the standard uuid(), which is more random than the uuid in mysql
How to generate a UUIDv4 in MySQL?
You can use since 8.0.13
CREATE TABLE t1 (
uuid_field VARCHAR(40) DEFAULT (uuid())
);
But you wanted more than unique, but here are only allowed internal functions and not user defined as for uuid v4, for that uyou need the trogger
As per the documentation, BINARY(x) adds some hidden padding bytes to the end of each entry, & VARCHAR(40) also wastes space by not being encoded directly in binary. Using VARBINARY(16) would be more efficient.
Also, more entropy (unguessability / security) per byte is available from RANDOM_BYTES(16) than standardized UUIDs, because they use some sections to encode constant metadata.
Perhaps the below will work for your needs.
-- example
CREATE TABLE `tbl_test` (
`GUID` VARBINARY(16) DEFAULT (RANDOM_BYTES(16)) NOT NULL PRIMARY KEY,
`Name` VARCHAR(50) NOT NULL
);
Im revisiting my database and noticed I had some primary keys that were of type INT.
This wasn't unique enough so I thought I would have a guid.
I come from a microsoft sql background and in the ssms you can
choose type to "uniqeidentifier" and auto increment it.
In mysql however Ive found that you have to make triggers that execute on insert for the tables you want
to generate a guide id for. Example:
Table:
CREATE TABLE `tbl_test` (
`GUID` char(40) NOT NULL,
`Name` varchar(50) NOT NULL,
PRIMARY KEY (`GUID`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Trigger:
CREATE TRIGGER `t_GUID` BEFORE INSERT ON `tbl_test`
FOR EACH ROW begin
SET new.GUID = uuid();
Alternatively you have to insert the guid yourself in the backend.
Im no DB expert but still remember that triggers cause performance problems.
The above is something I found here and is 9 years old so I was hoping something has changed?
As far as stated in the documentation, you can use uid() as a column default starting version 8.0.13, so something like this should work:
create table tbl_test (
guid binary(16) default (uuid_to_bin(uuid())) not null primary key,
name varchar(50) not null
);
This is pretty much copied from the documentation. I don't have a recent enough version of MySQL at hand to test this.
You can make a
INSERT INTO `tbl_test` VALUES (uuid(),'testname');
This would generate a new uuid, when you call it.
Or you can also use the modern uuid v4 by using one of these functions instead of the standard uuid(), which is more random than the uuid in mysql
How to generate a UUIDv4 in MySQL?
You can use since 8.0.13
CREATE TABLE t1 (
uuid_field VARCHAR(40) DEFAULT (uuid())
);
But you wanted more than unique, but here are only allowed internal functions and not user defined as for uuid v4, for that uyou need the trogger
As per the documentation, BINARY(x) adds some hidden padding bytes to the end of each entry, & VARCHAR(40) also wastes space by not being encoded directly in binary. Using VARBINARY(16) would be more efficient.
Also, more entropy (unguessability / security) per byte is available from RANDOM_BYTES(16) than standardized UUIDs, because they use some sections to encode constant metadata.
Perhaps the below will work for your needs.
-- example
CREATE TABLE `tbl_test` (
`GUID` VARBINARY(16) DEFAULT (RANDOM_BYTES(16)) NOT NULL PRIMARY KEY,
`Name` VARCHAR(50) NOT NULL
);
I have a large table called "queue". It has 12 million records right now.
CREATE TABLE `queue` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`userid` varchar(64) DEFAULT NULL,
`action` varchar(32) DEFAULT NULL,
`target` varchar(64) DEFAULT NULL,
`name` varchar(64) DEFAULT NULL,
`state` int(11) DEFAULT '0',
`timestamp` int(11) DEFAULT '0',
`errors` int(11) DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `idx_unique` (`userid`,`action`,`target`),
KEY `idx_userid` (`userid`),
KEY `idx_state` (`state`)
) ENGINE=InnoDB;
Multiple PHP workers (150) use this table simultaneously.
They select a record, perform a network request using the selected data and then delete the record.
I get mixed execution times from the select and delete queries. Is the delete command locking the table?
What would be the best approach for this scenario?
SELECT record + NETWORK request + DELETE the record
SELECT record + NETWORK request + MARK record as completed + DELETE completed records using a cron from time to time (I don't want an even bigger table).
Note: The queue gets new records every minute but the INSERT query is not the issue here.
Any help is appreciated.
"Don't queue it, just do it". That is, if the tasks are rather fast, it is better to simply perform the action and not queue it. Databases don't make good queuing mechanisms.
DELETE does not lock an InnoDB table. However, you can write a DELETE that seems that naughty. Let's see your actual SQL so we can work in improving it.
12M records? That's a huge backlog; what's up?
Shrink the datatypes so that the table is not gigabytes:
action is only a small set of possible values? Normalize it down to a 1-byte ENUM or TINYINT UNSIGNED.
Ditto for state -- surely it does not need a 4-byte code?
There is no need for INDEX(userid) since there is already an index (UNIQUE) starting with userid.
If state has only a few value, the index won't be used. Let's see your enqueue and dequeue queries so we can discuss how to either get rid of that index or make it 'composite' (and useful).
What's the current value of MAX(id)? Is it threatening to exceed your current limit of about 4 billion for INT UNSIGNED?
How does PHP use the queue? Does it hang onto an item via an InnoDB transaction? That defeats any parallelism! Or does it change state. Show us the code; perhaps the lock & unlock can be made less invasive. It should be possible to run a single autocommitted UPDATE to grab a row and its id. Then, later, do an autocommitted DELETE with very little impact.
I do not see a good index for grabbing a pending item. Again, let's see the code.
150 seems like a lot -- have you experimented with fewer? They may be stumbling over each other.
Is the Slowlog turned on (with a low value for long_query_time)? If so, I wonder what is the 'worst' query. In situations like this, the answer may be surprising.
We are having a Analytics product. For each of our customer we give one JavaScript code, they put that in their web sites. If a user visit our customer site the java script code hit our server so that we store this page visit on behalf of this customer. Each customer contains unique domain name.
we are storing this page visits in MySql table.
Following is the table schema.
CREATE TABLE `page_visits` (
`domain` varchar(50) DEFAULT NULL,
`guid` varchar(100) DEFAULT NULL,
`sid` varchar(100) DEFAULT NULL,
`url` varchar(2500) DEFAULT NULL,
`ip` varchar(20) DEFAULT NULL,
`is_new` varchar(20) DEFAULT NULL,
`ref` varchar(2500) DEFAULT NULL,
`user_agent` varchar(255) DEFAULT NULL,
`stats_time` datetime DEFAULT NULL,
`country` varchar(50) DEFAULT NULL,
`region` varchar(50) DEFAULT NULL,
`city` varchar(50) DEFAULT NULL,
`city_lat_long` varchar(50) DEFAULT NULL,
`email` varchar(100) DEFAULT NULL,
KEY `sid_index` (`sid`) USING BTREE,
KEY `domain_index` (`domain`),
KEY `email_index` (`email`),
KEY `stats_time_index` (`stats_time`),
KEY `domain_statstime` (`domain`,`stats_time`),
KEY `domain_email` (`domain`,`email`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
We don't have primary key for this table.
MySql server details
It is Google cloud MySql (version is 5.6) and storage capacity is 10TB.
As of now we are having 350 million rows in our table and table size is 300 GB. We are storing all of our customer details in the same table even though there is no relation between one customer to another.
Problem 1: For few of our customers having huge number of rows in table, so performance of queries against these customers are very slow.
Example Query 1:
SELECT count(DISTINCT sid) AS count,count(sid) AS total FROM page_views WHERE domain = 'aaa' AND stats_time BETWEEN CONVERT_TZ('2015-02-05 00:00:00','+05:30','+00:00') AND CONVERT_TZ('2016-01-01 23:59:59','+05:30','+00:00');
+---------+---------+
| count | total |
+---------+---------+
| 1056546 | 2713729 |
+---------+---------+
1 row in set (13 min 19.71 sec)
I will update more queries here. We need results in below 5-10 seconds, will it be possible?
Problem 2: The table size is rapidly increasing, we might hit table size 5 TB by this year end so we want to shard our table. We want to keep all records related to one customer in one machine. What are the best practises for this sharding.
We are thinking following approaches for above issues, please suggest us best practices to overcome these issues.
Create separate table for each customer
1) What are the advantages and disadvantages if we create separate table for each customer. As of now we are having 30k customers we might hit 100k by this year end that means 100k tables in DB. We access all tables simultaneously for Read and Write.
2) We will go with same table and will create partitions based on date range
UPDATE : Is a "customer" determined by the domain? Answer is Yes
Thanks
First, a critique if the excessively large datatypes:
`domain` varchar(50) DEFAULT NULL, -- normalize to MEDIUMINT UNSIGNED (3 bytes)
`guid` varchar(100) DEFAULT NULL, -- what is this for?
`sid` varchar(100) DEFAULT NULL, -- varchar?
`url` varchar(2500) DEFAULT NULL,
`ip` varchar(20) DEFAULT NULL, -- too big for IPv4, too small for IPv6; see below
`is_new` varchar(20) DEFAULT NULL, -- flag? Consider `TINYINT` or `ENUM`
`ref` varchar(2500) DEFAULT NULL,
`user_agent` varchar(255) DEFAULT NULL, -- normalize! (add new rows as new agents are created)
`stats_time` datetime DEFAULT NULL,
`country` varchar(50) DEFAULT NULL, -- use standard 2-letter code (see below)
`region` varchar(50) DEFAULT NULL, -- see below
`city` varchar(50) DEFAULT NULL, -- see below
`city_lat_long` varchar(50) DEFAULT NULL, -- unusable in current format; toss?
`email` varchar(100) DEFAULT NULL,
For IP addresses, use inet6_aton(), then store in BINARY(16).
For country, use CHAR(2) CHARACTER SET ascii -- only 2 bytes.
country + region + city + (maybe) latlng -- normalize this to a "location".
All these changes may cut the disk footprint in half. Smaller --> more cacheable --> less I/O --> faster.
Other issues...
To greatly speed up your sid counter, change
KEY `domain_statstime` (`domain`,`stats_time`),
to
KEY dss (domain_id,`stats_time`, sid),
That will be a "covering index", hence won't have to bounce between the index and the data 2713729 times -- the bouncing is what cost 13 minutes. (domain_id is discussed below.)
This is redundant with the above index, DROP it:
KEY domain_index (domain)
Is a "customer" determined by the domain?
Every InnoDB table must have a PRIMARY KEY. There are 3 ways to get a PK; you picked the 'worst' one -- a hidden 6-byte integer fabricated by the engine. I assume there is no 'natural' PK available from some combination of columns? Then, an explicit BIGINT UNSIGNED is called for. (Yes that would be 8 bytes, but various forms of maintenance need an explicit PK.)
If most queries include WHERE domain = '...', then I recommend the following. (And this will greatly improve all such queries.)
id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
domain_id MEDIUMINT UNSIGNED NOT NULL, -- normalized to `Domains`
PRIMARY KEY(domain_id, id), -- clustering on customer gives you the speedup
INDEX(id) -- this keeps AUTO_INCREMENT happy
Recommend you look into pt-online-schema-change for making all these changes. However, I don't know if it can work without an explicit PRIMARY KEY.
"Separate table for each customer"? No. This is a common question; the resounding answer is No. I won't repeat all the reasons for not having 100K tables.
Sharding
"Sharding" is splitting the data across multiple machines.
To do sharding, you need to have code somewhere that looks at domain and decides which server will handle the query, then hands it off. Sharding is advisable when you have write scaling problems. You did not mention such, so it is unclear whether sharding is advisable.
When sharding on something like domain (or domain_id), you could use (1) a hash to pick the server, (2) a dictionary lookup (of 100K rows), or (3) a hybrid.
I like the hybrid -- hash to, say, 1024 values, then look up into a 1024-row table to see which machine has the data. Since adding a new shard and migrating a user to a different shard are major undertakings, I feel that the hybrid is a reasonable compromise. The lookup table needs to be distributed to all clients that redirect actions to shards.
If your 'writing' is running out of steam, see high speed ingestion for possible ways to speed that up.
PARTITIONing
PARTITIONing is splitting the data across multiple "sub-tables".
There are only a limited number of use cases where partitioning buys you any performance. You not indicated that any apply to your use case. Read that blog and see if you think that partitioning might be useful.
You mentioned "partition by date range". Will most of the queries include a date range? If so, such partitioning may be advisable. (See the link above for best practices.) Some other options come to mind:
Plan A: PRIMARY KEY(domain_id, stats_time, id) But that is bulky and requires even more overhead on each secondary index. (Each secondary index silently includes all the columns of the PK.)
Plan B: Have stats_time include microseconds, then tweak the values to avoid having dups. Then use stats_time instead of id. But this requires some added complexity, especially if there are multiple clients inserting data. (I can elaborate if needed.)
Plan C: Have a table that maps stats_time values to ids. Look up the id range before doing the real query, then use both WHERE id BETWEEN ... AND stats_time .... (Again, messy code.)
Summary tables
Are many of the queries of the form of counting things over date ranges? Suggest having Summary Tables based perhaps on per-hour. More discussion.
COUNT(DISTINCT sid) is especially difficult to fold into summary tables. For example, the unique counts for each hour cannot be added together to get the unique count for the day. But I have a technique for that, too.
I wouldn't do this if i were you. First thing that come to mind would be, on receive a pageview message, i send the message to a queue so that a worker can pickup and insert to database later (in bulk maybe); also i increase the counter of siteid:date in redis (for example). Doing count in sql is just a bad idea for this scenario.
I am currently working on a project, which involves altering data stored in a MYSQL database. Since the table that I am working on does not have a key, I add a key with the following command:
ALTER TABLE deCoupledData ADD COLUMN MY_KEY INT NOT NULL AUTO_INCREMENT KEY
Due to the fact that I want to group my records according to selected fields, I try to create an index for the table deCoupledData that consists of MY_KEY, along with the selected fields. For example, If I want to work with the fields STATED_F and NOT_STATED_F, I type:
ALTER TABLE deCoupledData ADD INDEX (MY_KEY, STATED_F, NOT_STATED_F)
The real issue is that the fields that I usually work with are more than 16, so MYSQL does not allow super-keys longer than 16 fields.
In conclusion, Is there another way to do this? Can I make (somehow) MYSQL to order the records according to the desired super-key (something like clustering)? I really need to make my script faster and the main overhead is that each group may contain records which are not stored on the same page of the disk, and I assume that my pc starts random I/Os in order to retrieve records.
Thank you for your time.
Nick Katsipoulakis
CREATE TABLE deCoupledData (
AA double NOT NULL DEFAULT '0',
STATED_F double DEFAULT NULL,
NOT_STATED_F double DEFAULT NULL,
MIN_VALUES varchar(128) NOT NULL DEFAULT '-1,-1',
MY_KEY int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (MY_KEY),
KEY AA (AA) )
ENGINE=InnoDB AUTO_INCREMENT=74358 DEFAULT CHARSET=latin1
Okay, first of all, when you add an index over multiple columns and you don't really use the first column, the index is useless.
Example: You have a query like
SELECT *
FROM deCoupledData
WHERE
stated_f = 5
AND not_stated_f = 10
and an index over (MY_KEY, STATED_F, NOT_STATED_F).
The index can only be used, if you have another AND my_key = 1 or something in the WHERE clause.
Imagine you want to look up every person in a telephone book with first name 'John'. Then the knowledge that the book is sorted by last name is useless, you still have to look up every single name.
Also, the primary key does not have to be a surrogate / artificial one. It's nearly always better to have a primary key which is made up of columns which identify each row uniquely anyway.
Also it's not always good to have many indexes. Not only do indexes slow down INSERTs and UPDATEs, sometimes they just cause an extra lookup, since first a look at the index is taken and a second look to find the actual data.
That's just a few tips. Maybe Jordan's hint is not a bad idea, "You should maybe post a new question that has your actual SQL query, table layout, and performance questions".
UPDATE:
Yes, that is possible. According to manual
If you define a PRIMARY KEY on your table, InnoDB uses it as the clustered index.
which means that the data is practically sorted on disk, yes.
Be aware that it's also possible to define a primary key over multiple columns!
Like
CREATE TABLE deCoupledData (
AA double NOT NULL DEFAULT '0',
STATED_F double DEFAULT NULL,
NOT_STATED_F double DEFAULT NULL,
MIN_VALUES varchar(128) NOT NULL DEFAULT '-1,-1',
MY_KEY int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (NOT_STATED_F, STATED_F, AA),
KEY AA (AA) )
ENGINE=InnoDB AUTO_INCREMENT=74358 DEFAULT CHARSET=latin1
as long as the combination of the columns is unique.