this is the schema of my table:
CREATE TABLE [dbo].[ClassifiedDataStore_MasterTesting]
(
[PK_ID] [uniqueidentifier] NOT NULL,
[FK_SubCategory_Master] [int] NULL,
[FK_IntegratedWeb_Master] [int] NULL,
[FK_City_Master] [int] NULL,
[Title] [nvarchar](max) NULL,
[Description] [varchar](max) NULL,
[Url] [nvarchar](max) NULL,
[DisplayUrl] [varchar](max) NULL,
[Date] [datetime] NULL,
[ImageURL] [nvarchar](max) NULL,
[Price] [decimal](18, 2) NULL,
[Fetch_Date] [datetime] NULL,
[IsActive] [bit] NULL,
[record_id] [int] IDENTITY(1,1) NOT NULL,
CONSTRAINT [PK_ClassifiedDataStore_Master2] PRIMARY KEY CLUSTERED
(
[PK_ID] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
Wow, all those MAX columns... do you really need MAX for URLs and titles? Do you really need the PK to be a GUID?
Since most systems are I/O bound, one key to good performance (in addition to sensible indexing and only pulling the data you need) is to fit as much data onto each page as possible. With all these LOB columns storing potentially 2GB of data each, every page fetch is going to be a little bit of a nightmare for SQL Server. Strongly recommend considering trimming some of these data types where possible, e.g.
use an IDENTITY column instead of a GUID if feasible - why have both?
for any INTs that FK to lookups that will always have < 32K rows, use SMALLINT
for any INTs that FK to lookups that will always have < 255 rows, use TINYINT
use in-row storage (and not MAX types) for things like title and URL
you can shave a few bytes by using < 18 digits for price - doubt you will have classified items worth $1,000,000,000,000+
if < minute accuracy is not needed for Date/Fetch_Date, use SMALLDATETIME
if < day accuracy is not needed for Date/Fetch_Date, use DATE
(I also find it odd that you need Unicode/nvarchar for title, but not for description, and you need Unicode/nvarchar for URL/ImageURL, but not DisplayURL. Can you explain the rationale there? I'll give a hint: if the title can contain Unicode then it is reasonable to assume that the title could also be mentioned in the description, so it should also support Unicode. And all of your URLs are probably just fine supporting only varchar; I don't recall ever seeing a URL with Unicode characters in it (these are typically URL-encoded).)
Consider using data compression if you are on Enterprise Edition or better. Again, since most systems are I/O bound, we are happy to pay a slight CPU penalty compressing/decompressing data in order to fit it onto fewer pages, this will greatly reduce the time required to perform heavy read operations against the table.
When discussing performance it's always important to know what kind of queries will be most frequent for your table.
Your searches will be filtered using what columns? Title and Date?
Suppose that most of your queries start by filtering your table by: Date then by Title.
You should create a not unique clustered index using Date in first place and then Title.
Why is that? Because then your records are stored physically sequentially by that order making the searches much faster and that is why you can just have one clustered index per table.
Check this explanantion out
Related
I have to store DOIs in a MySQL database. The handbook says:
There is no limitation on the length of a DOI name.
So far, the maximum length of a DOI in my current data is 78 chars. Which field length would you recommend in order to not waste storage space and to be on the safe side? In general:
How do you handle the problem of not knowing the maximum length of input data that has to be stored in a database, considering space and the efficiency of transactions?
EDIT
There are these two (simplified) tables document and topic with a one-to-many relationship:
CREATE TABLE document
(
ID int(11) NOT NULL,
DOI ??? NOT NULL,
PRIMARY KEY (ID)
);
CREATE TABLE topic
(
ID int(11) NOT NULL,
DocID int(11) NOT NULL,
Name varchar(255) NOT NULL,
PRIMARY KEY (ID),
FOREIGN KEY (DocID) REFERENCES Document(ID), UNIQUE(DocID)
);
I have to run the following (simplified) query for statistics, returning the total value of referenced topic-categories per document (if there are any references):
SELECT COUNT(topic.Name) AS number, document.DOI
FROM document LEFT OUTER JOIN topic
ON document.ID = topic.DocID
GROUP BY document.DOI;
The character set used is utf_8_general_ci.
TEXT and VARCHAR can store 64KB. If you're being extra paranoid, use LONGTEXT which allows 4GB, though if the names are actually longer than 64KB then that is a really abusive standard. VARCHAR(65535) is probably a reasonable accommodation.
Since VARCHAR is variable length then you really only pay for the extra storage if and when it's used. The limit is just there to cap how much data can, theoretically, be put in the field.
Space is not a problem; indexing may be a problem. Please provide the queries that will need an index on this column. Also provide the CHARACTER SET needed. With those, we can discuss the ramifications of various cutoffs: 191, 255, 767, 3072, etc.
I need to decide which is better for performance:
A) Retrieving data from one of ~100 similar tables LEFT JOIN 1 table.
B) Retrieving data from one of ~100 similar tables after denormalizing 1 table that I joined in A).
I'm curious if denormalizing this one table pays off in SELECT performance, since I'm creating a lot more columns in database - similar tables will have 3-15(let's say 8) columns and the table to denormalize will have ~6columns.
So in variant A) I got 100 tables * 8 columns + 1 table * 6 columns = 806 columns.
In variant B) I got 100 tables * (8 columns + 6 columns) = 1400 columns.
So which is better when we're not looking at disk space, only focusing on performance?
-------------EDIT-----------------
As Rick James asked for SHOW CREATE TABLE - the competing one :
CREATE TABLE `ItemsGeneral` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`description` text COLLATE utf8mb4_unicode_ci NOT NULL,
`datePosted` datetime NOT NULL,
`dateEnds` datetime NOT NULL,
`photos` tinyint(3) unsigned NOT NULL,
`userId` int(10) unsigned NOT NULL,
`locationSimple` point NOT NULL,
`locationPrecise` point NOT NULL,
PRIMARY KEY (`id`),
SPATIAL KEY `locationPrecise` (`locationPrecise`),
SPATIAL KEY `locationSimple` (`locationSimple`),
KEY `userId` (`userId`),
KEY `dateEnds` (`dateEnds`)
) ENGINE=MyISAM AUTO_INCREMENT=10001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
And all the other ~100s tables will have tiny/small/medium ints.
The implication of your question is that you already threw normalization out the window from the start.
In terms of performance, a standard rule of thumb in terms of result speed (I'll use > to mean faster than here)
cached data > data cached query > query on SSD > query on magneto optical disk
Another rule of thumb for database performance is that optimization of the size of the data set is directly related to performance against a given resource.
Now in terms of joins, there is a price to pay for a simple keyed join, but since these types of queries are typically measured in milliseconds, that certainly isn't a reason to denormalize lots of data and in the process blow up your dataset 20%, especially if you might need to update all that data.
Normalization is simply a cost for having atomic accurate data but it also helps keep your dataset to an optimal size.
Just as an example, if you have a mysql server, and you are using InnoDB, AND you have properly allocated memory on the server to your InnoDB cache, you can often see an extremely high cache hit ratio, where the queries are coming straight out of ram. At that point the fact that more of your database can be in cache due to the size of the dataset is more important than the fact that you joined 2 tables together.
I find on the regular, that people who setup mysql but aren't experts in it, aren't aware either that the size of their dataset could fit entirely in cache (were they to allocate it), or that they haven't even changed any of the default values and have essentially almost no allocation to cache.
Just to be clear, this involves configuration of the innodb_buffer_pool_size and innodb_buffer_pool_instances.
We are having a Analytics product. For each of our customer we give one JavaScript code, they put that in their web sites. If a user visit our customer site the java script code hit our server so that we store this page visit on behalf of this customer. Each customer contains unique domain name.
we are storing this page visits in MySql table.
Following is the table schema.
CREATE TABLE `page_visits` (
`domain` varchar(50) DEFAULT NULL,
`guid` varchar(100) DEFAULT NULL,
`sid` varchar(100) DEFAULT NULL,
`url` varchar(2500) DEFAULT NULL,
`ip` varchar(20) DEFAULT NULL,
`is_new` varchar(20) DEFAULT NULL,
`ref` varchar(2500) DEFAULT NULL,
`user_agent` varchar(255) DEFAULT NULL,
`stats_time` datetime DEFAULT NULL,
`country` varchar(50) DEFAULT NULL,
`region` varchar(50) DEFAULT NULL,
`city` varchar(50) DEFAULT NULL,
`city_lat_long` varchar(50) DEFAULT NULL,
`email` varchar(100) DEFAULT NULL,
KEY `sid_index` (`sid`) USING BTREE,
KEY `domain_index` (`domain`),
KEY `email_index` (`email`),
KEY `stats_time_index` (`stats_time`),
KEY `domain_statstime` (`domain`,`stats_time`),
KEY `domain_email` (`domain`,`email`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
We don't have primary key for this table.
MySql server details
It is Google cloud MySql (version is 5.6) and storage capacity is 10TB.
As of now we are having 350 million rows in our table and table size is 300 GB. We are storing all of our customer details in the same table even though there is no relation between one customer to another.
Problem 1: For few of our customers having huge number of rows in table, so performance of queries against these customers are very slow.
Example Query 1:
SELECT count(DISTINCT sid) AS count,count(sid) AS total FROM page_views WHERE domain = 'aaa' AND stats_time BETWEEN CONVERT_TZ('2015-02-05 00:00:00','+05:30','+00:00') AND CONVERT_TZ('2016-01-01 23:59:59','+05:30','+00:00');
+---------+---------+
| count | total |
+---------+---------+
| 1056546 | 2713729 |
+---------+---------+
1 row in set (13 min 19.71 sec)
I will update more queries here. We need results in below 5-10 seconds, will it be possible?
Problem 2: The table size is rapidly increasing, we might hit table size 5 TB by this year end so we want to shard our table. We want to keep all records related to one customer in one machine. What are the best practises for this sharding.
We are thinking following approaches for above issues, please suggest us best practices to overcome these issues.
Create separate table for each customer
1) What are the advantages and disadvantages if we create separate table for each customer. As of now we are having 30k customers we might hit 100k by this year end that means 100k tables in DB. We access all tables simultaneously for Read and Write.
2) We will go with same table and will create partitions based on date range
UPDATE : Is a "customer" determined by the domain? Answer is Yes
Thanks
First, a critique if the excessively large datatypes:
`domain` varchar(50) DEFAULT NULL, -- normalize to MEDIUMINT UNSIGNED (3 bytes)
`guid` varchar(100) DEFAULT NULL, -- what is this for?
`sid` varchar(100) DEFAULT NULL, -- varchar?
`url` varchar(2500) DEFAULT NULL,
`ip` varchar(20) DEFAULT NULL, -- too big for IPv4, too small for IPv6; see below
`is_new` varchar(20) DEFAULT NULL, -- flag? Consider `TINYINT` or `ENUM`
`ref` varchar(2500) DEFAULT NULL,
`user_agent` varchar(255) DEFAULT NULL, -- normalize! (add new rows as new agents are created)
`stats_time` datetime DEFAULT NULL,
`country` varchar(50) DEFAULT NULL, -- use standard 2-letter code (see below)
`region` varchar(50) DEFAULT NULL, -- see below
`city` varchar(50) DEFAULT NULL, -- see below
`city_lat_long` varchar(50) DEFAULT NULL, -- unusable in current format; toss?
`email` varchar(100) DEFAULT NULL,
For IP addresses, use inet6_aton(), then store in BINARY(16).
For country, use CHAR(2) CHARACTER SET ascii -- only 2 bytes.
country + region + city + (maybe) latlng -- normalize this to a "location".
All these changes may cut the disk footprint in half. Smaller --> more cacheable --> less I/O --> faster.
Other issues...
To greatly speed up your sid counter, change
KEY `domain_statstime` (`domain`,`stats_time`),
to
KEY dss (domain_id,`stats_time`, sid),
That will be a "covering index", hence won't have to bounce between the index and the data 2713729 times -- the bouncing is what cost 13 minutes. (domain_id is discussed below.)
This is redundant with the above index, DROP it:
KEY domain_index (domain)
Is a "customer" determined by the domain?
Every InnoDB table must have a PRIMARY KEY. There are 3 ways to get a PK; you picked the 'worst' one -- a hidden 6-byte integer fabricated by the engine. I assume there is no 'natural' PK available from some combination of columns? Then, an explicit BIGINT UNSIGNED is called for. (Yes that would be 8 bytes, but various forms of maintenance need an explicit PK.)
If most queries include WHERE domain = '...', then I recommend the following. (And this will greatly improve all such queries.)
id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
domain_id MEDIUMINT UNSIGNED NOT NULL, -- normalized to `Domains`
PRIMARY KEY(domain_id, id), -- clustering on customer gives you the speedup
INDEX(id) -- this keeps AUTO_INCREMENT happy
Recommend you look into pt-online-schema-change for making all these changes. However, I don't know if it can work without an explicit PRIMARY KEY.
"Separate table for each customer"? No. This is a common question; the resounding answer is No. I won't repeat all the reasons for not having 100K tables.
Sharding
"Sharding" is splitting the data across multiple machines.
To do sharding, you need to have code somewhere that looks at domain and decides which server will handle the query, then hands it off. Sharding is advisable when you have write scaling problems. You did not mention such, so it is unclear whether sharding is advisable.
When sharding on something like domain (or domain_id), you could use (1) a hash to pick the server, (2) a dictionary lookup (of 100K rows), or (3) a hybrid.
I like the hybrid -- hash to, say, 1024 values, then look up into a 1024-row table to see which machine has the data. Since adding a new shard and migrating a user to a different shard are major undertakings, I feel that the hybrid is a reasonable compromise. The lookup table needs to be distributed to all clients that redirect actions to shards.
If your 'writing' is running out of steam, see high speed ingestion for possible ways to speed that up.
PARTITIONing
PARTITIONing is splitting the data across multiple "sub-tables".
There are only a limited number of use cases where partitioning buys you any performance. You not indicated that any apply to your use case. Read that blog and see if you think that partitioning might be useful.
You mentioned "partition by date range". Will most of the queries include a date range? If so, such partitioning may be advisable. (See the link above for best practices.) Some other options come to mind:
Plan A: PRIMARY KEY(domain_id, stats_time, id) But that is bulky and requires even more overhead on each secondary index. (Each secondary index silently includes all the columns of the PK.)
Plan B: Have stats_time include microseconds, then tweak the values to avoid having dups. Then use stats_time instead of id. But this requires some added complexity, especially if there are multiple clients inserting data. (I can elaborate if needed.)
Plan C: Have a table that maps stats_time values to ids. Look up the id range before doing the real query, then use both WHERE id BETWEEN ... AND stats_time .... (Again, messy code.)
Summary tables
Are many of the queries of the form of counting things over date ranges? Suggest having Summary Tables based perhaps on per-hour. More discussion.
COUNT(DISTINCT sid) is especially difficult to fold into summary tables. For example, the unique counts for each hour cannot be added together to get the unique count for the day. But I have a technique for that, too.
I wouldn't do this if i were you. First thing that come to mind would be, on receive a pageview message, i send the message to a queue so that a worker can pickup and insert to database later (in bulk maybe); also i increase the counter of siteid:date in redis (for example). Doing count in sql is just a bad idea for this scenario.
I am designing a website and have a table which dealing with lot of inserts. On each month this table will get at least 50 million records.
So currently I am using bigint unsgined data type as the primary key of this table.
CREATE TABLE `class`.`add_contact_details`
(
`con_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
`add_id_ref` BIGINT UNSIGNED NOT NULL,
`con_name` VARCHAR(200),
`con_email` VARCHAR(200),
`con_phone` VARCHAR(200),
`con_fax` VARCHAR(200),
`con_mailbox` VARCHAR(500),
`con_status_show_email` TINYINT(1),
`con_status_show_phone` TINYINT(1),
`con_status_show_fax` TINYINT(1),
`con_status_show_mailbox` TINYINT(1),
PRIMARY KEY (`con_id`) ) ENGINE=INNODB CHARSET=latin1 COLLATE=latin1_swedish_ci;
So by doing a big research I found that most of the people are worry about using BIGINT because it is memory consuming and need lot of space.
So I found an article that describing a alternate for that. Here it is
"You could use a combined ( tinyint, int ) key. The tinyint would start at, and default to, 1. IF the int value is ever about to overflow, you change the tinyint's default value to 2, and reset the int value to 1. You can create code that runs every day, or on another applicable schedule, which checks for that condition and makes that change if needed."
So it make sense right? So Is there anybody who is using this?
What should I use by considering the performance?
Is there any alternative enterprice level solution for this?
Stick with the BIGINT. You do save two or three bytes using the dual key, but you do pay for it.
References to the table need to use two keys instead of one, so all foreign key relationships are more complicated.
where clauses to find a single row are much more complicated. Consider the difference between:
where id in (1, 2, 3, 4, 5)
and
where id_part1 = 0 and id_part2 = 1 or
id_part1 = 0 and id_part2 = 2 or
. . .
The step to increment the first part is no automatic, requiring either manual intervention or the overhead of a trigger.
This reminds me (in a bad way) of segmented memory architectures that were in vogue 20+ years ago. Be happy the computer can understand 64-bit keys with no problems.
BIGINT needs 8 bytes of storage so 50 million records is 400 MB per month which shouldn't be an issue at all.
We are running databases (on DB2) with a couple of TB on a single server.
The only thing you should consider for querying by PK is putting an index on that field.
best regards
Romeo Kienzler
I am currently evaluating strategy for storing supplier catalogs.
There can be multiple items in catalog vary from 100 to 0.25million.
Each item may have multiple errors. application should support browsing of catalog items
Group by Type of Error, Category, Manufacturer, Suppliers etc..
Browse items for any group, Should be able to sort and search on multiple columns (partid,
names, price etc..)
Question is when i have to provide functionality of "Multiple SEARCH and SORT and GROUP" how should i create index.
According to mysql doc & blogs for index it seems that creating index on individual column will not be used by all query.
Creating multi column index is even not specific for my case.
There might be 20 - 30 combination of group search & sort.
How do i scale and how can i make search fast.
Expecting to handle 50 million records of data.
Currently evaluating on 15 million of data.
Suggestions are welcome.
CREATE TABLE CATALOG_ITEM
(
AUTO_ID BIGINT PRIMARY KEY AUTO_INCREMENT,
TENANT_ID VARCHAR(40) NOT NULL,
CATALOG_ID VARCHAR(40) NOT NULL,
CATALOG_VERSION INT NOT NULL,
ITEM_ID VARCHAR(40) NOT NULL,
VERSION INT NOT NULL,
NAME VARCHAR(250) NOT NULL,
DESCRIPTION VARCHAR(2000) NOT NULL,
CURRENCY VARCHAR(5) NOT NULL,
PRICE DOUBLE NOT NULL,
UOM VARCHAR(10) NOT NULL,
LEAD_TIME INT DEFAULT 0,
SUPPLIER_ID VARCHAR(40) NOT NULL,
SUPPLIER_NAME VARCHAR(100) NOT NULL,
SUPPLIER_PART_ID VARCHAR(40) NOT NULL,
MANUFACTURER_PART_ID VARCHAR(40),
MANUFACTURER_NAME VARCHAR(100),
CATEGORY_CODE VARCHAR(40) NOT NULL,
CATEGORY_NAME VARCHAR(100) NOT NULL,
SOURCE_TYPE INT DEFAULT 0,
ACTIVE BOOLEAN,
SUPPLIER_PRODUCT_URL VARCHAR(250),
MANUFACTURER_PRODUCT_URL VARCHAR(250),
IMAGE_URL VARCHAR(250),
THUMBNAIL_URL VARCHAR(250),
UNIQUE(TENANT_ID,ITEM_ID,VERSION),
UNIQUE(TENANT_ID,CATALOG_ID,ITEM_ID)
);
CREATE TABLE CATALOG_ITEM_ERROR
(
ITEM_REF BIGINT,
FIELD VARCHAR(40) NOT NULL,
ERROR_TYPE INT NOT NULL,
ERROR_VALUE VARCHAR(2000)
);
If you are determined to do this solely in MySQL, then you should be creating indexes that will work for all your queries. It's OK to have 20 or 30 indexes if there are 20-30 different queries doing your sorting. But you can probalby do it with far less indexes than that.
You also need to plan how these indexes will be maintained. I'm assuming because this is for supplier catalogs that the data is not going to change much. In this case, simply creating all the indexes you need should do the job nicely. If the data rows are going to be edited or inserted frequently in realtime, then you have to consider that with your indexing - then having 20 or 30 indexes might not be such a good idea (since MySQL will be constantly having to update them all). You also have to consider which MySQL storage engine to use. If your data never changes, MyISAM (the default engine, basically fast flat files) is a good choice. If it changes a lot, then you should be using InnoDB so you can get row level locking. InnoDB would also allow you to define a clustered index, which is a special index that controls the order stuff is stored on disk. So if you had one particular query that is run 99% of the time, you could create a clustered index for it and all the data would already be in the right order on disk, and would return super super fast. But, every insert or update to the data would result in the entire table being reordered on disk, which is not fast for lots of data. You'd never use one if the data changed at all frequently, and you might have to batch load data updates (like new versions of a supplier's million rows). Again, it comes down to whether you will be updating it never, now and then, or constantly in realtime.
Finally, you should consider alternative means than doing this in MySQL. There are a lot of really good search products out there now, such as Apache Solr or Sphinx (mentioned in a comment above), which could make your life a lot easier when coding up the search interfaces themselves. You could index the catalogs in one of these and then use them provide some really awesome search features like full text and/or faceted search. It's like having a private google search engine indexing your stuff, is a good way to describe how these work. It takes time to write the code to interface with the search server, but you will most likely save that time not having to write and wrap your head around the indexing problem and other issues I mentioned above.
If you do just go with creating all the indexes though, learn how to use the EXPLAIN command in MySQL. That will let you see what MySQL's plan for executing a query will be. You can create indexes then re-run EXPLAIN on your queries and see how MySQL is going to use them. This way you can make sure that each of your query methods has indexes supporting it, and is not falling back to a scanning your entire table of data to find things. With as many rows as you're talking about, every query MUST be able to use indexes to find its data. If you get those right, it'll perform fine.