This is a theoretical question. Sorry, but I don't have a working tables data to show, I'll try to improvise with a theoretical example.
Using MySql/MariaDB. Have indexes for all relevant fields.
I have a system, which historical design had a ProductType table, something like:
ID=1, Description="Milk"
ID=2, Description="Bread"
ID=3, Description="Salt"
ID=4, Description="Sugar"
and so on.
There are some features in the system that rely on the ProductType ID and the Description is also used in different places, such as for defining different properties of the product type.
There is also a Product table, with fields such as:
ID, ProductTypeID, Name
The Product:Name don't have the product type description in it, so a "Milk bottle 1l" will have an entry such as:
ID=101, ProductTypeID=1, Name="bottle 1l"
and "Sugar pack 1kg" will be:
ID=102, ProductTypeID=4, Name="pack 1kg"
You get the idea...
The system combines the ProductType:Description and Product:Name to show full product names to the users. This creates a systematic naming for all the products, so there is no way to define a product with a name such as "1l bottle of milk". I know that in English that might be hard to swallow, but that way works great with my local language.
Years passed, the database grow to millions of products.
Since full-text index should have all searched data in one table, I had to store the ProductType:Description inside the Product table in a string field I added that have different keywords related to the product, so the full-text search will be able to find anything related to the product (type, name, barcode, SKU and etc.)
Now I'm trying to solve the full table scans and it makes me think that current design might not be optimal and I'll have to redesign and store the full product name (type + name) in the same table...
In order to show the proper order of the products there's an ORDER BY TypeDescription ASC, ProductName ASC after the ProductType table is joined to Product select queries.
From my research I see that the database can't use indexes when the order is done on fields from different tables, so it's doing full table scan to get to the right entries.
During pagination, there's ORDER and LIMIT 50000,100 in the query that take lots of time.
There are sections with lots for products, so that ordering and limiting cause very long full table scans.
How would you handle that situation?
Change the design and store all query related data to the Product table? Feels a bit of a duplication and not natural solution.
Or maybe there's another way to solve it?
Will index on VARCHAR type (product name) be efficient for the ORDER speed? Or the database will still do full table scan?
My first question here. Couldn't find answers on similar cases.
Thanks!
I've tried to play with the queries to see if ordering by a VARCHAR field that have an index will work, but the EXPLAIN SELECT still shows that the query didn't use the index and did WHERE run :(
UPDATE
Trying to add some more data...
The situation is a bit more complicated and after digging a bit more it looks like the initial question was not in the right direction.
I removed the product type from the queries and still have the slow query.
I feel like it's a chicken and egg situation...
I have a table that maps prodcut IDs to section IDs:
CREATE TABLE `Product2Section` (
`SectionId` int(10) unsigned NOT NULL,
`ProductId` int(10) unsigned NOT NULL,
KEY `idx_ProductId` (`ProductId`),
KEY `idx_SectionId` (`SectionId`),
KEY `idx_ProductId_SectionId` (`ProductId`,`SectionId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC
The query (after stripping all non-relevant to the question feilds):
SELECT DISTINCT
DRIVER.ProductId AS ID,
p.*
FROM
Product2Section AS DRIVER
LEFT JOIN Product p ON
(p.ID = DRIVER.ProductId)
WHERE
DRIVER.SectionId IN(
544,545,546,548,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,566,567,568,570,571,572,573,574,575,1337,1343,1353,1358,1369,1385,1956,1957,1964,1973,1979,1980,1987,1988,1994,1999,2016,2020,576,577,578,579,580,582,586,587,589,590,591,593,596,597,598,604,605,606,608,609,612,613,614,615,617,619,620,621,622,624,625,626,627,628,629,630,632,634,635,637,639,640,642,643,644,645,647,648,651,656,659,660,661,662,663,665,667,669,670,672,674,675,677,683,684,689,690,691,695,726,728,729,730,731,734,736,741,742,743,745,746,749,752,758,761,762,763,764,768,769,771,772,773,774,775,776,777
)
ORDER BY
p.ProductName ASC
LIMIT 500900,100;
explain shows:
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
DRIVER
index
idx_SectionId
idx_ProductId_SectionId
8
NULL
589966
Using where; Using index; Using temporary; Using filesort
1
SIMPLE
p
eq_ref
PRIMARY,idx_ID
PRIMARY
4
4project.DRIVER.ProductId
1
Using where
I've tried to select from the products table and join the Product2Section in order to filter the results, but get the same results:
SELECT DISTINCT
p.ID,
p.ProductName
FROM
Product p
LEFT JOIN
Product2Section p2s ON (p.ID=p2s.ProductId)
WHERE
p2s.SectionId IN(
544,545,546,548,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,566,567,568,570,571,572,573,574,575,1337,1343,1353,1358,1369,1385,1956,1957,1964,1973,1979,1980,1987,1988,1994,1999,2016,2020,576,577,578,579,580,582,586,587,589,590,591,593,596,597,598,604,605,606,608,609,612,613,614,615,617,619,620,621,622,624,625,626,627,628,629,630,632,634,635,637,639,640,642,643,644,645,647,648,651,656,659,660,661,662,663,665,667,669,670,672,674,675,677,683,684,689,690,691,695,726,728,729,730,731,734,736,741,742,743,745,746,749,752,758,761,762,763,764,768,769,771,772,773,774,775,776,777
)
ORDER BY
p.ProductName ASC
LIMIT 500900,
100;
explain:
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
p2s
index
idx_ProductId,idx_SectionId,idx_ProductId_SectionId
idx_ProductId_SectionId
8
NULL
589966
Using where; Using index; Using temporary; Using filesort
1
SIMPLE
p
eq_ref
PRIMARY,idx_ID
PRIMARY
4
4project.p2s.ProductId
1
Using where
Don't see a way out of that situation.
The two single column indices on Product2Section serve no purpose. You should change your junction table to:
CREATE TABLE `Product2Section` (
`SectionId` int unsigned NOT NULL,
`ProductId` int unsigned NOT NULL,
PRIMARY KEY (`SectionId`, `ProductId`),
KEY `idx_ProductId_SectionId` (`ProductId`, `SectionId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
There are other queries in the system that probably use the single field indexes
The single column indices cannot be used for anything that the two composite indices cannot be used for. They are just wasting space and cause unnecessary overhead on insert and for the optimizer. Setting one of the composite indices as PRIMARY stops InnoDB from having to create its own internal rowid, which just wastes space. It also adds the uniqueness constraint which is currently missing from your table.
From the docs:
Accessing a row through the clustered index is fast because the index search leads directly to the page that contains the row data. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record.
This is not significant for a "simple" junction table as both columns should be stored in both indices, therefor no further read is required.
You said:
that didn't really bother me since there was no real performance hit
You may not see the difference when running an individual query with no contention but the difference in a highly contended production environment can be huge, due to the amount of effort required.
Do you really need to accommodate 4,294,967,295 (int unsigned) sections? Perhaps the 65,535 provided by smallint unsigned would be enough?
You said:
Might change it in the future. Don't think it will change the performance somehow
Changing SectionId to smallint will reduce each index entry from 8 to 6 bytes. That's a 25% reduction in size. Smaller is faster.
Why are you using LEFT JOIN? The fact that you are happy to reverse the order of the tables in the query suggests it should be an INNER JOIN.
Do you have your buffer pool configured appropriately, or is it set to defaults? Please run ANALYZE TABLE Product2Section; and then provide the output from:
SELECT TABLE_ROWS, AVG_ROW_LENGTH, DATA_LENGTH + INDEX_LENGTH
FROM information_schema.TABLES
WHERE TABLE_NAME = 'Product2Section';
And:
SELECT ROUND(SUM(DATA_LENGTH + INDEX_LENGTH)/POW(1024, 3), 2)
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'your_database_name';
And:
SHOW VARIABLES LIKE 'innodb_buffer%';
I need to retrieve columns from two tables and I have used an INNER JOIN. But its consuming lot of time during loading the page. Is there any better and faster way to achieve the same?
Select P.Col1, P.Col2, P.Col3, P.Col4, P.Col5, C.Col1, C.Col2, C.Col3 from Pyalers P inner join Customers C on C.Col1 = P.Col1 where P.Col2 = 5
Thanks in Advance.
Without knowing your DDL, there's no way to say.
But conceptually this is ok, just be sure you have proper indexs sets.
For instance: (is your table name really 'Pyalers'? Assuming 'players')
CREATE INDEX idx_players ON `players` (col1);
CREATE INDEX idx_customers ON `customers` (col1);
use the columns you need for joinning the 2 tables.
http://dev.mysql.com/doc/refman/5.0/en/create-index.html
You're doing it the right way, but if you don't have indexes on your tables on the correct columns, it's not going to be very fast for tables of any size. Do Pyalers.col1 and Customers.col1 both have indexes on them?
Show us how the tables are defined.
Be sure your table has the needed indexes... as a "thumb rule", every field which is used for search (WHERE) or data joins (INNER JOIN, LEFT JOIN, RIGHT JOIN) should be indexed.
Example: If you are creating a table, you can add your indexes at that time (notice that your tables should always have a primary key):
CREATE TABLE myTable (
myId int unsigned not null,
someField varchar(50),
primary key (myId),
index someIdx(someField)
);
If your table already exists, and you want to add indexes, you need to use the ALTER statement:
ALTER TABLE myTable
ADD INDEX someIdx(someField),
ADD PRIMARY KEY (myId);
Rules:
To define an index you most provide a unique name for it, and specify the fields included in the index: INDEX myIndex(field1, field2, ...)
There are different types of indexes: PRIMARY KEY is used for primary keys (that's obvious, huh?); INDEX is an 'ordinary index', just used to speed up search and join operations; UNIQUE INDEX is an index that prevents duplicate values.
Recomendations:
Whenever you can, index all numeric and date fields that are relevant (ids, birth date, etc.). Avoid creating indexes on fields that contain 'double' values.
Don't abuse of indexes, because abuse can create very large index files.
Tips:
If you want to see how your query will be executed, you can use the EXPLAIN statement:
EXPLAIN SELECT a., b. FROM a INNER JOIN b on a.myId = b.otherId
This instruction will show you the execution plan of the query. If in the last column you see 'file sort' or 'using temporary', you may (just may) need aditional indexes (notice that if you use GROUP BY you will almost always get the 'using temporary' message)
Hope this help you
Assume that I have one big table with three columns: "user_name", "user_property", "value_of_property". Lat's also assume that I have a lot of user (let say 100 000) and a lot of properties (let say 10 000). Then the table is going to be huge (1 billion rows).
When I extract information from the table I always need information about a particular user. So, I use, for example where user_name='Albert Gates'. So, every time the mysql server needs to analyze 1 billion lines to find those of them which contain "Albert Gates" as user_name.
Would it not be wise to split the big table into many small ones corresponding to fixed users?
No, I don't think that is a good idea. A better approach is to add an index on the user_name column - and perhaps another index on (user_name, user_property) for looking up a single property. Then the database does not need to scan all the rows - it just need to find the appropriate entry in the index which is stored in a B-Tree, making it easy to find a record in a very small amount of time.
If your application is still slow even after correctly indexing it can sometimes be a good idea to partition your largest tables.
One other thing you could consider is normalizing your database so that the user_name is stored in a separate table and use an integer foriegn key in its place. This can reduce storage requirements and can increase performance. The same may apply to user_property.
you should normalise your design as follows:
drop table if exists users;
create table users
(
user_id int unsigned not null auto_increment primary key,
username varbinary(32) unique not null
)
engine=innodb;
drop table if exists properties;
create table properties
(
property_id smallint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;
drop table if exists user_property_values;
create table user_property_values
(
user_id int unsigned not null,
property_id smallint unsigned not null,
value varchar(255) not null,
primary key (user_id, property_id),
key (property_id)
)
engine=innodb;
insert into users (username) values ('f00'),('bar'),('alpha'),('beta');
insert into properties (name) values ('age'),('gender');
insert into user_property_values values
(1,1,'30'),(1,2,'Male'),
(2,1,'24'),(2,2,'Female'),
(3,1,'18'),
(4,1,'26'),(4,2,'Male');
From a performance perspective the innodb clustered index works wonders in this similar example (COLD run):
select count(*) from product
count(*)
========
1,000,000 (1M)
select count(*) from category
count(*)
========
250,000 (500K)
select count(*) from product_category
count(*)
========
125,431,192 (125M)
select
c.*,
p.*
from
product_category pc
inner join category c on pc.cat_id = c.cat_id
inner join product p on pc.prod_id = p.prod_id
where
pc.cat_id = 1001;
0:00:00.030: Query OK (0.03 secs)
Properly indexing your database will be the number 1 way of improving performance. I once had a query take a half an hour (on a large dataset, but none the less). Then we come to find out that the tables had no index. Once indexed the query took less than 10 seconds.
Why do you need to have this table structure. My fundemental problem is that you are going to have to cast the data in value of property every time you want to use it. That is bad in my opinion - also storing numbers as text is crazy given that its all binary anyway. For instance how are you going to have required fields? Or fields that need to have constraints based on other fields? Eg start and end date?
Why not simply have the properties as fields rather than some many to many relationship?
have 1 flat table. When your business rules begin to show that properties should be grouped then you can consider moving them out into other tables and have several 1:0-1 relationships with the users table. But this is not normalization and it will degrade performance slightly due to the extra join (however the self documenting nature of the table names will greatly aid any developers)
One way i regularly see databqase performance get totally castrated is by having a generic
Id, property Type, Property Name, Property Value table.
This is really lazy but exceptionally flexible but totally kills performance. In fact on a new job where performance is bad i actually ask if they have a table with this structure - it invariably becomes the center point of the database and is slow. The whole point of relational database design is that the relations are determined ahead of time. This is simply a technique that aims to speed up development at a huge cost to application speed. It also puts a huge reliance on business logic in the application layer to behave - which is not defensive at all. Eventually you find that you wan to use properties in a key relationsip which leads to all kinds of casting on the join which further degrades performance.
If data has a 1:1 relationship with an entity then it should be a field on the same table. If your table gets to more than 30 fields wide then consider movign them into another table but dont call it normalisation because it isnt. It is a technique to help developers group fields together at the cost of performance in an attempt to aid understanding.
I don't know if mysql has an equivalent but sqlserver 2008 has sparse columns - null values take no space.
SParse column datatypes
I'm not saying a EAV approach is always wrong, but i think using a relational database for this approach is probably not the best choice.