If I have a table which contains previous product purchases of a user like so:
ID | ProductName | UserID
1 | Grapes | 3455
2 | Water | 1944
3 | Bread | 3455
4 | Milk | 3455
...
As you can see in the example above, user 3455 has bought grapes, bread and milk. If I wanted to retrieve all the products a user has bought, I would have to find each of the records which has user 3455.
Would storing all the products which are from user 3455 together speed up searching for these records like defragmenting a hard drive? And if so, would the process of deleting the old records and readding them to the end of the database be a waste of processing power?
The fastest way to get the information you want is to have an index on t(UserId, ProductName). This is a covering index for the following query:
select ProductName
from t
where UserId = 3455;
The means that all the columns needed by the query are in the index, and in the proper order. So, the query optimizer can resolve the query using only the index. This should be quite fast.
Related
Suppose I had these 4 tables, consisting of various foreign key relationships (eg a area must belong to a location, a shop must belong to area, an item price must belong to a shop ect..)
----------------------------------
|Location Name | Location ID |
| | |
----------------------------------
-------------------------------------------------
|Area Name | Area ID | Location ID |
| | | |
-------------------------------------------------
-------------------------------------------------
| Shop Name | Shop ID | Area ID |
| | | |
-------------------------------------------------
----------------------------------
| Item Price | Shop ID |
| | |
----------------------------------
And I wanted the sum of 'Item Price' that belonged to a specific location id. So all the areas and shops item price total for location id 'x'.
One way I found to do this is to join all the tables for one location and get the amount eg:
SELECT SUM(Item Price) FROM
items
left join shops ON (items.shop id = shops.shop id)
left join areas ON (shops.area id = areas.area id)
left join locations ON (areas.location id = location.location id)
WHERE Location Id = 4;
However is this the best way to do this since it involves retrieving the full tree of the data and filtering it out? Would there be a better way if there are a million rows or is this the best way?
You can try sub query --
SELECT SUM(Item Price) FROM
items
left join shops ON (items.shop id = shops.shop id)
left join (select area id from areas where Location Id = 4) as Ar ON (shops.area id = areas.area id)
If you define the right indexes, then the query does not read all the millions of rows for each table.
Think about a telephone book and how you look up a name. Do you read the whole book cover to cover looking for the name? No, you take advantage of the fact that the book is sorted by lastname, firstname and you go directly to the name. It takes only a few tries to find the right page. In fact, on average it takes about log2N tries for a book with N names in it.
The same kind of search happens for each join. If you have indexes, each comparison expression uses a similar lookup to find matching rows in the joined table. It's pretty fast.
But if that's not fast enough, you can also use denormalization, which in this case would be storing all the data in one table, with many columns wide.
----------------------------------------------------------------------
|Location Name | Area Name | Shop Name | Item Name | Item Price |
| | | | | |
----------------------------------------------------------------------
The advantage of denormalization is that it avoids certain joins. It stores the row just like one of the rows you'd get from the result set of your example joined SQL query. You just read one row from the table and you have all the information you need.
The disadvantage of denormalization is the redundant storage of data. Presumably each shop has many items. But each item is stored on a row of its own, which means that row has to repeat the names of the shop, area, and location.
By storing those data repeatedly, you create an opportunity for "anomalies" like if you change the name of a given shop, but you mistakenly change it only on a few rows instead of everywhere the shop name appears. Now you have two names for the same shop, and someone else looking at the database has no way of knowing which one is correct.
In general, maintaining multiple normalized tables in preferable, because each "fact" is stored exactly once, so there can be no anomalies.
Creating indexes to help your queries is sufficient for most applications.
You might like my presentation, How to Design Indexes, Really, and the video: https://www.youtube.com/watch?v=ELR7-RdU9XU
I have two mysql tables like bellow:
table_category
-----------------
id | name | type
1 | A | Cloth
2 | B | Fashion
3 | C | Electronics
4 | D | Electronics
table_product
------------------
id | cat_cloth | cat_fashion | cat_electronics
1 | 1 | 2 | 3
1 | NULL | 2 | 4
Here cat_cloth, cat_fashion, cat_electronics is ID from table_category
It is better to have another table for category type but I need a quick solution for now.
I want to get list of categories with total number of products. I wrote following query:
SELECT table_category.*, table_product.id, COUNT(table_product.id) as count
FROM table_category
LEFT JOIN table_product` ON table_category.id = table_product.cat_cloth
OR table_category.id = table_product.cat_fashion
OR table_category.id = table_product.cat_electronis
GROUP BY table_product.id
ORDER BY table_product.id ASC
Question: The sql I wrote it works but I have more then 14K categories and 50K products and the sql works very slow. I added index for cat_* ids but no improvement. My question how can I optimize this query?
I found the query takes 3-4 minutes to process the volume of data I mentioned. I want to reduce the execution time.
Best Regards
As far as I can say every "OR" either in "ON" or "WHERE" part is very cost expensive. It will sound very stupid but I would recommend you to make 3 separate small selects combined together with UNION ALL.
This we do with similar problems both in mysql and postgresql and in some cases when we got "resources exceeded" we had to do it also for bigquery. So it is very stupid and you will have more work but it certainly works and it is much quicker in producing results then many "OR"s.
So, I have 2 tables
On the first table, lets call it products, lets say I have
product_id, company_id (this is a FK), product_name.
On the second table, lets call it deals, I have
deal_id, company_id (same one as the first table), deal_title.
I need to add products to the deals. if I added a product_id field to the deals table, I would have multiple rows and ids for each deal, which is completely wrong. What is the correct way to do it?
You should add a table for manage the relation between products and deals
eg:
table products_deal
product_id
deal_id
What you want is a pivot table between the two tables you have that have a structure like:
|-deal_id----|-product_id----|
| 10 | 23 |
| 10 | 24 |
| 10 | 32 |
| ...
| ...
If you need to find all products associated with deal #10, you can just use a query like SELECT * FROM pivot_table WHERE deal_id = 10
I'll try to explain my situation: I'm trying to create a search engine for products on my website, so when the user needs to find a product I need to show similar ones, here's an example.
User searches:
assassins creed OR assassinscreed OR aSsAssIn's CreeD assuming there are no letters/numbers mispelling (those 3 queries should produce the same result)
Expected results:
Assassin's Creed AND Assassin's Creed: Unity AND Assassin's Creed: Special Edition
What have I tried so far
I have created a MySQL field for the search engine which contains a parsed name of the product (Assassin's Creed: Unity -> assassinscreedunity
I parse the search query
I search using MySQL's INSTR()
My problem
I'm fine by using this, but I heard it can be slow when the number of rows increases, I've created a full-text index in my table, but I don't think it would help, so I need another solution.
Thanks for any answer, and ask me anything before downvoting.
First of all, you should keep track of performance issues in your queries more precisely than 'heard it cand be slow' and 'think it would help'. One starting point may be the Slow Query Log.
If you have a table which contains the same parsed name in more than one row, consider normalizing your database. In the specific case, store unique parsed names in one table, and only the id of the corresponding parsed name in the table you described in your question. This way, you only need to check the smaller table with unique names and can then quickly find all matching entries in the main table by id.
Example:
Consider the following table with your structure
id | product_name | rating
-----------------------------------
1 | assassinscreedunity | 5
2 | assassinscreedunity | 2
3 | monkeyisland | 3
4 | monkeyisland | 5
5 | assassinscreedunity | 4
6 | monkeyisland | 4
you would have to scan all six entries to find relevant rows.
In contrast, consider two tables like this:
id | p_id | rating
--------------------
1 | 1 | 5
2 | 1 | 2
3 | 2 | 3
4 | 2 | 5
5 | 1 | 4
6 | 2 | 4
id | name
--------------------------
1 | assassinscreedunity
2 | monkeyisland
In this case, you only have to scan two entries (compared to six) and can then efficiently look up relevant rows using the integer id.
To further enhance the performance, you could extend the concept of a parsed name and use hashes. For example, you could calculate the SHA1-hash of your parsed name which is a 160 bit value. You can find entries in your database for this value very efficiently. To match substrings, you can add them to the second table as well. Since the hash only needs to computed once, you still can use the database to match by an integer. Another thing for you might be fuzzy hashing.
In addition, you should read up on the Rabin–Karp algorithm or string searching in general.
I have a large database with two tables: stat and total.
The example of the relation is the following:
STAT:
| ID | total event |
+--------+--------------+
| 7 | 2 |
| 8 | 1 |
TOTAL:
|ID | Event |
+---+--------------+
| 7 | "hello" |
| 7 | "everybody" |
| 8 | "hi" |
This is a very simplified version; also consider that STAT table could have 500K records, and for each STAT I can have about 200 TOTAL rows.
Currently, if I run a simple SELECT query in table TOTAL the system is terribly slow.
Could anyone help me with some advice for the creation of the TOTAL table? Is it possible to say to MySQL that the id column is already sorted so that there is no reason to scan all the rows till the end where, for example, id=7?
Add INDEX(ID) to your tables (both), if you did not already.
SELECT COUNT(*) FROM TOTAL WHERE ID=7 -> if ID is indexed, this will be fast.
You can add an index, and furthermore you can partition your table.
As per #ypercube's comment, tables are not stored in a sorted state, so one cannot "tell" this to the database. However you can add an index on tables to make them faster to search.
One important thing to check - it looks like TOTAL.ID is intended as a foreign key - if so, the table TOTAL should have a primary key called ID. Rename the existing column of that name to STAT_ID instead, so it is obvious what it is a foreign key for. Then add an index on STAT_ID.
Lastly, as a point of style, I recommend that you make your table and column names case-insensitive, and write them in lower-case. It makes it easier to read SQL when keywords are in upper case, and database objects are in lower.