Suppose I have the following database setup (a simplified version from what I actually have):
Table: news_posting (500,000+ entries)
| --------------------------------------------------------------|
| posting_id | name | is_active | released_date | token |
| 1 | posting_1 | 1 | 2013-01-10 | 123 |
| 2 | posting_2 | 1 | 2013-01-11 | 124 |
| 3 | posting_3 | 0 | 2013-01-12 | 125 |
| --------------------------------------------------------------|
PRIMARY posting_id
INDEX sorting ON (is_active, released_date, token)
Table: news_category (500 entries)
| ------------------------------|
| category_id | name |
| 1 | category_1 |
| 2 | category_2 |
| 3 | category_3 |
| ------------------------------|
PRIMARY category_id
Table: news_cat_match (1,000,000+ entries)
| ------------------------------|
| category_id | posting_id |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 2 | 2 |
| 3 | 2 |
| 1 | 3 |
| 2 | 3 |
| ------------------------------|
UNIQUE idx (category_id, posting_id)
My task is as follows. I must get a list of 50 latest news postings (at some offset) that are active, that are before today's date, and that are in one of the 20 or so categories that are specified in the request. Before I choose the 50 news postings to return, I must sort the appropriate news postings by token in descending order. My query is currently similar to the following:
SELECT DISTINCT posting_id
FROM news_posting np
INNER JOIN news_cat_match ncm ON (ncm.posting_id = np.posting_id AND ncm.category_id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20))
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
ORDER BY np.token DESC LIMIT 50
With just one specified category_id the query does not involve a filesort and is reasonably fast because it does not have to process removal of duplicate results. However, calling EXPLAIN on the above query that has multiple category_id's returns a table that says that there is filesort to be done. And, the query is extremely slow on my data set.
Is there any way to optimize the table setup and/or the query?
I was able to get the above query to run even faster than with a single-value category list version by rewriting it as follows:
SELECT posting_id
FROM news_posting np
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
AND EXISTS (
SELECT ncm.posting_id
FROM news_cat_match ncm
WHERE ncm.posting_id = np.posting_id
AND ncm.category_id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
LIMIT 1
)
ORDER BY np.token DESC LIMIT 50
This now takes under a second on my data set.
The sad part is that this is even faster than if there is just one category_id specified. That's because the subset of news items is bigger than with just one category_id, so it finds the results more quickly.
Now my next question is whether this can be optimized for cases when a category has only few news that are spread in time?
The following is still pretty slow on my development machine. Although it's fast enough on the production server, I would like to optimize this if possible.
SELECT DISTINCT posting_id
FROM news_posting np
INNER JOIN news_cat_match ncm ON (ncm.posting_id = np.posting_id AND ncm.category_id = 1)
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
ORDER BY np.token DESC LIMIT 50
Does anyone have any further suggestions?
Related
I have two database tables, one as the main table and the other as the relation table.
The first table is a table of contents and the second table is a table that connects to users or groups.
Some data may also be modified in this second table.
I'm not sure about the structure and performance.
for example, we have User Id 160 which is under group id 7
So for the first, we have a post Table.
id | title | content | cover | status
------------------------------------------------
1 | first | content 1 | /img/... | 1
2 | second | content 2 | /img/... | 1
3 | another | content 3 | /img/... | 1
4 | four | content 4 | /img/... | 1
5 | five | content 5 | /img/... | 1
and for the second we have a post_rel Table:
id | group_id | user_id | post_id | title | cover | sort | status
---------------------------------------------------------------------------
1 | 7 | NULL | 1 | g title | img/... | 1 | 1
2 | NULL | 160 | 1 | u title | NULL | 2 | 1 *** selected for user_id
3 | 7 | NULL | 2 | NULL | img/... | 6 | 0
4 | NULL | 160 | 2 | NULL | img/... | 4 | 1 *** selected for user_id
5 | NULL | 160 | 3 | some | img/... | 3 | 1 *** selected for user_id
6 | 7 | NULL | 4 | NULL | img/... | 9 | 1 *** selected for group_id
7 | NULL | 165 | 5 | NULL | img/... | 5 | 0
This is the basic query we have.
select
`post_rel`.`title` as `custom_title`,
`post_rel`.`cover` as `custom_cover`,
`post_rel`.`group_id`,
`post_rel`.`user_id`,
`post`.*
from
`post`
inner join `post_rel` on `post`.`id` = `post_rel`.`post_id`
where
`post`.`status` = 1
and `post_rel`.`status` = 1
and (
`post_rel`.`user_id` = 160
or (
`post_rel`.`group_id` = 7
and `post_rel`.`post_id` not in (
select
`post_rel`.`post_id`
from
`post_rel`
where
`post_rel`.`user_id` = 160
)
)
)
order by
`post_rel`.`sort` asc
So, what you think about the basic query? Especially in the subquery, won't performance drop in a large table? Is it possible to write a better and simpler query or change the structure?
Edit: this is sqlfiddle example of my code and structure http://sqlfiddle.com/#!9/ed9d4b/1
I would change it to use "not exists" instead of "not in" and would use aliases so I could pull it off like so:
select
b.`title` as `custom_title`,
b.`cover` as `custom_cover`,
b.`group_id`,
b.`user_id`,
a.*
from
`post` a
inner join `post_rel` b on a.`id` = b.`post_id`
where
a.`status` = 1
and b.`status` = 1
and (
b.`user_id` = 160
or (
b.`group_id` = 7
and not exists (
select
'x'
from
`post_rel` c
where
c.`user_id` = 160 and c.`post_id`=b.`post_id`
)
)
)
order by
b.`sort` asc
Typically when managing users and group, there's this notion of an exception user who directly can get assigned to assets just like the whole group. This seems to be an example of that.
From a modeling-only perspective, there are 2 ways to deal with that:
Ensure that every user exists in a group and that you only assign assets to groups. For the exception user, create a group. You could even enforce that every user belongs to only one group. This way your post_rel table deals with only groups. Unfortunately, the relationship between group and user is not understood well enough to weigh in appropriately.
Driven only by the need to eliminate null values towards a good model which also reduces overhead, the other option is to use name value pairs and allows the User and Group to exist in the same field with another field besides it, denoting Group or User.
These are the SQL Fiddle:
NOT EXISTS version: http://sqlfiddle.com/#!9/1af8cf/2
NOT IN version: http://sqlfiddle.com/#!9/1af8cf/1
Some reading on nulls https://dev.mysql.com/doc/refman/5.6/en/data-size.html
Specifically:
Declare columns to be NOT NULL if possible. It makes SQL operations faster, by enabling better use of indexes and eliminating overhead for testing whether each value is NULL. You also save some storage space, one bit per column. If you really need NULL values in your tables, use them. Just avoid the default setting that allows NULL values in every column.
I have two tables in a MySQL database like this:
User:
userid |userid | Username | Plan(VARCHAR) | Status |
-----------+------------+--------------+---------------+---------+
1 | 1 | John | 1,2,3 |1 |
2 | 2 | Cynthia | 1,2 |1 |
3 | 3 | Charles | 2,3,4 |1 |
Plan: (planid is primary key)
planid(INT) | Plan_Name | Cost | status |
-------------+----------------+----------+--------------+
1 | Tamil Pack | 100 | ACTIVE |
2 | English Pack | 100 | ACTIVE |
3 | SportsPack | 100 | ACTIVE |
4 | KidsPack | 100 | ACTIVE |
OUTPUT
id |userid | Username | Plan | Planname |
---+-------+----------+------------+-------------------------------------+
1 | 1 | John | 1,2,3 |Tamil Pack,English Pack,SportsPack |
2 | 2 | Cynthia | 1,2 |Tamil Pack,English Pack |
3 | 3 | Charles | 2,3,4 |English Pack,Sportspack, Kidspack |
Since plan id in Plan table is integer and the user can hold many plans, its stored as comma separated as varchar, so when i try with IN condition its not working.
SELECT * FROM plan WHERE find_in_set(plan_id,(select user.planid from user where user.userid=1))
This get me the 3 rows from plan table but i want the desired output as above.
How to do that.? any help Please
A rewrite off your query what should work is as follows..
Query
SELECT
all columns you need
, GROUP_CONCAT(Plan.Plan_Name ORDER BY Plan.planid) AS Planname
FROM
Plan
WHERE
FIND_IN_SET(Plan.plan_id,(
SELECT
User.Plan
FROM
user
WHERE User.userid = 1
)
)
GROUP BY
all columns what are in the select (NOT the GROUP_CONCAT function)
You also can use FIND_IN_SET on the ON clause off a INNER JOIN.
One problem is that the join won't ever use indexes.
Query
SELECT
all columns you need
, GROUP_CONCAT(Plan.Plan_Name ORDER BY Plan.planid) AS Planname
FROM
User
INNER JOIN
Plan
ON
FIND_IN_SET(Plan.id, User.Plan)
WHERE
User.id = 1
GROUP BY
all columns what are in the select (NOT the GROUP_CONCAT function)
Like i said in the comments you should normalize the table structures and add the table User_Plan whats holds the relations between the table User and Plan.
Given a structure like this in a MySQL database
#data_table
(id) | user_id | time | (...)
#relations_table
(id) | user_id | user_coach_id | (...)
we can select all data_table rows belonging to a certain user_coach_id (let's say 1) with
SELECT rel.`user_coach_id`, dat.*
FROM `relations_table` rel
LEFT JOIN `data_table` dat ON rel.`uid` = dat.`uid`
WHERE rel.`user_coach_id` = 1
ORDER BY val.`time` DESC
returning something like
| user_coach_id | id | user_id | time | data1 | data2 | ...
| 1 | 9 | 4 | 15 | foo | bar | ...
| 1 | 7 | 3 | 12 | oof | rab | ...
| 1 | 6 | 4 | 11 | ofo | abr | ...
| 1 | 4 | 4 | 5 | foo | bra | ...
(And so on. Of course time are not integers in reality but to keep it simple.)
But now I would like to query (ideally) only up to an arbitrary number of rows from data_table per distinct user_id but still have those ordered (i.e. newest first). Is that even possible?
I know I can use GROUP BY user_id to only return 1 row per user, but then the ordering doesn't work and it seems kind of unpredictable which row will be in the result. I guess it's doable with a subquery, but I haven't figured it out yet.
To limit the number of rows in each GROUP is complicated. It is probably best done with an #variable to count, plus an outer query to throw out the rows beyond the limit.
My blog on Groupwise Max gives some hints of how to do such.
So basically what I'm trying to do is get gained experience, ordering it, then only displaying top 5 or 50. Now note that I'm not SQL expert but I have knowledge of indexes as well as file sorting. The query that I have is filesorting most likely due to "gained_xp" not being a index-- let alone even a column as it's only temporary. There's no clear explanation how to fix this as I'm trying to contain it all in one query. I'm trying to sort nearly 13k rows with that number only expanding. I'd also need the number of rows to be dynamic as well as the time since. Any help would be appreciated. Thank you
Explain Output: Using where; Using temporary; Using filesort
Indexes include: time userid override overallXp overallLevel overallRank
The closest I've gotten to order all rows (which never ends up completing and ends in a mysql reboot) are:
SELECT FROM_UNIXTIME(time, '%Y-%m-%d'), t_u.userid as uid, MAX(t_u.OverallXP)-(SELECT overallXP FROM track_updates WHERE `userid` = t_u.userid AND `time`>'1394323200' ORDER BY id ASC LIMIT 1) as gained_xp
FROM track_updates t_u
WHERE t_u.time>'1394323200'
GROUP BY t_u.userid
If I'd run a query that selects only one user and works correctly is:
SELECT FROM_UNIXTIME(time, '%Y-%m-%d'), (t_u.overallXP)-(SELECT overallXP FROM track_updates WHERE `userid`='1' ORDER BY `id` ASC LIMIT 1) as gained_xp, t_u.userid
FROM track_updates t_u
WHERE t_u.userid='1' AND t_u.time>'1393632000'
ORDER BY t_u.time DESC
LIMIT 1
Sample Data per request:
____________________________________________________________________
| id | userid | time | overallLevel | overallXP | overallRank |
| 1 | 1 | 1394388114 | 1 | 1 | 1 |
| 2 | 1 | 1394389114 | 2 | 10 | 1 |
| 3 | 2 | 1394388114 | 1 | 1 | 2 |
| 4 | 2 | 1394389114 | 1 | 5 | 2 |
| 5 | 2 | 1394390114 | 2 | 7 | 2 |
Output (most recent time; gained xp current-initial; ordered by gain_xp):
____________________________________________
| id | time | userid | gained_xp |
| 1 | March 9th 2014 | 1 | 9 |
| 2 | March 9th 2014 | 2 | 6 |
I have a data table that I use to do some calculations. The resulting data set after calculations looks like:
+------------+-----------+------+----------+
| id_process | id_region | type | result |
+------------+-----------+------+----------+
| 1 | 4 | 1 | 65.2174 |
| 1 | 5 | 1 | 78.7419 |
| 1 | 6 | 1 | 95.2308 |
| 1 | 4 | 1 | 25.0000 |
| 1 | 7 | 1 | 100.0000 |
+------------+-----------+------+----------+
By other hand I have other table that contains a set of ranges that are used to classify the calculations results. The range tables looks like:
+----------+--------------+---------+
| id_level | start | end | status |
+----------+--------------+---------+
| 1 | 0 | 75 | Danger |
| 2 | 76 | 90 | Alert |
| 3 | 91 | 100 | Good |
+----------+--------------+---------+
I need to do a query that add the corresponding 'status' column to each value when do calculations. Currently, I can do that adding the following field to calculation query:
select
...,
...,
[math formula] as result,
(select status
from ranges r
where result between r.start and r.end) status
from ...
where ...
It works ok. But when I have a lot of rows (more than 200K), calculation query become slow.
My question is: there is some way to find that 'status' value without do that subquery?
Some one have worked on something similar before?
Thanks
Yes, you are looking for a subquery and join:
select s.*, r.status
from (select s.*
from <your query here>
) s left outer join
ranges r
on s.result between r.start and r.end
Explicit joins often optimize better than nested select. In this case, though, the ranges table seems pretty small, so this may not be the performance issue.