I currently am doing this to get some data from our table:
SELECT DISTINCT(CategoryID),Distance FROM glinks_DistancesForTowns WHERE LinkID = $linkID ORDER BY Distance LIMIT 20
I'm iterating over that for every link id we have (50k odd). Them I'm processing them in Perl with:
my #cats;
while (my ($catid,$distance) = $sth->fetchrow) {
push #cats, $cat;
}
I'm trying to see if there is a better way to do this in a sub-query with MySQL, vs doing 50k smaller queries (i.e one per link)
The basic structure of the table is:
glinks_Links
ID
glinks_DistancesForTowns
LinkID
CategoryID
Distance
I'm sure there must be a simple way to do it - but I'm just not seeing it.
As requested - here is a dump of the table structure. Its actually more complex than that, but the other fields just hold values so I've taken those bits out to give a cleaner over-view of the structure:
CREATE TABLE `glinks_DistancesForTowns` (
`LinkID` int(11) DEFAULT NULL,
`CategoryID` int(11) DEFAULT NULL,
`Distance` float DEFAULT NULL,
`isPaid` int(11) DEFAULT NULL,
KEY `LinkID` (`LinkID`),
KEY `CategoryID` (`CategoryID`,`isPaid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
CREATE TABLE `glinks_Links` (
`ID` int(10) unsigned NOT NULL AUTO_INCREMENT,
`Title` varchar(100) NOT NULL DEFAULT '',
`URL` varchar(255) NOT NULL DEFAULT 'http://',
PRIMARY KEY (`ID`),
KEY `booking_hotel_id_fk` (`booking_hotel_id_fk`)
) ENGINE=MyISAM AUTO_INCREMENT=617547 DEFAULT CHARSET=latin1
This is the kind of thing I'm hoping for:
SELECT glinks_Links.ID FROM glinks_Links as links, glinks_DistancesForTowns as distance (
SELECT DISTINCT(CategoryID),Distance FROM distance WHERE distance.LinkID = links.ID ORDER BY Distance LIMIT 20
)
But obviously that doesn't work;)
It sounds like you want the top 20 towns by distance for each link, right?
MySQL 8.0 supports window functions, and this would be the way to write the query:
WITH cte AS (
SELECT l.ID, ROW_NUMBER() OVER(PARTITION BY l.ID ORDER BY d.Distance) AS rownum
FROM glinks_Links as l
JOIN glinks_DistancesForTowns AS d ON d.LinkID = l.ID
) SELECT ID FROM cte WHERE rownum <= 20;
Versions older than 8.0 do not support these features of SQL, so you have to get creative with user-defined variables or self-joins. See for example my answer to How to SELECT the newest four items per category?
Related
I am working on a mysql 5.6 database, and I have a table looking something like this:
CREATE TABLE `items` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`account_id` int(11) NOT NULL,
`node_type_id` int(11) NOT NULL,
`property_native_id` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`parent_item_id` bigint(20) DEFAULT NULL,
`external_timestamp` datetime DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `index_items_on_acct_node_prop` (`account_id`,`node_type_id`,`property_native_id`),
KEY `index_items_on_account_id_and_external_timestamp` (`account_id`,`external_timestamp`),
KEY `index_items_on_account_id_and_created_at` (`account_id`,`created_at`),
KEY `parent_item_external_timestamp_idx` (`parent_item_id`,`external_timestamp`),
) ENGINE=InnoDB AUTO_INCREMENT=194417315 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
I am trying to optimize a query doing this:
SELECT *
FROM items
WHERE parent_item_id = ?
AND external_timestamp < ( SELECT external_timestamp
FROM items
WHERE id = ?
) FROM items ORDER BY
external_timestamp LIMIT 5
Currently, there is an index on parent_item_id, so when I run this query with EXPLAIN, I get an "extra" of "Using where; Using filesort"
When I modify the index to be (parent_item_id, external_timestamp), then the EXPLAIN's "extra" becomes "Using index condition"
The problem is that the EXPLAIN's "rows" field is still the same (which is usually a couple thousand rows, but it could be millions in some use-cases).
I know that I can do something like AND external_timestamp > (1 week ago) or something like that, but I'd really like the number of rows to be just the number of LIMIT, so 5 in that case.
Is it possible to instruct the database to lock onto a row and then get the 5 rows before it on that (parent_item_id, external_timestamp) index?
(I'm unclear on what you are trying to do. Perhaps you should provide some sample input and output.) See if this works for you:
SELECT i.*
FROM items AS i
WHERE i.parent_item_id = ?
AND i.external_timestamp < ( SELECT external_timestamp
FROM items
WHERE id = ? )
ORDER BY i.external_timestamp
LIMIT 5
Your existing INDEX(parent_item_id, external_timestamp) will probably be used; see EXPLAIN SELECT ....
If id was supposed to match in all 5 rows, then the subquery is not needed.
SELECT items.*
FROM items
CROSS JOIN ( SELECT external_timestamp
FROM items
WHERE id = ? ) subquery
WHERE items.parent_item_id = ?
AND items.external_timestamp < subquery.external_timestamp
ORDER BY external_timestamp LIMIT 5
id is PK, hence the subquery will return only one row (or none).
I have a table defined as follows:
| book | CREATE TABLE `book` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`provider_id` int(10) unsigned DEFAULT '0',
`source_id` varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,
`title` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`description` longtext COLLATE utf8_unicode_ci,
PRIMARY KEY (`id`),
UNIQUE KEY `provider` (`provider_id`,`source_id`),
KEY `idx_source_id` (`source_id`),
) ENGINE=InnoDB AUTO_INCREMENT=1605425 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
when there are about 10 concurrent read with following sql:
SELECT * FROM `book` WHERE (provider_id = '1' AND source_id = '1037122800') ORDER BY `book`.`id` ASC LIMIT 1
it becomes slow, it takes about 100 ms.
however if I changed it to
SELECT * FROM `book` WHERE (provider_id = '1' AND source_id = '221630001') LIMIT 1
then it is normal, it takes several ms.
I don't understand why adding order by id makes query much slower? could anyone expain?
Try to add desired columns (Select Column Name,.. ) instead of * or Refer this.
Why is my SQL Server ORDER BY slow despite the ordered column being indexed?
I'm not a mysql expert, and not able to perform a detailed analysis, but my guess would be that because you are providing values for the UNIQUE KEY in the WHERE clause, the engine can go and fetch that row directly using an index.
However, when you ask it to ORDER BY the id column, which is a PRIMARY KEY, that changes the access path. The engine now guesses that since it has an index on id, and you want to order by id, it is better to fetch that data in PK order, which will avoid a sort. In this case though, it leads to a slower result, as it has to compare every row to the criteria (a table scan).
Note that this is just conjecture. You would need to EXPLAIN both statements to see what is going on.
A table with a few Million rows, something like this:
my_table (
`CONTVISITID` bigint(20) NOT NULL AUTO_INCREMENT,
`NODE_ID` bigint(20) DEFAULT NULL,
`CONT_ID` bigint(20) DEFAULT NULL,
`NODE_NAME` varchar(50) DEFAULT NULL,
`CONT_NAME` varchar(100) DEFAULT NULL,
`CREATE_TIME` datetime DEFAULT NULL,
`HITS` bigint(20) DEFAULT NULL,
`UPDATE_TIME` datetime DEFAULT NULL,
`CLIENT_TYPE` varchar(20) DEFAULT NULL,
`TYPE` bigint(1) DEFAULT NULL,
`PLAY_TIMES` bigint(20) DEFAULT NULL,
`FIRST_PUBLISH_TIME` bigint(20) DEFAULT NULL,
PRIMARY KEY (`CONTVISITID`),
KEY `cont_visit_contid` (`CONT_ID`),
KEY `cont_visit_createtime` (`CREATE_TIME`),
KEY `cont_visit_publishtime` (`FIRST_PUBLISH_TIME`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=57676834 DEFAULT CHARSET=utf8
I had a query that I have managed to optimize to the following departing from a flat select:
SELECT a.cont_id, SUM(a.hits)
FROM (
SELECT cont_id,hits,type,first_publish_time
FROM my_table
where create_time > '2017-03-10 00:00:00'
AND first_publish_time>1398310263000
AND type=1) as a group by a.cont_id
order by sum(HITS) DESC LIMIT 10;
Can this be further optimized?
Edit:
I started with a FLAT select like I mentioned before, what I mean by flat select not to have a composite select like my current one. Instead of the single select that someone responded with. A single select is twice slower, so not viable in my case.
Edit2: I have a DBA friend who suggested me to change the query to this:
SELECT a.cont_id, SUM(a.hits)
FROM (
SELECT cont_id,hits
FROM my_table
where create_time > '2017-03-10 00:00:00'
AND first_publish_time>1398310263000
AND type=1) as a group by a.cont_id
order by sum(HITS) DESC LIMIT 10;
As I do not need the fields extra (type,first_publish_time) and the TMP table is smaller, this makes the query faster about about 1/4 total time of the fastest version I have. He also suggested to add a composite index between (create_time, cont_id, hits). He says with this index I will get really good performance, but I have not done that as this is a production DB and the alter might affect replication. I will post results once done.
INDEX(type, first_publish_time)
INDEX(type, create_time)
Then do
SELECT cont_id, SUM(hits) AS tot_hits
FROM my_table
where create_time > '2017-03-10 00:00:00'
AND first_publish_time > 1398310263000
AND type = 1
group by cont_id
order by tot_hits DESC
LIMIT 10;
Start the index with any = filters (type, in this case); then you get one chance to us a range.
The reason for 2 indexes -- The Optimizer will look at statistics and decide which look better based on the values given.
Consider shrinking the BIGINTs (8 bytes) to some smaller INT type. Saving space will help speed, especially if the table is too big to be cached.
For further discussion, please provide EXPLAIN SELECT ...;.
I have a complicated issue but rather than go into the specifics i have simplified it to the following.
Lets say we are trying to build a system, where users of the system can apply for priority levels on various services on a per zip-code basis. This system would have four tables like so...
CREATE TABLE `zip_code` (
`zip` varchar(7) NOT NULL DEFAULT '',
`lat` float NOT NULL DEFAULT '0',
`long` float NOT NULL DEFAULT '0'
PRIMARY KEY (`zip`,`lat`,`long`),
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `user` (
`user_id` int(10) NOT NULL AUTO_INCREMENT
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `service` (
`service_id` int(10) NOT NULL AUTO_INCREMENT
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `service_priority` (
`user_id` int(10) NOT NULL',
`service_id` int(10) NOT NULL',
`zip` varchar(7) NOT NULL,
`priority` tinyint(1) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Now lets also say that we have 45000 zip-codes, a few hundred services and a few thousand users, and that no user can have the same priority level as another user for the same service in the same zip code.
I need a query that if given a particular zip code, radius, service, and a user_id will return the highest available priority level for all other zip codes within that radius for that service.
And, also, would like to know any suggestions for restructuring this data.
The problem that i see happening here is as the user base grows, the service_priority table is going to get huge, in theory 45000 rows bigger for every user although in practice probably only 10000 rows bigger.
What can i do to mitigate these problems?
Switch to InnoDB.
zip_code table should probably have PRIMARY KEY(zip) unless you really want multiple rows for a given zip.
"no user can have the same priority level as another user for the same service in the same zip code" -- can be enforced by
service_priority : UNIQUE(service_id, user_id, zip)
Then your query may look something like
SELECT sp.*
FROM ( SELECT b.zip
FROM ( SELECT lat, lng FROM zip_code WHERE zip = '$zip' ) AS a
JOIN zip_code AS b
WHERE ... < $radius
) AS z
JOIN service_priority AS sp
WHERE sp.zip = z.zip
AND sp.user_id = $user_id
AND sp.service_id = $service_id
ORDER BY sp.priority DESC
LIMIT 1
Notes:
The index, above, is also tailored for this query.
The innermost query gets the one lat/lng for the center point.
The middle query focuses on finding the nearby zips. See the tag I added to find many questions discussion how to do that.
The outer query then filters results based on user and service.
Finally, the highest priority row is picked.
I've been working on a small Perl program that works with a table of articles, displaying them to the user if they have not been already read. It has been working nicely and it has been quite speedy, overall. However, this afternoon, the performance has degraded from fast enough that I wasn't worried about optimizing the query to a glacial 3-4 seconds per query. To select articles, I present this query:
SELECT channelitem.ciid, channelitem.cid, name, description, url, creationdate, author
FROM `channelitem`
WHERE ciid NOT
IN (
SELECT ciid
FROM `uninet_channelitem_read`
WHERE uid = '1030'
)
AND (
cid =117
OR cid =308
OR cid =310
)
ORDER BY `channelitem`.`creationdate` DESC
LIMIT 0 , 100
The list of possible cid's varies and could be quite a bit more. In any case, I noted that about 2-3 seconds of the total time to make the query is devoted to "ORDER BY." If I remove that, it only takes about a half second to give me the query back. If I drop the subquery, the performance goes back to normal... but the subquery didn't seem to be problematic until just this afternoon, after working fine for a week or so.
Any ideas what could be slowing it down so much? What might I do to try to get the performance back up to snuff? The table being queried has 45,000 rows. The subquery's table has fewer than 3,000 rows at present.
Update: Incidentally, if anyone has suggestions on how to do multiple queries or some other technique that would be more efficient to accomplish what I am trying to do, I am all ears. I'm really puzzled how to solve the problem at this point. Can I somehow apply the order by before the join to make it apply to the real table and not the derived table? Would that be more efficient?
Here is the latest version of the query, derived from suggestions from #Gordon, below
SELECT channelitem.ciid, channelitem.cid, name, description, url, creationdate, author
FROM `channelitem`
LEFT JOIN (
SELECT ciid, dateRead
FROM `uninet_channelitem_read`
WHERE uid = '1030'
)alreadyRead ON channelitem.ciid = alreadyRead.ciid
WHERE (
alreadyRead.ciid IS NULL
)
AND `cid`
IN ( 6648, 329, 323, 6654, 6647 )
ORDER BY `channelitem`.`creationdate` DESC
LIMIT 0 , 100
Also, I should mention what my db structure looks like with regards to these two tables -- maybe someone can spot something odd about the structure:
CREATE TABLE IF NOT EXISTS `channelitem` (
`newsversion` int(11) NOT NULL DEFAULT '0',
`cid` int(11) NOT NULL DEFAULT '0',
`ciid` int(11) NOT NULL AUTO_INCREMENT,
`description` text CHARACTER SET utf8 COLLATE utf8_unicode_ci,
`url` varchar(222) DEFAULT NULL,
`creationdate` datetime DEFAULT NULL,
`urgent` varchar(10) DEFAULT NULL,
`name` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`lastchanged` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`author` varchar(255) NOT NULL,
PRIMARY KEY (`ciid`),
KEY `newsversion` (`newsversion`),
KEY `cid` (`cid`),
KEY `creationdate` (`creationdate`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1638554365 ;
CREATE TABLE IF NOT EXISTS `uninet_channelitem_read` (
`ciid` int(11) NOT NULL,
`uid` int(11) NOT NULL,
`dateRead` datetime NOT NULL,
PRIMARY KEY (`ciid`,`uid`),
KEY `ciid` (`ciid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
It never hurts to try the left outer join version of such a query:
SELECT ci.ciid, ci.cid, ci.name, ci.description, ci.url, ci.creationdate, ci.author
FROM `channelitem` ci left outer join
(SELECT ciid
FROM `uninet_channelitem_read`
WHERE uid = '1030'
) cr
on ci.ciid = cr.ciid
where cr.ciid is null and
ci.cid in (117, 308, 310)
ORDER BY ci.`creationdate` DESC
LIMIT 0 , 100
This query will be faster with an index on uninet_channelitem_read(ciid) and probably on channelitem(cid, ciid, createddate).
The problem could be that you need to create an index on the channelitem table for the column creationdate. Indexes help a database to run queries faster. Here is a link about MySQL Indexing