assign sequential number to each set of duplicate records

assign sequential number to each set of duplicate records - mysql

I've been searching for an easy solution to a pretty trivial problem. I have an huge set of records (~120,000) that I need to screen for duplicates, assign a sequential number to each set of duplicates, like Assign# below:
Eventually, I am trying to achieve this:
I use P1, P2, and P3 fields as a set of sort parameters in query (ascending/descending) to determine the best/top Name for each set of identical NCBI hits.
I tried a lot of things already and my main problem is that access freezes half way through and I don't really know if the script is functional.
FROM [sortquery]
WHERE ((([sortquery].Name) In
(
SELECT TOP 1 [sortquery].Name
FROM [sortquery] AS Dupe
WHERE Dupe.NCBI=[sortquery].NCBI
ORDER BY Dupe.NCBI
)))
ORDER BY [sortquery].NCBI;
I am open to any suggestion and corrections! Thanks for any help =)

The traditional method is to count:
SELECT
*,
(Select Count(*)
From Sortquery As S
Where S.NCBI = Sortquery.NCBI
And S.P1 * 1000 + S.P3 >= Sortquery.P1 * 1000 + Sortquery.P3) As [Assign#]
FROM
[sortquery]
ORDER BY
NCBI Asc,
P1 Desc,
P3 Desc,
[Name] Asc,
[Assign#] Asc

Related

SQL: user variable increment for each rows

EDIT : my question is not clear, so I've reformulated it here : Order sql result by occurrence of a set of keywords in a string
I'm improving my search system for my website. I'm trying to use and increment variables in sql request, like that...
SET #titlematch = 0;
SELECT *,
CASE
when title like '%apple%' then (SET #titlematch = #titlematch+1)
when title like '%orange%' then (SET #titlematch = #titlematch+1)
when title like '%other_keyword_searched%' then (SET #titlematch = #titlematch+1)
(...)
END,
(...)
FROM pages
(...)
ORDER by #titlematch desc
In fact, titlematch should be incremented each time that a keyword is in the title. If there's "apple" and "orange" in the title, titlematch should be equal to 2.
But actually, it doesn't work...
(sorry for my english)

I think it fails because it must handle all the data,if title like someWordYouDontAcccountFor it will fail.You must account for all possible cases or use else.

In response to your comment (Yes, always), I rewrite your query in this way:
SELECT *, (select count(*) from pages p2 where p1.field_date < p2.field_date) as pos
(...)
FROM pages p1
(...)
ORDER by (select count(*) from pages p2 where p1.field_date < p2.field_date) desc
In this way you count every rows before the actual (I've based my count on ipotetic field_date but if you want you can change your condition), so you have an incremental value for each row, and finally, I add this condition in order by clause.
Tell me if it's OK

MYSQL GROUP_CONCAT and IN

I have a little query, it goes like this:
It's slightly more complex than it looks, the only issue is using the output of one subquery as the parameter for an IN clause to generate another. It works to some degree - but it only provides the results from the first id in the "IN" clause. Oddly, if I manually insert the record ids "00003,00004,00005" it does give the proper results.
What I am seeking to do is get second level many to many relationship - basically tour_stops have items, which in turn have images. I am trying to get all the images from all the items to be in a JSON string as 'item_images'. As stated, it runs quickly, but only returns the images from the first related item.
SELECT DISTINCT
tour_stops.record_id,
(SELECT
GROUP_CONCAT( item.record_id ) AS in_item_ids
FROM tour_stop_item
LEFT OUTER JOIN item
ON item.record_id = tour_stop_item.item_id
WHERE tour_stop_item.tour_stops_id = tour_stops.record_id
GROUP BY tour_stops.record_id
) AS rel_items,
(SELECT
CONCAT('[ ',
GROUP_CONCAT(
CONCAT('{ \"record_id\" : \"',record_id,'\",
\"photo_credit\" : \"',photo_credit,'\" }')
)
,' ]')
FROM images
WHERE
images.attached_to IN(rel_items) AND
images.attached_table = 'item'
ORDER BY img_order ASC) AS item_images
FROM tour_stops
WHERE
tour_stops.attached_to_tour = $record_id
ORDER BY tour_stops.stop_order ASC
Both of these below answers I tried, but it did not help. The second example (placing the entire first subquery inside he "IN" statement) not only produced the same results I am already getting, but also increased query time exponentially.
EDIT: I replaced my IN statement with
IN(SELECT item_id FROM tour_stop_item WHERE tour_stops_id = tour_stops.record_id)
and it works, but it brutally slow now. Assuming I have everything indexed correctly, is this the best way to do it?
using group_concat in PHPMYADMIN will show the result as [BLOB - 3B]
GROUP_CONCAT in IN Subquery
Any insights are appreciated. Thanks

I am surprised that you can use rel_items in the subquery.
You might try:
concat(',', images.attached_to, ',') like concat('%,', rel_items, ',%') and
This may or may not be faster. The original version was fast presumably because there are no matches.
Or, you can try to change your in clause. Sometimes, these are poorly optimized:
exists (select 1
from tour_stop_item
where tour_stops_id = tour_stops.record_id and images.attached_to = item_id
)
And then be sure you have an index on tour_stop_item(tour_stops_id, item_id).

How to avoid filesort for that mysql query?

I'm using this kind of queries with different parameters :
EXPLAIN SELECT SQL_NO_CACHE `ilan_genel`.`id` , `ilan_genel`.`durum` , `ilan_genel`.`kategori` , `ilan_genel`.`tip` , `ilan_genel`.`ozellik` , `ilan_genel`.`m2` , `ilan_genel`.`fiyat` , `ilan_genel`.`baslik` , `ilan_genel`.`ilce` , `ilan_genel`.`parabirimi` , `ilan_genel`.`tarih` , `kgsim_mahalleler`.`isim` AS mahalle, `kgsim_ilceler`.`isim` AS ilce, (
SELECT `ilanresimler`.`resimlink`
FROM `ilanresimler`
WHERE `ilanresimler`.`ilanid` = `ilan_genel`.`id`
LIMIT 1
) AS resim
FROM (
`ilan_genel`
)
LEFT JOIN `kgsim_ilceler` ON `kgsim_ilceler`.`id` = `ilan_genel`.`ilce`
LEFT JOIN `kgsim_mahalleler` ON `kgsim_mahalleler`.`id` = `ilan_genel`.`mahalle`
WHERE `ilan_genel`.`ilce` = '703'
AND `ilan_genel`.`durum` = '1'
AND `ilan_genel`.`kategori` = '1'
AND `ilan_genel`.`tip` = '9'
ORDER BY `ilan_genel`.`id` DESC
LIMIT 225 , 15
and this is what i get in explain section:
these are the indexes that i already tried to use:
any help will be deeply appreciated what kind of index will be the best option or should i use another table structure ?

You should first simplify your query to understand your problem better. As it appears your problem is constrained to the ilan_gen1 table, the following query would also show you the same symptoms.:
SELECT * from ilan_gene1 WHERE `ilan_genel`.`ilce` = '703'
AND `ilan_genel`.`durum` = '1'
AND `ilan_genel`.`kategori` = '1'
AND `ilan_genel`.`tip` = '9'
So the first thing to do is check that this is the case. If so, the simpler question is simply why does this query require a file sort on 3661 rows. Now the 'hepsi' index sort order is:
ilce->mahelle->durum->kategori->tip->ozelik
I've written it that way to emphasise that it is first sorted on 'ilce', then 'mahelle', then 'durum', etc. Note that your query does not specify the 'mahelle' value. So the best the index can do is lookup on 'ilce'. Now I don't know the heuristics of your data, but the next logical step in debugging this would be:
SELECT * from ilan_gene1 WHERE `ilan_genel`.`ilce` = '703'`
Does this return 3661 rows?
If so, you should be able to see what is happening. The database is using the hepsi index, to the best of it's ability, getting 3661 rows back then sorting those rows in order to eliminate values according to the other criteria (i.e. 'durum', 'kategori', 'tip').
The key point here is that if data is sorted by A, B, C in that order and B is not specified, then the best logical thing that can be done is: first a look up on A then a filter on the remaining values against C. In this case, that filter is performed via a file sort.
Possible solutions
Supply 'mahelle' (B) in your query.
Add a new index on 'ilan_gene1' that doesn't require 'mahelle', i.e. A->C->D...
Another tip
In case I have misdiagnosed your problem (easy to do when I don't have your system to test against), the important thing here is the approach to solving the problem. In particular, how to break a complicated query into a simpler query that produces the same behaviour, until you get to a very simple SELECT statement that demonstrates the problem. At this point, the answer is usually much clearer.

Increment string with %name%+(num) in mysql

Is there way to realize this algorithm with mysql without 100500 queries and lots of resources?
if (exists %name% in table.name) {
num = 2;
while(exists %newname%+(num) in table.name) num++;
%name% = newname+(num);
}
Thanks

I don't know how much better you can do with a stored procedure in MySql, but you can definitely do better than 100500 queries:
SELECT name FROM table WHERE name LIKE 'somename%' ORDER BY name DESC LIMIT 1
At that point, you know that you can increment the number at the end of name and the result will be unused.
I 'm glossing over some fine print (this approach will never find and fill any "holes" in the naming scheme that may exist, and it's still not guaranteed that the name will be available due to race conditions), but in practice it can be made to work quite easily.

The simpliest way I can see of doing it is to create a table of sequential numbers
then cross join on to it....
SELECT a.name,b.id
FROM table a
WHERE a.name = 'somename'
CROSS JOIN atableofsequentialnumbers b
WHERE NOT EXISTS (SELECT 1 FROM table x WHERE x.name = CONCAT(a.name,b.id))
LIMIT 10
This will return the first 10 available numbers/names

indexes in mysql SELECT AS or using Views

I'm in over my head with a big mysql query (mysql 5.0), and i'm hoping somebody here can help.
Earlier I asked how to get distinct values from a joined query
mysql count only for distinct values in joined query
The response I got worked (using a subquery with join as)
select *
from media m
inner join
( select uid
from users_tbl
limit 0,30) map
on map.uid = m.uid
inner join users_tbl u
on u.uid = m.uid
unfortunately, my query has grown more unruly, and though I have it running, joining into a derived table is taking too long because there is no indexes available to the derived query.
my query now looks like this
SELECT mdate.bid, mdate.fid, mdate.date, mdate.time, mdate.title, mdate.name,
mdate.address, mdate.rank, mdate.city, mdate.state, mdate.lat, mdate.`long`,
ext.link,
ext.source, ext.pre, meta, mdate.img
FROM ext
RIGHT OUTER JOIN (
SELECT media.bid,
media.date, media.time, media.title, users.name, users.img, users.rank, media.address,
media.city, media.state, media.lat, media.`long`,
GROUP_CONCAT(tags.tagname SEPARATOR ' | ') AS meta
FROM media
JOIN users ON media.bid = users.bid
LEFT JOIN tags ON users.bid=tags.bid
WHERE `long` BETWEEN -122.52224684058 AND -121.79760915942
AND lat BETWEEN 37.07500915942 AND 37.79964684058
AND date = '2009-02-23'
GROUP BY media.bid, media.date
ORDER BY media.date, users.rank DESC
LIMIT 0, 30
) mdate ON (mdate.bid = ext.bid AND mdate.date = ext.date)
phew!
SO, as you can see, if I understand my problem correctly, i have two derivative tables without indexes (and i don't deny that I may have screwed up the Join statements somehow, but I kept messing with different types, is this ended up giving me the result I wanted).
What's the best way to create a query similar to this which will allow me to take advantage of the indexes?
Dare I say, I actually have one more table to add into the mix at a later date.
Currently, my query is taking .8 seconds to complete, but I'm sure if I could take advantage of the indexes, this could be significantly faster.

First, check for indices on ext(bid, date), users(bid) and tags(bid), you should really have them.
It seems, though, that it's LONG and LAT that cause you most problems. You should try keeping your LONG and LAT as a (coordinate POINT), create a SPATIAL INDEX on this column and query like that:
WHERE MBRContains(#MySquare, coordinate)
If you can't change your schema for some reason, you can try creating additional indices that include date as a first field:
CREATE INDEX ix_date_long ON media (date, `long`)
CREATE INDEX ix_date_lat ON media (date, lat)
These indices will be more efficient for you query, as you use exact search on date combined with a ranged search on axes.

Starting fresh:
Question - why are you grouping by both media.bid and media.date? Can a bid have records for more than one date?
Here's a simpler version to try:
SELECT
mdate.bid,
mdate.fid,
mdate.date,
mdate.time,
mdate.title,
mdate.name,
mdate.address,
mdate.rank,
mdate.city,
mdate.state,
mdate.lat,
mdate.`long`,
ext.link,
ext.source,
ext.pre,
meta,
mdate.img,
( SELECT GROUP_CONCAT(tags.tagname SEPARATOR ' | ')
FROM tags
WHERE ext.bid = tags.bid
ORDER BY tags.bid GROUP BY tags.bid
) AS meta
FROM
ext
LEFT JOIN
media ON ext.bid = media.bid AND ext.date = media.date
JOIN
users ON ext.bid = users.bid
WHERE
`long` BETWEEN -122.52224684058 AND -121.79760915942
AND lat BETWEEN 37.07500915942 AND 37.79964684058
AND ext.date = '2009-02-23'
AND users.userid IN
(
SELECT userid FROM users ORDER BY rank DESC LIMIT 30
)
ORDER BY
media.date,
users.rank DESC
LIMIT 0, 30

You might want to compare your perforamnces against using a temp table for each selection, and joining those tables together.
create table #whatever
create table #whatever2
insert into #whatever select...
insert into #whatever2 select...
select from #whatever join #whatever 2
....
drop table #whatever
drop table #whatever2
If your system has enough memory to hold full tables this might work out much faster. It depends on how big your database is.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

assign sequential number to each set of duplicate records - mysql

The traditional method is to count: SELECT , (Select Count() From Sortquery As S Where S.NCBI = Sortquery.NCBI And S.P1 * 1000 + S.P3 >= Sortquery.P1 * 1000 + Sortquery.P3) As [Assign#] FROM [sortquery] ORDER BY NCBI Asc, P1 Desc, P3 Desc, [Name] Asc, [Assign#] Asc

Related

SQL: user variable increment for each rows

MYSQL GROUP_CONCAT and IN

How to avoid filesort for that mysql query?

Increment string with %name%+(num) in mysql

indexes in mysql SELECT AS or using Views

Categories

Resources

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

assign sequential number to each set of duplicate records - mysql

The traditional method is to count: SELECT *, (Select Count(*) From Sortquery As S Where S.NCBI = Sortquery.NCBI And S.P1 * 1000 + S.P3 >= Sortquery.P1 * 1000 + Sortquery.P3) As [Assign#] FROM [sortquery] ORDER BY NCBI Asc, P1 Desc, P3 Desc, [Name] Asc, [Assign#] Asc

Related

SQL: user variable increment for each rows

MYSQL GROUP_CONCAT and IN

How to avoid filesort for that mysql query?

Increment string with %name%+(num) in mysql

indexes in mysql SELECT AS or using Views

Categories

Resources

The traditional method is to count: SELECT , (Select Count() From Sortquery As S Where S.NCBI = Sortquery.NCBI And S.P1 * 1000 + S.P3 >= Sortquery.P1 * 1000 + Sortquery.P3) As [Assign#] FROM [sortquery] ORDER BY NCBI Asc, P1 Desc, P3 Desc, [Name] Asc, [Assign#] Asc