MySQL query - Get first & last rows from groups of larger query - mysql

I have a table (names) with about 10M rows (id, first, last, etc), and I need to break it down into digestible groups by last name letter (e.g. All last names ending in A in groups of 100), and grabbing the first and last record of each group.
I'm not sure what the most efficient way is, and not familiar with sub-querys. I think I should count the rows by last name letter (all the A's), divide it by 100, and select the first and last row? Struggling here to get an efficient query to work.
SELECT COUNT(id)
FROM names
WHERE last REGEXP '^[A].*$' / 100
gives me count of groups
SELECT COUNT (id), min(first), max(last),
(SELECT COUNT(id)
FROM names
WHERE last REGEXP '^[A].*$' / 100)
FROM names
can't get right syntax

OK let's start with the basics. First of all, to do the pagination, you would need a query like:
SELECT last, first
FROM names
WHERE last LIKE '%a'
ORDER BY last ASC
LIMIT 0,100 /* query for first page */
The challenge you than have is how to get the very first and last name from each of these groups. Unfortunately, there is no really straightforward way to do this other than either manually inspecting the first and last record of the result set above and then repeating the same thing over and over for each increment of 100. You would be best served to use a application-side DB library that allows you to easily skip your pointer within the result set. Assuming you have ability to easily move the pointer, you could also do this with a single non-paginated query and just move your point to the 1st, 100th, 101st, 200th, etc. record to extract the value.
This is probably a pretty unreasonable action for your application to take everytime you are wanting to render your navigation elements, as you would need to do this 26 times. This might cause you to rethink you navigation experience altogether, or come up with a solution to reasonably cache the results for use in navigational display.
Alternatives could include using a surrogate counter field to number all rows from 1 to x for each first letter grouping and using mathematical mean to get the rows (i.e. modulus):
SET #x=0;
SELECT `last`, `first`
FROM (
SELECT #x:=#x+1 AS `counter`, `last`, `first`
FROM names
WHERE last LIKE '%a'
ORDER BY `last` ASC
) AS all_rows
WHERE `counter` MOD 100 = 0
OR `counter` MOD 100 = 1
Though again you would need to do this 26 times if you wanted to generate all of your second level navigation options.

Related

MySQL: in designing a loot drop table, is it possible to specify a number of times the query repeats itself and outputs each result on the same table

as part of teaching myself SQL, I'm coding a loot drop table that I hope to use in D&D campaigns.
the simplest form of the query is:
SELECT rarity,
CASE
WHEN item=common THEN (SELECT item FROM common.table)
WHEN item=uncommon THEN (SELECT item FROM unommon.table)
...etc
END AS loot
FROM rarity.table
ORDER BY RAND()*(1/weight)
LIMIT 1
the idea is that the query randomly chooses a rarity from the rarity.table based on a weighted probability. There are 10 types of rarity, each represented on the rarity.table as a single row and having a column for probabilistic weight.
If I want to randomly output 1 item (limit 1), this works great.
However, attempting to output more than 1 item at a time isn't probabilistic in that the query can only put out 1 row of each rarity. If say I want to roll 10 items (limit 10) for my players, it will just output all 10 rows, producing 1 item from each rarity, and never multiple of the higher weighted rarities.
I have tried something similar, creating a different rarity.table that was 1000 rows long, and instead of having a 'weight' column representing probabilistic weight in rows, ex. common is rows 1-20, uncommon rows 21-35...etc.
Then writing the query as
ORDER BY RAND()
LIMIT x
-- (where x is the number of items I want to output)
and while this is better in some ways, it results are still limited by the number of rows for each rarity. I.E. if I set limit to 100, it again just gives me the whole table without taking probability into consideration. This is fine in that I probably won't be rolling 100 items at once, but feels incorrect that the output will always be limited to
20 common items, 15 uncommon, etc. This is also MUCH slower, as my actual code has a lot of case and sub-case statements.
So, my thought moved on to if is possible to run the query with a limit 1, but to set the query to run x number of times, and then include each result on the same table, preserving probability and not being limited by the number of rows in the table. However, I haven't figured out how to do so.
Any thoughts on how to achieve these results? Or maybe a better approach?
Please let me know if I can clarify anything.
Thank you!
A big no-no is having several virtually identical tables (common and uncommon) as separate tables. Instead, have one table with an extra column to distinguish the types. That will let your sample query be written more simply, possibly with a JOIN.
attempting to output more than 1 item at a time isn't probabilistic in that the query can only put out 1 row of each rarity
Let's try to tackle that with something like
SELECT ... WHERE ... 'common' ORDER BY ... LIMIT 1
UNION
SELECT ... WHERE ... 'uncommon' ORDER BY ... LIMIT 1
...
If you don't want the entire list like that, then do
(
((the UNION above))
) ORDER BY RAND() LIMIT 3; -- to pick 3 of the 10
Yes, it looks inefficient. But ORDER BY RAND() LIMIT 1 is inherently inefficient -- it fetches the entire table, shuffles the rows, then peels off one row.
Munch on those. There are other possibilities.
while I'm sure there is room for improvement / optimization, I actually figured out a solution for myself in case anyone is interested.
Instead of the first query being the rarity table, I made a new table that is thousands of entries long, called rolls.table, and first query this table. Here, the limit function works as a way to select the number of rolls I want to make.
Then, every time this table outputs a row the query selects from the rarity.table independently.
Does that make sense?
I'll work with this for now, but would love to hear how to make it better.... it takes like 20 seconds for the output table to load haha.

mySQL - Pagination of filtered rows

I have a REST service which return rows from a database table depending on the current page and results per page.
When not filtering the results, it's pretty easy to do, I just do a SELECT WHERE id >= (page - 1) * perPage + 1 and LIMIT to perPage.
The problem is when trying to use pagination on filtered results, e.g. if I choose to filter only the rows WHERE type = someType.
In that case, the first match of the first page can start in id 7, and the last can be in id 5046. Then the first match of the second page can start at 7302 and end at 12430, and so on.
For the first page of filtered results, I'd be able to simply start from id 1 and LIMIT to perPage, but for the second page, etc, I need to know the index of the last matched row in the previous page, or even better - the first matched row in the current page, or some other indication.
How do I do it efficiently? I need to be able to do it on tables with millions of rows, so obviously fetching all the rows and taking it from there is not an option.
The idea is something like this:
SELECT ... FROM ... WHERE filterKey = filterValue AND id >= id_of_first_match_in_current_page
with id_of_first_match_in_current_page being the mystery.
You can't know what the first id on a given page is, because id numbers are not necessarily sequential. In other words, there could be gaps in the sequence, so rows on the fifth page of 100 rows doesn't necessarily start at id 500. It could start on id 527 for example, It's impossible to know.
Stated yet another way: id is a value, not a row number.
One possible solution if your client is advancing through pages in ascending order is that each REST request fetches data, notes the greatest id value on that page, then uses that in the next REST request so it queries id values that are larger.
SELECT ... FROM ... WHERE filterKey = filterValue
AND id > id_of_last_match_of_previous_page
But if your REST request can fetch any random page, this solution doesn't work. It depends on having fetched the prior page already.
Another solution is to use the LIMIT <x> OFFSET <y> syntax. This allows you to request any arbitrary page. LIMIT <y>, <x> works the same, but for some reason x and y are reversed in the two different syntax forms, so keep that in mind.
Using LIMIT...OFFSET isn't very efficient when you request a page that is many pages into the result. Say you request the 5,000th page. MySQL has to generate a result on the server-side of 5,000 pages, then discard 4,999 of them and return the last page in the result. Sorry, but that's how it works.
Re your comment:
You must understand that WHERE applies conditions on values in rows, but pages are defined by the position of rows. These are two different ways of determining rows!
If you have a column that is guaranteed to be a row-number, then you can use that value like a row position. You can even put an index on it, or use it as the primary key.
But primary key values may change, and may not be consecutive, for example if you update or delete rows, or rollback some transactions, and so on. Renumbering primary key values is a bad idea because other tables or external data may reference primary key values.
So you could add another column that is not the primary key, but only a row-number.
ALTER TABLE MyTable ADD COLUMN row_number BIGINT UNSIGNED, ADD KEY (row_number);
Then fill the values when you need to renumber the rows.
SET #row := 0;
UPDATE MyTable SET row_number = (#row := #row + 1) ORDER BY id;
You'd have to re-number the rows if you ever delete some, for example. It's not efficient to do this frequently, depending on the size of the table.
Also, new inserts cannot create correct row number values without locking the table. This is necessary to prevent race conditions.
If you have a guarantee that row_number is a sequence of consecutive values, then it's both a value and a row position, so you can use it for high-performance index lookups for any arbitrary page of rows.
SELECT * FROM MyTable WHERE row_number BETWEEN 401 AND 500;
At least until the next time the sequence of row numbers is put into doubt by a delete or by new inserts.
You're using the ID column for the wrong purpose. ID is the identifier of a record, not the sequence number of a record for any given set of results.
The LIMIT keyword extends to basic pagination. If you just wanted the first 10 records, you'd do something like:
LIMIT 10
To paginate, if you wanted the second 10 records, you'd do:
LIMIT 10,10
The 10 after that:
LIMIT 20,10
And so on.
The LIMIT clause is independent of the WHERE clause. Use WHERE to filter your results, use LIMIT to paginate them.

Limit the result starting from a specific row with a given Id?

I want to write a query to select a subset of a table, only starting from a given id.
I know about limit x, y, but x here is the number of the raw to start from. But in my case I want to start from a specific Id, no matter what its location inside the table.
What I mean is that the query below selects from row number 5, but I want it to select 10 records from row with id, say 213odin2d211d21:
SELECT * FROM my_table Limit 5, 10
I can't find a way to do this. Any help will be appreciated.
Note that, the Id here is a mix of strings and integers. So I can't do
SELECT * FROM <table> WHERE id > (id)
What you want to do is not possible. By default, records in the database are not ordered. Without ORDER BY you can't expect the server to return your queries in any particular order. Since you are saying, that you store some kind of digit/char identifier as your id, for which less then and greater then are not defined, it is not clear which records "follow" your specific record.
You will either have to:
Define another column to sort your records on, or
Define a behaviour for comparing your ids (What is "less then"? What is "greater then"?)
That being said, you can of course define that you want to sort your id just like sorting strings! In this case, you can use STRCMP() to compare two strings. Your query would look like this:
SELECT * FROM <table> WHERE STRCMP(id,?) = 1 ORDER BY id LIMIT 10
This will select the first 10 records, with id "greater than" ?.

Selecting X oldest date from the database

Good Afternoon
Please can someone help me, I’m nearly a total noob. I have a very simple DB which has thousands of rows and very few columns. I have an ID, Name, Image, Information, and Date Added. Really basic!
Now I’m trying to display only a single row of data at a time so there is no need for loops and things in this request. Sounds very simple in theory?.
I can display a row in date order, and by the most recent or oldest, ascending or descending. But I want to be able to display for example: =
The 6th newest entry. And perhaps somewhere else on my sites the 16 most recent entry and so on. This could even be the 1232 most recent entry.
Sounds to me like it would be a common task but I can’t find the answer anywhere. Can someone provide me with the very short command for doing this? I probably missing something really daft and fundamental.
Thanks
Leah
The LIMIT clause can be used to constrain the number of rows returned by
the SELECT statement. LIMIT takes one or two numeric arguments, which
must both be nonnegative integer constants (except when using prepared
statements).
With two arguments, the first argument specifies the offset of the first
row to return, and the second specifies the maximum number of rows to
return. The offset of the initial row is 0 (not 1):
SELECT * FROM tbl LIMIT 5,10; # Retrieve rows 6-15
http://dev.mysql.com/doc/refman/5.1/en/select.html
So if you want the 1232nd row from your table you can something like this:
SELECT * FROM tbl ORDER BY date_added LIMIT 1231,1;
In your query use LIMIT e.g.
LIMIT 6,1 // Starts at row 6 and retrieves one result.

randomizing large dataset

I am trying to find a way to get a random selection from a large dataset.
We expect the set to grow to ~500K records, so it is important to find a way that keeps performing well while the set grows.
I tried a technique from: http://forums.mysql.com/read.php?24,163940,262235#msg-262235 But it's not exactly random and it doesn't play well with a LIMIT clause, you don't always get the number of records that you want.
So I thought, since the PK is auto_increment, I just generate a list of random id's and use an IN clause to select the rows I want. The problem with that approach is that sometimes I need a random set of data with records having a spefic status, a status that is found in at most 5% of the total set. To make that work I would first need to find out what ID's I can use that have that specific status, so that's not going to work either.
I am using mysql 5.1.46, MyISAM storage engine.
It might be important to know that the query to select the random rows is going to be run very often and the table it is selecting from is appended to frequently.
Any help would be greatly appreciated!
You could solve this with some denormalization:
Build a secondary table that contains the same pkeys and statuses as your data table
Add and populate a status group column which will be a kind of sub-pkey that you auto number yourself (1-based autoincrement relative to a single status)
Pkey Status StatusPkey
1 A 1
2 A 2
3 B 1
4 B 2
5 C 1
... C ...
n C m (where m = # of C statuses)
When you don't need to filter you can generate rand #s on the pkey as you mentioned above. When you do need to filter then generate rands against the StatusPkeys of the particular status you're interested in.
There are several ways to build this table. You could have a procedure that you run on an interval or you could do it live. The latter would be a performance hit though since the calculating the StatusPkey could get expensive.
Check out this article by Jan Kneschke... It does a great job at explaining the pros and cons of different approaches to this problem...
You can do this efficiently, but you have to do it in two queries.
First get a random offset scaled by the number of rows that match your 5% conditions:
SELECT ROUND(RAND() * (SELECT COUNT(*) FROM MyTable WHERE ...conditions...))
This returns an integer. Next, use the integer as an offset in a LIMIT expression:
SELECT * FROM MyTable WHERE ...conditions... LIMIT 1 OFFSET ?
Not every problem must be solved in a single SQL query.