Select random data with DataMapper

Select random data with DataMapper - mysql

Im trying to select random datasets with DataMapper, but seems like there is no such function support.
For example, i have set of data:
+-------------------+
| ID | Name | Value |
+-------------------+
| 1 | T1 | 123 |
| 2 | T2 | 456 |
| 3 | T3 | 789 |
| 4 | T4 | 101 |
| ----------------- |
| N | Tn | value |
There can be a lot of data, more than 100k rows.
And i need to map data to object:
class Item
include DataMapper::Resource
property :id, Serial
property :name, String
property :value, String
end
So, the question is: How to select random data from table?
Similar query in SQL will be:
SELECT id, name, value FROM table ORDER BY RAND() LIMIT n;

A long time after the OP, but since this is the first google hit for "datamapper random row"...
Using pure DataMapper, and without making assumptions about continuous IDs, etc, you can do:
Item.first(:offset => rand(Item.count))
which results in the queries:
SELECT COUNT(*) FROM `items`
SELECT <fields> FROM `items` ORDER BY `id` LIMIT 1 OFFSET <n>
If you'd prefer a single query, at the cost of potentially reduced speed, you can do:
Item.all.sample
while results in:
SELECT <fields> FROM `items` ORDER BY `id`
Obviously, wrap this in a transaction if you need to.

I generally don't care literally retrieving random records. In this case, I use a slighttly different paradigm.
ORDER BY value // or value mod some number // you could also use name, or some function on the name
SELECT LIMIT n OFFSET k
where k is a random number generated in your code less than N-n. Sufficiently random for most cases, even though the records are somewhat contiguous in what you use for ORDER BY.

You could generate a random number x < number_of_rows, and just fetch that id.
You could also try entering the SQL directly, like this:
find_by_sql(<<-SQL
SELECT `id`, `name`, `value` FROM table ORDER BY RAND() LIMIT n;
SQL, :properties => property_set)
You need to specify :properties though, for it to map with your property set.

Related

How do you batch SELECT statements when you can't rely on the IDs to be in literal order?

What I mean by literal order is that, altough the IDs are auto-increment, through business logic, it might end up that 8 comes after 4 when 5 should've been there. That is to say, if a deletion if ID happens, there's no re-indexing
This is how my rows look (table name is wp_posts):
+-----+-------------+----+--+--+--+
| ID | post_author | .. | | | |
+-----+-------------+----+--+--+--+
| 4 | .. | | | | |
+-----+-------------+----+--+--+--+
| 8 | .. | | | | |
+-----+-------------+----+--+--+--+
| 124 | .. | | | | |
+-----+-------------+----+--+--+--+
| 672 | .. | | | | |
+-----+-------------+----+--+--+--+
| 673 | .. | | | | |
+-----+-------------+----+--+--+--+
| 674 | .. | | | | |
+-----+-------------+----+--+--+--+
ID is an int that has the auto-increment characteristic, but when a post is deleted, there is no re-assignment of IDs. It will just simply get deleted and because it's auto-increment, you can still assume that, vertically, the items that come after the one you're looking at are always bigger than the ones before.
I'm querying for ID: SELECT ID FROM wp_posts to get a list of all the IDs I need. Now, it just so happens that I need to batch all of this, using AJAX requests because once I retrieve the IDs, I need to operate on them.
Thing is, I don't really understand how to pass my data back to AJAX. What LIMIT does is, if I provide 2 arguments, such as: SELECT ID FROM wp_posts LIMIT 1,3, it'll return back 4,8,124 because it looks at row number. But what do I do on the next call? Yes, the first call always starts with 1, but once I need to launch the second AJAX request to perform yet another SELECT, how do I know where I should start? In my case, I'd want to start again at 4, so, my second query would be SELECT ID FROM wp_posts LIMIT 4, 7 and so on.
Do I really need to send that counter (even if I can automate it, since, you see, it's an increment of 3) back?
Is there no way for SQL to handle this automatically?

You have many confusions in your question. Let me try to clear up some basic ones.
First, the auto-incremented key is the primary key for the table. You do not need to worry about gaps. In fact, the key should basically be meaningless. It fulfills the following:
It is guaranteed to be unique.
It is guaranteed to be in insertion order.
Gaps are allowed and of no concern. There is no re-indexing. It is a bad idea because:
Primary keys uniquely identify each row and this mapping should be consistent across time.
Primary keys are used in other tables to refer to values, so re-indexing would either invalidate those relationships or require massive changes to many tables.
Re-indexes pre-supposes that the value means something, when it doesn't.
Second, a query such as:
SELECT ID
FROM wp_posts
LIMIT 1, 3;
Can return any three rows. Why? Because you have no specified an ORDER BY and SQL result sets without ORDER BY are unordered. There are no guarantees. So you should always be in the habit of using an ORDER BY.
Third, if you want to essentially "page" through results, then use the OFFSET feature in LIMIT (as you have above):
SELECT ID
FROM wp_posts
ORDER BY ID
LIMIT #offset, 3;
This will allow you to reset the #offset value and go to which rows you want.

First query:
SELECT ID FROM wp_posts ORDER BY ID LIMIT 3
This returns 4,8,124 as you said. In your client, save the largest ID value in a variable.
Subsequent queries:
SELECT ID FROM wp_posts WHERE ID > ? ORDER BY ID LIMIT 3
Send a parameter into this query using the greated ID value from the previous result. It's still in a variable.
This also helps make the query faster, because it doesn't have to skip all those initial rows every time. Paging through a large dataset using LIMIT/OFFSET is pretty inefficient. SQL has to actually read all those rows even though it's not going to return them.
But if you use WHERE ID > ? then SQL can efficiently start the scan in the right place, on the first row that would be included in the result.

Seems, you want to return the first three rows of your query ordered by currently existing ID values(whatever they're after all DML statement's applied on the table wp_posts).
Then, Consider using an auxiliary iteration variable #i to provide an ordered integer value set starting from 1 and increasing as 2,3,... without any gaps :
select t.*
from
(
select #i := #i + 1 as rownum, t1.*
from tab t1
join (select #i:=0) t2
) t
order by rownum
limit 0,3;
Demo

Mysql-> Group after rand()

I have the following table in Mysql
Name Age Group
abel 7 A
joe 6 A
Rick 7 A
Diana 5 B
Billy 6 B
Pat 5 B
I want to randomize the rows, but they should still remain grouped by the Group column.
For exmaple i want my result to look something like this.
Name Age Group
joe 6 A
abel 7 A
Rick 7 A
Billy 6 B
Pat 5 B
Diana 5 B
What query should i use to get this result? The entire table should be randomised and then grouped by "Group" column.

What you describe in your question as GROUPing is more correctly described as sorting. This is a particular issue when talking about SQL databases where "GROUP" means something quite different and determines the scope of aggregation operations.
Indeed "group" is a reserved word in SQL, so although mysql and some other SQL databases can work around this, it is a poor choice as an attribute name.
SELECT *
FROM yourtable
ORDER BY `group`
Using random values also has a lot of semantic confusion. A truly random number would have a different value every time it is retrieved - which would make any sorting impossible (and databases do a lot of sorting which is normally invisible to the user). As long as the implementation uses a finite time algorithm such as quicksort that shouldn't be a problem - but a bubble sort would never finish, and a merge sort could get very confused.
There are also degrees of randomness. There are different algorithms for generating random numbers. For encryption it's critical than the random numbers be evenly distributed and completely unpredictable - often these will use hardware events (sometimes even dedicated hardware) but I don't expect you would need that. But do you want the ordering to be repeatable across invocations?
SELECT *
FROM yourtable
ORDER BY `group`, RAND()
...will give different results each time.
OTOH
SELECT
FROM yourtable
ORDER BY `group`, MD5(CONCAT(age, name, `group`))
...would give the results always sorted in the same order. While
SELECT
FROM yourtable
ORDER BY `group`, MD5(CONCAT(DATE(), age, name, `group`))
...will give different results on different days.

DROP TABLE my_table;
CREATE TABLE my_table
(name VARCHAR(12) NOT NULL
,age INT NOT NULL
,my_group CHAR(1) NOT NULL
);
INSERT INTO my_table VALUES
('Abel',7,'A'),
('Joe',6,'A'),
('Rick',7,'A'),
('Diana',5,'B'),
('Billy',6,'B'),
('Pat',5,'B');
SELECT * FROM my_table ORDER BY my_group,RAND();
+-------+-----+----------+
| name | age | my_group |
+-------+-----+----------+
| Joe | 6 | A |
| Abel | 7 | A |
| Rick | 7 | A |
| Pat | 5 | B |
| Diana | 5 | B |
| Billy | 6 | B |
+-------+-----+----------+

Do the random first then sort by column group.
select Name, Age, Group
from (
select *
FROM yourtable
order by RAND()
) t
order by Group

Try this:
SELECT * FROM table order by Group,rand()

Mysql query like number greater than x

I have a field for comments used to store the title of the item sold on the site as well as the bid number (bid_id). Unfortunately, the bid_id is not stored on its own in that table.
I want to query items that have a number (the bid_id) greater than 4,000 for example.
So, what I have is:
select * from mysql_table_name where comment like '< 4000'
I know this won't work, but I need something similar that works.
Thanks a lot!

Just get your bid_id column cleaned up. Then index is.
create table `prior`
( id int auto_increment primary key,
comments text not null
);
insert `prior` (comments) values ('asdfasdf adfas d d 93827363'),('mouse cat 12345678');
alter table `prior` add column bid_id int; -- add a nullable int column
select * from `prior`; -- bid_id is null atm btw
update `prior` set bid_id=right(comments,8); -- this will auto-cast to an int
select * from `prior`;
+----+-----------------------------+----------+
| id | comments | bid_id |
+----+-----------------------------+----------+
| 1 | asdfasdf adfas d d 93827363 | 93827363 |
| 2 | mouse cat 12345678 | 12345678 |
+----+-----------------------------+----------+
Create the index:
CREATE INDEX `idxBidId` ON `prior` (bid_id); -- or unique index

select * from mysql_table_name where substring(comment,start,length, signed integer) < 4000
This will work, but I suggest create new column and put the bid value in it then compare.
To update value in new column you can use
update table set newcol = substring(comment,start,length)
Hope this will help

There is nothing ready that works like that.
You could write a custom function or loadable UDF, but it would be a significant work, with significant impact on the database. Then you could run WHERE GET_BID_ID(comment) < 4000.
What you can do more easily is devise some way of extracting the bid_id using available string functions.
For example if the bid_id is always in the last ten characters, you can extract those, and replace all characters that are not digits with nil. What is left is the bid_id, and that you can compare.
Of course you need a complex expression with LENGTH(), SUBSTRING(), and REPLACE(). If the bid_id is between easily recognizable delimiters, then perhaps SUBSTRING_INDEX() is more your friend.
But better still... add an INTEGER column, initialize it to null, then store there the extracted bid_id. Or zero if you're positive there's no bid_id. Having data stored in mixed contexts is evil (and a known SQL antipattern to boot). Once you have the column available, you can select every few seconds a small number of items with new_bid_id still NULL and subject those to extraction, thereby gradually amending the database without overloading the system.
In practice
This is the same approach one would use with more complicated cases. We start by checking what we have (this is a test table)
SELECT commento FROM arti LIMIT 3;
+-----------------------------------------+
| commento |
+-----------------------------------------+
| This is the first comment 100 200 42500 |
| Another 7 Q 32768 |
| And yet another 200 15 55332 |
+-----------------------------------------+
So we need the last characters:
SELECT SUBSTRING(commento, LENGTH(commento)-5) FROM arti LIMIT 3;
+-----------------------------------------+
| SUBSTRING(commento, LENGTH(commento)-5) |
+-----------------------------------------+
| 42500 |
| 32768 |
| 55332 |
+-----------------------------------------+
This looks good but it is not; there's an extra space left before the ID. So 5 doesn't work, SUBSTRING is 1-based. No matter; we just use 4.
...and we're done.
mysql> SELECT commento FROM arti WHERE SUBSTRING(commento, LENGTH(commento)-4) < 40000;
+-------------------+
| commento |
+-------------------+
| Another 7 Q 32768 |
+-------------------+
mysql> SELECT commento FROM arti WHERE SUBSTRING(commento, LENGTH(commento)-4) BETWEEN 35000 AND 55000;
+-----------------------------------------+
| commento |
+-----------------------------------------+
| This is the first comment 100 200 42500 |
+-----------------------------------------+
The problem is if you have a number not of the same length (e.g. 300 and 131072). Then you need to take a slice large enough for the larger number, and if the number is short, you will get maybe "1 5 300" in your slice. That's where SUBSTRING_INDEX comes to the rescue: by capturing seven characters, from " 131072" to "1 5 300", the ID will always be in the last space separated token of the slice.
IN THIS LAST CASE, when numbers are not of the same length, you will find a problem. The extracted IDs are not numbers at all - to MySQL, they are strings. Which means that they are compared in lexicographic, not numerical, order; and "17534" is considered smaller than "202", just like "Alice" comes before "Bob". To overcome this you need to cast the string as unsigned integer, which further slows down the operations.
WHERE CAST( SUBSTRING(...) AS UNSIGNED) < 4000

ASCII sum of all the all the characters in column Mysql

I have a table users but i have shown only 2 columns I want to sum all the characters of name column.
+----+-------+
| id | name |
+----+-------+
| 0 | user |
| 1 | admin |
| 3 | edit |
+----+-------+
for example ascii sum of user will be
sum(user)=117+115+101+114=447
i have tired this
SELECT ASCII(Substr(name, 1,1)) + ASCII(Substr(name, 2, 1)) FROM user
but it only sums 2.

You are going to have to fetch one character at a time to do the sum. One method is to write a function with a while loop. You can do this with a SELECT, if you know the longest string:
SELECT name, SUM(ASCII(SUBSTR(name, n, 1)))
FROM user u JOIN
(SELECT 1 as n UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL
SELECT 4 UNION ALL SELECT 5 -- sufficient for your examples
) n
ON LENGTH(name) <= n.n
GROUP BY name;
If your goal is to turn the string as something that can be easily compared or a fixed length, then you might consider the encryption functions in MySQL. Adding up the ASCII values is not a particularly good hash function (because strings with the same characters in different orders produce the same value). At the very least, multiplying each ASCII value by the position is a bit better.

Sum and percentage on json array elements

My table is like this:
create table alphabet_soup(
id numeric,
index json bigint
);
my data looks like this:
(id, json) looks like this: (1, '{('key':1,'value':"A"),('key':2,'value':"C"),('key':3,'value':"C")...(600,"B")}')
How do I sum across the json for number of A and number of B and do % of the occurence of A or B? I have about 6 different types of values (ABCDEF), but for simplicity I am just looking for a comparison of 3 values.
I am trying to find something to help me calculate the % of occurrence of a value from a key value pair in json. I am using postgres 9.4. I am new to both json and postgres, and I am landing on the same json functions manual page of postgres over and over.
I have managed to find a sum, but how to calculate the % in a nested select and display the key and values in increasing order of occurence like follows:
value | occurence | %
====================================
A | 300 | 50
B | 198 | 33
C | 102 | 17
The script I am using for the sum is :
select id, index->'key'::key as key
sum(case when (1,index::json->'1')::text = (1,index::json->'2')::text
then 1
else 0
end)/count(id) as res
from
alphabet_soup
group by id;
limit 10;
I get an output as follows:
column "alphabet_soup.id" must appear in the group by clause or be used in an aggregate function.
Thanks for the comment Patrick. Sorry I forgot to add I am using postgres 9.4

The easiest way to do this is to expand the json document into a regular row set using the json_each_text() function. Every single json document then becomes a set of rows and you can then apply aggregate function as you would on any other row set. However, you need to use the function as a row source (section 7.2.1.4) (since it returns a set of rows) and then select the value field which has the category of interest. Note that the function uses a field of the table, through an implicit LATERAL join (section 7.2.1.5).
SELECT id, value
FROM alphabet_soup, json_each_text("index");
which yields something like:
test=# SELECT id, value FROM alphabet_soup, json_each_text("index");
id | value
----+-------
1 | A
1 | C
1 | C
1 | B
To this you can apply regular aggregate functions over the appropriate windows to get the result you are looking for:
SELECT DISTINCT id, value,
count(value) OVER (PARTITION BY id, value) AS occurrence,
count(value) OVER (PARTITION BY id, value) * 100.0 /
count(id) OVER (PARTITION BY id) AS percentage
FROM (
SELECT id, value
FROM alphabet_soup, json_each_text("index") ) sub
ORDER BY id, value;
Which gives a result like:
id | value | occurrence | percentage
----+-------+------------+---------------------
1 | A | 1 | 25.0000000000000000
1 | B | 1 | 25.0000000000000000
1 | C | 2 | 50.0000000000000000
This will work for any number of categories (ABCDEF) and any number of ids.

# Patrick, it was an accident. I am new to stackoverflow. I did not realize how ti works. I was fiddling around and I found the answer to the question I asked in addition to the first one. Sorry about that!
For fun, I added some more to the code to make the % compare of the result set:
With q1 as
(SELECT DISTINCT id, value,
count(value) OVER (PARTITION BY id, value) AS occurrence,
count(value) OVER (PARTITION BY id, value) * 100.0 / count(id) OVER(PARTITION BY id) AS percentage
FROM ( SELECT id, value FROM alphabet_soup, json_each_text("index") ) sub
ORDER BY id, value) Select distinct id, value, least(percentage) from q1
Where (least(percentage))>20 Order by id, value;
The output for this is:
id | value | least
----+-------+--------
1 | B | 33
1 | C | 50

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Select random data with DataMapper - mysql

Related

How do you batch SELECT statements when you can't rely on the IDs to be in literal order?

Mysql-> Group after rand()

Mysql query like number greater than x

ASCII sum of all the all the characters in column Mysql

Sum and percentage on json array elements

Categories

Resources