Cascade Updates with Math in SQL - mysql

I am building a database system that contains a number of columns with mathematical relations between them. For example (this is a chess game storage example that I made up), one column is called "num_pawns", one is called "num_nights", one is called "num_castles", and one is called "piece_score". The values in piece_score are determined by the formula: 1*(num_pawns) + 3*(num_knights) + 5*(num_castles). I would like to know how to set a rule that recalculates the value of piece_score whenever one of the other columns is updated. For example, in row 1 the respective values of the first three columns are {6, 2, 1} and a knight is removed {6, 1, 1}, the value of piece_score should automatically update from 17 to 14 after the value in the num_knights column is updated. I realize that it would probably be more efficient to do this in PHP or C, but I am designing my database model to be portable. It could be on a server with MySQL and PHP, or it could be on a mobile device with SQLite and Java. Is there any way to accomplish these kind of post-query updates using strictly SQL?
Update:
To clarify and elaborate on the above example, imagine two additional columns: score_rank and skill_level. The reason for the above computations is so that each row can be assigned a value in score_rank based on its score compared to the other rows (highest score gets 1, second gets 2, etc.). After a rank is assigned, the top 5 scores are given a value of 1 in skill_level, scores 6-10 are given a value of 2, 11-15 are given 3, etc. Is this comparison and ordering possible in SQL? I'm guessing that a trigger is the proper approach, but how could I execute the ordering and assignments over the whole table after an UPDATE? As I mentioned, this database needs to be portable and self-contained, so ideally the sorting and ranking abstraction should be maintained inside the database. That way queries can be made based off of the score_rank column or skill_level column with the assumption that these values reflect the current state of the table.

Unless you have a really, really large table (think hundreds of thousands or millions of rows), you probably want to do this as a view and not by synchronizing the values. The view would be something like this:
CREATE VIEW v
AS
SELECT t.*,
( num_paws * 1 + 3 * num_knights + 5 * num_castles ) AS piece_score
FROM table t;
To do this as a separate column requires writing triggers to handle insert, updates, and deletes in each row. Much easier to must calculate the value when you need it. And, given that all the values are on one row, there is no performance advantage to storing the value in another column and keeping it up-to-date using a trigger.

It turns out that a trigger is the most realistic solution to my problem. score_rank could be obtained for a certain row using
SELECT (SELECT COUNT(*) FROM table_name AS p2 WHERE p2.num_pawns + 3*p2.num_knights + 5*num_castles > p1.num_pawns + 3*p2.num_knights + 5*num_castles) FROM table_name AS p1 where p1.rowid=x); where x is a rowid
Similarly, skill level can be calculated with this function by dividing by 5. This is probably the ideal way to do it, as it better fits the theoretical foundation of DRBs. However, with multiple columns being calculated in this way, the same values get recalculated each time they are needed, which can waste time as more columns are added. Therefore, I decided to use a trigger that would update score_rank after all necessary columns were filled, and then another trigger updates skill_level after score_rank is filled simply by dividing score_rank by 5, etc. This reduces redundant computations.

Related

Computational Complexity of SELECT DISTINC(column) FROM table on an indexed column

Question
I'm not a comp sci major so forgive me if I muddle the terminology. What is the computational complexity for calling
SELECT DISTINCT(column) FROM table
or
SELECT * FROM table GROUP BY column
on a column that IS indexed? Is it proportional to the number of rows or the number of distinct values in the column. I believe that would be O(1)*NUM_DISINCT_COLS vs O(NUM_OF_ROWS)
Background
For example if I have 10 million rows but only 10 distinct values/groups in that column visually you could simply count the last item in each group so the time complexity would be tied to the number of distinct groups and not the number of rows. So the calculation would take the same amount of time for 1 million rows as it would for 100. I believe the complexity would be
O(1)*Number_Of_DISTINCT_ELEMENTS
But in the case of MySQL if I have 10 distinct groups will MySQL still seek through every row, basically calculating a running some of each group, or is it set up in such a way that a group of rows of the same value can be calculated in O(1) time for each distinct column value? If not then I belive it would mean the complexity is
O(NUM_ROWS)
Why Do I Care?
I have a page in my site that lists stats for categories of messages, such as total unread, total messages, etc. I could calculate this information using GROUP BY and SUM() but I was under the impression this will take longer as the number of messages grow so instead I have a table of stats for each category. When a new message is sent or created I increment the total_messages field. When I want to view the states page I simply select a single row
SELECT total_unread_messages FROM stats WHERE category_id = x
instead of calculating those stats live across all messages using GROUP BY and/or DISINCT.
The performance hit either way is not large in my case and so this may seem like a case of "premature optimization", but it would be nice to know when I'm doing something that is or isn't scalable with regard to other options that don't take much time to construct.
If you are doing:
select distinct column
from table
And there is an index on column, then MySQL can process this query using a "loose index scan" (described here).
This should allow the engine to read one key from the index and then "jump" to the next key without reading the intermediate keys (which are all identical). This suggests that the operation does not require reading the entire index, so it is, in general, less than O(n) (where n = number of rows in the table).
I doubt that finding the next value requires only one operation. I wouldn't be surprised if the overall complexity were something like O(m * log(n)), where m = number of distinct values.

How to grab rows which contain a dual id reference

I have a messages table:
messages:
id(int)
send_id(int)
receive_id(int)
And I want to be able to select rows from this only when a->b and b->a exist, e.g.:
id send_id recieve_id
0, 15, 16
1, 16, 15
So that basically one message has been passed to each person. How would I be able to go about selecting just one of those two rows (either send or receive), and all of those for a specific id.
I want to only return results that have this duality.
My code currently uses a nested SELECT and doesn't work at all as needed.
You can achieve the result by taking advantage MySQL's LEAST and GREATEST built-in functions.
SELECT *
FROM messages
WHERE (LEAST(send_id, recieve_id), GREATEST(send_id, recieve_id), id)
IN
(
SELECT LEAST(send_id, recieve_id) as x,
GREATEST(send_id, recieve_id) as y,
MAX(id) msg_ID
FROM messages
GROUP BY x, y
);
SQLFiddle Demo
MySQL Comparison Operator (LEAST/GREATEST)
You have to define an additional synthesized column for this. Different alternatives: permanent as an index (fast), temporary if just for a selection once a month or on-the-fly inside the actual query.
Whatever alternative, that column should contain both ids, ordered in a numerical way and concatenated, maybe by some separation character like -. Now when you make a uniqueness restriction to that column only one of the two candidates can be entered into the result, the second one is rejected because it would violate that uniqueness rule.
The trick is the ordered concatenation instead of a normal combined index that would allow both variants due to the different order of ids.

SQL - Comparing text(combinations) on 100million table

I have a problem.
I have a table that has around 80-100million records in it. In that table I have a field, that has stored from 3 up to 16 different "combinations"(varchar). Combination is a 4-digit number, a colon and a char(A-E), . For example:
'0001:A/0002:A/0005:C/9999:E'. In this case there are 4 different combinations (they can go up to 16). This field is in every row of the table, never a null.
Now the problem: I have to go through the table, find every row, and see if they are similar.
Example rows:
0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A
0001:A/0002:A/0003:C
0001:A/0002:A/0003:A/0006:C
0701:A/0709:A/0711:C/0712:A/0713:A
As you can see, each of these rows is similar to the others (in some way). The thing that needs to be done here is when you send '0001:A/0002:A/0003:C' via program(or parameter in SQL), that it checks every row and see if they have the same "group". Now the catch here is that it has to go both ways and it has to be done "quick", and the SQL needs to compare them somehow.
So when you send '0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A' it has to find all fields where there are 3-16 same combinations and return the rows. This 3-16 can be specified via parameter, but the problem is that you would need to find all possible combinations, because you can send '0002:A:/0711:C/0713:A', and as you can see you can send 0002:A as the first parameter.
But you cannot have indexing because a combination can be on any place in a string, and you can send different combinations that are not "attached" (there could be a different combination in the middle).
So, sending '0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A' has to return all fields that has the same 3-16 fields
and it has to go both ways, if you send "0001:A/0002:A/0003:C" it has to find the row above + similar rows(all that contain all the parameters).
Some things/options I tried:
Doing LIKE for all send combinations is not practical + too slow
Giving a field full-index search isn't an option(don't know why exactly)
One of the few things that could work would be making some "hash" type of encoding for fields, calculating it via program, and searching for all same "hashes"(Don't know how would you do that, given that the hash would generate different combinations for similar texts, maybe some hash that would be written exactly for that
Making a new field, calculating/writing(can be done on insert) all possible combinations and checking via SQL/program if they have the same % of combinations, but I don't know how you can store 10080 combinations(in case of 16) into a "varchar" effectively, or via some hash code + knowing then which of them are familiar.
There is another catch, this table is in usage almost 24/7, doing combinations to check if they are the same in SQL is too slow because the table is too big, it can be done via program or something, but I don't have any clue on how could you store this in a new row that you would know somehow that they are the same. It is a possibility that you would calculate combinations, storing them via some hash code or something on each row insert, calculating "hash" via program, and checking the table like:
SELECT * FROM TABLE WHERE ROW = "a346adsad"
where the parameter would be sent via program.
This script would need to be executed really fast, under 1 minute, because there could be new inserts into the table, that you would need to check.
The whole point of this would be to see if there are any similar combinations in SQL already and blocking any new combination that would be "similar" for inserting.
I have been dealing with that problem for 3 days now without any possible solution, the thing that was the closest is different type of insert/hash like, but I don't know how could that work.
Thank you in advance for any possible help, or if this is even possible!
it checks every row and see if they have the same "group".
IMHO if the group is a basic element of your data structure, your database structure is flawed: it should have each group in its own cell to be normalized. The structure you described makes it clear that you store a composite value in the field.
I'd tear up the table into 3:
one for the "header" information of the group sequences
one for the groups themselves
a connecting table between the two
Something along these lines:
CREATE TABLE GRP_SEQUENCE_HEADER (
ID BIGINT PRIMARY KEY,
DESCRIPTION TEXT
);
CREATE TABLE GRP (
ID BIGINT PRIMARY KEY,
GROUP_TXT CHAR(6)
);
CREATE TABLE GRP_GRP_SEQUENCE_HEADER (
GROUP_ID BIGINT,
GROUP_SEQUENCE_HEADER_ID BIGINT,
GROUP_SEQUENCE_HEADER_ORDER INT, /* For storing the order in the sequence */
PRIMARY KEY(GROUP_ID, GROUP_SEQUENCE_HEADER_ID)
);
(of course, add the foreign keys, and most importantly the indexes necessary)
Then you only have to break up the input into groups, and execute a simple query on a properly indexed table.
Also, you would probably save on the disk space too by not storing duplicates...
A sample query for finding the "similar" sequences' IDs:
SELECT ggsh.GROUP_SEQUENCE_HEADER_ID,COUNT(1)
FROM GRP_GRP_SEQUENCE_HEADER ggsh
JOIN GRP g ON ggsh.GROUP_ID=g.GROUP_ID
WHERE g.GROUP_TXT IN (<groups to check for from the sequence>)
GROUP BY gsh.ID
HAVING COUNT(1) BETWEEN 3 AND 16 --lower and upper boundaries
This returns all the header IDs that the current sequence is similar to.
EDIT
Rethinking it a bit more, you could even break up the group into the two parts, but as I seem to understand, you always have full groups to deal with, so it doesn't seem to be necessary.
EDIT2 Maybe if you want to speed the process up even more, I'd recommend to translate the sequences using bijection into numeric data. For example, evaluate the first 4 numbers to be an integer, shift it by 4 bits to the left (multiply by 16, but quicker), and add the hex value of the character in the last place.
Examples:
0001/A --> 1 as integer, A is 10, so 1*16+10 =26
...
0002/B --> 2 as integer, B is 11, so 2*16+11 =43
...
0343/D --> 343 as integer, D is 13, so 343*16+13 =5501
...
9999/E --> 9999 as integer, E is 14, so 9999*16+14 =159998 (max value, if I understood correctly)
Numerical values are handled more efficiently by the DB, so this should result in an even better performance - of course with the new structure.
So basically you want to execute a complex string manipulation on 80-100 million rows in less than a minute! Ha, ha, good one!
Oh wait, you're serious.
You cannot hope to do these searches on the fly. Read Joel Spolsky's piece on getting Back to Basics to understand why.
What you need to do is hive off those 80-100 million strings into their own table, broken up into those discrete tokens i.e. '0001:A/0002:A/0003:C' is broken up into three records (perhaps of two columns - you're a bit a vague about the relationship between the numeric and alphabetic components of th etokens). Those records can be indexed.
Then it is simply a matter of tokenizing the search strings and doing a select joining the search tokens to the new table. Not sure how well it will perform: that rather depends on how many distinct tokens you have.
As people have commented you would benefit immensely from normalizing your data, but can you not cheat and create a temp table with the key and exploding out your column on the "/", so you go from
KEY | "0001:A/0002:A/0003:A/0006:C"
KEY1| "0001:A/0002:A/0003:A"
to
KEY | 0001:A
KEY | 0002:A
KEY | 0003:A
KEY | 0006:C
KEY1| 0001:A
KEY1| 0002:A
KEY1| 0003:A
Which would allow you to develop a query something like the following (not tested):
SELECT
t1.key
, t2.key
, COUNT(t1.*)
FROM
temp_table t1
, temp_table t2
, ( SELECT t3.key, COUNT(*) AS cnt FROM temp_table t3 GROUP BY t3.key) t4
WHERE
t1.combination IN (
SELECT
t5.combination
FROM
temp_table t5
WHERE
t5.key = t2.key)
AND t1.key <> t2.key
HAVING
COUNT(t1.*) = t4.cnt
So return the two keys where key1 is a proper subset of key?
I guess I can recommend to build special "index".
It will be quite big but you will achieve superspeedy results.
Let's consider this task as searching a set of symbols.
There are design conditions.
The symbols are made by pattern "NNNN:X", where NNNN is number [0001-9999] and X is letter [A-E].
So we have 5 * 9999 = 49995 symbols in alphabet.
Maximum length of words with this alphabet is 16.
We can build for each word set of combinations of its symbols.
For example, the word "abcd" will have next combinations:
abcd
abc
ab
a
abd
acd
ac
ad
bcd
bc
b
bd
cd
с
d
As symbols are sorted in words we have only 2^N-1 combinations (15 for 4 symbols).
For 16-symbols word there are 2^16 - 1 = 65535 combinations.
So we make for an additional index-organized table like this one
create table spec_ndx(combination varchar2(100), original_value varchar2(100))
Performance will be excellent with price of overhead - in the worst case for each record in the original table there will be 65535 "index" records.
So for 100-million table we will get 6-trillion table.
But if we have short values size of "special index" reduces drastically.

Storing ids in a MySQL-Link-Table

I have a table "link_tabl" in which I want to link three other tables by id. So in every row I have a triplets (id_1, id_2, id_3). I could create for every element of the triplet a column and everything would be fine.
But I want more: =)
I need to respect one more "dimension". There is an Algorthm who creates the triplets (the linkings between the tables). The algorithm sometimes outputs different linkings.
Example:
table_person represents a person.
table_task represents a task.
table_loc reüpresents a location.
So a triplet of ids (p, t, l) means: A certain person did something at some location.
The tuple (person, task) are not changed by the algorithm. They are given. The algorithm outputs for a tuple (p,t) a location l. But sometimes the algorithm determines different locations for such a tuple. I want to store in a table the last 10 triplets for every tuple (author, task).
What would be the best approach for that?
I thought of something like:
IF there is a tuple (p,t) ALREADY stored in link_table ADD the id of location into the next free slot (column) of the row.
If there are already 10 values (all columns are full) delete the first one, move every value from column i to column i-1 and store the new value in the last column.
ELSE add a new row.
But I don't know if this is a good approach and if it is, how to realise that...
Own partial solution
I figured out, that I could make two columns. Onw which stores the author id. One which stores the task id. And by
...
UNIQUE INDEX (auth_id, task_id)
...
I could index them. So now I just have to figure out how to move values from column i to i-1 elegantly. =)
Kind regards
Aufwind
I would store the output of the algorithm in rows, with a date indicator. The requirement to only consider the last 10 records sounds fairly arbitrary - and I wouldn't enshrine it in my column layout. It also makes some standard relational tools redundant - for instance, the query "how many locations exist for person x and location y" couldn't be answered by "count", but instead by looking at which column is null.
So, I'd recommend something like:
personID taskID locationID dateCreated
1 1 1 1 April 20:20:10
1 1 2 1 April 20:20:11
1 1 3 1 April 20:20:12
The "only 10" requirement could be enforced by using "top 10" in select queries; you could even embed that in a view if necessary.

randomizing large dataset

I am trying to find a way to get a random selection from a large dataset.
We expect the set to grow to ~500K records, so it is important to find a way that keeps performing well while the set grows.
I tried a technique from: http://forums.mysql.com/read.php?24,163940,262235#msg-262235 But it's not exactly random and it doesn't play well with a LIMIT clause, you don't always get the number of records that you want.
So I thought, since the PK is auto_increment, I just generate a list of random id's and use an IN clause to select the rows I want. The problem with that approach is that sometimes I need a random set of data with records having a spefic status, a status that is found in at most 5% of the total set. To make that work I would first need to find out what ID's I can use that have that specific status, so that's not going to work either.
I am using mysql 5.1.46, MyISAM storage engine.
It might be important to know that the query to select the random rows is going to be run very often and the table it is selecting from is appended to frequently.
Any help would be greatly appreciated!
You could solve this with some denormalization:
Build a secondary table that contains the same pkeys and statuses as your data table
Add and populate a status group column which will be a kind of sub-pkey that you auto number yourself (1-based autoincrement relative to a single status)
Pkey Status StatusPkey
1 A 1
2 A 2
3 B 1
4 B 2
5 C 1
... C ...
n C m (where m = # of C statuses)
When you don't need to filter you can generate rand #s on the pkey as you mentioned above. When you do need to filter then generate rands against the StatusPkeys of the particular status you're interested in.
There are several ways to build this table. You could have a procedure that you run on an interval or you could do it live. The latter would be a performance hit though since the calculating the StatusPkey could get expensive.
Check out this article by Jan Kneschke... It does a great job at explaining the pros and cons of different approaches to this problem...
You can do this efficiently, but you have to do it in two queries.
First get a random offset scaled by the number of rows that match your 5% conditions:
SELECT ROUND(RAND() * (SELECT COUNT(*) FROM MyTable WHERE ...conditions...))
This returns an integer. Next, use the integer as an offset in a LIMIT expression:
SELECT * FROM MyTable WHERE ...conditions... LIMIT 1 OFFSET ?
Not every problem must be solved in a single SQL query.