I have a table "link_tabl" in which I want to link three other tables by id. So in every row I have a triplets (id_1, id_2, id_3). I could create for every element of the triplet a column and everything would be fine.
But I want more: =)
I need to respect one more "dimension". There is an Algorthm who creates the triplets (the linkings between the tables). The algorithm sometimes outputs different linkings.
Example:
table_person represents a person.
table_task represents a task.
table_loc reüpresents a location.
So a triplet of ids (p, t, l) means: A certain person did something at some location.
The tuple (person, task) are not changed by the algorithm. They are given. The algorithm outputs for a tuple (p,t) a location l. But sometimes the algorithm determines different locations for such a tuple. I want to store in a table the last 10 triplets for every tuple (author, task).
What would be the best approach for that?
I thought of something like:
IF there is a tuple (p,t) ALREADY stored in link_table ADD the id of location into the next free slot (column) of the row.
If there are already 10 values (all columns are full) delete the first one, move every value from column i to column i-1 and store the new value in the last column.
ELSE add a new row.
But I don't know if this is a good approach and if it is, how to realise that...
Own partial solution
I figured out, that I could make two columns. Onw which stores the author id. One which stores the task id. And by
...
UNIQUE INDEX (auth_id, task_id)
...
I could index them. So now I just have to figure out how to move values from column i to i-1 elegantly. =)
Kind regards
Aufwind
I would store the output of the algorithm in rows, with a date indicator. The requirement to only consider the last 10 records sounds fairly arbitrary - and I wouldn't enshrine it in my column layout. It also makes some standard relational tools redundant - for instance, the query "how many locations exist for person x and location y" couldn't be answered by "count", but instead by looking at which column is null.
So, I'd recommend something like:
personID taskID locationID dateCreated
1 1 1 1 April 20:20:10
1 1 2 1 April 20:20:11
1 1 3 1 April 20:20:12
The "only 10" requirement could be enforced by using "top 10" in select queries; you could even embed that in a view if necessary.
Related
Context: I have three tables.
Table #1: Tbl_TraumaCodes: It marks all the dates, times, and hospital beds where a medical team is alerted to go treat a patient with a serious traumatic injury.
Table #2: Tbl_Location: Lists the date, time, and location (area of the hospital, bed number) where a patient was with an identifying number.
Table #3: Tbl_Outcomes: Has an identifying number paired with discharge outcomes.
Task: I need to with a reasonable amount of surety, match records in Tbl_TraumaCodes with Tbl_Outcomes.
Matching Tbl_Location and Tbl_Outcomes is easy and automatic through a matching query using the identifying number. Matching Tbl_Location records with Tbl_Trauma Codes will create the link I need.
I designed a look-up table in Tbl_Location where the date, time, and location of records from Tbl_ TraumaCode appears so that I can match them. However, the times that are supposed to correspond between Table_Location and Table_TraumaCode are not exactly the same. The times are roughly within the same ballpark (usually 30 +/- min).
Problem: I have thousands of records to match. There may only be 10 records on a given day, which allows me to limit the options when I type in, say, July 1st in the look-up table. Not every item in Tbl_Location with have a matching item in Tbl_TraumaCode. That means I have to match 10 records when there may be 40 extra record to work with. It’s incorrect to assign an item (time) in Table_TraumaCode to more than one item in the Table_Location. My goal is to reduce the potential for human error.
Is there a way to make the records from the look-up table that are already assigned to a record within Tbl_Location NOT display in the look-up field? I thought about drawing the look-up table from a query, but I don’t know how I would create a TraumaCode query that only displays records that aren’t matched in another table. I also don't know if it would impact the previously assigned records.
I avail myself of the collective wisdom and humbly thank you.
I have information which contains the following components
RecID
UID
Name
Namevariant1
Namevariant2
The NameVariant can be thought of as name aliases - for example, "John Paul" could be referred to as "JP", so if JP's information is sought, then I should be able to match JP with John Paul.
Naturally, I would like to see if my input name falls into any of these variants (including the exact match).
I see two ways of doing this:
a) Have table with Name, NameVariant1, NameVariant2 as columns. The lookup will be the following query:
select * from <table> where Name=<input> OR NameVariant1=<input> OR NameVariant2=<input>
For speed, I might want to create indexes on Name, NameVariant1, NameVariant2.
b) Have Separate table which will contain a map of the variants, and the records associated with these aliases (captured as a recordset, with a suitable delimiter):
RecID
UID
Name
Variant
RecordSet
By going with plan (b), we can avoid storing duplicate aliases. For example, "John Peter" and "John Paul" have the same alias "JP", so I do not have to store "JP" twice.
Please note that there are huge number of inputs expected on these tables, in the order of a few million records.
With respect to lookup performance, I am confused: Assuming that there are 'N' records, plan (a) lookup amounts to searching in 3 different columns - that translates to 3 * (DB lookup on <=N elements). For plan (b), lookup amounts to searching along a maximum of 3N rows [If all aliases are different from each other, there will be 3 aliases per record, and hence 3N records in the second table]. So the complexity in plan (b) is (DB lookup on <=3N elements).
So, the question: Which strategy is better ? 3* (DB lookup on <=N elements) OR (DB lookup on <=3N elements)
For practical purposes, one can assume that there will be very less amount of duplicate entries, so the total number of distinct aliases will be close to 3N.
I am building a database system that contains a number of columns with mathematical relations between them. For example (this is a chess game storage example that I made up), one column is called "num_pawns", one is called "num_nights", one is called "num_castles", and one is called "piece_score". The values in piece_score are determined by the formula: 1*(num_pawns) + 3*(num_knights) + 5*(num_castles). I would like to know how to set a rule that recalculates the value of piece_score whenever one of the other columns is updated. For example, in row 1 the respective values of the first three columns are {6, 2, 1} and a knight is removed {6, 1, 1}, the value of piece_score should automatically update from 17 to 14 after the value in the num_knights column is updated. I realize that it would probably be more efficient to do this in PHP or C, but I am designing my database model to be portable. It could be on a server with MySQL and PHP, or it could be on a mobile device with SQLite and Java. Is there any way to accomplish these kind of post-query updates using strictly SQL?
Update:
To clarify and elaborate on the above example, imagine two additional columns: score_rank and skill_level. The reason for the above computations is so that each row can be assigned a value in score_rank based on its score compared to the other rows (highest score gets 1, second gets 2, etc.). After a rank is assigned, the top 5 scores are given a value of 1 in skill_level, scores 6-10 are given a value of 2, 11-15 are given 3, etc. Is this comparison and ordering possible in SQL? I'm guessing that a trigger is the proper approach, but how could I execute the ordering and assignments over the whole table after an UPDATE? As I mentioned, this database needs to be portable and self-contained, so ideally the sorting and ranking abstraction should be maintained inside the database. That way queries can be made based off of the score_rank column or skill_level column with the assumption that these values reflect the current state of the table.
Unless you have a really, really large table (think hundreds of thousands or millions of rows), you probably want to do this as a view and not by synchronizing the values. The view would be something like this:
CREATE VIEW v
AS
SELECT t.*,
( num_paws * 1 + 3 * num_knights + 5 * num_castles ) AS piece_score
FROM table t;
To do this as a separate column requires writing triggers to handle insert, updates, and deletes in each row. Much easier to must calculate the value when you need it. And, given that all the values are on one row, there is no performance advantage to storing the value in another column and keeping it up-to-date using a trigger.
It turns out that a trigger is the most realistic solution to my problem. score_rank could be obtained for a certain row using
SELECT (SELECT COUNT(*) FROM table_name AS p2 WHERE p2.num_pawns + 3*p2.num_knights + 5*num_castles > p1.num_pawns + 3*p2.num_knights + 5*num_castles) FROM table_name AS p1 where p1.rowid=x); where x is a rowid
Similarly, skill level can be calculated with this function by dividing by 5. This is probably the ideal way to do it, as it better fits the theoretical foundation of DRBs. However, with multiple columns being calculated in this way, the same values get recalculated each time they are needed, which can waste time as more columns are added. Therefore, I decided to use a trigger that would update score_rank after all necessary columns were filled, and then another trigger updates skill_level after score_rank is filled simply by dividing score_rank by 5, etc. This reduces redundant computations.
I am doing some prep form my DB final and am stack with this question. I have the answer but I am not sure that my steps are correct. Would appreciate if you can tell me if my answers have the correct logic in them. Thanks
Assume that the EMPLOYEE table has 2000 tuples. It has a primary key
column called ID that has a range of [1 - 2000]. It also has an column
called DOB (date of birth). There are 250 distinct DOB values among
the employees (i.e., on average four employees share the same DOB).
Assume that 20 EMPLOYEE tuples can fit into a disk block. Each
scenario given below is independent, that is because ID is indexed in
one scenario does not mean it is indexed in the others.
Assume that EMPLOYEE has a sparse, B+-tree index on the ID attribute.
Each node in the index has a maximum fanout of 100 (each node in the
tree can have 100 children). Each node in the index is 50% full. (a)
How many disk blocks does the index occupy? (b) The following query
will read how many disk blocks in the worst case (give an exact
number, e.g., 50)?
SELECT * FROM EMPLOYEE WHERE ID = 80;
(c) The following query will read how many disk blocks? Note that this
query projects the ID.
SELECT ID FROM EMPLOYEE WHERE ID > 1500;
(d)
The following query will read how many disk blocks?
SELECT MAX(ID)
FROM EMPLOYEE;
A)My take on this is that if there are 100 entries in the index(20000/200), 50 in each block. So 2 blocks for entries and one root block.
B) Because it is a sparse index, it will read one block on top level to figures out which of the lover ones to go, and then one of the lower blocks to find the correct values. so it will read 2 blocks in worst case
C) 2 blocks. One to find number 1500, another to find all the tuples >1500
D) 3 blocks. 2 to find the max value block, another to find max value itself.
Assume that EMPLOYEE has a single-level, dense, clustering index on
DOB. Assume that each node in the index can hold 200 index records.
(a) How many disk blocks does the index occupy?
(b) The following
query will read how many disk blocks in the worst case (give an exact
number, e.g., 50)?
SELECT ID
FROM EMPLOYEE
WHERE ID = 80;
(c)The following query will read how many disk blocks? Note that this query projects the DOB.
SELECT DOB FROM
EMPLOYEE WHERE DOB <> ’1/1/2000’;
(d) The following query will read
how many disk blocks?
SELECT * FROM EMPLOYEE WHERE DOB = ’1/1/2000’’;
A) 3 Blocks again. One root, one with 200 entries one with 50
B) Since ID is not indexed in this example, it has to looks through whole table. But I am not sure how to calculate the blocks.
C) 3 Blocks? Whole table must be scanned.
D) 2 lower level blocks to find the indexes
Sorry for the long post, just tried to add all the details.
I have a problem.
I have a table that has around 80-100million records in it. In that table I have a field, that has stored from 3 up to 16 different "combinations"(varchar). Combination is a 4-digit number, a colon and a char(A-E), . For example:
'0001:A/0002:A/0005:C/9999:E'. In this case there are 4 different combinations (they can go up to 16). This field is in every row of the table, never a null.
Now the problem: I have to go through the table, find every row, and see if they are similar.
Example rows:
0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A
0001:A/0002:A/0003:C
0001:A/0002:A/0003:A/0006:C
0701:A/0709:A/0711:C/0712:A/0713:A
As you can see, each of these rows is similar to the others (in some way). The thing that needs to be done here is when you send '0001:A/0002:A/0003:C' via program(or parameter in SQL), that it checks every row and see if they have the same "group". Now the catch here is that it has to go both ways and it has to be done "quick", and the SQL needs to compare them somehow.
So when you send '0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A' it has to find all fields where there are 3-16 same combinations and return the rows. This 3-16 can be specified via parameter, but the problem is that you would need to find all possible combinations, because you can send '0002:A:/0711:C/0713:A', and as you can see you can send 0002:A as the first parameter.
But you cannot have indexing because a combination can be on any place in a string, and you can send different combinations that are not "attached" (there could be a different combination in the middle).
So, sending '0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A' has to return all fields that has the same 3-16 fields
and it has to go both ways, if you send "0001:A/0002:A/0003:C" it has to find the row above + similar rows(all that contain all the parameters).
Some things/options I tried:
Doing LIKE for all send combinations is not practical + too slow
Giving a field full-index search isn't an option(don't know why exactly)
One of the few things that could work would be making some "hash" type of encoding for fields, calculating it via program, and searching for all same "hashes"(Don't know how would you do that, given that the hash would generate different combinations for similar texts, maybe some hash that would be written exactly for that
Making a new field, calculating/writing(can be done on insert) all possible combinations and checking via SQL/program if they have the same % of combinations, but I don't know how you can store 10080 combinations(in case of 16) into a "varchar" effectively, or via some hash code + knowing then which of them are familiar.
There is another catch, this table is in usage almost 24/7, doing combinations to check if they are the same in SQL is too slow because the table is too big, it can be done via program or something, but I don't have any clue on how could you store this in a new row that you would know somehow that they are the same. It is a possibility that you would calculate combinations, storing them via some hash code or something on each row insert, calculating "hash" via program, and checking the table like:
SELECT * FROM TABLE WHERE ROW = "a346adsad"
where the parameter would be sent via program.
This script would need to be executed really fast, under 1 minute, because there could be new inserts into the table, that you would need to check.
The whole point of this would be to see if there are any similar combinations in SQL already and blocking any new combination that would be "similar" for inserting.
I have been dealing with that problem for 3 days now without any possible solution, the thing that was the closest is different type of insert/hash like, but I don't know how could that work.
Thank you in advance for any possible help, or if this is even possible!
it checks every row and see if they have the same "group".
IMHO if the group is a basic element of your data structure, your database structure is flawed: it should have each group in its own cell to be normalized. The structure you described makes it clear that you store a composite value in the field.
I'd tear up the table into 3:
one for the "header" information of the group sequences
one for the groups themselves
a connecting table between the two
Something along these lines:
CREATE TABLE GRP_SEQUENCE_HEADER (
ID BIGINT PRIMARY KEY,
DESCRIPTION TEXT
);
CREATE TABLE GRP (
ID BIGINT PRIMARY KEY,
GROUP_TXT CHAR(6)
);
CREATE TABLE GRP_GRP_SEQUENCE_HEADER (
GROUP_ID BIGINT,
GROUP_SEQUENCE_HEADER_ID BIGINT,
GROUP_SEQUENCE_HEADER_ORDER INT, /* For storing the order in the sequence */
PRIMARY KEY(GROUP_ID, GROUP_SEQUENCE_HEADER_ID)
);
(of course, add the foreign keys, and most importantly the indexes necessary)
Then you only have to break up the input into groups, and execute a simple query on a properly indexed table.
Also, you would probably save on the disk space too by not storing duplicates...
A sample query for finding the "similar" sequences' IDs:
SELECT ggsh.GROUP_SEQUENCE_HEADER_ID,COUNT(1)
FROM GRP_GRP_SEQUENCE_HEADER ggsh
JOIN GRP g ON ggsh.GROUP_ID=g.GROUP_ID
WHERE g.GROUP_TXT IN (<groups to check for from the sequence>)
GROUP BY gsh.ID
HAVING COUNT(1) BETWEEN 3 AND 16 --lower and upper boundaries
This returns all the header IDs that the current sequence is similar to.
EDIT
Rethinking it a bit more, you could even break up the group into the two parts, but as I seem to understand, you always have full groups to deal with, so it doesn't seem to be necessary.
EDIT2 Maybe if you want to speed the process up even more, I'd recommend to translate the sequences using bijection into numeric data. For example, evaluate the first 4 numbers to be an integer, shift it by 4 bits to the left (multiply by 16, but quicker), and add the hex value of the character in the last place.
Examples:
0001/A --> 1 as integer, A is 10, so 1*16+10 =26
...
0002/B --> 2 as integer, B is 11, so 2*16+11 =43
...
0343/D --> 343 as integer, D is 13, so 343*16+13 =5501
...
9999/E --> 9999 as integer, E is 14, so 9999*16+14 =159998 (max value, if I understood correctly)
Numerical values are handled more efficiently by the DB, so this should result in an even better performance - of course with the new structure.
So basically you want to execute a complex string manipulation on 80-100 million rows in less than a minute! Ha, ha, good one!
Oh wait, you're serious.
You cannot hope to do these searches on the fly. Read Joel Spolsky's piece on getting Back to Basics to understand why.
What you need to do is hive off those 80-100 million strings into their own table, broken up into those discrete tokens i.e. '0001:A/0002:A/0003:C' is broken up into three records (perhaps of two columns - you're a bit a vague about the relationship between the numeric and alphabetic components of th etokens). Those records can be indexed.
Then it is simply a matter of tokenizing the search strings and doing a select joining the search tokens to the new table. Not sure how well it will perform: that rather depends on how many distinct tokens you have.
As people have commented you would benefit immensely from normalizing your data, but can you not cheat and create a temp table with the key and exploding out your column on the "/", so you go from
KEY | "0001:A/0002:A/0003:A/0006:C"
KEY1| "0001:A/0002:A/0003:A"
to
KEY | 0001:A
KEY | 0002:A
KEY | 0003:A
KEY | 0006:C
KEY1| 0001:A
KEY1| 0002:A
KEY1| 0003:A
Which would allow you to develop a query something like the following (not tested):
SELECT
t1.key
, t2.key
, COUNT(t1.*)
FROM
temp_table t1
, temp_table t2
, ( SELECT t3.key, COUNT(*) AS cnt FROM temp_table t3 GROUP BY t3.key) t4
WHERE
t1.combination IN (
SELECT
t5.combination
FROM
temp_table t5
WHERE
t5.key = t2.key)
AND t1.key <> t2.key
HAVING
COUNT(t1.*) = t4.cnt
So return the two keys where key1 is a proper subset of key?
I guess I can recommend to build special "index".
It will be quite big but you will achieve superspeedy results.
Let's consider this task as searching a set of symbols.
There are design conditions.
The symbols are made by pattern "NNNN:X", where NNNN is number [0001-9999] and X is letter [A-E].
So we have 5 * 9999 = 49995 symbols in alphabet.
Maximum length of words with this alphabet is 16.
We can build for each word set of combinations of its symbols.
For example, the word "abcd" will have next combinations:
abcd
abc
ab
a
abd
acd
ac
ad
bcd
bc
b
bd
cd
с
d
As symbols are sorted in words we have only 2^N-1 combinations (15 for 4 symbols).
For 16-symbols word there are 2^16 - 1 = 65535 combinations.
So we make for an additional index-organized table like this one
create table spec_ndx(combination varchar2(100), original_value varchar2(100))
Performance will be excellent with price of overhead - in the worst case for each record in the original table there will be 65535 "index" records.
So for 100-million table we will get 6-trillion table.
But if we have short values size of "special index" reduces drastically.