How to calculate database disk blocks? - mysql

I am doing some prep form my DB final and am stack with this question. I have the answer but I am not sure that my steps are correct. Would appreciate if you can tell me if my answers have the correct logic in them. Thanks
Assume that the EMPLOYEE table has 2000 tuples. It has a primary key
column called ID that has a range of [1 - 2000]. It also has an column
called DOB (date of birth). There are 250 distinct DOB values among
the employees (i.e., on average four employees share the same DOB).
Assume that 20 EMPLOYEE tuples can fit into a disk block. Each
scenario given below is independent, that is because ID is indexed in
one scenario does not mean it is indexed in the others.
Assume that EMPLOYEE has a sparse, B+-tree index on the ID attribute.
Each node in the index has a maximum fanout of 100 (each node in the
tree can have 100 children). Each node in the index is 50% full. (a)
How many disk blocks does the index occupy? (b) The following query
will read how many disk blocks in the worst case (give an exact
number, e.g., 50)?
SELECT * FROM EMPLOYEE WHERE ID = 80;
(c) The following query will read how many disk blocks? Note that this
query projects the ID.
SELECT ID FROM EMPLOYEE WHERE ID > 1500;
(d)
The following query will read how many disk blocks?
SELECT MAX(ID)
FROM EMPLOYEE;
A)My take on this is that if there are 100 entries in the index(20000/200), 50 in each block. So 2 blocks for entries and one root block.
B) Because it is a sparse index, it will read one block on top level to figures out which of the lover ones to go, and then one of the lower blocks to find the correct values. so it will read 2 blocks in worst case
C) 2 blocks. One to find number 1500, another to find all the tuples >1500
D) 3 blocks. 2 to find the max value block, another to find max value itself.
Assume that EMPLOYEE has a single-level, dense, clustering index on
DOB. Assume that each node in the index can hold 200 index records.
(a) How many disk blocks does the index occupy?
(b) The following
query will read how many disk blocks in the worst case (give an exact
number, e.g., 50)?
SELECT ID
FROM EMPLOYEE
WHERE ID = 80;
(c)The following query will read how many disk blocks? Note that this query projects the DOB.
SELECT DOB FROM
EMPLOYEE WHERE DOB <> ’1/1/2000’;
(d) The following query will read
how many disk blocks?
SELECT * FROM EMPLOYEE WHERE DOB = ’1/1/2000’’;
A) 3 Blocks again. One root, one with 200 entries one with 50
B) Since ID is not indexed in this example, it has to looks through whole table. But I am not sure how to calculate the blocks.
C) 3 Blocks? Whole table must be scanned.
D) 2 lower level blocks to find the indexes
Sorry for the long post, just tried to add all the details.

Related

Find a range of unused primary key to create product numbers in batch

I am trying to find a way to identify empty range of primary key.
I have a table that have for primary key only numbers, named id.
I am trying to fill the gaps when receiving batch of products.
My column id (primary) has numbers that can be following each other but for each product type we jump to another thousands and some employees are not following the rule of taking next free spot, they find an empty and use it!
I would need to find something that would have a functionality similar to:
select product.id
from product
where product.id >= x
and next product.id > x + y
x would be the last used product
y is the amount of product i have to fill in this batch.
For example; if x is my starting point and has a value of 25000 and y = 50,
the first time 50 unused number is reach is from 26600 to 27500.
The result would give me 25999... which would be the last product entered.
It is mandatory that batch product have consecutive numbers.
Is there any query that can give that result?
thank you in advance!
this query should produce the first suitable gap using 50 as the size required.
note that you should query the gap and insert the batch in the same transaction, to prevent someone else uses the same gap at the same time.
select a.id+1
from yourtable a
where not exists (select id from yourtable where id between a.id+1 and a.id+50)
order by a.id
limit 1
if the table is empty this query will return no rows.
beware that filling the gaps in primary keys in a context with high demand and many simultaneous inserts would be a bad idea since it would serialize all transactions.

Which mysql strategy is better ? 3* (DB lookup on <=N elements) OR (DB lookup on <=3N elements)

I have information which contains the following components
RecID
UID
Name
Namevariant1
Namevariant2
The NameVariant can be thought of as name aliases - for example, "John Paul" could be referred to as "JP", so if JP's information is sought, then I should be able to match JP with John Paul.
Naturally, I would like to see if my input name falls into any of these variants (including the exact match).
I see two ways of doing this:
a) Have table with Name, NameVariant1, NameVariant2 as columns. The lookup will be the following query:
select * from <table> where Name=<input> OR NameVariant1=<input> OR NameVariant2=<input>
For speed, I might want to create indexes on Name, NameVariant1, NameVariant2.
b) Have Separate table which will contain a map of the variants, and the records associated with these aliases (captured as a recordset, with a suitable delimiter):
RecID
UID
Name
Variant
RecordSet
By going with plan (b), we can avoid storing duplicate aliases. For example, "John Peter" and "John Paul" have the same alias "JP", so I do not have to store "JP" twice.
Please note that there are huge number of inputs expected on these tables, in the order of a few million records.
With respect to lookup performance, I am confused: Assuming that there are 'N' records, plan (a) lookup amounts to searching in 3 different columns - that translates to 3 * (DB lookup on <=N elements). For plan (b), lookup amounts to searching along a maximum of 3N rows [If all aliases are different from each other, there will be 3 aliases per record, and hence 3N records in the second table]. So the complexity in plan (b) is (DB lookup on <=3N elements).
So, the question: Which strategy is better ? 3* (DB lookup on <=N elements) OR (DB lookup on <=3N elements)
For practical purposes, one can assume that there will be very less amount of duplicate entries, so the total number of distinct aliases will be close to 3N.

Computational Complexity of SELECT DISTINC(column) FROM table on an indexed column

Question
I'm not a comp sci major so forgive me if I muddle the terminology. What is the computational complexity for calling
SELECT DISTINCT(column) FROM table
or
SELECT * FROM table GROUP BY column
on a column that IS indexed? Is it proportional to the number of rows or the number of distinct values in the column. I believe that would be O(1)*NUM_DISINCT_COLS vs O(NUM_OF_ROWS)
Background
For example if I have 10 million rows but only 10 distinct values/groups in that column visually you could simply count the last item in each group so the time complexity would be tied to the number of distinct groups and not the number of rows. So the calculation would take the same amount of time for 1 million rows as it would for 100. I believe the complexity would be
O(1)*Number_Of_DISTINCT_ELEMENTS
But in the case of MySQL if I have 10 distinct groups will MySQL still seek through every row, basically calculating a running some of each group, or is it set up in such a way that a group of rows of the same value can be calculated in O(1) time for each distinct column value? If not then I belive it would mean the complexity is
O(NUM_ROWS)
Why Do I Care?
I have a page in my site that lists stats for categories of messages, such as total unread, total messages, etc. I could calculate this information using GROUP BY and SUM() but I was under the impression this will take longer as the number of messages grow so instead I have a table of stats for each category. When a new message is sent or created I increment the total_messages field. When I want to view the states page I simply select a single row
SELECT total_unread_messages FROM stats WHERE category_id = x
instead of calculating those stats live across all messages using GROUP BY and/or DISINCT.
The performance hit either way is not large in my case and so this may seem like a case of "premature optimization", but it would be nice to know when I'm doing something that is or isn't scalable with regard to other options that don't take much time to construct.
If you are doing:
select distinct column
from table
And there is an index on column, then MySQL can process this query using a "loose index scan" (described here).
This should allow the engine to read one key from the index and then "jump" to the next key without reading the intermediate keys (which are all identical). This suggests that the operation does not require reading the entire index, so it is, in general, less than O(n) (where n = number of rows in the table).
I doubt that finding the next value requires only one operation. I wouldn't be surprised if the overall complexity were something like O(m * log(n)), where m = number of distinct values.

Storing ids in a MySQL-Link-Table

I have a table "link_tabl" in which I want to link three other tables by id. So in every row I have a triplets (id_1, id_2, id_3). I could create for every element of the triplet a column and everything would be fine.
But I want more: =)
I need to respect one more "dimension". There is an Algorthm who creates the triplets (the linkings between the tables). The algorithm sometimes outputs different linkings.
Example:
table_person represents a person.
table_task represents a task.
table_loc reüpresents a location.
So a triplet of ids (p, t, l) means: A certain person did something at some location.
The tuple (person, task) are not changed by the algorithm. They are given. The algorithm outputs for a tuple (p,t) a location l. But sometimes the algorithm determines different locations for such a tuple. I want to store in a table the last 10 triplets for every tuple (author, task).
What would be the best approach for that?
I thought of something like:
IF there is a tuple (p,t) ALREADY stored in link_table ADD the id of location into the next free slot (column) of the row.
If there are already 10 values (all columns are full) delete the first one, move every value from column i to column i-1 and store the new value in the last column.
ELSE add a new row.
But I don't know if this is a good approach and if it is, how to realise that...
Own partial solution
I figured out, that I could make two columns. Onw which stores the author id. One which stores the task id. And by
...
UNIQUE INDEX (auth_id, task_id)
...
I could index them. So now I just have to figure out how to move values from column i to i-1 elegantly. =)
Kind regards
Aufwind
I would store the output of the algorithm in rows, with a date indicator. The requirement to only consider the last 10 records sounds fairly arbitrary - and I wouldn't enshrine it in my column layout. It also makes some standard relational tools redundant - for instance, the query "how many locations exist for person x and location y" couldn't be answered by "count", but instead by looking at which column is null.
So, I'd recommend something like:
personID taskID locationID dateCreated
1 1 1 1 April 20:20:10
1 1 2 1 April 20:20:11
1 1 3 1 April 20:20:12
The "only 10" requirement could be enforced by using "top 10" in select queries; you could even embed that in a view if necessary.

randomizing large dataset

I am trying to find a way to get a random selection from a large dataset.
We expect the set to grow to ~500K records, so it is important to find a way that keeps performing well while the set grows.
I tried a technique from: http://forums.mysql.com/read.php?24,163940,262235#msg-262235 But it's not exactly random and it doesn't play well with a LIMIT clause, you don't always get the number of records that you want.
So I thought, since the PK is auto_increment, I just generate a list of random id's and use an IN clause to select the rows I want. The problem with that approach is that sometimes I need a random set of data with records having a spefic status, a status that is found in at most 5% of the total set. To make that work I would first need to find out what ID's I can use that have that specific status, so that's not going to work either.
I am using mysql 5.1.46, MyISAM storage engine.
It might be important to know that the query to select the random rows is going to be run very often and the table it is selecting from is appended to frequently.
Any help would be greatly appreciated!
You could solve this with some denormalization:
Build a secondary table that contains the same pkeys and statuses as your data table
Add and populate a status group column which will be a kind of sub-pkey that you auto number yourself (1-based autoincrement relative to a single status)
Pkey Status StatusPkey
1 A 1
2 A 2
3 B 1
4 B 2
5 C 1
... C ...
n C m (where m = # of C statuses)
When you don't need to filter you can generate rand #s on the pkey as you mentioned above. When you do need to filter then generate rands against the StatusPkeys of the particular status you're interested in.
There are several ways to build this table. You could have a procedure that you run on an interval or you could do it live. The latter would be a performance hit though since the calculating the StatusPkey could get expensive.
Check out this article by Jan Kneschke... It does a great job at explaining the pros and cons of different approaches to this problem...
You can do this efficiently, but you have to do it in two queries.
First get a random offset scaled by the number of rows that match your 5% conditions:
SELECT ROUND(RAND() * (SELECT COUNT(*) FROM MyTable WHERE ...conditions...))
This returns an integer. Next, use the integer as an offset in a LIMIT expression:
SELECT * FROM MyTable WHERE ...conditions... LIMIT 1 OFFSET ?
Not every problem must be solved in a single SQL query.