I currently have a table which represents the start and stop work times of an employee:
id_employee int
check_in datetime
check_out datetime
It requires an update on check_out when the employee is finished.
Would it be preferable to have a table as follows ?
id_employee int
date_event datetime
event_type varchar, values can be CHECKIN or CHECKOUT.
To determine if an employee has already checked in all I have to is check if the last record for a given employee has an event_type of CHECKIN. Also, fetching a record and updating it is no longer necessary.
Is the second approach better ? Or do you have other suggestions ?
I know this post is outdated but, this is for someone who's still looking for solution:
Attendance Table Structure
id | int
employee_code | varchar
status | enum('check_in','check_out')
created | datetime
Data
id employee_code status created
1 EMP0001 check_in 2016-08-20 09:30:30
2 EMP0001 check_out 2016-08-20 18:15:00
3 EMP0002 check_in 2016-08-21 14:52:48
4 EMP0002 check_out 2016-08-21 21:09:18
Query
SELECT
A1.employee_code,
A1.created AS check_in_at,
A2.created AS check_out_at,
TIMEDIFF(A2.created, A1.created) AS total_time
FROM
tbl_attendances AS A1
INNER JOIN tbl_attendances AS A2
ON A1.employee_code = A2.employee_code
AND DATE(A1.created) = DATE(A2.created)
WHERE 1 = 1
AND A1.status = 'check_in'
AND A2.status = 'check_out'
AND DATE(A1.created) BETWEEN '2016-08-20'
AND '2016-08-21'
AND DATE(A2.created) BETWEEN '2016-08-20'
AND '2016-08-21'
ORDER BY A1.created DESC
Results
employee_code check_in_at check_out_at total_time
EMP0002 2016-08-21 14:52:48 2016-08-21 21:09:18 06:16:30
EMP0001 2016-08-20 09:30:30 2016-08-20 18:15:00 08:44:30
For specific employee add AND A1.employee_code = 'EMP0001' in WHERE clause
As usual, "it depends".
Option 1 is easier to build, and simpler to query. Finding out who checked in but didn't check out is a simple query; finding the total hours worked for each employee is also straightforward. This simplicity probably means it will be faster for common queries. The only drawback I see is that it is harder to extend. If you want to capture a different event type for "lunch break", for instance, you have to add extra columns.
Option 2 is more flexible - you can add new event types without changing your schema. However, simple queries - how many hours did employee x work in June - are quite tricky. You pay for the flexibility in significant additional effort.
So, it depends what you mean by "better".
Format #2 is better because :
This table is just a punch record entry - Even if it has anomalies it doesn't matter.
Going forward this table will expand, for example you might want to introduce two more events INTERVAL_OUT, INTERVAL_IN. Second format will keep it simple.
If possible use event_type_id instead of event_type and another table event_type or just a constant array eg.
array_event_name = array (1=>CHECKIN, 2=>CHECKOUT, 3=>INTERVAL_IN, 4=>INTERVAL_OUT)
i would go with the second one.
however, the main questions and business rules will be the same and answerable by either approach.
Option #1
With the first option, the database itself can better protect itself from some anomalies1. Some anomalies are still possible2, but it's a start.
On the other hand, InnoDB tables are clustered and secondary indexes in clustered tables can be expensive (see the "Disadvantages of clustering" in this article), which is something to consider if you need to query on check_out.
Option #2
With the second option, you are relying on the imperative code even for anomalies that can be prevented purely declaratively with the database design.
On a plus side, you are less likely to need secondary indexes.
Choice
So in a nutshell, go with the first option, unless you need a secondary index. If you do need the secondary index, depending on what kind of index covering you wish to achieve, you might go with either option.
1 Such as checking-out without first checking-in.
2 Such as checking-in again, without first checking-out, overlapping "stints", etc...
I would go with the first option here. Putting both time stamps in a single row will enhance your search time and will make your calculations easier.
Suppose you want to calculation work hours for an employee for a day. Your search will stop at the first line it matches and you will have all the required data. You wont have to dig any deeper which is not the case with option 2. Option 1 also reduces your table size by using only 1 row per check-in/check-out.
Option 2 does have one advantage though. When checking out, your database will have to do a search to update the data for option 1. For option 2, its just a write.
Considering the fact that you will search the data more than once, you can give up the direct insert advantage to gain a better structure and faster search. Although the final choice is up to you.
Good Luck!
Related
For the following 2 tables structures, assuming the data volume is really high:
cars table
Id | brand name | make year | purchase year | owner name
Is there any query performance benefit with structuring it this way and joining the 2 tables instead?
cars table
Id | brand_id | make year | purchase year | owner name
brands table
Id | name
Also, if all 4 columns fall in my where clause, does it make sense indexing any?
I would at least have INDEX(owner_name) since that is very selective. Having INDEX(owner_name, model_year) won't help enough to matter for this type of data. There are other cases where I would recommend a 4-column composite index.
"data volume is really high". If you are saying there are 100K rows, then it does not matter much. If you are saying a billion rows, then we need to get into a lot more details.
"data volume is really high". 10 queries/second -- Yawn. 1000/second -- more details, please.
2 tables vs 1.
Data integrity - someone could mess up the data either way
Speed -- a 1-byte TINYINT UNSIGNED (range 0..255) is smaller than an average of about 7 bytes for VARCHAR(55) forbrand. But it is hardly enough smaller to matter on space or speed. (And if you goof and makebrand_idaBIGINT`, which is 8 bytes; well, oops!)
Indexing all columns is different than having no indexes. But "indexing all" is ambiguous:
INDEX(user), INDEX(brand), INDEX(year), ... is likely to make it efficient to search or sort by any of those columns.
INDEX(user, brand, year), ... makes it especially efficient to search by all those columns (with =), or certain ORDER BYs.
No index implies scanning the entire table for any SELECT.
Another interpretation of what you said (plus a little reading between the lines): Might you be searching by any combination of columns? Perhaps non-= things like year >= 2016? Or make IN ('Toyota', 'Nissan')?
Study http://mysql.rjweb.org/doc.php/index_cookbook_mysql
An argument for 1 table
If you need to do
WHERE brand = 'Toyota'
AND year = 2017
Then INDEX(brand, year) (in either order) is possible and beneficial.
But... If those two columns are in different tables (as with your 2-table example), then you cannot have such an index, and performance will suffer.
For my bachelor thesis I have to analyze a password leak and I have a table with 2 colums MEMBER_EMAIL and MEMBER_HASH
I want to calculate the frequency of each hash efficiently
So that the output looks like:
Hash | Amount
----------------
2e3f.. | 345
2f2e.. | 288
b2be.. | 189
My query until now was straight forward:
SELECT MEMBER_HASH AS hashed, count(*) AS amount
FROM thesis.fulllist
GROUP BY hashed
ORDER BY amount DESC
While it works fine for smaller tables, i have problems computing the query on the whole list (112 mio. entries), where it takes me over 2 days, ending in a weird connection timeout error even if my settings regarding that are fine.
So I wonder if there is a better way to calculate (as i can't really think of any), would appreciate any help!
Your query can't be optimized as it's quite simple. The only way I think to improve the way the query is executed is to index the "MEMBER_HASH".
This is how you can do it :
ALTER TABLE `table` ADD INDEX `hashed` (`MEMBER_HASH`);
I have 2 tables posts<id, user_id, text, votes_counter, created> and votes<id, post_id, user_id, vote>. Here the table vote can be either 1 (upvote) or -1(downvote). Now if I need to fetch the total votes(upvotes - downvotes) on a post, I can do it in 2 ways.
Use count(*) to count the number of upvotes and downvotes on that post from votes table and then do the maths.
Set up a counter column votes_counter and increment or decrement it everytime a user upvotes or downvotes. Then simply extract that votes_counter.
My question is which one is better and under what condition. By saying condition, I mean factors like scalability, peaktime et cetera.
To what I know, if I use method 1, for a table with millions of rows, count(*) could be a heavy operation. To avoid that situation, if I use a counter then during peak time, the votes_counter column might get deadlocked, too many users trying to update the counter!
Is there a third way better than both and as simple to implement?
The two approaches represent a common tradeoff between complexity of implementation and speed.
The first approach is very simple to implement, because it does not require you to do any additional coding.
The second approach is potentially a lot faster, especially when you need to count a small percentage of items in a large table
The first approach can be sped up by well designed indexes. Rather than searching through the whole table, your RDBMS could retrieve a few records from the index, and do the counts using them
The second approach can become very complex very quickly:
You need to consider what happens to the counts when a user gets deleted
You should consider what happens when the table of votes is manipulated by tools outside your program. For example, merging records from two databases may prove a lot more complex when the current counts are stored along with the individual ones.
I would start with the first approach, and see how it performs. Then I would try optimizing it with indexing. Finally, I would consider going with the second approach, possibly writing triggers to update counts automatically.
As this sounds a lot like StackExchange, I'll refer you to this answer on the meta about the database schema used on the site. The votes table looks like this:
Votes table:
Id
PostId
VoteTypeId, one of the following values:
1 - AcceptedByOriginator
2 - UpMod
3 - DownMod
4 - Offensive
5 - Favorite (if VoteTypeId = 5, UserId will be populated)
6 - Close
7 - Reopen
8 - BountyStart (if VoteTypeId = 8, UserId will be populated)
9 - BountyClose
10 - Deletion
11 - Undeletion
12 - Spam
15 - ModeratorReview
16 - ApproveEditSuggestion
UserId (only present if VoteTypeId is 5 or 8)
CreationDate
BountyAmount (only present if VoteTypeId is 8 or 9)
And so based on that it sounds like the way it would be run is:
SELECT VoteTypeId FROM Votes WHERE VoteTypeId = 2 OR VoteTypeId = 3
And then based on the value, do the maths:
int score = 0;
for each vote in voteQueryResults
if(vote == 2) score++;
if(vote == 3) score--;
Even with millions of results, this is probably going to be a very fast operation as it's so simple.
This might sound like a silly question but here it is; I am sure it has happened to anyone around here, you build a web app with a db structure per specifications (php/mysql), but then the specs change slightly and you need to make the change in the db to reflect it, here is a short example:
Order table
->order id
->user id
->closed
->timestamp
but because the orders are paid in different currency than in the one, which is quoted in the db, I need to add the field exchange rate, which is only checked and known when closing the order, not upon insertion of the record. Thus I can either add the new field to the current table, and leave it null/blank when inserting, and then update when necessary; or I can create a new table with the following structure:
Order exchange rates
->exchange id
->order id
->exchange rate
Although I believe that the letter is better because it is a less intrusive change, and won't affect the rest of the application functionality, you could end up with insane amount of joined queries to get all the information necessary. On the other hand the former approach could mess up some other queries you have in the db, but it is definitely more practical and also logical in terms of the overall db structure. Also I don't think that it is a good practice to use the structure of insert null and update later, but that might be just my lonely opinion...
Thus I would like to ask what do you think is the preferable approach.
I'm thinking of another alternative. Setup an exchange rate table like:
create table exchange_rate(
cur_code_from varchar2(3) not null
,cur_code_to varchar2(3) not null
,valid_from date not null
,valid_to date not null
,rate number(20,6) not null
);
alter table exchange_rate
add constraint exchange_rate_pk
primary key(cur_code_from, cur_code_to, valid_from);
The table should hold data that looks something like:
cur_code_from cur_code_to valid_from valid_to rate
============= =========== ========== ======== ====
EUR EUR 2014-01-01 9999-12-31 1
EUR USD 2014-01-01 9999-12-31 1,311702
EUR SEK 2014-01-01 2014-03-30 8,808322
EUR SEK 2014-04-01 9999-12-31 8,658084
EUR GBP 2014-01-01 9999-12-31 0,842865
EUR PLN 2014-01-01 9999-12-31 4,211555
Note the special case when you convert from and to the same currency.
From a normalization perspective, you don't need valid_to since it can be computed from the next valid_from, but from a practical point of view, it's easier to work with a valid-to-date than using a sub-query every time.
Then, to convert into the customers currency you would join with this table:
select o.order_value * x.rate as value_in_customer_currency
from orders o
join exchange_rate_t x on(
x.cur_code_from = 'EUR' -- Your- default currency here
and x.cur_code_to = 'SEK' -- The customers currency here
and o.order_close_date between x.valid_from and x.valid_to
)
where o.order_id = 1234;
Here I have used the rates valid as of the order_close_date. So if you have two orders, one with a close date of 2014-02-01, then it would pick up a different rate than an order with a close date of 2014-04-05.
I think you just need to add exchange_rate_id in the order table and create a look up table Exchange_Rates with columns ex_rate_id, description , deleted, created_date.
So when an order closes you just need to update the exchange_rate_id column in order table with id and later on you can create a join with the look up table to pull records.
Keep in mind that
one order have only one currency upon closing.
one currency can be updated against one or many orders
It is a one to many relationship, so i don't think that you have to make a separate table for that. If you do so I think that will consider in extra normalization.
Hey. I have 160 columns that are filled with data when a user fills a report form out and submit it. A few of these sets of columns contain similar data, but there needs to be multiple instance of this data per record set as it may be different per instance in the report.
For example, an employee opens a case by a certain type at one point in the day, then at another point in the day they open another case of a different type. I want to create totals per user based on the values in these columns. There is one column set that I want to target right now, case type. I would like to be able to see all instances of the value "TSTO" in columns CT1, CT2, CT3... through CT20. Then have that sorted by the employee ID number, which is just one column in the table.
Any ideas? I am struggling with this one.
So far I have SELECT CT1, CT2, CT3, CT4, CT5, CT6, CT7, CT8, CT9, CT10, CT11, CT12, CT13, CT14, CT15, CT16, CT17, CT18, CT19, CT20 FROM REPORTS GROUP BY OFFICER
This will display the values of all the case type entries in a record set but I need to count them, I tried to use,
SELECT CT1, CT2, CT3, CT4, CT5, CT6, CT7, CT8, CT9, CT10, CT11, CT12, CT13, CT14, CT15, CT16, CT17, CT18, CT19, CT20 FROM REPORTS COUNT(TSTO) GROUP BY OFFICER
but it just spits an error. I am fairly new to mysql databasing and php, I feel I have a good grasp but query'ing the database and the syntax involved is a tad bit confused and/or overwhelming right now. Just gotta learn the language. I will keep looking and I have found some similar things on here but I don't understand what I am looking at (completely) and I would like to shy away from using code that "works" but I don't understand fully.
Thank you very much :)
Edit -
So this database is an activity report server for the days work for the employees. The person will often open cases during the day. These cases vary in type, and their different types are designated by a four letter convention. So your different case types could be TSTO, DOME, ASBA, etc etc. So the user will fill out their form throughout the day then submit it down to the database. That's all fine :) Now I am trying to build a page which will query the database by user request for statistics of a user's activities. So right now I am trying to generate statistics. Specifically, I want to be able to generate the statistic of, and in human terms, "HOW MANY OCCURENCES OF "USER INPUTTED CASE TYPE" ARE THERE FOR EMPLOYEEIDXXX"
So when a user submits a form they will type in this four letter case type up to 20 times in one form, there is 20 fields for this case type entry, thus there is 20 columns. So these 20 columns for case type will be in one record set, one record set is generated per report. Another column that is generated is the employeeid column, which basically identifies who generated the record set through their form.
So I would like to be able to query all 20 columns of case type, across all record sets, for a defined type of case (TSTO, DOME, ASBA, etc etc) and then group that to corresponding user(s).
So the output would look something like,
316 TSTO's for employeeid108
I hope this helps to clear it up a bit. Again I am fairly fresh to all of this so I am not the best with the vernacular and best practices etc etc...
Thanks so much :)
Edit 2 -
So to further elaborate on what I have going on, I have an HTML form that has 164 fields. Each of these fields ultimately puts a value into a column in a single record set in my DB, each submission. I couldn't post images or more than two URLs so I will try to explain it the best I can without screenshots.
So what happens is this information gets in the DB. Then there is the query'ing. I have a search page which uses an HTML form to select the type of information to be searched for. It then displays a synopsis of each report that matches the query. The user than enters the REPORT ID # for the report they want to view in full into another small form (an input field with a submit button) which brings them to a page with the full report displayed when they click submit.
So right now I am trying to do totals and realizing my DB will be needing some work and tweaking to make it easier to create querys for it for different information needed. I've gleaned some good information so far and will continue to try and provide concise information about my setup as best I can.
Thanks.
Edit 3 -
Maybe you can go to my photobucket and check them out, should let me do one link, there is five screenshots, you can kind of see better what I have happening there.
http://s1082.photobucket.com/albums/j376/hughessa
:)
The query you are looking for would be very long and complicated for your current db schema.
Every table like (some_id, column1, column2, column3, column4... ) where columns store the same type of data can be also represented by a table (some_id, column_number, column_value ) where instead of 1 row with values for 20 columns you have 20 rows.
So your table should rather look like:
officer ct_number ct_value
1 CT1 TSTO
1 CT2 DOME
1 CT3 TSTO
1 CT4 ASBA
(...)
2 CT1 DOME
2 CT2 TSTO
For a table like this if you wanted to find how many occurences of different ct_values are there for officer 1 you would use a simple query:
SELECT officer, ct_value, count(ct_value) AS ct_count
FROM reports WHERE officer=1 GROUP BY ct_value
giving results
officer ct_value ct_count
1 TSTO 2
1 DOME 1
1 ASBA 1
If you wanted to find out how many TSTO's are there for different officers you would use:
SELECT officer, ct_value, count( officer ) as ct_count FROM reports
WHERE ct_value='TSTO' GROUP BY officer
giving results
officer ct_value ct_count
1 TSTO 2
2 TSTO 1
Also any type of query for your old schema can be easily converted to new schema.
However if you need store additional information about every particular report I suggest having two tables:
Submissions
submission_id report_id ct_number ct_value
primary key
auto-increment
------------------------------------------------
1 1 CT1 TSTO
2 1 CT2 DOME
3 1 CT3 TSTO
4 1 CT4 ASBA
5 2 CT1 DOME
6 2 CT2 TSTO
with report_id pointing to a record in another table with as many columns as you need for additional data:
Reports
report_id officer date some_other_data
primary key
auto-increment
--------------------------------------------------------------------
1 1 2011-04-29 11:28:15 Everything went ok
2 2 2011-04-29 14:01:00 There were troubles
Example:
How many TSTO's are there for different officers:
SELECT r.officer, s.ct_value, count( officer ) as ct_count
FROM submissions s JOIN reports r ON s.report_id = r.report_id
WHERE s.ct_value='TSTO'
GROUP BY r.officer