updating db structure - new table or adding a field - mysql

This might sound like a silly question but here it is; I am sure it has happened to anyone around here, you build a web app with a db structure per specifications (php/mysql), but then the specs change slightly and you need to make the change in the db to reflect it, here is a short example:
Order table
->order id
->user id
->closed
->timestamp
but because the orders are paid in different currency than in the one, which is quoted in the db, I need to add the field exchange rate, which is only checked and known when closing the order, not upon insertion of the record. Thus I can either add the new field to the current table, and leave it null/blank when inserting, and then update when necessary; or I can create a new table with the following structure:
Order exchange rates
->exchange id
->order id
->exchange rate
Although I believe that the letter is better because it is a less intrusive change, and won't affect the rest of the application functionality, you could end up with insane amount of joined queries to get all the information necessary. On the other hand the former approach could mess up some other queries you have in the db, but it is definitely more practical and also logical in terms of the overall db structure. Also I don't think that it is a good practice to use the structure of insert null and update later, but that might be just my lonely opinion...
Thus I would like to ask what do you think is the preferable approach.

I'm thinking of another alternative. Setup an exchange rate table like:
create table exchange_rate(
cur_code_from varchar2(3) not null
,cur_code_to varchar2(3) not null
,valid_from date not null
,valid_to date not null
,rate number(20,6) not null
);
alter table exchange_rate
add constraint exchange_rate_pk
primary key(cur_code_from, cur_code_to, valid_from);
The table should hold data that looks something like:
cur_code_from cur_code_to valid_from valid_to rate
============= =========== ========== ======== ====
EUR EUR 2014-01-01 9999-12-31 1
EUR USD 2014-01-01 9999-12-31 1,311702
EUR SEK 2014-01-01 2014-03-30 8,808322
EUR SEK 2014-04-01 9999-12-31 8,658084
EUR GBP 2014-01-01 9999-12-31 0,842865
EUR PLN 2014-01-01 9999-12-31 4,211555
Note the special case when you convert from and to the same currency.
From a normalization perspective, you don't need valid_to since it can be computed from the next valid_from, but from a practical point of view, it's easier to work with a valid-to-date than using a sub-query every time.
Then, to convert into the customers currency you would join with this table:
select o.order_value * x.rate as value_in_customer_currency
from orders o
join exchange_rate_t x on(
x.cur_code_from = 'EUR' -- Your- default currency here
and x.cur_code_to = 'SEK' -- The customers currency here
and o.order_close_date between x.valid_from and x.valid_to
)
where o.order_id = 1234;
Here I have used the rates valid as of the order_close_date. So if you have two orders, one with a close date of 2014-02-01, then it would pick up a different rate than an order with a close date of 2014-04-05.

I think you just need to add exchange_rate_id in the order table and create a look up table Exchange_Rates with columns ex_rate_id, description , deleted, created_date.
So when an order closes you just need to update the exchange_rate_id column in order table with id and later on you can create a join with the look up table to pull records.
Keep in mind that
one order have only one currency upon closing.
one currency can be updated against one or many orders
It is a one to many relationship, so i don't think that you have to make a separate table for that. If you do so I think that will consider in extra normalization.

Related

Apply value specified for date range (from, to) to value in table containing single row per date

First of all, sorry for title, but my english is too poor to explain meaning of my question. :)
Let's suppose that we have two tables. The first table tbl_percents contains percent value history during date ranges. If date field to equals 0000-00-00 it means that it is unfinished range.
table: tbl_percents
from date
to date
percent int
example content:
2001-01-01 | 2015-01-21 | 10%
2015-01-21 | 0000-00-00 | 20%
Second table is tbl_revenue which contains revenue values for specific date.
table: tbl_revenue
date date
revenue bigint
example content:
2014-01-10 | 10
2015-01-22 | 10
Now we want to apply percent specified in table tbl_percents to revenue. In result we want to get something like this:
2014-01-10 | 1 #because from 2001-01-01 to 2015-01-21 percent = 10%
2015-01-22 | 2 #because from 2015-01-22 till now percent = 20%
Is it possible to get this result in single SQL query?
Yep. You want to do a join using a BETWEEN condition. I have to caution you that these types of queries get very expensive, very fast, so you don't want to do this on a huge dataset. That being said, you can join your tables with something like the following:
SELECT b.revenue, a.percent
FROM tbl_percents AS a
INNER JOIN tbl_revenue AS b
ON b.date BETWEEN a.from_date AND
CASE WHEN a.to_date = DATE("0000-00-00") THEN DATE("2100-01-01")
ELSE a.to_date END
Basically what I'm doing is setting the to_date to something very large and in the future (namely Jan 01, 2100). If the to_date is 0000-00-00, then I apply the very large in the future date. Otherwise I just use the to_date. Using that, I join by revenue date to my percents table where the revenue date is between the percent start date and my modified percent end date.
Again, this is computationally not a good idea on a huge dataset... but for general purposes, it should work just fine. If you start having trouble with speed/performance, I'd suggest trying to apply similar logic using a scripting language like R or Python.
Best of luck!
Something like:
SELECT (CAST(COALESCE(SELECT [percent] FROM tbl_percents
WHERE tbl_revenue.[date] BETWEEN TO AND FROM OR [date] > TO
AND FROM = '0000-00-00' LIMIT 1),0) AS DECIMAL(12,2)) / 100) * revenue
AS MyNewVal FROM tbl_revenue
I can't test this where I am, but it might get you pointed in a good direction. I think you need to cast your int stored [percent] field to decimal to avoid 10/100==0 but it seems strait forward otherwise.

Adding or subtracting values based on another field's contents

I have a table with transactions. All transactions are stored as positive numbers, if its a deposit or withdrawl only the action changes. How do i write a query that can sum up the numbers based on the action
-actions-
1 Buy 2 Sell 5 Dividend
ID ACTION SYMBOL PRICE SHARES
1 1 AGNC 27.50 150
2 2 AGNC 30.00 50
3 5 AGNC 1.25 100
So the query should show AGNC has a total of 100 shares.
SELECT
symbol,sum(shares) AS shares,
ROUND(abs(sum((price * shares))),2) AS cost,
FROM bf_transactions
WHERE (action_id <> 5)
GROUP BY symbol
HAVING sum(shares) > 0
I was originally using that query when i had positive/negative numbers and that worked great.. but i dont know how to do it now with just positive numbers.
This ought to do it:
SELECT symbol, sum(case action
when 1 then shares
when 2 then -shares
end) as shares
FROM bf_transactions
GROUP BY symbol
SQL Fiddle here
It is however good practice to denormalize this kind of data - what you appear to have now is a correctly normalized database with no duplicate data, but it's rather impractical to use as you can see in cases like this. You should keep a separate table with current stock portfolio that you update when a transaction is executed.
Also, including a HAVING-clause to 'hide' corrupted data (someone has sold more than they have purchased) seems rather bad practice to me - when a situation like that is detected you should definitely throw some kind of error, or an internal alert.

Summary of a day or run a query and sum the results?

Since I don't know to calculate efficiency I'll ask here and I hope someone could tell me what is better and explain it a bit.
The scenario:
Currently I have a table that insert rows of production of each worker.
Something like: (Worker1) produced (product10) with (some amount) for a Date.
And that goes for each station he worked in though the day.
The Question:
I need to generate a report of the sum of amounts that worker produced for each date. I know how to generate the report either way but the question is how is it more efficient?
Having to run a query for each person that sums up the production for each date? or having a table that I'll insert the total amount, workerID and date?
Again if you could explain it a bit further it would be nice, if not than at least an educated answer would help me a lot with this problem.
Example:
This is what I have right now in my production table:
ID EmpID ProductID Amount Dateofproduction
----------------------------------------------------------
1 1 1 100 14/01/2013
2 1 2 20 14/01/2012
This is what I want in the end:
EmpID Amount DateofProduction
-----------------------------------
1 120 14/01/2013
Should I start another table for this? or should I just sum what I have in the production table and take what I need?
Bear in mind that the production table will get larger and larger each day (of course).
i) Direct :
select EmpId, sum(Amount) as Amount, DateOfProduction
from ProductionTable
group by EmpId, DateOfProduction.
ii) Now, the size of the table will keep growing. And you need only day-wise reports.
Is this table being used by anyone else? Can some of the data be archived? If some of the data can be archived, I would suggest, after each day and reporting, backup all the data from this table to a secondary archive table. So, every day you will have to query only today's worth of records.
Secondly, you can consider adding an index to DateOfProduction. You will then be able to restrict your queries in date range. For example, select EmpId, sum(Amount) as Amount, DateOfProduction from ProductionTable group by EmpId, DateOfProduction where DateOfProduction = Date(now()). (or something similar)
Because it is just a single table and no complicated queries, MySql will be easily able to take care of millions of records. Try EXPLAIN on the queries to check the number of records being touched and indexes being used.
Unless I am missing something, it sounds like you just want this:
select empid,
sum(amount) TotalAmount,
Dateofproduction
from yourtable
group by empid, Dateofproduction
See SQL Fiddle with Demo
Result:
| EMPID | TOTALAMOUNT | DATEOFPRODUCTION |
------------------------------------------
| 1 | 120 | 2013-01-14 |
Note: I am guessing that the second row of data you provided is supposed to be 2013 not 2012.

Design to represent employee check-in and check-out

I currently have a table which represents the start and stop work times of an employee:
id_employee int
check_in datetime
check_out datetime
It requires an update on check_out when the employee is finished.
Would it be preferable to have a table as follows ?
id_employee int
date_event datetime
event_type varchar, values can be CHECKIN or CHECKOUT.
To determine if an employee has already checked in all I have to is check if the last record for a given employee has an event_type of CHECKIN. Also, fetching a record and updating it is no longer necessary.
Is the second approach better ? Or do you have other suggestions ?
I know this post is outdated but, this is for someone who's still looking for solution:
Attendance Table Structure
id | int
employee_code | varchar
status | enum('check_in','check_out')
created | datetime
Data
id employee_code status created
1 EMP0001 check_in 2016-08-20 09:30:30
2 EMP0001 check_out 2016-08-20 18:15:00
3 EMP0002 check_in 2016-08-21 14:52:48
4 EMP0002 check_out 2016-08-21 21:09:18
Query
SELECT
A1.employee_code,
A1.created AS check_in_at,
A2.created AS check_out_at,
TIMEDIFF(A2.created, A1.created) AS total_time
FROM
tbl_attendances AS A1
INNER JOIN tbl_attendances AS A2
ON A1.employee_code = A2.employee_code
AND DATE(A1.created) = DATE(A2.created)
WHERE 1 = 1
AND A1.status = 'check_in'
AND A2.status = 'check_out'
AND DATE(A1.created) BETWEEN '2016-08-20'
AND '2016-08-21'
AND DATE(A2.created) BETWEEN '2016-08-20'
AND '2016-08-21'
ORDER BY A1.created DESC
Results
employee_code check_in_at check_out_at total_time
EMP0002 2016-08-21 14:52:48 2016-08-21 21:09:18 06:16:30
EMP0001 2016-08-20 09:30:30 2016-08-20 18:15:00 08:44:30
For specific employee add AND A1.employee_code = 'EMP0001' in WHERE clause
As usual, "it depends".
Option 1 is easier to build, and simpler to query. Finding out who checked in but didn't check out is a simple query; finding the total hours worked for each employee is also straightforward. This simplicity probably means it will be faster for common queries. The only drawback I see is that it is harder to extend. If you want to capture a different event type for "lunch break", for instance, you have to add extra columns.
Option 2 is more flexible - you can add new event types without changing your schema. However, simple queries - how many hours did employee x work in June - are quite tricky. You pay for the flexibility in significant additional effort.
So, it depends what you mean by "better".
Format #2 is better because :
This table is just a punch record entry - Even if it has anomalies it doesn't matter.
Going forward this table will expand, for example you might want to introduce two more events INTERVAL_OUT, INTERVAL_IN. Second format will keep it simple.
If possible use event_type_id instead of event_type and another table event_type or just a constant array eg.
array_event_name = array (1=>CHECKIN, 2=>CHECKOUT, 3=>INTERVAL_IN, 4=>INTERVAL_OUT)
i would go with the second one.
however, the main questions and business rules will be the same and answerable by either approach.
Option #1
With the first option, the database itself can better protect itself from some anomalies1. Some anomalies are still possible2, but it's a start.
On the other hand, InnoDB tables are clustered and secondary indexes in clustered tables can be expensive (see the "Disadvantages of clustering" in this article), which is something to consider if you need to query on check_out.
Option #2
With the second option, you are relying on the imperative code even for anomalies that can be prevented purely declaratively with the database design.
On a plus side, you are less likely to need secondary indexes.
Choice
So in a nutshell, go with the first option, unless you need a secondary index. If you do need the secondary index, depending on what kind of index covering you wish to achieve, you might go with either option.
1 Such as checking-out without first checking-in.
2 Such as checking-in again, without first checking-out, overlapping "stints", etc...
I would go with the first option here. Putting both time stamps in a single row will enhance your search time and will make your calculations easier.
Suppose you want to calculation work hours for an employee for a day. Your search will stop at the first line it matches and you will have all the required data. You wont have to dig any deeper which is not the case with option 2. Option 1 also reduces your table size by using only 1 row per check-in/check-out.
Option 2 does have one advantage though. When checking out, your database will have to do a search to update the data for option 1. For option 2, its just a write.
Considering the fact that you will search the data more than once, you can give up the direct insert advantage to gain a better structure and faster search. Although the final choice is up to you.
Good Luck!

MySQL query puzzle - finding what WOULD have been the most recent date

I've looked all over and haven't yet found an intelligent way to handle this, though I feel sure one is possible:
One table of historical data has quarterly information:
CREATE TABLE Quarterly (
unique_ID INT UNSIGNED NOT NULL,
date_posted DATE NOT NULL,
datasource TINYINT UNSIGNED NOT NULL,
data FLOAT NOT NULL,
PRIMARY KEY (unique_ID));
Another table of historical data (which is very large) contains daily information:
CREATE TABLE Daily (
unique_ID INT UNSIGNED NOT NULL,
date_posted DATE NOT NULL,
datasource TINYINT UNSIGNED NOT NULL,
data FLOAT NOT NULL,
qtr_ID INT UNSIGNED,
PRIMARY KEY (unique_ID));
The qtr_ID field is not part of the feed of daily data that populated the database - instead, I need to retroactively populate the qtr_ID field in the Daily table with the Quarterly.unique_ID row ID, using what would have been the most recent quarterly data on that Daily.date_posted for that data source.
For example, if the quarterly data is
101 2009-03-31 1 4.5
102 2009-06-30 1 4.4
103 2009-03-31 2 7.6
104 2009-06-30 2 7.7
105 2009-09-30 1 4.7
and the daily data is
1001 2009-07-14 1 3.5 ??
1002 2009-07-15 1 3.4 &&
1003 2009-07-14 2 2.3 ^^
then we would want the ?? qtr_ID field to be assigned '102' as the most recent quarter for that data source on that date, and && would also be '102', and ^^ would be '104'.
The challenges include that both tables (particularly the daily table) are actually very large, they can't be normalized to get rid of the repetitive dates or otherwise optimized, and for certain daily entries there is no preceding quarterly entry.
I have tried a variety of joins, using datediff (where the challenge is finding the minimum value of datediff greater than zero), and other attempts but nothing is working for me - usually my syntax is breaking somewhere. Any ideas welcome - I'll execute any basic ideas or concepts and report back.
Just subquery for the quarter id using something like:
(
SELECT unique_ID
FROM Quarterly
WHERE
datasource = ?
AND date_posted >= ?
ORDER BY
unique_ID ASC
LIMIT 1
)
Of course, this probably won't give you the best performance, and it assumes that dates are added to Quarterly sequentially (otherwise order by date_posted). However, it should solve your problem.
You would use this subquery on your INSERT or UPDATE statements as the value of your qtr_ID field for your Daily table.
The following appears to work exactly as intended but it surely is ugly (with three calls to the same DATEDIFF!!), perhaps by seeing a working query someone might be able to further reduce it or improve it:
UPDATE Daily SET qtr_ID = (select unique_ID from Quarterly
WHERE Quarterly.datasource = Daily.datasource AND
DATEDIFF(Daily.date_posted, Quarterly.date_posted) =
(SELECT MIN(DATEDIFF(Daily.date_posted, Quarterly.date_posted)) from Quarterly
WHERE Quarterly.datasource = Daily.datasource AND
DATEDIFF(Daily.date_posted, Quarterly.date_posted) > 0));
After more work on this query, I ended up with enormous performance improvements over the original concept. The most important improvement was to create indices in both the Daily and Quarterly tables - in Daily I created indices on (datasource, date_posted) and (date_posted, datasource) USING BTREE and on (datasource) USING HASH, and in Quarterly I did the same thing. This is overkill but it made sure I had an option that the query engine could use. That reduced the query time to less than 1% of what it had been. (!!)
Then, I learned that given my particular circumstances I could use MAX() instead of ORDER BY and LIMIT so I use a call to MAX() to get the appropriate unique_ID. That reduced the query time by about 20%.
Finally, I learned that with the InnoDB storage engine I could segment the chunk of the Daily table that I was updating with any one query, which allowed me to multi-thread the queries with a little elbow-grease and scripting. The parallel processing worked well and every thread reduced the query time linearly.
So, the basic query that is performing literally 1000 times better than my own first attempt is:
UPDATE Daily
SET qtr_ID =
(
SELECT MAX(unique_ID)
FROM Quarterly
WHERE Daily.datasource = Quarterly.datasource AND
Daily.date_posted > Quarterly.dateposted
)
WHERE unique_ID > ScriptVarLowerBound AND
unique_ID <= ScriptVarHigherBound
;