Creating the right database structure from a manual tariff
I have been assigned a rather challenging database design and thought someone may be able to give me a few pointers to help get going. We currently have a warehouse goods in and goods out system and now we would like to use the data to calculate storage charges.
The database already holds the following: Goods date in, Goods date out, Consignment weight, Number of pieces, Dimensions, Description of goods, Storage container type (if applicable). The data is held in MySQL which may not be suitable for the tariff structure below.
Here is the charging structure for Band 1,2,3,4. We have about 12 bands dependent on Customer size and importance. All the other bands are derivatives of the following:
BAND 1
On arrival in our facility
€0.04 per kilo + €4.00 per consignment for general cargo
€0.07 per kilo for MAGAZINES – NO STORAGE CHARGE
STORAGE CHARGES AFTER 5 DAYS
€4.00 per intact pallet max size 120x100x160cm (Standard warehouse wooden pallet)
€6.50 per cubic metre on loose cargo or out of gauge cargo.
CARGO DELIVERED IN SPECIFIC CONTAINERS
20FT PALLET ONLY - €50.00
40FT PALLET ONLY - €20.00
BAND 2
0.04 per kilo no min charge
STORAGE CHARGES AFTER 6 DAYS
€2.50 per cubic metre
CONTAINERS
20FT PALLET ONLY - €50.00
40FT PALLET ONLY - €20.00
BAND 3
€0.03 per kilo + €3.00 per consignment up to 2000kg
€0.02 per kilo + €2.00 per consignment over 2000kg
STORAGE CHARGES AFTER 5 DAYS
€4.00 per pallet max size 120x100x160
€0.04 per kilo loose cargo
BAND 4
€5.00 per pallet
STORAGE CHARGES AFTER 4 DAYS
€5.00 per pallet max size 120x100x160
My thoughts so far are to collect the charging band on arrival of the freight then try and fit the tariff into a table with some normalisation such as container type.
Anyone had experience of this type of manual to system conversion?
Probably the algorithm for computing the tariff is too messy to do in SQL. So, let's approach your question from a different point of view.
Build the algorithm in your client language (Java/PHP/VB/...).
As you are doing step 1, think about what data is needed - perhaps a 2-column array of "days" and "Euros"? Maybe something involving "kilos"? Maybe there are multiple patterns -- days and/or kilos?
Build the table or tables necessary to store those arrays.
Decide how to indicate that kilos is irrelevant -- perhaps by leaving out any rows in the kilo table? Or an extra column that gives size/weight?
My point is that the algorithm needs to drive the task; the database is merely a persistent store.
Here's another approach. Instead of having columns for days, kilos, etc, just have a JSON string of whatever factors are needed. In the client code, decode the JSON, then have suitable IF statements to act on kilos if present, ELSE ... etc.
Again, the database is just a persistent store; the client is driving the format by what is convenient to it.
In either implementation, there would be a column for the type of item involved, plus the Band. The SELECT would have ORDER BY Band. Note that there would be no concept of 12; any number could be implemented.
Performance? Fetching all ~12 rows and stepping through them -- this should not be a performance problem. If you had a thousand bands, you might notice a slight delay.
Related
I have a project with customers buying a product with platform based tokens. I have a mysql table that tracks a customer buying x amount and one tracking customer consumption(-x amount). In order to display their Amount of tokens they have left on the platform and query funds left on spending I wanted to query (buys - comsumed). But I remembered that people alsways talk about space is cheaper than computation(Not just $ but querytime as well). Should I have a seperate table for querying amount that gets updated with each buy or consume ?
So far I have always tried to use the least amount of tables to make it simple and have easy oversight, but I start to question if that is right...
There is no right answer, keep in mind the goal of the application, and updates in software likely to happen.
If you keep in these 2 tables transactions the user may have, then the new column was necessary, cause you had to sum the columns. If one row is for one user (likely your case), then 90% you should use those 2 tables only.
I would suggest you not have that extra column. As far with my expierence, in that kind of situations has the down of the bigger the project becomes, the more difficult is for you and the other developers, to have in mind to update the new column, because is dependent variable.
Also, when the user buy products or consumption tokens, you will have to update the new token, so energy and time loss as well.
You can store the (buys - consumed) in session, and update when is needed(if real time update is not necessary, not multiple devices).
If you need continuous update, so multiple queries over time, then memory loss over energy-time loss is greater, so you should have that 3 table - column.
AN organization is interested in modelling the sales rate of cases of product sold each week. The product is a luxury item so distributions of sales tend to be small and right-skewed, A typical month (4 weeks) of sales might look like {0, 1, 1, 4}.
While we were originally developing the analysis, this seemed like an obvious application of GLM--specifically Poisson regression to estimate the mean sales rate.
However, the sales team has recently come back and mentioned that they actually sell the product in many smaller sizes, such as 750-mL bottles and 187-mL samples. I could easily convert the sales into equivalent units (a case contains 9L of product), but this would result in some non-integer sales units. If the previous 4-week sales distribution had all been 750mL bottles, for example, the new distribution would look like {0, 0.0833, 0.0833, 0.333}
I would like to be able to model the sales rate in terms of a common unit (a case, or 9L) and I thought I could use an offset term to do this, but I've run into difficulties whenever there are zero products sold (offset term is also zero).
My understanding is that the non-integer values preclude the direct use of a Poisson likelihood to model these data (without some sort of transformation). I could simply try a normal linear model, but the sales data are still discrete (e.g., they can only occupy a handful of values determined by the volume of product and number of units sold). I still feel like a discrete model would be more appropriate but am stumped at how to account for the different "sizes" of product appearing in the data without simply running a separate model for each product.
Have you ever handled data like these in a similar fashion, and how did you make this accommodation?
I have customer dimension table and the location of customer can change.
The customerid filters the sales fact table.
I have 2 options:
Slowly changing dimension type 2 to hold 1 new record for each customer's location changes
Or
Store the location at the time of data load into the sales fact table.
Both ways allow me to see sales by location (although it's a customer location, the etl will place it on fact table).
The later option saves me from implementing SCD on dim table.
What are factors to decide which of the 2 approaches is suitable?
Your fact table should contain things that we measure, count, total. Your dimensions should be descriptive elements that allow users to slice their data along an axis - basically answer the "by" part of their request
I want to see total sales by year and month across this customer based regional hierarchy
Don't take my word for it, grab a data warehousing book or go read the freely available information from the Kimball Group
Storing the customer data on the fact is a bad idea regardless of your database engine. To satisfy a query like the above, the storage engine needs to read in the entirety of your fact table and the supporting dimensions. It could read (Date, RegionId, CustomerId, SalesAmount) which likely costs something like 16 bytes per row times however many rows you have. Or, it can read (Date, RegionId, CustomerName, CustomerAddress, CustomerCity, CustomerState, CustomerPostalCode, SalesAmount) at a cost of what, 70 bytes per row? That's an inflation to
store your data (disk is cheap but that's not the point)
read your data (basic physics, the more data you wrote to disk, the longer it takes to read it back out)
less available memory for other queries (you're in a multi-user/query environment, when you hog resources, there's less for others)
write data (ETL processing is going to take longer because you have to write more pages to disk than you should have)
inability to optimize (What if the business just wants to see "Total Sales by Year and Month" - no customer hierarchy. The database engine will still have to read all the pages with all that useless customer data just to get at the things the user actually wanted)
Finally, the most important takeaway from the Data Warehouse Toolkit is on like page 1. The biggest reason that Data Warehouse projects fails is that IT drives the requirements and it sounds like you're thinking of doing that to avoid creating a SCD type 2 dimension. If the business problem you're attempting to solve is that they need to be able to see sales data associated to the customer data at the point of time it happened, you have a Type 2 customer dimension.
Yes, technologies like Columnstore Compression can reduce the amount of storage required but it's not free because now you're adding workload to the cpu. Maybe you have it, maybe you don't. Or, you model it correctly and then do the compression as well and you still come out ahead in a proper dimensional model.
How you model location depends on what it relates to. If it is an attribute of a sale then it belongs as its own dim related to the sale. If it is an attribute of a customer (such as their home address) then it belongs in the customer dim. If the location is an attribute of both a sale and a customer then it belongs in both
I am trying to make a stock market simulator and i want to be as real as possible.
My question is: Nasdaq has 3000+ companies and in their database of stocks, right?! but is it one entry line for every share of every symbol on the sql db like the following example?
Company Microsoft = MSFT
db `companies_shares`
ID symbol price owner_id* company_id last_trade_datetime
1 msft 58.99 54334 101 2019-06-15 13:09:32
2 msft 58.99 54334 101 2019-06-15 13:09:32
3 msft 58.91 2231 101 2019-06-15 13:32:32
4 msft 58.91 544 101 2019-06-15 13:32:32
*owner_id = user id of the person that last bought the share.
Or is it calculate based on the shares available to trade and demand to buy and sell provided by the market maker? for exemple:
I've already tried the first example by it takes a lot space in my db and i'm concerned about the band width of all those trades, especially when millions of requests(trades) are being made every minute.
I've already tried the first example by it takes a lot space in my db and i'm concerned about the band width of all those trades, especially when millions of requests(trades) are being made every minute.
What is the best solution? Database or math?
Thanks in advance.
You might also want to Google many to many relationships.
Think about it this way. One person might own many stocks. One stock might be held by many people. That is a many to many relationship and usually modelled using three tables in a physical database. This is often written as M:M
Also, people might buy or sell a single company on multiple occasions this would likely be modelled using another table. From the person perspective there will be many transactions so we have a new type of relationship one (person) to many (transactions). This is often written as a 1:M relationship.
As to what to store, as a general rule it is best to store the atomic pieces of data. For example for a transaction, store the customer I'd, transaction date/time, the quantity bought or sold and the price at the very least.
You might also want to read up about normalization. Usually 3rd normal form is a good level to strive for, but a lot of this is a "it depends upon your circumstance and what you need to do". Often people will denormalize for speed of access at the expense of more storage and potentially more complicated updating....
You also mentioned performance, more often than not big companies such as NASDAQ. will use multiple layers if IT infrastructure. Each layer will have a different role and thus different functional characteristics and performance characteristics. Often they will be multiple servers operating together in a cluster. For example they might use a NoSQL system to manage the high volume of trading. From there there might be a feed (e.g. kafka) into other systems for other purposes (e.g. fraud prevention, analytics, reporting and so on).
You also mention data volumes. I do not know how much data you are talking about, but at one financial customer I've worked at have several peta bytes of storage (1 peta byte = 1000 TB) running on over 300 servers just for their analytics platform. They were probably on the medium to large size as far as financial institutions go.
I hope this helps point you in the right direction.
What I have is about 130GB of time varying state data of several thousand financial instruments' orderbooks.
The csv files I have contain a row per each change in the orderbook state (due to an executed trade, inserted order etc.). The state is described as: a few fields of general orderbook information (e.g. isin code of the instrument), a few fields of information about the state change (such as orderType, time) and finally the buy and sell levels of the current state. There are up to 20 levels (Buy level 1 representing the best buy price, sell level one representing the best sell price and so on.) of both sell and buy orders, and each of them consist of 3 fields (price, aggregated volume and order amount). Finally there is additional 3 field of aggregated data of the levels beyond 20 for both buy and sell side. This amounts to total maximum of 21*2*3 = 126 fields of the levels data per state.
The problem is that since there rarely exists anywhere near 20 levels it doesn't seem to make sense to reserve fields for each of them. E.g. I'd have a rows where there are 3 buy levels and the rest of the fields are empty. On the other hand the same orderbook can have 7 buy levels after a few moments.
I will definitely normalize the general orderbook info into it's own table, but I don't know how to handle the levels efficiently.
Any help would be much appreciated.
I have had to deal with exactly this structure of data, at one point in time. One important question is how the data will be used. If you are only looking for the best bid and ask price at any given time, then the levels do not make much of difference. If you are analyzing market depth, then the levels can be important.
For the volume of data you are using, other considerations such as indexing and partitioning may be more important. If the data you need for a particular query fits into memory, then it doesn't matter how large the overall table is.
My advice is to keep the different levels in the same record. Then, you can use page compression (depending on your storage engine) to eliminate most of the space reserved for the empty values. SQL Server does this automatically, so it was a no-brainer to put the levels in a single record.
A compromise solution, if page compression does not work, is to store a fixed number of levels. Five levels would typically be populated, so you wouldn't have the problem of wasted space on empty fields. And, that number of levels may be sufficient for almost all usage.