I am fairly new to mySql. I am working on a project, where I want to do evaluations on the economic efficiency of certain goods in python. For my first tries, I held my data in a csv-file. I now want to move on to a solid database solution and decided I would use mySql.
The dada consists for example of the following columns
id, GoodName, GoodWeightRaw, GoodWeightProcessed, RawMaterial, SellingPrice.
The "problem" is, that there are simple goods and there are combined goods.
Say:
id 1 - copper
id 2 - plastics
Somewhere further down we might have
id 50 - copper cable
Copper Cables are made from copper and plastics - therefore the RawMaterial of id50 would be the goods id1 and id2. Also the RawWeight of the copper cables would be the proccesed weight of copper + the processed weight of plastics.
In the csv file I use at the moment those values have all been "hard coded" and if some values of the basic materials change, I would have to look up in which combined goods they are used and change the values by hand accordingly.
I wonder wether there is a way in sql to automaticly compute values in a row from values in another row, as well as have them updated every time the other rows change.
So far I tried:
First I thought I might create two tables for basic and combined goods, however the combined goods would still not update themselfes.
I found out, that I can create table rows from SELECT statements, and create combined goods from a combination of basic goods this way. However those rows are also "permanent" once created and would still have to be updated manually.
So is there a clean best practice in SQL for rows, created from other rows and then updated accordingly when the correlating rows are changed?
Related
We're building an e-commerce system and we need some help in deciding on what's the best way to determine how many stocks are available per product.
Say we have the tables "products", "products_in", and "products_out". "products_in" records all our transactions that increase the quantities of our products (e.g. when we buy the products from our wholesale suppliers). While "products_out" records all our transactions that decrease the quantities of our products (e.g. when our customers buy the products).
In our apps, retrieving the quantities available for our products is more common than writing/updating records in the "products_in" and "products_out" tables. Given this, will the use of a MySQL view that depends on "products_in" and "products_out" and computes the available stock be more efficient than computing it on the fly every time we query it? Will the value on the view be recomputed every time there's a new record in "products_in" or "products_out"? Or will the view recompute the value every time we query it (which can be quite expensive in our case)?
will the use of a MySQL view that depends on "products_in" and "products_out" and computes the available stock be more efficient than computing it on the fly every time we query it? Will the value on the view be recomputed every time there's a new record in "products_in" or "products_out"? Or will the view recompute the value every time we query it (which can be quite expensive in our case)?
Let's think of the db steps in each case:
Case 1 If you compute available_stock every time a product comes in or goes out and store it in say product table
If product comes in, Insert queries in product_in table or if product goes out, Insert queries in product_out table
In either case, Update queries in available_stock column of product. (Assume here if 10 products come or 10 products go, there will be 10 individual queries that will be fired) - Expensive?
Case 2 If you compute available_stock in view everytime and not store it in database
Fetch records from product_in and product_out tables (only for few products for which you want available_stock), do some math, and display the estimated stock - Expensive?
I personally would go with case 2, because it involves less db transactions overall then case 1 which involves tons of transactions to keep the stock in sync.
Footnote In the sidelines, I'd definitely say that if you are hardcore 'Object Oriented Programmer' then your db mappings definitely violates the fundamentals. Products_in, Products_out are both the same entities (objects) that records the inventory/stock transactions (like Father,Mother entities are Persons), therefore you should make them encapsulated into one general table ProductInOutData.
In ProductInOutData, you can then add an enum having either in value or out value. Having both in and out records in one table will not only improve the readability and accessibility but also will help in easy calculation of the products coming in or going out making the case 2 more lightweight.
I’ve been thinking about this for a couple of days but I feel that I’m lacking the right words in order to ask google the questions I need an answer to. That’s why I’d really appreciate an kind of help, hints or guidance.
First of all, I have almost no experience with databases (apart from misusing Excel as such) and, unfortunately, I have all my data written in very impractical and huge .csv files.
What I have:
I have time series data (in 15 minute-steps) for several hundred sensors (SP) over the course of several years (a couple of million rows in total) in Table 1. There are also some weather condition data (WCD) that applies to all of my sensors and is therefore stored in the same table.
Note that each sensor delivers two data points per measurement.
Table1 (Sensors as Columns)
Now I also have another table (Table 2) that lists several static properties that define each sensor in Table 1.
Table 2 (Sensors as Rows)
My main question is concerning database design and general implementation (MySQL or MS Access): Is it really necessary to have hundreds of columns (two for each sensor) in Table1? I wish I could store the “link” to the respective time series data simply as two additional columns in Table2.
Is that feasible? Does that even make sense? How would I set up this database automatically (coming from .csv files with a different structure) since I can’t do type in every column by hand for hundreds of sensors and their attached time series?
In the end, I want to be able to make a query/sort my data (see below) by timeframe, date and sensor-properties.
The reason for all of this is the following:
I want to create a third table (Table3) which “stores” dynamic values. These values are results of calculations based on the sensor-measurements and WCD in Table 1. However, depending on the sensor-properties in Table2, the sensors and their respective time series data that serve as input for the calculations of Table3 might differ from set to set.
That way I want to obtain e.g. Set 1: “a portfolio of sensors with location A for each month between January 2010 and November 2011” and store it somewhere. Then I want to do the same for Set 2: e.g. “a portfolio of sensors with location B for the same time frame”. Finally I will compare these different portfolios and conduct further analysis on them. Does that sound reasonable at all??
So far, I’m not even sure whether I should actually store the results for each calculation of Table3 in the database or if I output them query and feed them directly into my analyzation tool. What makes more sense?
A more useful structure for your sensor and WCD data would be:
Table SD - Sensor Data
Columns:
Datetime
Sensor
A_value
B_value
With this structure you do not need to store a link to the time series data in Table 2--the Sensor value is the common data that links the tables.
If your weather conditions data all have the same type of values and/or attributes then you should normalize it similarly:
Table WCD - Weather Conditions Data, Normalized
Columns:
Datetime
Weather_condition
Weather_condition_value
From your example, it looks like different weather conditions may have different attributes (or different data types of attributes), in which case the form in which you have the WCD in your Table 1 may be most appropriate.
Storing the results of your calculations in another table sounds like a reasonable thing to do if at least some of your further analysis could be, or will be, done using SQL.
Using a basic star schema, I have been told that a fact table would have at least the amount of rows equal to the product of the number of rows in each dimension.
For example, 3 products, 5 promotions, and 10 stores would mean that the fact table should have at least 150 records, regardless of where or not a product actually had every promotion or exists in every store. Specifically, null values would exists where for example, a product does not have a specific promotion and etc.
Can someone please provide an academical source that supports, or in the least, please just confirm this idea.
The reason why I am asking this is that my understanding tells me this would create a MASSIVE amount of useless data in the fact table.
Thanks!
Hi thanks for the replies. I consulted my lecturer and he actually found a page reference for me: "...Take a very simplistic example of 3 products, 5 customers, 30 days, and 10 sales representatives represented as row in the dimension tables. Even in this example, the number of fact table rows will be 4500, very large in comparison with the dimension table rows..." (Ponniah, P., 2009. Data warehousing: Fundamentals for IT professionals, 2nd Edition. John Wiley & Sons, Inc., New Jersey. p. 237)
However, the author goes on to say that: "We have said that a single row in the fact table relates to a particular product, a specific calendar date, a specific customer, and an individual sales representative. In other words, for a particular product, a specific calendar date, a specific customer, and an individual sales representative, there is a corresponding row in the fact table. What happens when the date represents a closed holiday and no orders are received and processed? The fact table rows for such dates will not have values for the measures. Also there could be other combinations of dimension table attributes, values for which the fact table rows will have null measures. Do we need to keep such rows with nulls measures in the fact table? There is no need for this. Therefore it is important to realize this type of sparse data and understand that the fact table could have gaps."
In short, you guys seem to be correct, thanks!
Of course not. I suggest you ask your source to clarify this claim, it sounds as if there is a missunderstanding somewhere here.
And what if you add a time dimension..?
Also it is not even possible to have null values as keys where i.e. promotions are missing, because the reason for the key is to point to a dimensional value, wich a null value isn't doing.
The dimension values are there to support whatever facts you have, not the other way around.
This may relate to a specific kind of fact table: the pattern that Ralph Kimball terms a Periodic Snapshot Fact Table. That is where the fact table repeats an entire population of rows for each point in time. IMO the usefulness of that approach is extremely limited.
A Snapshot Fact Table does not implicitly require that the fact table is the product of its dimensions but it does pose the potential problem of what the correct population of each snapshot should be. The cross product of dimensions is one way to do it I suppose.
We have a database table that has way too many rows. to speed up performance, we are trying to create a summary table. this works great for one to one relationships. e.g. let's say furniture has a type and a manufacturer_id, you could have a table that has both of these columns and a counts column. it would be easy to query that table and very quickly get the number of furnitures of a given type.
But, what if there is a many to many relationship? so each piece of furniture can also have one or many colors and one or many distributors. what happens then? is there any way to summarize this data so i can quickly find how many furnitures are green? or how many are blue and yellow?
obviously this is just a made up example. but given a huge database table with millions and millions of rows, how can i create a summary table to quickly look up aggregate information?
Assuming you know what you do and know this is a real bottleneck: Do you have measurements of the performance now? Do you know where it starts taking time?
You will have to query the database anyway to get that count. So you can store it in a separate table like color count and distributor count. Another solution is to cache the results of these queries in a caching system. For example if you have memcached or some other tools already in use.
Most simply when you just have a database is just to create a table:
table color count
color_id
amount
That is a very simple query. You can index it very well and no joins are needed.
Updating can be done with triggers, with a cron or at the moment you update the many to many table. Depending on your needs and capacity. Take into consideration that updating the records also takes time so use it for optimizing reads, that's what I read in your question.
Multiple tables should keep the size down... and a good database system should keep the performance up.
In my opinion, keeping a separate 'summary table' creates a lot of overhead and maintenance problems and is only really useful if the same summary information is desired over and over (i.e., how many furnitures are green without also storing how many are blue, how many are yellow, how many are blue and yellow, etc., etc., etc.)
What I would do is:
Table 1: furnitures
Column 1: uniqueID
Column 2: name
Table 2: distributors
Column 1: uniqueID
Column 2: name
Table 3: colors
Column 1: uniqueID
Column 2: name
Table 4: furniture-distributor
Column 1: furnitureUniqueIDvalue
Column 2: distributorUniqueIDvalue
Table 5: furniture-color
Column 1: furnitureUniqueIDvalue
Column 2: colorUniqueIDvalue
How many furnitures are green:
SELECT COUNT(*) FROM furniture-color WHERE colorUniqueIDvalue = 'green ID';
How many furniture are both blue and yellow:
SELECT COUNT(*) FROM furniture-color as t1 INNER JOIN furniture-color as t2 ON t1.furnitureUniqueIDvalue = t2.furnitureUniqueIDvalue AND t1.colorUniqueIDvalue = 'blue ID' AND t2.colorUniqueIDvalue = 'yellow ID';
Getting lists of distributors of blue and yellow furniture, or furniture from a particular distributor that is either green or red, or most anything else is possible with the right SQL statement (left as an exercise for the reader).
You need to distinguish between counting different types of furniture (distinct furniture id) and counting actual pieces of furniture.
If you have a distributor-color table, then you can count actual pieces of furniture. However, you cannot count different types of furniture. This is the difference between additive facts and non-additive facts, in the terminology of OLAP. If you are interested in this subject, check out Ralph Kimball and his classic book "The Data Warehouse Toolkit".
To count furniture types, you need to include that in your table. So, you need a distributor-color-furniture table. Now to get the total for a distributor, you can use:
select distributor, count(distinct furnitureid)
from dcf
group by distributor
And similarly for color.
It seems that you want to translate your original data into a fact table, for each of reporting. This is a very good and standard idea for developing data marts. Your data mart could have two fact tables. One for each type of furniture (so you can handle the manufacturing questions easily) and other for distributor-color-furniture (for harder questions).
Some databases, such as Oracle and SQL Server, have support for these types of data structures. What you are talking about is more like a new "system", rather than just a new "table". You need to think about the dimensions for the fact table, the updates, and the types of reports that you need.
There will be 2^n possible rows in the color summary table where 'n' is the number of colors. If you reduce the colors to a bitmap and assign each color a location (red=0,orange=1,yellow=2,green=3,etc.) then your color summary table could be:
Color Count
0x0001 256
0x0002 345
0x0003 23839
etc.
256 only have red, 345 only have orange, 23,839 have red and orange. To get a count of how many have red but could have other colors would require summing the rows with bit position 0 set. Alternatively a separate summary table could be set up with only 'n' entries, one for each color, to avoid summing over the rows.
If you want the summary table to manage both distributor and color then I think it would have 2^n * 2^m rows (where 'm' is the number of distributors) to have all the combinations of multiple distributors for multiple pieces of furniture each possibly with multiple colors.
I'm designing a statistics tracking system for a sales organization that manages 300+ remote sales locations around the world. The system receives daily reports on sales figures (raw dollar values, and info-stats such as how many of X item were sold, etc.).
I'm using MAMP to build the system.
I'm planning on storing these figures in one MySQL big table, so each row is one day's statistics from one location. Here is a sample:
------------------------------------------------------------------
| LocationID | Date | Sales$ | Item1Sold | Item2Sold | Item3Sold |
------------------------------------------------------------------
| Hawaii | 3/4 | 100 | 2 | 3 | 4 |
| Turkey | 3/4 | 200 | 1 | 5 | 9 |
------------------------------------------------------------------
Because the organization will potentially receive a statistics update from each of 300 locations on a daily basis, I am estimating that within a month the table will have 9,000 records and within a year around 108,000. MySQL table partitioning based on the year should therefore keep queries in the 100,000 record range, which I think will allow steady performance over time.
(If anyone sees a problem with the theories in my above 'background data', feel free to mention them as I have no experience with building a large-scale database and this was simply what I have gathered with searching around the net.)
Now, on the front end of this system, it is web-based and has a primary focus on PHP. I plan on using the YUI framework I found online to display graph information.
What the organization needs to see is daily/weekly graphs of the sales figures of their remote locations, and whatever 'breakdown' statistics such as items sold (so you can "drill down" into a monetary graph and see what percentage of that income came from item X).
So if I have the statistics by LocationID, it's a fairly simple matter to organize this information by continent. If the system needs to display a graph of the sales figures for all locations in Europe, I can do a Query that JOINs a Dimension Table for the LocationID that gives its "continent" category and thereby sum (by date) all of those figures and display them on the graph. Or, to display weekly information, sum all of the daily reports in a given week and return them to my JS graph object as a JSON array, voila. Pretty simple stuff as far as I can see.
Now, my thought was to create "summary" tables of these common queries. When the user wants to pull up the last 3 months of sales for Africa, and the query has to go all the way down to the daily level and with various WHERE and JOIN clauses, sum up the appropriate LocationID's figures on a weekly basis, and then display to the user...well it just seemed more efficient to have a less granular table. Such a table would need to be automatically updated by new daily reports into the main table.
Here's the sort of hierarchy of data that would then need to exist:
1) Daily Figures by Location
2) Daily Figures by Continent based on Daily Figures by Location
3) Daily Figures for Planet based on Daily Figures by Continent
4) Weekly Figures by Location based on Daily Figures by Location
5) Weekly Figures By Continent based on Weekly Figures by Location
6) Weekly Figures for Planet based on Weekly Figures by Continent
So we have a kind of tree here, with the most granular information at the bottom (in one table, admittedly) and a series of less and less granular tables so that it is easier to fetch the data for long-term queries (partitioning the Daily Figures table by year will be useless if it receives queries for 3 years of weekly figures for the planet).
Now, first question: is this necessary at all? Is there a better way to achieve broad-scale query efficiency in the scenario I'm describing?
Assuming that there is no particularly better way to do this, how to go about this?
I discovered MySQL Triggers, which to me would seem capable of 'cascading the updates' as it were. After an INSERT into the Daily Figures table, a trigger could theoretically read the information of the inserted record and, based on its values, call an UPDATE on the appropriate record of the higher-level table. I.e., $100 made in Georgia on April 12th would prompt the United States table's 'April 10th-April 17th' record to UPDATE with a SUM of all of the daily records in that range, which would of course see the newly entered $100 and the new value would be correct.
Okay, so that's theoretically possible, but it seems too hard-coded. I want to build the system so that the organization can add/remove locations and set which continent they are in, which would mean that the triggers would have to be reconfigured to include that LocationID. The inability to make multiple triggers for a given command and table means that I would have to either store the trigger data separately or extract it from the trigger object, and then parse in/out the particular rule being added or removed, or keep an external array that I handled with PHP before this step, or...basically, a ton of annoying work.
While MySQL triggers initially seemed like my salvation, the more I look into how tricky it will be to implement them in the way that I need the more it seems like I am totally off the mark in how I am going about this, so I wanted to get some feedback from more experienced database people.
While I would appreciate intelligent answers with technical advice on how to accomplish what I'm trying to do, I will more deeply appreciate wise answers that explain the correct action (even if it's what I'm doing) and why it is correct.