I have a requirement to have 612 columns in my database table. The # of columns as per data type are:
BigInt – 150 (PositionCol1, PositionCol2…………PositionCol150)
Int - 5
SmallInt – 5
Date – 150 (SourceDateCol1, SourceDate2,………….SourceDate150)
DateTime – 2
Varchar(2000) – 150 (FormulaCol1, FormulaCol2………………FormulaCol150)
Bit – 150 (IsActive1, IsActive2,……………….IsActive150)
When user does the import for first time the data gets stored in PositionCol1, SourceDateCol1, FormulaCol1, IsActiveCol1, etc. (other datetime, Int, Smallint columns).
When user does the import for second time the data gets stored in PositionCol2, SourceDateCol2, FormulaCol2, IsActiveCol2, etc. (other datetime, Int, Smallint columns)….. so and so on.
There is a ProjectID column in the table for which data is being imported.
Before starting the import process, user maps the excel column names with the database column names (PositionCol1, SourceDateCol1, FormulaCol1, IsActiveCol1) and this mapping get stored in a separate table; so that when retrieved data can be shown under these mapping column names instead of DB column names. E.g.
PositionCol1 may be mapped to SAPDATA
SourceDateCol1 may be mapped to SAPDATE
FormulaCol1 may be mapped to SAPFORMULA
IsActiveCol1 may be mapped to SAPISACTIVE
40,000 rows will be added in this table every day, my questions is that will the SQL be able to handle the load of that much of data in the long run?
Most of the times, a row will have data in about 200-300 columns; in the worst case it’ll have data in all of the 612 columns. Keeping in view this point, shall I make some changes in the design to avoid any future performance issues? If so, please suggest what could be done?
If I stick to my current design, what points I should take care of, apart from Indexing, to have optimal performance while retrieving the data from this huge table?
If I need to retrieve data of a particular entity e.g. SAPDATA, I’ll have to go to my mapping table, get the database column name against SAPDATA i.e. PositionCol1 in this case; and retrieve it. But, in that way, I’ll have to write dynamic queries. Is there any other better way?
Don't stick with your current design. Your repeating groups are unweildy and self limiting... What happens when somebody uploads 151 times? Normalise this table so that you have one of each type per row rather than 150. You won't need mapping this way as you can select SAPDATA from the positioncol without worring if it is 1-150.
You probably want a PROJECTS table with an ID, a PROJECT_UPLOADS table with an ID and an FK to the PROJECTS table. This table would have Position, SourceDate, Formula and IsActive given your use-case above.
Then you could do things like
select p.name, pu.position from PROJECTS p inner join PROJECT_UPLOADS pu on pu.projectid = p.id WHERE pu.position = 'SAPDATA'
etc.
Related
I’ve been thinking about this for a couple of days but I feel that I’m lacking the right words in order to ask google the questions I need an answer to. That’s why I’d really appreciate an kind of help, hints or guidance.
First of all, I have almost no experience with databases (apart from misusing Excel as such) and, unfortunately, I have all my data written in very impractical and huge .csv files.
What I have:
I have time series data (in 15 minute-steps) for several hundred sensors (SP) over the course of several years (a couple of million rows in total) in Table 1. There are also some weather condition data (WCD) that applies to all of my sensors and is therefore stored in the same table.
Note that each sensor delivers two data points per measurement.
Table1 (Sensors as Columns)
Now I also have another table (Table 2) that lists several static properties that define each sensor in Table 1.
Table 2 (Sensors as Rows)
My main question is concerning database design and general implementation (MySQL or MS Access): Is it really necessary to have hundreds of columns (two for each sensor) in Table1? I wish I could store the “link” to the respective time series data simply as two additional columns in Table2.
Is that feasible? Does that even make sense? How would I set up this database automatically (coming from .csv files with a different structure) since I can’t do type in every column by hand for hundreds of sensors and their attached time series?
In the end, I want to be able to make a query/sort my data (see below) by timeframe, date and sensor-properties.
The reason for all of this is the following:
I want to create a third table (Table3) which “stores” dynamic values. These values are results of calculations based on the sensor-measurements and WCD in Table 1. However, depending on the sensor-properties in Table2, the sensors and their respective time series data that serve as input for the calculations of Table3 might differ from set to set.
That way I want to obtain e.g. Set 1: “a portfolio of sensors with location A for each month between January 2010 and November 2011” and store it somewhere. Then I want to do the same for Set 2: e.g. “a portfolio of sensors with location B for the same time frame”. Finally I will compare these different portfolios and conduct further analysis on them. Does that sound reasonable at all??
So far, I’m not even sure whether I should actually store the results for each calculation of Table3 in the database or if I output them query and feed them directly into my analyzation tool. What makes more sense?
A more useful structure for your sensor and WCD data would be:
Table SD - Sensor Data
Columns:
Datetime
Sensor
A_value
B_value
With this structure you do not need to store a link to the time series data in Table 2--the Sensor value is the common data that links the tables.
If your weather conditions data all have the same type of values and/or attributes then you should normalize it similarly:
Table WCD - Weather Conditions Data, Normalized
Columns:
Datetime
Weather_condition
Weather_condition_value
From your example, it looks like different weather conditions may have different attributes (or different data types of attributes), in which case the form in which you have the WCD in your Table 1 may be most appropriate.
Storing the results of your calculations in another table sounds like a reasonable thing to do if at least some of your further analysis could be, or will be, done using SQL.
I'm working now on a project that involves many users and they're log in time/log out time (and summary) details. To be able to watch after their presence.
My question is what is possibly the best way to store tat data? (if we talk about hundreds or maybe thousands of users)
1. To make an DB that contains a table for each user, there it has all the dates and hours?
2. To make one big table which contains all this data?
Thanks.
A table for each user is a weird approach.
Make a table for ALL users, which is the correct way to go.
Then make a table called actions with the user_id as a FOREIGN KEY, and two more columns: type and time.
When the user logs in, add a new row to the actions table with type = 1 (login) and when he logs out, add a type = 2 (logout).
Using numbers instead of strings is better since it reduces database weight.
Repeating the same string is costy.
The type column must be a INT type.
The time column can have CURRENT_TIMESTAMP as the default value, since it will log the action when it has happened.
See a example fiddle with schema and query
We are building an analytics engine which has to store attribute preference score for each user. We are expecting 400 attributes and they may change(at what frequency is not known as yet). We are planning to store this in Redshift.
My qs is:
Should we store as 1 row per user with 400 cols(1 column for each attribute)
or should we go for a table structure like
(uid, attribute id, attribute value, preference score) which will be (20-400)rows by 3 columns
Which kind of storage would lead to a better performance in Redshift.
Should be really consider NoSQL for this?
Note:
1. This is a backend for real time application with increasing number of users.
2. For processing, the above table has to be read with entire information of all attibutes for one user i.e indirectly create a 1*400 matrix at runtime.
Please help me which desgin would be ideal for such a use case. Thank you
You can go for tables like given in this example and then use bitwise functions
http://docs.aws.amazon.com/redshift/latest/dg/r_bitwise_examples.html
Bitwise functions are here
For your problem, I would suggest a two table design. Its more pain in the beginning but will help in future.
First table would be a key value kind of first table, which would store all the base data and would be kind of future proof, where you can add/remove more attributes, but this table will continue working.
And a N(400 in your case) column 2nd table. This second table you can build using the first table. For the second table, you can start with a bare minimum set of columns .. lets say only 50 out of those 400. So that querying this table would be really fast. And the structure of this table can be refreshed periodically to match with the current reporting requirements. Also you will always have the base table in case you need to backfill any data.
I am currently working on a web service that stores and displays money currency data.
I have two MySQL tables, CurrencyTable and CurrencyValueTable.
The CurrencyTable holds the names of the currencies as well as their description and so forth, like so:
CREATE TABLE CurrencyTable ( name VARCHAR(20), description TEXT, .... );
The CurrencyValueTable holds the values of the currencies during the day - a new value is inserted every 2 minutes when the market is open. The table looks like this:
CREATE TABLE CurrencyValueTable ( currency_name VARCHAR(20), value FLOAT, 'datetime' DATETIME, ....);
I have two questions regarding this design:
1) I have more than 200 currencies. Is it better to have a separate CurrencyValueTable for each currency or hold them all in one table?
2) I need to be able to show the current (latest) value of the currency. Is it better to just insert such a field to the CurrencyTable and update it every two minutes or is it better to use a statement like:
SELECT value FROM CurrencyValueTable ORDER BY 'datetime' DESC LIMIT 1
The second option seems slower.. I am leaning towards the first one (which is also easier to implement).
Any input would be greatly appreciated!!
p.s. - please ignore SQL syntax / other errors, I typed it off the top of my head..
Thanks!
To your questions:
I would use one table. Especially if you need to report on or compare data from multiple currencies, it will be incredibly improved by sticking to one table.
If you don't have a need to track the history of each currency's value, then go ahead and just update a single value -- but in that case, why even have a separate table? You can just add "latest value" as a field in the currency table and update it there. If you do need to track history, then you will need the two tables and the SQL you posted will work.
As an aside, instead of FLOAT I would use DECIMAL(10,2). After MySQL 5.0, this will actually have improved results when it comes to currency handling with rounding.
It is better to have one table holding all currencies
If there is need for historical prices, then the table needs to hold them. A reasonable compromise in many situations is to split the price table into a full list of historical prices and another table which only has the current prices.
Using data type float can be troublesome. Please be sure you know what you are doing. If not, use a database currency data type.
As your webservice is transactional it is better if you'd have to access less tables at the same time. Since you will be reading and writing a lot, I would suggest having a single table.
Its better to insert a field to the CurrencyTable and update it rather than hitting two tables for a single request.
I have several tables like Buyers, Shops, Brands, Money_Collectors, e.t.c.
Each one of those has a default value, e.g. the default Buyer is David, the default Shop is Ebay, and so on.
I would like to save those default values in a database (so that user could change them).
I thought to add is_default column to each one of the tables, but it seems to be ineffective because only one row in each table may be the default.
Then I thought that the best would be to have Defaults table that will contain all the default values. This table will have 1 row and N columns, where N is the number of the default values:
Defaults table:
buyer shop brand money_collector
----- ---- ----- ---------------
David Ebay Dell NULL (no default value)
But, this seems to be not the best approach because the table structure changes when a new default value is added.
What would be the best approach to store default values ?
Just to be clear.
The best way is with a column on each table which dropdowns source from.
And here's why...
"Shouldn't I worry about space when
saving data in a database?"
The short answer is no. The longer answer is what you should worry about is performance. Focusing on space will lead you to do very bad things.
Bad things that you'll do if space is a concern.
You'll bury meaning into Primary Keys. i.e. Smart Keys.
You'll try to store mulitple values in one column.
You'll index too little
(No doubt we could create a list of 50 bad practices which save space)
suppose there are 50 shops (select box
with 50 possible values). In this
case, to store the default shop you
need 50 boolean fields,
Well it's ONE Boolean column. It exists on each row.
Let me ask you this. If you created a table with 1 date column and inserted 1 row, how much space would you use on disk?
If you said a 7 or 8 bytes then you're off by about 1000 times.
The smallest unit of disk space is a block. Blocks are typical 8kb (the can be as small as 2kb as large as 32kb, in general (no nitpicking here, the actual limits are unimportant))
Let's say you have 8kb blocks then your 1 column, 1 row table takes 8Kb. If you insert another 999 rows it will still take up 8KB. (Again no nitpicking there is overhead per block and per row - it's an example)
So in your look up table with 50 store names, the likelihood that adding 50 bytes to the size of the table forces you to expand from 1 block to 2 is slim to none and completely irrelevant.
On the other hand, your default table will certainly take up at least one additional block.
But the worst hit to PERFORMANCE is that your call to fill a drop down will need two round trips to the database, one to get the list, one to get the default. (yes, you may be able to do this in one but go with it)
So you've saved exactly zero space and doubled your network traffic.
You see what I'm saying.
Another crucial reason to stop worrying about space is you're giving up clarity. think of the developer you're going to hire to run this app. When he joins the team and looks at the database, imagine the two scenarios.
There's a Boolean column named Default_value
There's a table with no relationships to anything that's named Default_Values
You ask him to build a new for with a dropdown for 'store'.
In scenario 1 he finds the store table, wires up the dropdown to a simple query of the table and uses the default_value field to select the initial value.
In scenario 2, without some training, how would he know to look for a separate table? Maybe he'd see the table but by the time you're hiring, your datamodel now has hundreds of tables.
Again, a little contrived but the point is salient. Clarity in the database is well, well worth a byte per row.
Technical stuff
I'm not a MySQL guy but in Oracle, a null column at the end of a row take no additional space. In Oracle I would use a Varchar2(1) and let 'T' = Default and leave the others null. That would have the effect on only using 1 addition byte total, and not per row. YMMV with MySQL, you can pose that question separately if you can't Google the answer.
But the time to worry about that is on millions of rows, not hundreds. Any table which feeds a dropdown will never be big enough to start worrying about extra bytes.
What if you create an XML and then store that XML in the table in an XML column. The XML column would contain the XML, and the XML could have tags of tables and a sub node of default values.
You should rather create a a table with two columns and n rows
Defaults table:
buyer, David
shop, Ebay,
brand, Dell
This way you can add new values without having to change table structure
You can create a catalog table (some kind of metadata table) containing the default values as strings for the desired table columns. Then you can use the convert function for getting the appropriate value. Below is a sample table definition (Transact-SQL was used):
create table dbo.cat_default_values
(
id_column varchar(30) not null,
id_table varchar(30) not null,
datatype varchar(30) not null,
value varchar(100) not null,
f_creation datetime not null,
usr_creation char(8) null,
primary key clustered (id_column, id_table)
)
declare #defaultValueInt int,
#defaultValueVarchar varchar(30)
select #defaultValueInt = convert(int, value)
from cat_default_values where id_column = "defColumInteger" and id_table = "table1"
select defaultValueVarchar = value
from cat_default_values where id_column = "defColumVarchar" and id_table = "table1"
What you are trying to store is not meta data information. First of all, so I will not invent an external data store to store this data.(coupled with extra code )
I assume you have a PK Sequence generation logic (under your control). I will assign a magic number x and I will insert a record in each table with _id = x as the default value. So if you want to show the user the default value, you can handle in your query in a uniform way or you can handle this in application logic while insert. The good thing about this is, you have access to default value all the time and without writing any extra logic and the logic for maintaining default value of a table can be maintained using the same code (templating ;)
(From the lessons W3c learned from modeling schema information of XML using DTD.)
Only catch is this logic should be made explicit either using some extensive documentation or could be hard imposed by using a trigger.