I asked the following question regarding DB Design and Performance issue in my application today.
DB Design and Data Retrieval from a heavy table
But, could not get much replies on that. I don't know, I may not have explained the question properly. Now, I have re-defined my question, hoping to get some suggestion from the experts.
I am facing performance issues while selecting data from a particular table. The business logic of the application is as following:
I have a number of import processes which result in creating pivot columns under their parent column names while showing them to the user. As the columns get pivoted, system takes time to convert rows into columns which results in slow performance.
The database tables related to this functionality are as following:
There can be N number of clients. CLT_Clients table stores client information.
There can be N number of projects associated to a client. PRJ_Projects table stores project information and a link to the client.
There can be N number of listings associated to a project. PRJ_Listings table stores listing information and a link to the project.
There can be N number of source entities associated to a listing. ST_Entities table stores source entity information and a link to the listing.
This source entity is the actual import that contains the InvestorID, position values, source date, active and formula status.
The name of the import e.g. L1Entity1 is stored in ST_Entities table alongwith ID field i.e. EntityID
InvestorID, Position, Source Date, Active and Formula values get stored in ST_Positions table
Database Diagram
Data need to be view as following:
With this design I’m able to handle N number of imports because the Position, Source Date, IsActive, Formula columns get Pivoted.
The problem that I’m facing with this design is that the system performs very slow when it has to select data for more than 10-12 source entities, and the requirement is to show about 150 source entities. Because data is not stored row wise and I need to show it column wise, hence dynamic queries are written to pivot these columns which takes long.
Ques 1: Please comment/suggest on my current database design if it’s correct or need to be changed with the new design by taking 150 columns each for Position, Source Date, IsActive, Formula; In this new way data will already be stored in the way I need to retrieve in i.e. I’ll not have to pivot/unpivot it. But the downside is:
a) There are going to be more than 600 columns in this table?
b) There will be limit i.e. 150 on the source entities.
Ques 2: If I need to stick to my current, what can be done to improve the performance?
Please see below the indexing and Pivot method information:
Regarding indexes in Position table, I also have taken ProjectID field with clustered index as the data is selected from Position table either on the basis of ProjectID OR EntityID.
Whenever EntityID is used to select data from Position table, it's always used in JOIN. And Whenever ProjectID is used to select data from this table, it's always used in WHERE.
Point to notice here is that I have a Clustered index on ProjectID but I have not taken any index on Pivoted column OR EntityID. Is there any room for improvement here?
Pivoting Method used:
Example 1:
'Select * From
(
Select DD.InvestorID,Cast(1 As Bit) As IsDSInvestor,DD.Position,
Case DD.ProjectID
When ' + CAST(#ProjectID AS VARCHAR) +' Then DE.SourceName
Else ''' + #PPDeliveryDate + '''+DE.SourceName
End As SourceName
From DE_PositionData DD
Inner Join DE_DataEntities DE ON DE.EntityID=DD.EntityID
Where DD.ProjectID IN (' + CAST(#ProjectID AS VARCHAR) +',' + CAST(#PreviousProjectID AS VARCHAR) +') AND InvestorID IS NOT NULL
) IDD
Pivot
(
max(Position) for SourceName in ('+ #DataColumns+')
) as p1'
Example2:
'Select * From
(
Select DD.InvestorID As DSSOFID,Cast(1 As Bit) As IsActiveInvestor,
Case ST.SourceTypeCode
When ''RSH'' Then Cast(IsNull(DD.IsActive,0) As Int)
Else Cast(IsNull(DD.IsActive,1) As Int)
End As IsActive,
''~''+DE.SourceName As ActiveSourceName
From DE_DataEntities DE
Left Join DE_PositionData DD ON DE.EntityID=DD.EntityID
Left Join
(
Select * From #DataSources
Where ProjectID=' + CAST(#ProjectID AS VARCHAR) +'
) ST ON ST.ESourceID=DE.ESourceID
Where DE.ProjectID=' + CAST(#ProjectID AS VARCHAR) +' AND ST.SourceTypeCode NOT IN (''PBC'',''EBL'',''REG'')
AND InvestorID IS NOT NULL
) IDD
Pivot
(
Max(IsActive) for ActiveSourceName in ('+ #DataColumns+')
) As p1'
I would suggest the following.
You should store your data in the normalized format. You should be able to set up indexes to make the pivoting of the data go faster. If you post the actual query that you are using for pivoting, we might be able to help.
There are many reasons you want to store the data this way:
Flexibility in the number of repeating blocks
Many databases have limits on the number of columns or total width of a table. You don't want your design to approach those limits.
You may want additional information on each block. I always include CreatedBy and CreatedAt columns in my tables, and you would want this per block.
You have additional flexibility in summarization.
Adding/removing intermediate values is cumbersome.
That said, the pivoted table has one key advantage: it is what users want to see. If your data is updated only once per day, then you should create a reporting table with the pivot.
If your data is updated incrementally throughout the day, then you can set up triggers (or stored procedure code) to update the base tables and the reporting summary.
But as I said earlier, you should ask another question is the particular method you are using for pivoting. Perhaps we can improve the performance there.
Related
Our problem lies in performing a left join on two large tables (both having millions of entries).
The first one is a table that contains input supplied by the end-user of our program. It contains answers to a variety of questions. Every question belongs to a certain questionnaire. The most important columns are an identifier for the given response, an identifier for the questionnaire form, the datetime the answer is given and an identifier for the user that supplied the answer.
The second table contains information on daily progress of the users regarding the completion of questionnaires. It contains information on the amount of answers a certain user has given on a certain day for a given activity. The most important columns in this table are the user id, the questionnaire id and the date.
The second database is updated right after a new answer enters the first database. Updating is performed by code (workers) that runs on a different server. We would to like make the system robust against failure of this other server. An important step to ensure that the table with the results ('responses') remains in sync with the progress ('progress_questionnaires') table is to be able to check whether a combination of user_id, questionnaire_id and datetime from the 'responses' table is also present in the 'progress_questionnaires' table. A query that captures the required results, but does not perform on large databases (NxN, in which N is couple of millions entries), is displayed below.
A query that captures the required results is:
SELECT r.chapter_id, r.user_id, CAST(first_created as date) as date, 1 as original
FROM responses r
LEFT JOIN progress_questionnaires pq ON r.questionnaire_id = pq.questionnaire_id AND r.user_id = pq.user_id AND CAST(r.first_created as date) = pq.date
WHERE pa.activity_id IS NULL
GROUP BY r.questionnaire_id, r.user_id, CAST(r.first_created as date)
As stated before, this query does capture the required results, but does not perform well on large tables. All key columns are properly indexed as far as we know.
We would be very happy if someone could help us out.
P.S. We are using MariaDB, SQL version 5.5.43. I hope I supplied al necessary information, but logically I would be happy to supply additional information where necessary.
I have a requirement to have 612 columns in my database table. The # of columns as per data type are:
BigInt – 150 (PositionCol1, PositionCol2…………PositionCol150)
Int - 5
SmallInt – 5
Date – 150 (SourceDateCol1, SourceDate2,………….SourceDate150)
DateTime – 2
Varchar(2000) – 150 (FormulaCol1, FormulaCol2………………FormulaCol150)
Bit – 150 (IsActive1, IsActive2,……………….IsActive150)
When user does the import for first time the data gets stored in PositionCol1, SourceDateCol1, FormulaCol1, IsActiveCol1, etc. (other datetime, Int, Smallint columns).
When user does the import for second time the data gets stored in PositionCol2, SourceDateCol2, FormulaCol2, IsActiveCol2, etc. (other datetime, Int, Smallint columns)….. so and so on.
There is a ProjectID column in the table for which data is being imported.
Before starting the import process, user maps the excel column names with the database column names (PositionCol1, SourceDateCol1, FormulaCol1, IsActiveCol1) and this mapping get stored in a separate table; so that when retrieved data can be shown under these mapping column names instead of DB column names. E.g.
PositionCol1 may be mapped to SAPDATA
SourceDateCol1 may be mapped to SAPDATE
FormulaCol1 may be mapped to SAPFORMULA
IsActiveCol1 may be mapped to SAPISACTIVE
40,000 rows will be added in this table every day, my questions is that will the SQL be able to handle the load of that much of data in the long run?
Most of the times, a row will have data in about 200-300 columns; in the worst case it’ll have data in all of the 612 columns. Keeping in view this point, shall I make some changes in the design to avoid any future performance issues? If so, please suggest what could be done?
If I stick to my current design, what points I should take care of, apart from Indexing, to have optimal performance while retrieving the data from this huge table?
If I need to retrieve data of a particular entity e.g. SAPDATA, I’ll have to go to my mapping table, get the database column name against SAPDATA i.e. PositionCol1 in this case; and retrieve it. But, in that way, I’ll have to write dynamic queries. Is there any other better way?
Don't stick with your current design. Your repeating groups are unweildy and self limiting... What happens when somebody uploads 151 times? Normalise this table so that you have one of each type per row rather than 150. You won't need mapping this way as you can select SAPDATA from the positioncol without worring if it is 1-150.
You probably want a PROJECTS table with an ID, a PROJECT_UPLOADS table with an ID and an FK to the PROJECTS table. This table would have Position, SourceDate, Formula and IsActive given your use-case above.
Then you could do things like
select p.name, pu.position from PROJECTS p inner join PROJECT_UPLOADS pu on pu.projectid = p.id WHERE pu.position = 'SAPDATA'
etc.
I want to try and keep this as one query and not use PHP, but it's proving to be tough.
I have a table called applications, that stores all the applications and some basic information about them.
Then, I have a table with all the types of applications in it, and that table contains a reference to another table which stores more specific data about the specific type of application in question.
select applications.id as appid, applications.category, type.title as type, type.id as tid, type.valuefld, type.tablename
from applications
left join type on applications.typeid=type.id
left join department on type.deptid=department.id
where not isnull(work_cat)
and work_cat != ''
and applications.deleted=0
and datei between '10-04-14' and '11-04-14'
order by type, work_cat
Now, in the old version, there is another query on every single result. Over hundreds of results... that sucks.
This is the query I'd like to integrate so I can get all the data in one result row. (Old is ASP, I'm re-writing it in PHP)
query = "select sum("&adors.fields("valuefld")&") as cost, description from "&adors.fields("tablename")&" where appid = '"&adors.fields("tablename")&"'"
Prepared statements, I'm aware, are the best solution, but for now they are not an option.
You can't do this with a plain SQL query - you need to have a defined set of tables that your query is based on. The fact that your current implementation queries from whatever table is named by tablename from the first result-set means that to get this all in one query, you will have to restructure your data. You have to know what tables you're querying from rather than having it dynamic.
If the reason for these different tables is the different information stored in each requiring different record (column) structures, you might want to look into Key/Value pair storage in a large table. Once you combine the dynamically named ones into a single location you can integrate your two queries together.
I am currently working on a web service that stores and displays money currency data.
I have two MySQL tables, CurrencyTable and CurrencyValueTable.
The CurrencyTable holds the names of the currencies as well as their description and so forth, like so:
CREATE TABLE CurrencyTable ( name VARCHAR(20), description TEXT, .... );
The CurrencyValueTable holds the values of the currencies during the day - a new value is inserted every 2 minutes when the market is open. The table looks like this:
CREATE TABLE CurrencyValueTable ( currency_name VARCHAR(20), value FLOAT, 'datetime' DATETIME, ....);
I have two questions regarding this design:
1) I have more than 200 currencies. Is it better to have a separate CurrencyValueTable for each currency or hold them all in one table?
2) I need to be able to show the current (latest) value of the currency. Is it better to just insert such a field to the CurrencyTable and update it every two minutes or is it better to use a statement like:
SELECT value FROM CurrencyValueTable ORDER BY 'datetime' DESC LIMIT 1
The second option seems slower.. I am leaning towards the first one (which is also easier to implement).
Any input would be greatly appreciated!!
p.s. - please ignore SQL syntax / other errors, I typed it off the top of my head..
Thanks!
To your questions:
I would use one table. Especially if you need to report on or compare data from multiple currencies, it will be incredibly improved by sticking to one table.
If you don't have a need to track the history of each currency's value, then go ahead and just update a single value -- but in that case, why even have a separate table? You can just add "latest value" as a field in the currency table and update it there. If you do need to track history, then you will need the two tables and the SQL you posted will work.
As an aside, instead of FLOAT I would use DECIMAL(10,2). After MySQL 5.0, this will actually have improved results when it comes to currency handling with rounding.
It is better to have one table holding all currencies
If there is need for historical prices, then the table needs to hold them. A reasonable compromise in many situations is to split the price table into a full list of historical prices and another table which only has the current prices.
Using data type float can be troublesome. Please be sure you know what you are doing. If not, use a database currency data type.
As your webservice is transactional it is better if you'd have to access less tables at the same time. Since you will be reading and writing a lot, I would suggest having a single table.
Its better to insert a field to the CurrencyTable and update it rather than hitting two tables for a single request.
I want to design a database which is described as follows:
Each product has only one status at one time point. However, the status of a product can change during its life time. How could I design the relationship between product and status which can easily be queried all product of a specific status at current time? In addition, could anyone please give me some in-depth details about design database which related to time duration as problem above? Thanks for any help
Here is a model to achieve your stated requirement.
Link to Time Series Data Model
Link to IDEF1X Notation for those who are unfamiliar with the Relational Modelling Standard.
Normalised to 5NF; no duplicate columns; no Update Anomalies, no Nulls.
When the Status of a Product changes, simply insert a row into ProductStatus, with the current DateTime. No need to touch previous rows (which were true, and remain true). No dummy values which report tools (other than your app) have to interpret.
The DateTime is the actual DateTime that the Product was placed in that Status; the "From", if you will. The "To" is easily derived: it is the DateTime of the next (DateTime > "From") row for the Product; where it does not exist, the value is the current DateTime (use ISNULL).
The first model is complete; (ProductId, DateTime) is enough to provide uniqueness, for the Primary Key. However, since you request speed for certain query conditions, we can enhance the model at the physical level, and provide:
An Index (we already have the PK Index, so we will enhance that first, before adding a second index) to support covered queries (those based on any arrangement of { ProductId | DateTime | Status } can be supplied by the Index, without having to go to the data rows). Which changes the Status::ProductStatus relation from Non-Identifying (broken line) to Identifying type (solid line).
The PK arrangement is chosen on the basis that most queries will be Time Series, based on Product⇢DateTime⇢Status.
The second index is supplied to enhance the speed of queries based on Status.
In the Alternate Arrangement, that is reversed; ie, we mostly want the current status of all Products.
In all renditions of ProductStatus, the DateTime column in the secondary Index (not the PK) is DESCending; the most recent is first up.
I have provided the discussion you requested. Of course, you need to experiment with a data set of reasonable size, and make your own decisions. If there is anything here that you do not understand, please ask, and I will expand.
Responses to Comments
Report all Products with Current State of 2
SELECT ProductId,
Description
FROM Product p,
ProductStatus ps
WHERE p.ProductId = ps.ProductId -- Join
AND StatusCode = 2 -- Request
AND DateTime = ( -- Current Status on the left ...
SELECT MAX(DateTime) -- Current Status row for outer Product
FROM ProductStatus ps_inner
WHERE p.ProductId = ps_inner.ProductId
)
ProductId is Indexed, leading col, both sides
DateTime in Indexed, 2nd col in Covered Query Option
StatusCode is Indexed, 3rd col in Covered Query Option
Since StatusCode in the Index is DESCending, only one fetch is required to satisfy the inner query
the rows are required at the same time, for the one query; they are close together (due to Clstered Index); almost always on the same page due to the short row size.
This is ordinary SQL, a subquery, using the power of the SQL engine, Relational set processing. It is the one correct method, there is nothing faster, and any other method would be slower. Any report tool will produce this code with a few clicks, no typing.
Two Dates in ProductStatus
Columns such as DateTimeFrom and DateTimeTo are gross errors. Let's take it in order of importance.
It is a gross Normalisation error. "DateTimeTo" is easily derived from the single DateTime of the next row; it is therefore redundant, a duplicate column.
The precision does not come into it: that is easily resolved by virtue of the DataType (DATE, DATETIME, SMALLDATETIME). Whether you display one less second, microsecond, or nanosecnd, is a business decision; it has nothing to do with the data that is stored.
Implementing a DateTo column is a 100% duplicate (of DateTime of the next row). This takes twice the disk space. For a large table, that would be significant unnecessary waste.
Given that it is a short row, you will need twice as many logical and physical I/Os to read the table, on every access.
And twice as much cache space (or put another way, only half as many rows would fit into any given cache space).
By introducing a duplicate column, you have introduced the possibility of error (the value can now be derived two ways: from the duplicate DateTimeTo column or the DateTimeFrom of the next row).
This is also an Update Anomaly. When you update any DateTimeFrom is Updated, the DateTimeTo of the previous row has to be fetched (no big deal as it is close) and Updated (big deal as it is an additional verb that can be avoided).
"Shorter" and "coding shortcuts" are irrelevant, SQL is a cumbersome data manipulation language, but SQL is all we have (Just Deal With It). Anyone who cannot code a subquery really should not be coding. Anyone who duplicates a column to ease minor coding "difficulty" really should not be modelling databases.
Note well, that if the highest order rule (Normalisation) was maintained, the entire set of lower order problems are eliminated.
Think in Terms of Sets
Anyone having "difficulty" or experiencing "pain" when writing simple SQL is crippled in performing their job function. Typically the developer is not thinking in terms of sets and the Relational Database is set-oriented model.
For the query above, we need the Current DateTime; since ProductStatus is a set of Product States in chronological order, we simply need the latest, or MAX(DateTime) of the set belonging to the Product.
Now let's look at something allegedly "difficult", in terms of sets. For a report of the duration that each Product has been in a particular State: the DateTimeFrom is an available column, and defines the horizontal cut-off, a sub set (we can exclude earlier rows); the DateTimeTo is the earliest of the sub set of Product States.
SELECT ProductId,
Description,
[DateFrom] = DateTime,
[DateTo] = (
SELECT MIN(DateTime) -- earliest in subset
FROM ProductStatus ps_inner
WHERE p.ProductId = ps_inner.ProductId -- our Product
AND ps_inner.DateTime > ps.DateTime -- defines subset, cutoff
)
FROM Product p,
ProductStatus ps
WHERE p.ProductId = ps.ProductId
AND StatusCode = 2 -- Request
Thinking in terms of getting the next row is row-oriented, not set-oriented processing. Crippling, when working with a set-oriented database. Let the Optimiser do all that thinking for you. Check your SHOWPLAN, this optimises beautifully.
Inability to think in sets, thus being limited to writing only single-level queries, is not a reasonable justification for: implementing massive duplication and Update Anomalies in the database; wasting online resources and disk space; guaranteeing half the performance. Much cheaper to learn how to write simple SQL subqueries to obtain easily derived data.
"In addition, could anyone please give me some in-depth details about design database which related to time duration as problem above?"
Well, there exists a 400-page book entitled "Temporal Data and the Relational Model" that addresses your problem.
That book also addresses numerous problems that the other responders have not addressed in their responses, for lack of time or for lack of space or for lack of knowledge.
The introduction of the book also explicitly states that "this book is not about technology that is (commercially) available to any user today.".
All I can observe is that users wanting temporal features from SQL systems are, to put it plain and simple, left wanting.
PS
Even if those 400 pages could be "compressed a bit", I hope you don't expect me to give a summary of the entire meaningful content within a few paragraphs here on SO ...
tables similar to these:
product
-----------
product_id
status_id
name
status
-----------
status_id
name
product_history
---------------
product_id
status_id
status_time
then write a trigger on product to record the status and timestamp (sysdate) on each update where the status changes
Google "bi-temporal databases" and "slowly changing dimensions".
These are two names for esentially the same pattern.
You need to add two timestamp columns to your product table "VALID_FROM" and "VALID_TO".
When your product status changes you add a NEW row with "VALID_FROM" of now() some other known effective data/time and set the "VALID_TO" to 9999-12-31 23:59:59 or some other date ridiculously far into the future.
You also need to zap the "9999-12-31..." date on the previously current row to the current "VALID_FROM" time - 1 microsecond.
You can then easily query the product status at any given time.