Database indices for log table - mysql

I need to create log table in a database. Each log will have various parameters. This is the table design I've come up.
Table Log
============
ID(INT)
Action(INT)
Created(TIMESTAMP)
Table Parameter
================
ID(INT)
ActionID(INT)
Title(VARCHAR(50))
VALUE(VARCHAR(300))
My question is, I need to perform complex queries on this log table such as
Who made what action and when?Who means I need to check parameters if title = "personID" and what means I need to check action code.
Total of prices of sale actionsFor this one, I need to check the action code for "sale" (say 3 (as integer)) and then I need to retrieve parameters of that actions with title "price" and then cast them to double and sum them up.
This queries will expand and I need to ensure that my design can answer more complex queries. I know that indices give great boosts but I am not very sure where I need to add to get the most out of database engine. The underlying database is MySQL.
Which fields should I have indices? Or is my design accurate for my purpose? I am also using InnoDB engine.
Edit
Sample SQL
SELECT LOG.ID, LOG.ACTION, LOG.CREATED FROM LOG, PARAMETER
WHERE PARAMETER.ACTIONID = ACTION.ID AND
PARAMETER.TITLE = "PERSONNAME" AND
PARAMETER.VALUE = 'JOHN' AND
PARAMETER.TITLE = "PRICE" AND
(CAST(PARAMETER.VALUE AS DOUBLE) > 30.0) AND
PARAMETER.TITLE = "DATE" AND
(CAST(PARAMETER.VALUE AS TIMESTAMP) > (NOW() - INTERVAL 1 DAY)))
The more conditions we have, the more Parameter.TITLE and Parameter.VALUE pairs will be used in the SQL.

You need to JOIN to PARAMETER multiple times, once per key-value pair that you are testing for. There are many examples; follow the Entity-Attribute-Value tag I added.
Since this is an EAV problem, you will have troubles with performance.
One improvement can be found in the changes I advocate for WP here
Another discussion is here on why not to hide the things you want to filter on in the K-V table. And an alternative.

Related

How to design the database when you need too many columns? [duplicate]

This question already has answers here:
How do you know when you need separate tables?
(9 answers)
Closed 9 years ago.
I have a table called cars but each car has hundreds of attributes and they keep on increasing over time (horsepower, torque, a/c, electric windows, etc...) My table has each attribute as a column. Is that the right way to do it when I have thousands of rows and hundreds of columns? Also, I made each attribute a column so I facilitate advanced searching / filtering.
Using MySQL database.
Thanks
This is an interesting question IMHO, and the answer may depend on your specific data model and implementation. The most important factor in this case is data density.
How much of each row is actually filled up, in average?
If most of your fields are always present, then data scope partition may be the way to go.
If most of your fields are empty, then a metadata-like structure (like #JayC suggested) may be more attractive.
Let's use the case you mentioned, and do some simulations.
On the first case, scope partition, the idea is to implement partitions based on scope or usage. As an example of partitioning by usage, let's say that the most retrieved fields are Model, Year, Maker and Color. These fields may compose your main [CAR] table, the owner of the ID field which will exclusively identify the vehicle.
Now let's say that Engine, Horsepower, Torque and Cylinders are also used for searches from time to time, but not so frequently. These may exist on a secondary table [CAR_INFO_1], which is tied to the first table by the presence of the CAR_ID field, a foreign key. Proceed by creating as many partitions you need.
Advantage: Simpler queries. You may coalesce all information about a vehicle if you do a joint query (for example inside a VIEW).
Downside: Maintenance. Each new field must be implemented in the model itself, and you need an updated data model to locate where the field you need is actually stored (or abstract it inside a view.)
Metadata format is much more elegant, but demands more of your database engine. Check #JayC's and #Nitzan Shaked's answers for details.
Advantages: 100% data density. You'll never have empty Data values. Also maintenance - a new attribute is created by adding it as a row to the metadata identifier table. Data structure is less complex as well.
Downside: Complex queries, together with more complex execution plans. Let's say you need all Ford cars made in 2010 that are blue. It would be very trivial on the first case:
SELECT * FROM CAR WHERE Model='Ford' AND Year='2010' AND Color='Blue'
Now the same query on a metadata-structured model:
Assume the existence of this two tables,
CAR_METADATA_TYPE
ID DESC
1 'Model'
2 'Year'
3 'Color'
and
CAR_METADATA [CAR_ID], [METADATA_TYPE_ID], [VALUE]
The query itself would like something like this:
SELECT * FROM CAR, CAR_METADATA [MP1], CAR_METADATA [MP2], CAR_METADATA [MP3]
WHERE MP1.CAR_ID = CAR.ID AND MP1.METADATA_TYPE_ID = 1 AND MP1.Value='Ford'
AND MP2.CAR_ID = CAR.ID AND MP2.METADATA_TYPE_ID = 2 AND MP2.Value='2010'
AND MP3.CAR_ID = CAR.ID AND MP3.METADATA_TYPE_ID = 3 AND MP3.Value='Blue'
So, it all depends on you needs. But given your case, my suggestion would be the Metadata format.
(But do a model cleanup first - no repeated fields, 1:N data on their own table instead of inline fields like Color1, Color2, Color3, this kind of stuff ;) )
I guess the obvious question is, then: why not have a table car_attrs(car, attr, value)? Each attribute is a row. Most queries can be re-written to use this form.
If it is all about features, create a features table, list all your features as rows and give them some sort of automatic id, and create a car_features that with foreign keys to both your cars table and your features table that associates cars with features, maybe along with any values associated with the relationship (one passenger electric seat, etc.).
If you have ever changing attributes, then consider storing them in an XML blob or text structure in one column. This structure is not relational. The most important attributes will then be duplicated in additional columns so you can craft queries to search on them as the Blob will not be searchable from SQL queries. This will cut down on the amount of columns in that table and allow for expansion without changing the database schema.
As others as suggested, if you want all the attributes in a table, then use an attribute table to define them. Then will depend on your requirements and needs of the application.

How to efficiently design MySQL database for my particular case

I am developing a forum in PHP MySQL. I want to make my forum as efficient as I can.
I have made these two tables
tbl_threads
tbl_comments
Now, the problems is that there is a like and dislike button under the each comment. I have to store the user_name which has clicked the Like or Dislike Button with the comment_id. I have made a column user_likes and a column user_dislikes in tbl_comments to store the comma separated user_names. But on this forum, I have read that this is not an efficient way. I have been advised to create a third table to store the Likes and Dislikes and to comply my database design with 1NF.
But the problem is, If I make a third table tbl_user_opinion and make two fields like this
1. comment_id
2. type (like or dislike)
So, will I have to run as many sql queries as there are comments on my page to get the like and dislike data for each comment. Will it not inefficient. I think there is some confusion on my part here. Can some one clarify this.
You have a Relational Scheme like this:
There are two ways to solve this. The first one, the "clean" one is to build your "like" table, and do "count(*)'s" on the appropriate column.
The second one would be to store in each comment a counter, indicating how many up's and down's have been there.
If you want to check, if a specific user has voted on the comment, you only have to check one entry, wich you can easily handle as own query and merge them two outside of your database (for this use a query resulting in comment_id and kind of the vote the user has done in a specific thread.)
Your approach with a comma-seperated-list is not quite performant, due you cannot parse it without higher intelligence, or a huge amount of parsing strings. If you have a database - use it!
("One Information - One Dataset"!)
The comma-separate list violates the principle of atomicity, and therefore the 1NF. You'll have hard time maintaining referential integrity and, for the most part, querying as well.
Here is one way to do it in a normalized fashion:
This is very clustering-friendly: it groups up-votes belonging to the same comment physically close together (ditto for down-votes), making the following query rather efficient:
SELECT
COMMENT.COMMENT_ID,
<other COMMENT fields>,
COUNT(DISTINCT UP_VOTE.USER_ID) - COUNT(DISTINCT DOWN_VOTE.USER_ID) SCORE
FROM COMMENT
LEFT JOIN UP_VOTE
ON COMMENT.COMMENT_ID = UP_VOTE.COMMENT_ID
LEFT JOIN DOWN_VOTE
ON COMMENT.COMMENT_ID = DOWN_VOTE.COMMENT_ID
WHERE
COMMENT.COMMENT_ID = <whatever>
GROUP BY
COMMENT.COMMENT_ID,
<other COMMENT fields>;
[SQL Fiddle]
Please measure on realistic amounts of data if that works fast enough for you. If not, then denormalize the model and cache the total score in the COMMENT table, and keep it current it through triggers every time a new row is inserted to or deleted from *_VOTE tables.
If you also need to get which comments a particular user voted on, you'll need indexes on *_VOTE {USER_ID, COMMENT_ID}, i.e. the reverse of the primary/clustering key above.1
1 This is one of the reasons why I didn't go with just one VOTE table containing an additional field that can be either 1 (for up-vote) or -1 (for down-vote): it's less efficient to cover with secondary indexes.

versioning each field vs history date field?

Which do you recommend and why?
I have a few tables, when i make a change to the data... it should go to a history table (audit) with a effective date.
The other solution is versioning each field to insert a new row when making changes to the data?
Which is the best method for the invoice information? Item name and price is always change
These are slowly changing dimensions, type 2 and type 4, appropriately.
Both methods are valid and may be more appropriate for your needs, depending on your model and query requirements.
Basically, type 2 (versioning) is more appropriate when you need to query historical values as often as the current one, while type 4 (history table) is more suited when you are querying the current value more often and there are more queries (more queries to develop I mean) against the most recent value.
A system we use and happy with:
Each table that requires history, we create a similar table and adding a timestamp field at the end, which becomes a part of the PK.
Each update on original table, we insert into history table with the same conditions:
update table x WHERE somthing something
insert into table x_history
select * from x WHERE something something
That keeps your data clean and your tables slim.
My personal preference would be to user the Observer Pattern in your application and to implement a separate history table. This means that you can pull the data from the history table when you need it and you don't compromise the speed of querying the main table.

design database relating to time attribute

I want to design a database which is described as follows:
Each product has only one status at one time point. However, the status of a product can change during its life time. How could I design the relationship between product and status which can easily be queried all product of a specific status at current time? In addition, could anyone please give me some in-depth details about design database which related to time duration as problem above? Thanks for any help
Here is a model to achieve your stated requirement.
Link to Time Series Data Model
Link to IDEF1X Notation for those who are unfamiliar with the Relational Modelling Standard.
Normalised to 5NF; no duplicate columns; no Update Anomalies, no Nulls.
When the Status of a Product changes, simply insert a row into ProductStatus, with the current DateTime. No need to touch previous rows (which were true, and remain true). No dummy values which report tools (other than your app) have to interpret.
The DateTime is the actual DateTime that the Product was placed in that Status; the "From", if you will. The "To" is easily derived: it is the DateTime of the next (DateTime > "From") row for the Product; where it does not exist, the value is the current DateTime (use ISNULL).
The first model is complete; (ProductId, DateTime) is enough to provide uniqueness, for the Primary Key. However, since you request speed for certain query conditions, we can enhance the model at the physical level, and provide:
An Index (we already have the PK Index, so we will enhance that first, before adding a second index) to support covered queries (those based on any arrangement of { ProductId | DateTime | Status } can be supplied by the Index, without having to go to the data rows). Which changes the Status::ProductStatus relation from Non-Identifying (broken line) to Identifying type (solid line).
The PK arrangement is chosen on the basis that most queries will be Time Series, based on Product⇢DateTime⇢Status.
The second index is supplied to enhance the speed of queries based on Status.
In the Alternate Arrangement, that is reversed; ie, we mostly want the current status of all Products.
In all renditions of ProductStatus, the DateTime column in the secondary Index (not the PK) is DESCending; the most recent is first up.
I have provided the discussion you requested. Of course, you need to experiment with a data set of reasonable size, and make your own decisions. If there is anything here that you do not understand, please ask, and I will expand.
Responses to Comments
Report all Products with Current State of 2
SELECT ProductId,
Description
FROM Product p,
ProductStatus ps
WHERE p.ProductId = ps.ProductId -- Join
AND StatusCode = 2 -- Request
AND DateTime = ( -- Current Status on the left ...
SELECT MAX(DateTime) -- Current Status row for outer Product
FROM ProductStatus ps_inner
WHERE p.ProductId = ps_inner.ProductId
)
ProductId is Indexed, leading col, both sides
DateTime in Indexed, 2nd col in Covered Query Option
StatusCode is Indexed, 3rd col in Covered Query Option
Since StatusCode in the Index is DESCending, only one fetch is required to satisfy the inner query
the rows are required at the same time, for the one query; they are close together (due to Clstered Index); almost always on the same page due to the short row size.
This is ordinary SQL, a subquery, using the power of the SQL engine, Relational set processing. It is the one correct method, there is nothing faster, and any other method would be slower. Any report tool will produce this code with a few clicks, no typing.
Two Dates in ProductStatus
Columns such as DateTimeFrom and DateTimeTo are gross errors. Let's take it in order of importance.
It is a gross Normalisation error. "DateTimeTo" is easily derived from the single DateTime of the next row; it is therefore redundant, a duplicate column.
The precision does not come into it: that is easily resolved by virtue of the DataType (DATE, DATETIME, SMALLDATETIME). Whether you display one less second, microsecond, or nanosecnd, is a business decision; it has nothing to do with the data that is stored.
Implementing a DateTo column is a 100% duplicate (of DateTime of the next row). This takes twice the disk space. For a large table, that would be significant unnecessary waste.
Given that it is a short row, you will need twice as many logical and physical I/Os to read the table, on every access.
And twice as much cache space (or put another way, only half as many rows would fit into any given cache space).
By introducing a duplicate column, you have introduced the possibility of error (the value can now be derived two ways: from the duplicate DateTimeTo column or the DateTimeFrom of the next row).
This is also an Update Anomaly. When you update any DateTimeFrom is Updated, the DateTimeTo of the previous row has to be fetched (no big deal as it is close) and Updated (big deal as it is an additional verb that can be avoided).
"Shorter" and "coding shortcuts" are irrelevant, SQL is a cumbersome data manipulation language, but SQL is all we have (Just Deal With It). Anyone who cannot code a subquery really should not be coding. Anyone who duplicates a column to ease minor coding "difficulty" really should not be modelling databases.
Note well, that if the highest order rule (Normalisation) was maintained, the entire set of lower order problems are eliminated.
Think in Terms of Sets
Anyone having "difficulty" or experiencing "pain" when writing simple SQL is crippled in performing their job function. Typically the developer is not thinking in terms of sets and the Relational Database is set-oriented model.
For the query above, we need the Current DateTime; since ProductStatus is a set of Product States in chronological order, we simply need the latest, or MAX(DateTime) of the set belonging to the Product.
Now let's look at something allegedly "difficult", in terms of sets. For a report of the duration that each Product has been in a particular State: the DateTimeFrom is an available column, and defines the horizontal cut-off, a sub set (we can exclude earlier rows); the DateTimeTo is the earliest of the sub set of Product States.
SELECT ProductId,
Description,
[DateFrom] = DateTime,
[DateTo] = (
SELECT MIN(DateTime) -- earliest in subset
FROM ProductStatus ps_inner
WHERE p.ProductId = ps_inner.ProductId -- our Product
AND ps_inner.DateTime > ps.DateTime -- defines subset, cutoff
)
FROM Product p,
ProductStatus ps
WHERE p.ProductId = ps.ProductId
AND StatusCode = 2 -- Request
Thinking in terms of getting the next row is row-oriented, not set-oriented processing. Crippling, when working with a set-oriented database. Let the Optimiser do all that thinking for you. Check your SHOWPLAN, this optimises beautifully.
Inability to think in sets, thus being limited to writing only single-level queries, is not a reasonable justification for: implementing massive duplication and Update Anomalies in the database; wasting online resources and disk space; guaranteeing half the performance. Much cheaper to learn how to write simple SQL subqueries to obtain easily derived data.
"In addition, could anyone please give me some in-depth details about design database which related to time duration as problem above?"
Well, there exists a 400-page book entitled "Temporal Data and the Relational Model" that addresses your problem.
That book also addresses numerous problems that the other responders have not addressed in their responses, for lack of time or for lack of space or for lack of knowledge.
The introduction of the book also explicitly states that "this book is not about technology that is (commercially) available to any user today.".
All I can observe is that users wanting temporal features from SQL systems are, to put it plain and simple, left wanting.
PS
Even if those 400 pages could be "compressed a bit", I hope you don't expect me to give a summary of the entire meaningful content within a few paragraphs here on SO ...
tables similar to these:
product
-----------
product_id
status_id
name
status
-----------
status_id
name
product_history
---------------
product_id
status_id
status_time
then write a trigger on product to record the status and timestamp (sysdate) on each update where the status changes

			
				
Google "bi-temporal databases" and "slowly changing dimensions".
These are two names for esentially the same pattern.
You need to add two timestamp columns to your product table "VALID_FROM" and "VALID_TO".
When your product status changes you add a NEW row with "VALID_FROM" of now() some other known effective data/time and set the "VALID_TO" to 9999-12-31 23:59:59 or some other date ridiculously far into the future.
You also need to zap the "9999-12-31..." date on the previously current row to the current "VALID_FROM" time - 1 microsecond.
You can then easily query the product status at any given time.

Critique my MySQL Database Design for Unlimited DYNAMIC Fields

Looking for a scalable, flexible and fast database design for 'Build your own form' style website - e.g Wufoo.
Rules:
User has only 1 Form they can build
User can create their own fields or choose from 'standard' fields
User's 1 Form has as many fields as the user wants
Values can be the sibling of another value E.g A photo value could have name, location, width, height as sibling values
Special Rules:
User can submit their form a maximum of 5 times a day
Value's Date is important
Flexibility to report on values (for single user, across all users, 1 field, many fields) is very important -- data visualization (most will be chronologically based e.g. all photos for July 2009 for all users).
Table "users"
uid
Table "field_user" - assign a field to a users form
fid
uid
weight - int - used to order the fields on the users form
Table "fields"
fid
creator_uid - int - the field 'creator'
label - varchar - e.g. Email
value_type - varchar - used to determine what field in the 'values' table will be filled in (e.g. if 'int' then values of this field will submit data into the values.type_int field - and all other .type_x fields will be NULL).
field_type - varchar - e.g. 'email' - used for special conditions e.g. validation rules
Table "values"
vid
parent_vid
fid
uid
date - date
date_group - int - value 1-5 (user may submit max of 5 forms per day)
type_varchar - varchar
type_text - text
type_int - int
type_float - float
type_bool - bool
type_date - date
type_timestamp - timestamp
I understand that this approach will mean records in the 'Value' table will only have 1 piece of data with other .type_x fields containing NULL's... but from my understanding this design will be the 'fastest' solution (less queries, less join tables)
At OSCON yesterday, Josh Berkus gave a good tutorial on DB design, and he spent a good fraction of it mercilessly tearing into such "EAV"il tables; you should be able to find his slides on the OSCON site soon, and eventually the audio recording of his whole tutorial online (the latter will probably take a while).
You'll need a join per attribute (multiple instances of the values table, one per attribute you're fetching or updating) so I don't know what you mean by "less join tables". Joining many instances of the same table isn't a particularly fast operation, and your design makes indices nearly unfeasible and unusable.
At least as a minor improvement use per-type separate tables for your attributes' values (maybe some indexing might be applicable in that case, though with MySQL's limitation to one index per query per table even that's somewhat dubious).
You should really look into schema-free dbs like CouchDB, problems like this are exactly those these types of DBs want to solve.
y'know, create table, alter, add a column, etc are operations you can do at run time in many modern rdbms implementations. Why be EAVil? Especially if you are using dynamic sql.
It's not for the fainthearted. I recall an implementation at Boeing which resulted in 70,000 tables in a database.
Obviously there are pitfalls in dynamic table creation, but they also exist for EAV tables. Things like two attributes for the same key expressing the same fact. Or transitive dependencies and other normalization gotchas. So why not at least leverage the power of the RDBMS on your behalf?
I agree with john owen.
dynamically creating a query from the schema is a small price to pay compared to querying EVA tables. Especially if the tables are large.
Usually table columns are considered an "interface". A design that relies on a dynamically changing interface is usually bad, but EAV data is a special case where you don't have many options. You have to choose between slow unintuitive queries or dynamic schema.