SQL (mySQL) optimization via BOOLEAN values - mysql

I am working on a side project that is quite an undertaking; my question regards the efficiency gained when using a BOOLEAN value to determine whether or not further data processing is required.
For example: If I had a table that listed all the creatures. In another table that was relational in nature listed their hibernation period, and calories consumed each day during hibernation.
Is it efficient to have inside the (Creatures) table a value for "hibernates" BOOLEAN.
If true then go to the "hibernation_creature_info_relations" table and find the creature with that ID and return that information.
This means that for all the creatures whose value for "hibernates" = false will prevent SQL from having to search through the large table of "hibernation_creature_info_relations."
Or when using ID's is the process so fast in checking the "hibernation_creature_info_relations" table so fast that there will actually be a larger impact on performance by having to process the argument of doing what based on if the value of hibernation is set to true or false?
I hope this was enough information to help you understand what I am asking, if not please let me know so I can rephrase or include more details.

No, that is not a good way to do things.
Use a normal field that can be null instead.
Example
table creatures
---------------
id name info_id
1 dino null
2 dog 1
3 cat 2
table info
--------------
id info_text
1 dogs bark
2 cats miauw
Now you can just do a join:
SELECT c.name, i.info_text
FROM creature c
LEFT JOIN info i ON (c.info_id = i.id)
If you do it like this, SQL can use an index.
No SQL database will create an index on a boolean field.
The cardinality of that field is too low and using indexes on low cardinality fields slows things down instead of speeding things up.
See: MySQL: low cardinality/selectivity columns = how to index?

If you want to use the column "hibernates" only to prevent the SQL from having to search through the other table then you should follow #Johan otherwise you can create index on the column "hibernates" it will improve the execution time. But keep in mind what #Johan is trying to tell you.

Related

SQL - does database extract repeating joined data multiple times or just once?

This is a performance question. In a query joining another table (the other acting as dictionary) where the joined data repeat, because foreign key value is repeated in many records of the base table, will database engine extract the repeating data multiple times (I mean by that not the presented output, but actually accessing and searching the table again and again), or is it smart enough to somehow cache the results and extract everything just once? I am using mySQL.
I mean a situation like this:
SELECT *
FROM Tasks
JOIN People
ON Tasks.personID = People.ID;
Lets assume People table consists of:
ID | Name
1 | John
2 | Mary
And Tasks:
ID | personID
1 | 1
2 | 1
3 | 2
Will "John" data be physically extracted twice or once? Is it worth trying to avoid such queries?
John will show up twice in the result set.
However, if I interpret your question right, this is not about the resulting result set, but more about how the data is internally read to produce this result set.
In this case you have a join between two tables. In a join between two tables there's a "driving table" that's read first, and then the "secondary table" that is accessed once per each row of the driving table.
Now:
If MySQL chooses Tasks as the driving table, then the row John from the People will be accessed twice (because it will be in the secondary table).
If MySQL chooses People as the driving table, then naturally the row John will be accessed only once.
So, which option will MySQL pick? Get the execution plan and you'll find out. The table that shows up first in the plan is the driving table; the other is the secondary table. Mind that the execution plan may change in the future without notice.
Note: accessing doesn't mean to perform physical I/O on the disk. Once the row is read, it becomes "hot" and it's usually cached for some time; any repeated access will probably end up reading from the cache and won't cause more physical I/O.
The answer to your question is that it repeats the data. The string values are not cached or reduced to one per distinct value.
In general, this isn't a problem because you would run queries that have small result sets by selecting a limited subset of data.
But if you don't limit the query, it would produce a large result set, potentially with strings repeated.
MySQL takes the table task and add for every row a/some row(s) from people that fits.
It has to gather every row, that belongs one row of the table tasks.
So it would grab for the second row woth the same id also the same data again.
this is usually not aproblem as you would put the colums in an INDEX and s it would find them quickly

Is there a way to add an attribute to only 1 row in SQL?

Take this table as an example :
CREATE TABLE UserServices (
ID BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
Service1 TEXT,
Service2 TEXT,
.
.
.
) ENGINE = MYISAM;
Every user will have different number of services, so lets say the table starts with 10 columns for services for each user. If one user will have 11 services, must all other users have 11 columns also? Now of course it is a table and row needs to have the same number of columns, but it is just seems like an awful waste of memory. Maybe the use of another database type is better?
Thank you!!
Storing a boatload of nulls isn't really a "waste of memory" because the space is negligible - hard disks cost pence per gigabyte, programmers cost tens/hundreds of $/hr so it's certainly economical to burn the space and it's not really a great argument for avoidance.
There is a better argument though, as others have said; databases don't do variable numbers of columns for a particular ID in a table, but they DO do variable numbers of rows per ID.. This is how DBs are designed: columns are fixed, rows are variable. Everything that a database does and offers in terms of querying, storage, retrieval, internal design etc is optimised towards this pattern
There are well established operations (called pivots) that will turn your vertical arrangement of data into horizontal (with nulls) at query time, so you don't have to store the data horizontally
Here's a pivot example:
Table:
ID, ServiceIdentifier, ServiceOwner
1, SV1, John
1, SV2, Sarah
2, SV1, Phil
2, SV2, John
2, SV3, Joe
3, SV2, Mark
SELECT
ID,
MAX(CASE WHEN ServiceIdentifier = 'SV1' THEN ServiceOwner END) as SV1_Owner,
MAX(CASE WHEN ServiceIdentifier = 'SV2' THEN ServiceOwner END) as SV2_Owner,
MAX(CASE WHEN ServiceIdentifier = 'SV3' THEN ServiceOwner END) as SV3_Owner
FROM
Table
GROUP BY
ID
Result:
ID SV1_Owner SV2_Owner SV3_Owner
1 John Sarah
2 Phil John Joe
3 Mark
As noted, it's not a huge cost to just store the data horizontally and if you're sure the table will never change/ not need new columns adding on a weekly basis to cope with new services etc, then it might be a sensible developer optimisation to just have columns full of nulls. If you'll add columns regularly, or one day have thousands of services, then vertical storage is going to have to be the way it goes
To expand a little on what's already been said:
Is there a way to add an attribute to only 1 row in SQL?
No, and that's kinda fundamental to how relationship databases (SQL) work - and that's in any version of SQL, whether it's mysql, t-sql, etc. If you have a table - and you want to add an attribute to that table, it's going to be another column, and that column will be there for every row. Not just relational databases - that's just how tables work.
But, that's not how anyone would do it. What you would do is what Alan suggested - a separate table for Services, then a 3rd table (he suggested naming it 'UserServices') that links the two. And that's not a one-off suggestion - that's pretty much "the" way to do it. There's no waste.
Maybe the use of another database type is better?
Possibly, if you want something with less restrictions, then you could go with something other than SQL. Since SQL is so dominant, everything is usually categorized as NOSQL. - Mongo is the most popular NOSQL database currently, which is why RC brought it up.

Database indices for log table

I need to create log table in a database. Each log will have various parameters. This is the table design I've come up.
Table Log
============
ID(INT)
Action(INT)
Created(TIMESTAMP)
Table Parameter
================
ID(INT)
ActionID(INT)
Title(VARCHAR(50))
VALUE(VARCHAR(300))
My question is, I need to perform complex queries on this log table such as
Who made what action and when?Who means I need to check parameters if title = "personID" and what means I need to check action code.
Total of prices of sale actionsFor this one, I need to check the action code for "sale" (say 3 (as integer)) and then I need to retrieve parameters of that actions with title "price" and then cast them to double and sum them up.
This queries will expand and I need to ensure that my design can answer more complex queries. I know that indices give great boosts but I am not very sure where I need to add to get the most out of database engine. The underlying database is MySQL.
Which fields should I have indices? Or is my design accurate for my purpose? I am also using InnoDB engine.
Edit
Sample SQL
SELECT LOG.ID, LOG.ACTION, LOG.CREATED FROM LOG, PARAMETER
WHERE PARAMETER.ACTIONID = ACTION.ID AND
PARAMETER.TITLE = "PERSONNAME" AND
PARAMETER.VALUE = 'JOHN' AND
PARAMETER.TITLE = "PRICE" AND
(CAST(PARAMETER.VALUE AS DOUBLE) > 30.0) AND
PARAMETER.TITLE = "DATE" AND
(CAST(PARAMETER.VALUE AS TIMESTAMP) > (NOW() - INTERVAL 1 DAY)))
The more conditions we have, the more Parameter.TITLE and Parameter.VALUE pairs will be used in the SQL.
You need to JOIN to PARAMETER multiple times, once per key-value pair that you are testing for. There are many examples; follow the Entity-Attribute-Value tag I added.
Since this is an EAV problem, you will have troubles with performance.
One improvement can be found in the changes I advocate for WP here
Another discussion is here on why not to hide the things you want to filter on in the K-V table. And an alternative.

Can I optimize such a MySQL query without using an index?

A simplified version of my MySQL db looks like this:
Table books (ENGINE=MyISAM)
id <- KEY
publisher <- LONGTEXT
publisher_id <- INT <- This is a new field that is currently null for all records
Table publishers (ENGINE=MyISAM)
id <- KEY
name <- LONGTEXT
Currently books.publisher holds values that keep getting repeated, but that the publishers.name holds uniquely.
I want to get rid of books.publisher and instead populate the books.publisher_id field.
The straightforward SQL code that describes what I want done, is as follows:
UPDATE books
JOIN publishers ON books.publisher = publishers.name
SET books.publisher_id = publishers.id;
The problem is that I have a big number of records, and even though it works, it's taking forever.
Is there a faster solution than using something like this in advance?:
CREATE INDEX publisher ON books (publisher(20));
Your question title says ".. optimize ... query without using an index?"
What have you got against using an index?
You should always examine the execution plan if a query is running slowly. I would guess it's having to scan the publishers table for each row in order to find a match. It would make sense to have an index on publishers.name to speed the lookup of an id.
You can drop the index later but it wouldn't harm to leave it in, since you say the process will have to run for a while until other changes are made. I imagine the publishers table doesn't get update very frequently so performance of INSERT and UPDATE on the table should not be an issue.
There are a few problems here that might be helped by optimization.
First of all, a few thousand rows doesn't count as "big" ... that's "medium."
Second, in MySQL saying "I want to do this without indexes" is like saying "I want to drive my car to New York City, but my tires are flat and I don't want to pump them up. What's the best route to New York if I'm driving on my rims?"
Third, you're using a LONGTEXT item for your publisher. Is there some reason not to use a fully indexable datatype like VARCHAR(200)? If you do that your WHERE statement will run faster, index or none. Large scale library catalog systems limit the length of the publisher field, so your system can too.
Fourth, from one of your comments this looks like a routine data maintenance update, not a one time conversion. So you need to figure out how to avoid repeating the whole deal over and over. I am guessing here, but it looks like newly inserted rows in your books table have a publisher_id of zero, and your query updates that column to a valid value.
So here's what to do. First, put an index on tables.publisher_id.
Second, run this variant of your maintenance query:
UPDATE books
JOIN publishers ON books.publisher = publishers.name
SET books.publisher_id = publishers.id
WHERE books.publisher_id = 0
LIMIT 100;
This will limit your update to rows that haven't yet been updated. It will also update 100 rows at a time. In your weekly data-maintenance job, re-issue this query until MySQL announces that your query affected zero rows (look at mysqli::rows_affected or the equivalent in your php-to-mysql interface). That's a great way to monitor database update progress and keep your update operations from getting out of hand.
Your update query has invalid syntax but you can fix that later. The way to get it to run faster is to add a where clause so that you are only updating the necessary records.

design database relating to time attribute

I want to design a database which is described as follows:
Each product has only one status at one time point. However, the status of a product can change during its life time. How could I design the relationship between product and status which can easily be queried all product of a specific status at current time? In addition, could anyone please give me some in-depth details about design database which related to time duration as problem above? Thanks for any help
Here is a model to achieve your stated requirement.
Link to Time Series Data Model
Link to IDEF1X Notation for those who are unfamiliar with the Relational Modelling Standard.
Normalised to 5NF; no duplicate columns; no Update Anomalies, no Nulls.
When the Status of a Product changes, simply insert a row into ProductStatus, with the current DateTime. No need to touch previous rows (which were true, and remain true). No dummy values which report tools (other than your app) have to interpret.
The DateTime is the actual DateTime that the Product was placed in that Status; the "From", if you will. The "To" is easily derived: it is the DateTime of the next (DateTime > "From") row for the Product; where it does not exist, the value is the current DateTime (use ISNULL).
The first model is complete; (ProductId, DateTime) is enough to provide uniqueness, for the Primary Key. However, since you request speed for certain query conditions, we can enhance the model at the physical level, and provide:
An Index (we already have the PK Index, so we will enhance that first, before adding a second index) to support covered queries (those based on any arrangement of { ProductId | DateTime | Status } can be supplied by the Index, without having to go to the data rows). Which changes the Status::ProductStatus relation from Non-Identifying (broken line) to Identifying type (solid line).
The PK arrangement is chosen on the basis that most queries will be Time Series, based on Product⇢DateTime⇢Status.
The second index is supplied to enhance the speed of queries based on Status.
In the Alternate Arrangement, that is reversed; ie, we mostly want the current status of all Products.
In all renditions of ProductStatus, the DateTime column in the secondary Index (not the PK) is DESCending; the most recent is first up.
I have provided the discussion you requested. Of course, you need to experiment with a data set of reasonable size, and make your own decisions. If there is anything here that you do not understand, please ask, and I will expand.
Responses to Comments
Report all Products with Current State of 2
SELECT ProductId,
Description
FROM Product p,
ProductStatus ps
WHERE p.ProductId = ps.ProductId -- Join
AND StatusCode = 2 -- Request
AND DateTime = ( -- Current Status on the left ...
SELECT MAX(DateTime) -- Current Status row for outer Product
FROM ProductStatus ps_inner
WHERE p.ProductId = ps_inner.ProductId
)
ProductId is Indexed, leading col, both sides
DateTime in Indexed, 2nd col in Covered Query Option
StatusCode is Indexed, 3rd col in Covered Query Option
Since StatusCode in the Index is DESCending, only one fetch is required to satisfy the inner query
the rows are required at the same time, for the one query; they are close together (due to Clstered Index); almost always on the same page due to the short row size.
This is ordinary SQL, a subquery, using the power of the SQL engine, Relational set processing. It is the one correct method, there is nothing faster, and any other method would be slower. Any report tool will produce this code with a few clicks, no typing.
Two Dates in ProductStatus
Columns such as DateTimeFrom and DateTimeTo are gross errors. Let's take it in order of importance.
It is a gross Normalisation error. "DateTimeTo" is easily derived from the single DateTime of the next row; it is therefore redundant, a duplicate column.
The precision does not come into it: that is easily resolved by virtue of the DataType (DATE, DATETIME, SMALLDATETIME). Whether you display one less second, microsecond, or nanosecnd, is a business decision; it has nothing to do with the data that is stored.
Implementing a DateTo column is a 100% duplicate (of DateTime of the next row). This takes twice the disk space. For a large table, that would be significant unnecessary waste.
Given that it is a short row, you will need twice as many logical and physical I/Os to read the table, on every access.
And twice as much cache space (or put another way, only half as many rows would fit into any given cache space).
By introducing a duplicate column, you have introduced the possibility of error (the value can now be derived two ways: from the duplicate DateTimeTo column or the DateTimeFrom of the next row).
This is also an Update Anomaly. When you update any DateTimeFrom is Updated, the DateTimeTo of the previous row has to be fetched (no big deal as it is close) and Updated (big deal as it is an additional verb that can be avoided).
"Shorter" and "coding shortcuts" are irrelevant, SQL is a cumbersome data manipulation language, but SQL is all we have (Just Deal With It). Anyone who cannot code a subquery really should not be coding. Anyone who duplicates a column to ease minor coding "difficulty" really should not be modelling databases.
Note well, that if the highest order rule (Normalisation) was maintained, the entire set of lower order problems are eliminated.
Think in Terms of Sets
Anyone having "difficulty" or experiencing "pain" when writing simple SQL is crippled in performing their job function. Typically the developer is not thinking in terms of sets and the Relational Database is set-oriented model.
For the query above, we need the Current DateTime; since ProductStatus is a set of Product States in chronological order, we simply need the latest, or MAX(DateTime) of the set belonging to the Product.
Now let's look at something allegedly "difficult", in terms of sets. For a report of the duration that each Product has been in a particular State: the DateTimeFrom is an available column, and defines the horizontal cut-off, a sub set (we can exclude earlier rows); the DateTimeTo is the earliest of the sub set of Product States.
SELECT ProductId,
Description,
[DateFrom] = DateTime,
[DateTo] = (
SELECT MIN(DateTime) -- earliest in subset
FROM ProductStatus ps_inner
WHERE p.ProductId = ps_inner.ProductId -- our Product
AND ps_inner.DateTime > ps.DateTime -- defines subset, cutoff
)
FROM Product p,
ProductStatus ps
WHERE p.ProductId = ps.ProductId
AND StatusCode = 2 -- Request
Thinking in terms of getting the next row is row-oriented, not set-oriented processing. Crippling, when working with a set-oriented database. Let the Optimiser do all that thinking for you. Check your SHOWPLAN, this optimises beautifully.
Inability to think in sets, thus being limited to writing only single-level queries, is not a reasonable justification for: implementing massive duplication and Update Anomalies in the database; wasting online resources and disk space; guaranteeing half the performance. Much cheaper to learn how to write simple SQL subqueries to obtain easily derived data.
"In addition, could anyone please give me some in-depth details about design database which related to time duration as problem above?"
Well, there exists a 400-page book entitled "Temporal Data and the Relational Model" that addresses your problem.
That book also addresses numerous problems that the other responders have not addressed in their responses, for lack of time or for lack of space or for lack of knowledge.
The introduction of the book also explicitly states that "this book is not about technology that is (commercially) available to any user today.".
All I can observe is that users wanting temporal features from SQL systems are, to put it plain and simple, left wanting.
PS
Even if those 400 pages could be "compressed a bit", I hope you don't expect me to give a summary of the entire meaningful content within a few paragraphs here on SO ...
tables similar to these:
product
-----------
product_id
status_id
name
status
-----------
status_id
name
product_history
---------------
product_id
status_id
status_time
then write a trigger on product to record the status and timestamp (sysdate) on each update where the status changes

			
				
Google "bi-temporal databases" and "slowly changing dimensions".
These are two names for esentially the same pattern.
You need to add two timestamp columns to your product table "VALID_FROM" and "VALID_TO".
When your product status changes you add a NEW row with "VALID_FROM" of now() some other known effective data/time and set the "VALID_TO" to 9999-12-31 23:59:59 or some other date ridiculously far into the future.
You also need to zap the "9999-12-31..." date on the previously current row to the current "VALID_FROM" time - 1 microsecond.
You can then easily query the product status at any given time.