I am new in mysql. What I would to do is create a new table which is a copy of the original one table with one more column under a specific condition. Which condition appears as a new column is the new table. I mean:
Let table be a sequence of given point (x,y) I want to create the table temp being (x,y,r) where r = x^2 + y^2<1 But what I did is
CREATE temp LIKE table;
ALTER TABLE temp ADD r FLOAT;
INSERT INTO temp (x,y) SELECT * FROM table WHERE x*x+y*y<1;
UPDATE temp SET r=x*x+y*y;
It is ok, it gives what I want, but my database is much more bigger than this simple example and here I calculate twice the radius r in two table. It is not so good about optimization.
Is there a way to pass the clause into the new column directly?
Thanks in advance.
You should (almost) never store calculated data in a database. It ends up creating maintenance and application nightmares when the calculated values end up out of sync with the values from which they are calculated.
At this point you're probably saying to yourself, "Well, I'll do a really good job keeping them in sync." It doesn't matter, because down the road at some point, for whatever reason, they will get out of sync.
Luckily, SQL provides a nice mechanism to handle what you want - views.
CREATE VIEW temp
AS
SELECT
x,
y,
x*x + y*y AS r
FROM My_Table
WHERE
x*x + y*y < 1
You don't actually need to worry about doing the calculation twice. There is more overhead to doing an insert and update. So, you should do those calculations at the same time.
MySQL extends the use of the having clause, so this is easy:
CREATE temp LIKE table;
ALTER TABLE temp ADD r FLOAT;
INSERT INTO temp(x, y, r)
SELECT x, y, x*x+y*y as r
FROM table
HAVING r < 1;
It is quite possible that an additional table is not actually necessary, but it depends on how you are using the data. For instance, if you have rather complicated processing and are referring to temp multiple times and temp is rather smaller than the original data, then this could be a useful optimization.
Also, materializing the calculation in a table not only saves time (when the calculation is expensive, which this isn't), but it also allows building an index on the computed value -- something you cannot otherwise do in MySQL.
Personally, my preference is for more complicated queries rather than a profusion of temporary tables. As with many things with extremes, the best solution often lies in the middle (well, not really in the middle but temporary tables aren't all bad).
Rather than:
INSERT INTO temp (x,y) SELECT * FROM table WHERE x*x+y*y<1;
UPDATE temp SET r=x*x+y*y;
Try this:
INSERT INTO temp (x,y,r)
SELECT x,
y,
x*x+y*y AS r
FROM table
WHERE x*x+y*y<1;
Related
We have a database table which stores browser data for visitors, broken down by multiple different subtypes. For simplicity, let's use the table schema below. The querying will basically be on any single id column, the metric column, the timestamp column (stored as seconds since epoch), and one of the device, browser, or os columns.
We are going to performance test the star vs snowflake schema (where all of the ids go into a single column, but then an additional column id_type is added to determine which type of identifier it is) for this table, but as long as the star schema (which is how it is now) is within 80% of the snowflake performance, we are going to keep it since it will make our load process much easier. Before I do that however, I want to make sure the indexes are optimized on the star schema.
create table browser_data (
id_1 int,
id_2 int,
id_3 int,
id_4 int,
metric varchar(20),
browser varchar(20),
device varchar(20),
os varchar(20),
timestamp bigint
)
Would it be better to create individual indexes on just the id columns, or also include the metric and timestamp columns in those indexes as well?
Do not normalize "continuous" values, such as DATETIME, FLOAT, INT. Do leave the values in the main table.
When you move the value to other table(s), especially "snowflake", it makes querying based on the values somewhere between a little slower and a lot slower. This especially happens when you need to filter on more than one metric that is not in the main table. Either of these perform very poorly because of "snowflake" or "over-normalization":
WHERE a.x = 123 AND b.y = 345
ORDER BY a.x, b.y
As for what indexes to create -- that depends entirely on the queries you need to perform. So, I strongly recommend you sketch out the likely SELECTs based on your tentative CREATE TABLEs.
INT is 4 bytes. TIMESTAMP is 5, FLOAT is 4, etc. That is, normalizing such things are also inefficient on space.
More
When doing JOINs, the Optimizer will almost always start with one table, then move on to the another table, etc. (See "Nested Loop Join".)
For example (building on the above 'code'), when 2 columns are normalized, and you are testing on the values, you do not have two ids in hand, you only have the two values. This makes the query execution very inefficient. For
SELECT ...
FROM main
JOIN a USING(a_id)
JOIN b USING(b_id)
WHERE a.x = 123 AND b.y = 345
The following is very likely to be the 'execution plan':
Reach into a to find the row(s) with x=123; get the id(s) for those rows. This may include many rows that are yet to be filtered by b.y. a needs INDEX(x)
Go back to the main table, looking up rows with those id(s). main needs INDEX(a_id). Again, more rows than necessary may be hauled around.
Only now, do you get to b (using b_id) to check for y=345; toss the unnecessary rows you have been hauling around. b needs INDEX(b_id)
Note my comment about "haul around". Blindly using * (in SELECT *) adds to the problem -- all the columns are being hauled around while performing the steps.
On the other hand... If x and y were in the main table, then the code works like:
WHERE main.x = 123
AND main.y = 345
only needs INDEX(x,y) (in either order). And it quickly locates exactly the rows desired.
In the case of ORDER BY a.x, b.y, it cannot use any index on any table. So the query must create a tmp table, sort it, then deliver the rows in the desired order.
But if x and y are in the same table, then INDEX(x,y) (in that order) may be useful for ORDER BY x,y and avoid the tmp table and the sort.
With a single table, the Optimizer might use an index for WHERE, or it might use an index for ORDER BY, depending on the phase of the moon. In some cases, one index can be used for both -- this is optimal.
Another note: If you also have LIMIT 10,... If the sort is avoided, then only 10 rows need to be looked at, not the entire set from the WHERE.
Complete noob alert! I need to store a largish set of data fields (480) for each of many devices i am measuring. Each field is a Decimal(8,5). First, is this an unreasonably large table? I have no experience really, so if it is unmanageable, I might start thinking of an alternative storage method.
Right now, I am creating a new row using INSERT, then trying to put the 480 data values in to the new row using UPDATE (in a loop). Currently each UPDATE is overwriting the entire column. How do I specify only to modify the last row? For example, with a table ("magnitude") having columns "id", "field1", "field2",...:
sql UPDATE magnitude SET field1 = 3.14; this modifies the entire "field1" column.
Was trying to do something like:
sql UPDATE magnitude SET field1 = 3.14 WHERE id = MAX(id)
Obviously I am a complete noob. Just trying to get this one thing working and move on... Did look around a lot but can't find a solution. Any help appreciated.
Instead of inserting a row and then updating it with values, you should insert an entire row, with populated values, at once, using the insert command.
I.e.
insert into tTable (column1, column2, ..., column n) values (datum1, datum2, ..., datum n)
Your table's definition should have the ID column with property identity, which means that it will autofill it for you when you insert, i.e. you don't need to specify it.
Re: appropriateness of the schema, I think 480 is a large number of columns. However, this is a straightforward enough example that you could try it and determine empirically if your system is able to give you the performance you need.
If I were doing this myself, I would go for a different solution that has many rows instead of many columns:
Create a table tDevice (ID int, Name nvarchar)
Create a table tData (ID int, Device_ID int, Value decimal(8,5))
-- With a foreign key on Device_ID back to tDevice.ID
Then, to populate:
Insert all your devices in tDevice
Insert one row into tData for every Device / Data combination
-- i.e. 480 x n rows, n being the number of devices
Then, you can query the data you want like so:
select * from tData join tDevice on tDevice.ID = tData.Device_ID
I need to calculate returns at different frequencies. In order to do so, I would like to be able to lag the values in a column by k units. While I have found different specific solutions, I have not been able to make a general stored procedure (most likely due to my inexperience with mysql). How could I best do this?
I have a table with multiple columns, amongst which columns containing info on:
ID
Date
Price
The end result should be a table with all the original columns, plus a column containing the lagged values of Price.
To keep the procedure general, I could imagine the procedure would take the table name, necessary column names (e.g. ID, Date, Price), and number of lags k as input, and append a column to the table.
You can do what you want with a correlated subquery. Here is an example:
select t.*,
(select t2.price
from <tablename> t2
where t2.date < t.date
order by date
limit 1 offset 1 -- change the offset for a bigger lag
) as price_lag_1
from <tablename> t;
Your desire to create a generic stored procedure is not very SQL-y. MySQL doesn't support table-valued functions, so you wouldn't be able to use the resulting table as an actual table.
If you want to put this in a stored procedure that is generic, you will need dynamic SQL to construct the SQL statement, using the particular table and columns that you pass in.
Instead, I would suggest that you simply learn how to express what you want as a query. If you have multiple tables with the same structure, then you may want to revisit your data model. Have multiple similar tables is often an example of an entity being inappropriately spread across too many tables.
I have a table named test :
create table demo (name varchar(10), mark1 int, mark2 int);
I need the total of mark1 and mark2 for each row many times.
select name, (mark1 + mark2) as total from demo;
Which I am told is not efficient. I am not allowed to add a new total column in the table.
Can I store such business logic in Index?
I created a view
CREATE VIEW view_total AS SELECT name, (mark1 + mark2) as 'total' from demo;
I populated the demo table with:
DELIMITER $$
CREATE PROCEDURE InsertRand(IN NumRows INT)
BEGIN
DECLARE i INT;
SET i = 1;
START TRANSACTION;
WHILE i <= NumRows DO
INSERT INTO demo VALUES (i,i+1,i+2);
SET i = i + 1;
END WHILE;
COMMIT;
END$$
DELIMITER ;
CALL InsertRand(100000);
The execution time of
select * from view_total;
and
select * from demo;
is same, 10 ms. So I have not gained any benefit of view. I tried to create index over the view with :
create index demo_total_view on view_total (name, total);
which failed with error :
ERROR 1347 (HY000): 'test.view_total' is not BASE TABLE
Any pointer about how do I prevent the redundant action of totaling the columns?
As a general rule never store in a table what you can calculate on exit from it. For instance, you want age, you should store date of birth. If you want the sum of two columns, you should store those two columns, nothing else.
Maintaining the data-integrity, -quality and -consistency in your database should be your paramount concern. If there is the slightest chance that a third column, which is the sum of the first two, could be out-of-sync then it is not worth doing.
As you cannot maintain the column without embedding the calculation into all code that inserts data into the table (open to being forgotten in the future and updating may break it) or firing a trigger every time you insert something (lots of additional work) you should not do this.
Your situation is a perfect use-case for views. You need to consistently calculate a column in the same way. If you let everyone calculate this as they wish then the same problems as with inserting the calculated column occur, you need to guarantee that this is always calculated the same way. The way to do this is to have a view on your table that pre-calculates the column in a standard way, that will be identical for every user.
Calculating a sum hundreds of time would be much costlier then reading it from somewhere... right?
Not necessarily, this depends entirely on your own situation. If you have slower disks then reading the data may easily be more expensive then calculating it. Especially since it's an extremely simple calculation.
In all likelihood it will make no difference at all but if it is a major performance concern you should test both situations and decide whether the potential loss of data-quality and the additional overhead in maintaining the calculation in a table is worth the odd nano-second on extraction from the database.
Which I am told is not efficient.
By whom? Surely you should ask the person who made the statement to explain it - not us?
How is it not efficient? The only time it would affect performance significantly is where you could use an index on mark1 and/or mark2 - it won't be used for a query like:
SELECT *
FROM demo
WHERE mark1+mark2 > 200;
But with indexes on both values you can do this:
SELECT *
FROM demo
WHERE mark1+mark2 > 200
AND (mark1 > (200/2) OR mark2 > (200/2));
The overhead of adding the 2 columns together is negligible. You can prove this yourself by measuring comparing the elapsed time of:
SELECT SQL_NO_CACHE mark1, mark2, name FROM demo;
and
SELECT SQL_NO_CACHE mark1+mark2, name FROM demo;
(Regarding your error - if you create the index on the table then the view will automatically detect and use it).
(MariaDB supports virtual columns which can be used to create a behaviour like Oracle's function-based indexes).
Bit of a newbie here. I'm currently working on a MySQL table that lists the details for different cars. I need a new field that is built up of the information from three other fields. So I have 'Acceleration', 'Speed' and 'Braking' which all contain double digit integers that are averaged out to another field I want to call 'Average'.
The logic being 'Acceleration'+'Speed'+'Braking'/3
I can't seem to figure out the correct syntax to do this. I do specifically need this to be a field as I need those values to show up on other queries. I know a SELECT query can get the result values I need, but how to I conduct those values to a permanent field on that table?
Thanks in advance for any help on this.
First, you'd need to alter the table schema to define the new column:
ALTER TABLE my_table ADD COLUMN Average FLOAT;
Next, update the table to set the values:
UPDATE my_table SET Average = (Acceleration + Speed + Braking) / 3;
Consider how to correctly set Average for newly inserted/updated data. Perhaps use triggers:
CREATE TRIGGER calc_average_ins AFTER INSERT ON my_table FOR EACH ROW
SET NEW.Average = (NEW.Acceleration + NEW.Speed + NEW.Braking) / 3;
CREATE TRIGGER calc_average_upd AFTER UPDATE ON my_table FOR EACH ROW
SET NEW.Average = (NEW.Acceleration + NEW.Speed + NEW.Braking) / 3;
You might want to consider instead introducing this column in a view, to create the averages as required, on-the-fly, and thereby preventing it from becoming desynchronised from the underlying data values (but note you no longer achieve the performance benefit of having the values cached):
CREATE VIEW my_view AS
SELECT *, (Acceleration + Speed + Braking) / 3 AS Average FROM my_table;
Finally, note that your average has no physical meaning in the real world (what would be its units?): a more meaningful metric may or may not be more suitable to your needs.