I have a table named test :
create table demo (name varchar(10), mark1 int, mark2 int);
I need the total of mark1 and mark2 for each row many times.
select name, (mark1 + mark2) as total from demo;
Which I am told is not efficient. I am not allowed to add a new total column in the table.
Can I store such business logic in Index?
I created a view
CREATE VIEW view_total AS SELECT name, (mark1 + mark2) as 'total' from demo;
I populated the demo table with:
DELIMITER $$
CREATE PROCEDURE InsertRand(IN NumRows INT)
BEGIN
DECLARE i INT;
SET i = 1;
START TRANSACTION;
WHILE i <= NumRows DO
INSERT INTO demo VALUES (i,i+1,i+2);
SET i = i + 1;
END WHILE;
COMMIT;
END$$
DELIMITER ;
CALL InsertRand(100000);
The execution time of
select * from view_total;
and
select * from demo;
is same, 10 ms. So I have not gained any benefit of view. I tried to create index over the view with :
create index demo_total_view on view_total (name, total);
which failed with error :
ERROR 1347 (HY000): 'test.view_total' is not BASE TABLE
Any pointer about how do I prevent the redundant action of totaling the columns?
As a general rule never store in a table what you can calculate on exit from it. For instance, you want age, you should store date of birth. If you want the sum of two columns, you should store those two columns, nothing else.
Maintaining the data-integrity, -quality and -consistency in your database should be your paramount concern. If there is the slightest chance that a third column, which is the sum of the first two, could be out-of-sync then it is not worth doing.
As you cannot maintain the column without embedding the calculation into all code that inserts data into the table (open to being forgotten in the future and updating may break it) or firing a trigger every time you insert something (lots of additional work) you should not do this.
Your situation is a perfect use-case for views. You need to consistently calculate a column in the same way. If you let everyone calculate this as they wish then the same problems as with inserting the calculated column occur, you need to guarantee that this is always calculated the same way. The way to do this is to have a view on your table that pre-calculates the column in a standard way, that will be identical for every user.
Calculating a sum hundreds of time would be much costlier then reading it from somewhere... right?
Not necessarily, this depends entirely on your own situation. If you have slower disks then reading the data may easily be more expensive then calculating it. Especially since it's an extremely simple calculation.
In all likelihood it will make no difference at all but if it is a major performance concern you should test both situations and decide whether the potential loss of data-quality and the additional overhead in maintaining the calculation in a table is worth the odd nano-second on extraction from the database.
Which I am told is not efficient.
By whom? Surely you should ask the person who made the statement to explain it - not us?
How is it not efficient? The only time it would affect performance significantly is where you could use an index on mark1 and/or mark2 - it won't be used for a query like:
SELECT *
FROM demo
WHERE mark1+mark2 > 200;
But with indexes on both values you can do this:
SELECT *
FROM demo
WHERE mark1+mark2 > 200
AND (mark1 > (200/2) OR mark2 > (200/2));
The overhead of adding the 2 columns together is negligible. You can prove this yourself by measuring comparing the elapsed time of:
SELECT SQL_NO_CACHE mark1, mark2, name FROM demo;
and
SELECT SQL_NO_CACHE mark1+mark2, name FROM demo;
(Regarding your error - if you create the index on the table then the view will automatically detect and use it).
(MariaDB supports virtual columns which can be used to create a behaviour like Oracle's function-based indexes).
Related
Sample Table:
+----+-------+-------+-------+-------+-------+---------------+
| id | col1 | col2 | col3 | col4 | col5 | modifiedTime |
+----+-------+-------+-------+-------+-------+---------------+
| 1 | temp1 | temp2 | temp3 | temp4 | temp5 | 1554459626708 |
+----+-------+-------+-------+-------+-------+---------------+
above table has 50 million records
(col1, col2, col3, col4, col5 these are VARCHAR columns)
(id is PK)
(modifiedTime)
Every column is indexed
For Ex: I have two tabs in my website.
FirstTab - I print the count of above table with following criteria [col1 like "value1%" and col2 like "value2%"]
SeocndTab - I print the count of above table with following criteria [col3 like "value3%"]
As I have 50 million records, the count with those criteria takes too much time to get the result.
Note: I would change records data(rows in table) sometime. Insert new rows. Delete not needed records.
I need a feasible solution instead of querying the whole table. Ex: like caching the older count. Is anything like this possible.
While I'm sure it's possible for MySQL, here's a solution for Postgres, using triggers.
Count is stored in another table, and there's a trigger on each insert/update/delete that checks if the new row meets the condition(s), and if it does, add 1 to the count. Another part of the trigger checks if the old row meets the condition(s), and if it does, subtracts 1.
Here's the basic code for the trigger that counts the rows with temp2 = '5':
CREATE OR REPLACE FUNCTION updateCount() RETURNS TRIGGER AS
$func$
BEGIN
IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
EXECUTE 'UPDATE someTableCount SET cnt = cnt + 1 WHERE 1 = (SELECT 1 FROM (VALUES($1.*)) x(id, temp1, temp2, temp3) WHERE x.temp2 = ''5'')'
USING NEW;
END IF;
IF TG_OP = 'DELETE' OR TG_OP = 'UPDATE' THEN
EXECUTE 'UPDATE someTableCount SET cnt = cnt - 1 WHERE 1 = (SELECT 1 FROM (VALUES($1.*)) x(id, temp1, temp2, temp3) WHERE x.temp2 = ''5'')'
USING OLD;
END IF;
RETURN new;
END
$func$ LANGUAGE plpgsql;
Here's a working example on dbfiddle.
You could of course modify the trigger code to have dynamic where expressions and store counts for each in the table like:
CREATE TABLE someTableCount
(
whereExpr text,
cnt INT
);
INSERT INTO someTableCount VALUES ('temp2 = ''5''', 0);
In the trigger you'd then loop through the conditions and update accordingly.
FirstTab - I print the count of above table with following criteria [col1 like "value1%" and col2 like "value2%"]
That would benefit from a 'composite' index:
INDEX(col1, col2)
because it would be "covering". (That is, all the columns needed in the query are found in a single index.)
SeocndTab - I print the count of above table with following criteria [col3 like "value3%"]
You apparently already have the optimal (covering) index:
INDEX(col3)
Now, let's look at it from a different point of view. Have you noticed that search engines no longer give you an exact count of rows that match? You are finding out why -- It takes too long to do the tally not matter what technique is used.
Since "col1" gives me no clue of your app, nor any idea of what is being counted, I can only throw out some generic recommendations:
Don't give the counts.
Precompute the counts, save them somewhere and deliver 'stale' values. This can be handy if there are only a few different "values" being counted. It is probably not practical for arbitrary strings.
Say "about nnnn" in the output.
Play some tricks to decide whether it is practical to compute the exact value or just say "about".
Say "more than 1000".
etc
If you would like to describe the app and the columns, perhaps I can provide some clever tricks.
You expressed concern about "insert speed". This is usually not an issue, and the benefit of having the 'right' index for SELECTs outweighs the slight performance hit for INSERTs.
It sounds like you're trying to use a hammer when a screwdriver is needed. If you don't want to run batch computations, I'd suggest using a streaming framework such as Flink or Samza to add and subtract from your counts when records are added or deleted. This is precisely what those frameworks are built for.
If you're committed to using SQL, you can set up a job that performs the desired count operations every given time window, and stores the values to a second table. That way you don't have to perform repeated counts across the same rows.
As a general rule of thumb when it comes to optimisation (and yes, 1 SQL server node#50mio entries per table needs one!), here is a list of few possible optimisation techniques, some fairly easy to implement, others maybe need more serious modifications:
optimize your MYSQL field type and sizes, eg. use INT instead of VARCHAR if data can be presented with numbers, use SMALL INT instead of BIG INT, etc. In case you really need to have VARCHAR, then use as small as possible length of each field,
look at your dataset; is there any repeating values? Let say if any of your field has only 5 unique values in 50mio rows, then save those values to separate table and just link PK to this Sample Table,
MYSQL partitioning, basic understanding is shown at this link, so the general idea is so implement some kind of partitioning scheme, e.g. new partition is created by CRONJOB every day at "night" when server utilization is at minimum, or when you reach another 50k INSERTs or so (btw also some extra effort will be needed for UPDATE/DELETE operations on different partitions),
caching is another very simple and effective approach, since requesting (almost) same data (I am assuming your value1%, value2%, value3% are always the same?) over and over again. So do SELECT COUNT() once a while, and then use differencial index count to get actual number of selected rows,
in-memory database can be used alongside tradtional SQL DBs to get often-needed data: simple key-value pair style could be enough: Redis, Memcached, VoltDB, MemSQL are just some of them. Also, MYSQL also knows in-memory engine,
use other types of DBs, e.g NoSQL DB like MongoDB, if your dataset/system can utilize different concept.
If you are looking for aggregation performance and don't really care about insert times, I would consider changing your Row DBMS for a Column DBMS.
A Column RDBMS stores data as columns, meaning each column is indexed independantly from the others. This allows way faster aggregations, I have switched from Postgres to MonetDB (an open source column DBMS) and summing one field from a 6 milions lines table dropped down from ~60s to 50ms. I chose MonetDB as it supports SQL querying and odbc connections which were a plus for my use case, but you will experience similar performance improvements with other Column DBMS.
There is a downside to Column storing, which is that you lose performance on insert, update and delete queries, but from what you said, I believe it won't affect you that much.
In Postgres, you can get an estimated row count from the internal statistics that are managed by the query planner:
SELECT reltuples AS approximate_row_count FROM pg_class WHERE relname = 'mytable';
Here you have more details: https://wiki.postgresql.org/wiki/Count_estimate
You could create a materialized view first. Something like this:
CREATE MATERIALIZED VIEW mytable AS SELECT * FROM the_table WHERE col1 like "value1%" and col2 like "value2%";`
You can also materialize directly the count queries. If you have 10 tabs, then you should have to materialize 10 views:
CREATE MATERIALIZED VIEW count_tab1 AS SELECT count(*) FROM the_table WHERE col1 like "value1%" and col2 like "value2%";`
CREATE MATERIALIZED VIEW count_tab2 AS SELECT count(*) FROM the_table WHERE col2 like "value2%" and col3 like "value3%";`
...
After each insert, you should refresh views (asynchronously):
REFRESH MATERIALIZED VIEW count_tab1
REFRESH MATERIALIZED VIEW count_tab2
...
As noted in the critique, you have not posted what you have tried. So I would assume that the limit of question is exactly what you posted. So kindly report results of exactly that much
What is the current time you are spending for the subset of the problem, i.e. count of [col1 like "value1%" and col2 like "value2%"] and 2nd [col3 like "value3%]
The trick would be to scan the data source once and make the data source smaller by creating an index. So first create an index on col1,col2,col3,id. Purpose of col3 and id is so that database scans just the index. And I would get both counts in same SQL
select sum
(
case
when col1 like 'value1%' and col2 like 'value2%' then 1
else 0
end
) cnt_condition_1,
sum
(
case
when col3 like 'value3%' then 1
else 0
end
) cnt_condition_2
from table
where (col1 like 'value1%' and col2 like 'value2%') or
(col3 like 'value3%')
```
So the 50M row table is probably very wide right now. This should trim it down - on a reasonable server I would expect above to return in a few seconds. If it does not and each condition returns < 10% of the table, second option will be to create multiple indexes for each scenario and do count for each so that index is used in each case.
If there is no bulk insert/ bulk updates happening in your system, Can you try vertical partitioning in your table? By vertical partitioning, you can separate the data block of col1, col2 from other data of the table and so your searching space will reduce.
Also, indexing on every columns doesn't seem to be the best approach to go with. Index wherever it is absolutely needed. In this case, I would say Index(col1,col2) and Index(col3).
Even after indexing, you need to look into the fragmentation of those indexes and modify it accordingly to get the best results. Because, sometimes 50 million index of one column can sit as one huge chunk, which will restrict multi processing capabilities of your SQL server.
Each Database has their own peculiarities in how to "enhance" their RDBMS. I can't speak for MySQL or SQL Server but for PostgreSQL you should consider making the indexes that you search as GIN (Generalized Inverted Index)-based indexes.
CREATE INDEX name ON table USING gin(col1);
CREATE INDEX name ON table USING gin(col2);
CREATE INDEX name ON table USING gin(col3);
More information can be found here.
-HTH
this will work:
select count(*) from (
select * from tablename where col1 like 'value1%' and col2 like 'value2%' and col3
like'value3%')
where REGEXP_LIKE(col1,'^value1(.*)$') and REGEXP_LIKE(col2,'^value2(.*)$') and
REGEXP_LIKE(col1,'^value2(.*)$');
try not to apply index on all the columns as it slows down the processing of a sql
query and have it in required columns only.
I have following table similar to Oracle user_sequences.
I have logic of sequence prefix/suffix something, but for simplicity, I'm skipping as matters less here.
create table my_seq(
min_value integer,
Max_value integer,
last_value integer,
increment_by tinyint,
customer_id integer);
Assume in current table there are two records.
insert into my_seq(min_value,max_value,last_value,increment_by,customer_id)
values(1,99999999,1,1,'foo#',1),(1,999999999,100,1,'foo#',2);
My foo table structure is like,
create table foo(id Auto_increment,foo_number varchar(20),customer_id integer);
Constrained:
I can't use MySQL AUTO_INCREMENT columns as foo contains different customers data, and every customer could opt foo_number auto generation or manual entry and there should be gap if customer opted for auto_generation. So customer=1 has opted for it, foo# should be 1,2,3,4 etc, no gaps are allowed.
So far so good, with auto increment logic that we have implemented if my app runs in single thread. We generate foo_number and populate in foo table, along with other data points.
I simply do a query to get the next auto#.
select last_number from my_seq where customer_id=?;
reads the # and the update the record.
update my_seq set last_number=last_number+increment_by where customer_id=?;
Problem:
When multiple concurrent session tries the run select last_number from my_seq..., it returns same foo_number multiple times. Also, I can't enforce single thread in application because of application side limitation and performance bottleneck, hence need to solve it in database side.
Please suggest, how I could avoid duplicate numbers? Please help, thanks in advance.
I did google, many stackoverflow links suggests get_last_id(), as you could see, I can't use it.
I was able to solve this problem by just combining suggestions of #Akina and #RickJames , thank you both for thier support.
create table my_seq(
min_value integer,
Max_value integer,
last_value integer,
increment_by tinyint,
customer_id integer)ENGINE = InnoDB;
Here ENGINE=InnoDB is very important.
In order to make sure there is table level locking while reading, I have modified my app code to:
Auto-Commit=FALSE
Then,
//very import to begin the transaction
begin;
select last_number from my_seq where customer_id=? FOR UPDATE;
Read the result in App.
update my_seq set last_number=last_number+1 where customer_id=?;
commit;
This was generating the unique sequence number even in case of multiple concurrent sessions.
I have faced another problem, that this solution has slowed down other are where I do generate sequence#. I have solved it enabling a row level lock instead of table level lock by indexing customer_id.
ALTER TABLE TABLE_NAME ADD INDEX (customer_id);
Hope this will be help full to others.
I have a large table containing hourly statistical data broken down across a number of dimensions. It's now large enough that I need to start aggregating the data to make queries faster. The table looks something like:
customer INT
campaign INT
start_time TIMESTAMP
end_time TIMESTAMP
time_period ENUM(hour, day, week)
clicks INT
I was thinking that I could, for example, insert a row into the table where campaign is null, and the clicks value would be the sum of all clicks for that customer and time period. Similarly, I could set the time period to "day" and this would be the sum of all of the hours in that day.
I'm sure this is a fairly common thing to do, so I'm wondering what the best way to achieve this in MySql? I'm assuming an INSERT INTO combined with a SELECT statement (like with a materialized view) - however since new data is constantly being added to this table, how do I avoid re-calculating aggregate data that I've previously calculated?
I done something similar and here is the problems I have deal with:
You can use round(start_time/86400)*86400 in "group by" part to get summary of all entries from same day. (For week is almost the same)
The SQL will look like:
insert into the_table
( select
customer,
NULL,
round(start_time/86400)*86400,
round(start_time/86400)*86400 + 86400,
'day',
sum(clicks)
from the_table
where time_period = 'hour' and start_time between <A> and <B>
group by customer, round(start_time/86400)*86400 ) as tbl;
delete from the_table
where time_period = 'hour' and start_time between <A> and <B>;
If you going to insert summary from same table to itself - you will use temp (Which mean you copy part of data from the table aside, than it dropped - for each transaction). So you must be very careful with the indexes and size of data returned by inner select.
When you constantly inserting and deleting rows - you will get fragmentation issues sooner or later. It will slow you down dramatically. The solutions is to use partitioning & to drop old partitions from time to time. Or you can run "optimize table" statement, but it will stop you work for relatively long time (may be minutes).
To avoid mess with duplicate data - you may want to clone the table for each time aggregation period (hour_table, day_table, ...)
If you're trying to make the table smaller, you'll be deleting the detailed rows after you make the summary row, right? Transactions are your friend. Start one, compute the rollup, insert the rollup, delete the detailed rows, end the transaction.
If you happen to add more rows for an older time period (who does that??), you can run the rollup again - it will combine your previous rollup entry with your extra data into a new, more powerful, rollup entry.
Bit of a newbie here. I'm currently working on a MySQL table that lists the details for different cars. I need a new field that is built up of the information from three other fields. So I have 'Acceleration', 'Speed' and 'Braking' which all contain double digit integers that are averaged out to another field I want to call 'Average'.
The logic being 'Acceleration'+'Speed'+'Braking'/3
I can't seem to figure out the correct syntax to do this. I do specifically need this to be a field as I need those values to show up on other queries. I know a SELECT query can get the result values I need, but how to I conduct those values to a permanent field on that table?
Thanks in advance for any help on this.
First, you'd need to alter the table schema to define the new column:
ALTER TABLE my_table ADD COLUMN Average FLOAT;
Next, update the table to set the values:
UPDATE my_table SET Average = (Acceleration + Speed + Braking) / 3;
Consider how to correctly set Average for newly inserted/updated data. Perhaps use triggers:
CREATE TRIGGER calc_average_ins AFTER INSERT ON my_table FOR EACH ROW
SET NEW.Average = (NEW.Acceleration + NEW.Speed + NEW.Braking) / 3;
CREATE TRIGGER calc_average_upd AFTER UPDATE ON my_table FOR EACH ROW
SET NEW.Average = (NEW.Acceleration + NEW.Speed + NEW.Braking) / 3;
You might want to consider instead introducing this column in a view, to create the averages as required, on-the-fly, and thereby preventing it from becoming desynchronised from the underlying data values (but note you no longer achieve the performance benefit of having the values cached):
CREATE VIEW my_view AS
SELECT *, (Acceleration + Speed + Braking) / 3 AS Average FROM my_table;
Finally, note that your average has no physical meaning in the real world (what would be its units?): a more meaningful metric may or may not be more suitable to your needs.
I've got a mysql table where each row has its own sequence number in a "sequence" column. However, when a row gets deleted, it leaves a gap. So...
1
2
3
4
...becomes...
1
2
4
Is there a neat way to "reset" the sequencing, so it becomes consecutive again in one SQL query?
Incidentally, I'm sure there is a technical term for this process. Anyone?
UPDATED: The "sequence" column is not a primary key. It is only used for determining the order that records are displayed within the app.
If the field is your primary key...
...then, as stated elsewhere on this question, you shouldn't be changing IDs. The IDs are already unique and you neither need nor want to re-use them.
Now, that said...
Otherwise...
It's quite possible that you have a different field (that is, as well as the PK) for some application-defined ordering. As long as this ordering isn't inherent in some other field (e.g. if it's user-defined), then there is nothing wrong with this.
You could recreate the table using a (temporary) auto_increment field and then remove the auto_increment afterwards.
I'd be tempted to UPDATE in ascending order and apply an incrementing variable.
SET #i = 0;
UPDATE `table`
SET `myOrderCol` = #i:=#i+1
ORDER BY `myOrderCol` ASC;
(Query not tested.)
It does seem quite wasteful to do this every time you delete items, but unfortunately with this manual ordering approach there's not a whole lot you can do about that if you want to maintain the integrity of the column.
You could possibly reduce the load, such that after deleting the entry with myOrderCol equal to, say, 5:
SET #i = 5;
UPDATE `table`
SET `myOrderCol` = #i:=#i+1
WHERE `myOrderCol` > 5
ORDER BY `myOrderCol` ASC;
(Query not tested.)
This will "shuffle" all the following values down by one.
I'd say don't bother. Reassigning sequential values is a relatively expensive operation and if the column value is for ordering purpose only there is no good reason to do that. The only concern you might have is if for example your column is UNSIGNED INT and you suspect that in the lifetime of your application you might have more than 4,294,967,296 rows (including deleted rows) and go out of range, even if that is your concern you can do the reassigning as a one time task 10 years later when that happens.
This is a question that often I read here and in other forums. As already written by zerkms this is a false problem. Moreover if your table is related with other ones you'll lose relations.
Just for learning purpose a simple way is to store your data in a temporary table, truncate the original one (this reset auto_increment) and than repopulate it.
Silly example:
create table seq (
id int not null auto_increment primary key,
col char(1)
) engine = myisam;
insert into seq (col) values ('a'),('b'),('c'),('d');
delete from seq where id = 3;
create temporary table tmp select col from seq order by id;
truncate seq;
insert into seq (col) select * from tmp;
but it's totally useless. ;)
If this is your PK then you shouldn't change it. PKs should be (mostly) unchanging columns. If you were to change them then not only would you need to change it in that table but also in any foreign keys where is exists.
If you do need a sequential sequence then ask yourself why. In a table there is no inherent or guaranteed order (even in the PK, although it may turn out that way because of how most RDBMSs store and retrieve the data). That's why we have the ORDER BY clause in SQL. If you want to be able to generate sequential numbers based on something else (time added into the database, etc.) then consider generating that either in your query or with your front end.
Assuming that this is an ID field, you can do this when you insert:
INSERT INTO yourTable (ID)
SELECT MIN(ID)
FROM yourTable
WHERE ID > 1
As others have mentioned I don't recommend doing this. It will hold a table lock while the next ID is evaluated.