Sample Table:
+----+-------+-------+-------+-------+-------+---------------+
| id | col1 | col2 | col3 | col4 | col5 | modifiedTime |
+----+-------+-------+-------+-------+-------+---------------+
| 1 | temp1 | temp2 | temp3 | temp4 | temp5 | 1554459626708 |
+----+-------+-------+-------+-------+-------+---------------+
above table has 50 million records
(col1, col2, col3, col4, col5 these are VARCHAR columns)
(id is PK)
(modifiedTime)
Every column is indexed
For Ex: I have two tabs in my website.
FirstTab - I print the count of above table with following criteria [col1 like "value1%" and col2 like "value2%"]
SeocndTab - I print the count of above table with following criteria [col3 like "value3%"]
As I have 50 million records, the count with those criteria takes too much time to get the result.
Note: I would change records data(rows in table) sometime. Insert new rows. Delete not needed records.
I need a feasible solution instead of querying the whole table. Ex: like caching the older count. Is anything like this possible.
While I'm sure it's possible for MySQL, here's a solution for Postgres, using triggers.
Count is stored in another table, and there's a trigger on each insert/update/delete that checks if the new row meets the condition(s), and if it does, add 1 to the count. Another part of the trigger checks if the old row meets the condition(s), and if it does, subtracts 1.
Here's the basic code for the trigger that counts the rows with temp2 = '5':
CREATE OR REPLACE FUNCTION updateCount() RETURNS TRIGGER AS
$func$
BEGIN
IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
EXECUTE 'UPDATE someTableCount SET cnt = cnt + 1 WHERE 1 = (SELECT 1 FROM (VALUES($1.*)) x(id, temp1, temp2, temp3) WHERE x.temp2 = ''5'')'
USING NEW;
END IF;
IF TG_OP = 'DELETE' OR TG_OP = 'UPDATE' THEN
EXECUTE 'UPDATE someTableCount SET cnt = cnt - 1 WHERE 1 = (SELECT 1 FROM (VALUES($1.*)) x(id, temp1, temp2, temp3) WHERE x.temp2 = ''5'')'
USING OLD;
END IF;
RETURN new;
END
$func$ LANGUAGE plpgsql;
Here's a working example on dbfiddle.
You could of course modify the trigger code to have dynamic where expressions and store counts for each in the table like:
CREATE TABLE someTableCount
(
whereExpr text,
cnt INT
);
INSERT INTO someTableCount VALUES ('temp2 = ''5''', 0);
In the trigger you'd then loop through the conditions and update accordingly.
FirstTab - I print the count of above table with following criteria [col1 like "value1%" and col2 like "value2%"]
That would benefit from a 'composite' index:
INDEX(col1, col2)
because it would be "covering". (That is, all the columns needed in the query are found in a single index.)
SeocndTab - I print the count of above table with following criteria [col3 like "value3%"]
You apparently already have the optimal (covering) index:
INDEX(col3)
Now, let's look at it from a different point of view. Have you noticed that search engines no longer give you an exact count of rows that match? You are finding out why -- It takes too long to do the tally not matter what technique is used.
Since "col1" gives me no clue of your app, nor any idea of what is being counted, I can only throw out some generic recommendations:
Don't give the counts.
Precompute the counts, save them somewhere and deliver 'stale' values. This can be handy if there are only a few different "values" being counted. It is probably not practical for arbitrary strings.
Say "about nnnn" in the output.
Play some tricks to decide whether it is practical to compute the exact value or just say "about".
Say "more than 1000".
etc
If you would like to describe the app and the columns, perhaps I can provide some clever tricks.
You expressed concern about "insert speed". This is usually not an issue, and the benefit of having the 'right' index for SELECTs outweighs the slight performance hit for INSERTs.
It sounds like you're trying to use a hammer when a screwdriver is needed. If you don't want to run batch computations, I'd suggest using a streaming framework such as Flink or Samza to add and subtract from your counts when records are added or deleted. This is precisely what those frameworks are built for.
If you're committed to using SQL, you can set up a job that performs the desired count operations every given time window, and stores the values to a second table. That way you don't have to perform repeated counts across the same rows.
As a general rule of thumb when it comes to optimisation (and yes, 1 SQL server node#50mio entries per table needs one!), here is a list of few possible optimisation techniques, some fairly easy to implement, others maybe need more serious modifications:
optimize your MYSQL field type and sizes, eg. use INT instead of VARCHAR if data can be presented with numbers, use SMALL INT instead of BIG INT, etc. In case you really need to have VARCHAR, then use as small as possible length of each field,
look at your dataset; is there any repeating values? Let say if any of your field has only 5 unique values in 50mio rows, then save those values to separate table and just link PK to this Sample Table,
MYSQL partitioning, basic understanding is shown at this link, so the general idea is so implement some kind of partitioning scheme, e.g. new partition is created by CRONJOB every day at "night" when server utilization is at minimum, or when you reach another 50k INSERTs or so (btw also some extra effort will be needed for UPDATE/DELETE operations on different partitions),
caching is another very simple and effective approach, since requesting (almost) same data (I am assuming your value1%, value2%, value3% are always the same?) over and over again. So do SELECT COUNT() once a while, and then use differencial index count to get actual number of selected rows,
in-memory database can be used alongside tradtional SQL DBs to get often-needed data: simple key-value pair style could be enough: Redis, Memcached, VoltDB, MemSQL are just some of them. Also, MYSQL also knows in-memory engine,
use other types of DBs, e.g NoSQL DB like MongoDB, if your dataset/system can utilize different concept.
If you are looking for aggregation performance and don't really care about insert times, I would consider changing your Row DBMS for a Column DBMS.
A Column RDBMS stores data as columns, meaning each column is indexed independantly from the others. This allows way faster aggregations, I have switched from Postgres to MonetDB (an open source column DBMS) and summing one field from a 6 milions lines table dropped down from ~60s to 50ms. I chose MonetDB as it supports SQL querying and odbc connections which were a plus for my use case, but you will experience similar performance improvements with other Column DBMS.
There is a downside to Column storing, which is that you lose performance on insert, update and delete queries, but from what you said, I believe it won't affect you that much.
In Postgres, you can get an estimated row count from the internal statistics that are managed by the query planner:
SELECT reltuples AS approximate_row_count FROM pg_class WHERE relname = 'mytable';
Here you have more details: https://wiki.postgresql.org/wiki/Count_estimate
You could create a materialized view first. Something like this:
CREATE MATERIALIZED VIEW mytable AS SELECT * FROM the_table WHERE col1 like "value1%" and col2 like "value2%";`
You can also materialize directly the count queries. If you have 10 tabs, then you should have to materialize 10 views:
CREATE MATERIALIZED VIEW count_tab1 AS SELECT count(*) FROM the_table WHERE col1 like "value1%" and col2 like "value2%";`
CREATE MATERIALIZED VIEW count_tab2 AS SELECT count(*) FROM the_table WHERE col2 like "value2%" and col3 like "value3%";`
...
After each insert, you should refresh views (asynchronously):
REFRESH MATERIALIZED VIEW count_tab1
REFRESH MATERIALIZED VIEW count_tab2
...
As noted in the critique, you have not posted what you have tried. So I would assume that the limit of question is exactly what you posted. So kindly report results of exactly that much
What is the current time you are spending for the subset of the problem, i.e. count of [col1 like "value1%" and col2 like "value2%"] and 2nd [col3 like "value3%]
The trick would be to scan the data source once and make the data source smaller by creating an index. So first create an index on col1,col2,col3,id. Purpose of col3 and id is so that database scans just the index. And I would get both counts in same SQL
select sum
(
case
when col1 like 'value1%' and col2 like 'value2%' then 1
else 0
end
) cnt_condition_1,
sum
(
case
when col3 like 'value3%' then 1
else 0
end
) cnt_condition_2
from table
where (col1 like 'value1%' and col2 like 'value2%') or
(col3 like 'value3%')
```
So the 50M row table is probably very wide right now. This should trim it down - on a reasonable server I would expect above to return in a few seconds. If it does not and each condition returns < 10% of the table, second option will be to create multiple indexes for each scenario and do count for each so that index is used in each case.
If there is no bulk insert/ bulk updates happening in your system, Can you try vertical partitioning in your table? By vertical partitioning, you can separate the data block of col1, col2 from other data of the table and so your searching space will reduce.
Also, indexing on every columns doesn't seem to be the best approach to go with. Index wherever it is absolutely needed. In this case, I would say Index(col1,col2) and Index(col3).
Even after indexing, you need to look into the fragmentation of those indexes and modify it accordingly to get the best results. Because, sometimes 50 million index of one column can sit as one huge chunk, which will restrict multi processing capabilities of your SQL server.
Each Database has their own peculiarities in how to "enhance" their RDBMS. I can't speak for MySQL or SQL Server but for PostgreSQL you should consider making the indexes that you search as GIN (Generalized Inverted Index)-based indexes.
CREATE INDEX name ON table USING gin(col1);
CREATE INDEX name ON table USING gin(col2);
CREATE INDEX name ON table USING gin(col3);
More information can be found here.
-HTH
this will work:
select count(*) from (
select * from tablename where col1 like 'value1%' and col2 like 'value2%' and col3
like'value3%')
where REGEXP_LIKE(col1,'^value1(.*)$') and REGEXP_LIKE(col2,'^value2(.*)$') and
REGEXP_LIKE(col1,'^value2(.*)$');
try not to apply index on all the columns as it slows down the processing of a sql
query and have it in required columns only.
I have two columns primary key (id, id2) in MySql.
Those ids have a direct connection (for id=1 id2=11, for id=2 id=22, ect.)
I was wonder if the following query:
select * from my_table where id IN (1,2,3..) AND id2 IN (11,22,33..)
Is actually damage the performance, although it is a primary key.
Will run a single select in loop:
select * from my_table where id = 1 AND id2 = 11
select * from my_table where id = 2 AND id2 = 22
...
run faster?
I believe the answer is yes, cause for each id, the query compare id2 with a list of integers.
Is it correct?
Also, does IN makes a difference for a single column primary key?
If you are checking 'most' of the values, then a single query doing a table scan is probably fastest.
If the IN clauses are rather short, the Optimizer may hopscotch through the table very efficiently.
If the table is huge and the values are scattered and disk hits are needed, it may be slow regardless.
Roughly speaking, 10 1-row SELECTs inherently take as long as a single SELECT fetching 100 rows. (This assumes no I/O and good indexes.) So, you need to be desperate to do single selects.
In other words, you test it with your data and your IN lists. We cannot give you a simple answer. But beware, as the table grows and/or your IN lists change, performance could change.
so I have written a query as follows:
UPDATE
table1 latest, table2 previous
SET latest.col1 = previous.col1
WHERE latest.col2 = previous.col2 and previous.col1 is not null;
which copies the value of col2, from table2 to table 1 wherever the value of col1, matches. However due to the context, there can be no primary/foreign key constraints involve and col2 doesn't contain nulls but col1 does( in both tables)..
however this query takes several minutes to execute! is there a way to speeded it up?
Fixed by adding indexes to both table. Created indexes for the common column on which the tables were being joined. Wherever lookups are performed, either through joins, the columns being joined/lookedup should have indexes
I have two tables that are structurally identical in every aspect except column order, and I need to transfer the data from one to the other.
Normally, I would solve this through e.g. INSERT INTO tbl1 SELECT * FROM tbl2, but of course this won't work because the columns are out of order.
I know I can specify the columns manually (INSERT INTO tbl1 (col1, col2, ..., coln) SELECT col1, col2, ..., coln FROM tbl2) but I have inherited a project where this scenario happens for more than two hundred tables, and I would very much like to not have to do all that manual labour.
I tried variations of getting the column names through SELECT COLUMN NAME in the information_schema, but I couldn't get that to work, which is kinda sensible on MySQL's part anyway. Worst case I could write a script that reads all tables and then all columns for each table, constructing a query that I can run that way. But I'd ideally like a pure MySQL solution, if any exists.
Here's my problem...
I need to be able to check which items in a list of about 1,000 items (the needles) are in a fairly large table containing about ~500,000 rows (the haystack).
My question is, what's the best/fastest/most efficient way to do this?
I know that I can create a SQL statement like this:
SELECT id FROM haystack WHERE id IN (ID1, ID2, ID3, ..., IDn)
(assuming ID1, ID2, ID3, ..., IDn are the the needles.)
However, I'm not sure how performant or wise that is if the needles list contains 1,000+ items.
I also know that, if my needles list was in a table of it's own, I could join that table to the haystack table. However, the needles list isn't already in a table.
So - I guess another possible option is to put those 1,000 items into a temporary table and then join that to the haystack table. If that's the best option - then what's the best way to quickly load 1,000 items into a temporary table? (E.g., 1,000 individual INSERT statements? Insert all rows in a single INSERT statment? Is there a limit on how long an INSERT statement can be?)
A third possible option - write the needles list to a text file, then use LOAD DATA INFILE to load that into a (temporary) table, then join the temp table to the haystack table. But, wow... that seems like a lot of overhead.
Is there another, better option?
For what it's worth, the context of this is PHP, and I'm getting the needles list from a JSON web-service response, and using MySQLi for the database interaction.
According to this benchmark, it is faster in your case to use a temporary table and the JOIN method.
I am not sure though that's not a premature optimisation. You should perform your own benchmark and determine if the added complexity deserves the effort. I would recommend going with the simple IN method and only start to optimise when you detect a performance issue.
Just remember that according to the manual:
The number of values in the IN list is only limited by the max_allowed_packet value.
I think your query SELECT id FROM haystack WHERE id IN (ID1, ID2, ID3, ..., IDn) would be fine. I have a very similar use case where I have millions of "needles" and I pass them to the IN clause in blocks of 10,000 via PDO with no issues.
I would add that the column you are checking should be indexed. In my case it is the primary key of the table.
If the needles are going to be used to query the haystack frequently, you absolutely want to create a new table. For this example, I'm going to assume that the needles are int values and will label them as id in the table needle.
First, you need to create the table
CREATE TABLE needle (
id INT(11) PRIMARY KEY
)
Next, you need to insert the values
INSERT INTO needle (id)
VALUES (ID1),
(ID2),
...,
(IDn)
Now, you can query haystack using a join.
SELECT h.id
FROM haystack h
JOIN needle n
ON h.id = n.id
If this is an infrequent query and the number of needles won't grow beyond the 1,000, using the IN clause won't hurt your performance greatly.