I have a very large staging table that I want to process a few rows at a time into an indexed table.
As the time to write indexes results in longer than desired locks on the target table I usually do this a few 100k rows at a time. I pick the rows using order by a unique column as well as Limit and Offset to pick a value to repeatedly churn away on the staging table.
SELECT unique_id INTO #cut_off FROM staging_X ORDER BY unique_id;
START TRANSACTION;
INSERT INTO my_indexed_table ([columns])
SELECT columns FROM staging_X where unique_id <= #cut_off;
DELETE FROM my_indexed_table WHERE unique_id <= #cut_off;
COMMIT;
I've done this for a couple of tables successfully, but am now faced with the largest table in my list. This one has more than 100 million rows. It is created by Apache Spark, so I have no control over setting up partitions or anything.
I've been wondering if I can just use LIMIT with a constant value on both the INSERT and DELETE queries without trying to sort the data. But I cannot find anything that states that the rows will be returned in a reliably repeatable order.
For reference I am using MySQL 5.7 and INNODB tables.
Update
On request the data is something like this:
uuid - Text
timestamp1 - Bigint - unixtime
timestmap2 - Bigint - unixtime
timestmap3 - Bigint - unixtime
timestmap4 - Bigint - unixtime
url - Text
metric1 - Int
metric2 - Int
There are about 30 million rows per day and I can only process this weekly. I can throttle the provision of the data create multiple tables (I cannot create partitions) with a limited row count each but ideally, I'd just like to be able to reliably get the first N rows, insert them elsewhere and delete them, without trying to sort the data.
Related
I have table named data_table.
This table already has about 10 million records. Previously I used to check if combination of itemID, FromDate and ToDate exists before inserting data. To make things easier I created unique index with fields itemID+FromDate+ToDate.
This table has now all together three indexes, ID (Pk), itemID and UniqueIndex
Problem
The first time if I try to generate report for say itemID=2630, and for date range 2018-01-01 to 2021-01-01. It takes around 60 seconds.
Second time for same parameters, it takes around 1 second.
I then deleted all data for this item (2630) , and reinserted many random data for this itemID between selected date range.
Now if I run the report third time, it still takes around 1 second.
I thought first time, the query results was cached so second time it was very fast.
In third step I removed all data and re inserted different records and then generated the report, but it was as fast as second time. For particular item, why first time the report generation process is very slow? Can anybody help me how to overcome this issue?
The table engine I use is innodb and my mysql version is 5.7.33
Update
This is my query
SELECT
*
FROM
DataTable AS D
WHERE ItemID = :ItemID
AND IsJoined = '0'
AND (
(
:paramFromDate < ToDate
AND ToDate < :paramToDate
)
OR (
:paramFromDate < FromDate
AND FromDate < :paramToDate
)
OR (
FromDate < :paramFromDate
AND :paramToDate < FromDate
)
)
ORDER BY FromDate DESC
UPDATE
Restarting mysql, causes the query slow again. And subsequent queries are fast until I restart mysql again.
Thanks
Caching
The main cache for InnoDB is the "buffer_pool". It caches blocks (16KB each) each of which contains several consecutive rows of data or rows of an index. All operations (read or write) of rows work with those blocks.
After starting or restarting MySQL, the cache is empty. Hence, everything needs to be fetched from disk, hence 'slow'.
After reading the data once (and getting the relevant blocks pulled into cache, a second query will find them cached and be 'fast'.
Inserted rows require the relevant block(s) to be in the cache. So, for some time after doing an an INSERT a SELECT of those rows will be 'fast'.
Better INDEX
As for optimizing that query, have
INDEX(ItemID, IsJoined, FromDate) -- (in this order)
The first two columns help with part of the WHERE.
The OR in the rest of the WHERE prevents any useful optimization involving the two date columns.
However, the Optimizer may be able to avoid a sort (for the ORDER BY) if it chooses to use the FromDate that I added to the index.
If you are checking for overlapping date ranges, see if this meets your needs:
AND fromDate <= :toParam
AND :fromParam <= toDate
If that works for you, then another part of the WHERE is handled by my index. (But it will not be possible to also have the other part handled.) (Also, I don't know whether you need < or <=.)
I am using Mysql database.
I have a table daily_price_history of stock values stored with the following fields. It has 11 million+ rows
id
symbolName
symbolId
volume
high
low
open
datetime
close
So for each stock SymbolName there are various daily stock values. And the data is now more than 11 million rows,
The following sql try to get the last 100 days of daily data for a set of 1500 symbols
SELECT `daily_price_history`.`id`,
`daily_price_history`.`symbolId_id`,
`daily_price_history`.`volume`,
`daily_price_history`.`close`
FROM `daily_price_history`
WHERE (`daily_price_history`.`id` IN
(SELECT U0.`id`
FROM `daily_price_history` U0
WHERE (U0.`symbolName` = `daily_price_history`.`symbolName`
AND U0.`datetime` >= 1598471533546))
AND `daily_price_history`.`symbolName` IN (A,AA, ...... 1500 symbols Names)
I have the table indexed on symbolName and also datetime
For getting 130K (i.e 1500 x 100 ~ 150000) rows of data it takes 20 secs.
Also i have weekly_price_history and monthly_price_history tables, and I try to run the similar sql, they take less time for the same number (130K) of rows, because they have less data in the table than daily.
weekly_price_history getting 150K rows takes 3s. The total number of rows in it are 2.5million
monthly_price_history getting 150K rows takes 1s. The total number of rows in it are 800K
So how to speed up the thing when the size of table is large.
As a starter: I don't see the point for the subquery at all. Presumably, your query could filter directly in the where clause:
select id, symbolid_id, volume, close
from daily_price_history
where datetime >= 1598471533546 and symbolname in ('A', 'AA', ...)
Then, you want an index on (datetime, symbolname):
create index idx_daily_price_history
on daily_price_history(datetime, symbolname)
;
The first column of the index matches on the predicate on datetime. It is not very likley, however, that the database will be able to use the index to filter symbolname against a large list of values.
An alternative would be to put the list of values in a table, say symbolnames.
create table symbolnames (
symbolname varchar(50) primary key
);
insert into symbolnames values ('A'), ('AA'), ...;
Then you can do:
select p.id, p.symbolid_id, p.volume, p.close
from daily_price_history p
inner join symbolnames s on s.symbolname = p.symbolname
where s.datetime >= 1598471533546
That should allow the database to use the above index. We can take one step forward and try and add the 4 columns of the select clause to the index:
create index idx_daily_price_history_2
on daily_price_history(datetime, symbolname, id, symbolid_id, volume, close)
;
When you add INDEX(a,b), remove INDEX(a) as being no longer necessary.
Your dataset and query may be a case for using PARTITIONing.
PRIMARY KEY(symbolname, datetime)
PARTITION BY RANGE(datetime) ...
This will do "partition pruning": datetime >= 1598471533546. Then the PRIMARY KEY will do most of the rest of the work for symbolname in ('A', 'AA', ...).
Aim for about 50 partitions; the exact number does not matter. Too many partitions may hurt performance; too few won't provide effective pruning.
Yes, get rid of the subquery as GMB suggests.
Meanwhile, it sounds like Django is getting in the way.
Some discussion of partitioning: http://mysql.rjweb.org/doc.php/partitionmaint
If I have a table TableA with 10k rows and I want to search all rows where id > 8000
When I use the SQL statement SELECT * FROM TableA WHERE id > 8000 to search them, what will MySQL do? Will it search 10k rows and return the 2k rows that match the condition or just ignore those 8k rows and return the 2k rows of data?
I also have a requirement to store a lot of data in the database per day and need to have a quick search for today records. Is one big table still the best method or are other solutions available?
Or would it be best to create 2 tables. 1 for the all records and 1 for today's records and when the new data coming, both table will insert but in the next day the record of the second table will delete.
Which method are better when comparing the speed of select or any other good method can for this case?
Actually i don't have the real database here now but i just worry
about which way/method can be better in that case
Updated information below at (8-12-2016 11:00)
I am using InnoDB but i will use the date as the search key and it is not a PK.
Returning 2k rows is just a extremely case for study but in the real case may returning (User Numbers * each record for that User), so if i got 100 user and they make 10 record in that day, i may need to returning 1k rows record.
My real case is i need to store all user records per days (maybe 10 records per 1 user) and i need to generate the rank for the last day records and the last 7 days records so i just worry if i just search the last day records in a large table, would it be slow or create another table just for save the last day records?
Are you fetching more than about 20% of the table? (The number 20% is inexact.)
Is the PRIMARY KEY on id? Or is it a secondary key?
Are you using ENGINE=InnoDB?
Case: InnoDB and PRIMARY KEY(id): The execution will start at 8000 and go until finished. This is optimal
Case: InnoDB, id is a secondary key, and a 'small' percentage of table is being fetched: The index will be used; it is a BTree and is scanned from 8000 to end, jumping over to the data (via the PK) to find the rows.
Case: InnoDB, id is secondary, and large percentage: The index will be ignored, and the entire table will be scanned ("table scan"), ignoring rows that don't match the WHERE clause. A table scan is likely to be faster than the previous case because of all the 'jumping over to the data'.
Other comments:
10K rows is "small" as tables go.
returning 2K rows is "large" as result sets go. What are you doing with them?
Is there further filtering that you could turn over to MySQL, so that you don't get all 2K back? Think COUNT or SUM with GROUP BY, FULLTEXT search index, etc.
More tips on indexes.
I've been running a website, with a large amount of data in the process.
A user's save data like ip , id , date to the server and it is stored in a MySQL database. Each entry is stored as a single row in a table.
Right now there are approximately 24 million rows in the table
Problem 1:
Things are getting slow now, as a full table scan can take too many minutes but I already indexed the table.
Problem 2:
If a user is pulling a select data from table it could potentially block all other users (as the table is locked) access to the site until the query is complete.
Our server
32 Gb Ram
12 core with 24 thread cpu
table use MyISAM engine
EXPLAIN SELECT SUM(impresn), SUM(rae), SUM(reve), `date` FROM `publisher_ads_hits` WHERE date between '2015-05-01' AND '2016-04-02' AND userid='168' GROUP BY date ORDER BY date DESC
Lock to comment from #Max P. If you write to MyIsam Tables ALL SELECTs are blocked. There is only a Table lock. If you use InnoDB there is a ROW Lock that only locks the ROWs they need. Aslo show us the EXPLAIN of your Queries. So it is possible that you must create some new one. MySQL can only handle one Index per Query. So if you use more fields in the Where Condition it can be useful to have a COMPOSITE INDEX over this fields
According to explain, query doesn't use index. Try to add composite index (userid, date).
If you have many update and delete operations, try to change engine to INNODB.
Basic problem is full table scan. Some suggestion are:
Partition the table based on date and dont keep more than 6-12months data in live system
Add an index on user_id
I have a table 'tbl' something like that:
ID bigint(20) - primary key, autoincrement
field1
field2
field3
That table has 600k+ rows.
Query:
SELECT * from tbl ORDER by ID LIMIT 600000, 1 takes 1.68 second
Query:
SELECT ID, field1 from tbl ORDER by ID LIMIT 600000, 1 takes 1.69 second
Query:
SELECT ID from tbl ORDER by ID LIMIT 600000, 1 takes 0.16 second
Query:
SELECT * from tbl WHERE ID = xxx takes 0.005 second
Those queries are tested in phpmyadmin.
And the result is query 3 and query 4 together return necessarily data.
Query 1 does the same jobs but much slower...
This doesn't look right for me.
Could anyone give any advice?
P.S. I'm sorry for formatting.. I'm new to this site.
New test:
Q5 : CREATE TEMPORARY TABLE tmptable AS (SELECT ID FROM tbl WHERE ID LIMIT 600030, 30);
SELECT * FROM tbl WHERE ID IN (SELECT ID FROM tmptable); takes 0.38 sec
I still don't understand how it's possible. I recreated all indexes.. what else can I do with that table? Delete and refill it manually? :)
Query 1 looks at the table's primary key index, finds the correct 600,000 ids and their corresponding locations within the table, then goes to the table and fetches everything from those 600k locations.
Query 2 looks at the table's primary key index, finds the correct 600k ids and their corresponding locations within the table, then goes to the table and fetches whichever subset of fields are asked for from those 600k rows.
Query 3 looks at the table's primary key index, finds the correct 600k ids, and returns them. It doesn't need to look at the table at all.
Query 4 looks at the table's primary key index, finds the single entry requested, goes to the table, reads that single entry, and returns it.
Time-wise, let's build backwards:
(Q4) The table index allows lookup of a key (id) in O(log n) time, meaning every time the table doubles in size it only takes one extra step to find the key in the index*. If you have 1 million rows, then, it would only take ~20 steps to find it. A billion rows? 30 steps. The index entry includes data on where in the table to go to find the data for that row, so MySQL jumps to that spot in the table and reads the row. The time reported for this is almost entirely overhead.
(Q3) As I mentioned, the table index is very fast; this query finds the first entry and just traverses the tree until it has the requested number of rows. I'm sure I could calculate the precise number of steps it would take, but as a maximum we'll say 20 steps x 600k rows = 12M steps; since it's traversing a tree it would likely be more like 1M steps, but the precise number is largely irrelevant. The most important thing to realize here is that once MySQL has walked the index to pull the ids it needs, it has everything you asked for. There's no need to go look at the table. The time reported for this one is essentially the time it takes MySQL to walk the index.
(Q2) This begins with the same tree-walking as discussed for query 3, but while pulling the IDs it needs, MySQL also pulls their location within the table files. It then has to go to the table file (probably already cached/mmapped in memory), and for every entry it pulled, seek to the proper place in the table and get the fields requested out of those rows. The time reported for this query is the time it takes to walk the index (as in Q3) plus the time to visit every row specified in the index.
(Q1) This is identical to Q2 when all fields are specified. As the time is essentially identical to Q2, we can see that it doesn't really take measurably more time to pull more fields out of the database, any time there is dwarfed by crawling the index and seeking to the rows.
*: Most databases use an indexing data structure (B-trees for MySQL) that has a log base much higher than 2, meaning that instead of an extra step every time the table doubles, it's more like an extra step every time the table size goes up by a factor of hundreds to thousands. This means that instead of the 20-30 steps I stated in the example, it's more like 2-5.