mysql, force limited entries in a table - mysql

I keep some temporary data in a memory table. I only need the 20 most recent entries and would prefer the data is always be on the heap. How should i accomplish this? i am sure theres nothing i can do about the memory table but how should i handle entries tables? should i add a autoincrease key and delete the oldest whenever i want to push a new value in?

Could you please describe in more detail, what you are trying to do? I don't see why you want to keep the most recent data in an additional table when you can just use a SELECT with descending order and a LIMIT 20. If the SELECT query is too expensive then just cache the result using memcached or similar and clear the cache every time a new data is inserted.
If the additional table is really necessary there are several ways to prune old data from the table. Either you fetch the id of the 20th recent data (again descending order and LIMIT 19,1 and delete everything that has a smaller id (in case you have an auto increment index, timestamp, etc.) or you SELECT COUNT(*) and then do a DELETE with ascending order and a LIMIT (all items - 20). This could be packed into a cronjob that runs every several minutes.
But I would really recommend using a cache and looking at the table definition. With a decent index there shouldn't be any problems.

Appending to the 20-entry table and removing the eldest element (i.e. the one with the minimum ID?) is possible. However, note that this will fragment the table.
That's OK so long as you run OPTIMIZE every once in a while.
A different way would be to pre-allocate 20 entries and keep a separate counter of which entry is the latest. Then instead of insert/delete, you would update the item ID based on the counter, which you would then increment (mod 20 + 1) and store again.
However note that both of these models work only under a "single-threaded" model. If multiple threads are running on the table it's possible that they'll conflict.
If the counter is in program memory, shared by threads but guarded properly, that will be both thread-safe and efficient.

Related

Optimize SQL to get rows count

I have a page on my site that keeps track of the number of people accessing it, on another part I displays the data containing information about the users that access this page, it displays only about 10 at a time.
The problem is I need to create pagination so I need to know how much data is on my table at every time and this causes the display page to take some time to load 2-3 seconds, sometimes 7-10, because I have millions of record. I am wondering, how do I get this page to load faster.
Select COUNT(*) as Count from visits
My first response is . . . if you are paging records 10 at a time, why do you need the total count of more than a million?
Second, counting a million rows should not take very long, unless your rows are wide (lots of columns or wide columns). If that is the case, then:
select count(id) from t;
can help, because it will explicitly use an index. Note that the first run may be slower than subsequent runs because of caching.
If you decide that you do need an exact row count, then your only real option for speeding it up using MySQL is to create triggers to maintain the count in another table. However, that will slow down inserts and deletions, which might not be a good idea.
The best answer is to say "About 1,234,000 visits", not the exact number. Then calculate it daily (or whatever).
But if you must have the exact count, ...
If this table is "write only", then there is a solution. It involves treating it as a "Fact" table in a Data Warehouse. Then create and maintain a "Summary table" with a row for, say, each hour. Then the COUNT becomes:
SELECT SUM(hourly_count) FROM SummaryTable;
This will be much faster because there is much less to scan. However, there is a problem in that it does not include the count for the last (partial) hour. But that can be solved if you use INSERT ... ON DUPLICATE KEY UPDATE ... to increment the counter for the current hour or insert a new row with a "1".
Some more info is here .
But, before we take this too far, please inform us of how often a new 'visit' occurs.
You cannot make that query get faster without changing the server's hardware or adding more servers to run it in parallel. In the second case it would be better to move to a nosql database.
My approach would be to reduce the number of records. That you could do by having some temporary table where you record the access logs for the past hour/day and after that time run a cronjob that deletes the data, or moves it to another table for log term storage.
You usually do not need to know exact number of rows for pagination
SELECT COUNT(*) FROM
(SELECT TOP 10000 * FROM visits) as v
would tell You, that there are at least 1000 pages. In most cases You do not need to know more.
You can store total count somewhere and update it from time to time if You want some reasonable estimate. If You need exact number, You can use trigger to keep it actual. The more up to date info, the more expensive, of course.
Decide on limit (let's say, 1000 last ones) from practical (business requirements) point of view. Have auto_increment index (id) or timestamp (createdon). Grab max 1000 records
select count(*) from (select id from visits order by id desc limit 1000)
or grab all 1000 and count paginate on the client side (php) (as if you paginate mysql will still go through those records):
select * from visits order by id desc limit 1000

How to speed up InnoDB count(*) query?

There are a number of similar questions on here, but a lot of the answers say to force the use of an index and that doesn't seem to speed anything up for me.
I am wanting to show a "live" counter on my website showing the number of rows in a table. Kind of like how some websites show the number of registered users, or some other statistic, in "real time" (i.e. updated frequently using ajax or websockets).
My table has about 5M rows. It's growing fairly quickly and there is a high volume of inserts and deletes on it. Running
select count(*) from my_table
Takes 1.367 seconds, which is unacceptable because I need my application to get the new row count about once per second.
I tried what many of the answers on here suggest and changed the query to:
select count(*) from my_table use index(my_index)
Where my_index is Normal, BTREE on a bigint field. But the time actually increased to 1.414 seconds.
Why doesn't using an index speed up the query as many answers on here said it would?
Another option some answers suggest is to put a trigger on the table that increments a column in another table. So I could create a stats table and whenever a row is inserted or deleted in my_table have a trigger increment or decrement a column in the stats table. Is this the only other option, since using an index doesn't seem to work?
EDIT: Here's a perfect example of the type of thing I'm trying to accomplish: https://www.freelancer.com. Scroll to the bottom of the page and you'll see:
Those numbers update every second or so.
It takes time to read 5 million records and count them -- whether in an index or in the raw data form.
If a "fast-and-dirty" solution is acceptable, you can use metadata:
SELECT table_rows
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = <whatever> and TABLE_NAME = <whatever2>;
Note that this can get out-of-sync.
Another possibility is to partition the table into smaller chunks. One advantage is that if the inserts and deletes tend to be to one partition, you can just count that and use metadata for the other partitions.
A trigger may or may not help in this situation, depending on the insert/delete load. If you are doing multiple inserts per minute, then a trigger is a no-brainer -- a fine solution. If you are doing dozens or hundreds of changes per second, then the overhead of the trigger might slow down the server.
If your system is so busy that the counting is having too much impact, then probably the INSERTing/DELETEing is also having impact. One way to improve INSERT/DELETE is to do them in 'batches' instead of one at a time.
Gather the INSERTs, preferably in the app, but optionally in a 'staging' table. Then, once a second (or whatever) copy them into the real table using an INSERT..SELECT, or (if needed) INSERT..ON DUPLICATE KEY UPDATE. DELETEs can go into the same table (with a flag) or a separate table.
The COUNT(*) can be done at the end of the batch. Or it could be dead reckoned (at much lower cost) by knowing what the count was, then adjusting by what the staging table(s) will change it by.
This is a major upheaval to you app code, so don't embark on it unless you have spikes of, say, >100 INSERTs/DELETEs per second. (A steady 100 INSERTs/sec = 3 billion rows per year.)
For more details on "staging table", see http://mysql.rjweb.org/doc.php/staging_table Note that that blog advocates flip-flopping between a pair of staging tables, so as to minimize locks on them, and to allow multiple clients to coexist.
Have a job running in the background that does the following; then use its table for getting the count:
Loop:
INSERT INTO Counter (ct_my_table)
SELECT COUNT(*) FROM my_table;
sleep 1 second
end loop
At worst, it will be a couple of seconds out of date. Also note that INSERTs and DELETEs interfere (read: slow down) the SELECT COUNT(*), hence the "sleep".
Have you noticed that some UIs say "About 120,000 thingies"? They are using even cruder estimations. But it is usually good enough for the users.
Take inaccurate value from information_schema as Gordon Linoff suggested
Another inaccurate source of rows count is SELECT MAX(id) - MIN(id)
Create table my_table_count where you store rows count of table my_table and update it with triggers
In many cases you don't need an accurate value. Who cares if you show 36,400 users instead of the accurate 36,454?

Finding records not updated in last k days efficiently

I have a table which contains records of last n days. The records in this table are around 100 million. I need to find the records which are not updated in last k
My solution to this problem is
Partition the table on k1. Index on timestamp column. Now instead of updating the timestamp(so that index is not rebuilt), perform remove + insert. By doing this the I think the query to find the records not updated in last k days will be fast.
Is there any other better way to optimize these operations?
For example,
Suppose we have many users and each user can use different products. Also a user can start using(becomes owner) new products any time. If user does not use a product for n days his ownership expires. Now we need to find all the products for a user which are not used by him in last k days. The number of users are of order 10000 and number of products from which he can choose is of order 100,000.
I modeled this problem using a table with schema (user_id, product_id, last_used). product_id is the id of the product the user is using. Whenever a user uses the product last_used is updated. Also a user's ownership of product expires if not used for n days by the user. I partitioned on the table on user_id and indexed last_used(timestamp). Also instead of updating I performed delete + create. I did partitioning and indexing for optimizing the query to fetch records not updated in last k days for a user.
Is there a better way to solve this problem?
You have said you need to "find" and, I think "expire" the records belonging to a particular user after a certain number of days.
Look, this can be done even in a large table with good indexing without too much trouble. I promise you, partitioning the table will be a lot of trouble. You have asserted that it's too expensive in your application to carry an index on your last_used column because of updates. But, considering the initial and ongoing expense of maintaining a partitioned table, I strongly suggest you prove that assertion first. You may be wrong about the cost of maintaining indexes.
(Updating one row with a column that's indexed doesn't rebuild the index, it modifies it. The MySQL storage engine developers have optimized that use case, I promise you.)
As I am sure you know, this query will retrieve old records for a particular user.
SELECT product_id
FROM tbl
WHERE user_id = <<<chosen user>>>
AND last_used <= CURRENT_DATE() - <<<k>>> DAY
will yield your list of products. This will work very efficiently indeed if you have a compound covering index on (user_id, last_used, product_id). If you don't know what a compound covering index is, you really should find out using your favorite search engine. This one will random-access the particular user and then do a range scan on the last_used date. It will then return the product ids from the index.
If you want to get rid of all old records, I suggest you write a host program that repeats this query in a loop until you find that it has processed zero rows. Run this at an off-peak time in your application. The LIMIT clause will prevent each individual query from taking too long and interfering with other uses of the table. For the sake of speed on this query, you'll need an index on last_used.
DELETE FROM tbl
WHERE last_used <= CURRENT_DATE() - <<<k>>> DAY
LIMIT 500
I hope this helps. It comes from someone who's made the costly mistake of trying to partition something that didn't need partitioning.
MySQL doesn't "rebuild" indexes (not completely) when you modify an indexed value. In fact, it doesn't even reorder the records. It just moves the record to the proper 16KB page.
Within a page, the records are in the order they were added. If you inserted in order, then they're in order, otherwise, they're not.
So, when they say that MySQL's clustered indexes are in physical order, it's only true down to the page level, but not within the page.
Clustered indexes still get the benefit that the page data is on the same page as the index, so no further lookup is needed if the row data is small enough to fit in the pages. Reading is faster, but restructuring is slower because you have to move the data with the index. Secondary indexes are much faster to update, but to actually retrieve the data (with the exception of covering indexes), a further lookup must be made to retrieve the actual data via the primary key that the secondary index yields.
Example
Page 1 might hold user records for people whose last name start with A through B. Page 2 might hold names C through D, etc. If Bob renames himself Chuck, his record just gets copied over from page 1 to page 2. His record will always be put at the end of page 2. The keys are kept sorted, but not the data they point to.
If the page becomes full, MySQL will split the page. In this case, assuming even distribution between C and D, page 1 will be A through B, page 2 will be C, and page 3 will be D.
When a record is deleted, the space is compacted, and if the record becomes less than half full, MySQL will merge neighboring pages and possibly free up a page inbetween.
All of these changes are buffered, and MySQL does the actual writes when it's not busy.
The example works the same for both clustered (primary) and secondary indexes, but remember that with a clustered index, the keys point to the actual table data, whereas with a secondary index, the keys point to a value equal to the primary key.
Summary
After awhile, page splitting caused from random inserts will cause the pages to become noncontiguous on disk. The table will become "fragmented". Optimizing the table (rebuilding the table/index) fixes this.
There would be no benefit in deleting then reinserting the record. In fact, you'll just be adding transactional overhead. Let MySQL handle updating the index for you.
Now that you understand indexes a bit more, perhaps you can make a better decision of how to optimize your database.

A very interesting MYSQL problem (related to indexing, million records, algorithm.)

This problem is pretty hard to describe and therefore difficult to search the answer. I hope some expert could share you opinions on that.
I have a table with around 1 million of records. The table structure is similar to something like this:
items{
uid (primary key, bigint, 15)
updated (indexed, int, 11)
enabled (indexed, tinyint, 1)
}
The scenario is like this. I have to select all of the records everyday and do some processing. It takes around 3 second to handle each item.
I have written a PHP script to fetch 200 items each time using the following.
select * from items where updated > unix_timestamp(now()) - 86400 and enabled = 1 limit 200;
I will then update the "updated" field of the selected items to make sure that it wont' be selected again within one day. The selected query is something like that.
update items set updated = unix_timestamp(now()) where uid in (1,2,3,4,...);
Then, the PHP will continue to run and process the data which doesn't require any MYSQL connection anymore.
Since I have million records and each record take 3 seconds to process, it's definitely impossible to do it sequentially. Therefore, I will execute the PHP in every 10 seconds.
However, as time goes by and the table growth, the select getting much slower. Sometimes, it take more than 100 seconds to run!
Do you guys have any suggestion how may I solve this problem?
There are two points that I can think of that should help:
a. unix_timestamp(now()) - 86400)
... this will evaluate now() for every single row, make it a constant by setting a variable to that value before each run.
b. Indexes help reads but can slow down writes
Consider dropping indexes before updating (DISABLE KEYS) - and then re-add them before reading (ENABLE KEYS).
I don't think the index on enabled is doing you any good, the cardinality is too low. Remove that and your UPDATEs should go faster.
I am not sure what you mean when you say each record takes 3 seconds since, you are handling them in batches of 200. How are you determining this and what other processing is involved?
You could do this:
dispatcher.php: Manages the whole process.
fetches items in convenient packages from the database
calls worker.php on the same server with an HTTP post containing all UIDs fetched (I understand that worker.php would not need more than the UID to do its job)
maintains a counter of how many worker.php scrips are running. When one is started, the counter increments until a certain limit, when one worker returns then the counter is decremented. See "Asynchronous PHP calls?".
repeats until all records are fetched once. Maintain a MySQL LIMIT counter and do not work with updated.
worker.php: does the actual work
does its thing with each item posted.
writes to a helper table the ID of each item it has processed (no index on that table)
dispatcher.php: housekeping.
once all workers have returned, updates the main table with the helper table in a single statement
error recovery
since worker.php would update the helper table after each item done, you can use the state of the helper table to recover from a crash. Saving the "work package" of each individual worker before it starts running would help to recover worker states as well.
You would have a multi-threaded processing chain this way and could even distribute the whole thing across multiple machines.
You could try running this before the update:
ALTER TABLE items DISABLE KEYS;
and then when you're done updating,
ALTER TABLE items ENABLE KEYS;
That should recreate the index much faster than updating each record at a time will.
For a table with fewer than a couple of billion records, the primary key should be an unsigned int rather than a bigint.
One idea:
Use a HANDLER, that will improve your performance considerably:
http://dev.mysql.com/doc/refman/5.1/en/handler.html

Should totals be denormalized?

I am working on a website with a simple normalized database.
There is a table called Pages and a table called Views. Each time a Page is viewed, a unique record of that View is recorded in the Views table.
When displaying a Page on the site, I use a simple MySQL COUNT() to total up the number of Views for display.
Database design seems fine, except for this problem: I am at a loss for how to retrieve the top 10 most viewed pages among thousands.
Should I denormalize the Pages table by adding a Pages.views column to hold the total number of views for each page? Or is there an efficient way to query for the top 10 most viewed pages?
SELECT p.pageid, count(*) as viewcount FROM
pages p
inner join views v on p.pageid = v.pageid
group by p.pageid
order by count(*) desc
LIMIT 10 OFFSET 0;
I can't test this, but something along those lines. I would not store the value unless I have to due to performance constraints (I just learned the term "premature optimization", and it seems to apply if you do).
It depends on the level of information you are trying to maintain. If you want to record who viewed when? Then the separate table is fine. Otherwise, a column for Views is the way to go. Also If you keep a separate column, you'll find that the table will be locked more often since each page view will try to update the column for its corresponding row.
Select pageid, Count(*) as countCol from Views
group by pageid order by countCol DESC
LIMIT 10 OFFSET 0;
Database normalization is all about the most efficient / least redundant way to store data. This is good for transaction processing, but often directly conflicts with the need to efficiently get the data out again. The problem is usually addressed by having derived tables (indexes, materialized views, rollup tables...) with more accessible, pre-processed data. The (slightly dated) buzzword here is Data Warehousing.
I think you want to keep your Pages table normalized, but have an extra table with the totals. Depending on how recent those counts need to be, you can update the table when you update the original table, or you can have a background job to periodically recalculate the totals.
You also want to do this only if you really run into a performance problem, which you will not unless you have a very large number of records, or a very large number of concurrent accesses. Keep your code flexible to be able to switch between having the table and not having it.
I would probably include the views column in the Pages table.
It seems like a perfectly reasonable breaking of normalization to me. Especially since I can't imagine you deleting views so you wouldn't expect the count to get out of whack. Referential integrity doesn't seem super-critical in this case.
Denormalizing would definitely work in this case. Your loss is the extra storage room used up by the extra column.
Alternatively you could set up a scheduled job to populate this information on a nightly basis, whenever your traffic is low, x period of time.
In this case you would be losing the ability to instantly know your page counts unless you run this query manually.
Denormalization can definitely be employed to increase performance.
--Kris
While this is an old question, I'd like to add my answer because I find the accepted one to be misguided.
It is one thing to have COUNT for a single selected row; it is quite another to sort the COUNT of ALL columns.
Even if you have just 1000 rows, each counted with some join, you can easily involve reading tens of thousands if not millions of rows.
It can be ok if you only call this occasionally, but it is very costly otherwise.
What you can do is to add a TRIGGER:
CREATE TRIGGER ins AFTER INSERT ON table1 FOR EACH ROW
UPDATE table2
SET count = count + 1
WHERE CONDITION