Finding records not updated in last k days efficiently - mysql

I have a table which contains records of last n days. The records in this table are around 100 million. I need to find the records which are not updated in last k
My solution to this problem is
Partition the table on k1. Index on timestamp column. Now instead of updating the timestamp(so that index is not rebuilt), perform remove + insert. By doing this the I think the query to find the records not updated in last k days will be fast.
Is there any other better way to optimize these operations?
For example,
Suppose we have many users and each user can use different products. Also a user can start using(becomes owner) new products any time. If user does not use a product for n days his ownership expires. Now we need to find all the products for a user which are not used by him in last k days. The number of users are of order 10000 and number of products from which he can choose is of order 100,000.
I modeled this problem using a table with schema (user_id, product_id, last_used). product_id is the id of the product the user is using. Whenever a user uses the product last_used is updated. Also a user's ownership of product expires if not used for n days by the user. I partitioned on the table on user_id and indexed last_used(timestamp). Also instead of updating I performed delete + create. I did partitioning and indexing for optimizing the query to fetch records not updated in last k days for a user.
Is there a better way to solve this problem?

You have said you need to "find" and, I think "expire" the records belonging to a particular user after a certain number of days.
Look, this can be done even in a large table with good indexing without too much trouble. I promise you, partitioning the table will be a lot of trouble. You have asserted that it's too expensive in your application to carry an index on your last_used column because of updates. But, considering the initial and ongoing expense of maintaining a partitioned table, I strongly suggest you prove that assertion first. You may be wrong about the cost of maintaining indexes.
(Updating one row with a column that's indexed doesn't rebuild the index, it modifies it. The MySQL storage engine developers have optimized that use case, I promise you.)
As I am sure you know, this query will retrieve old records for a particular user.
SELECT product_id
FROM tbl
WHERE user_id = <<<chosen user>>>
AND last_used <= CURRENT_DATE() - <<<k>>> DAY
will yield your list of products. This will work very efficiently indeed if you have a compound covering index on (user_id, last_used, product_id). If you don't know what a compound covering index is, you really should find out using your favorite search engine. This one will random-access the particular user and then do a range scan on the last_used date. It will then return the product ids from the index.
If you want to get rid of all old records, I suggest you write a host program that repeats this query in a loop until you find that it has processed zero rows. Run this at an off-peak time in your application. The LIMIT clause will prevent each individual query from taking too long and interfering with other uses of the table. For the sake of speed on this query, you'll need an index on last_used.
DELETE FROM tbl
WHERE last_used <= CURRENT_DATE() - <<<k>>> DAY
LIMIT 500
I hope this helps. It comes from someone who's made the costly mistake of trying to partition something that didn't need partitioning.

MySQL doesn't "rebuild" indexes (not completely) when you modify an indexed value. In fact, it doesn't even reorder the records. It just moves the record to the proper 16KB page.
Within a page, the records are in the order they were added. If you inserted in order, then they're in order, otherwise, they're not.
So, when they say that MySQL's clustered indexes are in physical order, it's only true down to the page level, but not within the page.
Clustered indexes still get the benefit that the page data is on the same page as the index, so no further lookup is needed if the row data is small enough to fit in the pages. Reading is faster, but restructuring is slower because you have to move the data with the index. Secondary indexes are much faster to update, but to actually retrieve the data (with the exception of covering indexes), a further lookup must be made to retrieve the actual data via the primary key that the secondary index yields.
Example
Page 1 might hold user records for people whose last name start with A through B. Page 2 might hold names C through D, etc. If Bob renames himself Chuck, his record just gets copied over from page 1 to page 2. His record will always be put at the end of page 2. The keys are kept sorted, but not the data they point to.
If the page becomes full, MySQL will split the page. In this case, assuming even distribution between C and D, page 1 will be A through B, page 2 will be C, and page 3 will be D.
When a record is deleted, the space is compacted, and if the record becomes less than half full, MySQL will merge neighboring pages and possibly free up a page inbetween.
All of these changes are buffered, and MySQL does the actual writes when it's not busy.
The example works the same for both clustered (primary) and secondary indexes, but remember that with a clustered index, the keys point to the actual table data, whereas with a secondary index, the keys point to a value equal to the primary key.
Summary
After awhile, page splitting caused from random inserts will cause the pages to become noncontiguous on disk. The table will become "fragmented". Optimizing the table (rebuilding the table/index) fixes this.
There would be no benefit in deleting then reinserting the record. In fact, you'll just be adding transactional overhead. Let MySQL handle updating the index for you.
Now that you understand indexes a bit more, perhaps you can make a better decision of how to optimize your database.

Related

SQL Index to Optimize WHERE Query

I have a Postgres table with several columns, one column is the datetime that the column was last updated. My query is to get all the updated rows between a start and end time. It is my understanding for this query to use WHERE in this query instead of BETWEEN. The basic query is as follows:
SELECT * FROM contact_tbl contact
WHERE contact."UpdateTime" >= '20150610' and contact."UpdateTime" < '20150618'
I am new at creating SQL queries, I believe this query is doing a full table scan. I would like to optimize it if possible. I have placed a Normal index on the UpdateTime column, which takes a long time to create, but with this index the query is faster. One thing I am not sure about is if have to keep recalculating this index if the table gets bigger/columns get changed. Also, I am considering a CLUSTERED index on the UpdateTime row, but I wanted to ask if there was a canonical way of optimizing this/if I was on the right track first
Placing an index on UpdateTime is correct. It will allow the index to be used instead of full table scans.
2 WHERE conditions like the above vs. using the BETWEEN keyword are the exact same:
http://dev.mysql.com/doc/refman/5.7/en/comparison-operators.html#operator_between
BETWEEN is just "syntactical sugar" for those that like that syntax better.
Indexes allow for faster reads, but slow down writes (because like you mention, the new data has to be inserted into the index as well). The entire index does not need to be recalculated. Indexes are smart data structures, so the extra data can be added without a lot of extra work, but it does take some.
You're probably doing many more reads than writes, so using an index is a good idea.
If you're doing lots of writes and few reads, then you'd want to think a bit more about it. It would then come down to business requirements. Although overall the throughput may be slowed, read latency may not be a requirement but write latency may be, in which case you wouldn't want the index.
For instance, think of this lottery example: Everytime someone buys a ticket, you have to record their name and the ticket number. However the only time you ever have to read that data, is after the 1 and only drawing to see who had that ticket number. In this database, you wouldn't want to index the ticket number since they'll be so many writes and very few reads.

Database Optimisation through denormalization and smaller rows

Does tables with many columns take more time than the tables with less columns during SELECT or UPDATE query? (row count is same and I will update/select same number of columns in both cases)
example: I have a database to store user details and to store their last active time-stamp. In my website, I only need to show active users and their names.
Say, one table named userinfo has the following columns: (id,f_name,l_name,email,mobile,verified_status). Is it a good idea to store last active time also in the same table? Or its better to make a separate table(say, user_active) to store the last activity timestamp?
The reason I am asking, If I make two tables, userinfo table will only be accessed during new signups(to INSERT new user row) and I will use user_active table (table with less columns) to UPADATE timestamp and SELECT active users frequently.
But the cost I have to pay for creating two tables is data duplication as user_active table columns will be (id, f_name, timestamp).
The answer to your question is that, to a close approximation, having more columns in a table does not really take more time than having fewer columns for accessing a single row. This may seem counter-intuitive, but you need to understand how data is stored in databases.
Rows of a table are stored on data pages. The cost of a query is highly dependent on the number of pages that need to be read and written during the course of the query. Parsing the row from the data page is usually not a significant performance issue.
Now, wider rows do have a very slight performance disadvantage, because more data would (presumably) be returned to the user. This is a very minor consideration for rows that fit on a single page.
On a more complicated query, wider rows have a larger performance disadvantage, because more data pages need to be read and written for a given number of rows. For a single row, though, one page is being read and written -- assuming you have an index to find that row (which seems very likely in this case).
As for the rest of your question. The structure of your second table is not correct. You would not (normally) include fname in two tables -- that is data redundancy and causes all sort of other problems. There is a legitimate question whether you should store a table of all activity and use that table for the display purposes, but that is not the question you are asking.
Finally, for the data volumes you are talking about, having a few extra columns would make no noticeable difference on any reasonable transaction volume. Use one table if you have one attribute per entity and no compelling reason to do otherwise.
When returning and parsing a single row, the number of columns is unlikely to make a noticeable difference. However, searching and scanning tables with smaller rows is faster than tables with larger rows.
When searching using an index, MySQL utilizes a binary search so it would require significantly larger rows (and many rows) before any speed penalty is noticeable.
Scanning is a different matter. When scanning, it's reading through all of the data for all of the rows, so there's a 1-to-1 performance penalty for larger rows. Yet, with proper indexes, you shouldn't be doing much scanning.
However, in this case, keep the date together with the user info because they'll be queried together and there's a 1-to-1 relationship, and a table with larger rows is still going to be faster than a join.
Only denormalize for optimization when performance becomes an actual problem and you can't resolve it any other way (adding an index, improving hardware, etc.).

Move inactive rows to another table?

I have a table where when a row is created, it will be active for 24 hours with some writes and lots of reads. Then it becomes inactive after 24 hours and will have no more writes and only some reads, if any.
Is it better to keep these rows in the table or move them when they become inactive (or via batch jobs) to a separate table? Thinking in terms of performance.
This depends largely on how big your table will get, but if it grows forever, and has a significant number of rows per day, then there is a good chance that moving old data to another table would be a good idea. There are a few different ways you could accomplish this, and which is best depends on your application and data access patterns.
Essentially as you said, when a row becomes "old", INSERT to the archive table, and DELETE from the current table.
Create a new table every day (or perhaps every week, or every month, depending on how big your dataset is), and never worry about moving old rows. You'll just have to query old tables when accessing old data, but for the current day, you only ever access the current table.
Have a "today" table and a "all time" table. Duplicate the "today" rows in both tables, keeping them in sync with triggers or other mechanisms. When a row becomes old, simply delete from the "today" table, leaving the "all time" row in tact.
One advantage to #2, that may not be immediately obvious, is that I believe MySQL indexes can be optimized for read-only tables. So by having old tables that are never written to, you can take advantage of this extra optimization.
Generally moving rows between tables in proper RDBMS should not be necessary.
I'm not familiar with mysql specifics, but you should do fine with the following:
Make sure your timestamp column is indexed
In addition, you can use active BOOLEAN default true column
Make a batch run every day to mark >24h old rows inactive
Use a partial index for timestamp column so only rows marked active are indexed
Remember to have timestamp and active = TRUE in your where conditions to hit indexes. Use EXPLAIN a lot.
That all depends on the balance between ease of programming, and performance. Performance wise, yes it will definitely be faster. But whether the speed increase is worth the effort is hard to say.
I've worked on systems that run perfectly fine with millions of rows. However, if the data is ever growing it does eventually become a problem.
I've worked on a database storing transaction logging for automated equipment. It generates hundreds of thousands of events per day. After a year, the queries just wouldn't run at acceptable speeds any more. We now keep the last month's worth of logs in the main table (millions of rows still), and move older data to archive tables.
None of the application's functionality ever looks in the archive table (if you do a query of the transaction log, it will return no results). It is only really kept for emergency use, and is just queried with any standalone database query tool. Because the archive has well over a hundred million rows, and the nature of this emergency use is generally unplannable (and therefore mostly un-indexed) queries, they can take a long time to run.
There is another solution. To have another table containing only the active records (tblactiverecords). When the number of active records is really small, you could just do an inner join and get the active records. This should take very less time because primary key by default are indexed in mysql. As your rows become inactive, you could delete them from the tblactiverecords table.
create table tblrecords (id int primary key, data text);
Then,
create table tblactiverecords (tblrecords_id primary key);
you can do
select data from tblrecords join tblactiverecords on tblrecords.id = tblactiverecords.tblrecords_id;
to get all data that are active.

MySQL - why not index every field?

Recently I've learned the wonder of indexes, and performance has improved dramatically. However, with all I've learned, I can't seem to find the answer to this question.
Indexes are great, but why couldn't someone just index all fields to make the table incredibly fast? I'm sure there's a good reason to not do this, but how about three fields in a thirty-field table? 10 in a 30 field? Where should one draw the line, and why?
Indexes take up space in memory (RAM); Too many or too large of indexes and the DB is going to have to be swapping them to and from the disk. They also increase insert and delete time (each index must be updated for every piece of data inserted/deleted/updated).
You don't have infinite memory. Making it so all indexes fit in RAM = good.
You don't have infinite time. Indexing only the columns you need indexed minimizes the insert/delete/update performance hit.
Keep in mind that every index must be updated any time a row is updated, inserted, or deleted. So the more indexes you have, the slower performance you'll have for write operations.
Also, every index takes up further disk space and memory space (when called), so it could potentially slow read operations as well (for large tables).
Check this out
You have to balance CRUD needs. Writing to tables becomes slow. As for where to draw the line, that depends on how the data is being acessed (sorting filtering, etc.).
Indexing will take up more allocated space both from drive and ram, but also improving the performance a lot. Unfortunately when it reaches memory limit, the system will surrender the drive space and risk the performance. Practically, you shouldn't index any field that you might think doesn't involve in any kind of data traversing algorithm, neither inserting nor searching (WHERE clause). But you should if otherwise. By default you have to index all fields. The fields which you should consider unindexing is if the queries are used only by moderator, unless if they need for speed too
It is not a good idea to indexes all the columns in a table. While this will make the table very fast to read from, it also becomes much slower to write to. Writing to a table that has every column indexed would involve putting the new record in that table and then putting each column's information in the its own index table.
this answer is my personal opinion based I m using my mathematical logic to answer
the second question was about the border where to stop, First let do some mathematical calculation, suppose we have N rows with L fields in a table if we index all the fields we will get a L new index tables where every table will sort in a meaningfull way the data of the index field, in first glance if your table is a W weight it will become W*2 (1 tera will become 2 tera) if you have 100 big table (I already worked in project where the table number was arround 1800 table ) you will waste 100 times this space (100 tera), this is way far from wise.
If we will apply indexes in all tables we will have to think about index updates were one update trigger all indexes update this is a select all unordered equivalent in time
from this I conclude that you have in this scenario that if you will loose this time is preferable to lose it in a select nor an update because if you will select a field that is not indexed you will not trigger another select on all fields that are not indexed
what to index ?
foreign-keys : is a must based on
primary-key : I m not yet sure about it may be if someone read this could help on this case
other fields : the first natural answer is the half of the remaining filds why : if you should index more you r not far from the best answer if you should index less you are not also far because we know that no index is bad and all indexed is also bad.
from this 3 points I can conclude that if we have L fields composed of K keys the limit should be somewhere near ((L-K)/2)+K more or less by L/10
this answer is based on my logic and personal prictices
First of all, at least in SAP - ABAP and in background database table, we can create one index table for all required index fields, we will have their addresses only. So other SQL related software-database system can also use one table for all fields to be indexed.
Secondly, what is the writing performance? A company in one day records 50 sales orders for example. And let assume there is a table VBAK sales order header table with 30 fields for example each has 20 CHAR length..
I can write to real table in seconds, but other index table can work in the background, and at the same time a report is tried to be run, for this report while index table is searched, ther can be a logic- for database programming- a index writing process is contiuning and wait it for ending ( 5 sales orders at the same time were being recorded for example and take maybe 5 seconds) ..so , a running report can wait 5 seconds then runs 5 seconds total 10 seconds..
without index, a running report does not wait 5 seconds for writing performance..but runs maybe 40 seconds...
So, what is the meaning of writing performance no one writes thousands of records at the same time. But reading them.
And reading a second table means that : there were all ready sorted fields.I have 3 fields selected and I can find in which sorted sets I need to search these data, then I bring them...what RAM, what memory it is just a copied index table with only one data for each field -address data..What memory?
I think, this is one of the software company secrets hide from customers, not to wake them up , otherwise they will not need another system in the future with an expensive price.

MySQL: low cardinality/selectivity columns = how to index?

I need to add indexes to my table (columns) and stumbled across this post:
How many database indexes is too many?
Quote:
“Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.”
Is an Index really pointless if there are only two distinct values? Given a table as follows (MySQL Database, InnoDB)
Id (BIGINT)
fullname (VARCHAR)
address (VARCHAR)
status (VARCHAR)
Further conditions:
The Database contains 300 Million records
Status can only be “enabled” and “disabled”
150 Million records have status= enabled and 150 Million records have
stauts= disabled
My understanding is, without having an index on status, a select with where status=’enabled’ would result in a full tablescan with 300 Million Records to process?
How efficient is the lookup when I use a BTREE index on status?
Should I index this column or not?
What alternatives (maybe any other indexes) does MySQL InnoDB provide to efficiently look records up by the "where status="enabled" clause in the given example with a very low cardinality/selectivity of the values?
The index that you describe is pretty much pointless. An index is best used when you need to select a small number of rows in comparison to the total rows.
The reason for this is related to how a database accesses a table. Tables can be assessed either by a full table scan, where each block is read and processed in turn. Or by a rowid or key lookup, where the database has a key/rowid and reads the exact row it requires.
In the case where you use a where clause based on the primary key or another unique index, eg. where id = 1, the database can use the index to get an exact reference to where the row's data is stored. This is clearly more efficient than doing a full table scan and processing every block.
Now back to your example, you have a where clause of where status = 'enabled', the index will return 150m rows and the database will have to read each row in turn using separate small reads. Whereas accessing the table with a full table scan allows the database to make use of more efficient larger reads.
There is a point at which it is better to just do a full table scan rather than use the index. With mysql you can use FORCE INDEX (idx_name) as part of your query to allow comparisons between each table access method.
Reference:
http://dev.mysql.com/doc/refman/5.5/en/how-to-avoid-table-scan.html
I'm sorry to say that I do not agree with Mike. Adding an index is meant to limit the amount of full records searches for MySQL, thereby limiting IO which usually is the bottleneck.
This indexing is not free; you pay for it on inserts/updates when the index has to be updated and in the search itself, as it now needs to load the index file (full text index for 300M records is probably not in memory). So it might well be that you get extra IO in stead of limitting it.
I do agree with the statement that a binary variable is best stored as one, a bool or tinyint, as that decreases the length of a row and can thereby limit disk IO, also comparisons on numbers are faster.
If you need speed and you seldom use the disabled records, you may wish to have 2 tables, one for enabled and one for disabled records and move the records when the status changes. As it increases complexity and risk this would be my very last choice of course. Definitely do the move in 1 transaction if you happen to go for it.
It just popped into my head that you can check wether an index is actually used by using the explain statement. That should show you how MySQL is optimizing the query. I don't really know hoe MySQL optimizes queries, but from postgresql I do know that you should explain a query on a database approximately the same (in size and data) as the real database. So if you have a copy on the database, create an index on the table and see wether it's actually used. As I said, I doubt it, but I most definitely don't know everything:)
If the data is distributed like 50:50 then query like where status="enabled" will avoid half scanning of the table.
Having index on such tables is completely depends on distribution of data, i,e : if entries having status enabled is 90% and other is 10%. and for query where status="disabled" it scans only 10% of the table.
so having index on such columns depends on distribution of data.
#a'r answer is correct, however it needs to be pointed out that the usefulness of an index is given not only by its cardinality but also by the distribution of data and the queries run on the database.
In OP's case, with 150M records having status='enabled' and 150M having status='disabled', the index is unnecessary and a waste of resource.
In case of 299M records having status='enabled' and 1M having status='disabled', the index is useful (and will be used) in queries of type SELECT ... where status='disabled'.
Queries of type SELECT ... where status='enabled' will still run with a full table scan.
You will hardly need all 150 mln records at once, so I guess "status" will always be used in conjunction with other columns. Perhaps it'd make more sense to use a compound index like (status, fullname)
Jan, you should definitely index that column. I'm not sure of the context of the quote, but everything you said above is correct. Without an index on that column, you are most certainly doing a table scan on 300M rows, which is about the worst you can do for that data.
Jan, as asked, where your query involves simply "where status=enabled" without some other limiting factor, an index on that column apparently won't help (glad to SO community showed me what's up). If however, there is a limiting factor, such as "limit 10" an index may help. Also, remember that indexes are also used in group by and order by optimizations. If you are doing "select count(*),status from table group by status", an index would be helpful.
You should also consider converting status to a tinyint where 0 would represent disabled and 1 would be enabled. You're wasting tons of space storing that string vs. a tinyint which only requires 1 byte per row!
I have a similar column in my MySQL database. Approximately 4 million rows, with the distribution of 90% 1 and 10% 0.
I've just discovered today that my queries (where column = 1) actually run significantly faster WITHOUT the index.
Foolishly I deleted the index. I say foolishly, because I now suspect the queries (where column = 0) may have still benefited from it. So, instead I should explicitly tell MySQL to ignore the index when I'm searching for 1, and to use it when I'm searching for 0. Maybe.