I have a MySQL InnoDB table with a status column. The status can be 'done' or 'processing'. As the table grows, at most .1% of the status values will be 'processing,' whereas the other 99.9% of the values will be 'done.' This seems like a great candidate for an index due to the high selectivity for 'processing' (though not for 'done'). Is it possible to create an index for the status column that only indexes the value 'processing'? I do not want the index to waste an enormous amount of space indexing 'done.'
I'm not aware of any standard way to do this but we have solved a similar problem before by using two tables, Processing and Done in your case, the former with an index, the latter without.
Assuming that rows don't ever switch back from done to processing, here's the steps you can use:
When you create a record, insert it into the Processing table with the column set to processing.
When it's finished, set the column to done.
Periodically sweep the Processing table, moving done rows to the Done table.
That last one can be tricky. You can do the insert/delete in a transaction to ensure it transfers properly or you could use a unique ID to detect if it's already transferred and then just delete it from Processing (I have no experience with MySQL transaction support which is why I'm also giving that option).
That way, you're only indexing a few of the 99.9% of done rows, the ones that have yet to be transferred to the Done table. It will also work with multiple states of processing as you have alluded to in comments (entries are only transferred when they hit the done state, all other states stay in the Processing table).
It's akin to having historical data (stuff that will never change again) transferred to a separate table for efficiency. It can complicate some queries where you need access to both done and non-done rows since you have to join two tables so be aware there's a trade-off.
Better solution: don't use strings to indicate statuses. Instead use constants in your code with descriptive names => integer values. Then that integer is stored in the database, and MySQL will work a LOT faster than with strings.
I don't know what language you use, but for example in PHP:
class Member
{
const STATUS_ACTIVE = 1;
const STATUS_BANNED = 2;
}
if ($member->getStatus() == Member::STATUS_ACTIVE)
{
}
instead of what you have now:
if ($member->getStatus() == 'active')
{
}
Related
I have a system whereby users can input data into a mysql table from many sites across the globe.
The data is posted via ajax to my table without issues. But, I would like to improve my insertion code to prevent insertion if the timestamp is within some interval. This would weed out duplicate rows in my table.
Before you get angry -> I do understand I can set a primary key to certain columns and prevent duplicate insertion.
In my use case, I need to allow duplications of the numeric data where it is truly duplicated values from a unique submission -> this is valid in my case. I would like to leverage the timestamp to weed out obvious double insertions where the variables were submitted by accident twice.
I have tried to disable the button for 1-2 seconds, but this hasn't solved the problem entirely.
If I have columns: weight, height, country and the timestamp, I'd like to somehow check if there is an insert within n sections of the timestamp, where the post includes data that matches these variables. This would tell me that there is an accidental duplication from a user and I shouldn't insert it into the database.
I'm not too familiar with MYSQL, so I was hoping to get some guidance here.
Thanks.
There are different solutions, depending on the specifics of your case:
If you need to apply some rule that validates the new row using values inside the row itself a CHECK constraint will do. Consider, though, that MySQL enforces CHECK constraints starting in version 8.0.3 (if I remember well).
If you want to enforce a rule in relation to other rows, you can serialize the insertions into a queue. The consumer of the queue will validate the insertions one by one and will accept or reject them. Consider that serialization is not a good option for massive level of insertions, since it produce a bottleneck (this may be your case since you say insertions from across the globe).
Alternatively, you can use optimistic insertion, and always produce the insertion with an intermediate status "waiting for validation". Then other process(es) can validate the row. If all is good, then the row is approved; if not, then a compensation procedure is executed, in a-la-microservice way.
Which one is your case?
I have an big load file that I downloaded. This contains records that I will have to load into the database. Based on the size of the data, it will likely take 2 weeks or more to finish (since there is preprocessing etc). A coworker asked me to make what she called a delta file, which checks the current database to see if the data already exists based on a certain field in the database and IFF it exists then we will keep that in the load file, otherwise we will discard it.
I'm confused because to implement this I would need to do a select query for every file in the load file to check if it exists. a select would take O(n) I'm assuming. Then the insert (for a smaller data set) an additional O(1).
whereas an insert would just take O(1).
I'd like to 1) understand why this implementation is faster (If I don't understand things properly) and 2) a possible solution to implementation of this delta file if you can think of something smarter than what I suggested
Thanks
Databases make indexes for columns specified in the schema. The way your data is indexed can make a massive difference in performance. Without an index, a select operation may be O(n) but with an index it may be O(1).
Insert operations must maintain the index. For large data loading operations you may be well off to disable indexing until the end so you are doing a single index update on all the data instead of many index updates on each record you insert.
Some measurements I did the other day indicate that selects are faster than inserts in my situation. I came across this question because I am trying to learn if this is generally true or reflects something specific about the way I have it setup.
I can't seem to find any examples of anyone doing this on the web, so am wondering if maybe there's a reason for that (or maybe I haven't used the right search terms). There might even already be a term for this that I'm unaware of?
To save on database storage space for regularly reoccurring strings, I'm thinking of creating a MySQL table called unique_string. It would only have two columns:
"id" : INT : PRIMARY_KEY index
"string" : varchar(255) : UNIQUE index
Any other tables anywhere in the database can then use INT columns instead of VARCHAR columns. For example a varchar field called browser would instead be an INT field called browser_unique_string_id.
I would not use this for anything where performance matters. In this case I'm using it to track details of every single page request (logging web stats) and an "audit trial" of user actions on intranets, but other things potentially too.
I'm also aware the SELECT queries would be complex, so I'm not worried about that. I'll most likely write some code to generate the queries to return the "real" string data.
Thoughts? I feel like I might be overlooking something obvious here.
Thanks!
I have used this structure for a similar application -- keeping track of URIs for web logs. In this case, the database was Oracle.
The performance issues are not minimal. As the database grows, there are tens of millions of URIs. So, just identifying the right string during an INSERT is challenging. We handled this by building most of the update logic in hadoop, so the database table was, in essence, just a copy of a hadoop table.
In a regular database, you would get around this by building an index, as you suggest in your question. And, an index solution would work well up to your available memory. In fact, this is a rather degenerate case for an index, because you really only need the index and not the underlying table. I'm do not know if mysql or SQL Server recognize this, although columnar databases (such as Vertica) should.
SQL Server has another option. If you declare the string as VARCHAR(max), then it is stored no a separate data page from the rest of the data. During a full table scan, there is no need to load the additional page in memory, if the column is not being referenced in the query.
This is a very common design pattern in databases where the cardinality of the data is relatively small compared to the transaction table that it's linked to. The queries wouldn't be very complex, just a simple join to the lookup table. You can include more than just a string on the lookup table, other information that is commonly repeated. You're simply normalizing your model to remove duplicate data.
Example:
Request Table:
Date
Time
IP Address
Browser_ID
Browser Table:
Browser_ID
Browser_Name
Browser_Version
Browser_Properties
If you planning on logging data in real time (as opposed to a batch job) then you want to ensure your time to write a record to the database is as quick as possible. If you are logging synchronously then obviously the record creating time will directly affect the time it takes for a http request to complete. If this is async then slow record creation times will lead to a bottleneck. However if this is batch job then performance will not matter so long as you can confidently create all the batched records before the next batch runs.
In order to reduce the time it takes to create a record you really want to flatten out your database structure, your current query in pseudo might look like
SELECT #id = id from PagesTable
WHERE PageName = #RequestedPageName
IF #id = 0
THEN
INSERT #RequestedPageName into PagesTable
#id = SELECT ##IDENTITY 'or whatever method you db supports for
'fetching the id for a newly created record
END IF
INSERT #id, #BrowserName INTO BrowersLogTable
Where as in a flat structure you would just need 1 INSERT
If you are concerned about data Integrity, which you should be, then typically you would normalise this data by querying at writing it into a separate set of tables (or a separate database) at regular intervals and use this for querying against.
Am I correct to assume that an UPDATE query takes more resources than an INSERT query?
I am not a database guru but here my two cents:
Personally I don't think you have much to do in this regard, even if INSERT would be faster (all to be proven), can you convert an update in an insert?! Frankly I don't think you can do it all the times.
During an INSERT you don't usually have to use WHERE to identify which row to update but depending on your indices on that table the operation can have some cost.
During an update if you do not change any column included in any indices you could have quick execution, if the where clause is easy and fast enough.
Nothing is written in stones and really I would imagine it depends on whole database setup, indices and so on.
Anyway, found this one as a reference:
Top 84 MySQL Performance Tips
If you plan to perform a large processing (such as rating or billing for a cellular company), this question has a huge impact on system performance.
Performing large scale updates vs making many new tables and index has proven to reduce my company billing process form 26 hours to 1 hour!
I have tried it on 2 million records for 100,000 customer.
I first created the billing table and then every customer summary calls, I updated the billing table with the duration, price, discount.. a total of 10 fields.
In the second option I created 4 phases.
Each phase reads the previous table(s), creates index (after the table insert completed) and using: "insert into from select .." I have created the next table for the next phase.
Summary
Although the second alternative requires much more disk space (all views and temporary tables deleted at the end) there are 3 main advantages to this option:
It was 4 time faster than option 1.
In case there was a problem in the middle of the process I could start the process from the point it failed, as all the tables for the beginning of the phase were ready and the process could restart from this point. If the process fails implementing the first option, you will need to start the all the process all over again.
This made the development and QA work much faster as they could work parallel.
The key resource here is disk access (IOPS to be precise) and we should evaluate which ones results in minimum of that.
Agree with others on how it is impossible to give a generic answer but some thoughts to lead you in the right direction , assume a simple key-value store and key is indexed. Insertion is inserting a new key and update is updating the value of an existing key.
If that is the case (a very common case) , update would be faster than insertion because update involves an indexed lookup and changing an existing value without touching the index. You can assume that is one disk read to get the data and possibly one disk write. On the other hand insertion would involve two disk writes one for index , one for data. But the another hidden cost is the btree node splitting and new node creation which would happen in background while insertion leading to more disk access on average.
You cannot compare an INSERT and an UPDATE in general. Give us an example (with schema definition) and we will explain which one costs more and why. Also, you can compere a concrete INSERT and an UPDATE by checking their plan and execution time.
Some rules of thumbs though:
if you only update only one field, which is not indexed and you only update one record and you use rowid/primary key to find that record then this UPDATE will cost less, than
an INSERT, which will also affect only one row, though this row will have many not null constrained, indexed fields; and all those indexes have to be maintained (e.g. add a new leaf)
It depends. A simple UPDATE that uses a primary key in the WHERE clause and updates only a single non-indexed field would likely be less costly than an INSERT on the same table. But even that depends on the database engine involved. An UPDATE that involved modifying many indexed fields, however, might be more costly than the INSERT on that table because more index key modifications would be required. An UPDATE with a poorly constructed WHERE clause that required a table scan of millions of records would certainly be more expensive than an INSERT on that table.
These statements can take many forms, but if you limit the discussion to their "basic" forms that involve a single record, then the larger portion of the cost will usually be dedicated to modifying the indexes. Each indexed field that is modified during an UPDATE would typically involve two basic operations (delete the old key and add the new key) whereas the INSERT would require one (add the new key). Of course, a clustered index would then add some other dynamics as would locking issues, transaction isolation, etc. So, ultimately, the comparison between these statements in a general sense is not really possible and would probably require benchmarking of specific statements if it actually mattered.
Typically, though, it makes sense to just use the correct statement and not worry about it since it is usually not an option to choose between an UPDATE and an INSERT.
It depends. If update don't require changes of the key it's most likely that it will only costs like a search and then it will probably cost less than an insert, unless database is organized like an heap.
This is the only think i can state, because performances greatly depends on the database organization used.
If you for example use MyISAM that i suppose organized like an isam, insert should cost generally the same in terms of database read accesses but it will require some additional write operation.
On Sybase / SQL Server an update which impacts a column with a read-only index is internally replaced by a delete and then an insert, so this is obviously slower than insert. I do not know the implementation for other engines but I think this is a common strategy at least when indices are involved.
Now for tables without indices ( or for update requests not involving any index ) I suppose there are cases where the update can be faster, depending on the structure of the table.
In mysql you can change your update to insert with ON DUPLICATE KEY UPDATE
INSERT INTO t1 (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE c=c+1;
UPDATE t1 SET c=c+1 WHERE a=1;
A lot of people here are commenting that you can't compare an insert vs update but I disagree. People should understand that an update takes a lot more resources than insert or even possibly deleting and inserting.
Now regarding how you can even compare the 2 as one doesn't directly replace the other. But in certain cases you make an insert and then update the table with data from another table.
For instance I get a feed from an API which contains id1, but this table relates to another table and I would like to add table2_id. Instead of doing an update statement that takes a lot more resources, I can handle this in the backend which is faster and just do an insert statement instead of an insert and then an update. The update statement also locks the table causing a traffic jam so to speak.
I want to start counting the numbers of times a webpage is viewed and hence need some kind of simple counter. What is the best scalable method of doing this?
Suppose I have a table Frobs where each row corresponds to a page - some obvious options are:
Have an unsigned int NumViews field
in the Frobs table which gets
updated upon each view using UPDATE
Frobs SET NumViews = NumViews + 1. Simple but not so good at scaling as I understand it.
Have a separate table FrobViews
where a new row is inserted for each view. To display the
number of views, you then need to do a simple SELECT COUNT(*) AS NumViews FROM FrobViews WHERE FrobId = '%d' GROUP BY FrobId. This doesn't involve any updates so can avoid table locking in MyISAM tables - however, the read performance will suffer if you want to display the number of views on each page.
How do you do it?
There's some good advice here:
http://www.mysqlperformanceblog.com/2007/07/01/implementing-efficient-counters-with-mysql/
but I'd like to hear the views of the SO community.
I'm using InnoDb at the moment, but am interested in answers for both InnoDb and MyISAM.
If scalability is more important to you than absolute accuracy of the figures then you could cache the view count in your application for a short time rather than hitting the database on every page view - eg, only update the database once every 100 views.
If your application crashes between database updates then obviously you'll lose some of your data, but if you can tolerate a certain amount of inaccuracy then this might be a useful approach.
Inserting into a Database is not something you want to do on page views. You are likely to run into problems with updating your slave databases with all of the inserts since replication is single threaded on MySQL.
At my company we serve 25M page views a day and we have taken a tiered approach.
The view counter is stored in a separate table with 2 columns (profileId, viewCounter) both are unsigned integers.
For items that are infrequently viewed we update the table on page view.
For frequently viewed items we update MySQL about 1/10 of the time. For both types we update Memcache on every hit.
int Memcache::increment ( string $key [, int $value = 1 ] )
if (pageViews < 10000) { UPDATE page_view SET viewCounter=viewCounter+1 WHERE profileId = :? }
else if ((int)rand(10) == 1) { //UPDATE page_view SET viewCounter= ?:cache_value WHERE profileId = :? }
doing count(*) is very inefficient in InnoDB (MyISAM keeps count stats in the index), but MyISAM will lock the table on reads reducing concurrency. doing a count() for 50,000 or 100,000 rows is going to take a long time. Doing a select on a PK will be very fast.
If you require more scalability, you might want to look at redis
I would take your second approach and aggregate the data into the table from your first solution on a regular base. On this way you get the advandages of both solutions. To be clearer:
On every hit you insert a row into a table (lets name it hit_counters). This table got only one field (the pageid). Every x seconds you run a script (via a cronjob) which aggregates the data from the hit_counters table and put it into a second table (lets name it 'hits'. There you got two fields: the pageid and the total hits.
Im not sure but imho does innodb not help you very much for solution 1 if you get many hits on the same page: Innodb locks the row while updating so all other updates to this row will be delayed.
Depending on whats your program written in you could also batch the updates together by counting in your application and updating the database only every x seconds. This would only work if you use a programing language where you got persistent storage (like Java Servlets but not PHP)
What I do, and it may not apply to your scenario, is in the stored procedure that prepares/returns the data that is displayed on the page, I make the table counter update at the same time it returns the data - that way, there is only one call to the server that both gets the data, and updates the counter in the same call.
If you are not using SP's,(or if there is no database data on your page) this option may not be available to you, but if you are, its something to consider.