Efficient way to add up usage statistics using NoSQL - mysql

I am working on a system that tracks usage in the form of start and stop events for various things. In my current implementation, I have MySQL and each row of my table contains the start and stop timestamp, plus a unique ID to information about the event.
Running an aggregate query in MySQL to take a total of the difference of the stop and start times is very easy and relatively fast, and this returns the total minutes of usage.
I am trying to see how this would translate to NoSQL and wanted some suggestions for the best way to implement this in a performant way.
Since NoSQL seems like it doesn't really support this sort of calculation out of the box, I would have to ship a whole bunch of data to my client and do the calculations which would be extremely slow. One idea is to pre-compute the differences at insert time (basically denormalizing) which would create redundant data but make the subtractions faster.
The next problem is the additions, and that could be done for the simple aggregate case by maintaining a counter of this total sum (actually in this case I might not even need the pre-computed differences). However, the problem is that in reality I need to generate this usage across different slices of my data, so pre-computing would be difficult to do. I guess it would be possible to pre-compute a bunch of common sums, say ten or so, but then it seems the insert times would be slowed significantly because this logic has to be done for each insert. And to me one of the biggest advantages of NoSQL is the very small insert time even for large datasets.
If anyone has any suggestions please let me know.

Let's fix the MySQL insert times.
Batch inserts -- An INSERT with 100 rows runs 10 times as fast as 100 1-row INSERTs. Or use LOAD DATA.
(Assuming InnoDB) -- innodb_flush_log_at_trx_commit = 2 -- potentially a significant speedup.
Minimize the number of indexes.

Related

Running a cron to update 1 million records in every hour fails

We have an E-commerce system with more than 1 million users with a total or 4 to 5 million records in order table. We use codeigniter framework as back end and Mysql as database.
Due to this excessive number of users and purchases, we use cron jobs to update the order details and referral bonus points in every hour to make the things work.
Now we have a situation that these data updates exceeds one hour and the next batch of updates reach before finishing the previous one, there by leading into a deadlock and failure of the system.
I'd like to know about the different possible architectural and database scaling options and suggestions to get rid of this situation. We are using only the monolithic architecture to run this application.
Don't use cron. Have a single process that starts over when it finishes. If one pass lasts more than an hour, the next one will start late. (Checking PROCESSLIST is clumsy and error-prone. OTOH, this continually-running approach needs a "keep-alive" cronjob.)
Don't UPDATE millions of rows. Instead, find a way to put the desired info in a separate table that the user joins to. Presumably, that extra table would only 1 row (if everyone is controlled by the same game) or a small number of rows (if there are only a small number of patterns to handle).
Do have the slowlog turned on, with a small value for long_query_time (possibly "1.0", maybe lower). Use pt-query-digest to summarize it to find the "worst" queries. Then we can help you make them take less time, thereby helping to calm your busy system and improve the 'user experience'.
Do use batched INSERT. (A one INSERT with 100 rows runs about 10 times as fast as 100 single-row INSERTs.) Batching UPDATEs is tricky, but can be done with IODKU.
Do use batches of 100-1000 rows. (This is somewhat optimal considering the various things that can happen.)
Do use transactions judiciously. Do check for errors (including deadlocks) at every step.
Do tell us what you are doing in the hourly update. We might be able to provide more targeted advice than that 15-year-old book.
Do realize that you have scaled beyond the capabilities of the typical 3rd-party package. That is, you will have to learn the details of SQL.
I have some ideas here for you - mixed up with some questions.
Assuming you are limited in what you can do (i.e. you can't re-architect you way out of this) and that the database can't be tuned further:
Make the list of records to be processed as small as possible
i.e. Does the job have to run over all records? These 4-5 million records - are they all active orders, or that's how many you have in total for all time? Obviously just process the bare minimum.
Split and parallel process
You mentioned "batches" but never explained what that meant - can you elaborate?
Can you get multiple instances of the cron job to run at once, each covering a different segment of the records?
Multi-Record Operations
The easy (lazy) way to program updates is to do it in a loop that iterates through each record and processes it individually, but relational databases can do updates over multiple records at once. I'm pretty sure there's a proper term for that but I can't recall it. Are you processing each row individually or doing multi-record updates?
How does the cron job query the database? Have you hand-crafted the most efficient queries possible, or are you using some ORM / framework to do stuff for you?

Distributed database use cases

At the moment i do have a mysql database, and the data iam collecting is 5 Terrabyte a year. I will save my data all the time, i dont think i want to delete something very early.
I ask myself if i should use a distributed database because my data will grow every year. And after 5 years i will have 25 Terrabyte without index. (just calculated the raw data i save every day)
i have 5 tables and the most queries are joins over multiple tables.
And i need to access mostly 1-2 columns over many rows at a specific timestamp.
Would a distributed database be a prefered database than only a single mysql database?
Paritioning will be difficult, because all my tables are really high connected.
I know it depends on the queries and on the database table design and i can also have a distributed mysql database.
i just want to know when i should think about a distributed database.
Would this be a use case? or could mysql handle this large dataset?
EDIT:
in average i will have 1500 clients writing data per second, they affect all tables.
i just need the old dataset for analytics. Like machine learning and
pattern matching.
also a client should be able to see the historical data
Your question is about "distributed", but I see more serious questions that need answering first.
"Highly indexed 5TB" will slow to a crawl. An index is a BTree. To add a new row to an index means locating the block in that tree where the item belongs, then read-modify-write that block. But...
If the index is AUTO_INCREMENT or TIMESTAMP (or similar things), then the blocks being modified are 'always' at the 'end' of the BTree. So virtually all of the reads and writes are cacheable. That is, updating such an index is very low overhead.
If the index is 'random', such as UUID, GUID, md5, etc, then the block to update is rarely found in cache. That is, updating this one index for this one row is likely to cost a pair of IOPs. Even with SSDs, you are likely to not keep up. (Assuming you don't have several TB of RAM.)
If the index is somewhere between sequential and random (say, some kind of "name"), then there might be thousands of "hot spots" in the BTree, and these might be cacheable.
Bottom line: If you cannot avoid random indexes, your project is doomed.
Next issue... The queries. If you need to scan 5TB for a SELECT, that will take time. If this is a Data Warehouse type of application and you need to, say, summarize last month's data, then building and maintaining Summary Tables will be very important. Furthermore, this can obviate the need for some of the indexes on the 'Fact' table, thereby possibly eliminating my concern about indexes.
"See the historical data" -- See individual rows? Or just see summary info? (Again, if it is like DW, one rarely needs to see old datapoints.) If summarization will suffice, then most of the 25TB can be avoided.
Do you have a machine with 25TB online? If not, that may force you to have multiple machines. But then you will have the complexity of running queries across them.
5TB is estimated from INT = 4 bytes, etc? If using InnoDB, you need to multiple by 2 to 3 to get the actual footprint. Furthermore, if you need to modify a table in the future, such action probably needs to copy the table over, so that doubles the disk space needed. Your 25TB becomes more like 100TB of storage.
PARTITIONing has very few valid use cases, so I don't want to discuss that until knowing more.
"Sharding" (splitting across machines) is possibly what you mean by "distributed". With multiple tables, you need to think hard about how to split up the data so that JOINs will continue to work.
The 5TB is huge -- Do everything you can to shrink it -- Use smaller datatypes, normalize, etc. But don't "over-normalize", you could end up with terrible performance. (We need to see the queries!)
There are many directions to take a multi-TB db. We really need more info about your tables and queries before we can be more specific.
It's really impossible to provide a specific answer to such a wide question.
In general, I recommend only worrying about performance once you can prove that you have a problem; if you're worried, it's much better to set up a test rig, populate it with representative data, and see what happens.
"Can MySQL handle 5 - 25 TB of data?" Yes. No. Depends. If - as you say - you have no indexes, your queries may slow down a long time before you get to 5TB. If it's 5TB / year of highly indexable data it might be fine.
The most common solution to this question is to keep a "transactional" database for all the "regular" work, and a datawarehouse for reporting, using a regular Extract/Transform/Load job to move the data across, and archive it. The data warehouse typically has a schema optimized for querying, usually entirely unlike the original schema.
If you want to keep everything logically consistent, you might use sharding and clustering - a sort-a-kind-a out of the box feature of MySQL.
I would not, however, roll my own "distributed database" solution. It's much harder than you might think.

Store large amounts of sensor data in SQL, optimize for query performance

I need to store sensor data from various locations (different factories with different rooms with each different sensors). Data is being downloaded in regular intervals from a device on site in the factories that collects the data transmitted from all sensors.
The sensor data looks like this:
collecting_device_id, sensor_id, type, value, unit, timestamp
Type could be temperature, unit could be degrees_celsius. collecting_device_id will identify the factory.
There are quite a lot of different things (==types) being measured.
I will collect around 500 million to 750 million rows and then perform analyses on them.
Here's the question for storing the data in a SQL database (let's say MySQL InnoDB on AWS RDS, large machine if necessary):
When considering query performance for future queries, is it better to store this data in one huge table just like it comes from the sensors? Or to distribute it across tables (tables for factories, temperatures, humidities, …, everything normalized)? Or to have a wide table with different fields for the data points?
Yes, I know, it's hard to say "better" without knowing the queries. Here's more info and a few things I have thought about:
There's no constant data stream as data is uploaded in chunks every 2 days (a lot of writes when uploading, the rest of the time no writes at all), so I would guess that index maintenance won't be a huge issue.
I will try to reduce the amount of data being inserted upfront (data that can easily be replicated later on, data that does not add additional information, …)
Queries that should be performed are not defined yet (I know, designing the query makes a big difference in terms of performance). It's exploratory work (so we don't know ahead what will be asked and cannot easily pre-compute values), so one time you want to compare data points of one type in a time range to data points of another type, the other time you might want to compare rooms in factories, calculate correlations, find duplicates, etc.
If I would have multiple tables and normalize everything the queries would need a lot of joins (which probably makes everything quite slow)
Queries mostly need to be performed on the whole ~ 500 million rows database, rarely on separately downloaded subsets
There will be very few users (<10), most of them will execute these "complex" queries.
Is a SQL database a good choice at all? Would there be a big difference in terms of performance for this use case to use a NoSQL system?
In this setup with this amount of data, will I have queries that never "come back"? (considering the query is not too stupid :-))
Don't pre-optimize. If you don't know the queries then you don't know the queries. It is to easy to make choices now that will slow down some sub-set of queries. When you know how the data will be queried you can optimize then -- it is easy to normalize after the fact (pull out temperature data into a related table for example.) For now I suggest you put it all in one table.
You might consider partitioning the data by date or if you have another way that might be useful (recording device maybe?). Often data of this size is partitioned if you have the resources.
After you think about the queries, you will possibly realize that you don't really need all the datapoints. Instead, max/min/avg/etc for, say, 10-minute intervals may be sufficient. And you may want to "alarm" on "over-temp" values. This should not involve the database, but should involve the program receiving the sensor data.
So, I recommend not storing all the data; instead only store summarized data. This will greatly shrink the disk requirements. (You could store the 'raw' data to a plain file in case you are worried about losing it. It will be adequately easy to reprocess the raw file if you need to.)
If you do decide to store all the data in table(s), then I recommend these tips:
High speed ingestion (includes tips on Normalization)
Summary Tables
Data Warehousing
Time series partitioning (if you plan to delete 'old' data) (partitioning is painful to add later)
750M rows -- per day? per decade? Per month - not too much challenge.
By receiving a batch every other day, it becomes quite easy to load the batch into a temp table, do normalization, summarization, etc; then store the results in the Summary table(s) and finally copy to the 'Fact' table (if you choose to keep the raw data in a table).
In reading my tips, you will notice that avg is not summarized; instead sum and count are. If you need standard deviation, also, keep sum-of-squares.
If you fail to include all the Summary Tables you ultimately need, it is not too difficult to re-process the Fact table (or Fact files) to populate the new Summary Table. This is a one-time task. After that, the summarization of each chunk should keep the table up to date.
The Fact table should be Normalized (for space); the Summary tables should be somewhat denormalized (for performance). Exactly how much denormalization depends on size, speed, etc., and cannot be predicted at this level of discussion.
"Queries on 500M rows" -- Design the Summary tables so that all queries can be done against them, instead. A starting rule-of-thumb: Any Summary table should have one-tenth the number of rows as the Fact table.
Indexes... The Fact table should have only a primary key. (The first 100M rows will work nicely; the last 100M will run so slowly. This is a lesson you don't want to have to learn 11 months into the project; so do pre-optimize.) The Summary tables should have whatever indexes make sense. This also makes querying a Summary table faster than the Fact table. (Note: Having a secondary index on a 500M-rows table is, itself, a non-trivial performance issue.)
NoSQL either forces you to re-invent SQL, or depends on brute-force full-table-scans. Summary tables are the real solution. In one (albeit extreme) case, I sped up a 1-hour query to 2-seconds by by using a Summary table. So, I vote for SQL, not NoSQL.
As for whether to "pre-optimize" -- I say it is a lot easier than rebuilding a 500M-row table. That brings up another issue: Start with the minimal datasize for each field: Look at MEDIUMINT (3 bytes), UNSIGNED (an extra bit), CHARACTER SET ascii (utf8 or utf8mb4) only for columns that need it), NOT NULL (NULL costs a bit), etc.
Sure, it is possible to have 'queries that never come back'. This one 'never comes back, even with only 100 rows in a: SELECT * FROM a JOIN a JOIN a JOIN a JOIN a. The resultset has 10 billion rows.

How to optimize and Fast run SQL query

I have following SQL query that taking too much time to fetch data.
Customer.joins("LEFT OUTER JOIN renewals ON customers.id = renewals.customer_id").where("renewals.customer_id IS NULL && customers.status_id = 4").order("created_at DESC").select('first_name, last_name, customer_state, customers.created_at, customers.customer_state, customers.id, customers.status_id')
Above query takes 230976.6ms to execute.
I added indexing on firstname, lastname, customer_state and status_id.
How can I execute query within less then 3 sec. ?
Try this...
Everyone wants faster database queries, and both SQL developers and DBAs can turn to many time-tested methods to achieve that goal. Unfortunately, no single method is foolproof or ironclad. But even if there is no right answer to tuning every query, there are plenty of proven do's and don'ts to help light the way. While some are RDBMS-specific, most of these tips apply to any relational database.
Do use temp tables to improve cursor performance
I hope we all know by now that it’s best to stay away from cursors if at all possible. Cursors not only suffer from speed problems, which in itself can be an issue with many operations, but they can also cause your operation to block other operations for a lot longer than is necessary. This greatly decreases concurrency in your system.
However, you can’t always avoid using cursors, and when those times arise, you may be able to get away from cursor-induced performance issues by doing the cursor operations against a temp table instead. Take, for example, a cursor that goes through a table and updates a couple of columns based on some comparison results. Instead of doing the comparison against the live table, you may be able to put that data into a temp table and do the comparison against that instead. Then you have a single UPDATE statement against the live table that’s much smaller and holds locks only for a short time.
Sniping your data modifications like this can greatly increase concurrency. I’ll finish by saying you almost never need to use a cursor. There’s almost always a set-based solution; you need to learn to see it.
Don’t nest views
Views can be convenient, but you need to be careful when using them. While views can help to obscure large queries from users and to standardize data access, you can easily find yourself in a situation where you have views that call views that call views that call views. This is called nesting views, and it can cause severe performance issues, particularly in two ways. First, you will very likely have much more data coming back than you need. Second, the query optimizer will give up and return a bad query plan.
I once had a client that loved nesting views. The client had one view it used for almost everything because it had two important joins. The problem was that the view returned a column with 2MB documents in it. Some of the documents were even larger. The client was pushing at least an extra 2MB across the network for every single row in almost every single query it ran. Naturally, query performance was abysmal.
And none of the queries actually used that column! Of course, the column was buried seven views deep, so even finding it was difficult. When I removed the document column from the view, the time for the biggest query went from 2.5 hours to 10 minutes. When I finally unraveled the nested views, which had several unnecessary joins and columns, and wrote a plain query, the time for that same query dropped to subseconds.
Do use table-valued functions
RESOURCES
VIDEO/WEBCAST
Sponsored
Discover your Data Dilemma
WHITE PAPER
Best Practices when Designing a Digital Workplace
SEE ALL
Search Resources
Go
This is one of my favorite tricks of all time because it is truly one of those hidden secrets that only the experts know. When you use a scalar function in the SELECT list of a query, the function gets called for every single row in the result set. This can reduce the performance of large queries by a significant amount. However, you can greatly improve the performance by converting the scalar function to a table-valued function and using a CROSS APPLY in the query. This is a wonderful trick that can yield great improvements.
Want to know more about the APPLY operator? You'll find a full discussion in an excellent course on Microsoft Virtual Academy by Itzik Ben-Gan.
Do use partitioning to avoid large data moves
Not everyone will be able to take advantage of this tip, which relies on partitioning in SQL Server Enterprise, but for those of you who can, it’s a great trick. Most people don’t realize that all tables in SQL Server are partitioned. You can separate a table into multiple partitions if you like, but even simple tables are partitioned from the time they’re created; however, they’re created as single partitions. If you're running SQL Server Enterprise, you already have the advantages of partitioned tables at your disposal.
This means you can use partitioning features like SWITCH to archive large amounts of data from a warehousing load. Let’s look at a real example from a client I had last year. The client had the requirement to copy the data from the current day’s table into an archive table; in case the load failed, the company could quickly recover with the current day’s table. For various reasons, it couldn’t rename the tables back and forth every time, so the company inserted the data into an archive table every day before the load, then deleted the current day’s data from the live table.
This process worked fine in the beginning, but a year later, it was taking 1.5 hours to copy each table -- and several tables had to be copied every day. The problem was only going to get worse. The solution was to scrap the INSERT and DELETE process and use the SWITCH command. The SWITCH command allowed the company to avoid all of the writes because it assigned the pages to the archive table. It’s only a metadata change. The SWITCH took on average between two and three seconds to run. If the current load ever fails, you SWITCH the data back into the original table.
YOU MIGHT ALSO LIKE
Microsoft Dynamics AX ERP
Microsoft Dynamics AX: A new ERP is born, this time in the cloud
Joseph Sirosh
Why Microsoft’s data chief thinks current machine learning tools are like...
Urs Holzle Structure
Google's infrastructure czar predicts cloud business will outpace ads in 5...
This is a case where understanding that all tables are partitions slashed hours from a data load.
If you must use ORMs, use stored procedures
This is one of my regular diatribes. In short, don’t use ORMs (object-relational mappers). ORMs produce some of the worst code on the planet, and they’re responsible for almost every performance issue I get involved in. ORM code generators can’t possibly write SQL as well as a person who knows what they're doing. However, if you use an ORM, write your own stored procedures and have the ORM call the stored procedure instead of writing its own queries. Look, I know all the arguments, and I know that developers and managers love ORMs because they speed you to market. But the cost is incredibly high when you see what the queries do to your database.
Stored procedures have a number of advantages. For starters, you’re pushing much less data across the network. If you have a long query, then it could take three or four round trips across the network to get the entire query to the database server. That's not including the time it takes the server to put the query back together and run it, or considering that the query may run several -- or several hundred -- times a second.
Using a stored procedure will greatly reduce that traffic because the stored procedure call will always be much shorter. Also, stored procedures are easier to trace in Profiler or any other tool. A stored procedure is an actual object in your database. That means it's much easier to get performance statistics on a stored procedure than on an ad-hoc query and, in turn, find performance issues and draw out anomalies.
In addition, stored procedures parameterize more consistently. This means you’re more likely to reuse your execution plans and even deal with caching issues, which can be difficult to pin down with ad-hoc queries. Stored procedures also make it much easier to deal with edge cases and even add auditing or change-locking behavior. A stored procedure can handle many tasks that trouble ad-hoc queries. My wife unraveled a two-page query from Entity Framework a couple of years ago. It took 25 minutes to run. When she boiled it down to its essence, she rewrote that huge query as SELECT COUNT(*) from T1. No kidding.
OK, I kept it as short as I could. Those are the high-level points. I know many .Net coders think that business logic doesn’t belong in the database, but what can I say other than you’re outright wrong. By putting the business logic on the front end of the application, you have to bring all of the data across the wire merely to compare it. That’s not good performance. I had a client earlier this year that kept all of the logic out of the database and did everything on the front end. The company was shipping hundreds of thousands of rows of data to the front end, so it could apply the business logic and present the data it needed. It took 40 minutes to do that. I put a stored procedure on the back end and had it call from the front end; the page loaded in three seconds.
Of course, the truth is that sometimes the logic belongs on the front end and sometimes it belongs in the database. But ORMs always get me ranting.
Don’t do large ops on many tables in the same batch
This one seems obvious, but apparently it's not. I’ll use another live example because it will drive home the point much better. I had a system that suffered tons of blocking. Dozens of operations were at a standstill. As it turned out, a delete routine that ran several times a day was deleting data out of 14 tables in an explicit transaction. Handling all 14 tables in one transaction meant that the locks were held on every single table until all of the deletes were finished. The solution was to break up each table's deletes into separate transactions so that each delete transaction held locks on only one table. This freed up the other tables and reduced the blocking and allowed other operations to continue working. You always want to split up large transactions like this into separate smaller ones to prevent blocking.
Don't use triggers
This one is largely the same as the previous one, but it bears mentioning. Don’t use triggers unless it’s unavoidable -- and it’s almost always avoidable.
The problem with triggers: Whatever it is you want them to do will be done in the same transaction as the original operation. If you write a trigger to insert data into another table when you update a row in the Orders table, the lock will be held on both tables until the trigger is done. If you need to insert data into another table after the update, then put the update and the insert into a stored procedure and do them in separate transactions. If you need to roll back, you can do so easily without having to hold locks on both tables. As always, keep transactions as short as possible and don’t hold locks on more than one resource at a time if you can help it.
Don’t cluster on GUID
After all these years, I can't believe we’re still fighting this issue. But I still run into clustered GUIDs at least twice a year.
A GUID (globally unique identifier) is a 16-byte randomly generated number. Ordering your table’s data on this column will cause your table to fragment much faster than using a steadily increasing value like DATE or IDENTITY. I did a benchmark a few years ago where I inserted a bunch of data into one table with a clustered GUID and into another table with an IDENTITY column. The GUID table fragmented so severely that the performance degraded by several thousand percent in a mere 15 minutes. The IDENTITY table lost only a few percent off performance after five hours. This applies to more than GUIDs -- it goes toward any volatile column.
Don’t count all rows if you only need to see if data exists
It's a common situation. You need to see if data exists in a table or for a customer, and based on the results of that check, you’re going to perform some action. I can't tell you how often I've seen someone do a SELECT COUNT(*) FROM dbo.T1 to check for the existence of that data:
SET #CT = (SELECT COUNT(*) FROM dbo.T1);
If #CT > 0
BEGIN
END
It’s completely unnecessary. If you want to check for existence, then do this:
If EXISTS (SELECT 1 FROM dbo.T1)
BEGIN
END
Don’t count everything in the table. Just get back the first row you find. SQL Server is smart enough to use EXISTS properly, and the second block of code returns superfast. The larger the table, the bigger difference this will make. Do the smart thing now before your data gets too big. It’s never too early to tune your database.
In fact, I just ran this example on one of my production databases against a table with 270 million rows. The first query took 15 seconds, and included 456,197 logical reads, while the second one returned in less than one second and included only five logical reads. However, if you really do need a row count on the table, and it's really big, another technique is to pull it from the system table. SELECT rows from sysindexes will get you the row counts for all of the indexes. And because the clustered index represents the data itself, you can get the table rows by adding WHERE indid = 1. Then simply include the table name and you're golden. So the final query is SELECT rows from sysindexes where object_name(id) = 'T1' and indexid = 1. In my 270 million row table, this returned sub-second and had only six logical reads. Now that's performance.
Don’t do negative searches
Take the simple query SELECT * FROM Customers WHERE RegionID <> 3. You can’t use an index with this query because it’s a negative search that has to be compared row by row with a table scan. If you need to do something like this, you may find it performs much better if you rewrite the query to use the index. This query can easily be rewritten like this:
SELECT * FROM Customers WHERE RegionID < 3 UNION ALL SELECT * FROM Customers WHERE RegionID
This query will use an index, so if your data set is large it could greatly outperform the table scan version. Of course, nothing is ever that easy, right? It could also perform worse, so test this before you implement it. There are too many factors involved for me to tell you that it will work 100 percent of the time. Finally, I realize this query breaks the “no double dipping” tip from the last article, but that goes to show there are no hard and fast rules. Though we're double dipping here, we're doing it to avoid a costly table scan.
Ref:http://www.infoworld.com/article/2604472/database/10-more-dos-and-donts-for-faster-sql-queries.html
http://www.infoworld.com/article/2628420/database/database-7-performance-tips-for-faster-sql-queries.html

What is the cost of indexing multiple db columns?

I'm writing an app with a MySQL table that indexes 3 columns. I'm concerned that after the table reaches a significant amount of records, the time to save a new record will be slow. Please inform how best to approach the indexing of columns.
UPDATE
I am indexing a point_value, the
user_id, and an event_id, all required
for client-facing purposes. For an
instance such as scoring baseball runs
by player id and game id. What would
be the cost of inserting about 200 new
records a day, after the table holds
records for two seasons, say 72,000
runs, and after 5 seasons, maybe a
quarter million records? Only for
illustration, but I'm expecting to
insert between 25 and 200 records a
day.
Index what seems the most logical (that should hopefully be obvious, for example, a customer ID column in the CUSTOMERS table).
Then run your application and collect statistics periodically to see how the database is performing. RUNSTATS on DB2 is one example, I would hope MySQL has a similar tool.
When you find some oft-run queries doing full table scans (or taking too long for other reasons), then, and only then, should you add more indexes. It does little good to optimise a once-a-month-run-at-midnight query so it can finish at 12:05 instead of 12:07. However, it's a huge improvement to reduce a customer-facing query from 5 seconds down to 2 seconds (that's still too slow, customer-facing queries should be sub-second if possible).
More indexes tend to slow down inserts and speed up queries. So it's always a balancing act. That's why you only add indexes in specific response to a problem. Anything else is premature optimization and should be avoided.
In addition, revisit the indexes you already have periodically to see if they're still needed. It may be that the queries that caused you to add those indexes are no longer run often enough to warrant it.
To be honest, I don't believe indexing three columns on a table will cause you to suffer unless you plan on storing really huge numbers of rows :-) - indexing is pretty efficient.
After your edit which states:
I am indexing a point_value, the user_id, and an event_id, all required for client-facing purposes. For an instance such as scoring baseball runs by player id and game id. What would be the cost of inserting about 200 new records a day, after the table holds records for two seasons, say 72,000 runs, and after 5 seasons, maybe a quarter million records? Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
My response is that 200 records a day is an extremely small value for a database, you definitely won't have anything to worry about with those three indexes.
Just this week, I imported a days worth of transactions into one of our database tables at work and it contained 2.1 million records (we get at least one transaction per second across the entire day from 25 separate machines). And it has four separate composite keys which is somewhat more intensive than your three individual keys.
Now granted, that's on a DB2 database but I can't imagine IBM are so much better than the MySQL people that MySQL can only handle less than 0.01% of the DB2 load.
I made some simple tests using my real project and real MySql database.
My results are: adding average index (1-3 columns in an index) to a table - makes inserts slower by 2.1%. So, if you add 20 indexes, your inserts will be slower by 40-50%. But your selects will be 10-100 times faster.
So is it ok to add many indexes? - It depends :) I gave you my results - You decide!
Nothing for select queries, though updates and especially inserts will be order of magnitudes slower - which you won't really notice before you start inserting a LOT of rows at the same time...
In fact at a previous employer (single user, desktop system) we actually DROPPED indexes before starting our "import routine" - which would first delete all records before inserting a huge number of records into the same table...
Then when we were finished with the insertion job we would re-create the indexes...
We would save 90% of the time for this operation by dropping the indexes before starting the operation and re-creating the indexes afterwards...
This was a Sybase database, but the same numbers apply for any database...
So be careful with indexes, they're FAR from "free"...
Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
With that kind of insertion rate, the cost of indexing an extra column will be negligible.
Without some more details about expected usage of the data in your table worrying about indexes slowing you down smells a lot like premature optimization that should be avoided.
If you are really concerned about it, then setup a test database and simulate performance in the worst case scenarios. A test proving that is or is not a problem will probably be much more useful then trying to guess and worry about what may happen. If there is a problem you will be able to use your test setup to try different methods to fix the issue.
The index is there to speed retrieval of data, so the question should be "What data do I need to access quickly?". Without the index, some queries will do a full table scan (go through every row in the table) in order to find the data that you want. With a significant amount of records this will be a slow and expensive operation. If it is for a report that you run once a month then maybe thats okay; if it is for frequently accessed data then you will need the index to give your users a better experience.
If you find the speed of the insert operations are slow because of the index then this is a problem you can solve at the hardware level by throwing more CPUs, RAM and better hard drive technology at the problem.
What Pax said.
For the dimensions you describe, the only significant concern I can imagine is "What is the cost of failing to index multiple db columns?"