Simple MySQL output questions - mysql

I have a 2 row table pertaining of a number, and that numbers cube. Right now, I have about 13 million numbers inserted, and that's growing very, very quickly.
Is there a faster way to output simple tables quicker than using a command like SELECT * FROM table?
My second question pertains to the selection of a range of numbers. As stated above, I have a large database growing extremely fast to hold numbers and their cubes. If you're wondering, I'm trying to find the 3 numbers that will sum up to 33 when cubed. So, I'm doing this by using a server/client program to send a range of numbers to a client so they can do the equations on said range of numbers.
So, for example, let's say that the first client chimes in. I give him a range of 0, 100. He than goes off to compute the numbers and report back to tell the server if he found the triplet. If he didn't the loop will just continue.
When the client is doing the calculations for the numbers by itself, it goes extremely slow. So, I have decided to use a database to store the cubed numbers so the client does not have to do the calculations. The problem is, I don't know how to access only a range of numbers. For example, if the client had the range 0-100, it would need to access the cubes of all numbers from 0-100.
What is the select command that will return a range of numbers?
The engine I am using for the table is MyISAM.

If your table "mytable" has two columns
number cube
0 0
1 1
2 8
3 27
the query command will be (Assuming the start of the range is 100 and the end is 200):
select number, cube from mytable where number between 100 and 200 order by number;
If you want this query to be as fast as possible, make sure of the following:
number is an index. Thus you don't need to do a table scan to find the start of your range.
the index you create is clustered. Clustered indexes are way faster for
scans like this as the leaf in the index is the record (in comparison, the leaf in a
non-clustered index is a pointer to the record which may be in a completely different
part of the disk). As well, the clustered index
forces a sorted structure on the
data. Thus you may be able to read all 100
records from a single block.
Of course, adding an index will make writing to the table slightly slower. As well, I am assuming you are writing to the table in order (i.e. 0,1,2,3,4 etc. not 10,5,100,3 etc.). Writes to tables with clustered indexes are very slow if you write to the table in a random order (as the DB has to keep moving records to fit the new ones in).

Related

MySQL performance with index clairification

Say I have a mysql table with an index on the column 'name':
I do this query:
select * from name_table where name = 'John';
Say there are 5 results that are returned from a table with 100 rows.
Say I now insert 1 million new rows, non that have a name John, so there are still only 5 Johns in the table. Will the select statement be as fast as previously, so will inserting all these rows have an impact on the read speed of an indexed table?
Indexes have their own "tables", and when the MySQL engine determines that the lookup references an indexed column, the lookup happens on this table. It isn't really a table per-se, but the gist checks out.
That said, it will be nanoseconds slower, but not something you should concern yourself with.
More importantly, concern youself with indexing pertinent data, and column order, as these have MUCH more of an impact on database performance.
To learn more about what is happening behind the scenes, query the EXPLAIN:
EXPLAIN select * from name_table where name = 'John';
Note: In addition to the column orders listed in the link, it is a good (nay, great) idea to have variable length columns (VARCHAR) after their fixed-length counterparts (CHAR) as, durring the lookup, the engine has to either look at the row, read the column lengths, then skip forward for the lookup (mind you, this is only for non-indexed columns), or read the table declairation and know it always has to look at the column with the offset X. It is more complicated behind the scenes, but if you can shift all fixed-length columns to the front, you will thank yourself. Basically:
Indexed columns.
Everything Fixed-Length in order according to the link.
Everything Variable-Length in order according to the link.
Yes, it will be just as fast.
(In addition to the excellent points made Mike's answer...) there's an important point we should make regarding indexes (B-tree indexes in particular):
The entries in the index are stored "in order".
The index is also organized in a way that allows the database to very quickly identify the blocks in the index that contain the entries it's looking for (or the block that would contain entries, if no matching entries are there.)
What this means is that the database doesn't need to look at every entry in the index. Given a predicate like the one in your question:
WHERE name = 'John'
with an index with a leading column of name, the database can eliminate vast swaths of blocks that don't need to be checked.
Blocks near the beginning of the index contain entries 'Adrian' thru 'Anna', a little later in the index, a block contains entries for Caleb thru Carl, further long in the index James thru Jane, etc.
Because of the way the index is organized, the database effectively "knows" that the entries we're looking for cannot be in any of those blocks (because the index is in order, there's no way the value John could appear in those blocks we mentioned). So none of those blocks needs to be checked. (The database figures out in just a very small number of operations, that 98% of the blocks in the index can be eliminated from consideration.
High cardinality = good performance
The take away from this is that indexes are most effective on columns that have high cardinality. That is, there are a large number of distinct values in the column, and those values are unique or nearly unique.
This should clear up the answer to the question you were asking. You can add brazilians of rows to the table. If only five of those rows have a value of
John in the name column, when you do a query
WHERE name = `John`
it will be just as fast. The database will be able to locate the entries your looking for nearly as fast as it can when you had a thousand rows in the table.
(As the index grows larger, it does add "levels" to the index, to traverse down to the leaf nodes... so, it gets ever so slightly slower because of a tiny few more operations. Where performance really starts to bog down is when the InnoDB buffer cache is too small, and we have to wait for the (glacially slow in comparison) disk io operations to fetch blocks into memory.
Low cardinality = poor performance
Indexes on columns with low cardinality are much less effective. For example, a column that has two possible values, with an even distribution of values across the rows in the table (about half of the rows have one value, and the other half have the other value.) In this case, the database can't eliminate 98% of the blocks, or 90% of the blocks. The database has to slog through half the blocks in the index, and then (usually) perform a lookup to the pages in the underlying table to get the other values for the row.
But with gazillions of rows with a column gender, with two values 'M' and 'F', an index with gender as a leading column will not be effective in satisfying a query
WHERE gender = 'M'
... because we're effectively telling the database to retrieve half the rows in the table, and it's likely those rows are going to be evenly distributed in the table. So nearly every page in the table is going to contain at least one row we need, the database is going to opt to do a full table scan (to look at every row in every block in the table) to locate the rows, rather than using an index.
So, in terms of performance for looking up rows in the table using an index... the size of the table isn't really an issue. The real issue is the cardinality of the values in the index, and how many distinct values we're looking for, and how many rows need to be returned.

Finding records not updated in last k days efficiently

I have a table which contains records of last n days. The records in this table are around 100 million. I need to find the records which are not updated in last k
My solution to this problem is
Partition the table on k1. Index on timestamp column. Now instead of updating the timestamp(so that index is not rebuilt), perform remove + insert. By doing this the I think the query to find the records not updated in last k days will be fast.
Is there any other better way to optimize these operations?
For example,
Suppose we have many users and each user can use different products. Also a user can start using(becomes owner) new products any time. If user does not use a product for n days his ownership expires. Now we need to find all the products for a user which are not used by him in last k days. The number of users are of order 10000 and number of products from which he can choose is of order 100,000.
I modeled this problem using a table with schema (user_id, product_id, last_used). product_id is the id of the product the user is using. Whenever a user uses the product last_used is updated. Also a user's ownership of product expires if not used for n days by the user. I partitioned on the table on user_id and indexed last_used(timestamp). Also instead of updating I performed delete + create. I did partitioning and indexing for optimizing the query to fetch records not updated in last k days for a user.
Is there a better way to solve this problem?
You have said you need to "find" and, I think "expire" the records belonging to a particular user after a certain number of days.
Look, this can be done even in a large table with good indexing without too much trouble. I promise you, partitioning the table will be a lot of trouble. You have asserted that it's too expensive in your application to carry an index on your last_used column because of updates. But, considering the initial and ongoing expense of maintaining a partitioned table, I strongly suggest you prove that assertion first. You may be wrong about the cost of maintaining indexes.
(Updating one row with a column that's indexed doesn't rebuild the index, it modifies it. The MySQL storage engine developers have optimized that use case, I promise you.)
As I am sure you know, this query will retrieve old records for a particular user.
SELECT product_id
FROM tbl
WHERE user_id = <<<chosen user>>>
AND last_used <= CURRENT_DATE() - <<<k>>> DAY
will yield your list of products. This will work very efficiently indeed if you have a compound covering index on (user_id, last_used, product_id). If you don't know what a compound covering index is, you really should find out using your favorite search engine. This one will random-access the particular user and then do a range scan on the last_used date. It will then return the product ids from the index.
If you want to get rid of all old records, I suggest you write a host program that repeats this query in a loop until you find that it has processed zero rows. Run this at an off-peak time in your application. The LIMIT clause will prevent each individual query from taking too long and interfering with other uses of the table. For the sake of speed on this query, you'll need an index on last_used.
DELETE FROM tbl
WHERE last_used <= CURRENT_DATE() - <<<k>>> DAY
LIMIT 500
I hope this helps. It comes from someone who's made the costly mistake of trying to partition something that didn't need partitioning.
MySQL doesn't "rebuild" indexes (not completely) when you modify an indexed value. In fact, it doesn't even reorder the records. It just moves the record to the proper 16KB page.
Within a page, the records are in the order they were added. If you inserted in order, then they're in order, otherwise, they're not.
So, when they say that MySQL's clustered indexes are in physical order, it's only true down to the page level, but not within the page.
Clustered indexes still get the benefit that the page data is on the same page as the index, so no further lookup is needed if the row data is small enough to fit in the pages. Reading is faster, but restructuring is slower because you have to move the data with the index. Secondary indexes are much faster to update, but to actually retrieve the data (with the exception of covering indexes), a further lookup must be made to retrieve the actual data via the primary key that the secondary index yields.
Example
Page 1 might hold user records for people whose last name start with A through B. Page 2 might hold names C through D, etc. If Bob renames himself Chuck, his record just gets copied over from page 1 to page 2. His record will always be put at the end of page 2. The keys are kept sorted, but not the data they point to.
If the page becomes full, MySQL will split the page. In this case, assuming even distribution between C and D, page 1 will be A through B, page 2 will be C, and page 3 will be D.
When a record is deleted, the space is compacted, and if the record becomes less than half full, MySQL will merge neighboring pages and possibly free up a page inbetween.
All of these changes are buffered, and MySQL does the actual writes when it's not busy.
The example works the same for both clustered (primary) and secondary indexes, but remember that with a clustered index, the keys point to the actual table data, whereas with a secondary index, the keys point to a value equal to the primary key.
Summary
After awhile, page splitting caused from random inserts will cause the pages to become noncontiguous on disk. The table will become "fragmented". Optimizing the table (rebuilding the table/index) fixes this.
There would be no benefit in deleting then reinserting the record. In fact, you'll just be adding transactional overhead. Let MySQL handle updating the index for you.
Now that you understand indexes a bit more, perhaps you can make a better decision of how to optimize your database.

Fast mysql query to randomly select N usernames

In my jsp application I have a search box that lets user to search for user names in the database. I send an ajax call on each keystroke and fetch 5 random names starting with the entered string.
I am using the below query:
select userid,name,pic from tbl_mst_users where name like 'queryStr%' order by rand() limit 5
But this is very slow as I have more than 2000 records in my table.
Is there any better approach which takes less time and let me achieve the same..? I need random values.
How slow is "very slow", in seconds?
The reason why your query could be slow is most likely that you didn't place an index on name. 2000 rows should be a piece of cake for MySQL to handle.
The other possible reason is that you have many columns in the SELECT clause. I assume in this case the MySQL engine first copies all this data to a temp table before sorting this large result set.
I advise the following, so that you work only with indexes, for as long as possible:
SELECT userid, name, pic
FROM tbl_mst_users
JOIN (
-- here, MySQL works on indexes only
SELECT userid
FROM tbl_mst_users
WHERE name LIKE 'queryStr%'
ORDER BY RAND() LIMIT 5
) AS sub USING(userid); -- join other columns only after picking the rows in the sub-query.
This method is a bit better, but still does not scale well. However, it should be sufficient for small tables (2000 rows is, indeed, small).
The link provided by #user1461434 is quite interesting. It describes a solution with almost constant performance. Only drawback is that it returns only one random row at a time.
does table has indexing on name?
if not apply it
2.MediaWiki uses an interesting trick (for Wikipedia's Special:Random feature): the table with the articles has an extra column with a random number (generated when the article is created). To get a random article, generate a random number and get the article with the next larger or smaller (don't recall which) value in the random number column. With an index, this can be very fast. (And MediaWiki is written in PHP and developed for MySQL.)
This approach can cause a problem if the resulting numbers are badly distributed; IIRC, this has been fixed on MediaWiki, so if you decide to do it this way you should take a look at the code to see how it's currently done (probably they periodically regenerate the random number column).
3.http://jan.kneschke.de/projects/mysql/order-by-rand/

Limitations on amount of rows in MySQL

What are the limitations in terms of performance of MySQL when it comes to the amount of rows in a table? I currently have a running project that runs cronjobs every hour. Those gather data and write them into the database.
In order to boost the performance, I'm thinking about saving the data of those cronjobs in a table. (Not just the result, but all the things). The data itself will be something similar to this;
imgId (INT,FKEY->images.id) | imgId (INT,FKEY->images.id) | myData(INT)
So, the actual data per row is quite small. The problem is, that the amount of rows in this table will grow exponentially. With every imgId I add, I need the myData for every other image. That means, with 3000 images, I will have 3000^2 = 9 million rows (not counting the diagonals because I'm too lazy to do it now).
I'm concered about what MySQL can handle with such preconditions. Every hour will add roughly 100-300 new entries in the origin-table, meaning 10,000 to 90,000 new entries in the cross table.
Several questions arise:
Are there limitations to the number of rows in a table?
When (if) will MySQL significally drop performance?
What actions can I take to make this cross-table as fast (acessible-wise, writing doesn't have to be fast) as possible?
EDIT
I just finished by polynomial interpolation and it turns out the growth will not be as drastic as I originally thought. As the relation 1-2 has the same data as 2-1, I only need "half" a table, bringing the growth down to (x^2-x)/2.
Still, it will get a lot.
9 million rows is not a huge table. Given the structure you provided, as long as it's indexed properly performance of select / update / insert queries won't be an issue. DDL may be a bit slow.
Since all the rows are already described by a cartesian join, you don't need to populate the entire table.
If the order of the image pairs is not significant then you can save some space by sorting the attributes or using a two / three table schema where the imgIds are equivalent.

MySQL - why not index every field?

Recently I've learned the wonder of indexes, and performance has improved dramatically. However, with all I've learned, I can't seem to find the answer to this question.
Indexes are great, but why couldn't someone just index all fields to make the table incredibly fast? I'm sure there's a good reason to not do this, but how about three fields in a thirty-field table? 10 in a 30 field? Where should one draw the line, and why?
Indexes take up space in memory (RAM); Too many or too large of indexes and the DB is going to have to be swapping them to and from the disk. They also increase insert and delete time (each index must be updated for every piece of data inserted/deleted/updated).
You don't have infinite memory. Making it so all indexes fit in RAM = good.
You don't have infinite time. Indexing only the columns you need indexed minimizes the insert/delete/update performance hit.
Keep in mind that every index must be updated any time a row is updated, inserted, or deleted. So the more indexes you have, the slower performance you'll have for write operations.
Also, every index takes up further disk space and memory space (when called), so it could potentially slow read operations as well (for large tables).
Check this out
You have to balance CRUD needs. Writing to tables becomes slow. As for where to draw the line, that depends on how the data is being acessed (sorting filtering, etc.).
Indexing will take up more allocated space both from drive and ram, but also improving the performance a lot. Unfortunately when it reaches memory limit, the system will surrender the drive space and risk the performance. Practically, you shouldn't index any field that you might think doesn't involve in any kind of data traversing algorithm, neither inserting nor searching (WHERE clause). But you should if otherwise. By default you have to index all fields. The fields which you should consider unindexing is if the queries are used only by moderator, unless if they need for speed too
It is not a good idea to indexes all the columns in a table. While this will make the table very fast to read from, it also becomes much slower to write to. Writing to a table that has every column indexed would involve putting the new record in that table and then putting each column's information in the its own index table.
this answer is my personal opinion based I m using my mathematical logic to answer
the second question was about the border where to stop, First let do some mathematical calculation, suppose we have N rows with L fields in a table if we index all the fields we will get a L new index tables where every table will sort in a meaningfull way the data of the index field, in first glance if your table is a W weight it will become W*2 (1 tera will become 2 tera) if you have 100 big table (I already worked in project where the table number was arround 1800 table ) you will waste 100 times this space (100 tera), this is way far from wise.
If we will apply indexes in all tables we will have to think about index updates were one update trigger all indexes update this is a select all unordered equivalent in time
from this I conclude that you have in this scenario that if you will loose this time is preferable to lose it in a select nor an update because if you will select a field that is not indexed you will not trigger another select on all fields that are not indexed
what to index ?
foreign-keys : is a must based on
primary-key : I m not yet sure about it may be if someone read this could help on this case
other fields : the first natural answer is the half of the remaining filds why : if you should index more you r not far from the best answer if you should index less you are not also far because we know that no index is bad and all indexed is also bad.
from this 3 points I can conclude that if we have L fields composed of K keys the limit should be somewhere near ((L-K)/2)+K more or less by L/10
this answer is based on my logic and personal prictices
First of all, at least in SAP - ABAP and in background database table, we can create one index table for all required index fields, we will have their addresses only. So other SQL related software-database system can also use one table for all fields to be indexed.
Secondly, what is the writing performance? A company in one day records 50 sales orders for example. And let assume there is a table VBAK sales order header table with 30 fields for example each has 20 CHAR length..
I can write to real table in seconds, but other index table can work in the background, and at the same time a report is tried to be run, for this report while index table is searched, ther can be a logic- for database programming- a index writing process is contiuning and wait it for ending ( 5 sales orders at the same time were being recorded for example and take maybe 5 seconds) ..so , a running report can wait 5 seconds then runs 5 seconds total 10 seconds..
without index, a running report does not wait 5 seconds for writing performance..but runs maybe 40 seconds...
So, what is the meaning of writing performance no one writes thousands of records at the same time. But reading them.
And reading a second table means that : there were all ready sorted fields.I have 3 fields selected and I can find in which sorted sets I need to search these data, then I bring them...what RAM, what memory it is just a copied index table with only one data for each field -address data..What memory?
I think, this is one of the software company secrets hide from customers, not to wake them up , otherwise they will not need another system in the future with an expensive price.