DATABASE optimization insert and search - mysql

I was having an argument with a friend of mine. Suppose we have a db table with a userid and some other fields. This table might have a lot of rows. Let's suppose also that by design we limit the records for each userid in the table to about 50.My friend suggested that if I under every row for each userid one after another the lookup would be faster e.g
userid otherfield
1 .........
1 .........
.....until 50...
2 ........
etc. So when a user id 1 is created I pre-popopulate the 50 table's rows to with null values...etc. The idea is that if I know the amount of rows and find the first row with userid =1 I just have to look the next 49 an voila I don't have to search the whole table. Is this correct?can this be done without indexing? Is the pre-population an expensive process?Is there a performance difference if I just inserted in old fashioned way like
1 ........
2 ........
2 ........
1 ........
etc?

To answer a performance question like this, you should run performance tests on the different configurations.
But, let me make a few points.
First, although you might know that the records for a given id are located next to each other, the database does not know this. So, if you are searching for one user -- without an index -- then the engine needs to search through all the records (unless you have a limit clause in the query).
Second, if the data is fixed length (numeric and dates), the populating it with values after populating it with NULL values will occupy the same space on the page. But, if the data is variable length, then a given page will be filled with empty records. When you modify the records with real values, you will get page split.
What you are trying to do is to outsmart the database engine. This isn't necessary, because MySQL provides indexes, which provide almost all the benefits that you are describing.
Now, having said that, there is some performance benefit from having all the records for a user being co-located. If a user has 50 records, then reading the records with an index would typically require loading 50 pages into memory. If the records are co-located, then only one or two records would need to be read. Typically, this would be a very small performance gain, because most frequently accessed tables fit into memory. There might be some circumstances where the performance gain is worth it.

Related

Whether or not SQL query (SELECT) continues or stops reading data from table when find the value

Greeting,
My question; Whether or no sql query (SELECT) continues or stops reading data (records) from table when find the value that I was looking for?
referance: "In order to return data for this query, mysql must start at the beginning of the disk data file, read in enough of the record to know where the category field data starts (because long_text is variable length), read this value, see if it satisfies the where condition (and so decide whether to add to the return record set), then figure out where the next record set is, then repeat."
link for referance: http://www.verynoisy.com/sql-indexing-dummies/#how_the_database_finds_records_normally
In general you don't know and you don't care, but you have to adapt when queries take too long to execute. When you do something like
select a,b,c from mytable where a=3 and b=5
then the database engine has a couple of options to optimize. When all these options fail, then it will do a "full table scan" - which means, it will have to examine the entire table to see which rows are eligible. When you have indices on e.g. column a then the database engine can optimize the search because it can pre-select rows where a has value 3. So, in general, make sure that you have indices for the columns that are most searched. (Perversely, some database engines get confused when you have too many indices and will fall back to a full table scan because they've lost their way...)
As to whether or not the scanning stops: In general, the database engine has to examine all data in the table (hopefully aided by indices) and won't stop after having found just one hit. If you want just the first hit, use a limit 1 clause to make sure that your result set has only one outcome. But then again, if you have a sort by clause, the database engine cannot stop after the first hit, there might be next ones that should get priority given the sorting.
Summarizing, how the db engine does its scan depends on how smart it is, what indices are available etc.. If your select queries take too long then consider re-organizing your indices, writing your select statements differently, or rebuilding the table.
The RDBMS reading data from disk is something you cannot know, you should not care and you must not rely on.
The issue is too broad to get a precise answer. The engine reads data from storage in blocks, a block can contain records that are not needed by the query at hand. If all the columns needed by the query is available in an index, the RDBMS won't even read the data file, it will only use the index. The data it needs could already be cached in memory (because it was read during the execution of a previous query). The underlying OS and the storage media also keep their own caches.
On a busy system, all these factors could lead to very different storage access patterns while running the same query several times on a couple of minutes apart.
Yes it scans the entire file. Unless you put something like
select * from user where id=100 limit 1
This of course will still search entire rows if id 100 is the last record.
If id is a primary key it will automatically be indexed and searching would be optimized
I'm sorry... I thought the table.
I will change question and I will explain it in the following image;
I understand that in CASE 1 all columns must be read with each iteration.
My question is: If it's the same in the CASE 2 or columns that are not selected in the query are excluded from reading in each iteration.
Also, are the both queries are the some in performance perspective?
Clarify:
CASE: 1 In first CASE select print all data
CASE: 2 In second CASE select print columns first_name and last_name
Whether in CASE 2 mysql server (SQL query) reads only columns first_name, last_name or read the entire table to get that data(rows)=(first_name, last_name)?
An interest of me how the server reads table row in CASE 1 and CASE 2?

What works faster "longer table with less columns" or "shorter table with more columns"?

I have to make decision how to plan table that will be used to store dates.
I have about 20 different dates for each user and guess 100 000 users right now and growing.
So question is for SELECT query what will work faster if I make table with 20 fields? e.g.
"user_dates"
userId, date_registered, date_paid, date_started_working, ... date_reported, date_fired 20 total fields with 100 000 records in table
or make 2 tables it like
first table "date_types" with 3 fields and 20 records for above column names.
id, date_type_id, date_type_name
1 5 date_reported
2 3 date_registerd
...
and second table with 3 fields actual records
"user_dates"
userId, date_type, date
201 2 2012-01-28
202 5 2012-06-14
...
but then with 2 000 000 records ?
I think second option is more universal if I need to add more dates I can do it from front end just adding record to "date_type" table and then using it in "user_dates" however I am now worried about performance with 2 million records in table.
So which option you think will work faster?
A longer table will have a larger index. A wider table will have a smaller index but take more psychical space and probably have more overhead. You should carefully examine your schema to see if normalization is complete.
I would, however, go with your second option. This is because you don't need to necessarily have the fields exist if they are empty. So if the user hasn't been fired, no need to create a record for them.
If the dates are pretty concrete and the users will have all (or most) of the dates filled in, then I would go with the wide table because it's easier to actually write the queries to get the data. Writing a query that asks for all the users that have date1 in a range and date2 in a range is much more difficult with a vertical table.
I would only go with the longer table if you know you need the option to create date types on the fly.
The best way to determine this is through testing. Generally the sizes of data you are talking about (20 date columns by 100K records) is really pretty small in regards to MySQL tables, so I would probably just use one table with multiple columns unless you think you will be adding new types of date fields all the time and desire a more flexible schema. You just need to make sure you index all the fields that will be used in for filtering, ordering, joining, etc. in queries.
The design may also be informed by what type of queries you want to perform against the data. If for example you expect that you might want to query data based on a combination of fields (i.e. user has some certain date, but not another date), the querying will likely be much more optimal on the single table, as you would be able to use a simple SELECT ... WHERE query. With the separate tables, you might find yourself needing to do subselects, or odd join conditions, or HAVING clauses to perform the same kind of query.
As long as the user ID and the date-type ID are indexed on the main tables and the user_dates table, I doubt you will notice a problem when querying .. if you were to query the entire table in either case, I'm sure it would take a pretty long time (mostly to send the data, though). A single user lookup will be instantaneous in either case.
Don't sacrifice the relation for some possible efficiency improvement; it's not worth it.
Usually I go both ways: Put the basic and most oftenly used attributes into one table. Make an additional-attributes table to put rarley used attributes in, which then can be fetched lazily from the application layer. This way you are not doing JOIN's every time you fetch a user.

Fast mysql query to randomly select N usernames

In my jsp application I have a search box that lets user to search for user names in the database. I send an ajax call on each keystroke and fetch 5 random names starting with the entered string.
I am using the below query:
select userid,name,pic from tbl_mst_users where name like 'queryStr%' order by rand() limit 5
But this is very slow as I have more than 2000 records in my table.
Is there any better approach which takes less time and let me achieve the same..? I need random values.
How slow is "very slow", in seconds?
The reason why your query could be slow is most likely that you didn't place an index on name. 2000 rows should be a piece of cake for MySQL to handle.
The other possible reason is that you have many columns in the SELECT clause. I assume in this case the MySQL engine first copies all this data to a temp table before sorting this large result set.
I advise the following, so that you work only with indexes, for as long as possible:
SELECT userid, name, pic
FROM tbl_mst_users
JOIN (
-- here, MySQL works on indexes only
SELECT userid
FROM tbl_mst_users
WHERE name LIKE 'queryStr%'
ORDER BY RAND() LIMIT 5
) AS sub USING(userid); -- join other columns only after picking the rows in the sub-query.
This method is a bit better, but still does not scale well. However, it should be sufficient for small tables (2000 rows is, indeed, small).
The link provided by #user1461434 is quite interesting. It describes a solution with almost constant performance. Only drawback is that it returns only one random row at a time.
does table has indexing on name?
if not apply it
2.MediaWiki uses an interesting trick (for Wikipedia's Special:Random feature): the table with the articles has an extra column with a random number (generated when the article is created). To get a random article, generate a random number and get the article with the next larger or smaller (don't recall which) value in the random number column. With an index, this can be very fast. (And MediaWiki is written in PHP and developed for MySQL.)
This approach can cause a problem if the resulting numbers are badly distributed; IIRC, this has been fixed on MediaWiki, so if you decide to do it this way you should take a look at the code to see how it's currently done (probably they periodically regenerate the random number column).
3.http://jan.kneschke.de/projects/mysql/order-by-rand/

MySQL indexing - optional search criteria

"How many indexes should I use?" This question has been asked generally multiple times, I know. But I'm asking for an answer specific to my table structure and querying purposes.
I have a table with about 60 columns. I'm writing an SDK which has a function to fetch data based on optional search criteria. There are 10 columns for which the user can optionally pass in values (so the user might want all entries for a certain username and clientTimestamp, or all entries for a certain userID, etc). So potentially, we could be looking up data based on up to 10 columns.
This table will run INSERTS almost as often as SELECTS, and the table will usually have somewhere around 200-300K rows. Each row contains a significant amount of data (probably close to 0.5 MB).
Would it be a good or bad idea to have 10 indexes on this table?
Simple guide that may help you make a decision.
1. Index columns that have high selectivity.
2. Try normalizing your table (you mentioned username and userid columns; if it's not user table, no need for storing name here)
3. If your system is not abstract, it should be a number of parameters that are used more often than others. First of all, make sure you have indexes that support fast result retrieval with such parameters.

Having a column 'number_of_likes' or have a separate column...?

In my project, I need to *calculate 'number_of_likes' for a particular comment*.
Currently I have following structure of my comment_tbl table:
id user_id comment_details
1 10 Test1
2 5 Test2
3 7 Test3
4 8 Test4
5 3 Test5
And I have another table 'comment_likes_tbl' with following structure:
id comment_id user_id
1 1 1
2 2 5
3 2 7
4 1 3
5 3 5
The above one are sample data.
Question :
On my live server there are around 50K records. And I calculate the *number_of_likes to a particular comment by joining the above two tables*.
And I need to know Is it OK?
Or I should have one more field to the comment_tbl table to record the number_of_likes and increment it by 1 each time it is liked along with inserting it into the comment_likes_tbl....?
Doed it help me by anyway...?
Thanks In Advance.....
Yes, You should have one more field number_of_likes in the comment_tbl table. It will reduce the unnecessary joining of tables.
This way you don't need join until you need to get who liked the comment.
A good example you can see here is the database design of StackOverflow itself. See the Users Table they have a field Reputation with the Users table itself. Instead of Joining and calculating User's reputation every time they use this one.
You can take a few different approaches to something like this
As you're doing at the moment, run a JOIN query to return the collated results of comments and how many "likes" each has
As time goes on, you may find this is a drain on performance. Instead you could simply have a counter that increments attached to each comment field. But you may find it useful to also keep your *comment_likes_tbl* table, as this will be a permanent record of who liked what, and when (otherwise, you would just have a single figure with no additional metadata attached)
You could potentially also have a solution where you simply store your user's likes in the comment_likes_tbl, and then a cron task will run, on a pre-determined schedule, to automatically update all "like" counts across the board. Further down the line, with a busier site, this could potentially help even out performance, even if it does mean that "like" counts lag behind the real count slightly.
(on top of these, you can also implement caching solutions etc. to store temporary records of like values attached to comments, also MySQL has useful caching technology you can make use of)
But what you're doing just now is absolutely fine, although you should still make sure you've set up your indexes correctly, otherwise you will notice performance degradation more quickly. (a non-unique index on comment_id should suffice)
Use the query - as they are foreign keys the columns will be indexed and the query will be quick.
Yes, your architecture is good as it is and I would stick to it, for the moment.
Running too many joins can be a problem regarding performance, but as long as you don't have to face such problems, you shouldn't take care about it.
Even if you will ran into performance problems you should first,...
check if you use (foreign) keys, so that MySQL could lookup the data very fast
take advantage of MySQL Query cache
use some sort of 2nd caching layer, like memcached to store the value of likes (as this is only an incremental value).
The usage of memcache would solve your problem running too many joins and avoid to create a not really necessary column.