How does MySQL handle huge databases? - mysql

For the sake of example:
Let's say my database has 1 table where the fields are
id, first_name (VARCHAR 100 chars), last_name (VARCHAR 100 chars), about (VARCHAR 10,000 chars)
Now let's say the database is 100 Gigs large.
How will random access look like on a machine that only has 4 Gigs of RAM?
Will the query take constant time every time it's made?

If you search on first name and it is not indexed the server will read each row in the table and compare it to the where clause. This query will probably vary hugely in time since the time taken to retrieve the result is dependant on the position of the row. For example firstname 'a' will be quick to find and firstname 'z' will take much longer. Essentially you are doing a linear/sequential access of the database.
If there was a index on the firstname MySQL build a tree on the column. Trees are highly efficient when used in searching. Basically find values 'a' and 'z' should take the same amount of operations since you are doing a binary search. Note that I am saying operations.
There is now way you can gaurantee that a query will always execute in the same amount of time. Just remember while a database is memory intensive most people overlook the fact that a database is really bound to disk io. These factors make it highly unlikely that you can gaurantee execution time is always predictable and constant. However you can ensure that the number of operations used remain optimised.
Just one other thing while indexes speeds up reads they slow down writes. So indexing is a double edged sword. Index only what you reallly need to.

Related

Query plan for database table containing trillions of records [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I have a big table containing trillions of records of the following schema (Here serial no. is the key):
MyTable
Column | Type | Modifiers
----------- +--------------------------+-----------
serial_number | int |
name | character varying(255) |
Designation | character varying(255) |
place | character varying(255) |
timeOfJoining | timestamp with time zone |
timeOfLeaving | timestamp with time zone |
Now I want to fire queries of the form given below on this table:
select place from myTable where Designation='Manager' and timeOfJoining>'1930-10-10' and timeOfLeaving<'1950-10-10';
My aim is to achieve fast query execution times. Since, I am designing my own database from scratch, therefore I have the following options. Please guide me as to which one of the two options will be faster.
Create 2 separate table. Here, table1 contains the schema (serial_no, name, Designation, place) and table 2 contains the schema (serial_no, timeOfJoining, timeOfLeaving). And then perform a merge join between the two tables. Here, serial_no is the key in both the tables
Keep one single table MyTable. And run the following plan: Create an index Designation_place_name and using the Designation_place_name index, find rows that fit the index condition relation = 'Manager'(The rows on disc are accessed randomly) and then using the filter function keep only rows that match the timeOfJoining criteria.
Please help me figure out which one will be faster. It'll be great if you could also tell me the respective pros and cons.
EDIT: I intend to use my table as read-only.
If you are dealing with lots and lots of rows and you want to use a relational database, then your best bet for such a query is to satisfy it entirely in an index. The example query is:
select place
from myTable
where Designation='Manager' and
timeOfJoining > '1930-10-10' and
timeOfLeaving < '1950-10-10';
The index should contain the four fields mentioned in the table. This suggests an index like: mytable(Designation, timeOfJoining, timeOfLeaving, place). Note that only the first two will be used for the where clause, because of the inequality. However, most databases will do an index scan on the appropriate data.
With such a large amount of data, you have other problems. Although memory is getting cheaper and machines bigger, indexes often speed up queries because an index is smaller than the original table and faster to load in memory. For "trillions" of records, you are talking about tens of trillions of bytes of memory, just for the index -- and I don't know which databases are able to manage that amount of memory.
Because this is such a large system, just the hardware costs are still going to be rather expensive. I would suggest a custom solution that stored the data in a compressed format with special purpose indexing for the queries. Off-the-shelf databases are great products applicable in almost all data problems. However, this seems to be going near the limit of their applicability.
Even small efficiencies over an off-the-shelf database start to add up with such a large volume of data. For instance, the layout of records on pages invariably leaves empty space on a page (records don't exactly fit on a page, the database has overhead that you may not need such as bits for nullability, and so on). Say the overhead of the page structure and empty space amount to 5% of the size of a page. For most applications, this is in the noise. But 5% of 100 trillion bytes is 5 trillion bytes -- a lot of extra I/O time and wasted storage.
EDIT:
The real answer to the choice between the two options is to test them. This shouldn't be hard, because you don't need to test them on trillions of rows -- and if you have the hardware for that, you have the hardware for smaller tests. Take a few billions of rows on a machine with correspondingly less memory and CPUs and see which performs better. Once you are satisfied with the results, multiply the data by 10 and try again. You might want to do this one more time if you are not convinced of the results.
My opinion, though, is that the second is faster. The first duplicates the "serial number" in both tables, adding 8 bytes to each row ("int" is typically 4-bytes and that isn't big enough, so you need bigint). That alone will increase the I/O time and size of indexes for any analysis. If you were considering a columnar data store (such as Vertica) then this space might be saved. The savings on removing one or two columns is at the expense of reading in more bytes in total.
Also, don't store the raw form of any of the variables in the table. The "Designation" should be in a lookup table as well as the "place" and "name", so each would be 4-bytes (that should be big enough for the dimensions, unless one is all people on earth).
But . . . The "best" solution in terms of cost, maintainability, and scalability is probably something like Hadoop. That is how companies like Google and Yahoo manage vast quantities of data, and it seems apt here too.
Given the amount and type of data, I would suggest going with the second option. The upside is , you do not need to join anything. The join is usually very costly. However, in that case you are holding lots of redundant data.
The first option would be more memory efficient, the second more time efficient.
Furthermore, using indices, the DBMS is able to use index scans to read data from storage. Also, you should consider changing the variable length datatypes to fixed length datatypes, then the DBMS has an easier job of jumping between tuples as every tuple has a fixed (and known) length. In that case, operations like skip the next 100000 tuples are easy for the DBMS.
I am sorry to tell you but this schema just won't work for 'trillions' of records with any relational database. Just to store the index pages for serial_number and Designation for 1 trillion rows will require 465 terabytes. That is more than double the size of the entire World Data Centre for Climate database that currently holds the world record as the largest. If these requirements are for real, you really need to move to a star/snowflake schema. That means no varchars in this fact table, not even dates, only integers. Move all text and date fields to dimensions.
For the most part a single table makes some sense, but it would be ridiculous to store all those values as strings, depending on the uniqueness of your name/designation/place fields you could use something like this:
serial_number | BIGINT
name_ID | INT
Designation_ID | INT
place_ID | INT
timeOfJoining | timestamp with time zone
timeOfLeaving | timestamp with time zone
Without knowing the data it's impossible to know which lookups would be practical. As others have mentioned you've got some challenges ahead. Regarding indexing, I agree with Gordon.

MySQL Improving speed of order by statements

I've got a table in a MySQL db with about 25000 records. Each record has about 200 fields, many of which are TEXT. There's nothing I can do about the structure - this is a migration from an old flat-file db which has 16 years of records, and many fields are "note" type free-text entries.
Users can be viewing any number of fields, and order by any single field, and any number of qualifiers. There's a big slowdown in the sort, which is generally taking several seconds, sometimes as much as 7-10 seconds.
an example statement might look like this:
select a, b, c from table where b=1 and c=2 or a=0 order by a desc limit 25
There's never a star-select, and there's always a limit, so I don't think the statement itself can really be optimized much.
I'm aware that indexes can help speed this up, but since there's no way of knowing what fields are going to be sorted on, i'd have to index all 200 columns - what I've read about this doesn't seem to be consistent. I understand there'd be a slowdown when inserting or updating records, but assuming that's acceptable, is it advisable to add an index to each column?
I've read about sort_buffer_size but it seems like everything I read conflicts with the last thing I read - is it advisable to increase this value, or any of the other similar values (read_buffer_size, etc)?
Also, the primary identifier is a crazy pattern they came up with in the nineties. This is the PK and so should be indexed by virtue of being the PK (right?). The records are (and have been) submitted to the state, and to their clients, and I can't change the format. This column needs to sort based on the logic that's in place, which involves a stored procedure with string concatenation and substring matching. This particular sort is especially slow, and doesn't seem to cache, even though this one field is indexed, so I wonder if there's anything I can do to speed up the sorting on this particular field (which is the default order by).
TYIA.
I'd have to index all 200 columns
That's not really a good idea. Because of the way MySQL uses indexes most of them would probably never be used while still generating quite a large overhead. (see chapter 7.3 in link below for details). What you could do however, is to try to identify which columns appear most often in WHERE clause, and index those.
In the long run however, you will probably need to find a way, to rework your data structure into something more manageable, because as it is now, it has the smell of 'spreadsheet turned into database' which is not a nice smell.
I've read about sort_buffer_size but it seems like everything I read
conflicts with the last thing I read - is it advisable to increase
this value, or any of the other similar values (read_buffer_size,
etc)?
In general he answer is yes. However the actual details depend on your hardware, OS and what storage engine you use. See chapter 7.11 (especially 7.11.4 in link below)
Also, the primary identifier is a crazy pattern they came up with in
the nineties.[...] I wonder if there's anything I can do to speed up
the sorting on this particular field (which is the default order by).
Perhaps you could add a primarySortOrder column to your table, into which you could store numeric values that would map the PK order (precaluclated from the store procedure you're using).
Ant the link you've been waiting for: Chapter 7 from MySQL manual: Optimization
Add an index to all the columns that have a large number of distinct values, say 100 or even 1000 or more. Tune this number as you go.

MySQL - why not index every field?

Recently I've learned the wonder of indexes, and performance has improved dramatically. However, with all I've learned, I can't seem to find the answer to this question.
Indexes are great, but why couldn't someone just index all fields to make the table incredibly fast? I'm sure there's a good reason to not do this, but how about three fields in a thirty-field table? 10 in a 30 field? Where should one draw the line, and why?
Indexes take up space in memory (RAM); Too many or too large of indexes and the DB is going to have to be swapping them to and from the disk. They also increase insert and delete time (each index must be updated for every piece of data inserted/deleted/updated).
You don't have infinite memory. Making it so all indexes fit in RAM = good.
You don't have infinite time. Indexing only the columns you need indexed minimizes the insert/delete/update performance hit.
Keep in mind that every index must be updated any time a row is updated, inserted, or deleted. So the more indexes you have, the slower performance you'll have for write operations.
Also, every index takes up further disk space and memory space (when called), so it could potentially slow read operations as well (for large tables).
Check this out
You have to balance CRUD needs. Writing to tables becomes slow. As for where to draw the line, that depends on how the data is being acessed (sorting filtering, etc.).
Indexing will take up more allocated space both from drive and ram, but also improving the performance a lot. Unfortunately when it reaches memory limit, the system will surrender the drive space and risk the performance. Practically, you shouldn't index any field that you might think doesn't involve in any kind of data traversing algorithm, neither inserting nor searching (WHERE clause). But you should if otherwise. By default you have to index all fields. The fields which you should consider unindexing is if the queries are used only by moderator, unless if they need for speed too
It is not a good idea to indexes all the columns in a table. While this will make the table very fast to read from, it also becomes much slower to write to. Writing to a table that has every column indexed would involve putting the new record in that table and then putting each column's information in the its own index table.
this answer is my personal opinion based I m using my mathematical logic to answer
the second question was about the border where to stop, First let do some mathematical calculation, suppose we have N rows with L fields in a table if we index all the fields we will get a L new index tables where every table will sort in a meaningfull way the data of the index field, in first glance if your table is a W weight it will become W*2 (1 tera will become 2 tera) if you have 100 big table (I already worked in project where the table number was arround 1800 table ) you will waste 100 times this space (100 tera), this is way far from wise.
If we will apply indexes in all tables we will have to think about index updates were one update trigger all indexes update this is a select all unordered equivalent in time
from this I conclude that you have in this scenario that if you will loose this time is preferable to lose it in a select nor an update because if you will select a field that is not indexed you will not trigger another select on all fields that are not indexed
what to index ?
foreign-keys : is a must based on
primary-key : I m not yet sure about it may be if someone read this could help on this case
other fields : the first natural answer is the half of the remaining filds why : if you should index more you r not far from the best answer if you should index less you are not also far because we know that no index is bad and all indexed is also bad.
from this 3 points I can conclude that if we have L fields composed of K keys the limit should be somewhere near ((L-K)/2)+K more or less by L/10
this answer is based on my logic and personal prictices
First of all, at least in SAP - ABAP and in background database table, we can create one index table for all required index fields, we will have their addresses only. So other SQL related software-database system can also use one table for all fields to be indexed.
Secondly, what is the writing performance? A company in one day records 50 sales orders for example. And let assume there is a table VBAK sales order header table with 30 fields for example each has 20 CHAR length..
I can write to real table in seconds, but other index table can work in the background, and at the same time a report is tried to be run, for this report while index table is searched, ther can be a logic- for database programming- a index writing process is contiuning and wait it for ending ( 5 sales orders at the same time were being recorded for example and take maybe 5 seconds) ..so , a running report can wait 5 seconds then runs 5 seconds total 10 seconds..
without index, a running report does not wait 5 seconds for writing performance..but runs maybe 40 seconds...
So, what is the meaning of writing performance no one writes thousands of records at the same time. But reading them.
And reading a second table means that : there were all ready sorted fields.I have 3 fields selected and I can find in which sorted sets I need to search these data, then I bring them...what RAM, what memory it is just a copied index table with only one data for each field -address data..What memory?
I think, this is one of the software company secrets hide from customers, not to wake them up , otherwise they will not need another system in the future with an expensive price.

mysql index optimization for a table with multiple indexes that index some of the same columns

I have a table that stores some basic data about visitor sessions on third party web sites. This is its structure:
id, site_id, unixtime, unixtime_last, ip_address, uid
There are four indexes: id, site_id/unixtime, site_id/ip_address, and site_id/uid
There are many different types of ways that we query this table, and all of them are specific to the site_id. The index with unixtime is used to display the list of visitors for a given date or time range. The other two are used to find all visits from an IP address or a "uid" (a unique cookie value created for each visitor), as well as determining if this is a new visitor or a returning visitor.
Obviously storing site_id inside 3 indexes is inefficient for both write speed and storage, but I see no way around it, since I need to be able to quickly query this data for a given specific site_id.
Any ideas on making this more efficient?
I don't really understand B-trees besides some very basic stuff, but it's more efficient to have the left-most column of an index be the one with the least variance - correct? Because I considered having the site_id being the second column of the index for both ip_address and uid but I think that would make the index less efficient since the IP and UID are going to vary more than the site ID will, because we only have about 8000 unique sites per database server, but millions of unique visitors across all ~8000 sites on a daily basis.
I've also considered removing site_id from the IP and UID indexes completely, since the chances of the same visitor going to multiple sites that share the same database server are quite small, but in cases where this does happen, I fear it could be quite slow to determine if this is a new visitor to this site_id or not. The query would be something like:
select id from sessions where uid = 'value' and site_id = 123 limit 1
... so if this visitor had visited this site before, it would only need to find one row with this site_id before it stopped. This wouldn't be super fast necessarily, but acceptably fast. But say we have a site that gets 500,000 visitors a day, and a particular visitor loves this site and goes there 10 times a day. Now they happen to hit another site on the same database server for the first time. The above query could take quite a long time to search through all of the potentially thousands of rows for this UID, scattered all over the disk, since it wouldn't be finding one for this site ID.
Any insight on making this as efficient as possible would be appreciated :)
Update - this is a MyISAM table with MySQL 5.0. My concerns are both with performance as well as storage space. This table is both read and write heavy. If I had to choose between performance and storage, my biggest concern is performance - but both are important.
We use memcached heavily in all areas of our service, but that's not an excuse to not care about the database design. I want the database to be as efficient as possible.
I don't really understand B-trees besides some very basic stuff, but it's more efficient to have the left-most column of an index be the one with the least variance - correct?
There is one important property of B-tree indices you need to be aware of: It is possible (efficient) to search for an arbitrary prefix of the full key, but not a suffix. If you have an index site_ip(site_id, ip), and you ask for where ip = 1.2.3.4, MySQL will not use the site_ip index. If you instead had ip_site(ip, site_id), then MySQL would be able to use the ip_site index.
The is a second property of B-tree indices you should be aware of as well: they are sorted. A b-tree index can be used for queries like where site_id < 40.
There is also an important property of disk drives to keep in mind: sequential reads are cheap, seeks are not. If there are any columns used that are not in the index, MySQL must read the row from the table data. That's generally a seek, and slow. So if MySQL believes it'd wind up reading even a small percent of the table like this, it'll instead ignore the index. One big table scan (a sequential read) is usually faster than random reads of even a few percent of the rows in a table.
The same, by the way, applies to seeks through an index. Finding a key in a B-tree actually potentially requires a few seeks, so you'll find that WHERE site_id > 800 AND ip = '1.2.3.4' may not use the site_ip index, becuase each site_id requires several index seeks to find the start of the 1.2.3.4 records for that site. The ip_site index, however, would be used.
Ultimately, you're going to have to make liberal use of benchmarking and EXPLAIN to figure out the best indices for your database. Remember, you can freely add and drop indices as needed. Non-unique indices are not part of your data model; they are merely an optimization.
PS: Benchmark InnoDB as well, it often has better concurrent performance. Same with PostgreSQL.
First of all, if you are using ip as a string than change it to INT UNSIGNED column and use INET_ATON(expr) and INET_NTOA(expr) function to deal with this. Indexing on integer value is more efficient than indexing on strings of variable length.
Well indexes trade storage for performance. Its hard if you want both. Its hard to optimize this any further without know all the queries you run and their quantities per interval.
What you have will work. If you're running into a bottleneck, you'll need to find out whether its cpu,ram,disk and/or network and adjust accordingly. Its hard and wrong to prematurely optimize.
You probably want to switch to innodb if you have any updates, other wise myisam is good for insert/select. Also since your row size is small, you could look into mysql cluster (nbd). There is also an archive engine that can help with storage requirements but partitioning in 5.1 is probably a better thing to look into.
Flipping the order of your index doesn't make any sense, if these indexes are already used in all of your queries.
but it's more efficient to have the left-most column of an index be the one with the least variance - correct?
not sure but I haven't heard this before. Doesn't seem true to me for this application. The index order matters for sorting and by having multiple unique 1st most index fields, allows more possible queries to use index.

char vs varchar for performance in stock database

I'm using mySQL to set up a database of stock options. There are about 330,000 rows (each row is 1 option). I'm new to SQL so I'm trying to decide on the field types for things like option symbol (varies from 4 to 5 characters), stock symbol (varies from 1 to 5 characters), company name (varies from 5 to 60 characters).
I want to optimize for speed. Both creating the database (which happens every 5 minutes as new price data comes out -- i don't have a real-time data feed, but it's near real-time in that i get a new text file with 330,000 rows delivered to me every 5 minutes; this new data completely replaces the prior data), and also for lookup speed (there will be a web-based front end where many users can run ad hoc queries).
If I'm not concerned about space (since the db lifetime is 5 minutes, and each row contains maybe 300 bytes, so maybe 100MBs for the whole thing) then what is the fastest way to structure the fields?
Same question for numeric fields, actually: Is there a performance difference between int(11) and int(7)? Does one length work better than another for queries and sorting?
Thanks!
In MyISAM, there is some benefit to making fixed-width records. VARCHAR is variable width. CHAR is fixed-width. If your rows have only fixed-width data types, then the whole row is fixed-width, and MySQL gains some advantage calculating the space requirements and offset of rows in that table. That said, the advantage may be small and it's hardly worth a possible tiny gain that is outweighed by other costs (such as cache efficiency) from having fixed-width, padded CHAR columns where VARCHAR would store more compactly.
The breakpoint where it becomes more efficient depends on your application, and this is not something that can be answered except by you testing both solutions and using the one that works best for your data under your application's usage.
Regarding INT(7) versus INT(11), this is irrelevant to storage or performance. It is a common misunderstanding that MySQL's argument to the INT type has anything to do with size of the data -- it doesn't. MySQL's INT data type is always 32 bits. The argument in parentheses refers to how many digits to pad if you display the value with ZEROFILL. E.g. INT(7) will display 0001234 where INT(11) will display 00000001234. But this padding only happens as the value is displayed, not during storage or math calculation.
If the actual data in a field can vary a lot in size, varchar is better because it leads to smaller records, and smaller records mean a faster DB (more records can fit into cache, smaller indexes, etc.). For the same reason, using smaller ints is better if you need maximum speed.
OTOH, if the variance is small, e.g. a field has a maximum of 20 chars, and most records actually are nearly 20 chars long, then char is better because it allows some additional optimizations by the DB. However, this really only matters if it's true for ALL the fields in a table, because then you have fixed-size records. If speed is your main concern, it might even be worth it to move any non-fixed-size fields into a separate table, if you have queries that use only the fixed-size fields (or if you only have shotgun queries).
In the end, it's hard to generalize because a lot depends on the access patterns of your actual app.
Given your system constraints I would suggest a varchar since anything you do with the data will have to accommodate whatever padding you put in place to make use of a fixed-width char. This means more code somewhere which is more to debug, and more potential for errors. That being said:
The major bottleneck in your application is due to dropping and recreating your database every five minutes. You're not going to get much performance benefit out of microenhancements like choosing char over varchar. I believe you have some more serious architectural problems to address instead. – Princess
I agree with the above comment. You have bigger fish to fry in your architecture before you can afford to worry about the difference between a char and varchar. For one, if you have a web user attempting to run an ad hoc query and the database is in the process of being recreated, you are going to get errors (i.e. "database doesn't exist" or simply "timed out" type issues).
I would suggest that instead you build (at the least) a quote table for the most recent quote data (with a time stamp), a ticker symbol table and a history table. Your web users would query against the ticker table to get the most recent data. If a symbol comes over in your 5-minute file that doesn't exist, it's simple enough to have the import script create it before posting the new info to the quote table. All others get updated and queries default to the current day's data.
I would definitely not recreate the database each time. Instead I would do the following:
read in the update/snapshot file and create some object based on each row.
for each row get the symbol/option name (unique) and set that in the database
If it were me I would also have an in memory cache of all the symbols and the current price data.
Price data is never an int - you can use characters.
The company name is probably not unique as there are many options for a particular company. That should be an index and you can save space just using the id of a company.
As someone else also pointed out - your web clients do not need to have to hit the actual database and do a query - you can probably just hit your cache. (though that really depends on what tables and data you expose to your clients and what data they want)
Having query access for other users is also a reason NOT to keep removing and creating a database.
Also remember that creating databases is subject to whatever actual database implementation you use. If you ever port from MySQL to, say, Postgresql, you will discover a very unpleasant fact that creating databases in postgresql is a comparatively very slow operation. It is orders of magnitude slower than reading and writing table rows, for instance.
It looks like there is an application design problem to address first, before you optimize for performance choosing proper data types.