The effect of the fields length on the querying time - mysql

I have a mysql database in which I keep information of item and also I keep description.
The thing is that the description column can hold up to 150 chars which I think is long and I wondered if it slows the querying time. Also I wanted to know if its recommended to shorten the size of the int I mean if I have a price which is normally not that big should I limit the column to small/medium int?
The columns are something like this:
id name category publisher mail price description
Thanks in advance.

Store your character data as varchar() and not as char() and read up on the MySQL documentation on these data types (here). This only stores the characters actually in the description, plus a few more bytes of overhead.
As for whether or not the longer fields imply worse-performing queries. That is a complicated subject. Obviously, at the extreme, having the maximum size records is going to slow things down versus a 10-byte record. The reason has to do with I/O performance. MySQL reads in pages and a page can contain one or more records. The records on the page are then processed.
The more records that fit on the page, the fewer the I/Os.
But then it gets more complicated, depending on the hardware and the storage engine. Disks, nowadays, do read-aheads as do operating systems. So, the next read of a page (if pages are not fragmented and are adjacent to each other) may be much faster than the read of the initial page. In fact, you might have the next page in memory before processing on the first page has completed. At that point, it doesn't really matter how many records are on each page.
And, 200 bytes for a record is not very big. You should worry first about getting your application working and second about getting it to meet performance goals. Along the way, make reasonable choices, such as using varchar() instead of char() and appropriately sized numerics (you might consider fixed point numeric types rather than float for monetary values).

It is only you that considers 150 long - the database most likely does not, as they're designed to handle much more at once. Do not consider sacrificing your data for "performance". If the nature of your application requires you to store up to 150 characters of text at once, don't be afraid to do so, but do look up optimization tips.
Using proper data types, though, can help you save space. For instance, if you have a field which is meant to store values 0 to 20, there's no need for an INT field type. A TINYINT will do.
The documentation lists the data types and provides information on how much space they use and how they're managed.

Related

mysql getting rid of redundant values

I am creating a database to store data from a monitoring system that I have created. The system takes a bunch of data points(~4000) a couple times every minute and stores them in my database. I need to be able to down sample based on the time stamp. Right now I am planning on using one table with three columns:
results:
1. point_id
2. timestamp
3. value
so the query I'd be like to do would be:
SELECT point_id,
MAX(value) AS value
FROM results
WHERE timestamp BETWEEN date1 AND date2
GROUP BY point_id;
The problem I am running into is this seems super inefficient with respect to memory. Using this structure each time stamp would have to be recorded 4000 times, which seems a bit excessive to me. The only solutions I thought of that reduce the memory footprint of my database requires me to either use separate tables (which to my understanding is super bad practice) or storing the data in CSV files which would require me to write my own code to search through the data (which to my understanding requires me not to be a bum... and probably search substantially slower). Is there a database structure that I could implement that doesn't require me to store so much duplicate data?
A database on with your data structure is going to be less efficient than custom code. Guess what. That is not unusual.
First, though, I think you should wait until this is actually a performance problem. A timestamp with no fractional seconds requires 4 bytes (see here). So, a record would have, say 4+4+8=16 bytes (assuming a double floating point representation for value). By removing the timestamp you would get 12 bytes -- savings of 25%. I'm not saying that is unimportant. I am saying that other considerations -- such as getting the code to work -- might be more important.
Based on your data, the difference is between 184 Mbytes/day and 138 Mbytes/day, or 67 Gbytes/year and 50 Gbytes. You know, you are going to have to deal with biggish data issues regardless of how you store the timestamp.
Keeping the timestamp in the data will allow you other optimizations, notably the use of partitions to store each day in a separate file. This should be a big benefit for your queries, assuming the where conditions are partition-compatible. (Learn about partitioning here.) You may also need indexes, although partitions should be sufficient for your particular query example.
The point of SQL is not that it is the most optimal way to solve any given problem. Instead, it offers a reasonable solution to a very wide range of problems, and it offers many different capabilities that would be difficult to implement individually. So, the time to a reasonable solution is much, much less than developing bespoke code.
Using this structure each time stamp would have to be recorded 4000 times, which seems a bit excessive to me.
Not really. Date values are not that big and storing the same value for each row is perfectly reasonable.
...use separate tables (which to my understanding is super bad practice)
Who told you that!!! Normalising data (splitting into separate, linked data structures) is actually a good practise - so long as you don't overdo it - and SQL is designed to perform well with relational tables. It would perfectly fine to create a "time" table and link to the data in the other table. It would use a little more memory, but that really shouldn't concern you unless you are working in a very limited memory environment.

Optimization of Mysql table size

I am considering a problem.
In C language, we are suggested that the size of struct should be multiples of 2-byte.
e.q.:
struct text{
int index;//assume int is 4 byte.
char [8] word;
}//alought text is only 12 bytes, compiler would assign 16 bytes for this struct
Therefore, I am wondering does the record size(thanks of Gordon Linoff) of MySQL encounter the same problem?
Moreover, how can I optimize MySQL via controlling table size?
First, you are referring to a record size and not the table size.
Second, databases do not work the way that procedural languages do. Records are stored on pages, which are filled up until no more fit. Then additional pages are used. Typically, there are many records on a page.
You can get an idea of what a page looks like here. They are complicated but basically hidden from the user.
It sounds like you are attempting "premature optimization". This isn't quite the root of all evil, but it is a major distraction to getting things accomplished. In other words, define the record as you need it defined. Do what you want to do. If you have performance problems, then fix those when they arise.
The size of a record is going to be the least of your problems. Databases perform I/O in units of pages, so the difference between 12 and 16 bytes is meaningless for a single record. You still have to read the entire page (which is much larger).

To BLOB or not to BLOB

I am in the process of writing a web app backed up by a MySQL database where one of the tables has a potential for getting very large (order of gigabytes) with a significant proportion of table operations being writes. One of the table columns needs to store a string sequence that can be quite big. In my tests thus far it has reached a size of 289 bytes but to be on the safe side I want to design for a maximum size of 1 kb. Currently I am storing that column as a MySQL MediumBlob field in an InnoDB table.
At the same time I have been googling to establish the relative merits and demerits of BLOBs vs other forms of storage. There is a plethora of information out there, perhaps too much. What I have gathered is that InnoDB stores the first few bytes (789 if memory serves me right) of the BLOB in the table row itself and the rest elsewhere. I have also got the notion that if a row has more than one BLOB (which my table does not) per column then the "elsewhere" is a different location for each BLOB. That apart I have got the impression that accessing BLOB data is significantly slower than accessing row data (which sounds reasonable).
My question is just this - in light of my BLOB size and the large potential size of the table should I bother with a blob at all? Also, if I use some form of inrow storage instead will that not have an adverse effect on the maximum number of rows that the table will be able to accommodate?
MySQL is neat and lets me get away with pretty much everything in my development environment. But... that ain't the real world.
I'm sure you've already looked here but it's easy to overlook some of the details since there is a lot to keep in mind when it comes to InnoDB limitations.
The easy answer to one of your questions (maximum size of a table) is 64TBytes. Using variable size types to move that storage into a separate file would certainly change the upper limit on number of rows but 64TBytes is quite a lot of space so the ratio might be very small.
Having a column with a 1KByte string type that is stored inside the table seems like a viable solution since it's also very small compared to 64TBytes. Especially if you have very strict requirements for query speed.
Also, keep in mind that the InnoDB 64TByte limit might be pushed down by the the maximum file size for the OS you're using. You can always link several files together to get more space for your table but then it's starting to get a bit more messy.
if the BLOB data is more then 250kb it is not worth it. In your case i wouldn't bother myself whit BLOB'n. Read this

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.

How important is it to select the smallest possible data type when designing a database?

How much of a difference does using tinyint or smallint (when applicable) instead of just int do? Or restricting a char field to the minimum characters needed?
Do these choices affect performance or just allocated space?
On an Indexed field with a significantly large table the size of your field can make a large affect on performance. On a nonindexed field its not nearly as important bit it still has to write the extra data.
That said, the downtime of a resize of a large table can be several minutes or several hours even, so don't make them smaller than you'd imagine ever needing.
Yes it affects performance too.
If the indexes are larger, it takes longer to read them from disk, and less can be cached in memory.
I've frequently seen these three schema design defects causing problems:
A varchar(n) field was created with n only big enough for the sample of data that the designer had pulled in, not the global population: fine in unit tests, silent truncations in the real world.
A varchar(n) used where the data is fixed size. This masks data bugs.
A char(n) used for variable length data. This provides performance improvements (by enabling the data to sit in-line in the row on disc, but all the client code (and various stored procs/views etc) need to cope with whitespace padding issues (and often they don't). Whitespace padding can be difficult to track down, because spaces don't show up too well, and various libraries/sql clients suppress them.
I've never seen a well intentioned (i.e. not just using varchar(255) for all cols) but conservative selection of the wrong data size cause a significant performance problems. By significant, I mean factor of 10. I regularly see algorithmic design flaws (missing indexes, sending too much data over the wire etc.) causing much bigger performance hits.
Both, in some cases. But imo, it's more of a question of design than performance and storage considerations. The reason you don't make everything varchar(...) is because that doesn't accurately reflect what sort of data should be stored there, and it reduces your data's integrity and type-safety.