Formula to calculate MySql single row size (engine MyISAM) - mysql

I have a situation where I have to create tables dynamically. Depending on some criteria I am going to vary the size of the columns of a particular table.
For that purpose I need to calculate the size of one row.
e.g.
If I am going to create a following table
CREATE TABLE sample(id int, name varchar(30));
so that formula would give me the size of a single row for the table above considering all overheads for storing a row in a mysql table.
Is possible to do so and Is it feasible to do so?

It depends on the storage engine you use and the row format chosen for that table, and also your indexes. But it is not a very useful information.
Edit:
I suggest going against normalization only when you know exactly what you're doing. A DBMS is created to deal with large amount of data. You probably don't need to serialize your strctured data into a single field.
Keep in mind that your application layer then has to tokenie (or worse) the serialized field data to get the original meaning back, which has certainly larger overhead than getting the data already in a structured form, from the DB.
The only exeption I can think of is a client-heavy architcture, when moving processing to the client side actually takes burden off the server, and you would serialize our data anyway for the sake of the transfer. - In server-side code (like php) it is not a good practive to save serialized stye data into the DB.
(Though, using php's built in serialization may be a good idea in some cases. Your current project does not seem to benefit from it.)

The VARCHAR is a variable-length data type, it has a length property, but the value can be empty; calculation may be not exact. Have a look at 'Avg_row_length' field in information_schema.tables.

Related

Is it correct to have a BLOB field directly in the main table?

Which one is better: having a BLOB field in the same table or having a 1-TO-1 reference to it in another table?
I'm making a MySQL database whose main table is called item(ID, Description). This table is consulted by a program I'm developing in VB.NET which offers the possibility to double-click a specific item obtained with a query. Once opened its dedicated form, I would like to show an image stored in the BLOB field, a sort of item preview. The problem is I don't know where is better to create this BLOB field.
Assuming to have a table like this: Item(ID, Description, BLOB), will the BLOB field affect the database performance on queries like:
SELECT ID, Description FROM Item;
If yes, what do you think about this solution:
Item(ID, Description)
Images(Item, File)
Where Images.Item references to Item.ID, and File is the BLOB field.
You can add the BLOB field directly to your main table, as BLOB fields are not stored in-row and require a separate look-up to retrieve its contents. Your dependent table is needless.
BUT another and preferred way is to store on your database table only a pointer (path to the file on server) to your image file. In this way you can retrive the path and access the file from your VB.NET application.
To quote the documentation about blobs:
Each BLOB or TEXT value is represented internally by a separately allocated object. This is in contrast to all other data types, for which storage is allocated once per column when the table is opened.
In simpler terms, the blob's storage isn't stored inside the table's row, only a pointer is - which is pretty similar to what you're trying to achieve with the secondary table. To make a long story short - there's no need for another table, MySQL already doesn't the same thing internally.
Most of what has been said in the other Answers is mostly correct. I'll start from scratch, adding some caveats.
The two-table, 1-1, design is usually better for MyISAM, but not for InnoDB. The rest of my Answer applies only to InnoDB.
"Off-record" storage may happen to BLOB, TEXT, and 'large' VARCHAR and VARBINARY, almost equally.
"Large" columns are usually stored "off-record", thereby providing something very similar to your 1-1 design. However, by having InnoDB do the work usually leads to better performance.
The ROW_FORMAT and the size of the column makes a difference.
A "small" BLOB may be stored on-record. Pro: no need for the extra fetch when you include the blob in the SELECT list. Con: clutter.
Some ROW_FORMATs cut off at 767 bytes.
Some ROW_FORMATs store 20 bytes on-record; this is just a 'pointer'; the entire blob is off-record.
etc, etc.
Off-record is beneficial when you need to filter out a bunch of rows, then fetch only a few. Also, when you don't need the column.
As a side note, TINYTEXT is possibly useless. There are situations where the 'equivalent' VARCHAR(255) performs better.
Storing an image in the table (on- or off-record) is arguably unwise if that image will be used in an HTML page. HTML is quite happy to request the <img src=...> from your server or even some other server. In this case, a smallish VARCHAR containing a url is the 'correct' design.

How do databases handle redundant values?

Suppose I have a database with several columns. In each column there are lots of values that are often similar.
For example I can have a column with the name "Description" and a value could be "This is the description for the measurement". This description can occur up to 1000000 times in this column.
My question is not how I could optimize the design of this database but how a database handles such redundant values. Are these redundant values stored as effectively as with a perfect design (with respect to the total size of the database)? If so, how are the values compressed?
The only correct answer would be: depends on the database and the configuration. Because there is no silver bullet for this one. Some databases do only store values of each column once (some column stores or the like) but technically there is no necessity to do or not do this.
In some databases you can let the DBMS propose optimizations and in such a case it could possibly propose an ENUM field that holds only existing values, which would reduce the string to an id that references the string. This "optimization" comes at a price, for example, when you want to add a new value in the field description you have to adapt the ENUM field.
Depending on the actual use case those optimizations are worth nothing or are even a show stopper, for example when the data changes very often (inserts or updates). The dbms would spend more time managing uniqueness/duplicates than actually processing queries.
On the question of compression: also depends on the configuration and the database system I guess, depends on the field type too. text data can be compressed and in the case of non-indexed text fields there should be almost no drawback in using a simple compression algorithm. Which algorithm depends on the dbms and configuration, I suspect.
Unless you become more specific, there is no more specific answer, I believe.

Dedicated SQL table containing only unique strings

I can't seem to find any examples of anyone doing this on the web, so am wondering if maybe there's a reason for that (or maybe I haven't used the right search terms). There might even already be a term for this that I'm unaware of?
To save on database storage space for regularly reoccurring strings, I'm thinking of creating a MySQL table called unique_string. It would only have two columns:
"id" : INT : PRIMARY_KEY index
"string" : varchar(255) : UNIQUE index
Any other tables anywhere in the database can then use INT columns instead of VARCHAR columns. For example a varchar field called browser would instead be an INT field called browser_unique_string_id.
I would not use this for anything where performance matters. In this case I'm using it to track details of every single page request (logging web stats) and an "audit trial" of user actions on intranets, but other things potentially too.
I'm also aware the SELECT queries would be complex, so I'm not worried about that. I'll most likely write some code to generate the queries to return the "real" string data.
Thoughts? I feel like I might be overlooking something obvious here.
Thanks!
I have used this structure for a similar application -- keeping track of URIs for web logs. In this case, the database was Oracle.
The performance issues are not minimal. As the database grows, there are tens of millions of URIs. So, just identifying the right string during an INSERT is challenging. We handled this by building most of the update logic in hadoop, so the database table was, in essence, just a copy of a hadoop table.
In a regular database, you would get around this by building an index, as you suggest in your question. And, an index solution would work well up to your available memory. In fact, this is a rather degenerate case for an index, because you really only need the index and not the underlying table. I'm do not know if mysql or SQL Server recognize this, although columnar databases (such as Vertica) should.
SQL Server has another option. If you declare the string as VARCHAR(max), then it is stored no a separate data page from the rest of the data. During a full table scan, there is no need to load the additional page in memory, if the column is not being referenced in the query.
This is a very common design pattern in databases where the cardinality of the data is relatively small compared to the transaction table that it's linked to. The queries wouldn't be very complex, just a simple join to the lookup table. You can include more than just a string on the lookup table, other information that is commonly repeated. You're simply normalizing your model to remove duplicate data.
Example:
Request Table:
Date
Time
IP Address
Browser_ID
Browser Table:
Browser_ID
Browser_Name
Browser_Version
Browser_Properties
If you planning on logging data in real time (as opposed to a batch job) then you want to ensure your time to write a record to the database is as quick as possible. If you are logging synchronously then obviously the record creating time will directly affect the time it takes for a http request to complete. If this is async then slow record creation times will lead to a bottleneck. However if this is batch job then performance will not matter so long as you can confidently create all the batched records before the next batch runs.
In order to reduce the time it takes to create a record you really want to flatten out your database structure, your current query in pseudo might look like
SELECT #id = id from PagesTable
WHERE PageName = #RequestedPageName
IF #id = 0
THEN
INSERT #RequestedPageName into PagesTable
#id = SELECT ##IDENTITY 'or whatever method you db supports for
'fetching the id for a newly created record
END IF
INSERT #id, #BrowserName INTO BrowersLogTable
Where as in a flat structure you would just need 1 INSERT
If you are concerned about data Integrity, which you should be, then typically you would normalise this data by querying at writing it into a separate set of tables (or a separate database) at regular intervals and use this for querying against.

Mysql polymorphic tables?

The needs would be long to describe, so I'll simplify the example.
I want to make a form creation system ( the user can create a form, adding fields, etc... ). Let's focus on checkbox vs textarea.
The checkbox can have a value of 0 or 1, depending on the checked status.
The textarea must be a LONGTEXT type.
So in the database, that give me 3 choices concerning the structure of the table field_value:
1.
checkbox_value (TINYINT) | textarea_value (MEDIUMTEXT)
That mean that no input will ever use all column of the table. The table will waste some space.
2.
allfield_value (MEDIUMTEXT)
That mean that for the checkbox, I'll store a really tiny value in a MEDIUMTEXT, which is useless.
3.
tblcheckbox.value
tbltextarea.value
Now I have 1 separate table per field. That's optimal in terms of space, but in the whole context of the application, I might expect to have to read over 100 tables -- 1 query with a many JOIN ) in order to generate a single page that display a form.
In your opinion, what's the best way to proceed?
Do not consider an EAV data model. It's easy to put data in, but hard to get data out. It doesn't scale. It has no data integrity. You have to write lots of code yourself to do things that any RDBMS does for you if you model your data properly. Trying to use an RDBMS to create a general-purpose form management system that can accommodate any future needs is an example of the Inner-Platform Effect antipattern.
(By the way, if you do use EAV, don't try to join all the attributes back into a single row. You already commented that MySQL has a limit on the number of joins per query, but even if you can live within that, it doesn't perform well. Just fetch an attribute per row, and sort it out in application code. Loop over the attribute rows you fetch from the database, and populate your object field by field. That means more code for you to write, but that's the price of Inner-Platform Effect.)
If you want to store form data relationally, each attribute would go in its own column. This means you need to design a custom table for your form (or actually set of tables if your forms support multivalue fields). Name the columns according to the meaning of each given form field, not something generic like "checkbox_value". Choose a data type according to the needs of the given form field, not a one-size-fits-all MEDIUMTEXT or VARCHAR(255).
If you want to store form data non-relationally, you have more flexibility. You can use a non-relational document store such as MongoDB or even Solr. You can store documents without having to design a schema as you would with a relational database. But you lose many of the structural benefits that a schema gives you. You end up writing more code to "discover" the fields of documents instead of being able to infer the structure from the schema. You have no constraints or data types or referential integrity.
Also, you may already be using a relational database successfully for the rest of your data management and can't justify running two different databases simultaneously.
A compromise between relational and non-relational extremes is the Serialized LOB design, with the extension described in How FriendFeed Uses MySQL to Store Schema-Less Data. Most of your data resides in traditional relational tables. Your amorphous form data goes into a single BLOB column, in some format that encodes fields and data together (for example, XML or JSON or YAML). Then for any field of that data you want to be searchable, create an auxiliary table to index that single field and reference rows of form data where a given value in that respective field appears.
You might want to consider an EAV data model.

Does MYSQL stores it in an optimal way it if the same string is stored in multiple rows?

I have a table where one of the columns is a sort of id string used to group several rows from the table. Let's say the column name is "map" and one of the values for map is e.g. "walmart". The column has an index on it, because I use to it filter those rows which belong to a certain map.
I have lots of such maps and I don't know how much space the different map values take up from the table. Does MYSQL recognizes the same map value is stored for multiple rows and stores it only once internally and only references it with an internal numeric id?
Or do I have to replace the map string with a numeric id explicitly and use a different table to pair map strings to ids if I want to decrease the size of the table?
MySQL will store the whole data for every row, regardless of whether the data already exists in a different row.
If you have a limited set of options, you could use an ENUM field, else you could pull the names into another table and join on it.
I think MySQL will duplicate your content each time : it stores data row by row, unless you explicitly specify otherwise (putting the data in another table, like you suggested).
Using another table will mean you need to add a JOIN in some of your queries : you might want to think a bit about the size of your data (are they that big ?), compared to the (small ?) performance loss you may encounter because of that join.
Another solution would be using an ENUM datatype, at least if you know in advance which string you will have in your table, and there are only a few of those.
Finally, another solution might be to store an integer "code" corresponding to the strings, and have those code translated to strings by your application, totally outside of the database (or use some table to store the correspondances, but have that table cached by your application, instead of using joins in SQL queries).
It would not be as "clean", but might be better for performances -- still, this may be some kind of micro-optimization that is not necessary in your case...
If you are using the same values over and over again, then there is a good functional reason to move it to a separate table, totally aside from disk space considerations: To avoid problems with inconsistent data.
Suppose you have a table of Stores, which includes a column for StoreName. Among the values in StoreName "WalMart" occurs 300 times, and then there's a "BalMart". Is that just a typo for "WalMart", or is that a different store?
Also, if there's other data associated with a store that would be constant across the chain, you should store it just once and not repeatedly.
Of course, if you're just showing locations on a map and you really don't care what they are, it's just a name to display, then this would all be irrelevant.
And if that's the case, then buying a bigger disk is probably a simpler solution than redesigning your database just to save a few bytes per record. Because if we're talking arbitrary strings for place names here, then trying to find duplicates and have look-ups for them is probably a lot of work for very little gain.