Does MYSQL stores it in an optimal way it if the same string is stored in multiple rows? - mysql

I have a table where one of the columns is a sort of id string used to group several rows from the table. Let's say the column name is "map" and one of the values for map is e.g. "walmart". The column has an index on it, because I use to it filter those rows which belong to a certain map.
I have lots of such maps and I don't know how much space the different map values take up from the table. Does MYSQL recognizes the same map value is stored for multiple rows and stores it only once internally and only references it with an internal numeric id?
Or do I have to replace the map string with a numeric id explicitly and use a different table to pair map strings to ids if I want to decrease the size of the table?

MySQL will store the whole data for every row, regardless of whether the data already exists in a different row.
If you have a limited set of options, you could use an ENUM field, else you could pull the names into another table and join on it.

I think MySQL will duplicate your content each time : it stores data row by row, unless you explicitly specify otherwise (putting the data in another table, like you suggested).
Using another table will mean you need to add a JOIN in some of your queries : you might want to think a bit about the size of your data (are they that big ?), compared to the (small ?) performance loss you may encounter because of that join.
Another solution would be using an ENUM datatype, at least if you know in advance which string you will have in your table, and there are only a few of those.
Finally, another solution might be to store an integer "code" corresponding to the strings, and have those code translated to strings by your application, totally outside of the database (or use some table to store the correspondances, but have that table cached by your application, instead of using joins in SQL queries).
It would not be as "clean", but might be better for performances -- still, this may be some kind of micro-optimization that is not necessary in your case...

If you are using the same values over and over again, then there is a good functional reason to move it to a separate table, totally aside from disk space considerations: To avoid problems with inconsistent data.
Suppose you have a table of Stores, which includes a column for StoreName. Among the values in StoreName "WalMart" occurs 300 times, and then there's a "BalMart". Is that just a typo for "WalMart", or is that a different store?
Also, if there's other data associated with a store that would be constant across the chain, you should store it just once and not repeatedly.
Of course, if you're just showing locations on a map and you really don't care what they are, it's just a name to display, then this would all be irrelevant.
And if that's the case, then buying a bigger disk is probably a simpler solution than redesigning your database just to save a few bytes per record. Because if we're talking arbitrary strings for place names here, then trying to find duplicates and have look-ups for them is probably a lot of work for very little gain.

Related

Storing attributes with multiple integer values

I need to store a dynamic number of integer attributes (1-8). I'm storing them in individual columns in the database table, like:
attribute_1, attribute_2, ..., attribute_8
This makes for a fairly messy model when methods need to reference these, as well as an unwieldy database table and schema.
These are assigned default values (but are overridable on a form), and represent unique identifiers for the user.
For instance, a Brew is composed of up to eight batches before they are mixed together in a fermenter. The brewer might want to go back and refer to any one of these by its unique identifying number. I'm assigning these unique values based on the last highest value when creating a new Brew. However, the user may want to override these default values to some other numbers.
In most cases (smaller breweries), they'll probably only use the first two, but some larger breweries would use all eight.
There must be a better way to store these than having eight different attributes with the same name and a number at the end.
I'm using MySQL. Is there an easy/concise way to store an array or a JSON hash but still be able to edit these values on a form?
I would not store attributes like that. It will limit you in the future. Let say you want to know which brews have used attribute_4? You will have to scan the entire brews table, open the attributes field and deconstruct it to see if 4 is in there.
Much better to separate Brew and Attributes in two tables, and link them, like so:
Another benefit, is you can add attributes easily.
Storing JSON is ok, like #max pointed out. I just propose the normalized database way of doing it.

Save data as array or in multiple rows? Which way is more efficient?

Currently I have lots of rows in mysql db
venue_id
venue_name
venue_location
venue_geolocation
venue_type
venue_url
venue_manager
venue_phone
venue_logo
venue_company
venue_zip
venue_vat
venue_visible
Would it be more efficient to store most of the data in one array and in one row like venue_data. Then it would leave only 3 rows venue_id, venue_data, venue_visible. Then in my application I could explode that array. Would it save time, server load?
Storing the values as array (concatenating different values into a string?) is definitely a bad idea because:
You will loose the readability,
you won't be able to easily search on concatenated columns,
you cannot index these columns properly.
Furthermore it does not have an impact to the performance - see also Is there a performance decrease if there are too many columns in a table?
If you are unhappy with the many columns, you should consider normalizing (DB Normalization) your db schema.
You must ask yourself whether the amount of time and space you 'might' save is worth the cost.
Consider:
Combining columns into one will still have a comparable length as all of them separately
More space could potentially be saved by using appropriately sized data types
Disk space is cheap
Having distinct columns gives you the power to query any of those columns
Distinct columns also allows you to easily add or remove columns at a later date without having to re-construct every row's combined column
Distinct columns you can use $result->fetch_assoc() to immediately get your result row in an array, vs. spending processing time parsing a complex string
Parsing such a string may be prone to errors that selecting specific columns is not
You can add foreign key constraints and indexes on individual columns which would not work if you combined them
You can easily search on distinct columns, but not if you combine them
I can think of plenty more reasons why distinct columns are a better choice than trying to optimize code in a way that likely will not even save you any time. The query may be a few milliseconds faster, but you lost that time processing the string.

Formula to calculate MySql single row size (engine MyISAM)

I have a situation where I have to create tables dynamically. Depending on some criteria I am going to vary the size of the columns of a particular table.
For that purpose I need to calculate the size of one row.
e.g.
If I am going to create a following table
CREATE TABLE sample(id int, name varchar(30));
so that formula would give me the size of a single row for the table above considering all overheads for storing a row in a mysql table.
Is possible to do so and Is it feasible to do so?
It depends on the storage engine you use and the row format chosen for that table, and also your indexes. But it is not a very useful information.
Edit:
I suggest going against normalization only when you know exactly what you're doing. A DBMS is created to deal with large amount of data. You probably don't need to serialize your strctured data into a single field.
Keep in mind that your application layer then has to tokenie (or worse) the serialized field data to get the original meaning back, which has certainly larger overhead than getting the data already in a structured form, from the DB.
The only exeption I can think of is a client-heavy architcture, when moving processing to the client side actually takes burden off the server, and you would serialize our data anyway for the sake of the transfer. - In server-side code (like php) it is not a good practive to save serialized stye data into the DB.
(Though, using php's built in serialization may be a good idea in some cases. Your current project does not seem to benefit from it.)
The VARCHAR is a variable-length data type, it has a length property, but the value can be empty; calculation may be not exact. Have a look at 'Avg_row_length' field in information_schema.tables.

How do I search part of a column?

I have a mysql table containing 40 million records that is being populated by a process over which I have no control. Data is added only once every month. This table needs to be search-able by the Name column. But the name column contains the full name in the format 'Last First Middle'.
In the sphinx.conf, I have
sql_query = SELECT Id, OwnersName,
substring_index(substring_index(OwnersName,' ',2),' ',-1) as firstname,
substring_index(OwnersName,' ',2) as lastname
FROM table1
How do I use sphinx search to search by firstname and/or lastname? I would like to be able to search for 'Smith' in only the first name?
Per-row functions in SQL queries are always a bad idea for tables that may grow large. If you want to search on part of a column, it should be extracted out to its own column and indexed.
I would suggest, if you have power over the schema (as opposed to the population process), inserting new columns called OwnersFirstName and OwnersLastName along with an update/insert trigger which extracts the relevant information from OwnersName and populats the new columns appropriately.
This means the expense of figuring out the first name is only done when a row is changed, not every single time you run your query. That is the right time to do it.
Then your queries become blindingly fast. And, yes, this breaks 3NF, but most people don't realize that it's okay to do that for performance reasons, provided you understand the consequences. And, since the new columns are controlled by the triggers, the data duplication that would be cause for concern is "clean".
Most problems people have with databases is the speed of their queries. Wasting a bit of disk space to gain a large amount of performance improvement is usually okay.
If you have absolutely no power over even the schema, another possibility is to create your own database with the "correct" schema and populate it periodically from the real database. Then query yours. That may involve a fair bit of data transfer every month however so the first option is the better one, if allowed.
Judging by the other answers, I may have missed something... but to restrict a search in Sphinx to a specific field, make sure you're using the extended (or extended2) match mode, and then use the following query string: #firstname Smith.
You could use substring to get the parts of the field that you want to search in, but that will slow down the process. The query can not use any kind of index to do the comparison, so it has to touch each record in the table.
The best would be not to store several values in the same field, but put the name components in three separate fields. When you store more than one value in a fields it's almost always some problems accessing the data. I see this over and over in different forums...
This is an intractable problrm because fulll names can contains prefixes, suffixes, middle names and no middle names, composite first and last names with and without hyphens, etc. There is no reasonable way to do this with 100% reliability

Which is faster: Many rows or many columns?

In MySQL, is it generally faster/more efficient/scalable to return 100 rows with 3 columns, or 1 row with 100 columns?
In other words, when storing many key => value pairs related to a record, is it better to store each key => value pair in a separate row with with the record_id as a key, or to have one row per record_id with a column for each key?
Also, assume also that keys will need to be added/removed fairly regularly, which I assume would affect the long term maintainability of the many column approach once the table gets sufficiently large.
Edit: to clarify, by "a regular basis" I mean the addition or removal of a key once a month or so.
You should never add or remove columns on a regular basis.
http://en.wikipedia.org/wiki/Entity-Attribute-Value_model
There are a lot of bad things about this model and I would not use it if there was any other alternative. If you don't know the majority (except a few user customizable fields) of data columns you need for your application, then you need to spend more time in design and figure it out.
If your keys are preset (known at design time), then yes, you should put each key into a separate column.
If they are not known in design time, then you have to return your data as a list of key-value pairs which you should later parse outside the RDBMS.
If you are storing key/value pairs, you should have a table with two columns, one for the key (make this the PK for the table) and one for the value (probably don't need this indexed at all). Remember, "The key, the whole key, and nothing but the key."
In the multi-column approach, you will find that you table grows without bound because removing the column will nuke all the values and you won't want to do it. I speak from experience here having worked on a legacy system that had one table with almost 1000 columns, most of which were bit fields. Eventually, you stop being able to make the case to delete any of the columns because someone might be using it and the last time you did it, you had work till 2 am rolling back to backups.
First: determine how frequently your data needs to be accessed. If the data always needs to be retrieved in one shot and most of it used then consider storing all the key pairs as a serialized value or as an xml value. If you need to do any sort of complex analysis on that data and you need the value pairs then columns are ok but limit them to values that you know you will need to perform your queries on. It’s generally easier to design queries that use one column for one parameter than row. You will also find it easier to work with
the returned values if they are all in one row than many.
Second: separate your most frequently accessed data and put it in its own table and the other data in another. 100 columns is a lot by the way so I recommend that you split your data into smaller chunks that will be more manageable.
Lastly: If you have data that may frequently change then you should use create the column (key) in one table and then use its numerical key value against which you would store the key value. This assumes that you will be using the same key more than once and should speed up your search when you go to do your lookup.