I have a dataset of images on which I'm running a autoencoder to encode them in a vector of floats of length 32. To store these float values should I create 32 named columns or just put it in a BLOB of text and parse this text when need be? What would be the perf benefits of using the former vs latter?
Example of the data:
key:72
value:[1.8609547680625838e-8,2.9573993032272483e-8,0.9999995231628418,0.03153182193636894,
0.000003173188815708272,0.9999996423721313,0.8707512617111206,0.00005991563375573605,
0.9999498128890991,0.9999982118606567,0.947956383228302,0.9749470353126526,
0.9999994039535522,5.490094281412894e-7,0.9999681711196899,0.9958689212799072]
I would always be retrieving all the values for given image IDs.
Tables don't have performance. Queries have performance. Any consideration you have to make your database storage give optimal performance, it has to be made in the context of what types of queries you will run against the data.
If you will always query for the full array of values as a single entity, then use a blob.
If you will always query for a specific value in the Nth position in the array, then maybe a series of columns is good.
If you want to do aggregate queries like MIN(), MAX(), AVG() on the data using SQL, then make a second table with one float value per row.
You can't make this decision until you know the queries you will need to run.
Well usually you would use a mapping table to map which values belong to which vector.
But since the array you provided is all part of one value, one vector, and because using a mapping table would require adding 32 rows to the table for each vector maybe it is best to just save it as TEXT/BLOB.
Related
I was just wondering about the efficiency of storing a large amount of boolean values inside of a CHAR or VARCHAR
data
"TFTFTTF"
vs
isFoo isBar isText
false true false
Would it be worth the worse performance by switching storing these values in this manner? I figured it would just be easier just to set a single value rather than having all of those other fields
thanks
Don't do it. MySQL offers types such as char(1) and tinyint that occupy the same space as a single character. In addition, MySQL offers enumerated types, if you want your flags to have more than one value -- and for the values to be recognizable.
That last point is the critical point. You want your code to make sense. The string 'FTF' does not make sense. The columns isFoo, isBar, and isText do make sense.
There is no need to obfuscate your data model.
This would be a bad idea, not only does it have no advantage in terms of the space used, it also has a bad influence on query performance and the comprehensibility of your data model.
Disk Space
In terms of storage usage, it makes no real difference whether the data is stored in a single varchar(n) or char(n) column or in multiple tinynt, char(1)or bit(1) columns. Only when using varchar you would need 1 to 2 bytes more disk space per entry.
For more information about the storage requirements of the different data types, see the MySql documentation.
Query Performance
If boolean values were stored in a VarChar, the search for all entries where a specific value is True would take much longer, since string operations would be necessary to find the correct entries. Even when searching for a combination of Boolean values such as "TFTFTFTFTT", the query would still take longer than if the boolean values were stored in individual columns. Furthermore you can assign indexes to single columns like isFoo or isBar, which has a great positive effect on query performance.
Data Model
A data model should be as comprehensible as possible and if possible independent of any kind of implementation considerations.
Realistically, a database field should only contain one atomic value, that is to say: a value that can't be subdivided into separate parts.
Columns that do not contain atomic values:
cannot be sorted
cannot be grouped
cannot be indexed
So let's say you want to find all rows where isFoo is true you wouldn't be able to do it unless you were to do string operations like "find the third characters in this string and see if it's equal to "F". This would imply a full table scan with every query which would degrade performance quite dramatically.
it depends on what you want to do after storing the data in this format.
after retrieving this record you will have to do further processing on the server side which worsen the performance if you want to load the data by checking specific conditions. the logic in the server would become complex.
The columns isFoo, isBar, and isText would help you to write queries better.
I have the following problem:
We have a lot of different, yet similar types of data items that we want to record in a (MariaDB) database. All data items have some common parameters such as id, username, status, file glob, type, comments, start & end time stamps. In addition there are many (let's say between 40 and 100) parameters that are specific to each type of data item.
We would prefer to have the different data item types in the same table because they will be displayed along with several other data, as they happen, in one single list in the web application. This will appear like an activity stream or "Facebook wall".
It seems that the normalised approach with a top-level generic table joined with specific tables underneath will lead to bad performance. We will have to do both a lot of joins and unions in order to display the activity stream, and the application will frequently poll with this query, so it's important that the query runs fast.
So, which is the better solution(s) in terms of performance and storage optimization?
to utilize MariaDB's dynamic columns
to just add in all the different kinds of columns we need in one table, and just accept that each data item type will only use a few of the columns, i.e. the rest will be null.
something else?
Does it matter if we use regular columns when a lot of the data in them will be null?
When should we use dynamic columns and when is it better to use regular columns?
I believe you should have separate columns for the values you are filtering by. However, you might have some unfiltered values. For those it might be a good idea to store them in a single column as a json object (simple to encode/decode).
A few columns -- the main ones for using in WHERE and ORDER BY clauses (but not necessarily all the columns you might filter on.
A JSON column or MariaDB Dynamic columns.
See my blog on why not to use EAV schema. I focus on how to do it in JSON, but MariaDB's Dynamic Columns is arguably better.
Currently I have lots of rows in mysql db
venue_id
venue_name
venue_location
venue_geolocation
venue_type
venue_url
venue_manager
venue_phone
venue_logo
venue_company
venue_zip
venue_vat
venue_visible
Would it be more efficient to store most of the data in one array and in one row like venue_data. Then it would leave only 3 rows venue_id, venue_data, venue_visible. Then in my application I could explode that array. Would it save time, server load?
Storing the values as array (concatenating different values into a string?) is definitely a bad idea because:
You will loose the readability,
you won't be able to easily search on concatenated columns,
you cannot index these columns properly.
Furthermore it does not have an impact to the performance - see also Is there a performance decrease if there are too many columns in a table?
If you are unhappy with the many columns, you should consider normalizing (DB Normalization) your db schema.
You must ask yourself whether the amount of time and space you 'might' save is worth the cost.
Consider:
Combining columns into one will still have a comparable length as all of them separately
More space could potentially be saved by using appropriately sized data types
Disk space is cheap
Having distinct columns gives you the power to query any of those columns
Distinct columns also allows you to easily add or remove columns at a later date without having to re-construct every row's combined column
Distinct columns you can use $result->fetch_assoc() to immediately get your result row in an array, vs. spending processing time parsing a complex string
Parsing such a string may be prone to errors that selecting specific columns is not
You can add foreign key constraints and indexes on individual columns which would not work if you combined them
You can easily search on distinct columns, but not if you combine them
I can think of plenty more reasons why distinct columns are a better choice than trying to optimize code in a way that likely will not even save you any time. The query may be a few milliseconds faster, but you lost that time processing the string.
I currently have a table in MySQL that stores values normally, but I want to add a field to that table that stores an array of values, such as cities. Should I simply store that array as a CSV? Each row will need it's own array, so I feel uneasy about making a new table and inserting 2-5 rows for each row inserted in the previous table.
I feel like this situation should have a name, I just can't think of it :)
Edit
number of elements - 2-5 (a selection from a dynamic list of cities, the array references the list, which is a table)
This field would not need to be searchable, simply retrieved alongside other data.
The "right" way would be to have another table that holds each value but since you don't want to go that route a delimited list should work. Just make sure that you pick a delimiter that won't show up in the data. You can also store the data as XML depending on how you plan on interacting with the data this may be a better route.
I would go with the idea of a field containing your comma (or other logical delimiter) separated values. Just make sure that your field is going to be big enough to hold your maximum array size. Then when you pull the field out, it should be easy to perform an explode() on the long string using your delimiter, which will then immediately populate your array in the code.
Maybe the word you're looking for is "normalize". As in, move the array to a separate table, linked to the first by means of a key. This offers several advantages:
The array size can grow almost indefinitely
Efficient storage
Ability to search for values in the array without having to use "like"
Of course, the decision of whether to normalize this data depends on many factors that you haven't mentioned, like the number of elements, whether or not the number is fixed, whether the elements need to be searchable, etc.
Is your application PHP? It might be worth investigating the functions serialize and unserialize.
These two functions allow you to easily store an array in the database, then recreate that array at a later time.
As others have mentioned, another table is the proper way to go.
But if you really don't want to do that(?), assuming you're using PHP with MySQL, why not use the serialize() and store a serialized value?
I have a table where one of the columns is a sort of id string used to group several rows from the table. Let's say the column name is "map" and one of the values for map is e.g. "walmart". The column has an index on it, because I use to it filter those rows which belong to a certain map.
I have lots of such maps and I don't know how much space the different map values take up from the table. Does MYSQL recognizes the same map value is stored for multiple rows and stores it only once internally and only references it with an internal numeric id?
Or do I have to replace the map string with a numeric id explicitly and use a different table to pair map strings to ids if I want to decrease the size of the table?
MySQL will store the whole data for every row, regardless of whether the data already exists in a different row.
If you have a limited set of options, you could use an ENUM field, else you could pull the names into another table and join on it.
I think MySQL will duplicate your content each time : it stores data row by row, unless you explicitly specify otherwise (putting the data in another table, like you suggested).
Using another table will mean you need to add a JOIN in some of your queries : you might want to think a bit about the size of your data (are they that big ?), compared to the (small ?) performance loss you may encounter because of that join.
Another solution would be using an ENUM datatype, at least if you know in advance which string you will have in your table, and there are only a few of those.
Finally, another solution might be to store an integer "code" corresponding to the strings, and have those code translated to strings by your application, totally outside of the database (or use some table to store the correspondances, but have that table cached by your application, instead of using joins in SQL queries).
It would not be as "clean", but might be better for performances -- still, this may be some kind of micro-optimization that is not necessary in your case...
If you are using the same values over and over again, then there is a good functional reason to move it to a separate table, totally aside from disk space considerations: To avoid problems with inconsistent data.
Suppose you have a table of Stores, which includes a column for StoreName. Among the values in StoreName "WalMart" occurs 300 times, and then there's a "BalMart". Is that just a typo for "WalMart", or is that a different store?
Also, if there's other data associated with a store that would be constant across the chain, you should store it just once and not repeatedly.
Of course, if you're just showing locations on a map and you really don't care what they are, it's just a name to display, then this would all be irrelevant.
And if that's the case, then buying a bigger disk is probably a simpler solution than redesigning your database just to save a few bytes per record. Because if we're talking arbitrary strings for place names here, then trying to find duplicates and have look-ups for them is probably a lot of work for very little gain.