Variant data type in DB - mysql

I'm looking for a way to have a variant column in my database (mysql probably), I know this is not possible, but what I need is a way to emulate this behavior.
I have a simple pair of tables like:
#task table
(
id int ...,
date timestamp,
owner int
)
#info table
(
id int ...,
relative int, #points to Task
name varchar,
value VARIANT
)
Basically I need to associate a variable number of information fields to each task, and each information.value would be of distinct type (string, datetime, bool and integer).
I've planned to create four columns of each type, instead of a single VARIANT, and populate the correct one. But that table will grow a lot (600Mb/month), and I think this will be a huge waste of space.
Does someone know a better way to accomplish that?
I don't know if this will let that even worse or better but I'll do this in django!

You are implementing something called an entity-attribute-value (EAV) model. You've described it pretty well, in case you don't know what it is.
In terms of the data structure, string types occupy little space when they have NULL values. But other types do occupy space, so you will have wasted space. You could store everything as a string -- numbers as numbers, dates as YYYY-MM-DD, and make do with a single string. You do lose some of the flexibility of a native data type though.
In general, EAV models are computationally expensive. 600 Mbytes per month is a respectable amount of data. Pouring through gigabytes of data to bring records back together can be painful in MySQL (which has poor performance for group by). I generally recommend a hybrid EAV model, where a "regular" record stores commonly used attributes and the EAV piece is only there for the uncommon attributes.

Related

What is the most efficient Data type to store categorical string variable in MySQL

I have a table with about 50k rows and multiple columns.
Some columns have the data type VARCHAR but the store a unique set of values, Categorical strings.
I'm having some performance issues with this table, so I'm refactoring the data types and did my research and found out SET and ENUM are no better than VARCHAR since there will be a lookup table overhead.
what should I do
I guess by "categorical" you mean those columns have a "controlled vocabulary" – a limited set of possible values.
Some things you can do to make this table serve you more efficiently. You don't have to do them all. I list them in order of difficulty (difficulty for me at any rate).
Put indexes on the column or columns you will use in WHERE clauses when querying. Doing this is very likely to solve your performance issues: 50k rows is not tiny, but it is small.
Good index choices are an art. Check out https://use-the-index-luke.com for an introduction. Or, ask another question here if you have performance problems with certain queries.
If possible, and if necessary, declare those columns with COLLATE latin1_bin. That makes them shorter and makes looking them up faster. This won't work if your categorical values are in Arabic or some other language that needs Unicode.
Make a new table. Maybe call it category, and give it an INT UNSIGNED column for category_id and a VARCHAR column for category_name. Then, in your main table use INT UNSIGNED columns rather than VARCHAR columns: treat the new table as a lookup table, and the columns in your main table as numeric references to that table.
This approach is often used in large (megarow) tables to save RAM and disk space, and to formalize the "controlled vocabulary" of your categories. But I suspect it may be overkill for your app.
Your conclusions about SETs and ENUMs match my experience. Plus, adding values to ENUMs in a production database can be a shockingly expensive operation.

Multiple possible data types for the same attribute: null entries, EAV, or store as varchar?

I'm creating a database for combustion experiments. Each experiment has some scientific metadata which I call 'details'. For example ('Fuel', 'C2H6') or ('Pressure', 120). Because the same detail names (like 'Fuel') show up a lot, I created a table just to store the names and units. Here's a simplified version:
CREATE TABLE properties (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(50) NOT NULL,
units NVARCHAR(15) NOT NULL DEFAULT 'dimensionless',
);
I also created a table called 'details' which maps 'properties' to values.
CREATE TABLE details (
id INT AUTO_INCREMENT PRIMARY KEY,
property_id INT NOT NULL,
value VARCHAR(30),
FOREIGN KEY(property_id) REFERENCES properties(id)
);
This isn't ideal because the value attribute is sometimes a chemical name and sometimes a float. In the future, there may even be new entries that have integer values. Storing everything in a VARCHAR seems wasteful. Since it'll be hard to change later, I want to make the right decision now.
I've been researching this for hours and have considered four options:
Store everything as varchar under value (simplest to develop)
Use an EAV model (most complicated to develop).
Create a column for each type, and have plenty of NULL entries.
value_float, value_int, value_char
Use the JSON datatype.
Looking into each one, it seems like they're all bad in different ways. (1) is bad since it takes up extra space and I have to do extra operations to parse strings into numeric values. (2) is bad because of the huge increase in complexity (four extra tables and a lot more join operations), plus I hear EAV is to be avoided. (3) is a middle-ground in complexity, but there will be two NULL values for each table entry. (4) seems similar to (1), and I'm not sure how it might be better or worse.
I don't expect to have huge growth on this database or millions of entries. It just needs to be fast and searchable for researchers. I'm willing to have more backend complexity for a better/faster user experience.
By now I realize that there aren't that many clear-cut answers in database design. I'm simply asking for some insight into my three options, or perhaps another option I haven't thought of.
EDIT: Added JSON as an option.
Well, you have to sacrify something. Either HD space, or performance, or specific/general dimension or easy/complex to develop dimension. Choose a mix suitable for your needs and situation. - I solved it in 2000 in a general kind of EAV solution this way: basic record had a common properties shared by majority of events, then joins to properties without values (associative table), and those ones very specific properties/values I stored in a BLOB in XML like tags. This way I combined frequent properties with those very specific ones. AS this was intended as VERY GENERAL solution, you probably don't need, I'd sacrifice space, it's cheap today. Who cares if you take more space than it's "correct according to data modeling theory". Ok data model will be ugly, so what ? - You'll still need to decide on specific/general dimension - how specific attributes will be solved - either as specific columns (yes if they are repeated often) or in Property-TypeOfProperty-Value type of table.

How much does performance change with a VARCHAR or INT column - MySQL

I have many tables, with millions of lines, with MySQL. Those tables are used to store log lines.
I have a field "country" in VARCHAR(50). There is an index on this column.
Would it change the performances a lot to store a countryId in INT instead of this country field ?
Thank you !
Your question is a bit more complicated than it first seems. The simple answer is that Country is a string up to 50 characters. Replacing it by a 4-byte integer should reduce the storage space required for the field. Less storage means less I/O overhead in processing the query and smaller indexes. There are outlier cases of course. If country typically has a NULL value, then the current storage might be more efficient than having an id.
It gets a little more complicated, though, when you think about keeping the field up-to-date. One difference with a reference table is that the countries are now standardized, rather than being ad-hoc names. In general, this is a good thing. On the other hand, countries do change over time, so you have to be prepared to add a "South Sudan" or "East Timor" now and then.
If your database is heavy on inserts/updates, then changing the country field requires looking in the reference table for the correct value -- and perhaps inserting a new record there.
My opinion is "gosh . . . it would have been a good idea to set the database up this way in the beginning". At this point, you need to understand the effects on the application of maintaining a country reference table for the small performance gain of making the data structure more efficient and more accurate.
Indexes on INT values shows better performance than Indexes applied on string data types (VARCHAR).
because searching/matching an integer is always faster than a string and search algorithm implemented underneath of indexing works on same principle.
In your case you will get better performance with index on INT type than VARCHAR.

Tradeoff between using a string or int for column value

I'm making a database table where one of the columns is type. This is the type of thing that's being stored into this row.
Since this software is open source, I have to consider other people using it. I can use an int, which would theoretically be smaller to save in the database as well as much faster on lookup, but then I would have to have some documentation and it would make things more confusing for my users. The other option is to use a string, which takes up much more space and is slower on lookup.
Assuming this table will handle thousands of rows per day, it can reach the point of being unscalable quickly if I select the wrong data type.
Is using int always preferred in this case, when there are many millions of rows potentially in the database?
You are correct, INT is faster and therefore the better choice.
If you are concerned about future developers, add comments to the column explaining each value. If there are a lot of values, consider using a lookup table, so you can ask for a string, get it's numeric ID (a litle bit like a constant) and then look for that.
Like this
id | id_name
---|------------
1 | TYPE_ALPHA
2 | TYPE_BETA
3 | TYPE_DELTA
Now you have a literal explanation of the ID's. Just collect the ID (WHERE id_name = 'TYPE_ALPHA') and then use that to filter your table.
Perhaps a happy medium of the two solutions however is to use the ENUM data type. Documentation here.
If my understanding of ENUM is correct, it treats the field like a string during comparisons, but stores the actual data as numerated integers. When you look for a string, and it's not defined in the table schema, MySQL will simply throw an error, and if it does exist, then it will use the integer equivalent without even showing it. This provides both speed and readability.

Best way to store 'extra' user data in MySQL?

I'm adding a new feature to my user module for my CMS and I've hit a road block... Or I guess, a fork in the road, and I wanted to get some opinions from stackoverflow before I commit to anything.
Basically I want to allow admins to add new, 'extra' user fields that users can fill out on registration, edit in their profile, and/or be controlled by other modules. An example of this would be a birthday field, a lengthy description of themselves, or maybe points the user has earned on the site. Needless to say, the data stored will be varied and can range from large amounts of text, to a small integer value. To make matters worse - I want there to be the option to search this data.
With that out of the way - what would be the best way to do this? Right now I'm leaning towards having a table with the following columns.
userid, refFieldID, varchar, tinyint, smallint, int, text, date, datetime, etc.
I would prefer this as it would make searching significantly faster, and the reference table (Which holds all of the field's data, such as the name of the field, whether it's searchable or not, etc.) can reference which column should be used when storing data for that field.
The other idea, which was suggested to me and I've seen used in other solutions (vBulletin being one, although I have seen others whose names escape me at the moment), where you just have the userid, reference id, and a medtext field. I don't know enough about MySQL to say this with any certainty, but this method seems like it would be slower to search, and possibly have a larger overhead.
So which method would be 'best'? Is there another method I'm missing? Whichever method I end up using, it needs to be fast to search, not massive (A tiny bit of overhead is fine), and preferably allow complex queries used against the data.
I agree that a key-value table is probably the best solution. My first inclination would be to just store a text column, like vBulletin did. But, if you wanted to add the ability for the data store to be a bit more extensible and searchable like you've laid out, I might suggest:
1 medium/longtext or medium/longblob field for arbitrary text/binary storage (whatever is stored + overhead of 3-4 bytes for string length). Only reason to choose medium over long is to limit what can be stored to 2^24 bytes (16.7 MB) versus 2^32 bytes (2 GB).
1 integer (4 bytes) or bigint (8 bytes)
1 datetime (8 bytes)
Perhaps 1 float or double (4-8 bytes) for floating point storage
These fields will allow you to store nearly any type of data in the table but without inflating the width of table** (like a varchar would) and avoid any redundant storage (like having tinyint and mediumint etc). The text stored in the longtext field can still be reasonably searched using a fulltext index or a regular limited length index (e.g. index longtext_storage(8)).
** all blob values, such as longtext, are stored independently from the main table.
One technique that might work for you is to store this arbitrary data as text, in some notation like JSON, XML, or YAML. This decision depends on how you'll need to access the data: if you only look up each user's full chunk of user data, it could be ideal. If you need to run SQL queries on specific fields in the user data, you'll need to use a pure SQL or a hybrid approach.
Many of the newer, highly scalable "NoSQL" systems seem to favor JSON data (eg, MongoDB, CouchDB, and Project Voldemort). It's nice and terse, and you can create arbitrarily complex structures including maps (JSON objects) and lists (JSON arrays).