Tradeoff between using a string or int for column value - mysql

I'm making a database table where one of the columns is type. This is the type of thing that's being stored into this row.
Since this software is open source, I have to consider other people using it. I can use an int, which would theoretically be smaller to save in the database as well as much faster on lookup, but then I would have to have some documentation and it would make things more confusing for my users. The other option is to use a string, which takes up much more space and is slower on lookup.
Assuming this table will handle thousands of rows per day, it can reach the point of being unscalable quickly if I select the wrong data type.
Is using int always preferred in this case, when there are many millions of rows potentially in the database?

You are correct, INT is faster and therefore the better choice.
If you are concerned about future developers, add comments to the column explaining each value. If there are a lot of values, consider using a lookup table, so you can ask for a string, get it's numeric ID (a litle bit like a constant) and then look for that.
Like this
id | id_name
---|------------
1 | TYPE_ALPHA
2 | TYPE_BETA
3 | TYPE_DELTA
Now you have a literal explanation of the ID's. Just collect the ID (WHERE id_name = 'TYPE_ALPHA') and then use that to filter your table.
Perhaps a happy medium of the two solutions however is to use the ENUM data type. Documentation here.
If my understanding of ENUM is correct, it treats the field like a string during comparisons, but stores the actual data as numerated integers. When you look for a string, and it's not defined in the table schema, MySQL will simply throw an error, and if it does exist, then it will use the integer equivalent without even showing it. This provides both speed and readability.

Related

Single Column vs Multi Column Design (for Non Primary Key columns)

In database table design, which of the following is better design for event-log type of data growth
Design 1) Numeric columns(Long) and character columns (Varchar2) with
Index:
..(pkey)|..|..|StockNumber Long | StockDomain Varchar2 |...
.. |..|..|11111 | Finance
.. |..|..|23458 | Medical
Design 2) Character column Varchar2 with Index:
..(pkey)|..|..|StockDetails Varchar2(1000) |..|..
.. |..|..|11111;Finance |..|..
.. |..|..|23458;Medical |..|..
Design advantages: First design is very specific and Second design is more general which can accommodate more general data.In both the cases, columns indexed.
Storage: First design indexes require less storage than second
Performance: Same?
I am having a question about performance vs flexibility. Obviously, first design is better. But second design is the more general purpose. Let me know your insights
Note: Edited the question for more clarity.
In general, having discrete columns is the better way to go for a few reasons:
Datatypes - You have guarantees that the data you have saved is in the right formats, at least as far as non string columns go, your stockNumber will always be a number if it's a bigint/long, trying to set it to anything else will cause your insert/update to error. As part of a colon separated value (CSV) string there is a chance of bad data when it's part of a string.
Querying - Querying a single column has to be done using LIKE since you are looking for a substring of the single column string. If I look for WHERE StockDetails LIKE '%11111%' I will find the first line, but I may find another line where a dollar value inside that column, in a different field is $11111. With discrete columns your query would be WHERE StockNumber = 11111 guaranteeing it finds the data only in that column.
Using the data - Once you have found the row you're wanting, you then have to read the data. This means parsing out your CSV into separate fields. If one of those fields had a colon in it, and it is improperly escaped, the rest of the data is going to be parsed wrong, and you still need your values in a guaranteed same order, leaving blank sections ;; where you would have had a null value in a column.
There is a middle ground between storing CSVs and a separate columns. I have seen, and in fact am doing on one major project, data stored in a table as json. With json you have property names, so you don't care the order the fields appear in the string, because domain will still always be domain, any non standard fields you don't need in an entry (say a property that only exists for the medical domain) will just not be there rather than needing a blank double colon, and parsers for json exist in all languages I can think of that you would connect to your database, there's no need to manually code something to parse out your CSV string. For example your StockDetails given above would look like this:
+--------------------------------------+
| StockDetails |
+--------------------------------------+
| {"number":11111, "domain":"Finance"} |
| {"number":23458, "domain":"Medical"} |
+--------------------------------------+
This solves issues 2 and 3 above:
You now write your query as WHERE StockDetails LIKE '%"number":11111 including the json property name guarantees you don't find the data anywhere else in your string.
You don't need to worry about fields out of order, or missing in your string causing your data to be unusable, using json gives you the key/value pair, all you need to do is handle nulls where the key doesn't exist. This also lets you add fields easily, adding a new CSV field can break your code to parse it, the number of values will be off for your existing data, so you will need to update all rows potentially, however since in json you only store non null fields, a new field will be treated like any other null value on existing data.
In relational database design, you need discrete columns. One value per column per row.
This is the only way to use data types and constraints to implement some data integrity. In your second design, how would you implement a UNIQUE constraint on either StockNumber or StockDomain? How would you make sure StockNumber is actually a number?
This is the only way to create indexes on each column individually, or create a compound index that puts the StockDomain first.
As an analogy, look in the telephone book: can you find all people whose first name is "Bill" easily or efficiently? No, you have to search the whole book to find people with a specific first name. The order of columns in an index matters.
The second design is practically not a database at all — it's a file.
To respond to your comments, I'm reiterating what I wrote in a comment:
Sometimes denormalization is worthwhile, but I can't tell [if your second design is worthwhile], because you haven't described how you will query this data. You must take into account your query needs before you can decide on any optimization.
Stated another way: denormalization, like all other optimizations, benefits one query type, at the expense of other query types. Therefore you need to know which queries you need to be optimal, and which queries are less important, so it won't hurt your overall performance if the other queries are degraded.
If you can't predict the queries, default to designing a database with rules of normalization. Normalization is not designed for performance optimization, it's designed to prevent data anomalies, which is a good goal too.
You have posted several new comments, I guess in the hopes that I will suddenly understand and endorse your second design. But you still haven't described any specific query that will be optimized by using your second design.

Comparing a Set with a Big Collection of Sets

How to match a Set with a Big Collection of Sets stored in database.
[The collection may have millions of Sets].
Detailed Statement
[Prerequisite] A cluster has special property which is a set of attribute.
I will get an entity having a set of attribute.
If i have any existing cluster with exact same set of attribute (neither more nor less) then i will add the entity to that cluster. Else i will create a cluster having property as attribute set of new entity.
Above is the process of the clustering.
The problem is how i should store the data so that the system can run smoothly on very large dataset without performance issue.
What kind of database should i use for this? in SQL or NoSQL
What Possible Solution i thought of:
[MySQL]Store the attributes with cluster in a table so that clusterId to attributeId has m:n relation.[table cluster_attribute].
whenever an entity comes.
we run.
select clusterId,count(1) from cluster_attribute where attributeId in("comma separated IDs of attributes");
But this will not be good since we may find a long list of clusterId's which fullfills the above query.
In the same above table we perform query like.
select clusterId,count(1) cnt from cluster_attributes a
inner join cluster_attributes b on a.cluesterId=b.cluesterId
where b.attributeId in("comma separated IDs of attributes")
group by clusterId
having cnt = #sizeOfEntityAttributeSet;
This will scan much rows resulting slow query.
We store attribute as sorted Concatenation of attribute by any character | and make this column indexed.This way we will be able to query faster.But when ever i need to know which clusters have a certain attribute (A1), my query will go slow since i will need to use regexp search in mysql.
Items in set is non-duplicate.that is [a1,b1,c1] is valid while [a1,b1,a1,c1] is not.
millions of sets, each will hundreds of items.
Have 2 columns in the table for searching. One is the exact, complete, list of the values, sorted. It's a long string, probably TEXT. The other is a hash of that string. I might suggest MD5, then chop to 32 bits and put into INT UNSIGNED (or BINARY(4)). INDEX this column, but not UNIQUE.
Now, to check for existence, do likewise with the incoming 'set' -- build the string, and compute the hash. Look up the hashed value in the table. It will give you only a few rows, including some duds. Double check with the long string.
WHERE hash = $hash
AND str = '$str'
The lookup will be quite fast. The prep work (building the sorted string and computing the hash) will not be too difficult. It will be quite easy to code in, say, PHP.
Caveats:
This works only for an exact match of the set.
It scales quite well. If you have more than, say, a billion sets, then a 32-bit hash won't be big adequate. (But BIGINT and a longer BINARY would work.)

Insert id number instead of using Enum type - How much queries become slower?

sorry for my english (I'm posting from Italy) and (if it's possible) answer me with simple language (thanks)
I want you to know that I'm not a professional programmer: I'm just a fan of programming.
Here's the problem:
I'm going to create a table with four fields that contain repetitions of 6, 20, 165 and about 500 possible values (all varchar within 10 an 50).
My first idea was to use four table with only a field (primary Key) and four one-to-many relationships.
Later I read about Enum type of data and looked for a way for using this solution (because I thought that using all these varchar will absorb many bytes).
I've read a lot of web pages on the advantages and disadvantages of the Enum type, but I still have many doubts.
My problems with Enum type are:
1) I need never empty string (so I have to avoid any insert error by working in strict mode);
2) I'm afraid of possibles changing of the association between an Enum and the other data (for example if an Enum value disappear and I need to split Its data to the others)
So, I was scared by using Enum and I thought about using a third solution:
Given that the Enum type uses the numeric id of an array instead of a varchar, I thought about inserting an Id column (numeric) in the four tables and use the four Ids in the table with data instead of the string values.
And now the question:
Using this third way I'll need a very more complex query compared to using Enum for extraction of string data.
So I don't know which is best choice for save bytes without loosing time in extracting data.
Would you suggest me a choice?
How can I calculate the row limit beyond which a query will become sluggish?
Thanks in advance to anyone who will help me.
ENUM type is suitable for storing small number of values that are not supposed to change.
Your scenario is pretty common and the standard practice is to create four reference tables each with an ID field and a string value field, and then store those IDs in your main table.
Defining foreign key constraints ensures data integrity, and with indices your queries will be fast, do not worry about that at all.
Queries will not be complex, just needing four joins. It is also possible to create a view to avoid repeating the joins in different queries.
This solution is more flexible and easier to manage or port.

Variant data type in DB

I'm looking for a way to have a variant column in my database (mysql probably), I know this is not possible, but what I need is a way to emulate this behavior.
I have a simple pair of tables like:
#task table
(
id int ...,
date timestamp,
owner int
)
#info table
(
id int ...,
relative int, #points to Task
name varchar,
value VARIANT
)
Basically I need to associate a variable number of information fields to each task, and each information.value would be of distinct type (string, datetime, bool and integer).
I've planned to create four columns of each type, instead of a single VARIANT, and populate the correct one. But that table will grow a lot (600Mb/month), and I think this will be a huge waste of space.
Does someone know a better way to accomplish that?
I don't know if this will let that even worse or better but I'll do this in django!
You are implementing something called an entity-attribute-value (EAV) model. You've described it pretty well, in case you don't know what it is.
In terms of the data structure, string types occupy little space when they have NULL values. But other types do occupy space, so you will have wasted space. You could store everything as a string -- numbers as numbers, dates as YYYY-MM-DD, and make do with a single string. You do lose some of the flexibility of a native data type though.
In general, EAV models are computationally expensive. 600 Mbytes per month is a respectable amount of data. Pouring through gigabytes of data to bring records back together can be painful in MySQL (which has poor performance for group by). I generally recommend a hybrid EAV model, where a "regular" record stores commonly used attributes and the EAV piece is only there for the uncommon attributes.

How does a hash table work? Is it faster than "SELECT * from .."

Let's say, I have :
Key | Indexes | Key-values
----+---------+------------
001 | 100001 | Alex
002 | 100002 | Micheal
003 | 100003 | Daniel
Lets say, we want to search 001, how to do the fast searching process using hash table?
Isn't it the same as we use the "SELECT * from .. " in mysql? I read alot, they say, the "SELECT *" searching from beginning to end, but hash table is not? Why and how?
By using hash table, are we reducing the records we are searching? How?
Can anyone demonstrate how to insert and retrieve hash table process in mysql query code? e.g.,
SELECT * from table1 where hash_value="bla" ...
Another scenario:
If the indexes are like S0001, S0002, T0001, T0002, etc. In mysql i could use:
SELECT * from table WHERE value = S*
isn't it the same and faster?
A simple hash table works by keeping the items on several lists, instead of just one. It uses a very fast and repeatable (i.e. non-random) method to choose which list to keep each item on. So when it is time to find the item again, it repeats that method to discover which list to look in, and then does a normal (slow) linear search in that list.
By dividing the items up into 17 lists, the search becomes 17 times faster, which is a good improvement.
Although of course this is only true if the lists are roughly the same length, so it is important to choose a good method of distributing the items between the lists.
In your example table, the first column is the key, the thing we need to find the item. And lets suppose we will maintain 17 lists. To insert something, we perform an operation on the key called hashing. This just turns the key into a number. It doesn't return a random number, because it must always return the same number for the same key. But at the same time, the numbers must be "spread out" widely.
Then we take the resulting number and use modulus to shrink it down to the size of our list:
Hash(key) % 17
This all happens extremely fast. Our lists are in an array, so:
_lists[Hash(key % 17)].Add(record);
And then later, to find the item using that key:
Record found = _lists[Hash(key % 17)].Find(key);
Note that each list can just be any container type, or a linked list class that you write by hand. When we execute a Find in that list, it works the slow way (examine the key of each record).
Do not worry about what MySQL is doing internally to locate records quickly. The job of a database is to do that sort of thing for you. Just run a SELECT [columns] FROM table WHERE [condition]; query and let the database generate a query plan for you. Note that you don't want to use SELECT *, since if you ever add a column to the table that will break all your old queries that relied on there being a certain number of columns in a certain order.
If you really want to know what's going on under the hood (it's good to know, but do not implement it yourself: that is the purpose of a database!), you need to know what indexes are and how they work. If a table has no index on the columns involved in the WHERE clause, then, as you say, the database will have to search through every row in the table to find the ones matching your condition. But if there is an index, the database will search the index to find the exact location of the rows you want, and jump directly to them. Indexes are usually implemented as B+-trees, a type of search tree that uses very few comparisons to locate a specific element. Searching a B-tree for a specific key is very fast. MySQL is also capable of using hash indexes, but these tend to be slower for database uses. Hash indexes usually only perform well on long keys (character strings especially), since they reduce the size of the key to a fixed hash size. For data types like integers and real numbers, which have a well-defined ordering and fixed length, the easy searchability of a B-tree usually provides better performance.
You might like to look at the chapters in the MySQL manual and PostgreSQL manual on indexing.
http://en.wikipedia.org/wiki/Hash_table
Hash tables may be used as in-memory data structures. Hash tables may also be adopted for use with persistent data structures; database indices sometimes use disk-based data structures based on hash tables, although balanced trees are more popular.
I guess you could use a hash function to get the ID you want to select from. Like
SELECT * FROM table WHERE value = hash_fn(whatever_input_you_build_your_hash_value_from)
Then you don't need to know the id of the row you want to select and can do an exact query. Since you know that the row will always have the same id because of the input you build the hash value form and you can always recreate this id through the hash function.
However this isn't always true depending on the size of the table and the maximum number of hashvalues (you often have "X mod hash-table-size" somewhere in your hash). To take care of this you should have a deterministic strategy you use each time you get two values with the same id. You should check Wikipedia for more info on this strategy, its called collision handling and should be mentioned in the same article as hash-tables.
MySQL probably uses hashtables somewhere because of the O(1) feature norheim.se (up) mentioned.
Hash tables are great for locating entries at O(1) cost where the key (that is used for hashing) is already known. They are in widespread use both in collection libraries and in database engines. You should be able to find plenty of information about them on the internet. Why don't you start with Wikipedia or just do a Google search?
I don't know the details of mysql. If there is a structure in there called "hash table", that would probably be a kind of table that uses hashing for locating the keys. I'm sure someone else will tell you about that. =)
EDIT: (in response to comment)
Ok. I'll try to make a grossly simplified explanation: A hash table is a table where the entries are located based on a function of the key. For instance, say that you want to store info about a set of persons. If you store it in a plain unsorted array, you would need to iterate over the elements in sequence in order to find the entry you are looking for. On average, this will need N/2 comparisons.
If, instead, you put all entries at indexes based on the first character of the persons first name. (A=0, B=1, C=2 etc), you will immediately be able to find the correct entry as long as you know the first name. This is the basic idea. You probably realize that some special handling (rehashing, or allowing lists of entries) is required in order to support multiple entries having the same first letter. If you have a well-dimensioned hash table, you should be able to get straight to the item you are searching for. This means approx one comparison, with the disclaimer of the special handling I just mentioned.