is this table any good for mysql? I wanted to make it flexible in the future for this type of data storage. With this table structure, you can't use a PRIMARY KEY but an index ...
Should I change the format of the table to have headers - Primary Key, Width, Length, Space, Coupling ...
ID_NUM Param Value
1 Width 5e-081
1 Length 12
1 Space 5e-084
1 Coupling 1.511
1 Metal Layer M3-0
2 Width 5e-082
2 Length 1.38e-061
2 Space 5e-081
2 Coupling 1.5
2 Metal Layer M310
No, this is a bad design for a relational database. This is an example of the Entity-Attribute-Value design. It's flexible, but it breaks most rules of what it means to be a relational database.
Before you descend into the EAV design as a solution for a flexible database, read this story: Bad CaRMa.
More specifically, some of the problems with EAV include:
You don't know what attributes exist for any given ID_NUM without querying for them.
You can't make any attribute mandatory, the equivalent of NOT NULL.
You can't use database constraints.
You can't use SQL data types; the value column must be a long VARCHAR.
Particularly in MySQL, each VARCHAR is stored on its own data page, so this is very wasteful.
Queries are also incredibly complex when you use the EAV design. Magento, an open-source ecommerce platform, uses EAV extensively, and many users say it's very slow and hard to query if you need custom reports.
To be relational, you should store each different attribute in its own column, with its own name and an appropriate datatype.
I have written more about EAV in my presentation Practical Object-Oriented Models in SQL and in my blog post EAV FAIL, and in my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
What you suggest is called EAV model (Entity–Attribute–Value)
It has several drawbacks like severe difficulties in enforcing referential integrity constraints. In addition, the queries you'll have to come up with, will be a bit more complicated than with a normalized table as your second suggestion (table with columns: Primary Key, Width, Length, Space, Coupling, etc).
So, for a simple project, do not use EAV model.
If your plans are for a more complex project and you want maximum flexibility, do not use EAV either. You should look into 6NF (6th Normal Form) which is even harder to implement and certainly not an easy task in MySQL. But if you succeed, you'll have both goods: flexibility and normalization to the highest level (some people call "EAV" as "6NF done wrongly").
In my experience this idea of storing fields row-wise needs to be considered extremely carefully - although it seems give many advantages it makes many common tasks much more difficult.
On the positive side: It is easily extensible without changes to the structure of the database and in some ways abstracts the details of the data storage.
On the negative side: You need to look at all the everyday things storing fields column-wise gives you automatically in the DBMS: Simple inner/outer joins, one statement inserts/updates, uniqueness, foreign keys and other db level constraint checking, simple filtering ad ordering of search results.
Consider in your architecture a query to return all items with MetalLayer=X and Width between y and z - results sorted by Coupling by length. This query is much harder for you to construct and much, much harder for the DBMS to execute than it would be using columns to store specific fields.
In the balance the only time I have used a structure like the one you suggest was in a context where random unstructured additional data needed to be added on an ad-hoc basis. In my opinion this would be a last resort strategy if there was no way I could make a more traditional table structure work.
A few things to consider here:
There is no single primary key. This can be overcome by making the primary key consist of two columns (like in the second example of Carl T)
The Param column is repeated and to normalize this you should look at the example of MGA.
Thirdly the "Metal layer" column is a string and not a float value like the others.
So best to go for a table def like this:
create table yourTable(
ID int primary key,
ParamId int not null,
Width float,
Length float,
Space float,
Coupling float,
Metal_layer varchar(20),
Foreign key(ParamID) references Param(ID),
Value varchar(20)
)
create table Param(
ID int primary key,
Name varchar(20)
)
The main question you have to ask when creating a table specially for future use is how will this data be retrieved and what purpose it is having. Personally I always have a unique identifier usually an ID to the table.
Looking at you list do not seem to have anything that uniquely defines the entries so you will not be able to track duplicate entries nor uniquely retrieve a record.
If you want to keep this design you could create a composite primary key composed of the name and the param-value.
CREATE TABLE testtest (
ID INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
Name VARCHAR(100) NOT NULL,
value number NOT NULL
/*add other fields here*/
);
CREATE TABLE testtest (
name VARCHAR(100) NOT NULL,
value int NOT NULL,
/*add other fields here*/
primary key(name,value)
);
Those create table example express the 2 above mentioned options.
Related
Running Mysql Server version: 5.7.27-0ubuntu0.18.04.1
I'm creating a site/app where a user "submission" can be one of:
Text Comments
picture/file upload
video/file upload (more or less technically same as #2, just with different mime type)
I'm having trouble deciding between the two designs (shortened for brevity)...
CREATE TABLE submissions
(
submissionID INT,
userID INT,
submissionComments TEXT,
fileDirectory VARCHAR2(32), -- starting here these are only used 20% of time
fileName VARCHAR2(128)
fileMimeType VARCHAR2(128),
fileSize INT,
originalFileName VARCHAR2(64)
)
-OR-
CREATE TABLE submissions
(
submissionID INT,
userID INT,
submissionComments TEXT
)
CREATE TABLE submissionFiles
(
submissionFileID INT,
submissionID INT, -- FK to submissions table
fileDirectory VARCHAR2(32),
fileName VARCHAR2(128),
fileMimeType VARCHAR2(128),
fileSize INT,
originalFileName VARCHAR2(64)
)
I'm assuming text comments will prob be 70-80% of submissions.
So, the question becomes, is it better to use a single table and have a bunch of NULL values in fileDirectory/fileName/fileMimeType/fileSize/originalFileName?
Or, is it better to have a 1:1 relationship to support when files are uploaded. In that case, I'd be creating both a submissions and submissionFiles record. Obviously most queries would then require joining the two tables.
This essentially comes down to not having a good understanding of the impacts of VARCHAR (and 1 INT) columns in tables where they are majority NULL. I'm probably pre-optimizing a bit here considering this is a brand new site/app, but i'm trying to plan ahead.
Late addition 2nd question (as I type this out), i see that TEXT is capable of handling: 65,535 characters or 64 KB. That seems like a lot for what a typical user would be submitting (probably less than 500 characters). It would eat up storage pretty quick. Would would be the impacts of making submissionComments into VARCHAR(500) instead of TEXT? I'm assuming if anything, there are no negative trade-offs besides being able to store "less".
Thanks!
Edit: as madhur pointed out, there are similar questions/good answers about "design patterns". i'm more concerned about performance. does the presence of large number of varchar's negatively impact data storage/retrieval (by messing up the way mysql implements pages/extents/etc)?
I have built schemas either way. At some level, it does not matter. But you may find that certain queries are faster one way (or the other way). The disk usage is about the same.
Your second option allows for (and hence implies) multiple 'files' per 'submission'. For such a "many:1" relationship, you must use 2 tables.
On the other hand, if there can there be only one "file" per "submission", you don't need submissionFileID (which I assume was intended to be the PRIMARY KEY??) Instead, use PRIMARY KEY(submissionID) for that second table.
If you wish to discus further, please provide the full CREATE TABLE, including NULL or NOT NULL, the PRIMARY KEY of each table, and any secondary indexes.
submissionComments into VARCHAR(500) instead of TEXT?
No storage difference.
No speed difference.
The former would truncate, giving a warning or error, at 500 characters; the latter would truncate at 65535 bytes. I would simply use TEXT.
Back to the main question. Your example has several columns that are either all NULL or all filled in. Hence, I would lean toward 2 tables.
I'm creating a database for combustion experiments. Each experiment has some scientific metadata which I call 'details'. For example ('Fuel', 'C2H6') or ('Pressure', 120). Because the same detail names (like 'Fuel') show up a lot, I created a table just to store the names and units. Here's a simplified version:
CREATE TABLE properties (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(50) NOT NULL,
units NVARCHAR(15) NOT NULL DEFAULT 'dimensionless',
);
I also created a table called 'details' which maps 'properties' to values.
CREATE TABLE details (
id INT AUTO_INCREMENT PRIMARY KEY,
property_id INT NOT NULL,
value VARCHAR(30),
FOREIGN KEY(property_id) REFERENCES properties(id)
);
This isn't ideal because the value attribute is sometimes a chemical name and sometimes a float. In the future, there may even be new entries that have integer values. Storing everything in a VARCHAR seems wasteful. Since it'll be hard to change later, I want to make the right decision now.
I've been researching this for hours and have considered four options:
Store everything as varchar under value (simplest to develop)
Use an EAV model (most complicated to develop).
Create a column for each type, and have plenty of NULL entries.
value_float, value_int, value_char
Use the JSON datatype.
Looking into each one, it seems like they're all bad in different ways. (1) is bad since it takes up extra space and I have to do extra operations to parse strings into numeric values. (2) is bad because of the huge increase in complexity (four extra tables and a lot more join operations), plus I hear EAV is to be avoided. (3) is a middle-ground in complexity, but there will be two NULL values for each table entry. (4) seems similar to (1), and I'm not sure how it might be better or worse.
I don't expect to have huge growth on this database or millions of entries. It just needs to be fast and searchable for researchers. I'm willing to have more backend complexity for a better/faster user experience.
By now I realize that there aren't that many clear-cut answers in database design. I'm simply asking for some insight into my three options, or perhaps another option I haven't thought of.
EDIT: Added JSON as an option.
Well, you have to sacrify something. Either HD space, or performance, or specific/general dimension or easy/complex to develop dimension. Choose a mix suitable for your needs and situation. - I solved it in 2000 in a general kind of EAV solution this way: basic record had a common properties shared by majority of events, then joins to properties without values (associative table), and those ones very specific properties/values I stored in a BLOB in XML like tags. This way I combined frequent properties with those very specific ones. AS this was intended as VERY GENERAL solution, you probably don't need, I'd sacrifice space, it's cheap today. Who cares if you take more space than it's "correct according to data modeling theory". Ok data model will be ugly, so what ? - You'll still need to decide on specific/general dimension - how specific attributes will be solved - either as specific columns (yes if they are repeated often) or in Property-TypeOfProperty-Value type of table.
I was listening to some people at work speak about a database column.
Essentially, we wanted to add a new column which serves as an FK to a lookup table. It's basically your preferred contact method (primary phone or primary e-mail). So main table 'User' has an FK to 'PreferredContactMethod'
Person # 1:
"Let's store the FK column as an unsigned tiny int, and make the PK lookup table simply have a tinyint PK and a text description/code"
Person # 2
"It's irreleant whether we store the datatype as unsigned tinyint, or char(x) in terms of space in MySQL, so why make you have to do joins to find out what the value is for the lookup column? Instead, make the FK a char(x), and make the PK char(x) on the actual table"
Person # 3
"No it's not a good idea to make a PK represented as characters. It's not efficient. Handling something like unsigned tinyint is better than text. And since there are only two values, why don't we just store it as a single column (not an FK) with a value of either 0, or 1. This way it's more efficient and you don't have to join anything."
So after listening to this all, I started wondering who is right. My suspicion is this is so trivial that it wouldn't matter in terms of performance, but i'm so curious now as to what the pros and cons are that I'd love someone's take on this.
Thank you for your time.
It's typical for questions like this to have no right or wrong answer. Or actually, either answer can be right, because it depends on how you're going to use the data.
A good case for storing an int/tinyint to the lookup table is that the values change regularly, and you want to allow changes to happen in the lookup table without changing all the rows that reference it.
It's a good thing to store the PK as a string if you have a relatively small lookup table that doesn't change frequently, and the strings are fairly short. If the strings are long, this could make the FK reference bulkier that necessary and use a lot of space.
The inefficiency of storing a PK as a string doesn't affect a lookup table very much. That table is pretty small. The cost of a string PK is mostly due to frequent random inserts. But this doesn't impact a lookup table.
I have a MySQL database, and a particular table in that database will need to be self-referencing, in a one-to-many fashion. For scalability, I need to find the most efficient solution possible. The two ways most evident to me are:
1) Add a text field to the table, and store a serialized list of primary keys there
2) Keep a linker table, with each row being a one-to-one.
In case #1, I see the table growing very very wide (using a spatial analogy), but in case #2, I see the linker table growing to a very large number of rows, which would slow down lookups (by far the most common operation).
What's the most efficient manner in which to implement such a one-to-many relationship in MySQL? Or, perhaps, there is a much saner solution keeping the data all directly on the filesystem somehow, or else some other storage engine?
Just keep a table for the "many", with a key column for the primary table.
I quarantee you'll have lots of other more important problems to solve before you run into efficiency or capacity constraints in a standard industrial-strength relational dbms.
IMHO the most likely second option (with numerous alternative products) is to use an isam.
If you need to do deep/recursive traversals into the data, a graph database like Neo4j (where I'm on the team) is a good choice. You'll find some information in the article Should you go Beyond Relational Databases? and in this post at High Scalability. For a use case that may be similar to yours, read this thread on MetaFilter. For information on language bindings and other things you may also find the Neo4j wiki and mailing list useful.
Not so much an answer but a few questions and a possible approach....
If you want to make the table self referencing and only use one field ... there are some options. A calculated maskable 'join' field describes a way to associate many rows with each other.
The best solution will probably consider the nature of the data and relationships?
What is the nature of the data and lookups? What sort of relationship are you trying to contain? Association? Related? Parent/Children?
My first comment would be that you'll get better responses if you can describe how the data will be used (frequency of adds/updates vs lookups, adds vs updates, etc) in addition to what you've already described. That being said, my first thought would be to just go with a generic representation of
CREATE TABLE IF NOT EXISTS one_table (
`one_id` INT UNSIGNED NOT NULL AUTO_INCREMENT
COMMENT 'The The ID of the items in the one table' ,
... other data
)
CREATE TABLE IF NOT EXISTS many_table (
`many_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT
COMMENT 'the id of the items in the many table',
`one_id` INT UNSIGNED NOT NULL
COMMENT 'The ID of the item in the one table that this many item belongs to' ,
... other data
)
Making sure, of course, to create an index on the one_id in both tables.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
This post was edited and submitted for review 10 days ago.
Improve this question
I am not very familiar with databases and the theories behind how they work. Is it any slower from a performance standpoint (inserting/updating/querying) to use Strings for Primary Keys than integers?
For Example I have a database that would have about 100 million row like mobile number, name and email. mobile number and email would be unique. so can I have the mobile number or email as a primary key,
well it effect my query performance when I search based on email or mobile number. similarly the primary key well be used as foreign key in 5 to 6 tables or even more.
I am using MySQL database
Technically yes, but if a string makes sense to be the primary key then you should probably use it. This all depends on the size of the table you're making it for and the length of the string that is going to be the primary key (longer strings == harder to compare). I wouldn't necessarily use a string for a table that has millions of rows, but the amount of performance slowdown you'll get by using a string on smaller tables will be minuscule to the headaches that you can have by having an integer that doesn't mean anything in relation to the data.
Another issue with using Strings as a primary key is that because the index is constantly put into sequential order, when a new key is created that would be in the middle of the order the index has to be resequenced... if you use an auto number integer, the new key is just added to the end of the index.
Inserts to a table having a clustered index where the insertion occurs in the middle of the sequence DOES NOT cause the index to be rewritten. It does not cause the pages comprising the data to be rewritten. If there is room on the page where the row will go, then it is placed in that page. The single page will be reformatted to place the row in the right place in the page. When the page is full, a page split will happen, with half of the rows on the page going to one page, and half going on the other. The pages are then relinked into the linked list of pages that comprise a tables data that has the clustered index. At most, you will end up writing 2 pages of database.
Strings are slower in joins and in real life they are very rarely really unique (even when they are supposed to be). The only advantage is that they can reduce the number of joins if you are joining to the primary table only to get the name. However, strings are also often subject to change thus creating the problem of having to fix all related records when the company name changes or the person gets married. This can be a huge performance hit and if all tables that should be related somehow are not related (this happens more often than you think), then you might have data mismatches as well. An integer that will never change through the life of the record is a far safer choice from a data integrity standpoint as well as from a performance standpoint. Natural keys are usually not so good for maintenance of the data.
I also want to point out that the best of both worlds is often to use an autoincrementing key (or in some specialized cases, a GUID) as the PK and then put a unique index on the natural key. You get the faster joins, you don;t get duplicate records, and you don't have to update a million child records because a company name changed.
Too many variables. It depends on the size of the table, the indexes, nature of the string key domain...
Generally, integers will be faster. But will the difference be large enough to care? It's hard to say.
Also, what is your motivation for choosing strings? Numeric auto-increment keys are often so much easier as well. Is it semantics? Convenience? Replication/disconnected concerns? Your answer here could limit your options. This also brings to mind a third "hybrid" option you're forgetting: Guids.
It doesn't matter what you use as a primary key so long as it is UNIQUE. If you care about speed or good database design use the int unless you plan on replicating data, then use a GUID.
If this is an access database or some tiny app then who really cares. I think the reason why most of us developers slap the old int or guid at the front is because projects have a way of growing on us, and you want to leave yourself the option to grow.
Don't worry about performance until you have got a simple and sound design that agrees with the subject matter that the data describes and fits well with the intended use of the data. Then, if performance problems emerge, you can deal with them by tweaking the system.
In this case, it's almost always better to go with a string as a natural primary key, provide you can trust it. Don't worry if it's a string, as long as the string is reasonably short, say about 25 characters max. You won't pay a big price in terms of performance.
Do the data entry people or automatic data sources always provide a value for the supposed natural key, or is sometimes omitted? Is it occasionally wrong in the input data? If so, how are errors detected and corrected?
Are the programmers and interactive users who specify queries able to use the natural key to get what they want?
If you can't trust the natural key, invent a surrogate. If you invent a surrogate, you might as well invent an integer. Then you have to worry about whther to conceal the surrogate from the user community. Some developers who didn't conceal the surrogate key came to regret it.
Indices imply lots of comparisons.
Typically, strings are longer than integers and collation rules may be applied for comparison, so comparing strings is usually more computationally intensive task than comparing integers.
Sometimes, though, it's faster to use a string as a primary key than to make an extra join with a string to numerical id table.
Two reasons to use integers for PK columns:
We can set identity for integer field which incremented automatically.
When we create PKs, the db creates an index (Cluster or Non Cluster) which sorts the data before it's stored in the table. By using an identity on a PK, the optimizer need not check the sort order before saving a record. This improves performance on big tables.
Yes, but unless you expect to have millions of rows, not using a string-based key because it's slower is usually "premature optimization." After all, strings are stored as big numbers while numeric keys are usually stored as smaller numbers.
One thing to watch out for, though, is if you have clustered indices on a any key and are doing large numbers of inserts that are non-sequential in the index. Every line written will cause the index to re-write. if you're doing batch inserts, this can really slow the process down.
What is your reason for having a string as a primary key?
I would just set the primary key to an auto incrementing integer field, and put an index on the string field.
That way if you do searches on the table they should be relatively fast, and all of your joins and normal look ups will be unaffected in their speed.
You can also control the amount of the string field that gets indexed. In other words, you can say "only index the first 5 characters" if you think that will be enough. Or if your data can be relatively similar, you can index the whole field.
From performance standpoint - Yes string(PK) will slow down the performance when compared to performance achieved using an integer(PK), where PK ---> Primary Key.
From requirement standpoint - Although this is not a part of your question still I would like to mention. When we are handling huge data across different tables we generally look for the probable set of keys that can be set for a particular table. This is primarily because there are many tables and mostly each or some table would be related to the other through some relation ( a concept of Foreign Key ). Therefore we really cannot always choose an integer as a Primary Key, rather we go for a combination of 3, 4 or 5 attributes as the primary key for that tables. And those keys can be used as a foreign key when we would relate the records with some other table. This makes it useful to relate the records across different tables when required.
Therefore for Optimal Usage - We always make a combination of 1 or 2 integers with 1 or 2 string attributes, but again only if it is required.
I would probably use an integer as your primary key, and then just have your string (I assume it's some sort of ID) as a separate column.
create table sample (
sample_pk INT NOT NULL AUTO_INCREMENT,
sample_id VARCHAR(100) NOT NULL,
...
PRIMARY KEY(sample_pk)
);
You can always do queries and joins conditionally on the string (ID) column (where sample_id = ...).
There could be a very big misunderstanding related to string in the database are. Almost everyone has thought that database representation of numbers are more compact than for strings. They think that in db-s numbers are represented as in the memory. BUT it is not true. In most cases number representation is more close to A string like representation as to other.
The speed of using number or string is more dependent on the indexing then the type itself.
By default ASPNetUserIds are 128 char strings and performance is just fine.
If the key HAS to be unique in the table it should be the Key. Here's why;
primary string key = Correct DB relationships, 1 string key(The primary), and 1 string Index(The Primary).
The other option is a typical int Key, but if the string HAS to be unique you'll still probably need to add an index because of non-stop queries to validate or check that its unique.
So using an int identity key = Incorrect DB Relationships, 1 int key(Primary), 1 int index(Primary), Probably a unique string Index, and manually having to validate the same string doesn't exist(something like a sql check maybe).
To get better performance using an int over a string for the primary key, when the string HAS to be unique, it would have to be a very odd situation. I've always preferred to use string keys. And as a good rule of thumb, don't denormalize a database until you NEED to.