Primary key: a string or number (id)? - mysql

I am aware of benefits of using integers (amount of space, performance, indexes) as primary keys as opposite to strings.
Considering situation below...
I have a lookup table called ap_habitat (habitat values are also unique)
id habitat
1 Forest 1
2 Forest 2
Referenced table (fauna)
Especie habitat
X 1
Y 1
Referenced table is not very human readable (I know end users should not care about that, but for me would be useful to directly see in fauna table the NAME of the habitat).
To get a list of fauna and its habitat name I have to do a join...
select fauna.habitat, fauna.especie, AP_h.habitat from fauna INNER JOIN ap_habitat AS AP_h on AP_h.id=1
I could create a view, but if I have to create a view for each table referencing a foreign key...
Just wanna check what more experienced people recommend me.

Databases and, in general, computers are not designed to make your life more simple. They are designed to handle more data than a human mind can ever hope to remember in less time than it takes a human to blink. ;-)
Readability (especially in ideas conceived the before-Apple age) is not an issue at all.
On top of that: If you enjoy strange problems, data mapping impedance and spending endless nights writing workarounds for problems that using real-world names as primary keys get you for free, then be our guest. But please, don't ask for our help. We already know all the problems that you'll run into and it will be very hard for us to restrain our spite.
So: Never, ever use anything but an ID (UUID or long sequence) for a primary key. There are no (good) reasons to do it and if you found one, then you simply don't see the whole picture.
Yes, it makes a couple of things harder (like understanding what your data actually means). But as I said above, computers are meant to solve "lots of data" and "too slow" and nothing else.
Create a view or write a small helper application that can run your most important queries at the click of a button.
That said, I had some success with an application which runs a query and then displays a list of check boxes where I can pull in the foreign key relations to the data that the query returns (i.e. one checkbox per FK).

You ask about number or string as primary key. But based on your example if you use a string it wouldn't be a primary key at all, because you would no longer have a lookup table for it to be the primary key of. Perhaps you would still have the table for reasons not shown, like populating a drop down or storing extended descriptions beyond just the name.
Doing needless joins is not a good thing for performance. And having needless tables might be bad for storage size as well, depending on the length of the strings and the ratio of the sizes of the two tables.
You could also consider enumerated types, in which the data is stored as numbers (more or less) but the database translates them to and from strings automatically.

Related

Seeking a performant solution for accessing unique MySQL entries

I know very little about MySQL (or web development in general). I'm a Unity game dev and I've got a situation where users (of a region the size of which I haven't decided yet, possibly globally) can submit entries to an online database. The users must be able to then locate their entry at any time.
For this reason, I've generated a guid from .Net (System.Guid.NewGuid()) and am storing that in the database entry. This works for me! However... I'm no expert, but my gut tells me that looking up a complex string in what could be a gargantuan table might have terrible performance.
That said, it doesn't seem like anything other than a globally unique identifier will solve my problem. Is there a more elegant solution that I'm not seeing, or a way to mitigate against any issues this design pattern might create?
Thanks!
Make sure you define the GUID column as the primary key in the MySQL table. That will cause MySQL to create an index on it, which will enable MySQL to quickly find a row given the GUID. The table might be gargantuan but (assuming a regular B-tree index) the time required for a lookup will increase logarithmically relative to the size of the table. In other words, if it requires 2 reads to find a row in a 1,000-row table, finding a row in a 1,000,000-row table will only require 2 more reads, not 1,000 times as many.
As long as you have defined the primary key, the performance should be good. This is what the database is designed to do.
Obviously there are limits to everything. If you have a billion users and they're submitting thousands of these entries every second, then maybe a regular indexed MySQL table won't be sufficient. But I wouldn't go looking for some exotic solution before you even have a problem.
If you have a key of the row you want, and you have an index on that key, then this query will take less than a second, even if the table has a billion rows:
SELECT ... FROM t WHERE id = 1234.
The index in question might be the PRIMARY KEY, or it could be a secondary key.
GUIDs/UUIDs should be used only if you need to manufacture unique ids in multiple clients without asking the database for an id. If you do use such, be aware that GUIDs perform poorly if the table is bigger than RAM.

Can I have one million tables in my database?

Would there be any advantages/disadvantages to having one million tables in my database.
I am trying to implement comments. So far, I can think of two ways to do this:
1. Have all comments from all posts in 1 table.
2. Have a separate table for each post and store all comments from that post in it's respective table.
Which one would be better?
Thanks
You're better off having one table for comments, with a field that identifies which post id each comment belongs to. It will be a lot easier to write queries to get comments for a given post id if you do this, as you won't first need to dynamically determine the name of the table you're looking in.
I can only speak for MySQL here (not sure how this works in Postgresql) but make sure you add an index on the post id field so the queries run quickly.
You can have a million tables but this might not be ideal for a number of reasons[*]. Classical RDBMS are typically deployed & optimised for storing millions/billions of rows in hundreds/thousands of tables.
As for the problem you're trying to solve, as others state, use foreign keys to relate a pair of tables: posts & comments a la [MySQL syntax]:
create table post(id integer primary key, post text);
create table comment(id integer primary key, postid integer , comment text, key fk (postid));
{you can add constraints to enforce referential integrity between comment and posts to avoid orphaned comments but this requires certain capabilities of the storage engine to be effective}
The generation of primary key IDs is left to the reader, but something as simple as auto increment might give you a quick start [http://dev.mysql.com/doc/refman/5.0/en/example-auto-increment.html].
Which is better?
Unless this is a homework assignment, storing this kind of material in a classic RDBMS might not fit with contemporary idioms. Keep the same spiritual schema and use something like SOLR/Elasticsearch to store your material and benefit from the content indexing since I trust that you'll want to avoid writing your own search engine? You can use something like sphinx [http://sphinxsearch.com] to index MySQL in an equal manner.
[*] Without some unconventional structuring of your schema, the amount of metadata and pressure on the underlying filesystem will be problematic (for example some dated/legacy storage engines, like MyISAM on MySQL will create three files per table).
When working with relational databases, you have to understand (a little bit about) normalization. The third normal form (3NF) is easy to understand and works in almost any case. A short tutorial can be found here. Use Google if need more/other/better examples.
One table per record is a red light, you know you're missing something. It also means you need dynamic DDL, you must create new tables when you have new records. This is also a security issue, the database user needs to many permissions and becomes a security risk.

Would this style of relating of tables work?

Would the following relationships between the tables work out?
There are over 4000 rows for Airline Data, 150k rows for RAW DATA and
about 2000 rows for Airports.
I cannot create a primary key for RAW DATA because there are many repeated values.
http://i108.photobucket.com/albums/n32/lurker3345/ACCESSHELP-1.png
The relationships look fine. I assume many things -- for starters, that the data types match where they are linked. The diagram doesn't communicate much, and there could be many reasons why the schema shown is not optimal.
You certainly can create a PK for RAW DATA, and you had better because it is voluminous.
A common approach is to select multiple fields to serve as the key because together they obtain a unique value. This is called a compound key. It's helpful (even essential) because it naturally ensures the unique combination is not unintentially duplicated. (In most situations you will want to make sure all key fields are set to not allow a zero-length or null entry.)
There is a simpler approach that serves many situations. Maybe you don't need this kind of data integrity, or you aren't sure yet what would make up a compound key, or you just want to get a provisional PK in place. Merely add an autonumber field and declare that as PK.
Some developers take that easy approach and accomplish data validation outside the table...and some ignore data validation needs, which can result in a disaster.
Once you have the PK declared, making sure the table has indexes on critical fields (in addition to the PK) is important for efficiency.
Really, before all else, do yourself a favor and rename all tables and fields so there are no spaces. While at it, rethink every name and try for most descriptive and standardized name possible. Access is cruel when it comes to renaming things later on. Avoiding spaces is a practice that will help you greatly further down the road.

Mysql - Should I use ID columns?

I have a doubt about best practices and how the database engine works.
Suppose I create a table called Employee, with the following columns:
SS ID (Primary Key)
Name
Sex
Age
The thing is.. I see a lot of databases that all its tables has and aditional column called ID, wich is a sequencial number. Should I put and ID field in my table here? I mean, it already has a Primary Key to be indexed. Will the database works faster with a sequencial ID field? I dont see how it helps if I wont use it to link or research any table.
Does it helps? If so, why, what happens in the database?
thanks!
EDIT -----
This is just a silly example. Forget about the SS_ID, I know there are better ways for choosing a primary key. The main topi is because some people I know just ask me to add the collumn named ID, even if I know we wont use it for any SQL query. They just think it helps the database's performance in some way, specially because some database tools like Microsoft Access always asks us if we want it to add this new column.
This is wrong, right?
If SS means "Social Security", I'd strongly advise against using that as a PK. An auto-incremented identity is the way to go.
Using keys with business logic built in is a bad idea. Lots of people are sensitive about giving SS information. Your app could be eliminating part of their audience if they use SS as primary key. Laws like HIPPA can make it impossible for you to use.
The actual performance gain in having a sequential id is going to depend a lot on how you use the table.
If you're using some ORM framework, these generally work better having a sequential ID of an integral type [1], which is typically achieved with an sequential id column.
If you don't use an ORM framework, having an idkey that you never use and a surrogate ss_id key which is effectively what you always use makes little sense.
If you're referencing employees from other database table (foreign-key), then it'll probably be more efficient to have an id column, as storing that integer is going to consume less space in the child tables than storing the ss_id (which I assume is a CHAR or VARCHAR) everywhere.
On the ss_id, assuming it's a social security number (looks like it would be), there might be legal & privacy concerns attached to it that you should care about - my answer assumes you do have valid reasons to have social security numbers in your database, and that you would be legally allowed to use & store them.
[1] This is usually explained by the fact the ORM frameworks rely on having highly specialized cache mechanisms, that are tailored for typical ORM use - which usually implies having a sequential id primary key, and letting application deal with actual business identity. This is in fact related to consideration very similar to these of the "foreign key" considerations.
US Social Security numbers are not sufficiently identifying. And banks certainly do not use them in that way. Not everybody has one. Errors result in duplicates. Foreigners don't have them. They are far too fragile to use as a database PK.
Most importantly: the are resused after death
Do some research: SSN as Primary Key
What's more important (obviously) is that you have a primary key, as long as the data you put use for that primary key will be uniquely identifiable. In your example, SSN's are uniquely identifiable which is why banks use them and will work. The problem with this example is that your Employee ID is likely to be used as a Foreign Key in other tables, which means you're taking personal information (that is legally protected) and spraying it across your data model. You might do better using an Auto Incremented field in this case.

MySQL indexes - how to boost performace?

I'm trying to improve performance of an existing MySQL database.
It's a database regarding restaurants, there are two relevant tables:
there's a table for all entities of the website, every entity has a unique id,
an entity can be pretty much anything, it can be a restaurant, a user and many other things.
there are several entity types and as for restaurants, their entity type is "object".
Let me also say that this structure of the database is pretty much existing
so I don't want to make big changes, I'm not going to remove the table of all the entities
for example. (the Database itself has no data, but the PHP engine is built so it'll
be hard to make big changes to the structure).
there's also a table only for objects, there are several types of
objects in that database but restaurants specifically are going to be
searches for a lot since that's the subject of the website,
restaurants have several fields: country, city, name, genre.
there can't be two restaurants with the same name in the same city and country,
(There CAN be for example two restaurant with the same name but in different cities
of the same country or in two cities that have the same name but are in different countries)
so from this fact I guess I should make a unique three-column index for the country, city and name columns.
Also I want to say that the URL is also built in the form of www.domain.com/Country/City/Restuarant-Name, so the combination of country-city-name should be fetched fast and this type of query will happen a lot.
But also there'll be queries of a lot of other types like: searching for a name of
a restaurant (using a LIKE query because the name searched for can be a part
of the full name) in a certain city, or in a certain country.
searching for all the restaurants of a certain genre in a certain country and city.
and pretty much all the combination possible.
Probably the most used queries will be (a) searching for a restaurant name in a certain city
and country (which will be the same as the query used when a URL is typed but will use
LIKE), (b) searching for restaurants of a certain type in a certain city and country.
and lastly (c) searching for a restaurant name globally (in the whole database, without specifying the city and the country)
this table (the objects table) currently has PRIMARY KEY that is the ID of the objects,
the ID is also used a lot, would the best practice be the following?:
make a three-column UNIQUE index out of country,city,name
make another (not-unique) index out of the names (so a query of type c which I've wrriten
above will be executed fast)
maybe make some kind of a sub-table that contains only the restaurants out of the objects
table so this sub-table will be queried. (this is less important since if I'll decide
to make a large change I'll probably seperate the restaurants from the rest of the object
to begin with)
I'd really appreciate any help cause I've been trying to decide this for a long time.
p.s in the objects table some of the objects won't have any genre or any country or city,
so they will stay NULL, I know that NULL values are allowed in a UNIQUE KEY but will it
have an impact on performace?
Thanx alot for anyone who was willing to read this long question :)
You can think and plan as long as you want, but you won't know for certain what's best until you try, benchmark, and compare your options. That said, it certainly sounds like you're definitely on the right track.
composite key
Your "country-city-name" composite key appears to be in the most useful order, since it's ordered from broadest to narrowest selection criteria. I'm sure you did this intentionally, as a composite key's values can only be used from left to right. Because name does not come first in that index, you'd need a separate key for just name, as you noted.
index values of NULL
According to imysql.cn, "allowing NULL values in the index really doesn't impact performance." That's simply stated as an aside without data or reference, so I don't know how/if they'd proven that.
splitting the table
If there's a lot of other data mixed in with the restaurant records, sure, it could slow things a bit. If you shard the table into identically-structured "restaurant" and "other" tables, you could still easily query their combined data if necessary with a simple UNION. Unless you have an idea of the data/slowdown to expect, I'd prefer to avoid sharding the table unless necessary, at least for the sake of simplicity/uniformity.
Are there any foreseeable queries that current indexing wouldn't account for, such as a city without the country? If so, be sure to index appropriately to cover all foreseeable cases. You didn't mention it, but I assume you'll also have an index on genre.
Ultimately, you need to generate lots of test data and try it out. (Determine how much data you could eventually expect, and generate at least triple that much test data to put the system through its paces.) From what you've described, the design sounds pretty good, but testing may reveal unexpected issues, places where you'd benefit from different indexing, etc. With any issue found, you'd have a specific goal to accomplish rather than simply pondering all what-if scenarios.