Most efficient, scalable mysql database design - mysql

I have interesting question about database design:
I come up with following design:
first table:
**Survivors:**
Survivor_Id | Name | Strength | Energy
second table:
**Skills:**
Skill_Id | Name
third table:
**Survivor_skills:**
Surviror_Id |Skill_Id | Level
In first table Survivors there will be many records and will grow from time to time.
In second table will be just few skills which can survivors learn (for example: recoon (higher view range), sniper (better accuracy), ...). Theese skills aren't like strength or energy which all survivors have.
Third table is the most interesting, there survivors and skills join together. Everything will work just fine but I am worried about data duplication.
For example: survivor with id 1 will have 5 skills so first table would look like this:
// survivor_id | level_id | level
1 | 1 | 2
1 | 2 | 3
1 | 3 | 1
1 | 4 | 5
1 | 5 | 1
First record: survivor with id 1 has skill with id 1 on level 2
Second record ...
Is this proper approach or should I use something different.

Looks good to me. If you are worried about data duplication:
1) your server-side code should be gear to not letting this happen
2) you could check before inserting if it already exists
3) you could use MYSQL: REPLACE INTO - this will replace duplicate rows if configure proerply, or insert new ones (http://dev.mysql.com/doc/refman/5.0/en/replace.html)
4) set a unique index on columns where you want only unique rows, e.g. level_id, level

I concur with the others - this is the proper approach.
However, there is one aspect which hasn't been discussed: the order of columns in the composite key {Surviror_Id, Skill_Id}, which will be governed by the kinds of queries you need to run...
If you need to find skills of the given survivor, the order needs to be: {Surviror_Id, Skill_Id}.
If you need to find survivors with the given skill, the order needs to be: {Skill_Id, Surviror_Id}.
If you need both, you'll need both the key (and the implied index) on {Surviror_Id, Skill_Id} and an index on {Skill_Id, Surviror_Id}1. Since InnoDB tables are clustered, accessing Level through that secondary index requires double-lookup - to avoid that, consider using a covering index {Skill_Id, Surviror_Id, Level} instead.
1 Or vice-verse.

Related

Is it okay to have non sequential ids as primary keys for a table in your database?

I don't know enough about databases to find the right words to ask this question, so let me give an example to explain what I'm trying to do: Suppose I want the primary key for a table to be an ID I grab from an API, but the majority of those API requests result in 404 errors. As a result, my table would look like this:
I also don't know how to format a table-like structure on Stack Overflow, so this is going to be a rough visual:
API_ID_PK | name
------------------
1 | Billy
5 | Timmy
23 | Richard
54 | Jobert
104 | Broccoli
Is it okay for the ID's not to be sequentially separated by 1 digit? Or should I do this:
ID PK | API_ID | NAME
----------------------------------------
1 | 1 | Billy
2 | 5 | Timmy
3 | 23 | Richard
4 | 54 | Jobert
5 | 104 | Broccoli
Would the second table be more efficient for indexing reasons? Or is the first table perfectly fine? Thanks!
No, there won't be any effect on efficiency if you have non-consecutive IDs. In fact, MySQL (and other databases) allow for you to set a variable auto_increment_increment to have the ID increment by more than 1. This is commonly used in multi-master setups.
It's fine to have IDs not sequential. I regularly use GUIDs for IDs when dealing with enterprise software where multiple business could share the same object and they're never sequential.
The one thing to watch out for is if the numbers are the same. What's determining the ID value you're storing?
If you have a clustered index (Sql-Server) on a ID column and insert IDs with random values (like Guids), this can have a negative effect, as the physical order of the clustered index corresponds to the logical order. This can lead to a lot of index re-organisations. See: Improving performance of cluster index GUID primary key.
However, ordered but non consecutive values (values not separated by 1) are not a problem for clustered indexes.
For non-clustered indexes the order doesn't matter. It is okay to insert random values for primary keys as long as they are unique.

Database design. How to approach this please?

I'm designing a database (MySQL) that will manage a fleet of vehicles.
Company has many garages across the city, at each garage, vehicles gets serviced (operation). An operation can be any of 3 types of services.
Table Vehicle, Table Garagae, Table Operation, Table Operation Type 1, Table Operation Type 2, Table Operation type 3.
Each Operation has the vehicle ID, garage ID, but how do I link it to the the other tables (service tables) depending on which type of service the user chooses?
I would also like to add a billing table, but I'm lost at how to design the relationship between these tables.
If I have fully understood it I would suggest something like this (first of all you shouldn't have three operation tables):
Vehicles Table
- id
- garage_id
Garages Table
- id
Operations/Services Table
- id
- vehicle_id
- garage_id
- type
Customer Table
- id
- service_id
billings Table
- id
- customer_id
You need six tables:
vechicle: id, ...
garage: id, ...
operation: id, vechicle_id, garage_id, operation_type (which can be
one of the tree options/operations available, with the possibility to be extended)
customer: id, ...
billing: id, customer_id, total_amount
billingoperation: id, billing_id, operation_id, item_amount
You definitely should not creat three tables for operations. In the future if you would like to introduce a new operation that would involve creating a new table in the database.
For the record, I disagree with everyone who is saying you shouldn't have multiple operation tables. I think that's perfectly fine, as long as it is done properly. In fact, I'm doing that with one of my products right now.
If I understand, at the core of your question, you're asking how to do table inheritance, because Op Type 1 and Op Type 2 (etc.) IS A Operation. The short answer is that you can't. The longer answer is that you can't...at least not without some helper logic.
I assume you have some sort of program that will pull data from the database, rather than you just writing sql commands by hand. Working under that assumption, let's use this as a subset of your database:
Garage
------
GarageId | GarageLocation | etc.
---------|----------------|------
1 | 123 Main St. | XX
Operation
---------
OperationId | GarageId | TimeStarted | TimeEnded | OperationTypeDescId | OperationTypeId
------------|----------|-------------|-----------|---------------------|----------------
2 | 1 | noon | NULL | 2 | 2
OperationTypeDesc
-------------
OperationTypeDescId | Name | Description
--------------------|-------|-------------------------
1 | OpFoo | Do things with the stuff
2 | OpBar | Do stuff with the things
OpFoo
-----
OpID | Thing1 | Thing2
-----|--------|-------
1 | 123 | abc
OpBar
-----
OpID | Stuff1 | Stuff2
-----|--------|-------
1 | 456 | def
2 | 789 | ghi
Using this setup, you have the following information:
A garage has it's information, plain and simple
An operation has a unique ID (OperationId), a garage where it was executed, an ID referencing the description of the operation, and the OperationType ID (more on this in a moment).
A pre-populated table of operation types. Each type has a unique ID (OperationTypeDescId), the name of the operation, and a human-readable description of what that operation is.
1 table for each row in OperationTypeDesc. For convenience, the table name should be the same as the Name column
Now we can begin to see where inheritance comes into play. In the operation table, the OperationTypeId references the OpId of the relevant table...the "relevant table" is determined by the OperationTypeDescId.
An example: Let's say we had the above data set. In this example we know that there is an operation happening in a garage at 123 Main St. We know it started at noon, and has not yet ended. We know the type of operation is "OpBar". Since we know we're doing an OpBar operation instead of an OpFoo operation, we can focus on only the OpBar-relevant attributes, namely stuff1 and stuff2. Since the Operations's OperationTypeId is 2, we know that Stuff1 is 789 and Stuff2 is ghi.
Now the tricky part. In your program, this is going to require Reflection. If you don't know what that is, it's the practice of getting a Type from the NAME of that type. In our example, we know what table to look at (OpBar) because of its name in the OperationTypeDesc table. Put another way, you don't automatically know what table to look in; reflection tells you that information.
Edit:
Csaba says "In the future if you would like to introduce a new operation that would involve creating a new table in the database". That is correct. You would also need to add a new row to the OperationTypeDesc table. Csaba implies this is a bad thing, and I disagree - with a few provisions. If you are going to be adding a new operation type frequently, then yes, he makes a very good point. you don't want to be creating new tables constantly. If, however, you know ahead of time what types of operations will be performed, and will very rarely add new types of operations, then I maintain this is the way to go. All of your info common to all operations goes in the Operation table, and all op-specific info goes into the relevant "sub-table".
There is one more very important note regarding this. Because of how this is designed, you, the human, must be aware of the design. Whenever you create a new operation type, it's not as simple as creating the new table. Specifically, you have to make sure that the new table name and the OperationTypeDesc "Name" entry are the same. Think of it as an extra constraint - an "INTEGER" column can only contain ints, otherwise the db won't allow the data. In the same manner, the "Name" column can only contain the name of an existing table. You the human must be aware of that constraint, because it cannot be (easily) automatically enforced.

MySQL full text search matching similar results

I'll try to explain my situation: I'm trying to create a search engine for products on my website, so when the user needs to find a product I need to show similar ones, here's an example.
User searches:
assassins creed OR assassinscreed OR aSsAssIn's CreeD assuming there are no letters/numbers mispelling (those 3 queries should produce the same result)
Expected results:
Assassin's Creed AND Assassin's Creed: Unity AND Assassin's Creed: Special Edition
What have I tried so far
I have created a MySQL field for the search engine which contains a parsed name of the product (Assassin's Creed: Unity -> assassinscreedunity
I parse the search query
I search using MySQL's INSTR()
My problem
I'm fine by using this, but I heard it can be slow when the number of rows increases, I've created a full-text index in my table, but I don't think it would help, so I need another solution.
Thanks for any answer, and ask me anything before downvoting.
First of all, you should keep track of performance issues in your queries more precisely than 'heard it cand be slow' and 'think it would help'. One starting point may be the Slow Query Log.
If you have a table which contains the same parsed name in more than one row, consider normalizing your database. In the specific case, store unique parsed names in one table, and only the id of the corresponding parsed name in the table you described in your question. This way, you only need to check the smaller table with unique names and can then quickly find all matching entries in the main table by id.
Example:
Consider the following table with your structure
id | product_name | rating
-----------------------------------
1 | assassinscreedunity | 5
2 | assassinscreedunity | 2
3 | monkeyisland | 3
4 | monkeyisland | 5
5 | assassinscreedunity | 4
6 | monkeyisland | 4
you would have to scan all six entries to find relevant rows.
In contrast, consider two tables like this:
id | p_id | rating
--------------------
1 | 1 | 5
2 | 1 | 2
3 | 2 | 3
4 | 2 | 5
5 | 1 | 4
6 | 2 | 4
id | name
--------------------------
1 | assassinscreedunity
2 | monkeyisland
In this case, you only have to scan two entries (compared to six) and can then efficiently look up relevant rows using the integer id.
To further enhance the performance, you could extend the concept of a parsed name and use hashes. For example, you could calculate the SHA1-hash of your parsed name which is a 160 bit value. You can find entries in your database for this value very efficiently. To match substrings, you can add them to the second table as well. Since the hash only needs to computed once, you still can use the database to match by an integer. Another thing for you might be fuzzy hashing.
In addition, you should read up on the Rabin–Karp algorithm or string searching in general.

Advantages of a lookup table with INTs over decimals in MySQL records?

Trying to summarize in as few of words as possible:
I am trying to create a system that tracks the various products an individual can sell and the commission percentage they earn on that particular item. I am thinking about creating reference integers for each product called "levels" which will relate to their commission percentage in a new lookup table instead of a single reference point.. Is this overkill though or are there any benefits over just placing inline for each record?
My gut tells me there are advantages of design 1 below but not sure what they are the more I think about it. If I need to update all individuals selling product X with level Y, indexes and replaces make that easy and fast ultimately in both methods. By using design 2, I can dynamically change any "earn" to whatever percentage I can come up with (0.58988439) for a product whereas I would have to create this "level" in design 1.
Note: the product does not relate to the earn diretly (one sales rep can earn 50% for the same product another sales rep only earns 40% on).
Reference Examples:
Design 1 - two tables
table 1
ID | seller_id | product_id | level
-----------------------------------------------
1 | 11111 | 123A | 2
2 | 11111 | 15J1 | 6
3 | 22222 | 123A | 3
table 2
ID | level | earn
--------------------------
1 | 1 | .60
2 | 2 | .55
3 | 3 | .50
4 | 4 | .45
5 | 5 | .40
6 | 6 | .35
Design 2 - one table
ID | seller_id | product_id | earn
-----------------------------------------------
1 | 11111 | 123A | .55
2 | 11111 | 15J1 | .35
3 | 22222 | 123A | .45
(where earn is decimal based, commission percentage)
Update 1 - 7/9/13
It should also be noted that a rep's commission level can change at any given time. For this, we have planned on simply using status, start, and end dates with ranges for eligible commission levels / earn. For example, a rep may earn a Level 2 (or 55%) from Jan 1 to Feb 1. This would be noted in both designs above. Then when finding what level or percentage a rep was earning at any given time: select * from table where (... agent information) AND start <= :date AND (end > :date or END IS NULL)
Does level mean anything to the business?
For instance, I could imagine a situation where the levels are the unit of management. Perhaps there is a rush for sales one quarter, and the rates for each level change. Or, is there reporting by level? In these situations is would make sense to have a separate "level" table.
Another situation would be different levels for different prices of the product -- perhaps the most you sell it for, the higher the commission. Or, the commissions could be based on thresholds, so someone who has sold enough this year suddenly gets a higher commission.
In other words, there could be lots of rules around commission that go beyond the raw percentage. In that case, a "rule" table would be a necessary part of the data model (and "levels" are a particular type of rule).
On the other hand, if you don't have any such rules and the commission is always based on the person and product, then storing the percentage in the table makes a lot of sense. It is simple and understandable. It also has good performance when accessing the percentage -- which presumably happens much more often than changing it.
First of all, using id values to reference a lookup table has nothing to do with normalization per se. Your design #2 shown above is just as normalized. Lots of people have this misunderstanding about normalization.
One advantage to using a lookup table (design #1) is that you can change what is earned by level 6 (for example), and by updating one row in the lookup table, you implicitly affect all rows that reference that level.
Whereas in design #2, you would have to update every row to apply the same change. Not only does this mean updating many rows (which has performance implictations), but it opens the possibility that you might not execute the correct UPDATE matching all the rows that need updating. So some rows may have the wrong value for what should be the same earning level.
Again, using a lookup table can be a good idea in many cases, it's just not correct to call it normalization.

mysql fast select query without reading all db

I have a large database with two tables: stat and total.
The example of the relation is the following:
STAT:
| ID | total event |
+--------+--------------+
| 7 | 2 |
| 8 | 1 |
TOTAL:
|ID | Event |
+---+--------------+
| 7 | "hello" |
| 7 | "everybody" |
| 8 | "hi" |
This is a very simplified version; also consider that STAT table could have 500K records, and for each STAT I can have about 200 TOTAL rows.
Currently, if I run a simple SELECT query in table TOTAL the system is terribly slow.
Could anyone help me with some advice for the creation of the TOTAL table? Is it possible to say to MySQL that the id column is already sorted so that there is no reason to scan all the rows till the end where, for example, id=7?
Add INDEX(ID) to your tables (both), if you did not already.
SELECT COUNT(*) FROM TOTAL WHERE ID=7 -> if ID is indexed, this will be fast.
You can add an index, and furthermore you can partition your table.
As per #ypercube's comment, tables are not stored in a sorted state, so one cannot "tell" this to the database. However you can add an index on tables to make them faster to search.
One important thing to check - it looks like TOTAL.ID is intended as a foreign key - if so, the table TOTAL should have a primary key called ID. Rename the existing column of that name to STAT_ID instead, so it is obvious what it is a foreign key for. Then add an index on STAT_ID.
Lastly, as a point of style, I recommend that you make your table and column names case-insensitive, and write them in lower-case. It makes it easier to read SQL when keywords are in upper case, and database objects are in lower.