Specify probability to find a row in MySQL - mysql

I have a Table which has 3 columns
ID, name, and class
Class can have one of the 2 values "A" or "B"
If I wanted to search a student based on ID, then my SQL query would be
SELECT *
FROM users
WHERE ID='4'
This gives me correct results.
Suppose if I know for a specific value 4 the probability to find this user in class A rows is very high than in class B
Is there a way to optimize this query OR
optimize the schema OR
should I partition the tables into 2 for class A and class B users?
(Or is it that MySQL already knows about these optimizations)

You asked whether knowing the probability of finding a row in one class or the other can affect query performance.
Don't even start thinking about this until you have many crore (many tens of millions) of rows to deal with. Seriously, Instead, spend your irreplaceable time learning about table indexing and getting your application finished.
Looking up rows based on id values is very fast indeed if your indexes are correct. Partitioning doesn't help performance much if at all for simple lookups. And, maintaining partitioned tables is a huge and ongoing pain in the neck.
A query of the form WHERE id = constant AND class = constant can be made very near optimal with a compound index on (id, class).
Good material to learn about SQL performance is here. http://use-the-index-luke.com/sql/table-of-contents

You should have a table for class :
id | name
---+-------
1 | A
2 | B
Then instead of class names store class ids in users :
id | name | classId
---+------+-----------
1 | u1 | 1
2 | u2 | 2

Related

Relational databases: Integrate one tree structure into another

I'm currently designing a relational database table in MySQL for handling multiple categories, representing them later in a tree structure on the client side and filtering on them. Here is a picture of how the structure looks like:
So we have a root element which is set by default. We can after that add children to it (Level one). So far a table structure in the simplest case could be defined so:
| id | name | parent_id |
--------------------------------
1 All Categories NULL
2 History 1
However, I have a requirement that I need to include another tree structure type (Products) in the table (a corresponding API is available). The records from the other table have their own id types (UUID). Basically I need to ingest them in my table. A possible structure will look like so:
| id | UUID | name | parent_id |
----------------------------------------------------------
1 NULL All Categories NULL
2 NULL History 1
3 NULL Products 1
4 CN1001231232 Catalog electricity 3
5 CN1001231242 Catalog basic components 4
6 NULL Shipping 1
I am new to relational databases, but all of these possible NULL values for the UUID indicate (at least for me) to be bad design of database table. Is there a way of avoiding this, or even better way for this "ingestion"?
If you had a table for users, with columns first_name, middle_name, last_name but then a user signed up and said they have no middle name, you could just store NULL for that user's middle_name column. What's bad design about that?
NULL is used when an attribute is unknown or inapplicable on a given row. It seems appropriate for the case you describe, i.e. when records that did not come from the external source have no UUID, and need no UUID.
That said, some computer science theorists insist that NULL is never appropriate. There's a decades-old controversy about whether SQL should even have a NULL.
The alternative would be to create a second table, in which you store only the UUID and the reference to the entity in your first table. Then just don't store rows for the other id's.
| id | UUID |
-------------------
4 CN1001231232
5 CN1001231242
And don't store the UUID column in your first table. This eliminates the NULLs, but it means you need to do a JOIN of the two tables whenever you want to query the entities with their UUID's.
First make sure you actually have to combine these in the same table. Are the products categories? If they are categories and are used like categories then it makes sense to have them in the same table, but if they have categories then they should be kept separate and given a category/parent id.
If you're sure it's appropriate to store them in the same table then the way you have it is good with one adjustment. For the UUID you can use a separate naming scheme that makes it interchangeable with id for those entries and avoids collisions with the other uuids. For example:
| id | UUID | name | parent_id |
----------------------------------------------------------
1 CAT000000001 All Categories NULL
2 CAT000000002 History 1
3 CAT000000003 Products 1
4 CN1001231232 Catalog electricity 3
5 CN1001231242 Catalog basic components 4
6 CAT000000006 Shipping 1
Your requirements combine the two things relational database are not great with out of the box: modelling hierarchies, and inheritance (in the object-oriented sense).
Your design users the "single table inheritance" model (one of 3 competing options). It's the simplest option in terms of design.
In practical terms, you may want to add a column to explicitly state which type of record you're dealing with ("regular category" and "product category") so your queries are more obvious to others.

Storing data counts from table into a "trending" table

We have a table for which we have to present many many counts for different combinations of fields.
This takes quite a while to do on the fly and doesn't provide historical data, so I'm thinking in the best way to store those counts in another table, with a timestamp, so we can query them fast and get historical trends.
For each count we need 4 pieces of information to identify it, and there are about 1000 different metrics we would like to store.
I'm thinking on three different strategies, having a count and a timestamp but varying in how to identify the count for retrieval.
1 table with 4 fields to identify the count, the 4 fields wouldn't be normalized as they contain data from different external tables.
1 table with 1 "tag" field, which will contain the 4 pieces of information as a tag. This tags could be enriched and kept in another table maybe having a field for each tag part and linking them to the external tables.
Different tables for the different groups of counts to be able to normalize on one or more fields, but this will need anywhere from 6 to tens of tables.
I'm going with the first one, not normalized at all, but wondering if anyone has a better or simpler way to store all this counts.
Sample of a value:
status,installed,all,virtual,1234,01/05/2015
First field, status, can have up to 10 values
Second field, installed, can have up to 10 per different field 1
Third field, all,can have up to 10 different values, but they are the same for all categories
Fourth field, virtual, can have up to 30 values and will also be the same for all previous categories.
Last two fields will be a number and a timestamp
Thanks,
Isaac
When you have a lot of metrics and you don't need to use them to do intra-metrics calculation you can go for the 1. solution.
I would probably build a table like this
Satus_id | Installed_id | All_id | Virtual_id | Date | Value
Or if the combination of the first four columns have a proper name, I would probably create two tables (I think you refer to this possibility as the second solution with the 2):
Metric Table
Satus_id | Installed_id | All_id | Virtual_id | Metric_id | Metric_Name
Values Table
Metric_id | Date | Value
This is good if you have names for your metrics or other details which otherwise you will need to duplicate for each combination with the first approach.
In both cases it will be a bit complicated to do intra-rows operations using different metrics, for this reason this approach is suggested only for high level KPIs.
Finally, because all possible combination for the last two fields are always present in you table you can think to convert them to a columns:
Satus_id | Installed_id | Date | All1_Virtual1 | All1_Virtual2 | ... | All10_Virtua30
With 10 values for All and 30 for Virtual you will have 300 columns, not very easy to handle, but they will be worth to have if you have to do something like:
(All1_Virtual2 - All5_Virtual23) * All6_Virtual12
But in these case I would prefer (if possible) to do the calculation in advance to reduce the number of columns.

Database design. How to approach this please?

I'm designing a database (MySQL) that will manage a fleet of vehicles.
Company has many garages across the city, at each garage, vehicles gets serviced (operation). An operation can be any of 3 types of services.
Table Vehicle, Table Garagae, Table Operation, Table Operation Type 1, Table Operation Type 2, Table Operation type 3.
Each Operation has the vehicle ID, garage ID, but how do I link it to the the other tables (service tables) depending on which type of service the user chooses?
I would also like to add a billing table, but I'm lost at how to design the relationship between these tables.
If I have fully understood it I would suggest something like this (first of all you shouldn't have three operation tables):
Vehicles Table
- id
- garage_id
Garages Table
- id
Operations/Services Table
- id
- vehicle_id
- garage_id
- type
Customer Table
- id
- service_id
billings Table
- id
- customer_id
You need six tables:
vechicle: id, ...
garage: id, ...
operation: id, vechicle_id, garage_id, operation_type (which can be
one of the tree options/operations available, with the possibility to be extended)
customer: id, ...
billing: id, customer_id, total_amount
billingoperation: id, billing_id, operation_id, item_amount
You definitely should not creat three tables for operations. In the future if you would like to introduce a new operation that would involve creating a new table in the database.
For the record, I disagree with everyone who is saying you shouldn't have multiple operation tables. I think that's perfectly fine, as long as it is done properly. In fact, I'm doing that with one of my products right now.
If I understand, at the core of your question, you're asking how to do table inheritance, because Op Type 1 and Op Type 2 (etc.) IS A Operation. The short answer is that you can't. The longer answer is that you can't...at least not without some helper logic.
I assume you have some sort of program that will pull data from the database, rather than you just writing sql commands by hand. Working under that assumption, let's use this as a subset of your database:
Garage
------
GarageId | GarageLocation | etc.
---------|----------------|------
1 | 123 Main St. | XX
Operation
---------
OperationId | GarageId | TimeStarted | TimeEnded | OperationTypeDescId | OperationTypeId
------------|----------|-------------|-----------|---------------------|----------------
2 | 1 | noon | NULL | 2 | 2
OperationTypeDesc
-------------
OperationTypeDescId | Name | Description
--------------------|-------|-------------------------
1 | OpFoo | Do things with the stuff
2 | OpBar | Do stuff with the things
OpFoo
-----
OpID | Thing1 | Thing2
-----|--------|-------
1 | 123 | abc
OpBar
-----
OpID | Stuff1 | Stuff2
-----|--------|-------
1 | 456 | def
2 | 789 | ghi
Using this setup, you have the following information:
A garage has it's information, plain and simple
An operation has a unique ID (OperationId), a garage where it was executed, an ID referencing the description of the operation, and the OperationType ID (more on this in a moment).
A pre-populated table of operation types. Each type has a unique ID (OperationTypeDescId), the name of the operation, and a human-readable description of what that operation is.
1 table for each row in OperationTypeDesc. For convenience, the table name should be the same as the Name column
Now we can begin to see where inheritance comes into play. In the operation table, the OperationTypeId references the OpId of the relevant table...the "relevant table" is determined by the OperationTypeDescId.
An example: Let's say we had the above data set. In this example we know that there is an operation happening in a garage at 123 Main St. We know it started at noon, and has not yet ended. We know the type of operation is "OpBar". Since we know we're doing an OpBar operation instead of an OpFoo operation, we can focus on only the OpBar-relevant attributes, namely stuff1 and stuff2. Since the Operations's OperationTypeId is 2, we know that Stuff1 is 789 and Stuff2 is ghi.
Now the tricky part. In your program, this is going to require Reflection. If you don't know what that is, it's the practice of getting a Type from the NAME of that type. In our example, we know what table to look at (OpBar) because of its name in the OperationTypeDesc table. Put another way, you don't automatically know what table to look in; reflection tells you that information.
Edit:
Csaba says "In the future if you would like to introduce a new operation that would involve creating a new table in the database". That is correct. You would also need to add a new row to the OperationTypeDesc table. Csaba implies this is a bad thing, and I disagree - with a few provisions. If you are going to be adding a new operation type frequently, then yes, he makes a very good point. you don't want to be creating new tables constantly. If, however, you know ahead of time what types of operations will be performed, and will very rarely add new types of operations, then I maintain this is the way to go. All of your info common to all operations goes in the Operation table, and all op-specific info goes into the relevant "sub-table".
There is one more very important note regarding this. Because of how this is designed, you, the human, must be aware of the design. Whenever you create a new operation type, it's not as simple as creating the new table. Specifically, you have to make sure that the new table name and the OperationTypeDesc "Name" entry are the same. Think of it as an extra constraint - an "INTEGER" column can only contain ints, otherwise the db won't allow the data. In the same manner, the "Name" column can only contain the name of an existing table. You the human must be aware of that constraint, because it cannot be (easily) automatically enforced.

Most efficient, scalable mysql database design

I have interesting question about database design:
I come up with following design:
first table:
**Survivors:**
Survivor_Id | Name | Strength | Energy
second table:
**Skills:**
Skill_Id | Name
third table:
**Survivor_skills:**
Surviror_Id |Skill_Id | Level
In first table Survivors there will be many records and will grow from time to time.
In second table will be just few skills which can survivors learn (for example: recoon (higher view range), sniper (better accuracy), ...). Theese skills aren't like strength or energy which all survivors have.
Third table is the most interesting, there survivors and skills join together. Everything will work just fine but I am worried about data duplication.
For example: survivor with id 1 will have 5 skills so first table would look like this:
// survivor_id | level_id | level
1 | 1 | 2
1 | 2 | 3
1 | 3 | 1
1 | 4 | 5
1 | 5 | 1
First record: survivor with id 1 has skill with id 1 on level 2
Second record ...
Is this proper approach or should I use something different.
Looks good to me. If you are worried about data duplication:
1) your server-side code should be gear to not letting this happen
2) you could check before inserting if it already exists
3) you could use MYSQL: REPLACE INTO - this will replace duplicate rows if configure proerply, or insert new ones (http://dev.mysql.com/doc/refman/5.0/en/replace.html)
4) set a unique index on columns where you want only unique rows, e.g. level_id, level
I concur with the others - this is the proper approach.
However, there is one aspect which hasn't been discussed: the order of columns in the composite key {Surviror_Id, Skill_Id}, which will be governed by the kinds of queries you need to run...
If you need to find skills of the given survivor, the order needs to be: {Surviror_Id, Skill_Id}.
If you need to find survivors with the given skill, the order needs to be: {Skill_Id, Surviror_Id}.
If you need both, you'll need both the key (and the implied index) on {Surviror_Id, Skill_Id} and an index on {Skill_Id, Surviror_Id}1. Since InnoDB tables are clustered, accessing Level through that secondary index requires double-lookup - to avoid that, consider using a covering index {Skill_Id, Surviror_Id, Level} instead.
1 Or vice-verse.

Updating row in tables with huge amount of data

I have to update the views of the current post. From table posts witch have data > 2 millions. And the loading time of the page is slow.
Tables:
idpost | iduser | views | title |
1 | 5675 | 45645 | some title |
2 | 345 | 457 | some title 2 |
6 | 45 | 98 | some title 3 |
and many more... up to 2 millions
And iduser have Index, idpost have Primary key.
If I seprate the data and make a new table post_views and use LEFT JOIN to get the value of the views. At first it will be fast since the new table is still small, but over time she as well will have > 2 millions rows. And again it will be slow. How you deal with huge table ?
Split the table
You should split the table to separate different things and prevent repetition of title data. This will be a better design. I suggest following schema:
posts(idpost, title)
post_views(idpost, iduser, views)
Updating views count
You will need to update views of only one row at a time. Because, someone views your page, then you update related row. So, just one row update at a time without a searching overhead (thanks to key & index). I didn't understand how this can make an overhead?
Getting total views
Probably, you run a query like this one:
SELECT SUM(views) FROM post_views WHERE idpost = 100
Yes, this can make an overhead. A solution may be to create anew table total_post_views and update corresponding value in this table after each update on post_views. Thus, you will get rid of the LEFT JOIN and access total view count directly.
But, updating for each update also makes an overhead. To increase performance, you can give up updating total_post_views after each update on post_views. If you choose this way, you can perform update:
periodically, say in each 30sec,
after certain update counts of post_views, say for each 30 update.
In this way, you will get approximate results, of course. If this is tolerable, then I suggest you to go this way.