This question has been already asked but I've not found a "1 voice answer".
Is it better to do :
1 big table with :
user_id | attribute_1 | attribute_2 | attribute_3 | attribute_4
or 4 smal tables with :
user_id | attribute_1
user_id | attribute_2
user_id | attribute_3
user_id | attribute_4
1 big table or many small tables ? Each user can only have 1 value for attribute_X. We have a lot of data to save (100 millions users). We are using innoDB. Performance are really important for us (10 000 queries / s).
Thanks !
François
If you adhere to the Zero, One or Many principle, whereby there is either no such thing, one of them, or an unlimited number, you would always build properly normalized tables to track things like this.
For instance, a possible schema:
CREATE TABLE user_attributes (
id INT PRIMARY KEY NOT NULL AUTO_INCREMENT,
user_id INT NOT NULL,
attribute_name VARCHAR(255) NOT NULL,
attribute_value VARCHAR(255),
UNIQUE INDEX index_user_attributes_name(user_id, attribute_name)
);
This is the basic key-value store pattern where you can have many attributes per user.
Although the storage requirements for this is higher than a fixed-columns arrangement with the perpetually frustrating names like attribute1, the cost is small enough in the age of terabyte-sized hard-drives that it's rarely an issue.
Generally you'd create a single table for this data until insertion time becomes a problem. So long as your inserts are fast, I wouldn't worry about it. At that point you would want to consider a sharding strategy to divide this data into multiple tables with an identical schema, but only if it's required.
I would imagine that would be at the ~10-50 million rows stage, but could be higher if the amount of insert activity in this table is relatively low.
Don't forget that the best way to optimize for read activity is to use a cache: The fastest database query is the one you don't make. For that sort of thing you usually employ something like memcached to store the results of previous fetches, and you would invalidate this on a write.
As always, benchmark any proposed schema at production scale.
1 big table with :
user_id | attribute_1 | attribute_2 | attribute_3 | attribute_4
will make your management easier. Too many individual lookups otherwise, which will also complicate programming against the DB with the chance to increase application errors.
Related
We are making a mobile application with some friends, but we are having problems regarding the structure of the database due to Unknown.I think it is a good question that can help many people and it would be nice if people with knowledge can explain it well. The app consists of providing various services (more can be added in the future) to customers. They are logged in and have access to our services. At first we thought of a table that contains columns with all the customer data + the services. Then we saw that it was more effective to make another separate table called "services" and that identifies the user by an id. The problem now comes to this table. We do not know whether to make a single column with all services (such as array) or to make one column per service. I took a photo so that what I am proposing can be observed more easily.
The question is which of these options (obviously there may be a third that we do not contemplate) is the best, in terms of performance.
I think that the second option I see several defects but I'm not sure. In terms of latency and speed, traversing an array (and more if services are added, or perhaps they are out of order because the user first hired service2 and then 1) is much higher than in option 1. In addition, the fact that a user is under a service, that implies going through the entire array, looking for it and eliminating it. I don't know you are the experts, what do you recommend?all this will be uploaded to the cloud (azure), so all requests will be to the cloud
Option 2 is better than option 1. But, with respect, it's still not good.
Never never store comma-separated lists of things in columns of data. If you do you'll be sorry. (They're very costly to search.)
You want something like this. Three tables, one for users, another for services, and a so-called JOIN table to establish a many-to-many relationship between the two.
+-----------+ +-------------+ +-----------+
|user | |user_service | |service |
+-----------+ +-------------+ +-----------+
|user_id +--->|user_id |<----+service_id |
|givennamee | |service_id | |name |
|surname | +-------------+ +-----------+
|is_active |
+-----------+
Each row in user_service means a user is authorized to use that service. To authorize a user, INSERT a row. To revoke authorization, DELETE the row.
To find out whethe a user can use a service, use this query.
SELECT user.user_id
FROM user
JOIN user_service USING (user_id)
JOIN service USING (service_id)
WHERE user.givenname = 'Bill' AND user.surname='Gates'
AND service.name = 'CharityNavigator'
AND user.is_active > 0;
If your query returns the user_id then the chosen user may use the chosen service.
To get a list of the services for each user, use this query.
SELECT user.user_id, user.givenname, user.surname,
GROUP_CONCAT(service.name) service_names
FROM user
JOIN user_service USING (user_id)
JOIN service USING (service_id)
WHERE user.is_active > 0
GROUP BY user.user_id
Some explanation:
It's almost always best to build tables with rows for things like your services in them, rather than columns or comma-separated lists in columns. Why?
You can add new services -- as many as you want -- years from now without reworking your database code.
DBMSs, including MySQL, work well with JOIN operations.
Doing WHERE commalist_column SOMEHOW_CONTAINS (some_id) is disgustingly inefficient in most relational database management systems. Doing WHERE column = some_id is far more efficient because it can use an index.
Rows with fewer columns, in general, work better than rows with more columns.
It's far cheaper in production to add rows to databases than it is to add columns. Adding columns means altering table definitions. That operation can require downtime.
When you use columns for things like your services, you're creating a closed system. When you use rows, your system is open-ended.
May I suggest you read about database normalization? Don't be intimidated by all the CS jargon. Just look at some examples of how to normalize various databases.
And maybe read about entity-relationship database modeling?
Edit On the advice of a commenter, I suggest you make the primary key of your user_service table to contain both columns (user_id, service_id). I also suggest you make a reverse index with both columns (service_id, user_id) so your queries can look things up quickly starting with service as well as user. Your table definitions might look something like this:
CREATE TABLE user (
user_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
givenname VARCHAR(50) NULL DEFAULT NULL,
surname VARCHAR(50) NULL DEFAULT NULL,
is_active TINYINT NOT NULL DEFAULT '1',
PRIMARY KEY (user_id)
)
COLLATE='utf8mb4_general_ci';
CREATE TABLE service (
service_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
name VARCHAR(50) NULL DEFAULT NULL,
PRIMARY KEY (service_id)
)
COLLATE='utf8mb4_general_ci';
CREATE TABLE user_service (
user_id INT UNSIGNED NOT NULL,
service_id INT UNSIGNED NOT NULL,
PRIMARY KEY (user_id, service_id),
INDEX reverse_index (service_id, user_id),
CONSTRAINT FK_service
FOREIGN KEY (service_id)
REFERENCES service (service_id)
ON UPDATE RESTRICT ON DELETE RESTRICT,
CONSTRAINT FK_user
FOREIGN KEY (user_id)
REFERENCES user (user_id)
ON UPDATE RESTRICT ON DELETE RESTRICT
);
With this primary key if you attempt to INSERT a duplicate authorization for a user for a service, the dbms rejects it.
Be sure to use the same 'INT UNSIGNED NOT NULLdata type foruser_idandservice_id` in those tables.
This is a very common database design pattern: it's the canonical way of creating a many-to-many relationship between rows of two different tables.
A 3rd way (most frugal on space)
See the SET datatype. It allows for saying which combination of those 6 servs apply.
INT UNSIGNED (of a suitable size) is another way to have a "set".
SET or TINYINT takes only 1 byte to represent up to 8 items.
Your 6 column choice takes 6 bytes.
The "{serv1,... }" might be a VARCHAR, averaging 10-20 bytes.
So, My suggestions are clearly aimed at saving space. But maybe that is not important? Do you have millions or rows? Do you have more tnan 64 "servs"? (There is a limit of 64 on SET and BIGINT UNSIGNED.)
But Which?
Is the question about coding? Well, any method is going to take some effort to split the bits/columns/string apart to build the buttons on the screen. Probably a similar amount of effort and probably less than the effort to build the screen. Ditto for performance.
I highly recommend you pick two solutions and implement both. You will discover
How similar they are in performance, amount code, etc.
How insignificant the question is.
How much extra stuff you have learned about databases.
How easy it is to "try" and "throw away" another way to do something.
How the latency, performance, etc, differences are insignificant. (This is what we are really answering for you.)
The bigger picture
You have pointed out one use for this data structure. I worry that there are, or will be, other uses for this data structure. And that something else is the real determinant of which approach is best. (At that point, you can happily resurrect the thrown away version!)
A 4th way
JSON. But it would be more verbose (take more space) than your VARCHAR way. It may or may not be easier to work with -- this depends on the rest of the requirements.
My database has several categories to which I want to attach user-authored text "notes". For instance, an entry in a high level table named jobs may have several notes written by the user about it, but so might a lower level entry in sub_projects. Since these notes would all be of the same format, I'm wondering if I could simplify things by having only one notes table rather than a series of tables like job_notes or project_notes, and then use multiple many-to-many relationships to link it to several other tables at once.
If this isn't a deeply flawed idea from the get go (let me know if it is!), I'm wondering what the best way to do this might be. As I see it, I could do it in two ways:
Have a many-to-many junction table for each larger category, like job_notes_mapping and project_notes_mapping, and manage the MtM relationships individually
Have a single junction table linked to either an enum or separate table for table_type, which specifies what table the MtM relationship is mapping to:
+-------------+-------------+---------------+
| note_id | table_id | table_type_id |
+-------------+-------------+---------------+
| 1 | 1 | jobs |
| 2 | 2 | jobs |
| 3 | 1 | project |
| 4 | 2 | subproject |
| ........... | ........... | ........ |
+-------------+-------------+---------------+
Forgive me if any of these are completely horrible ideas, but I thought it might be an interesting question at least conceptually.
The ideal way, IMO, would be to have a supertype of jobs, projects and subprojects - let's call it activities - on which you could define any common fact types.
For example (I'm assuming jobs, projects and subprojects form a containment hierarchy):
activities (activity PK, activity_name, begin_date, ...)
jobs (job_activity PK/FK, ...)
projects (project_activity PK/FK, job_activity FK, ...)
subprojects (subproject_activity PK/FK, project_activity FK, ...)
Unfortunately, most database schemas define unique auto-incrementing identifiers PER TABLE which makes it very difficult to implement supertyping after data has been loaded. PostgreSQL allows sequences to be reused, which is great, some other DBMSs (like MySQL) don't make it easy at all.
My second choice would be your option 1, since it allows foreign key constraints to be defined. I don't like option 2 at all.
Unfortunately, we have ended up going with the ugliest answer to this, which is to have a notes table for every different type of entry - job_notes, project_notes, and subproject_notes. Our reasons for this were as follows:
A single junction table with a column containing the "type" of junction has poor performance since none of the foreign keys are "real" and must be manually searched. This is compounded by the fact that the Notes field contains a lot of text per entry.
A junction table per entry adds an additional table over simply having separate notes tables for every table type, and while it seems slightly prettier, it does not create substantial performance gains.
I'm not satisfied with this answer, because it seems so wasteful to effectively be duplicating the same Notes table for every job/project/subproject table that is being described. However, we haven't been able to come up with an answer that would hold up performance wise in the long term. I'll leave this open in case anyone has better recommendations for how to do this!
I have a current database structure that seems to split up some data for indexing purposes. The main tickets table has more "lite" fields such as foreign keys, integers, and dates, and the tickets_info table has the potentially larger data such as text fields and blobs. Is this a good idea to continue with, or should I combine the tables?
For example, the current structure looks something like this (assuming a one-to-one relationship with a foreign key on the indexes):
`tickets`
--------------------------------------------
id | customer | vendor | opened
--------------------------------------------
1 | 29 | 0 | 2013-10-09 12:49:04
`tickets_info`
--------------------------------------------
id | description | more_notes
--------------------------------------------
1 | This thing is broken! | Even longer...
My application does way more SELECTs than INSERTs/UPDATEs, so I can see the theoretical benefit of the splitting when large lists of tickets are queried at once for an overview page (250+ result listings per page). The larger details would then be used on the page that shows just the one ticket and its details with the simple JOIN (amongst the several other JOINS on the foreign keys).
The tables are currently MyISAM, but I am going to convert them to InnoDB once I restructure them if that makes any difference. There are currently about 33 columns in the tickets table and 4 columns in the tickets_info table; the tickets_info table can potentially have more columns depending on the installation ("custom fields" that I have implemented akin to PHPBBv3).
I think this design is fine. The tickets tables is used not only to show single tickets information, but also to do calculation (i.e. total of tickets sold in a specific day) and other analysis (How many tickets sold that vendor?).
Adding the tickets_info will increase the size of you tickets table without any benefits but with the risk to increase access time to the tickets table. I assume good indexing on the tickets table should keep you safe, but MySql is not a columnar database, so I expect that a row with big varchar or blog fields requires more resources.
Beside that if you use the ticket_info for single ticket queries I think you already get good performance when you query that table.
So my suggestion is leave it like it is :)
I'm making a site that will be a subscription based service that will provide users several courses based on whatever they signed up for. A single user can register in multiple courses.
Currently the db structure is as follows:
User
------
user_id | pwd | start | end
Courses
-------
course_id | description
User_course_subscription
------------------------
user_id | course_id | start | end
course_chapters
---------------
course_id | title | description | chapter_id | url |
The concern is that with the user_course_subscription table I cannot (at least at the moment I don't know how) I can have one user with multiple course subscriptions (unless I enter the same user_id in multiple times with a different course_id each time). Alternatively I would add many columns in the format calculus_1 chem_1 etc., but that would give me a ton of columns as the list of courses grow.
I was wondering if having the user_id put in multiple times is the most optimal way to do this? Or is there another way to structure the table (or maybe I'd have to restructure all the tables)?
Your database schema looks fine. Don't worry, you're on the right track. As for the User_course_subscription table, both user_id and course_id form the primary key together. This is called a joint primary key and is basically fine.
Values are still unique because no user subscribes to the same course twice. Your business logic code should ensure this anyway. For the database part: You might want to look up in your database system's manual how to set joint primary keys up properly when creating the table (syntax might differ).
If you don't like this idea, you can also create a pseudo primary key, that is having:
user_course_subscription
------------------------
user_course_subscription_id | user_id | course_id | start | end
...where user_course_subscription_id is just an auto-incremented integer. This way, you can use user_course_subscription_id to identify records. This might make things easier in some places of your code, because you don't always have to use two values.
As for heaving calculus_1, chem_1 etc. - don't do this. You might want to read up on database normalization, as mike pointed out. Especially 1NF through 3NF are very common in database design.
The only reason not to follow normal forms is performance, and then again, in most cases optimization is premature. If you're concerned, stress-test the prototype of your appliation under realistic (expected) conditions and measure response times to get some hard evidence.
I don't know what's the meaning of the start and end columns in the user table. But you seem to have no redundancy.
You should check out the boyce-codd normal form wikipedia article. There is a useful example.
I am in the process of creating a second version of my technical wiki site and one of the things I want to improve is the database design. The problem (or so I think) is that to display each document, I need to join upwards of 15 tables. I have a bunch of lookup tables that contain descriptive data associated with each wiki entry such as programmer used, cpu, tags, peripherals, PCB layout software, difficulty level, etc.
Here is an example of the layout:
doc
--------------
id | author_id | doc_type_id .....
1 | 8 | 1
2 | 11 | 3
3 | 13 | 3
_
lookup_programmer
--------------
doc_id | programmer_id
1 | 1
1 | 3
2 | 2
_
programmer
--------------
programmer_id | programmer
1 | USBtinyISP
2 | PICkit
3 | .....
Since some doc IDs may have multiples entries for a single attribute (such as programmer), I have created the DB to compensate for this. The other 10 attributes have a similiar layout as the 2 programmer tables above. To display a single document article, approx 20 tables are joined.
I used the Sphinx Search engine for finding articles with certain characteristics. Essentially Sphinx indexes all of the data (does not store) and returns the wiki doc ID of interest based on the filters presented. If I want to find articles that use a certain programmer and then sort by date, MYSQL has to first join ALL documents with the 2 programmer tables, then filter, and finally sort the remaining by insert time. No index can help me ordering the filtered results (takes a LONG time with 150k doc IDs) since it is done in a temporary table. As you can imagine, it gets worse really quickly with the more parameters that need to be filtered.
It is because I have to rely on Sphinx to return - say all wiki entries that use a certain CPU AND programer - that lead me to believe that there is a DB smell with my current setup....
edit: Looks like I have implemented a [Entity–attribute–value model]1
I don't see anything here that suggests you've implemented EAV. Instead, it looks like you've assigned every row in every table an ID number. That's a guaranteed way to increase the number of joins, and it has nothing to do with normalization. (There is no "I've now added an id number" normal form.)
Pick one lookup table. (I'll use "programmer" in my example.) Don't build it like this.
create table programmer (
programmer_id integer primary key,
programmer varchar(20) not null,
primary key (programmer_id),
unique key (programmer)
);
Instead, build it like this.
create table programmer (
programmer varchar(20) not null,
primary key (programmer)
);
And in the tables that reference it, consider cascading updates and deletes.
create table lookup_programmer (
doc_id integer not null,
programmer varchar(20) not null,
primary key (doc_id, programmer),
foreign key (doc_id) references doc (id)
on delete cascade,
foreign key (programmer) references programmer (programmer)
on update cascade on delete cascade
);
What have you gained? You keep all the data integrity that foreign key references give you, your rows are more readable, and you've eliminated a join. Build all your "lookup" tables that way, and you eliminate one join per lookup table. (And unless you have many millions of rows, you're probably not likely to see any degradation in performance.)