Should I use a single table for many categorized rows? - mysql

I want to implement some user event tracking in my website for statistics etc.
I thought about creating a table called tracking_events that will contain the following fields:
| id (int, primart) |
| event_type (int) |
| user_id (int) |
| date_happened (timestamp)|
this table will contain a large amount of rows (let's assume at least every page view is a tracked event and there are 1,000 daily visitors to the site).
Is it a good practice to create this table with the event_type field to differentiate between essentially different, yet identically structured rows?
or will it be a better idea to make a separate table for each type? e.g.:
table pageview_events
| id (int, primart) |
| user_id (int) |
| date_happened (timestamp)|
table share_events
| id (int, primart) |
| user_id (int) |
| date_happened (timestamp)|
and so on for 5-10 tables.
(the main concern is performance when selecting rows WHERE event_type = ...)
Thanks.

It really depends. If you need to have them separated, because you will only be querying them separately, then splitting them into two tables should be fine. That saves you from having to store an extra discriminator column.
BUT... if you need to query these sets together, as if they were a single table, it would be much easier to have them stored together, with a discriminator column.
As far as WHERE event_type=, if there are only two distinct values, with a pretty even distribution, then an index on just that column isn't going to help much. Including that column as the leading column in a multicolumn index(es) is probably the way to go, if a large number of your queries will include an equality predicate on that column.
Obviously, if these tables are going to be "large", then you'll want them indexed appropriately for your queries.

Related

Why do many to many relation use all the key from previous table?

While using mysql workbench and for designing database using designer the relation tool uses a third table to form a many to many relation between 2 tables.
I have 3 tables
TABLE1
TABLE2
TABLE3
TABLE2 has foregin key from primary key of TABLE1,having a many to one relation
TABLE2 and TABLE3 are related using a many to many relation,
as soon as I create the relation
a new table TABLE3_has_TABLE2 is created with all the key from TABLE2(primarykey of table2 & foreign key of table1) and TABLE3 (primary key of table3).
Now,
why is there foreign key of table1.?
Even if i remove I will be able to query data from table1 and table3 using table2 as intermediate, so is it good to have this kind of relation or avoided?
For Example in below diagram
This is a geographical distribution of location, on right side it shows the hirarchy.
Now,
Table1(Zone) is the primary table i.e Zone
Table2(state) is related to table1 using zone_id
Table3(division) is related to table2(state) using state_id & zone_id of table1(zone)
Question: Should this zone_id column be in the table3 or not?
similarly table4 contains all the previous key columns of table3.
Strictly from a denormalization point-of-view, the DIVISION.STATE_ZONE_ID isn't required.
Since you can get the ZONE_ID from the DIVISION by joining STATE on the state_id.
And it's the same with the division_state_state_id & division_state_zone_id in DISTRICT.
Having the division_division_id is enough to join DIVISION, then STATE, then ZONE.
However, what if you would remove those 'extra' fields?
Then a SQL always needs to go through that cascade of joined tables to get the ZONE.zone_name.
So there's an advantage that by having those 'extra' fields, it becomes possible to JOIN directly to the ZONE table. Which can simplify/speed up certain popular queries.
The disadvantage is that it becomes harder to assure referential integrity.
Because for example, you could assign a different zone_id to a DIVISION.state_zone_id than the STATE.zone_id you can get via DIVISION.state_state_id.
It is best practice in relational models to avoid many-to-many relationships. Workbench usually compensates for user trying to do that as you have seen.
Let us use an example (or check the tl;dr), where there are two identified entities; buyers and hardware items. Some people buy 1 item, others buy more than one. The thing is, that same item can be bought by many people. So the buyer table has Mr. A buying nails. Simple enough to record in one row. But lo' and behold, he ups and gets another item! How do we show that he buys another item?
One way is by adding another attribute to the table (say "item_number_two"). But then he gets another! We can't keep going adding attributes like that. Databases were designed more for vertical addition of records, rather than horizontal addition of attributes (to give a visual picture). There is a longer explanation but you should read up, or probably might figure it out after reading this.
Another way is to re-enter a record for Mr. A and then put the ID of another item in that column, showing that he bought two items (not really "he" from a database stand-point, it's two different people!).
A better method would be to create a table that consists of the unique identifiers found in the original tables (just one per table may be necessary). This is called an intermediary table. The original tables themselves do not have foreign keys from the other table.
This is where the concept of a composite key comes in. It means that two or more candidate keys are used to uniquely identify a record rather than just one. This is how it works:
Person Table:
| person_ID | person_Name |
| P0001 | Mr. A |
| P0002 | Mr. B |
| P0003 | Mr. C |
| P0004 | Mr. D |
Cat Table
| item_ID | cat_Name |
| I0001 | Nails |
| I0002 | Screws |
| I0003 | Hammers |
| I0004 | Power-Saw |
Intermediary table
| person_ID | item_ID |
| P0001 | I0001 |
| P0001 | I0002 |
| P0001 | I0003 | //Shows that person 1 bought more than one item
| P0002 | I0004 |
| P0002 | I0001 | //Shows that an item has been bought by more that one person
So this new table matches a record of one table(through the use of a primary key) to a record of another. The only thing that will ever be repeated is one of the two ID's. A unique record is made as long as no two combinations are repeated.
tl;dr - Having tables mapped in a many to many relationship inevitably wastes space in the DB when entering records, as new records of the same data have to be made to show a small difference (adding no real value in proportion to the space). Another issue is that it causes more calculations than necessary when a query is made, wasting time and space. Or the results returned may just be plain wrong...
EDIT:
If you have tables A and B having a many-to-many relationship, do the following as an alternative. Create a table C. Take the primary keys from table A and B and place them in tables C. In table C they both exist as primary and foreign keys. This would mean the following relationship is created.
| Table A |-----------<| Table C |>------------|Table B|
Table A and B are linked through C.
Sample query:
SELECT C.itemID FROM A, C WHERE A.personID = P0001 AND A.personID = C.personID;
This query will return all ID's of the items bought by the person with an ID of P0001. Records must match the condition of having a personID of P0001, but the record selected must have that matching ID in Table C (the intermediary table). An extended query could be to take the item names from the Table B. Each attribute in C has a recorded value that corresponds to a value of a key in either Table A or B, meaning that a query can be run to pull other info, where the value in Table C is = to the values in Table A/B (depending on which one you want).

How to better build a database

We have a DB on SQL, where we have a table (1) for users and a table (2) for user's saved information. Each piece of information is one line in table (2). So my question is the following - If we are intending to grow number of users to more than 1.000.000 and each user can have more than 10 piece of information, which of the following is a better way to build our DB:
a) Having 2 tables - 1 for users and 1 for information from all users, related to users with ID
b) Having a separate table for each user.
Thanks in advance.
Definitely it should be having a single table for the user is much better. Think from the DB prospective. You are thinking about the search time in a 1.000.000 row for a sorted ID. In the second case you have to search 1.000.000 table to get into a right table. So better go for option A.
I'm going to agree that option A is the better of the two options presented.
That being said, I would personally break up the information for the users into more tables as well. This would all be connected using foreign keys and will allow for more specific querying of the information.
SQL is not really horizontally scalable, so if you end up with users with less or more information than others, then you'll have NULL columns and this requires dealing with in various ways.
By using separate tables, you can still have all of the information contained, but not have to worry if one user has a home and cell phone number, while another only has a cell number.
If and when you do need to access a lot of the information at once, SQL is very good at dealing with this through joins and the like.
Option B is not bad, it just does not fit SQL. I would work if the DB in question was document based instead of tables. In that case, creating a single document for each user is a good idea, and likely preferred.
Option C)
table for users with a unique UserID as Clustered Index (Primary Key)
table for Type of saved information with a unique InformationID as Clustered Index (Primary Key)
table for UserInformation with unique UserInformationID as Clustered Index (Primary Key), a column for UserID (nonclustered index, foreign key to user table) and a column for InformationID (nonclustered index, foreign key to Information table). Have a "Value" or similar column to hold the data being save as it relates to the type of information.
Example:
Users Table
UserID UserName
1 | UserName1
2 | UserName2
Information Table
InfoID InfoName
1 | FavoriteColor
2 | FavoriteNumber
3 | Birthday
UserInformation Table
ID UserID InfoID Value
1 | 1 | 1 | Blue
2 | 1 | 2 | 7
3 | 1 | 3 | '11/01/1999'
4 | 2 | 3 | '05/16/1960'
This method allows for you to save any combination of values for any user without recording any of the non-supplied user information. It keeps the information table 'clean' because you won't need to keep adding columns for each new piece of information you wish to track. Just add a new record to the Info table, and then record only the values submitted to the UserInformation table.

Several many-to-many relationships to one table

My database has several categories to which I want to attach user-authored text "notes". For instance, an entry in a high level table named jobs may have several notes written by the user about it, but so might a lower level entry in sub_projects. Since these notes would all be of the same format, I'm wondering if I could simplify things by having only one notes table rather than a series of tables like job_notes or project_notes, and then use multiple many-to-many relationships to link it to several other tables at once.
If this isn't a deeply flawed idea from the get go (let me know if it is!), I'm wondering what the best way to do this might be. As I see it, I could do it in two ways:
Have a many-to-many junction table for each larger category, like job_notes_mapping and project_notes_mapping, and manage the MtM relationships individually
Have a single junction table linked to either an enum or separate table for table_type, which specifies what table the MtM relationship is mapping to:
+-------------+-------------+---------------+
| note_id | table_id | table_type_id |
+-------------+-------------+---------------+
| 1 | 1 | jobs |
| 2 | 2 | jobs |
| 3 | 1 | project |
| 4 | 2 | subproject |
| ........... | ........... | ........ |
+-------------+-------------+---------------+
Forgive me if any of these are completely horrible ideas, but I thought it might be an interesting question at least conceptually.
The ideal way, IMO, would be to have a supertype of jobs, projects and subprojects - let's call it activities - on which you could define any common fact types.
For example (I'm assuming jobs, projects and subprojects form a containment hierarchy):
activities (activity PK, activity_name, begin_date, ...)
jobs (job_activity PK/FK, ...)
projects (project_activity PK/FK, job_activity FK, ...)
subprojects (subproject_activity PK/FK, project_activity FK, ...)
Unfortunately, most database schemas define unique auto-incrementing identifiers PER TABLE which makes it very difficult to implement supertyping after data has been loaded. PostgreSQL allows sequences to be reused, which is great, some other DBMSs (like MySQL) don't make it easy at all.
My second choice would be your option 1, since it allows foreign key constraints to be defined. I don't like option 2 at all.
Unfortunately, we have ended up going with the ugliest answer to this, which is to have a notes table for every different type of entry - job_notes, project_notes, and subproject_notes. Our reasons for this were as follows:
A single junction table with a column containing the "type" of junction has poor performance since none of the foreign keys are "real" and must be manually searched. This is compounded by the fact that the Notes field contains a lot of text per entry.
A junction table per entry adds an additional table over simply having separate notes tables for every table type, and while it seems slightly prettier, it does not create substantial performance gains.
I'm not satisfied with this answer, because it seems so wasteful to effectively be duplicating the same Notes table for every job/project/subproject table that is being described. However, we haven't been able to come up with an answer that would hold up performance wise in the long term. I'll leave this open in case anyone has better recommendations for how to do this!

Using a VARCHAR vs INT in MySQL across millions of rows

I have a table that has roughly 30,000,000 rows of data sitting inside it.
The table is relatively simple:
+--------------------------------------+
| TABLE: recipe_locations |
+--------------------------------------+
| INT recipe_id (primary_key) |
| TEXT url |
| VARCHAR(128) domain (index) |
| VARCHAR(128) tag |
| INT number_ingrediants (index) |
+--------------------------------------+
Inside the tag, I am attempting to put the one main ingredient of the dish. I want to make this ingredient searchable.
The problem that I am having at the moment is that it is taking quite some time for searches to happen on the tag column. Infact, some LIKE %...% queries can take up to ten seconds to complete, which is unacceptable for the workload that I want to push to this table.
I was wondering if it would be faster to have another table which has all of the main ingrediants in it, and first search that tags table, fetching the IDs, and then doing a WHERE IN on the recipe_locations table?
The only thing that I could imagine is if the search query was say, "a" (-- where there could be hundreds of thousands of matches in the tags table), then getting all of the IDs for the tags would mean doing a subquery with WHERE IN, or doing a LEFT JOIN. I would like to know if this would hamper my performance of LIKE queries as described earlier.
Searching with LIKE over a VARCHAR field with 30000000 records is probably the worst thing you could do performance-wise. Also having a TEXT field that can potentially get huge on each row as well will make it even slower. So, that table, recipe_locations, should be accessed as little as possible. If I were you, I would create two additional tables:
Table: ingrediants
ingrediant_id INTEGER AUTOINCREMENT PRIMARY KEY
ingrediant_name VARCHAR(128)
Table recipe_ingrediants (1:n relationship, you probably want that)
recipe_id INTEGER
ingrediant_id INTEGER
(define appropiate indexes)
select
r.*
from
recipe_ingrediants ri
left join
recipe r on r.recipe_id=ri.recipe_id
left join
ingrediants i on i.ingrediant_id=ri.ingrediant_id
where
i.ingrediant_name='SALT'
order by
something
This way the query goes over the biggest table only once. With appropiate index definitions, this would be a lot quicker than what you have now.

mysql performance & index

I have 5 relation tables like
ID | FK_USER | FK_POST | DATE
Is it faster and efficient to user separate tables for each type of relation, or to create just one table like
ID | FK_USER | FK_POST | TYPE | DATE
where type is an Enum, and I put an INDEX on TYPE ?
Assume that I search about "subscription" (which is one of my relation types) Is it faster to use separate table and search on it, or use combined table and add "where TYPE = 1" to query string?
It is better to have a combined table with a where clause on 'Type'. There will not be much of a difference in performance either way u store the data. But when u have separate tables, in future if you are going to add new type, another table has to be created. This will add upto to your data management headache.