I have a table that has roughly 30,000,000 rows of data sitting inside it.
The table is relatively simple:
+--------------------------------------+
| TABLE: recipe_locations |
+--------------------------------------+
| INT recipe_id (primary_key) |
| TEXT url |
| VARCHAR(128) domain (index) |
| VARCHAR(128) tag |
| INT number_ingrediants (index) |
+--------------------------------------+
Inside the tag, I am attempting to put the one main ingredient of the dish. I want to make this ingredient searchable.
The problem that I am having at the moment is that it is taking quite some time for searches to happen on the tag column. Infact, some LIKE %...% queries can take up to ten seconds to complete, which is unacceptable for the workload that I want to push to this table.
I was wondering if it would be faster to have another table which has all of the main ingrediants in it, and first search that tags table, fetching the IDs, and then doing a WHERE IN on the recipe_locations table?
The only thing that I could imagine is if the search query was say, "a" (-- where there could be hundreds of thousands of matches in the tags table), then getting all of the IDs for the tags would mean doing a subquery with WHERE IN, or doing a LEFT JOIN. I would like to know if this would hamper my performance of LIKE queries as described earlier.
Searching with LIKE over a VARCHAR field with 30000000 records is probably the worst thing you could do performance-wise. Also having a TEXT field that can potentially get huge on each row as well will make it even slower. So, that table, recipe_locations, should be accessed as little as possible. If I were you, I would create two additional tables:
Table: ingrediants
ingrediant_id INTEGER AUTOINCREMENT PRIMARY KEY
ingrediant_name VARCHAR(128)
Table recipe_ingrediants (1:n relationship, you probably want that)
recipe_id INTEGER
ingrediant_id INTEGER
(define appropiate indexes)
select
r.*
from
recipe_ingrediants ri
left join
recipe r on r.recipe_id=ri.recipe_id
left join
ingrediants i on i.ingrediant_id=ri.ingrediant_id
where
i.ingrediant_name='SALT'
order by
something
This way the query goes over the biggest table only once. With appropiate index definitions, this would be a lot quicker than what you have now.
Related
I want to implement some user event tracking in my website for statistics etc.
I thought about creating a table called tracking_events that will contain the following fields:
| id (int, primart) |
| event_type (int) |
| user_id (int) |
| date_happened (timestamp)|
this table will contain a large amount of rows (let's assume at least every page view is a tracked event and there are 1,000 daily visitors to the site).
Is it a good practice to create this table with the event_type field to differentiate between essentially different, yet identically structured rows?
or will it be a better idea to make a separate table for each type? e.g.:
table pageview_events
| id (int, primart) |
| user_id (int) |
| date_happened (timestamp)|
table share_events
| id (int, primart) |
| user_id (int) |
| date_happened (timestamp)|
and so on for 5-10 tables.
(the main concern is performance when selecting rows WHERE event_type = ...)
Thanks.
It really depends. If you need to have them separated, because you will only be querying them separately, then splitting them into two tables should be fine. That saves you from having to store an extra discriminator column.
BUT... if you need to query these sets together, as if they were a single table, it would be much easier to have them stored together, with a discriminator column.
As far as WHERE event_type=, if there are only two distinct values, with a pretty even distribution, then an index on just that column isn't going to help much. Including that column as the leading column in a multicolumn index(es) is probably the way to go, if a large number of your queries will include an equality predicate on that column.
Obviously, if these tables are going to be "large", then you'll want them indexed appropriately for your queries.
There are four regions with more than one million records total. Should I create a table with a region column or a table for each region and combine them to get the top ranks?
If I combine all four regions, none of my columns will be unique so I will need to also add an id column for my primary key. Otherwise, name, accountId & characterId would be candidate keys or should I just add an id column anyways.
Table:
----------------------------------------------------------------
| name | accountId | iconId | level | characterId | updateDate |
----------------------------------------------------------------
Edit:
Should I look into partitioning the table by region_id?
Because all records are related to a particular region, a single database table in 3NF(e.g All-Regions) containing a regionId along with other attributes should work.
The correct answer, as usually with database design, is "It depends".
First of all, (IMHO) a good primary key should belong to the database, not to the users :)
So, if accountId and characterId are user-editable or prominently displayed to the user, they should not be used for the primary key of the table(s) anyway. And using name (or any other user-generated string) for a key is just asking for trouble.
As for the regions, try to divine how the records will be used.
Whether most of the queries will use only a single region, or most of them will use data across regions?
Is there a possibility that the schemas for different regions might diverge?
Will there be different usage scenarios for similar data? (e.g. different phone number patterns for different regions)
Bottom line, both approaches will work, let your data tell you which approach will be more manageable.
This question has been already asked but I've not found a "1 voice answer".
Is it better to do :
1 big table with :
user_id | attribute_1 | attribute_2 | attribute_3 | attribute_4
or 4 smal tables with :
user_id | attribute_1
user_id | attribute_2
user_id | attribute_3
user_id | attribute_4
1 big table or many small tables ? Each user can only have 1 value for attribute_X. We have a lot of data to save (100 millions users). We are using innoDB. Performance are really important for us (10 000 queries / s).
Thanks !
François
If you adhere to the Zero, One or Many principle, whereby there is either no such thing, one of them, or an unlimited number, you would always build properly normalized tables to track things like this.
For instance, a possible schema:
CREATE TABLE user_attributes (
id INT PRIMARY KEY NOT NULL AUTO_INCREMENT,
user_id INT NOT NULL,
attribute_name VARCHAR(255) NOT NULL,
attribute_value VARCHAR(255),
UNIQUE INDEX index_user_attributes_name(user_id, attribute_name)
);
This is the basic key-value store pattern where you can have many attributes per user.
Although the storage requirements for this is higher than a fixed-columns arrangement with the perpetually frustrating names like attribute1, the cost is small enough in the age of terabyte-sized hard-drives that it's rarely an issue.
Generally you'd create a single table for this data until insertion time becomes a problem. So long as your inserts are fast, I wouldn't worry about it. At that point you would want to consider a sharding strategy to divide this data into multiple tables with an identical schema, but only if it's required.
I would imagine that would be at the ~10-50 million rows stage, but could be higher if the amount of insert activity in this table is relatively low.
Don't forget that the best way to optimize for read activity is to use a cache: The fastest database query is the one you don't make. For that sort of thing you usually employ something like memcached to store the results of previous fetches, and you would invalidate this on a write.
As always, benchmark any proposed schema at production scale.
1 big table with :
user_id | attribute_1 | attribute_2 | attribute_3 | attribute_4
will make your management easier. Too many individual lookups otherwise, which will also complicate programming against the DB with the chance to increase application errors.
I am in the process of creating a second version of my technical wiki site and one of the things I want to improve is the database design. The problem (or so I think) is that to display each document, I need to join upwards of 15 tables. I have a bunch of lookup tables that contain descriptive data associated with each wiki entry such as programmer used, cpu, tags, peripherals, PCB layout software, difficulty level, etc.
Here is an example of the layout:
doc
--------------
id | author_id | doc_type_id .....
1 | 8 | 1
2 | 11 | 3
3 | 13 | 3
_
lookup_programmer
--------------
doc_id | programmer_id
1 | 1
1 | 3
2 | 2
_
programmer
--------------
programmer_id | programmer
1 | USBtinyISP
2 | PICkit
3 | .....
Since some doc IDs may have multiples entries for a single attribute (such as programmer), I have created the DB to compensate for this. The other 10 attributes have a similiar layout as the 2 programmer tables above. To display a single document article, approx 20 tables are joined.
I used the Sphinx Search engine for finding articles with certain characteristics. Essentially Sphinx indexes all of the data (does not store) and returns the wiki doc ID of interest based on the filters presented. If I want to find articles that use a certain programmer and then sort by date, MYSQL has to first join ALL documents with the 2 programmer tables, then filter, and finally sort the remaining by insert time. No index can help me ordering the filtered results (takes a LONG time with 150k doc IDs) since it is done in a temporary table. As you can imagine, it gets worse really quickly with the more parameters that need to be filtered.
It is because I have to rely on Sphinx to return - say all wiki entries that use a certain CPU AND programer - that lead me to believe that there is a DB smell with my current setup....
edit: Looks like I have implemented a [Entity–attribute–value model]1
I don't see anything here that suggests you've implemented EAV. Instead, it looks like you've assigned every row in every table an ID number. That's a guaranteed way to increase the number of joins, and it has nothing to do with normalization. (There is no "I've now added an id number" normal form.)
Pick one lookup table. (I'll use "programmer" in my example.) Don't build it like this.
create table programmer (
programmer_id integer primary key,
programmer varchar(20) not null,
primary key (programmer_id),
unique key (programmer)
);
Instead, build it like this.
create table programmer (
programmer varchar(20) not null,
primary key (programmer)
);
And in the tables that reference it, consider cascading updates and deletes.
create table lookup_programmer (
doc_id integer not null,
programmer varchar(20) not null,
primary key (doc_id, programmer),
foreign key (doc_id) references doc (id)
on delete cascade,
foreign key (programmer) references programmer (programmer)
on update cascade on delete cascade
);
What have you gained? You keep all the data integrity that foreign key references give you, your rows are more readable, and you've eliminated a join. Build all your "lookup" tables that way, and you eliminate one join per lookup table. (And unless you have many millions of rows, you're probably not likely to see any degradation in performance.)
Choice 1:
comments {commentid,replyto,comment}
//replyto will be null on many posts
Choice 2:
comments {commentid,comment}
replies {replyid, replyto, reply}
It looks like a matter of choice rather than linear benefit analysis at the moment.
The first option looks like a simple one, but the problem is that you're building a tree-structure in SQL.
and SQL does not support hierarchical data.
Not recommended - ever
TABLE comment
-------------
id unsigned integer auto_increment primary key,
reply_to unsigned integer,
comment text,
foreign key FK_comment_reply_to(reply_to) references comment.id
ON UPDATE CASCADE ON DELETE CASCADE
Recommended - if you want a tree 2 levels deep
If you build it using 2 tables
TABLE main_post
----------------
id unsigned integer auto_increment primary key,
body text,
TABLE reply
-------------
id unsigned integer auto_increment primary key,
reply_to unsigned integer,
body text,
foreign key FK_reply_reply_to(reply_to) references main_post.id
ON UPDATE CASCADE ON DELETE CASCADE
Then you are building a much simpler structure that can be easily queried in SQL because the tree is only 1 level deep.
For this reason I'd recommend choice number 2.
Alternatives for deeper trees
If you want a hierarchical structure I'd look at nested sets insteads, see:
http://www.pure-performance.com/2009/03/managing-hierarchical-data-in-sql/
In fact this is not 'only' matter of choice, but aware decision. Relational databases are not good at solving problems of hierarchical nature. There were tons of discussions, articles, and even books about that, so lets narrow the problem to your case.
The second choice would work fine ONLY if you were to allow replies to comments, and not to replies itself, thus this would be a tree with maximum 2 levels. That might be ok, but if you were to do that better solution would be to place everything with COMMENTS table, and add two columns: THREAD_ID (all the comments with the same THREAD_ID would belong to same thread), SEQ_NUM (or simply DATE would tell us which comment was first). Similar way of organising comments is implemented here on SO.
The first choice is quite simple and generic - but implements recurention with all its cons. Lets stop a bit and think... note that we are actually NOT building a tree, but a 'forest'. We will have many commen threads and every single thread will be a separate tree - relatively small amount of data to organise. In that case I would add a THREAD_ID column to COMMENTS table and use only that table (it would be also good to set an composite index on COMMENTS table containing THREAD_ID and COMMENTID columns - in exactly that order).
So upon above I would choose "choice 1".
Next decision should be about where to do the processing and comment tree construction? I would just get all the comments from the table an organise them on a controller (MVC) side, i.e. JAVA or C++. Traversing the list of comments and building the tree in Main Memory (using objects and pointers or hash tables) would be an easy thing. It is a good option also because small amount of nodes (comments and replies within one thread).
I would say it depends very much on what you're trying to achieve with this, from what I can understand if you want a max 2-level tree you should go with choice 2, if you want a deeper tree go with choice 1 with the following modification
Choice 1: comments {commentid,toplevelcommentid | thread | (whatever parent this comment and possibly other comments is/are linked to so you can easily recreate the structure afterwards),replyto,comment}
and when displaying results select everything that has commentid or toplevelcommentid equal to a value and order by commentid so you can easily recreate the structural data with a single select query
1) Queries against the TEXT table were always 3 times slower than those against the VARCHAR table (averages: 0.10 seconds for the VARCHAR table, 0.29 seconds for the TEXT table). The difference is 100% repeatable.
CREATE TABLE varcharTable (a varchar(255) NOT NULL, PRIMARY KEY (a)) ENGINE=MyISAM;
CREATE TABLE textTable (a text NOT NULL, PRIMARY KEY (a(255))) ENGINE=MyISAM;
mysql> explain SELECT SQL_NO_CACHE count(*) from varcharTable where a LIKE "n%";
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
| 1 | SIMPLE | varcharTable | range | PRIMARY | PRIMARY | 257 | NULL | 5882 | Using where; Using index |
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
1 row in set (0.00 sec)
mysql> explain SELECT SQL_NO_CACHE count(*) from T where a LIKE "n%";
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | T | range | PRIMARY | PRIMARY | 257 | NULL | 5882 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)
Index is being used for the VARCHAR table, but not for the TEXT table (in the Extra column)
2) search is not required on comments table.So, querying is not required and since its long too. Its type is preferred to be text
And then since its text you cannot search on it .So, put the comments(non-searchable and affecting performance) and replies on separate table. So, that the replies table will function good and the comments table will be kept just for storage purpose, no search performed on them.
Conclusion: So, put them the Comments table in a separate table.