Tagging System Scalability - mysql

I have a tagging system on my website. Trying to figure out whether its better to have a master tag table and another table with references to the master table, or a table with each row containing the text of the tag.
Thoughts on what would scale up better?
Thanks.

Scaling up isn't your problem, it's efficiency.
It is significantly easier for the server to search for the integer value 42 than the string value "answer", for example. It also takes up less space in storage, especially in the index table. You can then look up or join on the tag name table (I would favour looking it up separately, because then you can use Memcache or similar to store the names for even faster access)

Related

Seeking a performant solution for accessing unique MySQL entries

I know very little about MySQL (or web development in general). I'm a Unity game dev and I've got a situation where users (of a region the size of which I haven't decided yet, possibly globally) can submit entries to an online database. The users must be able to then locate their entry at any time.
For this reason, I've generated a guid from .Net (System.Guid.NewGuid()) and am storing that in the database entry. This works for me! However... I'm no expert, but my gut tells me that looking up a complex string in what could be a gargantuan table might have terrible performance.
That said, it doesn't seem like anything other than a globally unique identifier will solve my problem. Is there a more elegant solution that I'm not seeing, or a way to mitigate against any issues this design pattern might create?
Thanks!
Make sure you define the GUID column as the primary key in the MySQL table. That will cause MySQL to create an index on it, which will enable MySQL to quickly find a row given the GUID. The table might be gargantuan but (assuming a regular B-tree index) the time required for a lookup will increase logarithmically relative to the size of the table. In other words, if it requires 2 reads to find a row in a 1,000-row table, finding a row in a 1,000,000-row table will only require 2 more reads, not 1,000 times as many.
As long as you have defined the primary key, the performance should be good. This is what the database is designed to do.
Obviously there are limits to everything. If you have a billion users and they're submitting thousands of these entries every second, then maybe a regular indexed MySQL table won't be sufficient. But I wouldn't go looking for some exotic solution before you even have a problem.
If you have a key of the row you want, and you have an index on that key, then this query will take less than a second, even if the table has a billion rows:
SELECT ... FROM t WHERE id = 1234.
The index in question might be the PRIMARY KEY, or it could be a secondary key.
GUIDs/UUIDs should be used only if you need to manufacture unique ids in multiple clients without asking the database for an id. If you do use such, be aware that GUIDs perform poorly if the table is bigger than RAM.

MySQL partitioning or NoSQL like AWS DynamoDB?

Business logic:
My application crawls a lot(hundreds or sometimes thousands) of webpages every few hours and stores all the links(i.e. all anchor tags) on that webpage in a MySQL database table say links. This table is growing very big day by day (already around 20 million records as of now).
Technical:
I have a unique index combined on [webpage_id, link] in the links table. Also, I have a column crawl_count in the same table.
Now whenever I crawl a webpage, I already know webpage_id (the foreign key to webpages table) and I get links in that webpage (i.e. array of link) which I just do an insert or update query without worrying about what is already in the table.
INSERT INTO ........ ON DUPLICATE KEY UPDATE crawl_count=crawl_count+1
Problem:
The table grows big every day & I want to optimize the table for performance. Options I considered are,
Partitioning: Partition table by domains. All webpages belong to a particular domain. For example: Webpage https://www.amazon.in/gp/goldbox?ref_=nav_topnav_deals belong to the domain https://www.amazon.in/
NoSQL like DynamoDB. I have other tables of application in MySQL DB which I do not want to migrate to DynamoDB unless it's absolutely required. Also I have considered change in application logic (eg: change the structure of webpages table to something like
{webpage: "http://example.com/new-brands", links: [link1, link2, link3]}
and migrate this table to DynamoDB so I don't have a links table. But again, there is a limit for every record in DynamoDB(400kb). What if it exceeds this limit?
I have read pros & cons of using either of the approach. As far my understanding goes, DynamoDB doesn't seem to be a good fit for my situation. But still wanted to post this question so I can make a good decision for this scenario.
PARTITION BY domain -- No. There won't be any performance gain. Anyway, you will find that one domain dominates the table, and a zillion domains show up only once. (I'm speaking from experience.)
The only concept of an "array" is a separate table. It would have, in your case, webpage_id and link as a 2-column PRIMARY KEY (which is 'unique').
Normalize. This is to avoid having lots of copies of each domain and each link. This saves some space.
I assume you have two categories of links -- the ones for pages you have scanned, and the ones for pages waiting to scan. And probably the two sets are similar in size. I don't understand the purpose of `crawl count, but it adds to the cost.
I may be able to advise further if I could see the queries -- both inserting and selecting. Also, how big are the tables (GB) and what is the value of innodb_buffer_pool_size? Putting these together, we can discuss likely points if sluggishness.
Also the slowlog would help.
Are you dealing with non-ascii urls? Urls too long to index? Do you split urls into domain + path? Do you strip off "#..."? And "?..."?

Most efficient way to search a database with more than a billion records?

My client has a huge database containing just three fields:
Primary key (a unsigned number)
Name (multi-word text)
Description (up to 1000 varchar)
This database has got over few billion entries. I have no previous experience in handling such large amounts of data.
He wants me to design an interface using AJAX (like Google) to search this database. My queries are as slow as turtle.
What is best way to search text fields in such a large database? If the user is typing wrong spelling on interface, how can I return what he wanted ?
If you are using FULLTEXT indexes, you're correctly writing your queries, and the speed in which the results are returned are not adequate, you are entering a territory where MySQL may simply not be sufficient for you..
You may be able to tweak settings, purchase enough RAM to make sure that your entire data-set fits 100% in memory. It's definitely true that performance gains could be huge there.
I'd definitely recommend looking into tweaks of your mysql configuration. We've had some silly settings in the past. Operating system defaults tend to really suck!
However, if you have trouble at that point, you can:
Create a separate table containing each word (indexed) along with a record id that it refers to. This will allow you to search on single words.
Use a different system that's optimized for solving this problem. Unless my information is now outdated, the 2 engines that are the most popular for solving this problem are:
Sphinx
Solr / Lucene
If your table is myISAM then you can set the Name and Description fields to FULLTEXT
CREATE TABLE articles (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
Name VARCHAR(200),
Description TEXT,
FULLTEXT (Name,Description)
);
Then you can use queries like:
SELECT * FROM articles
WHERE MATCH (Name,Description) AGAINST ('database');
Your can find more info at http://docs.oracle.com/cd/E17952_01/refman-5.0-en/fulltext-search.html
Before doing any of the above you might want to backup (or atleast make a copy) of your database.
You can't. The only fast search in your scenario would be on the Primary Key since that's most likely to be the index. Text search is slow as a turtle.
In all seriousness, you have a few solutions:
If you have to stick with NoSQL you'll have to redesign you scheme. It's hard to give you a good recommendation without knowing the requirements. One solution would be to index keywords in a separate table.
Another solution is to switch to a different search engine, you can find suggestions in other questions here such as: Fast SQL Server search on 40M text records

store TEXT/BLOB in same table or not?

While searching trough SO, I've found two contradicting answers (and even a comment that stated that) but no definitive answer:
The problem is: is there any performance benefit, if you store a TEXT/BLOB field outside of a table?
We assume:
You SELECT correctly (only selection the TEXT/BLOB if required, no SELECT *)
Tables are indexed properly, where it makes sense (so it's not a matter of 'if you index it')
The database design doesnt really matter. This is a question to identify the MySQL behaviour in this special case, not to solve certain database design problems. Let's assume this Database has only one table (or two, if the TEXT/BLOB gets separated)
used engine: innoDB (others would be interesting too, if they fetch different results)
This post states, that putting the TEXT/BLOB into a separate table, only helps if you're already SELECTing in a wrong way (always SELECTing the TEXT/BLOB even when it's not necessary) - basically stating, that TEXT/BLOB in the same table is basically the better solution (less complexity, no performance hit, etc) since the TEXT/BLOB is stored seprately anyway
The only time that moving TEXT columns into another table will offer any benefit is if there it a tendency to usually select all columns from tables. This is merely introducing a second bad practice to compensate for the first. It should go without saying the two wrongs is not the same as three lefts.
MySQL Table with TEXT column
This post however, states that:
When a table has TEXT or BLOB columns, the table can't be stored in memory
Does that mean that it's already enough to have a TEXT/BLOB inside a table, to have a performance hit?
MySQL varchar(2000) vs text?
My Question basically is: What's the correct answer?
Does it really matter if you store TEXT/BLOB into a separate table, if you SELECT correctly?
Or does even having a TEXT/BLOB inside a table, create a potential performance hit?
Update: Barracuda is the default InnoDB file format since version 5.7.
If available on your MySQL version, use the InnoDB Barracuda file format using
innodb_file_format=barracuda
in your MySQL configuration and set up your tables using ROW_FORMAT=Dynamic (or Compressed) to actually use it.
This will make InnoDB to store BLOBs, TEXTs and bigger VARCHARs outside the row pages and thus making it a lot more efficient. See this MySQLperformanceblog.com blog article for more information.
As far as I understand it, using the Barracuda format will make storing TEXT/BLOB/VARCHARs in separate tables not valid anymore for performance reasons. However, I think it's always good to keep proper database normalization in mind.
One performance gain is to have a table with fixed length records. This would mean no variable length fields like varchar or text/blob. With fixed length records, MySQL doesn't need to "seek" the end of a record since it knows the size offset. It also knows how much memory it needs to load X records. Tables with fixed length records are less prone to fragmentation since space made available from deleted records can be fully reused. MyISAM tables actually have a few other benefits from fixed length records.
Assuming you are using innodb_file_per_table, keeping the tex/blob in a separate table will increase the likelihood that the file system caching will be used since the table will be smaller.
That said, this is a micro optimization. There are many other things you can do to get much bigger performance gains. For example, use SSD drives. It's not going to give you enough of a performance boost to push out the day of reckoning when your tables get so big you'll have to implement sharding.
You don't hear about databases using the "raw file system" anymore even though it can be much faster. "Raw" is when the database accesses the disk hardware directly, bypassing any file system. I think Oracle still supports this. But it's just not worth the added complexity, and you have to really know what you are doing. In my opinion, storing your text/blob in a separate table just isn't worth the added complexity for the possible performance gain. You really need to know what you are doing, and your access patterns, to take advantage of it.

Which of these 2 MySQL DB Schema approaches would be most efficient for retrieval and sorting?

I'm confused as to which of the two db schema approaches I should adopt for the following situation.
I need to store multiple attributes for a website, e.g. page size, word count, category, etc. and where the number of attributes may increase in the future. The purpose is to display this table to the user and he should be able to quickly filter/sort amongst the data (so the table strucuture should support fast querying & sorting). I also want to keep a log of previous data to maintain a timeline of changes. So the two table structure options I've thought of are:
Option A
website_attributes
id, website_id, page_size, word_count, category_id, title_id, ...... (going up to 18 columns and have to keep in mind that there might be a few null values and may also need to add more columns in the future)
website_attributes_change_log
same table strucuture as above with an added column for "change_update_time"
I feel the advantage of this schema is the queries will be easy to write even when some attributes are linked to other tables and also sorting will be simple. The disadvantage I guess will be adding columns later can be problematic with ALTER TABLE taking very long to run on large data tables + there could be many rows with many null columns.
Option B
website_attribute_fields
attribute_id, attribute_name (e.g. page_size), attribute_value_type (e.g. int)
website_attributes
id, website_id, attribute_id, attribute_value, last_update_time
The advantage out here seems to be the flexibility of this approach, in that I can add columns whenever and also I save on storage space. However, as much as I'd like to adopt this approach, I feel that writing queries will be especially complex when needing to display the tables [since I will need to display records for multiple sites at a time and there will also be cross referencing of values with other tables for certain attributes] + sorting the data might be difficult [given that this is not a column based approach].
A sample output of what I'd be looking at would be:
Site-A.com, 232032 bytes, 232 words, PR 4, Real Estate [linked to category table], ..
Site-B.com, ..., ..., ... ,...
And the user needs to be able to sort by all the number based columns, in which case approach B might be difficult.
So I want to know if I'd be doing the right thing by going with Option A or whether there are other better options that I might have not even considered in the first place.
I would recommend using Option A.
You can mitigate the pain of long-running ALTER TABLE by using pt-online-schema-change.
The upcoming MySQL 5.6 supports non-blocking ALTER TABLE operations.
Option B is called Entity-Attribute-Value, or EAV. This breaks rules of relational database design, so it's bound to be awkward to write SQL queries against data in this format. You'll probably regret using it.
I have posted several times on Stack Overflow describing pitfalls of EAV.
Also in my blog: EAV FAIL.
Option A is a better way ,though the time may be large when alert table for adding a extra column, querying and sorting options are quicker. I have used the design like Option A before, and it won't take too long when alert table while millions records in the table.
you should go with option 2 because it is more flexible and uses less ram. When you are using option1 then you have to fetch a lot of content into the ram, so will increases the chances of page fault. If you want to increase the querying time of the database then you should defiantly index your database to get fast result
I think Option A is not a good design. When you design a good data model you should not change the tables in a future. If you domain SQL language, using queries in option B will not be difficult. Also it is the solution of your real problem: "you need to store some attributes (open number, not final attributes) of some webpages, therefore, exist an entity for representation of those attributes"
Use Option A as the attributes are fixed. It will be difficult to query and process data from second model as there will be query based on multiple attributes.