SQL Separating big fields for speeding up queries - mysql

Let's say I have a table BOOK:
BOOK_ID INT(6) PK
--------------------
FILE_EXTENSION VARCHAR(5)
TITLE VARCHAR(60)
LANGUAGE VARCHAR(10)
EDITION INT(2)
PUBLISHMENT_OFFICE_ID INT(4)
PUBLISH_YEAR INT(4)
RATING INT(1)
FILE_UPDOAD_DATE DATE
LINK VARCHAR(150)
This table is meant to be used both for searching books (for ex. by extension, by publishment office, by authors (from other tables), etc) and for full visualization (print on page all books with all these fields).
So there is a question: For example, if I do
SELECT BOOK_ID FROM BOOK WHERE FILE_EXTENSION = 'PDF'
will this cause the load of all big fields (link, title, and maybe planned BLOB) as an intermediate result, or will it discard any unnecessary fields as soon as WHERE clause is translated with no performance issues?
The question leads for solution: separate big fields in other table with same PK in order to slow down visualization (cuz a JOIN is needed) but to speed up the search? Is it worth?
P.S. This particular DB is not meant to hold rly big amount of data, so my queries (I hope) won't be as slow. But this question is about general databases' design (let's say 10^8 entries).
P.P.S. Pls don't link me to database normalization (my full DB is normilized well)

Columns are stored as part of their row. Rows are stored as part of a Page. If you need one column from one row you need to read the whole row, in fact you read the whole page that row is in. That's likely to be thousands of rows, including all of their columns. Hopefully that page also has other rows you are interested in and the read isn't wasted.
That's why Columnar databases are becoming so popular for analytics. They store columns separately. They still store the values in Pages. So you read thousands of rows off the disk for that column, but in analytics you're likely to be interested in all or most of those rows. This way you can have hundreds of columns, but only ever read the columns you're querying.
MySQL doesn't have ColumnStore. So, you need an alternative.
First is to have your large fields in a separate table, which you've already alluded to.
Second, you can use a covering index.
If you index (file_extension, book_id) the query SELECT book_id FROM book WHERE file_extension = 'pdf' can be satisfied just be reading the index. It never needs to read the table itself. (Indexes are still stored as pages on the disk, but only the columns the index relates to, and potentially a row pointer. Much narrower than the table.)
That's a bit clunky though, because the covering index needs to cover the columns you know you'll be interested in.
In practice, your fields are small enough to not warrant this attention until it actually becomes a problem. It would be wise to store BLOBs in a separate table though.

"Columns are stored as part of their row." -- Yes and no. All the 'small' columns are stored together in the row. But TEXT and BLOB, when 'big', are stored elsewhere. (This assumes ENGINE=InnoDB.)
SELECT book_id FROM ... WHERE ext = 'PDF' would benefit from INDEX(ext, book_id). Without such, the query necessarily scans the entire table (100M rows?). With that index, it will be very efficient.
"print on page all books with all these fields" -- Presumably this excludes the bulky columns? In that case SELECT book_id versus SELECT all-these-fields will cost about the same. This is a reasonable thing to do on a web page -- if you are not trying to display thousands of books on a single page. That becomes a "bad UI" issue, more than an "inefficient query" issue.
title and link are likely to come under the heading of "small" in my discussion above. But any BLOBs are very likely to be "big".
Yes, it is possible to do "vertical partitioning" to split out the big items, but that is mostly repeating what InnoDB is already doing. Don't bother.
100M rows is well into the arena where we should discuss these things. My comments so far only touch the surface. To dig deeper, we need to see the real schema and some of the important queries. I expect some queries to be slow. With 100M rows, improving one query sometimes hurts another query.

Related

Does text or blob fields slow down the access to the table

I have table contains text and blob field and some other , I'm wondering if using these kind of data types would slow down the access to table
CREATE TABLE post
(
id INT(11),
person_id INT(11) ,
title VARCHAR(120) ,
date DATETIME ,
content TEXT,
image BLOB ,
);
let's say i have more than 100,000 posts and i want to do some query like
SELECT * FROM post WHERE post.date >= ? AND post.person_id = ?
would the query be faster if the table does not contains TEXT and BLOB fields
Yes or no.
If you don't fetch the text/blob fields, they don't slow down SELECTs. If you do, then they slow things down in either or both of these ways:
In InnoDB, TEXT and BLOB data, if large enough, is stored in a separate area from the rest of the columns. This may necessitate an extra disk hit. (Or may not, if it is already cached.)
In complex queries (more complex than yours), the Optimizer may need to make a temporary table. Typical situations: GROUP BY, ORDER BY and subqueries. If you are fetching a text or blob, the temp table cannot be MEMORY, but must be the slower MyISAM.
But, the real slowdown, is that you probably do not have this composite index: INDEX(person_id, date). Without it, the query might choose to gather up the text/blob (buried in the *) and haul it around, only to later discard it.
Action items:
Make sure you have that composite index.
If you don't need content for this query, don't use *.
If you need a TEXT or BLOB, use it; the alternatives tend to be no better. Using "vertical partitioning" ("splitting the table", as mentioned by #changepicture) is no better in the case of InnoDB. (It was a useful trick with MyISAM.) InnoDB is effectively "doing the split for you".
In my opinion, the short answer is yes. But there's more to it of course.
If you have good indexes then mysql will locate the data very fast but because the data is big then it will take a longer time to send the data.
In general smaller tables and use of numeric column types provides better performance.
And never do "SELECT *", it's just bad practice and in your case it's worst. What if you only need the title and date? Instead of transferring few data you transfer it all.
Consider splitting the table, meta data in one table and content and image in another table. This way going through the first table is very fast and only when you need the data from the second table will you access it. You will have a one-to-one relationship using this table structure.

Proper database design for Big Data

I have a huge number of tables for each country. I want multiple comment related fields for each so that users can make comments on my website. I might might have a few more fields like: date when comment was created, user_id of commenter. Also I might need to add other fields in the future.For Example, company_support_comment/support_rating, company_professionalism_comment
Let's say I have 1 million companies in one table and 100 comments per company. then I get lot's of comments just for one country It will easily exceed 2 billion.
Unsigned bigint can support 18 446 744 073 709 551 615. So we can have that many comments in one table. Unsigned int will give us 4.2+ billion. Which won't be enough in one table.
However imagine querying a table with 4 billion records? How long would that take ? I might not be able to efficiently retrieve the comments and it would take a huge load on the database. Given that in practice one table probably can't be done.
Multiple tables might also be bad. unless we just use json data..
Actually I'm not sure now. I need a proper solution for my database design. I am using mysql now.
Your question goes in the wrong direction, in my view.
Start with your database design. That means go with bigints to start with if you are concerned about it (because converting from int to bigint is a pain if you get that wrong). Build a good, normalized schema. Then figure out how to make it fast.
In your case, PostgreSQL may be a better option than MySQL because your query is going to likely be against secondary indexes. These are more expensive on MySQL with InnoDB than PostgreSQL, because with MySQL, you have to traverse the primary key index to retrieve the row. This means, effectively, traversing two btree indexes to get the rows you are looking for. Probably not the end of the world, but if performance is your primary concern that may be a cost you don't want to pay. While MySQL covering indexes are a little more useful in some cases, I don't think they help you here since you are interested, really, in text fields which you probably are not directly indexing.
In PostgreSQL, you have a btree index which then gives you a series of page/tuple tuples, which then allow you to look up the data effectively with random access. This would be a win with such a large table, and my experience is that PostgreSQL can perform very well on large tables (tables spanning, say, 2-3TB in size with their indexes).
However, assuming you stick with MySQL, careful attention to indexing will likely get you where you need to go. Remember you are only pulling up 100 comments for a company and traversing an index has O(log n) complexity so it isn't really that bad. The biggest issue is traversing the pkey index for each of the rows retrieved but even that should be manageable.
4 billions records in one table is not a big deal for No SQL database. Even for traditional database, if you build a bunch of secondary indexes correctly, like in MySQL, search in them will be quick(travels a b tree like data structure takes Log(n) disk visitation).
And for faster access, you need a front end cache system to work on your hot data, like redis or memcachd.
Recall your current situation, you are not sure what fields will be needed, then the only choice is a no-sql solution. Since the fields(columns) can be added in the future when they will be needed.
(From a MySQL perspective...)
1 table for companies; INT UNSIGNED will do. 1 table for comments BIGINT UNSIGNED may be necessary. You won't fetch hundreds of comments for display at once, will you? Unless you take care of the data layout, 100 comments could easily be 100 random disk hits, which (on cheap disk) would be 1 second.
You must have indexes (this mostly rules out NoSql)? Otherwise searching for records would be too painfully slow.
CREATE TABLE Comments (
comment_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
company_id INT UNSIGNED NOT NULL,
ts TIMESTAMP,
...
PRIMARY KEY(company_id, comment_id, ts), -- to get clustering and ordering
INDEX(comment_id) -- to keep AUTO_INCREMENT happy
...
) ENGINE=InnoDB;
If you paginate the display of the comments, use the tips in remember where you left off. That will make fetching comments about as efficient as possible.
As for Log(n) -- With about 100 items per node, a billion rows will have only 5 levels of BTree. This is small enough to essentially ignore when worrying about timing. Comments will be a terabyte or more? And your RAM will be significantly less than that? Then, you will generally have non-leaf nodes cached, but leaf nodes (where the data is) not cached. There might be several comment rows per leaf node consecutively stored. Hence, less than 100 disk hits to get 100 comments for display.
(Note: When the data is much bigger than RAM, 'performance' degenerates into 'counting the disk hits'.)
Well, you mentioned comments. What about other queries?
As for "company_support_comment/support_rating..." -- The simplest would be to add a new table(s) when you need to add those 'columns'. The basic Company data is relatively bulky and static; ratings are relative small but frequently changing. (Again, I am 'counting the disk hits'.)

Database Optimisation through denormalization and smaller rows

Does tables with many columns take more time than the tables with less columns during SELECT or UPDATE query? (row count is same and I will update/select same number of columns in both cases)
example: I have a database to store user details and to store their last active time-stamp. In my website, I only need to show active users and their names.
Say, one table named userinfo has the following columns: (id,f_name,l_name,email,mobile,verified_status). Is it a good idea to store last active time also in the same table? Or its better to make a separate table(say, user_active) to store the last activity timestamp?
The reason I am asking, If I make two tables, userinfo table will only be accessed during new signups(to INSERT new user row) and I will use user_active table (table with less columns) to UPADATE timestamp and SELECT active users frequently.
But the cost I have to pay for creating two tables is data duplication as user_active table columns will be (id, f_name, timestamp).
The answer to your question is that, to a close approximation, having more columns in a table does not really take more time than having fewer columns for accessing a single row. This may seem counter-intuitive, but you need to understand how data is stored in databases.
Rows of a table are stored on data pages. The cost of a query is highly dependent on the number of pages that need to be read and written during the course of the query. Parsing the row from the data page is usually not a significant performance issue.
Now, wider rows do have a very slight performance disadvantage, because more data would (presumably) be returned to the user. This is a very minor consideration for rows that fit on a single page.
On a more complicated query, wider rows have a larger performance disadvantage, because more data pages need to be read and written for a given number of rows. For a single row, though, one page is being read and written -- assuming you have an index to find that row (which seems very likely in this case).
As for the rest of your question. The structure of your second table is not correct. You would not (normally) include fname in two tables -- that is data redundancy and causes all sort of other problems. There is a legitimate question whether you should store a table of all activity and use that table for the display purposes, but that is not the question you are asking.
Finally, for the data volumes you are talking about, having a few extra columns would make no noticeable difference on any reasonable transaction volume. Use one table if you have one attribute per entity and no compelling reason to do otherwise.
When returning and parsing a single row, the number of columns is unlikely to make a noticeable difference. However, searching and scanning tables with smaller rows is faster than tables with larger rows.
When searching using an index, MySQL utilizes a binary search so it would require significantly larger rows (and many rows) before any speed penalty is noticeable.
Scanning is a different matter. When scanning, it's reading through all of the data for all of the rows, so there's a 1-to-1 performance penalty for larger rows. Yet, with proper indexes, you shouldn't be doing much scanning.
However, in this case, keep the date together with the user info because they'll be queried together and there's a 1-to-1 relationship, and a table with larger rows is still going to be faster than a join.
Only denormalize for optimization when performance becomes an actual problem and you can't resolve it any other way (adding an index, improving hardware, etc.).

MySQL 5.5 : Which one of the following is a better storage for a text/varchar field in innodb?

Requirement :
Page#1 -> Display users and 1-2 line preview of their latest 10 blog posts
Page#2 -> Display single blogpost with full text.
Method 1 :
MySQL table -> userid -> varchar 50
post_id -> integer
post_title -> varchar 100
post_description -> varchar 10000
for page#1 , select user_id, post_title , post_description from blog_table .
and substring of post_description is used to show preview in the listing.
for page#2 , select user_id , post_title, post_description where post_id = N
Method 2 :
MySQL table -> userid -> varchar 50
post_id -> integer
post_title -> varchar 100
post_brief -> varchar 250
post_description -> text
for page#1 , select user_id, post_title , post_brief from blog_table .
for page#2 , select user_id , post_title, post_description where post_id = N
Does storing two columns, one brief as varchar and one full as text ( since it accesses the file system , and should be queried only when needed ) , worth the performance benefit ?
Since, method 2, will store only pointer to the text in the row, whereas Method 1 will store full varchar 10K string in the row. Does it affect the amount of table data which can reside in RAM , hence affect read performance of queries ?
The performance of SQL queries mostly depends on JOINs, WHERE clauses, GROUP BYs and ORDER BYs, not on the columns retrieved. The columns only have a noticable effect on the query's speed if significantly more data is retrieved which might have to go over a network to be processed by your programming language. That is not the case here.
Short answer: The difference in performance between the two proposed setups is likely to be very small.
For good speed, your post_id column should have a (unique) index. You are not selecting, sorting or grouping by any other column, so data can come straight from the table, which is a very fast process.
You are talking about "pages" here, so I'm guessing those are going to be presented to users - it seems unlikely that you want to show a table of thousands of blog posts on the same page to a human, therefor you probably do actually have ORDER BY and/or LIMIT clauses in your statements that you didn't include in your question.
But lets look a bit deeper into this whole thing. Lets assume we are actually reading tons of TEXT columns directly from hard disk, wouldn't we hit the drive's maximum reading speed? Wouldn't retrieving just a VARCHAR(250) be faster, especially since it saves you the extra LEFT() call?
We can get the LEFT() call off the table real quick. String functions are really fast - after all, it is the CPU just cutting off some of the data, which is a really fast process. The only times when they produce a noticable delay is when they're used in WHERE clauses, JOINs etc., but that is NOT because those functions are slow, but because they have to be run lots (possibly millions) of times in order to produce even a single row of results, and even more so, because those uses often prevent the database from using its indexes properly.
So in the end it comes down to: how fast can MySQL read the table contents from the database. And that in turn depends on the storage engine you are using and its settings. MySQL can use a number of storage engines, including (but not limited to) InnoDB and MyISAM. Both of these engines offer different file layouts for large objects such as TEXT or BLOB columns (but funnily enough, also VARCHARs). If the TEXT column is stored in a different page than the rest of the row, the storage engine has to retrieve two pages for every row. If it is stored along with the rest, it'll be just one page. For sequential processing this could be a major change in performance.
Here's a bit of background reading on that:
Blob Storage in InnoDB
MyISAM Dynamic vs. Compressed Data File Layouts
Long Answer: It depends :)
You would have to do a number of benchmark tests on your own hardware to actually make the call as to which layout is actually quicker. Given that the second setup introduces redundancy with its additional column, it is likely to perform worse in most scenarios. It will perform better if - and only if - the table structure allows the shorter VARCHAR column to fit into the same page on disk while the long TEXT column would be on another page.
Edit: More on TEXT columns and performance
There seems to be a common misconception about BLOBs and in-memory processing. Quite a number of pages (including some answers here on StackOverflow - I'll try to find them, and give an additional comment) state that TEXT columns (and all other BLOBs) cannot be processed in memory by MySQL, and as such are always a performance hog. That is not true. What's really happening is this:
If you run a query that involves a TEXT column and that query needs a temporary table to be processed, then MySQL will have to create that temporary table on disk rather than in memory, because MySQL's MEMORY storage engine cannot handle TEXT columns. See this related question.
The MySQL documentation states this (the paragraph is the same for all versions from 3.2 through 5.6):
Instances of BLOB or TEXT columns in the result of a query that is
processed using a temporary table causes the server to use a table on
disk rather than in memory because the MEMORY storage engine does not
support those data types (see Section 8.4.3.3, “How MySQL Uses
Internal Temporary Tables”). Use of disk incurs a performance penalty,
so include BLOB or TEXT columns in the query result only if they are
really needed. For example, avoid using SELECT *, which selects all
columns.
It is the last sentence that confuses people - because that is just a bad example. A simple SELECT * will not be affected by this performance problem because it won't use a temporary table. If the same select was for example ordered by a non-indexed column, it the would have to use a temporary table and would be affected by this problem. Use the EXPLAIN command in MySQL to find out whether a query will need a temporary table or not.
By the way: None of this affects caching. TEXT columns can be cached just like anything else. Even if a query needed a temporary table and that had to be stored on disk, the result could still be cached if the system had the resources to do so, and the cache is not invalidated. In this regard, a TEXT column is just like anything else.
Edit 2: More on TEXT columns and memory requirements ...
MySQL uses the storage engine to retrieve records from disk. It will then buffer the results and hand them sequentially to the client. The following assumes that this buffer ends up in memory and not on disk (see above why)
For TEXT columns (and other BLOBs), MySQL will buffer a pointer to the actual BLOB. Such a pointer uses only a few bytes of memory, but requires the actual TEXT content to be retrieved from disk when the row is handed to the client.
For VARCHAR columns (and everything else but BLOBs), MySQL will buffer the actual data. This will usually use more memory, because most of your texts are going to be more than just a few bytes.
For calculated columns, MySQL will also buffer the actual data, just like with VARCHARs.
A couple of notes on this: Technically, the BLOBs will also be buffered when they are handed over to the client, but only one at a time - and for large BLOBs possibly not in its entirety. Since this buffer gets freed after each row, this does not have any major effect. Also, if a BLOB is actually stored in the same page as the rest of the row, it may end up being treated like VARCHARs. To be honest, I've never had the requirement to return lots of BLOBs in a single query, so I never tried.
Now lets actually answer the (now edited) question:
Page #1. Overview of users and short blog post snippets.
Your options are pretty much these queries
SELECT userid, post_title, LEFT(post_description, 250) FROM `table_method_1` <-- calculated based on a VARCHAR column
SELECT userid, post_title, LEFT(post_description, 250) FROM `table_method_2` <-- calculated based on the TEXT column
SELECT userid, post_title, post_brief FROM `table_method_2` <-- precalculated VARCHAR column
SELECT userid, post_title, post_description FROM `table_method_2` <-- return the full text, let the client produce the snippet
The memory requirements of the first three are identical. The fourth query will require less memory (the TEXT column will be buffered as a pointer,) but more traffic to the client. Since traffic usually is over a network (expensive in terms of performance,) this tends to be slower than the other queries - but your mileage may vary. The LEFT() function on the TEXT column might be sped up by telling the storage engine to use an inlined table layout, but this will depend on the average length of text being stored.
Page #2. A single blog post
SELECT userid, post_title, post_description FROM `table_method_1` WHERE post_id=... <-- returns a VARCHAR
SELECT userid, post_title, post_description FROM `table_method_2` WHERE post_id=... <-- returns a TEXT
The memory requirements are low to begin with, since only one single row will be buffered. For the reasons stated above the second will require a tiny bit less memory to buffer the row, but some additional memory to buffer a single BLOB.
In either case, I'm pretty sure you're not concerned with the memory requirements of a select that'll only return a single row, so it does not really matter.
Summary
If you have text of arbitrary length (or anything that requires more than a few kilobytes), you should use TEXT columns. That's what they're there for. The way MySQL handles those columns is beneficial most of the time.
There are only two things to remember for everyday use:
Avoid selecting TEXT columns, BLOB columns and all other columns that may have lots of data (and yes, that includes a VARCHAR(10000)) if you don't actually need them. The habit of "SELECT * FROM whatever" when all you need is a couple of values will put a lot of unnecessary stress on the database.
When you are selecting TEXT columns or other BLOBs, make sure the select does not use a temporary table. Use the EXPLAIN syntax when in doubt.
When you stick to those rules, you should get fairly decent performance from MySQL. If you need further optimation than that, you'll have to look at the finer details. This will include storage engines and respective table layouts, statistical information on the actual data, and knowledge about the hardware involved. From my experience, I could usually get rid of performance hogs without having to dig that deep.
Method 2 looks better but if you are storing HTML there post_brief could also be TEXT column, if it's pure text you could store everything in one column and use
SELECT user_id, post_title, LEFT(post_description,255) AS post_brief FROM blog_table.
Consider MySQL 5.6, it is much faster and you can use FULLTEXT Index in InnoDB, so in case of searching posts it will help a lot
Option 2 looks good to me also. As the blogpost is going to be huge, applying function on that columns should also take time.
And if you ask me, the data type of the post_description should be blob/text. Eventhough blob columns doesnot support search, that would be better option.
Only disadvantage of having as two columns is, you have to make sure both desc and brief are in sync(May be you can make it as a feature too )

MySQL LIKE alternative

Is there an alternative for LIKE. Note I cannot use FULL TEXT Search.
Here is my mysql code.
SELECT *
FROM question
WHERE content LIKE '%$search_each%'
OR title LIKE '%$search_each%'
OR summary LIKE '%$search_each%'
Well, MySQL has regular expressions but I would like to ask you what the problem is with multiple LIKEs.
I know it won't scale well when tables get really large but that's rarely a concern for the people using MySQL (not meaning to be disparaging to MySQL there, it's just that I've noticed a lot of people seem to use it for small databases, leaving large ones to the likes of Oracle, DB2 or SQLServer (or NoSQL where ACID properties aren't so important)).
If, as you say:
I plan to use it for really large sites.
then you should avoid LIKE altogether. And, if you cannot use full text search, you'll need to roll your own solution.
One approach we've used in the past is to use insert/update/delete triggers on the table to populate yet another table. The insert/update trigger should:
evaluate the string in question;
separate it into words;
throw away inconsequential words (all-numerics, noise words like 'at', 'the', 'to', and so on); then
add those words to a table which a marker to the row in the original table.
Then use that table for searching, almost certainly much faster than multiple LIKEs. It's basically a roll-your-own sort-of-full text search where you can fine-tune and control what actually should be indexed.
The advantage of this is speed during the select process with a minor cost during the update process. Keep in mind this is best for tables that are read more often than written (most of them) since it amortises the cost of indexing the individual words across all reads. There's no point in incurring that cost on every read, better to do it only when the data changes.
And, by the way, the delete trigger will simply delete all entries in the indexing table which refer to the real record.
The table structures would be something like:
Comments:
id int
comment varchar(200)
-- others.
primary key (id)
Words:
id int
word varchar(50)
primary key (id)
index (word)
WordsInComments:
wordid int
commentid int
primary key (wordid,commentid)
index (commentid)
Setting the many-to-many relationship to id-id (i.e., separate Words and WordsInComments tables) instead of id-text (combining them into one) is the correct thing to do for third normal form but you may want to look at trading off storage space for speed and combining them, provided you understand the implications.