Proper database design for Big Data

Proper database design for Big Data - mysql

I have a huge number of tables for each country. I want multiple comment related fields for each so that users can make comments on my website. I might might have a few more fields like: date when comment was created, user_id of commenter. Also I might need to add other fields in the future.For Example, company_support_comment/support_rating, company_professionalism_comment
Let's say I have 1 million companies in one table and 100 comments per company. then I get lot's of comments just for one country It will easily exceed 2 billion.
Unsigned bigint can support 18 446 744 073 709 551 615. So we can have that many comments in one table. Unsigned int will give us 4.2+ billion. Which won't be enough in one table.
However imagine querying a table with 4 billion records? How long would that take ? I might not be able to efficiently retrieve the comments and it would take a huge load on the database. Given that in practice one table probably can't be done.
Multiple tables might also be bad. unless we just use json data..
Actually I'm not sure now. I need a proper solution for my database design. I am using mysql now.

Your question goes in the wrong direction, in my view.
Start with your database design. That means go with bigints to start with if you are concerned about it (because converting from int to bigint is a pain if you get that wrong). Build a good, normalized schema. Then figure out how to make it fast.
In your case, PostgreSQL may be a better option than MySQL because your query is going to likely be against secondary indexes. These are more expensive on MySQL with InnoDB than PostgreSQL, because with MySQL, you have to traverse the primary key index to retrieve the row. This means, effectively, traversing two btree indexes to get the rows you are looking for. Probably not the end of the world, but if performance is your primary concern that may be a cost you don't want to pay. While MySQL covering indexes are a little more useful in some cases, I don't think they help you here since you are interested, really, in text fields which you probably are not directly indexing.
In PostgreSQL, you have a btree index which then gives you a series of page/tuple tuples, which then allow you to look up the data effectively with random access. This would be a win with such a large table, and my experience is that PostgreSQL can perform very well on large tables (tables spanning, say, 2-3TB in size with their indexes).
However, assuming you stick with MySQL, careful attention to indexing will likely get you where you need to go. Remember you are only pulling up 100 comments for a company and traversing an index has O(log n) complexity so it isn't really that bad. The biggest issue is traversing the pkey index for each of the rows retrieved but even that should be manageable.

4 billions records in one table is not a big deal for No SQL database. Even for traditional database, if you build a bunch of secondary indexes correctly, like in MySQL, search in them will be quick(travels a b tree like data structure takes Log(n) disk visitation).
And for faster access, you need a front end cache system to work on your hot data, like redis or memcachd.
Recall your current situation, you are not sure what fields will be needed, then the only choice is a no-sql solution. Since the fields(columns) can be added in the future when they will be needed.

(From a MySQL perspective...)
1 table for companies; INT UNSIGNED will do. 1 table for comments BIGINT UNSIGNED may be necessary. You won't fetch hundreds of comments for display at once, will you? Unless you take care of the data layout, 100 comments could easily be 100 random disk hits, which (on cheap disk) would be 1 second.
You must have indexes (this mostly rules out NoSql)? Otherwise searching for records would be too painfully slow.
CREATE TABLE Comments (
comment_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
company_id INT UNSIGNED NOT NULL,
ts TIMESTAMP,
...
PRIMARY KEY(company_id, comment_id, ts), -- to get clustering and ordering
INDEX(comment_id) -- to keep AUTO_INCREMENT happy
...
) ENGINE=InnoDB;
If you paginate the display of the comments, use the tips in remember where you left off. That will make fetching comments about as efficient as possible.
As for Log(n) -- With about 100 items per node, a billion rows will have only 5 levels of BTree. This is small enough to essentially ignore when worrying about timing. Comments will be a terabyte or more? And your RAM will be significantly less than that? Then, you will generally have non-leaf nodes cached, but leaf nodes (where the data is) not cached. There might be several comment rows per leaf node consecutively stored. Hence, less than 100 disk hits to get 100 comments for display.
(Note: When the data is much bigger than RAM, 'performance' degenerates into 'counting the disk hits'.)
Well, you mentioned comments. What about other queries?
As for "company_support_comment/support_rating..." -- The simplest would be to add a new table(s) when you need to add those 'columns'. The basic Company data is relatively bulky and static; ratings are relative small but frequently changing. (Again, I am 'counting the disk hits'.)

Related

SQL Separating big fields for speeding up queries

Let's say I have a table BOOK:
BOOK_ID INT(6) PK
--------------------
FILE_EXTENSION VARCHAR(5)
TITLE VARCHAR(60)
LANGUAGE VARCHAR(10)
EDITION INT(2)
PUBLISHMENT_OFFICE_ID INT(4)
PUBLISH_YEAR INT(4)
RATING INT(1)
FILE_UPDOAD_DATE DATE
LINK VARCHAR(150)
This table is meant to be used both for searching books (for ex. by extension, by publishment office, by authors (from other tables), etc) and for full visualization (print on page all books with all these fields).
So there is a question: For example, if I do
SELECT BOOK_ID FROM BOOK WHERE FILE_EXTENSION = 'PDF'
will this cause the load of all big fields (link, title, and maybe planned BLOB) as an intermediate result, or will it discard any unnecessary fields as soon as WHERE clause is translated with no performance issues?
The question leads for solution: separate big fields in other table with same PK in order to slow down visualization (cuz a JOIN is needed) but to speed up the search? Is it worth?
P.S. This particular DB is not meant to hold rly big amount of data, so my queries (I hope) won't be as slow. But this question is about general databases' design (let's say 10^8 entries).
P.P.S. Pls don't link me to database normalization (my full DB is normilized well)

Columns are stored as part of their row. Rows are stored as part of a Page. If you need one column from one row you need to read the whole row, in fact you read the whole page that row is in. That's likely to be thousands of rows, including all of their columns. Hopefully that page also has other rows you are interested in and the read isn't wasted.
That's why Columnar databases are becoming so popular for analytics. They store columns separately. They still store the values in Pages. So you read thousands of rows off the disk for that column, but in analytics you're likely to be interested in all or most of those rows. This way you can have hundreds of columns, but only ever read the columns you're querying.
MySQL doesn't have ColumnStore. So, you need an alternative.
First is to have your large fields in a separate table, which you've already alluded to.
Second, you can use a covering index.
If you index (file_extension, book_id) the query SELECT book_id FROM book WHERE file_extension = 'pdf' can be satisfied just be reading the index. It never needs to read the table itself. (Indexes are still stored as pages on the disk, but only the columns the index relates to, and potentially a row pointer. Much narrower than the table.)
That's a bit clunky though, because the covering index needs to cover the columns you know you'll be interested in.
In practice, your fields are small enough to not warrant this attention until it actually becomes a problem. It would be wise to store BLOBs in a separate table though.

"Columns are stored as part of their row." -- Yes and no. All the 'small' columns are stored together in the row. But TEXT and BLOB, when 'big', are stored elsewhere. (This assumes ENGINE=InnoDB.)
SELECT book_id FROM ... WHERE ext = 'PDF' would benefit from INDEX(ext, book_id). Without such, the query necessarily scans the entire table (100M rows?). With that index, it will be very efficient.
"print on page all books with all these fields" -- Presumably this excludes the bulky columns? In that case SELECT book_id versus SELECT all-these-fields will cost about the same. This is a reasonable thing to do on a web page -- if you are not trying to display thousands of books on a single page. That becomes a "bad UI" issue, more than an "inefficient query" issue.
title and link are likely to come under the heading of "small" in my discussion above. But any BLOBs are very likely to be "big".
Yes, it is possible to do "vertical partitioning" to split out the big items, but that is mostly repeating what InnoDB is already doing. Don't bother.
100M rows is well into the arena where we should discuss these things. My comments so far only touch the surface. To dig deeper, we need to see the real schema and some of the important queries. I expect some queries to be slow. With 100M rows, improving one query sometimes hurts another query.

MySQL or NoSQL? Recommended way of dealing with large amount of data

I have a database of which will be used by a large amount of users to store random long string (up to 100 characters). The table columns will be: userid, stringid and the actual long string.
So it will look pretty much like this:
Userid will be unique and stringid will be unique for each user.
The app is like a simple todo-list app, so each user will have an average amount of 50 todo's.
I am using the stringid in order that users will be able to delete the specific task at any given time.
I assume this todo app could end up with 7 million tasks in 3 years time and that scares me of using MySQL.
So my question is if this is the actual recommended way of dealing with large amount of data with long string (every new task gets a new row)? and is MySQL is the right database solution to choose for this kind of projects ?
I have not experienced with large amount of data yet and I am trying to save myself for the far future.

This is not a question of "large amounts" of data (mysql handles large amounts of data just fine and 2 mio rows isn't "large amounts" in any case).
MySql is a relational database. So if you have data that can be normalized, that is distributed among a number of tables that ensures every datapoint is saved only once then you should use MySql (or Maria, or any other relational database).
If you have schema-less data and speed is more important than consistency than you can/should use some NoSql database. Personally I don't see how a todo list would profit from NoSql (doesn't really matter in this case, but I guess as of now most programmig frameworks have better support for relational databases than for Nosql).

This is a pretty straightforward relational use case. I wouldn't see a need for NoSQL here.
The table you present should work fine however, I personally would question the need for the compound primary key as you would present this. I would probably have a primary key on stringid only to enforce uniqueness across all records. Rather than a compound primary key across userid and stringid. I would then put a regular index on userid.
The reason for this is in case you just want to query by stringid only (i.e. for deletes or updates), you are not tied into always having to query across both field to leverage your index (or adding having to add individual indexes on stringid and userid to enable querying by each field, which means my space in memory and disk taken up by indexes).
As far as whether MySQL is the right solution, this would really be for you to determine. I would say that MySQL should have no problem handling tables with 2 million rows and 2 indexes on two integer id fields. This is assuming you have allocated enough memory to hold these indexes in memory. There is certainly a ton of information available on working with MySQL, so if you are just trying to learn, it would likely be a good choice.

Regardless of what you consider a "large amount of data", modern DB engines are designed to handle a lot. The question of "Relational or NoSQL?" isn't about which option can support more data. Different relational and NoSQL solutions will handle the large amounts of data differently, some better than others.
MySQL can handle many millions of records, SQLite can not (at least not as effectively). Mongo (NoSQL) attempts to hold it's collections in memory (as well as the file system) so I have seen it fail with less than 1 million records on servers with limited memory, although it offers sharding which can help it scale more effectively.
The bottom line is: The number of records you store should not play into SQL vs NoSQL decisions, that decision should be left to how you will save and retrieve the data. It sounds like your data is already normalized (e.g. UserID) and if you also desire consistency when you i.e. delete a user (the TODO items also get deleted) then I would suggest using a SQL solution.

I assume that all queries will reference a specific userid. I also assume that the stringid is a dummy value used internally instead of the actual task-text (your random string).
Use an InnoDB table with a compound primary key on {userid, stringid} and you will have all the performance you need, due to the way a clustered index works.

Would it help to add index to BIGINT column in MySQL?

I have a table that will have millions of entries, and a column that has BIGINT(20) values that are unique to each row. They are not the primary key, but during certain operations, there are thousands of SELECTs using this column in the WHERE clause.
Q: Would adding an index to this column help when the amount of entries grows to the millions? I know it would for a text value, but I'm unfamiliar with what an index would do for INT or BIGINT.
A sample SELECT that would happen thousands of times is similar to this:
`SELECT * FROM table1 WHERE my_big_number=19287319283784

If you have a very large table, then searching against values that aren't indexed can be extremely slow. In MySQL terms this kind of query ends up being a "table scan" which is a way of saying it must test against each row in the table sequentially. This is obviously not the best way to do it.
Adding an index will help with read speeds, but the price you pay is slightly slower write speeds. There's always a trade-off when making an optimization, but in your case the reduction in read time would be immense while the increase in write time would be marginal.
Keep in mind that adding an index to a large table can take a considerable amount of time so do test this against production data before applying it to your production system. The table will likely be locked for the duration of the ALTER TABLE statement.
As always, use EXPLAIN on your queries to determine their execution strategy. In your case it'd be something like:
EXPLAIN SELECT * FROM table1 WHERE my_big_number=19287319283784

It will improve your look up (SELECT) performance (based on your example queries), but it will also make your inserts/updates slower. Your DB size will also increase. You need to look at how often you make these SELECT calls vs. INSERT calls. If you make a lot of SELECT calls, then this should improve your overall performance.

I have a 22 million row table on amazon ec2 small instance. So it is not the fastest server environment by a long shot. I have this create:
CREATE TABLE huge
(
myid int not null AUTO_INCREMENT PRIMARY KEY,
version int not null,
mykey char(40) not null,
myvalue char(40) not null,
productid int not null
);
CREATE INDEX prod_ver_index ON huge(productid,version);
This call runs finishes instantly:
select * from huge where productid=3333 and version=1988210878;
As for inserts, I can do 100/sec in PHP, but if i cram 1000 inserts into an array use implode on this same table I get get 3400 inserts per second. Naturally your data is not coming in that way. Just saying the server is relatively snappy. But as tadman suggests, and he meant to say EXPLAIN not examine, in front of a typical statement to see if the key column is showing an index that will be used were you to run it.
General Comments
For slow query debugging, place the word EXPLAIN in front of the word select (no matter how complicated the select/join may be), and run it. Though the query will not be run in normal fashion with resolving the resultset, the db engine will produce (almost immediately) an execution plan it would attempt. This plan may be abandoned when the real query is run (the one prior to putting EXPLAIN in front of it), but it is a major clue-in to schema shortcomings.
The output of EXPLAIN appears cryptic for those first reading one. Not for long though. After reading a few articles about it, such as Using EXPLAIN to Write Better MySQL Queries, one will usually be able to determine which sections of the query are using which indexes, using none and doing slow tablescans, slower where clauses, derived and temp tables.
Using the output of EXPLAIN sized up against your schema, you can gain insight into strategies for index creation (such as composite and covering indexes) to gain substantial query performance.
Sharing
Sharing this EXPLAIN output and schema output with others (such as in stackoverflow questions) hastens better answers concerning performance. Schema output is rendered with such statements as show create table myTableName. Thank you for sharing.

mysql index optimization for a table with multiple indexes that index some of the same columns

I have a table that stores some basic data about visitor sessions on third party web sites. This is its structure:
id, site_id, unixtime, unixtime_last, ip_address, uid
There are four indexes: id, site_id/unixtime, site_id/ip_address, and site_id/uid
There are many different types of ways that we query this table, and all of them are specific to the site_id. The index with unixtime is used to display the list of visitors for a given date or time range. The other two are used to find all visits from an IP address or a "uid" (a unique cookie value created for each visitor), as well as determining if this is a new visitor or a returning visitor.
Obviously storing site_id inside 3 indexes is inefficient for both write speed and storage, but I see no way around it, since I need to be able to quickly query this data for a given specific site_id.
Any ideas on making this more efficient?
I don't really understand B-trees besides some very basic stuff, but it's more efficient to have the left-most column of an index be the one with the least variance - correct? Because I considered having the site_id being the second column of the index for both ip_address and uid but I think that would make the index less efficient since the IP and UID are going to vary more than the site ID will, because we only have about 8000 unique sites per database server, but millions of unique visitors across all ~8000 sites on a daily basis.
I've also considered removing site_id from the IP and UID indexes completely, since the chances of the same visitor going to multiple sites that share the same database server are quite small, but in cases where this does happen, I fear it could be quite slow to determine if this is a new visitor to this site_id or not. The query would be something like:
select id from sessions where uid = 'value' and site_id = 123 limit 1
... so if this visitor had visited this site before, it would only need to find one row with this site_id before it stopped. This wouldn't be super fast necessarily, but acceptably fast. But say we have a site that gets 500,000 visitors a day, and a particular visitor loves this site and goes there 10 times a day. Now they happen to hit another site on the same database server for the first time. The above query could take quite a long time to search through all of the potentially thousands of rows for this UID, scattered all over the disk, since it wouldn't be finding one for this site ID.
Any insight on making this as efficient as possible would be appreciated :)
Update - this is a MyISAM table with MySQL 5.0. My concerns are both with performance as well as storage space. This table is both read and write heavy. If I had to choose between performance and storage, my biggest concern is performance - but both are important.
We use memcached heavily in all areas of our service, but that's not an excuse to not care about the database design. I want the database to be as efficient as possible.

I don't really understand B-trees besides some very basic stuff, but it's more efficient to have the left-most column of an index be the one with the least variance - correct?
There is one important property of B-tree indices you need to be aware of: It is possible (efficient) to search for an arbitrary prefix of the full key, but not a suffix. If you have an index site_ip(site_id, ip), and you ask for where ip = 1.2.3.4, MySQL will not use the site_ip index. If you instead had ip_site(ip, site_id), then MySQL would be able to use the ip_site index.
The is a second property of B-tree indices you should be aware of as well: they are sorted. A b-tree index can be used for queries like where site_id < 40.
There is also an important property of disk drives to keep in mind: sequential reads are cheap, seeks are not. If there are any columns used that are not in the index, MySQL must read the row from the table data. That's generally a seek, and slow. So if MySQL believes it'd wind up reading even a small percent of the table like this, it'll instead ignore the index. One big table scan (a sequential read) is usually faster than random reads of even a few percent of the rows in a table.
The same, by the way, applies to seeks through an index. Finding a key in a B-tree actually potentially requires a few seeks, so you'll find that WHERE site_id > 800 AND ip = '1.2.3.4' may not use the site_ip index, becuase each site_id requires several index seeks to find the start of the 1.2.3.4 records for that site. The ip_site index, however, would be used.
Ultimately, you're going to have to make liberal use of benchmarking and EXPLAIN to figure out the best indices for your database. Remember, you can freely add and drop indices as needed. Non-unique indices are not part of your data model; they are merely an optimization.
PS: Benchmark InnoDB as well, it often has better concurrent performance. Same with PostgreSQL.

First of all, if you are using ip as a string than change it to INT UNSIGNED column and use INET_ATON(expr) and INET_NTOA(expr) function to deal with this. Indexing on integer value is more efficient than indexing on strings of variable length.

Well indexes trade storage for performance. Its hard if you want both. Its hard to optimize this any further without know all the queries you run and their quantities per interval.
What you have will work. If you're running into a bottleneck, you'll need to find out whether its cpu,ram,disk and/or network and adjust accordingly. Its hard and wrong to prematurely optimize.
You probably want to switch to innodb if you have any updates, other wise myisam is good for insert/select. Also since your row size is small, you could look into mysql cluster (nbd). There is also an archive engine that can help with storage requirements but partitioning in 5.1 is probably a better thing to look into.
Flipping the order of your index doesn't make any sense, if these indexes are already used in all of your queries.
but it's more efficient to have the left-most column of an index be the one with the least variance - correct?
not sure but I haven't heard this before. Doesn't seem true to me for this application. The index order matters for sorting and by having multiple unique 1st most index fields, allows more possible queries to use index.

What is the optimal amount of data for a table?

How much data should be in a table so that reading is optimal? Assuming that I have 3 fields varchar(25). This is in MySQL.

I would suggest that you consider the following in optimizing your database design:
Consider what you want to accomplish with the database. Will you be performing a lot of inserts to a single table at very high rates? Or will you be performing reporting and analytical functions with the data?
Once you've determined the purpose of the database, define what data you need to store to perform whatever functions are necessary.
Normalize till it hurts. If you're performing transaction processing (the most common function for a database) then you'll want a highly normalized database structure. If you're performing analytical functions, then you'll want a more denormalized structure that doesn't have to rely on joins to generate report results.
Typically, if you've really normalized the structure till it hurts then you need to take your normalization back a step or two to have a data structure that will be both normalized and functional.
A normalized database is mostly pointless if you fail to use keys. Make certain that each table has a primary key defined. Don't use surrogate keys just cause its what you always see. Consider what natural keys might exist in any given table. Once you are certain that you have the right primary key for each table, then you need to define your foreign key references. Establishing explicit foreign key relationships rather than relying on implicit definition will give you a performance boost, provide integrity for your data, and self-document the database structure.
Look for other indexes that exist within your tables. Do you have a column or set of columns that you will search against frequently like a username and password field? Indexes can be on a single column or multiple columns so think about how you'll be querying for data and create indexes as necessary for values you'll query against.

Number of rows should not matter. Make sure the fields your searching on are indexed properly. If you only have 3 varchar(25) fields, then you probably need to add a primary key that is not a varchar.

Agree that you should ensure that your data is properly indexed.
Apart from that, if you are worried about table size, you can always implement some type of data archival strategy to later down the line.
Don't worry too much about this until you see problems cropping up, and don't optimise prematurely.

For optimal reading you should have an index. A table exists to hold the rows it was designed to contain. As the number of rows increases, the value of the index comes into play and reading remains brisk.

Phrased as such I don't know how to answer this question. An idexed table of 100,000 records is faster than an unindexed table of 1,000.
What are your requirements? How much data do you have? Once you know the answer to these questions you can make decisions about indexing and/or partitioning.

This is a very loose question, so a very loose answer :-)
In general if you do the basics - reasonable normalization, a sensible primary key and run-of-the-mill queries - then on today's hardware you'll get away with most things on a small to medium sized database - i.e. one with the largest table having less than 50,000 records.
However once you get past the 50k - 100k rows, which roughly corresponds to the point when the rdbms is likely to be memory constrained - then unless you have your access paths set up correctly (i.e. indexes) then performance will start to fall off catastrophically. That is in the mathematical sense - in such scenario's it's not unusual to see performance deteriorate by an order of magnitude or two for a doubling in table size.
Obviously therefore the critical table size at which you need to pay attention will vary depending upon row size, machine memory, activity and other environmental issues, so there is no single answer, but it is well to be aware that performance generally does not degrade gracefully with table size and plan accordingly.

I have to disagree with Cruachan about "50k - 100k rows .... roughly correspond(ing) to the point when the rdbms is likely to be memory constrained". This blanket statement is just misleading without two additional data: approx. size of the row, and available memory. I'm currently developing a database to find the longest common subsequence (a la bio-informatics) of lines within source code files, and reached millions of rows in one table, even with a VARCHAR field of close to 1000, before it became memory constrained. So, with proper indexing, and sufficient RAM (a Gig or two), as regards the original question, with rows of 75 bytes at most, there is no reason why the proposed table couldn't hold tens of millions of records.

The proper amount of data is a function of your application, not of the database. There are very few cases where a MySQL problem is solved by breaking a table into multiple subtables, if that's the intent of your question.
If you have a particular situation where queries are slow, it would probably be more useful to discuss how to improve that situation by modifying query or the table design.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008