tl;rd:
DB Partitioning with Primary Key
Index size problem.
DB size grows around 1-3 GB per day
Raid setup.
Do you have experience with Hypertable?
Long Version:
i just build / bought a home server:
Xeon E3-1245 3,4 HT
32GB RAM
6x 1,5 TB WD Cavier Black 7200
I will use the Server Board INTEL S1200BTL Raid (no money left for a raid controller). http://ark.intel.com/products/53557/Intel-Server-Board-S1200BTL
The mainboard has 4x SATA 3GB/s ports and 2x SATA 6GB/s
I'm not yet sure if i can setup all 6hdds in RAID 10,
if not possible, i thought 4x hdds Raid 10 (MYSQL DB) & 2xhdds Raid 0 for (OS/Mysql Indexes).
(If raid 0 breaks, its no problem for me, i need only secure the DB)
About the DB:
Its a web crawler DB, where domains, urls, links and such stuff gets stored.
So i thought i partition the DB with the primary keys of each table like
(1-1000000) (1000001-2000000) and so on.
When i do search / insert / select queries in the DB, i need to scan the hole table, cause some stuff could be in ROW 1 and the other in ROW 1000000000000.
If i do such partition by primary key (auto_increment) will this use all my CPU cores? So that it scans each partition parallel? Or should i stick with one huge DB without a partition.
The DB will be very big, on my home System right now its,
Table extract: 25,034,072 Rows
Data 2,058.7 MiB
Index 2,682.8 MiB
Total 4,741.5 MiB
Table Structure:
extract_id bigint(20) unsigned NO PRI NULL auto_increment
url_id bigint(20) NO MUL NULL
extern_link varchar(2083) NO MUL NULL
anchor_text varchar(500) NO NULL
http_status smallint(2) unsigned NO 0
Indexes:
PRIMARY BTREE Yes No extract_id 25034072
link BTREE Yes No url_id
extern_link (400) 25034072
externlink BTREE No No extern_link (400) 1788148
Table urls: 21,889,542 Rows
Data 2,402.3 MiB
Index 3,456.2 MiB
Total 5,858.4 MiB
Table Structure:
url_id bigint(20) NO PRI NULL auto_increment
domain_id bigint(20) NO MUL NULL
url varchar(2083) NO NULL
added date NO NULL
last_crawl date NO NULL
extracted tinyint(2) unsigned NO MUL 0
extern_links smallint(5) unsigned NO 0
crawl_status tinyint(11) unsigned NO 0
status smallint(2) unsigned NO 0
INDEXES:
PRIMARY BTREE Yes No url_id 21889542
domain_id BTREE Yes No domain_id 0
url (330) 21889542
extracted_status BTREE No No extracted 2
status 31
I see i could fix the externlink & link indexes, i just added externlink cause i needed to query that field and i was not able to use the link index. Do you see, what I could tune on the indexes? My new system will have 32 GB but if the DB grows in this speed, i will use 90% of the RAM in FEW wks / months.
Does a packed INDEX help? (How is the performance decrease?)
The other important tables are under 500MB.
Only the URL Source table is huge: 48.6 GiB
Structure:
url_id BIGINT
pagesource mediumblob data is packed with gzip high compression
Index is only on url_id (unique).
From this table the data can be wiped, when i have extracted all what i need.
Do you have any experience with Hypertables? http://hypertable.org/ <= Googles Bigtables. If I move to Hypertables, would this help me in performance (extracting data / searching / inserting / selecting & DB size). I read on the page but I'm still some clueless. Cause you cant directly compare MYSQL with Hypertables. I will try it out soon, must read the documentation first.
What i need, a solution, which fits in my setup, cause i have no money left for any other hardware setup.
Thanks for help.
Hypertable is an excellent choice for a crawl database. Hypertable is an open source, high performance, scalable database modeled after Google's Bigtable. Google developed Bigtable specifically for their crawl database. I recommend reading the Bigtable paper since it uses the crawl database as the running example.
Regarding to #4 (RAID setup), It's not recommended to use RAID5 for production servers. Great article about it -> http://www.dbasquare.com/2012/04/02/should-raid-5-be-used-in-a-mysql-server/
Related
I got a MySQL database and need to store upto 25 recommendations for each of the users (when user visits the site), here is my simple table that holds userid, recommendation and rank for the recommendation:
userid | recommendation | rank
1 | movie_A | 1
1 | movie_X | 2
...
10 | movie_B | 1
10 | movie_A | 2
....
I expect about 10M users and that combined with 25 recommendations would result in 250M rows. Is there any other better ways to design a user-recommendation table?
Thanks!
Is your requirement only to retrieve the 25 recommendations and send it to a UI layer for consumption?
if that is the case, the system that computes the recommendations can build a JSON document and update the value against the Userid. MySQL has support for JSON datatype.
This might not be a good approach if you want to perform search queries on the JSON document.
250 million rows isn't unreasonable in a simple table like this:
CREATE TABLE UserMovieRecommendations (
user_id INT UNSIGNED NOT NULL,
movie_id INT UNSIGNED NOT NULL,
rank TINYINT UNSIGNED NOT NULL,
PRIMARY KEY (user_id, movie_id, rank),
FOREIGN KEY (user_id) REFERENCES Users(user_id),
FOREIGN KEY (movie_id) REFERENCES Movies(movie_id)
);
That's 9 bytes per row. so only about 2GB.
25 * 10,000,000 * 9 bytes = 2250000000 bytes, or 2.1GB.
Perhaps double that to account for indexes and so on. Still not hard to imagine a MySQL server configured to hold the entire data set in RAM. And it's probably not necessary to hold all the data in RAM, since not all 10 million users will be viewing their data at once.
You might never reach 10 million users, but if you do, I expect that you will be using a server with plenty of memory to handle this.
We have a big table with the following table structure:
CREATE TABLE `location_data` (
`id` int(20) NOT NULL AUTO_INCREMENT,
`dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`device_sn` char(30) NOT NULL,
`data` char(20) NOT NULL,
`gps_date` datetime NOT NULL,
`lat` double(30,10) DEFAULT NULL,
`lng` double(30,10) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `dt` (`dt`),
KEY `data` (`data`),
KEY `device_sn` (`device_sn`,`data`,`dt`),
KEY `device_sn_2` (`device_sn`,`dt`)
) ENGINE=MyISAM AUTO_INCREMENT=721453698 DEFAULT CHARSET=latin1
Many times we have performed query such as follow:
SELECT * FROM location_data WHERE device_sn = 'XXX' AND data = 'location' ORDER BY dt DESC LIMIT 1;
OR
SELECT * FROM location_data WHERE device_sn = 'XXX' AND data = 'location' AND dt >= '2014-01-01 00:00:00 ' AND dt <= '2014-01-01 23:00:00' ORDER BY dt DESC;
We have been optimizing this in a few ways:
By adding index and using FORCE INDEX on device_sn.
Separating the table into multiple tables based on the date (e.g. location_data_20140101) and pre-checking if there is a data based on certain date and we will pull that particular table alone. This table is created by cron once a day and the data in location_data for that particular date will be deleted.
The table location_data is HIGH WRITE and LOW READ.
However, few times, the query is running really slow. I wonder if there are other methods / ways / restructure the data that allows us to read a data in sequential date manner based on a given device_sn.
Any tips are more than welcomed.
EXPLAIN STATEMENT 1ST QUERY:
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
| 1 | SIMPLE | location_dat | ref | data,device_sn,device_sn_2 | device_sn | 50 | const,const | 1 | Using where |
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
EXPLAIN STATEMENT 2nd QUERY:
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+
| 1 | SIMPLE | test_udp_new | range | dt,data,device_sn,device_sn_2 | dt | 4 | NULL | 1 | Using where |
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+
The index device_sn (device_sn,data,dt) is good. MySQL should use it without need to do any FORCE INDEX. You can verify it by running "explain select ..."
However, your table is MyISAM, which is only supports table level locks. If the table is heavily write it may be slow. I would suggest converting it to InnoDB.
Ok, I'll provide info that I know and this might not answer your question but could provide some insight.
There exits certain differences between InnoDB and MyISAM. Forget about full text indexing or spatial indexes, the huge difference is in how they operate.
InnoDB has several great features compared to MyISAM.
First off, it can store the data set it works with in RAM. This is why database servers come with a lot of RAM - so that I/O operations could be done quick. For example, an index scan is faster if you have indexes in RAM rather than on HDD because finding data on HDD is several magnitudes slower than doing it in RAM. Same applies for full table scans.
The variable that controls this when using InnoDB is called innodb_buffer_pool_size. By default it's 8 MB if I am not mistaken. I personally set this value high, sometimes even up to 90% of available RAM. Usually, when this value is optimized - a lot of people experience incredible speed gains.
The other thing is that InnoDB is a transactional engine. That means it will tell you that a write to disk succeeded or failed and that will be 100% correct. MyISAM won't do that because it doesn't force OS to force HDD to commit data permanently. That's why sometimes records are lost when using MyISAM, it thinks data is written because OS said it was when in reality OS tried to optimize the write and HDD might lose buffer data, thus not writing it down. OS tries to optimize the write operation and uses HDD's buffers to store larger chunks of data and then it flushes it in a single I/O. What happens then is that you don't have control over how data is being written.
With InnoDB you can start a transaction, execute say 100 INSERT queries and then commit. That will effectively force the hard drive to flush all 100 queries at once, using 1 I/O. If each INSERT is 4 KB long, 100 of them is 400 KB. That means you'll utilize 400kb of your disk's bandwith with 1 I/O operation and that remainder of I/O will be available for other uses. This is how inserts are being optimized.
Next are indexes with low cardinality - cardinality is a number of unique values in an indexed column. For primary key this value is 1. it's also the highest value. Indexes with low cardinality are columns where you have a few distinct values, such as yes or no or similar. If an index is too low in cardinality, MySQL will prefer a full table scan - it's MUCH quicker. Also, forcing an index that MySQL doesn't want to use could (and probably will) slow things down - this is because when using an indexed search, MySQL processes records one by one. When it does a table scan, it can read multiple records at once and avoid processing them. If those records were written sequentially on a mechanical disk, further optimizations are possible.
TL;DR:
use InnoDB on a server where you can allocate sufficient RAM
set the value of innodb_buffer_pool_size large enough so you can allocate more resources for faster querying
use an SSD if possible
try to wrap multiple INSERTs into transactions so you can better utilize your hard drive's bandwith and I/O
avoid indexing columns that have low unique value count compared to row count - they just waste space (though there are exceptions to this)
I have a huge InnoDB Table with three columns (int, mediumint, int). The innodb_file_per_table setting is on and there is only a PRIMARY KEY of the first two columns
The table schema is:
CREATE TABLE `big_table` (
`user_id` int(10) unsigned NOT NULL,
`another_id` mediumint(8) unsigned NOT NULL,
`timestamp` int(10) unsigned NOT NULL,
PRIMARY KEY (`user_id`,`another_id `)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
MySQL Version is 5.6.16
Currently I am multi-inserting over 150 rows per second. No deletion, and no updates.
There are no significant rollbacks or other transaction aborts, that would cause wasted space usage.
MySQL shows a calculated size of 75,7GB on that table.
.ibd size on disc: 136,679,784,448 byte (127.29 GiB)
Counted rows: 2,901,937,966 (47.10 byte per row)
2 days later MySQL shows also a calculated size of 75.7 GB on that table.
.ibd size on disc: 144,263,086,080 byte (135.35 GiB)
Counted rows: 2,921,284,863 (49.38 byte per row)
Running SHOW TABLE STATUS for the table shows:
Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Collation
InnoDB | 10 | Compact | 2645215723 | 30 | 81287708672 | 0 | 0 | 6291456 | utf8_unicode_ci
Here are my Questions:
Why is the disc usage growing disproportionally to the row count?
Why is the Avg_row_length and Data_length totally wrong?
Hope someone can help me, that the disc usage will not grow like this anymore. I have not noticed that as the table was smaller.
I am assuming that your table hasn't grown to its present ~2.9 billion rows organically, and that you either recently loaded this data or have caused the table to be re-organized (using ALTER TABLE or OPTIMIZE TABLE, for instance). So it starts off quite well-packed on disk.
Based on your table schema (which is fortunately very simple and straightforward), each row (record) is laid out as follows:
(Header) 5 bytes
`user_id` 4 bytes
`another_id` 3 bytes
(Transaction ID) 6 bytes
(Rollback Pointer) 7 bytes
`timestamp` 4 bytes
=============================
Total 29 bytes
InnoDB will never actually fill pages to more than approximately ~15/16 full (and normally never less than 1/2 full). With all of the extra overhead in various places the full-loaded cost of a record is somewhere around 32 bytes minimum and 60 bytes maximum per row in leaf pages of the index.
When you bulk-load data through an import or through an ALTER TABLE or OPTIMIZE TABLE, the data will normally be loaded (and the indexes created) in order by PRIMARY KEY, which allows InnoDB to very efficiently pack the data on disk. If you then continue writing data to the table in random (or effectively random) order, the efficiently-packed index structures must expand to accept the new data, which in B+Tree terms means splitting pages in half. If you have an ideally-packed 16 KiB page where records consume ~32 bytes on average, and it is split in half to insert a single row, you now have two half-empty pages (~16 KiB wasted) and that new row has "cost" 16 KiB.
Of course that's not really true. Over time the index tree would settle down with pages somewhere between 1/2 full and 15/16 full -- it won't keep splitting pages forever, because the next insert that must happen into the same page will find that plenty of space already exists to do the insert.
This can be a bit disconcerting if you initially bulk load (and thus efficiently pack) your data into a table and then switch to organically growing it, though. Initially it will seem as though the tables are growing at an insane pace, but if you track the growth rate over time it should slow down.
You can read more about InnoDB index and record layout in my blog posts: The physical structure of records in InnoDB, The physical structure of InnoDB index pages, and B+Tree index structures in InnoDB.
I have a performance issue, while handling billion records using select query,I have a table as
CREATE TABLE `temp_content_closure2` (
`parent_label` varchar(2000) DEFAULT NULL,
`parent_code_id` bigint(20) NOT NULL,
`parent_depth` bigint(20) NOT NULL DEFAULT '0',
`content_id` bigint(20) unsigned NOT NULL DEFAULT '0',
KEY `code_content` (`parent_code_id`,`content_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY KEY (parent_depth)
PARTITIONS 20 */ |
I used partition which will increase the performance by subdividing the table,but it is not usefull in my case,my sample select in this table
+----------------+----------------+--------------+------------+
| parent_label | parent_code_id | parent_depth | content_id |
+----------------+----------------+--------------+------------+
| Taxonomy | 20000 | 0 | 447 |
| Taxonomy | 20000 | 0 | 2286 |
| Taxonomy | 20000 | 0 | 3422 |
| Taxonomy | 20000 | 0 | 5916 |
+----------------+----------------+--------------+------------+
Here the content_id will be unique in respect to the parent_dept,so i used parent_depth as a key for partitioning.In every depth i have 2577833 rows to handle ,so here partitioning is not useful,i got a idea from websites to use archive storage engine but it will use full table scan and not use index in select ,basically 99% i use select query in this table and this table is going to increases its count every day.currently i am in mysql database which has 5.0.1 version.i got an idea about nosql database to use ,but is any way to handle in mysql ,if you are ssuggesting nosql means which can i use cassandra or accumulo ?.
Add an index like this:
ALTER TABLE table ADD INDEX content_id ('content_id')
You can also add multiple indexes if you have more specific SELECT criteria which will also speed things up.
Multiple and single indexes
Overall though, if you have a table like this thats growing so fast then you should probably be looking at restructuring your sql design.
Check out "Big Data" solutions as well.
With that size and volume of data, you'd need to either setup a sharded-MySQL setup in a cluster of machines (Facebook and Twitter stored massive amounts of data on a sharded MySQL setup, so it is possible), or alternatively use a Big Table-based solution that natively distribute the data amongst nodes in various clusters- Cassandra and HBase are the most popular alternatives here. You must realize, a billion records on a single machine will hit almost every limit of the system- IO first, followed by memory, followed by CPU. It's simply not feasible.
If you do go the Big Table way, Cassandra will be the quickest to setup and test. However, if you anticipate map-reduce type analytic needs, then HBase is more tightly integrated with the Hadoop ecosystem, and should work out well. Performance-wise, they are both neck to neck, so take your pick.
Good afternoon all. I am coming to you in the hopes that you can provide some direction with a MYSQL optimization problem that I am having. First, a few system specifications.
MYSQL version: 5.2.47 CE
WampServer v 2.2
Computer:
Samsung QX410 (laptop)
Windows 7
Intel i5 (2.67 Ghz)
4GB RAM
I have two tables:
“Delta_Shares” contains stock trade data, and contains two columns of note. “Ticker” is Varchar(45), “Date_Filed” is Date. This table has about 3 million rows (all unique). I have an index on this table “DeltaSharesTickerDateFiled” on (Ticker, Date_Filed).
“Stock_Data” contains two columns of note. “Ticker” is Varchar(45), “Value_Date” is Date. This table has about 19 million rows (all unique). I have an index on this table “StockDataIndex” on (Ticker, Value_Date).
I am attempting to update the “Delta_Shares” table by looking up information from the Stock_Data table. The following query takes more than 4 hours to run.
update delta_shares A, stock_data B
set A.price_at_file = B.stock_close
where A.ticker = B.ticker
and A.date_filed = B.value_Date;
Is the excessive runtime the natural result of the large number of rows, poor index’ing, a bad machine, bad SQL writing, or all of the above? Please let me know if any additional information would be useful (I am not overly familiar with MYSQL, though this issue has moved me significantly down the path of optimization). I greatly appreciate any thoughts or suggestions.
UPDATED with "EXPLAIN SELECT"
1(id) SIMPLE(seltype) A(table) ALL(type) DeltaSharesTickerDateFiled(possible_keys) ... 3038011(rows)
1(id) SIMPLE(seltype) B(table) ref(type) StockDataIndex(possible_keys) StockDataIndex(key) 52(key_len) 13ffeb2013.A.ticker,13ffeb2013.A.date_filed(ref) 1(rows) Using where
UPDATED with table describes.
Stock_Data Table:
idstock_data int(11) NO PRI auto_increment
ticker varchar(45) YES MUL
value_date date YES
stock_close decimal(10,2) YES
Delta_Shares Table:
iddelta_shares int(11) NO PRI auto_increment
cik int(11) YES MUL
ticker varchar(45) YES MUL
date_filed_identify int(11) YES
Price_At_File decimal(10,2) YES
delta_shares int(11) YES
date_filed date YES
marketcomparable varchar(45) YES
market_comparable_price decimal(10,2) YES
industrycomparable varchar(45) YES
industry_comparable_price decimal(10,2) YES
Index from Delta_Shares:
delta_shares 0 PRIMARY 1 iddelta_shares A 3095057 BTREE
delta_shares 1 DeltaIndex 1 cik A 18 YES BTREE
delta_shares 1 DeltaIndex 2 date_filed_identify A 20633 YES BTREE
delta_shares 1 DeltaSharesAllIndex 1 cik A 18 YES BTREE
delta_shares 1 DeltaSharesAllIndex 2 ticker A 619011 YES BTREE
delta_shares 1 DeltaSharesAllIndex 3 date_filed_identify A 3095057 YES BTREE
delta_shares 1 DeltaSharesTickerDateFiled 1 ticker A 11813 YES BTREE
delta_shares 1 DeltaSharesTickerDateFiled 2 date_filed A 3095057 YES BTREE
Index from Stock_Data:
stock_data 0 PRIMARY 1 idstock_data A 18683114 BTREE
stock_data 1 StockDataIndex 1 ticker A 14676 YES BTREE
stock_data 1 StockDataIndex 2 value_date A 18683114 YES BTREE
There are a few benchmarks you could make to see where the bottleneck is. For example, try updating the field to a constant value and see how long it takes (obviously, you'll want to make a copy of the database to do this on). Then try a select query that doesn't update, but just selects the values to be updated and the values they will be updated to.
Benchmarks like these will usually tell you whether you're wasting your time trying to optimize or whether there is much room for improvement.
As for the memory, here's a rough idea of what you're looking at:
varchar fields are 2 bytes plus actual length and datetime fields are 8 bytes. So let's make an extremely liberal guess that your varchar fields in the Stock_Data table average around 42 bytes. With the datetime field that adds up to 50 bytes per row.
50 bytes x 20 million rows = .93 gigabytes
So if this process is the only thing going on in your machine then I don't see memory as being an issue since you can easily fit all the data from both tables that the query is working with in memory at one time. But if there are other things going on then it might be a factor.
Try analyse on both tables and use straight join instead of the implicit join. Just a guess, but it sounds like a confused optimiser.