How to handle a huge dataset [closed] - mysql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
These days I'm reading about different ways to manage a huge dataset in the MySQL database.
To be honest, at the moment, I'm confused. I read some concepts about the mentioned issue but I don't know how they are related to each other?
Please take a look at these:
Partitioning - Which is a plugin
Clustering - Named NDB I guess
Sharding - Which is a concept I think and nothing implementable
The scenario is storing/maintaining/searching a huge set of data (assume a table with 5 billion rows) in MySQL. So we have to take apart the dataset, but how?
I've a few questions:
How much overlap is between those three items above?
In partitioning, all parts will be stored on the same machine (server)? Or they can be kept in different machines?
How to detect is the data stored in which partition? (in order to look up the data accordingly)
I know partitioning is for "tables", is clustering for "databases"?
By sharding, we replicate the data in different servers or we would have different data in the different servers? Also, is it happen in the "table" layer or the "database" layer?
How different parts (clusters/partitions) will see each other when it is needed? Like when we need to have a join clause on the whole table. Assuming the data is parted in different partitions/machines.
To use clustering, do I need to install a different edition (version) of MySQL? Isn't it supported by the normal edition?
Anyway, I've read about them over 3 days, and the main concept is still ambiguous for me.

a quick comparison:
description
nr of servers
redundant?
a goal
paritioning
1
No
time series
clustering
>= 3
Yes
recovery
sharding
>1
No
write scaling
Sharding is divvying up the data across multiple servers.
How much overlap is between those three items above?
A: Very little. Each divvies the data up in different ways for different goals.
In partitioning, all parts will be stored on the same machine (server)? Or they can be kept in different machines?
A: In partitioning, all parts will be stored on the same instance on the same machine (server).
How to detect is the data stored in which partition?
A: When practical, provide a WHERE clause that pinpoints which partition(s) are needed. (See "partition pruning")
I know partitioning is for "tables", is clustering for "databases"?
A: I think you could describe it that way. Clustering (also) has the advantage of having a second copy on a different piece of hardware.
By sharding, we replicate the data in different servers or we would have different data in the different servers? Also, is it happen in the "table" layer or the "database" layer?
A: No. Typically the largest table is split up in some arbitrary way -- some rows are put on each shard. Then clients must know how that split-up was done to know which server to talk to. (There is no canned code for this vital task.) Smaller tables are either copied onto all shards or put onto other machine(s).
How different parts (clusters/partitions) will see each other when it is needed? Like when we need to have a join clause on the whole table. Assuming the data is parted in different partitions/machines.
A: A JOIN works on only one server. (MariaDB has "FEDERATEDX", but that is a costly workaround.) For Partitioning, the query sees the many partitions as one big table, so JOIN is not a problem. For Clustering, everything is on each server, to no problem. For Sharding is fine within the constraint that you have only part of the big table.
BTW: read this: How to handle a question that asks many things

Related

i am using MySQL as database and i have 1 table containing 60 columns. Is it good to have a table with so many columns? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 11 months ago.
Improve this question
I'm doing a training point management function, I need to save all those points in the database so that it can be displayed when needed. And I created table for that function with 60 columns. Is that good? Or can anyone suggest me another way to handle it?
It is unusual but not impossible for a table to have that many columns, however...
It suggests that you schema might not be normalized. If that is the case then you will run into problems designing queries and/or making efficient use of the available resources.
Depending on how often each row is updated, the table could become fragmented. MySQL, like most DBMS, does not add up the size of all the attributes in the relation to work out the size to allocate for the record (although this is an option with C-ISAM). It rounds that figuere up so that there is some space for the data to grow, but at some point it could be larger than the space available, At that point the record must be migrated elsewhere. This leads to fragmentation in the data.
You queries are going to be very difficult to read/maintain. You may fall into the trap of writing "select * ...." which means that the DBMS needs to read the entirety of the record into memory in order to resolve the query. This does not make for efficient use of your memory.
We can't tell you if what you have done is correct, nor if you should be doing it differently without a detailed understanding of the underlying the data.
I've worked with many tables that had dozens of columns. It's usually not a problem.
In relational database theory, there is no limit to the number of columns in a table, as long as it's finite. If you need 60 attributes and they are all properly attributes of the candidate key in that table, then it's appropriate to make 60 columns.
It is possible that some of your 60 columns are not proper attributes of the table, and need to be split into multiple tables for the sake of normalization. But you haven't described enough about your specific table or its columns, so we can't offer opinions on that.
There's a practical limit in MySQL for how many columns it supports in a given table, but this is a limit of the implementation (i.e. MySQL internal code), not of the theoretical data model. The actual maximum number of columns in a table is a bit tricky to define, since it depends on the specific table. But it's almost always greater than 60. Read this blog about Understanding the Maximum Number of Columns in a MySQL Table for details.

how to improve speed in database? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I am starting to create my first web application in my career using mysql.
I am going to make table which contain users information (like id, firstname, lastname, email, password, phone number).
Which of the following is better?
Put all data into one single table (userinfo).
Divide all data by alphabet character and put data into many tables. for example, if user's email id is Joe#gmail.com that put into table (userinfo_j) and if user's email id is kevin#gmail.com that put into table (userinfo_k).
I don't want to sound condescending, but I think you should spend some time reading up on database design before tackling this project, especially the concept of normalization, which provides consistent and proven rules for how to store information in a relational database.
In general, my recommendation is to build your database to be easy to maintain and understand first and foremost. On modern hardware, a reasonably well-designed database with indexes running relational queries can support millions of records, often tens or hundreds of millions of records without performance problems.
If your database has a performance problem, tune the query first; add indexes second, buy better hardware third, and if that doesn't work, you may consider a design that makes the application harder to maintain (often called denormalization).
Your second solution will almost certainly be slower for most cases.
Relational databases are really, really fast when searching by indexed fields; searching for "email like 'Joe#gmail.com'" on a reasonable database will be too fast to measure on a database with tens of millions of records.
However, including the logic to find the right table in which to search will almost certainly be slower than searching in all the tables.
Especially if you want to search by things other than email address - imagine finding all the users who signed up in the last week. Or who have permission to do a certain thing in your application. Or who have a #gmail.com account.
So, the second solution is bad from a design/maintenance point of view, and will almost certainly be slower.
First one is better. In second you will have to write extra logic to find out which table you will start looking into. And for speeding up the search you can implement indexers. Here I suppose you will do equal operations more often rather than less than or more than operations so you can try implementing indexer with Hash. For comparison operations B-Tree are better.
Like others said, the first one is better. Specially if you need to add other tables in your database and link them to user´s table, as the second one will soon get impossible to work and create relationships when your number of tables increase.

Database choices for big data [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have many text files, their total size is about 300GB ~ 400GB. They are all in this format
key1 value_a
key1 value_b
key1 value_c
key2 value_d
key3 value_e
....
each line is composed by a key and a value. I want to create a database which can let me query all value of a key. For example, when I query key1, value_a, value_b and value_c are returned.
First of all, inserting all these files into the database is a big problem. I try to insert a few GBs size chunk to MySQL MyISAM table with LOAD DATA INFILE syntax. But it appears MySQL can't utilize multicores for inserting data. It's as slow as hell. So, I think MySQL is not a good choice here for so many records.
Also, I need to update or recreate the database periodically, weekly, or even daily if possible, therefore, insertion speed is important for me.
It's not possible for a single node to do the computing and insertion efficiently, to be efficient, I think it's better to perform the insertion in different nodes parallely.
For example,
node1 -> compute and store 0-99999.txt
node2 -> compute and store 10000-199999.txt
node3 -> compute and store 20000-299999.txt
....
So, here comes the first criteria.
Criteria 1. Fast insertion speed in distributed batch manner.
Then, as you can see in the text file example, it's better to provide multiple same key to different values. Just like key1 maps to value_a/value_b/value_c in the example.
Criteria 2. Multiple keys are allowed
Then, I will need to query keys in the database. No relational or complex join query is required, all I need is simple key/value querying. The important part is that multiple key to same value
Criteria 3. Simple and fast key value querying.
I know there are HBase/Cassandra/MongoDB/Redis.... and so on, but I'm not familiar with all of them, not sure which one fits my needs. So, the question is - what database to use? If none of them fits my needs, I even plan to build my own, but it takes efforts :/
Thanks.
There are probably a lot of systems that would fit your needs. Your requirements make things pleasantly easy in a couple ways:
Because you don't need any cross-key operations, you could use multiple databases, dividing keys between them via hash or range sharding. This is an easy way to solve the lack of parallelism that you observed with MySQL and probably would observe with a lot of other database systems.
Because you never do any online updates, you can just build an immutable database in bulk and then query it for the rest of the day/week. I'd expect you'd get a lot better performance this way.
I'd be inclined to build a set of hash-sharded LevelDB tables. That is, I wouldn't use an actual leveldb::DB which supports a more complex data structure (a stack of tables and a log) so that you can do online updates; instead, I'd directly use leveldb::Table and leveldb::TableBuilder objects (no log, only one table for a given key). This is a very efficient format for querying. And if your input files are already sorted like in your example, the table building will be extremely efficient as well. You can achieve whatever parallelism you desire by increasing the number of shards - if you're using a 16-core, 16-disk machine to build the database, then use at least 16 shards, all generated in parallel. If you're using 16 16-core, 16-disk machines, at least 256 shards. If you have a lot fewer disks than cores as many people do these days, try both, but you may find fewer shards are better to avoid seeks. If you're careful, I think you can basically max out the disk throughput while building tables, and that's saying a lot as I'd expect the tables to be noticeably smaller than your input files due to the key prefix compression (and optionally Snappy block compression). You'll mostly avoid seeks because aside from a relatively small index that you can typically buffer in RAM, the keys in the leveldb tables are stored in the same order as you read them from the input files, assuming again that your input files are already sorted. If they're not, you may want enough shards that you can sort a shard in RAM then write it out, perhaps processing shards more sequentially.
I would suggest you using SSDB(https://github.com/ideawu/ssdb), a leveldb server that suitable for storing collections of data.
You can store the data in maps:
ssdb->hset(key1, value1)
ssdb->hset(key1, value2)
...
list = ssdb->hscan(key1, 1000);
// now list = [value1, value2, ...]
SSDB is fast(half the speed of Redis, 30000 insertions per second), it is a network wrapper of leveldb, one-line installation and startup. Its clients include PHP, C++, Python, Java, Lua, ...
The traditional answer would be to use Oracle if you have the big bucks, or PostgreSQL if you don't. However, I'd suggest you also look at solutions like mongoDb which I found to be blazing fast and will also accomodate a scenario where your schema is not fixed and can change across your data.
Since you are already familiar with MySQL, I suggest trying all MySQL options before moving to a new system.
Many bigdata systems are tuned for very specific problems but don't fare well in areas that are taken for granted from a RDBMS. Also, most applications need regular RDBMS features alongside bigdata features. So moving to a new system may create new problems.
Also consider the software ecosystem, community support and knowledge base available around the system of your choice.
Coming back to the solution, how many rows would be there in the database? This is an important metric. I am assuming more than 100 million.
Try Partitioning. It can help a lot. The fact that your select criteria is simple and you don't require joins only make things better.
Postgres has a nice way of handling partitions. It requires more code to get up and running but gives an amazing control. Unlike MySQL, Postgres does not have a hard limit on number of partitions. Partitions in Postgres are regular tables. This gives you much more control over indexing, searching, backup, restore, parallel data access etc.
Take a look at HBase. You can store multiple values against a key, by using columns. Unlike RDBMS, you don't need to have fixed set of columns in each row, but can have arbitrary number of columns for a row. Since you query data by a key (row-key in HBase parlance), you can retrieve all the values for a given key by reading values of all the columns in that row.
HBase also concept of retention period, so you can decide which columns live for how long. Hence, the data can get cleaned up on its own, as per need basis. There are some interesting techniques people have employed to utilize the retention periods.
HBase is quite scalable, and supports very fast reads and writes.
InfoBright maybe is a good choice.

MySQL: multiple tables or one table with many columns? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 11 months ago.
The community reviewed whether to reopen this question 11 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
So this is more of a design question.
I have one primary key (say the user's ID), and I have tons of information associated with that user.
Should I have multiple tables broken down into categories according to the information, or should I have just one table with many columns?
The way I used to do it was to have multiple tables, so say, one table for application usage data, one table for profile info, one table for back end tokens etc. to keep things looking organized.
Recently some one told me that it's better not to do it that way and having a table with lots of columns is fine. The thing is, all those columns have the same primary key.
I'm pretty new to database design so which approach is better and what are the pros and cons?
What's the conventional way of doing it?
Any time information is one-to-one (each user has one name and password), then it's probably better to have it one table, since it reduces the number of joins the database will need to do to retrieve results. I think some databases have a limit on the number of columns per table, but I wouldn't worry about it in normal cases, and you can always split it later if you need to.
If the data is one-to-many (each user has thousands of rows of usage info), then it should be split into separate tables to reduce duplicate data (duplicate data wastes storage space, cache space, and makes the database harder to maintain).
You might find the Wikipedia article on database normalization interesting, since it discusses the reasons for this in depth:
Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.
Denormalization is also something to be aware of, because there are cases where repeating data is better (since it reduces the amount of work the database needs to do when reading data). I'd highly recommend making your data as normalized as possible to start out, and only denormalize if you're aware of performance problems in specific queries.
One big table is often a poor choice. Related tables are what relational database were designed to work with. If you index properly and know how to write performant queries, they are going to perform fine.
When tables get too many columns, then you can run into issues with the actual size of the page that the database is storing the information on. Either the record can end up being too large for the page, in which can you may end up not being able to create or update a specific record which makes users unhappy or you may (in SQL Server at least) be allowed some overflow for particular datatypes (with a set of rules you need to look up if you are doing this) but if many records will overflow the page size you can create tremedous performance problems. Now how MYSQL handles the pages and whether you have a problem when the potential page size gets too large is something you would have to look up in the documentation for that database.
Came across this, and as someone who used to use MySQL a lot, and then switched over to Postgres recently, one of the big advantages is that you can add JSON objects to a field in Postgres.
So if you are in this situation, you don't have to necessarily decide between one large table with many columns and splitting it up, but you can merge columns into JSON objects to reduce it e.g. instead of address being 5 columns, it can just be one. You can also query on that object too.
I have a good example. Overly Normalized database with the following set of relationships:
people -> rel_p2staff -> staff
and
people -> rel_p2prosp -> prospects
Where people has names and persons details, staff has just the staff record details, prospects has just prospects details, and the rel tables are relationship tables with foreign keys from people linking to staff and prospects.
This sort of design carries on for entire database.
Now to query this set of relations it's a multi-table join every time, sometimes 8 and more table join. It has been working fine up to mid this year, when it started getting very slow now that we past 40000 records of people.
Indexing and all low hanging fruits had been used up last year, all queries are optimized to perfection. This is the end of the road for the particular normalized design and management now approved a rebuilt of entire application that depends on it as well as restructure of the database, over a term of 6 months. $$$$ Ouch.
The solution will be to have a direct relation for people -> staff and people -> prospect
ask yourself these questions if you put everything in one table, will you have multiple rows for that user? If you have to update a user do you want to keep an audit trail? Can the user have more than one instance of a data element? (like phone number for instance) will you have a case where you might want to add an element or set of elements later?
if you answer yes then most likely you want to have child tables with foreign key relationships.
Pros of parent/child tables is data integrity, performance via indexes (yes you can do it on a flat table also) and IMO easier to maintain if you need to add a field later, especially if it will be a required field.
Cons design is harder, queries become slightly more complex
But, there are many cases where one big flat table will be appropriate so you have to look at your situation to decide.
I'm already done doing some sort of database design. for me, it depends on the difficulty of the system with database management; yeah it is true to have unique data in one place only but it is really hard to make queries with overly normalized database with lots of record. Just combine the two schema; use one huge table if you feel that you'll be having a massive records that are hard to maintain just like facebook,gmail,etc. and use different table for one set of record for simple system... well this is just my opinion .. i hope it could help.. just do it..you can do it... :)
The conventional way of doing this would be to use different tables as in a star schema or snowflake schema. Howeevr, I would base this strategy to be two fold. I believe in the theory that data should only exist in one place, there for the schema I mentioned would work well. However, I also believe that for reporting engines and BI suites, a columnar approach would be hugely beneficial becuase it is more supportive of the the reporting needs. Columnar approaches like those with infobright.org have huge performance gains and compression that makes using both approaches incredibly useful. Alot of companies are starting to realize that have just one database architecture in the organization is not supportive of the full range of their needs. Alot of companies are implementing both the concept of having more than one database achitecture.
i think having a single table is more effective but you should make sure that the table is organised in a manner that it shows the relationship,trend as well as the difference in variables of the same row.
for example if the table shows age and grades of the students you should arange the table in a manner that thank highest scorer is well differentiated with the lowest scorer and the difference in the age of students is even.

Ways to optimize MySQL table [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a big big table the size of table is in GB's around 130 GB. Every day data is dumped in the table.
I'd like to optimize the table... Can anyone suggest me how I should go about it?
Any input will be a great help.
It depends how you are trying to optimize it.
For querying speed, appropriate indexes including multi-column indexes would be a very good place to start. Do explains on all your queries to see what is taking up so much time. Optimize the code that's reading the data to store it instead of requerying.
If old data is less important or you're getting too much data to handle, you can rotate tables by year, month, week, or day. That way the data writing is always to a pretty minimal table. The older tables are all dated (ie tablefoo_2011_04) so that you have a backlog.
If you are trying to optimize size in the same table, make sure you are using appropriate types. If you get variable length strings, use a varchar instead of statically sized data. Don't use strings for status indicators, use an enum or int with a secondary lookup table.
The server should have a lot of ram so that it's not going to disk all the time.
You can also look at using a caching layer such as memcached.
More information about what the actual problem is, your situation, and what you are trying to optimize for would be helpful.
If your table is a sort of logging table, there can be several strategy for optimizing.
(1) Store essential data only.
If there are not necessary - nullable - columns in it and they does not be used for aggregation or analytics, store them into other table. Keep the main table smaller.
Ex) Don't store raw HTTP_USER_AGENT string. Preprocessing the agent string and store smaller data what you exactly want to review.
(2) Make the table as fixed format.
Use CHAR then VARCHAR for almost-fixed-length strings. This will be helpful for sped up SELECT queries.
Ex) ip VARCHAR(15) => ip CHAR(15)
(3) Summarize old data and dump them into other table periodically.
If you don't have to review the whole data everyday, divide it into periodically table (year/month/day) and store summarize data for old ones.
Ex) Table_2011_11 / Table_2011_11_28
(4) Don't use too many indexes for big table.
Too many indexes cause heavy load for inserting queries.
(5) Use ARCHIVE engine.
MySQL supports ARCHIVE ENGINE. This engine supports zlib for data compression.
http://dev.mysql.com/doc/refman/5.0/en/archive-storage-engine.html
It fits for logging generally(AFAIK), lack of ORDER BY, REPLACE, DELETE and UPDATE are not a big problem for logging.
You should show us what your SHOW CREATE TABLE tablename outputs so we can see the columns, indexes and so on.
From the glimpse of everything, it seems MySQL's partitioning is what you need to implement in order to increase performance further.
A few possible strategies.
If the dataset is so large, it may be of use to store certain information redundantly: keeping cache tables if certain records are accessed much more frequently than others, denormalize information (either to limit the number of joins or creating tables with less columns so you have a lean table to keep in memory at all times), or keeping summaries for the fast lookup of totals.
The summaries-table(s) can be kept in synch by either periodically generating them or by the use of triggers, or even combining both by having a cache table for the latest day on which you can calculate actual totals, and summaries for the historical data... will give you full precision while not requiring to read the full index. Test to see what delivers best performance in your situation.
Splitting your table by periods is certainly an option. It's like partitioning, but Mayflower Blog advises to do it yourself as the MySQL implementation seems to have certain limitations.
Additionally to this: if the data in those historical tables is never changed and you want to reduce space, you could use myisampack. Indexes are supported (you have to rebuild) and performance gain is reported, but I suspect you would gain speed on reading individual rows but face decreasing performance on large reads (as lots of rows need unpacking).
And last: you could think about what you need from the historical data. Does it need the exact same information you have for more recent entries, or are there things that just aren't important anymore? I could imagine if you have an access log, for example, that it stores all sorts of information like ip, referal url, requested url, user agent... Perhaps in 5 years time the user agent isn't interesting at all to know, it's fine to combine all requests from one ip for one page + css + javascript + images into one entry (perhaps have a different many-to-one table for the precise files), and the referal urls only need number of occurances and can be decoupled from exact time or ip.
Don't forget to consider the speed of the medium on which the data is stored. I think you can use raid disks to speed up access or maybe store the table in RAM but at 130GB that might be a challenge! Then consider the processor too. I realise this isn't a direct answer to your question but it may help achieve your aims.
You can still try to do partitioning using tablespaces or "table-per-period" structure as #Evan advised.
If your fulltext searching is failing may be should go to Sphinx/Lucene/Solr. External search engines can definitely help you to get faster.
If we are talking about table structure than you should use the smallest datatype if it possible.
If optimize table is too slow and it's true for the really big tables you can backup this table and restore it. Off course in this case you will need to get some downtime.
As bottom line:
if your issue concerning fulltext searching than before applying any table changes try to use external search engines.