I created a table where it has 30 columns.
CREATE TABLE "SETTINGS" (
"column1" INTEGER PRIMARY KEY,
...
...
"column30"
)
However, I can group them and create different table where they can have foreign keys to the primary table. Which is the best way to follow? Or the number of the columns is small so it's the same which way I will follow?
It depends on the data and the query you often do.
Best for one big table
If you need to extract all the columns always
If you need to update many fields at the same time
If all the fields or quite all have not null values
Best for many little tables
If the data are "sparse" it means not many columns have values you can imagine to split them in different tables and create a record in a child table only if not null values exists
If you extract only few related fields at one time
If you update only related fields at one time
Better names for each column (for example instead of domicile_address and residence_address you can have two columns with named address in two tables residences and domiciles)
The problem is that generally you can use both solutions depending from the situation. A usage analysis must be done to choose the right one.
If they really are named column1, column2....column30 then that's a fairly good indicator that your data is not normalized.
Following the rules of normalization should always be your starting point. There are sometimes reasons for breaking the rules - but that comes after you know what your data should look like.
Regarding breaking the rules.....there are 2 potential scenarios which may apply here (after you've worked out the correct structure and done an appropriate level of testing with an appropriate volume of data):
Joins are expensive. holding parent/child relations in a single table can improve query performance where you are routinely selecting only both parent and child and retrieving individual rows
unless you are using fixed width MyISAM tables, updating records can result in them changing size, and hence they have to relocated in the table data file. This is expensive and can have a knock on effect on retrieval.
Related
I have a big table with over 100 million rows. I have been trimming it down for months getting rid of bad data (rows wise), trying to keep it small. I already had 9 columns on this table. I want to add a new boolean column to it. Below is the current state.
This table started off small, and now its getting pretty wide. Yet again, I am tasked with adding more information per row. This time it's a new boolean field. I expect this field to be low volume, meaning less than 10% will have this set to true. I know I can make it default null, and it is a boolean column which should be small.
However, I wanted to get some advice. This table cannot get infinitely wide, and I will need to work around this. Under these circumstances, does it make more sense to create another table and foreign key reference the record when I have additional data to add? How do the pro's handle this in database design?
The best situation for usability is to have all data on the record so any form of a query can get or calculate on the table itself without joins. I just do not have confidence that it will scale to 1 BILLION rows (insert meme).
At my job I support MySQL instances that have multi-billion row tables. At that scale, care must be taken to optimize queries properly. You don't want to do a table-scan at that scale.
But that's about rows, not columns. You asked first about columns.
The way to choose between adding a column versus adding another table is to follow rules of database normalization. If the new column is for an attribute of the same entity as your current table, add the column to that table. If it's a multi-valued attribute or if it's really an attribute of some other entity, then add it to a different table.
Very, very rarely is it the right choice to make another table solely for the sake of having too many columns. A given MySQL table can have dozens of columns pretty easily, and hundreds if you're careful.
In theory, there is no limit to the number of columns that might be appropriate to put in the same table with respect to normalization. But there are limitations due to the code to store those columns in a given implementation (e.g. InnoDB storage engine in MySQL).
See https://www.percona.com/blog/2013/04/08/understanding-the-maximum-number-of-columns-in-a-mysql-table/
So the maximum number of columns for a table in MySQL is somewhere between 191 and 2829, depending on a number of factors.
In the comments on that blog, I was able to design a table that failed to be created at 59 columns. Read the blog for details.
I'm setting up a table that might have upwards of 70 columns. I'm now thinking about splitting it up as some of the data in the columns won't be needed every time the table is accessed. Then again, if I do this I'm left with having to use joins.
At what point, if any, is it considered too many columns?
It's considered too many once it's above the maximum limit supported by the database.
The fact that you don't need every column to be returned by every query is perfectly normal; that's why SELECT statement lets you explicitly name the columns you need.
As a general rule, your table structure should reflect your domain model; if you really do have 70 (100, what have you) attributes that belong to the same entity there's no reason to separate them into multiple tables.
There are some benefits to splitting up the table into several with fewer columns, which is also called Vertical Partitioning. Here are a few:
If you have tables with many rows, modifying the indexes can take a very long time, as MySQL needs to rebuild all of the indexes in the table. Having the indexes split over several table could make that faster.
Depending on your queries and column types, MySQL could be writing temporary tables (used in more complex select queries) to disk. This is bad, as disk i/o can be a big bottle-neck. This occurs if you have binary data (text or blob) in the query.
Wider table can lead to slower query performance.
Don't prematurely optimize, but in some cases, you can get improvements from narrower tables.
It is too many when it violates the rules of normalization. It is pretty hard to get that many columns if you are normalizing your database. Design your database to model the problem, not around any artificial rules or ideas about optimizing for a specific db platform.
Apply the following rules to the wide table and you will likely have far fewer columns in a single table.
No repeating elements or groups of elements
No partial dependencies on a concatenated key
No dependencies on non-key attributes
Here is a link to help you along.
That's not a problem unless all attributes belong to the same entity and do not depend on each other.
To make life easier you can have one text column with JSON array stored in it. Obviously, if you don't have a problem with getting all the attributes every time. Although this would entirely defeat the purpose of storing it in an RDBMS and would greatly complicate every database transaction. So its not recommended approach to be followed throughout the database.
Having too many columns in the same table can cause huge problems in the replication as well. You should know that the changes that happen in the master will replicate to the slave.. for example, if you update one field in the table, the whole row will be w
I have a basic question about database designing.
I have a lot of files which I have to read and insert them in database. Each file has some thousand lines and each line has about 30 fields (by these types: small int, int, big int, varchar, json). Of course I use multi threads along with bulk inserting in order to increase insert speed (finally I have 30-40 millions records).
After inserting I want to have some sophisticated analysis and the performance is important to me.
Now I get each line fields and I'm ready to insert so I have 3 approaches:
1- One big table:
In this case I can create a big table with 30 columns and stores all of the files fields in that. So there is a table with huge size which I want to have a lot of analysis on it.
2- A fairly large table (A) and some little tables (B)s
In this case I can create some little tables which consist of the columns that have fairly identical records if we separate them from the other columns. So these little tables just has some hundred or thousand records instead of 30 millions records. So in fairly large table (A), I emit the columns which I put them in another table and I use a foreign key instead of them. Finally I has a table (A) with 20 columns and 30 millions records and some tables (B) with 2-3 columns and 100-50000 records for each of them. So in order to analysis the table A, I have to use some joins ,for example in select and ...
3- just a fairly large table
In this case I can create a fairly large table like table A in above case (with 20 columns) and instead of using foreign keys, I use a mapping between source columns and destination columns (this is something like foreign keys but has a little difference). For example I have 3 columns c1,c2,c3 that in case 2, I put them in another table B and use foreign key to access them, but now I assign a specific number to each distinctive records consist of c1,c2,c3 at inserting time and store the relation between the record and its assigned value in the program codes. So this table is completely like the table A in case number 2 but there is no need to use join in select or ...
While the inserting time is important, the analysis time that I will have is more important to me, so I want to know your opinion about which of these case is better and also I will glad to see the other solutions.
From a design perspective 30 to 40 million is not that bad a number. Performance is fully dependent on how you would design your DB to be.
If you are using SQL Server then you could consider putting the large table on a separate database file group. I have worked on one case in a similar fashion where we had around 1.8 Billion record in a single table.
For the analysis if you are not going to look into the entire data in one shot. You could consider a vertical partitioning of the data. You could use a partition schema based on your need. Some sample could be to split the data as yearly partitions and this will help if your analysis will be limited to a years worth of data(just an example).
The major thing would be de-normalization /normalization based on your need and of course non clustered/clustered indexing of the data. Again this will depend on what sort of analysis queries you would be using.
A single thread can INSERT one row at a time and finish 40M rows in a day or two. With LOAD DATA, you can do it in perhaps an hour or less.
But is loading the real question? For doing grouping, summing, etc, the question is about SELECT. For "analytics", the question is not one of table structure. Have a single table for the raw data, plus one or more "Summary tables" to make the selects really fast for your typical queries.
Until you give more details about the data, I cannot give more details about a custom solution.
Partitioning (vertical or horizontal) is unlikely to help much in MySQL. (Again, details needed.)
Normalization shrinks the data, which leads to faster processing. But, it sounds like the dataset is so small that it will all fit in RAM?? (I assume your #2 is 'normalization'?)
Beware of over-normalization.
Imagine if we had millions of rows in Table A.
For each large row (10+ columns) of Table A, we might have 20+ rows that are exact duplicates except for a singular column where we store an ID for Table B.
Would it be more EFFICIENT and/or MEMORY SAVING to store in Table A, the ID's for Table B in a text field ---> "B_ID1|B_ID2|B_ID3" etc and then return this data client-side, parse it, and then send it out for the actual data from Table B.
This is assuming we had 2+ million rows of unique data in Table A and if we stored that additional column outside the text field, we would add 2 Million*20+ Rows to that individual table with all that extra wasted space.
Or am I very naive in my approach and understanding of SQL? I literally just started using it like a week ago and taught myself the basics for my app.
This is where a weak entity (table) is best used.
Instead of duplicating all the data in table A, you simply create a new table that links A to B. In it, you can have only the ID to table A that links to the several ID's in table B (and set the primary key to be both of the foreign keys).
If you find yourself duplicating a lot of data across multiple rows, it may indicate that your database isn't normalized (http://en.wikipedia.org/wiki/Database_normalization).
This means that you might be able to break it into multiple smaller tables that reference each other to avoid data duplication.
SQL provides the ability to index your table in a variety of ways. I'm not an expert on big data, but my first hunch would be no. Having an auto-incrementing, indexed primary key lets the SQL server do the work of maintaining the list of records in a way it can easily look up info you need.
The real question comes down to how you are needing to parse/interact with this 2 million some odd rows. Is it a bunch of split document info? User profiles? Is it real-time inputs from some hardware device? Context is key to determining if SQL is even the best way to approach the problem.
Can you give us a little context on what sort of project you're theorizing? Or is this a more hypothetical question?
UPDATE: Check out W3 Schools for a brief intro to SQL concepts (among other coding references)
I got one table called Table1, it has around 20 columns. Half of these columns are string values, and the rest are integer. My question is so simple: what's better, have all the columns into only one table, or have it distributed into 2, 3 or even 4 tables? If so, I'd have to join them using LEFT JOIN.
What's the best choice?
Thanks
The question of "best" depends on how the table is being used. So, there is no real answer to the question. I can say that 20 columns is not a lot and many very reasonable tables have more than 20 columns of mixed types.
First observation: If you are asking such a question, you have some knowledge of SQL but not in-depth knowledge. One table is almost certainly the way to go.
What might change this advice? If many of the integer columns are NULL -- say 90% of the records have all of them as NULL -- then those NULL values are probably just wasting space on the data page. By eliminating those rows and storing the values in another table, you would reduce the size of the data.
The same is true of the string values, but with a caveat. Whereas the integers occupy at least 4 bytes, variable length strings might be even smaller (depends on the exact way that the database stores them).
Another reason would be on how the data is typically used. If the queries are usually using just a handful of columns, then storing each column in a separate table could be beneficial. To be honest, the overhead of the key column generally overwhelms any savings. And, such a data structure is really lousy for updates, inserts, and deletes.
However, this becomes quite practical in a columnar database such as Paraccel, Amazon Redshift, or Vertica. Such databases have built-in support for this type of splitting and it can have some very remarkable effects on performance.
Answering this with an example for users table -
1) `users` - id, name, dob, city, zipcode etc.
2) `users_products` - id, user_id(FK), product_name, product_validity,...
3) `users_billing_details` - id, user_id(FK to `users`), billing_name, billing_address..
4) `users_friends` - id, user_id(FK to `users`), friend_id(FK to same table `users`)
Hence if have many relations, use MANY-to-MANY relationship. If few relationship go with using the same table. All depends upon your structure and requirements.
SUGGESTION - Many-to-Many makes your data structure more flexible.
You can have 20 columns in 1 table. Nothing wrong with that. But then are you sure you are designing the structure properly?
Could some of these data change significantly in the future?
Is the table trying to encapsulate a single activity or entity?
Does the table have a singular meaning with respect to the domain or does it encapsulate multiple entities?
Could the structure be simplified into smaller tables having singular meaning for each table and then "Relationships" added via primary key/foreign keys?
These are some of the questions you take into consideration while designing a database.
If you find answer to these questions, you will know yourself whether you should have a single table or multiple tables?