how do you handle outliers of different sizes in different columns? - data-analysis

I am analyzing dailyActivity_merged.csv and sleepDay_merged.csv from the Fitabase data. When cleaning the data, I checked the data for outliers. And as you can see, there are many outliers of different sizes in different columns. I'm afraid to delete them column by column so as not to alter the results. How should I handle these outliers? Is it necessary to delete them?boxplot of different columns

Related

When to add a column vs adding a related table?

I have a big table with over 100 million rows. I have been trimming it down for months getting rid of bad data (rows wise), trying to keep it small. I already had 9 columns on this table. I want to add a new boolean column to it. Below is the current state.
This table started off small, and now its getting pretty wide. Yet again, I am tasked with adding more information per row. This time it's a new boolean field. I expect this field to be low volume, meaning less than 10% will have this set to true. I know I can make it default null, and it is a boolean column which should be small.
However, I wanted to get some advice. This table cannot get infinitely wide, and I will need to work around this. Under these circumstances, does it make more sense to create another table and foreign key reference the record when I have additional data to add? How do the pro's handle this in database design?
The best situation for usability is to have all data on the record so any form of a query can get or calculate on the table itself without joins. I just do not have confidence that it will scale to 1 BILLION rows (insert meme).
At my job I support MySQL instances that have multi-billion row tables. At that scale, care must be taken to optimize queries properly. You don't want to do a table-scan at that scale.
But that's about rows, not columns. You asked first about columns.
The way to choose between adding a column versus adding another table is to follow rules of database normalization. If the new column is for an attribute of the same entity as your current table, add the column to that table. If it's a multi-valued attribute or if it's really an attribute of some other entity, then add it to a different table.
Very, very rarely is it the right choice to make another table solely for the sake of having too many columns. A given MySQL table can have dozens of columns pretty easily, and hundreds if you're careful.
In theory, there is no limit to the number of columns that might be appropriate to put in the same table with respect to normalization. But there are limitations due to the code to store those columns in a given implementation (e.g. InnoDB storage engine in MySQL).
See https://www.percona.com/blog/2013/04/08/understanding-the-maximum-number-of-columns-in-a-mysql-table/
So the maximum number of columns for a table in MySQL is somewhere between 191 and 2829, depending on a number of factors.
In the comments on that blog, I was able to design a table that failed to be created at 59 columns. Read the blog for details.

Table with many columns or many small tables?

I created a table where it has 30 columns.
CREATE TABLE "SETTINGS" (
"column1" INTEGER PRIMARY KEY,
...
...
"column30"
)
However, I can group them and create different table where they can have foreign keys to the primary table. Which is the best way to follow? Or the number of the columns is small so it's the same which way I will follow?
It depends on the data and the query you often do.
Best for one big table
If you need to extract all the columns always
If you need to update many fields at the same time
If all the fields or quite all have not null values
Best for many little tables
If the data are "sparse" it means not many columns have values you can imagine to split them in different tables and create a record in a child table only if not null values exists
If you extract only few related fields at one time
If you update only related fields at one time
Better names for each column (for example instead of domicile_address and residence_address you can have two columns with named address in two tables residences and domiciles)
The problem is that generally you can use both solutions depending from the situation. A usage analysis must be done to choose the right one.
If they really are named column1, column2....column30 then that's a fairly good indicator that your data is not normalized.
Following the rules of normalization should always be your starting point. There are sometimes reasons for breaking the rules - but that comes after you know what your data should look like.
Regarding breaking the rules.....there are 2 potential scenarios which may apply here (after you've worked out the correct structure and done an appropriate level of testing with an appropriate volume of data):
Joins are expensive. holding parent/child relations in a single table can improve query performance where you are routinely selecting only both parent and child and retrieving individual rows
unless you are using fixed width MyISAM tables, updating records can result in them changing size, and hence they have to relocated in the table data file. This is expensive and can have a knock on effect on retrieval.

Tables with less rows vs ONE table with MANY Rows

I am creating a test site for many user to take many quizes. I want to store these results into a table. Each user can take up 5000 quizzes. My question is...Would it be better to make a table for each user and store his results into his own table (QuizID, Score)...OR...Would it be better to store ALL the results into ONE table (UserID, QuizID, Score)?
Example
5000 questions PER table * 1000 User Tables
VS
1 Table with 5,000,000 rows for the same 1000 Users.
Also, is there a limit to ROWs or TABLEs a DB can hold?
There is a limit to how much data a table can store. On modern operating systems, this is measured in Terabytes (see the documentation).
There are numerous reasons why you do not want to have multiple tables:
SQL databases are optimized for large tables, not for large numbers of tables. In fact, having large numbers of tables can introduce inefficiencies, because of partially filled data pages.
5,000,000 rows is not very big. If it is, partitioning can be used to improve efficiency.
Certain types of queries are a nightmare, when you are dealing with hundreds or thousands of tables. A simple question such as "What is the average of number of quizzes per user?" becomes a large effort.
Adding a new user requires adding new tables, rather than just inserting rows in existing tables.
Maintaining the database -- such as adding a column or an index -- becomes an ordeal, rather than a simple statement.
You lose the ability to refer to each user/quiz combination for foreign key purposes. You may not be thinking about it now, but perhaps a user starts taking the same quiz multiple times.
There are certain specialized circumstances where dividing the data among multiple tables might be a reasonable alternative. One example are security requirements, where you just are not allowed to mix different user's data. Another example would be different replication requirements on different subsets of the data. Even in these cases, it is unlikely that you would have thousands of different tables with the same structure.
Ideally you should have this approach.
Question Table with all the questions and primary key question Id.
User table with user details.
Table with 1 to many relationship having User id , quiz id and answer.
You are worrying about many rows in table but think there will be some user who will take only max 10-15 quiz. You will end up creating for 10 rows.

Should frequently accessed tables containing large blobs with one-to-one relationships be normalized and columns split into two tables?

I have a frequently accessed table containing 3 columns of blobs, and 4 columns of extra data that is not used in the query, but just sent as result to PHP. There are 6 small columns (big int, small int, tiny int, medium int, medium int, medium int) that are used in the queries in the WHERE/ORDER BY/GROUP BY.
The server has very low memory, around 1GBs, and so the cache is not enough to improve the performance one on the large table. I've indexed the last 6 small columns, but it doesn't seem to be helping.
Would it be a good solution to split this large table into two?
One table containing the last 6 columns, and the other containing the blobs and extra data, and link it to the previous table with a foreign key that has a one to one relationship?
I'll then run the queries on the small table, and join the little number of rows remaining after filtering to the table with the blobs and extra data to return them to PHP.
Please note, I've already done this, and I managed to decrease the query time from 1.2-1.4 seconds to 0.1-0.2 seconds. However I'm not sure if the solution I've tried is considered good practice, or is even advisable at all?
What you have implemented is sometimes called "vertical partitioning". If you take it to the extreme, then it is the basis for columnar databases, such as Vertica.
As you have observed, such partitioning can dramatically increase query performance. One reason is that less data needs to be read for processing a row of data.
The downside is for updates, inserts, and deletes. With all the data in a single row, these operations are basically atomic -- that is, the operation only affects one row in a data page. (This is not strictly true with blobs, because these are split among multiple pages.)
When you split the data among multiple tables, then you need to coordinate these operations among the tables, so you don't end up with "partial" rows of data.
For a database being used with bulk inserts and lots of querying, this is not a particularly important consideration. Your splitting of separate columns of the data into separate tables is a reasonable approach for improving performance.

How many fields is normal to have in one table?

Ok, I am creating a game, I have one table where I save a lot of information about a member, so I have many field in it. How many fields is normal to have in one table? Does it matter? Maybe I should split that info into two-three-four tables? What do you think?
Normalize the Database
If you feel you have too many columns, you probably have repeating groups, which suggests you should normalize the database. See an example here: Description of the database normalization basics
Hard MySQL Limits
MySQL 5.5 Column Count Limit
Every table has a maximum row size of 65,535 bytes.
There is a hard limit of 4096 columns
per table
Splitting of data into tables should generally not be dictated by the number of columns, but by the nature of the data. The process of splitting a large table into smaller ones is called normalization.
The only other reason I can think of to split a table is, if you may need data in clusters, i.e. you often need columns A-D together or columns E-L, but never all columns or columns D-F, then you can split the table into two tables, one containing columns A-D and the primary key, the other one containing columns E-L and the primary key.
Speaking about limits, MySQL says it's 4096 (source).
Yet I haven't seen so big tables yet, even those huge data mining tables don't come close.
You shouldn't be concerned about it as long as your database is normalized. As soon as you can spot same data being stored twice (for example, player table might have player_type column storing some fixed values), it's worth moving such data to separate table, instead od duplicating information in other tables and hence reducing columns count (but that's only "side effect" of normalization, not something that drives it).
I've never personally encountered one with more than 500 columns in it, but short of the maximum sizes there's no reason to have any fewer than the design demands. Beware of doing SELECT * FROM it though.
"information about a member" - umm always difficult, but I always separate identifiable information into another table and use a salt key to link the 2 together. That way it is not as easy to "hijack" usernames and passwords etc. And you can always use the SALT as a session variable rather than username/password/userId or whatever.
Typically I only store a ID, salt and joining date in 1 table. As I said, the rest I try to "hide" so that they cannot be "linked/hijacked".
Hope helps