Minimum and maximum values in feature scaling/normalization? [closed] - csv

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm fairly new to machine learning and I'm working on preprocessing my training data using linear feature scaling.
My question is, given a .csv file where each column of data represents a feature, with what minX and maxX values should I be normalizing my data?
More specifically, should I be normalizing each feature separately (using minX/maxX values from each column), normalizing all the data at once (finding minX/maxX from the entire dataset, ergo all the features), or normalizing on an input-by-input basis?

Normalize each feature separately. What you want is to limit the range of each feature in a well defined interval (i.e. [0,1]).
Use data from training data set only.
If you use Min-Max scaling you are going to have a smaller STD, this is not bad. If use Min-Max or standardization (mu=0, std=1) depends on the application you need to do.

You want all of your features to be in the same range for linear classifiers (and not only them! Also for neural nets!). The reason why you want to scale should be very clear to you before moving forward. Take a look at Andrew Ng's lecture on this subject for an intuitive explanation of what's going on.
Once this is clear, you should have the answer to your question: normalize each feature individually. For example, if you have a table with 3 rows:
row | F1 | F2
1 | 1 | 1000
2 | 2 | 2000
3 | 3 | 3000
You want to scale F1 by taking its max value (3) and its min value (1). You are going to do the same for F2 having 3000 and 1000 as max and min respectively.
This is called MinMax scaling. You can also do scaling based on mean and variance, or follow another approach entirely by thinking that you usually have a "budget" in terms of computational resources and you want to maximize it. In that case, something like Histogram Equalization might be a good choice.
A final note: if you are using decision trees (as a standalone classifier, or in a decision forest or in a boosted ensemble) then don't bother normalizing, it won't change a thing.

Related

i am using MySQL as database and i have 1 table containing 60 columns. Is it good to have a table with so many columns? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 11 months ago.
Improve this question
I'm doing a training point management function, I need to save all those points in the database so that it can be displayed when needed. And I created table for that function with 60 columns. Is that good? Or can anyone suggest me another way to handle it?
It is unusual but not impossible for a table to have that many columns, however...
It suggests that you schema might not be normalized. If that is the case then you will run into problems designing queries and/or making efficient use of the available resources.
Depending on how often each row is updated, the table could become fragmented. MySQL, like most DBMS, does not add up the size of all the attributes in the relation to work out the size to allocate for the record (although this is an option with C-ISAM). It rounds that figuere up so that there is some space for the data to grow, but at some point it could be larger than the space available, At that point the record must be migrated elsewhere. This leads to fragmentation in the data.
You queries are going to be very difficult to read/maintain. You may fall into the trap of writing "select * ...." which means that the DBMS needs to read the entirety of the record into memory in order to resolve the query. This does not make for efficient use of your memory.
We can't tell you if what you have done is correct, nor if you should be doing it differently without a detailed understanding of the underlying the data.
I've worked with many tables that had dozens of columns. It's usually not a problem.
In relational database theory, there is no limit to the number of columns in a table, as long as it's finite. If you need 60 attributes and they are all properly attributes of the candidate key in that table, then it's appropriate to make 60 columns.
It is possible that some of your 60 columns are not proper attributes of the table, and need to be split into multiple tables for the sake of normalization. But you haven't described enough about your specific table or its columns, so we can't offer opinions on that.
There's a practical limit in MySQL for how many columns it supports in a given table, but this is a limit of the implementation (i.e. MySQL internal code), not of the theoretical data model. The actual maximum number of columns in a table is a bit tricky to define, since it depends on the specific table. But it's almost always greater than 60. Read this blog about Understanding the Maximum Number of Columns in a MySQL Table for details.

How to improve Random forest regression prediction result [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am working with parking occupancy prediction using machine learning random forest regression. I have 6 features, I have tried to implement the random forest model but the results are not good, As I am very new to this I do not know what kind of model is suitable for this kind of problem. My dataset is huge I have 47 million rows. I have also used Random search cv but I cannot improve the model. Kindly have a look at the code below and help to improve or suggest another model.
Random forest regression
The features used are extracted with the help of the location data of the parking lots with a buffer. Kindly help me to improve.
So, your used variables are :
['restaurants_pts','population','res_percent','com_percent','supermarkt_pts', 'bank_pts']
The thing I see is, for a same Parking, those variables won't change, so the Regression will just predict the "average" occupancy of the parking. One of the key part of your problem seem to be that the occupancy is not the same at 5pm and at 4am...
I'd suggest you work on a time variable (ex : arrival) so it's usable.
Itself, the variable cannot be understood by the model, but you can work on it to create categories with it. For example, you make a preprocess selecting only the HOUR of your variable, and then make categories with it (either each hour being a category, or larger categories like ['noon - 6am', '6am - 10am', '10am - 2pm', '2pm - 6 pm', '6 pm - noon'])

Encode probability distribution in single cell of table [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Is there any mechanism for storing all the information of a probability distribution (discrete and/or continuous) into a single cell of a table? If so, how is this achieved and how might one go about making queries on these cells?
Your question is very vague. So only general hints can be provided.
I'd say there are two typical approaches for this (if I got your question right):
you can store some complex data into a single "cell" (how you call it) inside a database table. Easiest for this is to use JSON encoding. So you have an array of values, encode that to a string and store that string. If you want to access the values again you query the string and decode it back into an array. Newer versions of MariaDB or MySQL offer an extension to access such values on sql level too, though access is pretty slow that way.
you use an additional table for the values and store only a reference in the cell. This actually is the typical and preferred approach. This is how the relational database model works. The advantage of this approach is that you can directly access each value separately in sql, that you can use mathematical operations like sums, averages and the like on sql level and that you are not limited in the amount of storage space like you are when using a single cell. Also you can filter the values, for example by date ranges or value boundaries.
In the end, taking all together, both approaches offer the same, though they require different handling of the data. The fist approach additionally requires some scripting language on the client side to handle encoding and decoding, but that typically is given anyway.
The second approach us considered cleaner and will be faster in most of the cases, except if you always access to whole set of values at all times. So a decision can only be made when knowing more specific details about the environment and goal of an implementation.
Say we have a distribution in column B like:
and we want to place the distribution in a single cell. In C1 enter:
=B1
and in C2 enter:
=B1 & CHAR(10) & C1
and copy downwards. Finally, format cell C13 with wrap on:

Technical implications of FFT spectral analysis over custom defined frequency bands [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
First of all, I should mention that I'm not an expert in signal processing, but I know some of the very basics. So I apologize if this question doesn't make any sense.
Basically I want to be able to run a spectral analysis over a specific set of user-defined discrete frequency bands. Ideally I would want to capture around 50-100 different bands simultaneously. For example: the frequencies of each key on an 80-key grand piano.
Also I should probably mention that I plan to run this in a CUDA environment with about 200 cores at my disposal (Jetson TK1).
My question is: What acquisition time, sample rate, sampling frequency, etc should I use to get a high enough resolution to line up with the desired results? I don't want to choose a crazy high number like 10000 samples, so are there any tricks to minimize the number of samples while getting spectral lines within the desired bands?
Thanks!
The FFT result does not depend on its initialization, only on the sample rate, length, and signal input. You don't need to use a whole FFT if you only want one frequency result. A bandpass filter (perhaps 1 per core) for each frequency band would allow customizing each filter for the bandwidth and response desired for that frequency.
Also, for music, note pitch is very often different from spectral frequency peak.

Simple programming practice (Fizz Buzz, Print Primes) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I want to practice my skills away from a keyboard (i.e. pen and paper) and I'm after simple practice questions like Fizz Buzz, Print the first N primes.
What are your favourite simple programming questions?
I've been working on http://projecteuler.net/
Problem:
Insert + or - sign anywhere between the digits 123456789 in such a way that the expression evaluates to 100. The condition is that the order of the digits must not be changed.
e.g.: 1 + 2 + 3 - 4 + 5 + 6 + 78 + 9 = 100
Programming Problem:
Write a program in your favorite language which outputs all possible solutions of the above problem.
If you want a pen and paper kind of exercises I'd recommend more designing than coding.
Actually coding in paper sucks and it lets you learn almost nothing. Work environment does matter so typing on a computer, compiling, seeing what errors you've made, using refactor here and there, just doesn't compare to what you can do on a piece of paper and so, what you can do on a piece of paper, while being an interesting mental exercise is not practical, it will not improve your coding skills so much.
On the other hand, you can design the architecture of a medium or even complex application by hand in a paper. In fact, I usually do. Engineering tools (such as Enterprise Architect) are not good enough to replace the good all by-hand diagrams.
Good projects could be, How would you design a game engine? Classes, Threads, Storage, Physics, the data structures which will hold everything and so on. How would you start a search engine? How would you design an pattern recognition system?
I find that kind of problems much more rewarding than any paper coding you can do.
There are some good examples of simple-ish programming questions in Steve Yegge's article Five Essential Phone Screen Questions (under Area Number One: Coding). I find these are pretty good for doing on pen and paper. Also, the questions under OOP Design in the same article can be done on pen and paper (or even in your head) and are, I think, good exercises to do.
Quite a few online sites for competitive programming are full of sample questions/challenges, sorted by 'difficulty'. Quite often, the simpler categories in the 'algorithms' questions would suit you I think.
For example, check out TopCoder (algorithms section)!
Apart from that, 2 samples:
You are given a list of N points in the plane by their coordinates (x_i, y_i), and a number R>0. Output the maximum number out of the N given points that can be simultaneously covered by a disk of radius R (for bonus points: complexity?).
You are given an array of N numbers a1 to aN, and you want to compute a1 * a2 * ... * aN / ai for all values of i (so the output is again an array of N elements) without using division. Provide a (non-naive) method (complexity should be in O(N) multiplications).
I also like project euler, but I would like to point out that the questions get really tricky really fast. After the first 20 questions or so, they start to be problems most people won't be able to figure out in 1/2 an hour. Another problem is that a lot of them deal with math with really large numbers, that don't fit into standard integer or even long variable types.
Towers of Hannoi is great for practice on recursion.
I'd also do a search on sample programming interview questions.