Analyse data with degree of affection - data-analysis

Hello everyone! I'm a newbie studying Data Analysis.
If you'd like to see relationship how A,B,C affects outcome, you may use several models such as KNN, SVM, Logistics regression (as far as I know).
But all of them are kinda categorical, rather than degree of affection.
Let's say, I'd like to show how Fonts and Colors contribute the degree of attraction (as shown).
What models can I use?
Thousands thanks!

If your input is only categorical variables (and having a few values each), then there are finitely many potential samples. Therefore, the model will have finitely many inputs, and, therefore, only a few outputs. Just warning.
If you use, say, KNN or random forest, you can assign L2 norm as your accuracy metric. It will emphasize that 1 is closer to 2 than 5 (pls don't forget to normalize).

Related

What Algo to use to classify my data to 3 classes

I'm looking for a way to differentiate between 3 classes(classification problem) for each OBJECT to classify.
I have a large dataset(millions of lines). There are 2 features, each have 100 values(scaled to 0-1).
Each line refers to one sample of a specific Object(Object_id, 100 columns of my first feature, 100 of my second feature).
Each object(that has to be classified to either 3 classes) have at least 100 samples(1 sample is 1 line)
Unfortunately Classe 3 counts only 1/10 compared to 1 and 2(each object of classe 3 have around 500 samples, however classe 1 and 2 objects have around 2000 and more).
In order to do the classification, I need to take a bach of samples for each object(for exmaple 20, 50, or 100).
I dont know what algo suites better for my case, I'm new to deep learning so bear with me please
Let's break this down to two main questions: how to handle unbalanced datasets and which model to use.
Unbalanced datasets
Most machine learning algorithms are sensitive to some degree on unbalanced datasets. This is a huge challenge for Machine Learning in fields like medical diagnostics or seismology, where you have 98% "normal" readings and 2% "event" readings. There is no silver bullet to this problem. Some algorithms are more resilient to an unbalanced dataset, and some that deliberately unbalance their datasets to encourage a strong model (see bagging), and there are options to augment your data by introducing cloned data with statistical noise. However, your easiest and most effective approach is to decimate your dataset to make it balanced.
You have a class split of 2000|2000|500 datapoints. Randomly sample 500 datapoints from each of the first two classes so you have a balanced 500|500|500 dataset. It is important to randomly sample, instead of simply taking the first 500 as you want a representative sample of the class population. see the numpy.random module for how to select your datapoints.
Model selection
Although Deep Learning is portrayed as the be-all and end-all for machine learning, it represents a significant amount of time and cost to prepare, train and monitor. A typical approach to any new problem is to try some "baseline" shallow learning models. Often you'll see the following scenarios:
Your baseline models fail to train.
Your baseline model trains and fits moderately
Your baseline model trains and fits closely
In the first scenario, your deep learning model is unlikely to train either. In the third scenario there is no need to build a deep learning model when a simpler algorithm can solve it. Scenario 2 is your candidate fro deep learning.
So what models could you use?
Well, we know that it's a supervised problem, that we have a good number of samples, and that we are looking to classify. Your best bet for this kind of question is a Random Forests model. There is a good simple implementation in scikit-learn and hundreds of tutorials.
Alternatively, if you're looking at class fit through clustering, K-means ++ models (and co), or even Gaussian Mixture Models are a good place to start (again, see scikit learn's sklearn.clustering and sklearn.mixture)
If it fits well, then your work is done. If it fits moderately, think about deep learning. If it fails to fit, get add more features (and more diverse features) to your dataset.

Is it possible to train the sentiment classification model with the labeled data and then use it to predict sentiment on data that is not labeled?

I want to do sentiment analysis using machine learning (text classification) approach. For example nltk Naive Bayes Classifier.
But the issue is that a small amount of my data is labeled. (For example, 100 articles are labeled positive or negative) and 500 articles are not labeled.
I was thinking that I train the classifier with labeled data and then try to predict sentiments of unlabeled data.
Is it possible?
I am a beginner in machine learning and don't know much about it.
I am using python 3.7.
Thank you in advance.
Is it possible to train the sentiment classification model with the labeled data and then use it to predict sentiment on data that is not labeled?
Yes. This is basically the definition of what supervised learning is.
I.e. you train on data that has labels, so that you can then put it into production on categorizing your data that does not have labels.
(Any book on supervised learning will have code examples.)
I wonder if your question might really be: can I use supervised learning to make a model, assign labels to another 500 articles, then do further machine learning on all 600 articles? Well the answer is still yes, but the quality will fall somewhere between these two extremes:
Assign random labels to the 500. Bad results.
Get a domain expert assign correct labels to those 500. Good results.
Your model could fall anywhere between those two extremes. It is useful to know where it is, so know if it is worth using the data. You can get an estimate of that by taking a sample, say 25 records, and have them also assigned by a domain expert. If all 25 match, there is a reasonable chance your other 475 records also have been given good labels. If e.g. only 10 of the 25 match, the model is much closer to the random end of the spectrum, and using the other 475 records is probably a bad idea.
("10", "25", etc. are arbitrary examples; choose based on the number of different labels, and your desired confidence in the results.)

What is the best way to represent a collection of documents in a fixed length vector?

I am trying to build a deep neural networks that takes in a set of documents and predicts the category it belongs.
Since number of documents in each collection is not fixed, my first attempt was to get a mapping of documents from doc2vec and use the average.
The accuracy on training is high as 90% but the testing accuracy is low as 60%.
Is there a better way of representing a collection of documents as a fixed length vector so that the words they have in common are captured?
The description of your process so far is a bit vague and unclear – you may want to add more detail to your question.
Typically, Doc2Vec would convert each doc to a vector, not "a collection of documents".
If you did try to collapse a collection into a single vector – for example, by averaging many doc-vecs, or calculating a vector for a synthetic document with all the sub-documents' words – you might be losing valuable higher-dimensional structure.
To "predict the category" would be a typical "classification" problem, and with a bunch of documents (represented by their per-doc vectors) and known-labels, you could try various kinds of classifiers.
I suspect from your description, that you may just be collapsing a category to a single vector, then classifying new documents by checking which existing category-vector they're closest-to. That can work – it's vaguely a K-Nearest-Neighbors approach, but with every category reduced to one summary vector rather than the full set of known examples, and each classification being made by looking at just one single nearest-neighbor. That forces a simplicity on the process that may not match the "shapes" of the real categories as well as a true KNN classifier, or other classifiers, could achieve.
If accuracy on test data falls far below that observed during training, that can indicate that significant "overfitting" is occurring: the model(s) are essentially memorizing idiosyncrasies of the training data to "cheat" at answers based on arbitrary correlations, rather than learning generalizable rules. Making your model(s) smaller – such as by decreasing the dimensionality of your doc-vectors – may help in such situations, by giving the model less extra state in which to remember peculiarities of the training data. More data can also help - as the "noise" in more numerous varied examples tends of cancel itself out, rather than achieve the sort of misguided importance that can be learned in smaller datasets.
There are other ways to convert a variable-length text into a fixed-length vector, including many based on deeper learning algorithms. But, those can be even more training-data-hungry, and it seems like you may have other factors to improve before trying those in-lieu-of Doc2Vec.

Mixing text and numeric features for text classification using deep learning

I have a problem about classification of text into several categories (topics). Apart from text, I have some numeric features that I believe may be useful (there are also missing values among those features). But the most important information is, of course, presented in the text. Therefore, I think deep learning approach (with a common pipeline: embedding layer + CNN or RNN with dropout + Dense layer) would be the best choice. What is the best practice to mix the current model that works only on text input with numeric features? Are there any tricks, best common practices, state-of-the-art research going on in this field? Are there any papers/experiments (on GitHub, maybe) on this topic?
It'd be great if we could think of the problem in general, but for the sake of having an idea of what sort of problem we may solve, I will give a specific example. Let's suppose we have reviews from users in which they describe a problem they faced while receiving a service or purchasing an item. The target feature is multi-label: the set of tags (categories/topics) associated with the complaint that a user had (we should choose relevant ones among a few hundreds of possible topics).
Then apart from the user's comment itself (which is the most important feature), we may want also to take into account some numerical features like price, waiting time, rating (customer satisfaction score), etc. This can potentially be useful for predicting some particular categories.
The idea is to mix all these features somehow in a deep learning model to produce the final model. Not sure if I know much about the best ways how to do it. What are the best practices / useful tricks for this kinds of problems?
For each numeric feature, statistically have a representation (you can use pandas.DataFrame.describe), also plotting the distribution would visually make you stronger.
After having the values of mean, std, max, min etc. You should get rid ofoutliers which can harm your training model. For example, if your features have its 90% of its numeric values from 18 to 72 but has also values like 1.1 or 1200 etc. you should get rid of those by equalizing them to 18 or 72 depending on the side. You can use np.clip()
After having a reasonable distribution, you should convert those numeric features to categorical features. For instance, numeric distribution from 18 to 72 can be grouped as 18, 27, 36, ......, 72, taking the intervals. You can increase the resolution or decrease it, depending on your understanding and the performance of the algorithm. You can use np.digitize() or do manually by a simple function that you can write.
In the end you have a categorical feature just like the texts. CNN or RNN can work fine with categorical representations of the numeric values as well as you get the better advantage to have feature crosses to increase your performance.
But if you ask for something of more complex, I might not have understood your question or I may not know it. Still, if you want to ask more or differently, I will be happy to try to help.

Area of overlap of two circular Gaussian functions

I am trying to write code that finds the overlap of between 3D shapes.
Each shape is defined by two intersecting normal distributions (one in the x direction, one in the y direction).
Do you have any suggestions of existing code that addresses this question or functions that I can utilize to build this code? Most of my programming experience has been in R, but I am open to solutions in other languages as well.
Thank you in advance for any suggestions and assistance!
The longer research context on this question: I am studying the use of acoustic space by insects. I want to know whether randomly assembled groups of insects would have calls that are more or less similar than we observe in natural communities (a randomization test). To do so, I need to randomly select insect species and calculate the similarity between their calls.
For each species, I have a mean and variance for two call characteristics that are approximately normally distributed. I would like to use these two call characteristics to build a 3D probability distribution for the species. I would then like to calculate the amount by which the PDF for one species overlaps with another.
Please accept my apologies if the question is not clear or appropriate for this forum.
I work in small molecule drug discovery, and I frequently use a program (ROCS, by OpenEye Scientific Software) based on algorithms that represent molecules as collections of spherical Gaussian functions and compute intersection volumes. You might look at the following references, as well as the ROCS documentation:
(1) Grant and Pickup, J. Phys. Chem. 1995, 99, 3503-3510
(2) Grant, Gallardo, and Pickup, J. Comp. Chem. 1996, 17, 1653-1666