Training a network on two feature vectors? - deep-learning

I want to train an MLP that takes in two nba teams A and B and classifies one as a winner and the other as a loser (probably binary classification 0 for loser 1 for winner) as well as a predictor that assigns a probability that team A wins vs team B. I'm having trouble figuring out how the feature vector should look though and would like some advice. My ideas are
Take the difference between each team's feature
Concatenate the featers IE for one training example it would be [A_1,B_1,A_2,B_2,...,A_n,B_n] where n is the feature number
Use one feature vector for each team?(Don't know if that works)
Could anyone give somee suggestions

While I agree with Scott Hunter to try everything, here are a few thoughts:
Taking the difference - this depends very much on what the team's features represent. If each feature is like a statistic of the team (such as win rate, etc), then taking the difference might work. If it is something more abstract, it may not be a good idea. But you can try.
Concatenate the features - I think this would be a good choice, at least when starting out. It definitely seems the most obvious, and I think it would work to a good extent.
Another approach: You could build an encoder, which takes the team's feature vector and outputs a condensed representation. Then, you can do something with this condensed representation (feed it to a simpler model, or another MLP as well).

Related

Analyse data with degree of affection

Hello everyone! I'm a newbie studying Data Analysis.
If you'd like to see relationship how A,B,C affects outcome, you may use several models such as KNN, SVM, Logistics regression (as far as I know).
But all of them are kinda categorical, rather than degree of affection.
Let's say, I'd like to show how Fonts and Colors contribute the degree of attraction (as shown).
What models can I use?
Thousands thanks!
If your input is only categorical variables (and having a few values each), then there are finitely many potential samples. Therefore, the model will have finitely many inputs, and, therefore, only a few outputs. Just warning.
If you use, say, KNN or random forest, you can assign L2 norm as your accuracy metric. It will emphasize that 1 is closer to 2 than 5 (pls don't forget to normalize).

Mixing text and numeric features for text classification using deep learning

I have a problem about classification of text into several categories (topics). Apart from text, I have some numeric features that I believe may be useful (there are also missing values among those features). But the most important information is, of course, presented in the text. Therefore, I think deep learning approach (with a common pipeline: embedding layer + CNN or RNN with dropout + Dense layer) would be the best choice. What is the best practice to mix the current model that works only on text input with numeric features? Are there any tricks, best common practices, state-of-the-art research going on in this field? Are there any papers/experiments (on GitHub, maybe) on this topic?
It'd be great if we could think of the problem in general, but for the sake of having an idea of what sort of problem we may solve, I will give a specific example. Let's suppose we have reviews from users in which they describe a problem they faced while receiving a service or purchasing an item. The target feature is multi-label: the set of tags (categories/topics) associated with the complaint that a user had (we should choose relevant ones among a few hundreds of possible topics).
Then apart from the user's comment itself (which is the most important feature), we may want also to take into account some numerical features like price, waiting time, rating (customer satisfaction score), etc. This can potentially be useful for predicting some particular categories.
The idea is to mix all these features somehow in a deep learning model to produce the final model. Not sure if I know much about the best ways how to do it. What are the best practices / useful tricks for this kinds of problems?
For each numeric feature, statistically have a representation (you can use pandas.DataFrame.describe), also plotting the distribution would visually make you stronger.
After having the values of mean, std, max, min etc. You should get rid ofoutliers which can harm your training model. For example, if your features have its 90% of its numeric values from 18 to 72 but has also values like 1.1 or 1200 etc. you should get rid of those by equalizing them to 18 or 72 depending on the side. You can use np.clip()
After having a reasonable distribution, you should convert those numeric features to categorical features. For instance, numeric distribution from 18 to 72 can be grouped as 18, 27, 36, ......, 72, taking the intervals. You can increase the resolution or decrease it, depending on your understanding and the performance of the algorithm. You can use np.digitize() or do manually by a simple function that you can write.
In the end you have a categorical feature just like the texts. CNN or RNN can work fine with categorical representations of the numeric values as well as you get the better advantage to have feature crosses to increase your performance.
But if you ask for something of more complex, I might not have understood your question or I may not know it. Still, if you want to ask more or differently, I will be happy to try to help.

Project Euler 298 - there must be a correct answer? (only pastebinned code)

Project Euler has a paging file problem (though it's disguised in other words).
I tested my code(pastebinned so as not to spoil it for anyone) against the sample data and got the same memory contents+score as the problem. However, there is nowhere near a consistent grouping of scores. It asks for the expected difference in scores after 50 turns. A random sampling of scores:
1.50000000
1.78000000
1.64000000
1.64000000
1.80000000
2.02000000
2.06000000
1.56000000
1.66000000
2.04000000
I've tried a few of those as answers, but none of them have been accepted... I know some people have succeeded, so I'm really confused - what the heck am I missing?
Your problem likely is that you don't seem to know the definition of Expected Value.
You will have to run the simulation multiple times and for each score difference, maintain the frequency of that occurence and then take the weighted mean to get the expected value.
Of course, given that it is Project Euler problem, there is probably a mathematical formula which can be used readily.
Yep, there is a correct answer. To be honest, Monte Carlo can theoretically come close in on the expect value given the law of large numbers. However, you won't want to try it here. Because practically each time you run the simu, you will have a slightly different result rounded to eight decimal places (And I think this setting does exactly deprive anybody of any chance of even thinking to use Monte Carlo). If you are lucky, you will have one simu that delivers the answer after lots of trials, given that you have submitted all the previous and failed. I think, captcha is the second way that euler project let you give up any brute-force approach.
Well, agree with Moron, you have to figure out "expected value" first. The principle of this problem is, you have to find a way to enumerate every possible "essential" outcomes after 50 rounds. Each outcome will have its own |L-R|, so sum them up, you will have the answer. No need to say, brute-force approach fails in most of the case, especially in this case. Fortunately, we have dynamic programming (dp), which is fast!
Basically, dp saves the computation results in each round as states and uses them in the next. Thus it avoids repeating the same computation over and over again. The difficult part of this problem is to find a way to represent a state, that is to say, how you would like to save your temp results. If you have solved problem 290 in dp, you can get some hints there about how to understand the problem and formulate a state.
Actually, that isn't the most difficult part for the mind. The hardest mental piece is whether you realize that some memory statuses of the two players are numerically different but substantially equivalent. For example, L:12345 R:12345 vs L:23456 R:23456 or even vs L:98765 R:98765. That is due to the fact that the call is random. That is also why I wrote possible "essential" outcomes. That is, you can summarize some states into one. And only by doing so, your program can finish in reasonal time.
I would run your simulation a whole bunch of times and then do a weighted average of the | L- R | value over all the runs. That should get you closer to the expected value.
Just submitting one run as an answer is really unlikely to work. Imagine it was dice roll expected value. Roll on dice, score a 6, submit that as expected value.

How many units should there be in each generation of a genetic algorithm?

I am working on a roguelike and am using a GA to generate levels. My question is, how many levels should be in each generation of my GA? And, how many generations should it have? It it better to have a few levels in each generation, with many generations, or the other way around?
There really isn't a hard and fast rule for this type of thing - most experiments like to use at least 200 members in a population at the barest minimum, scaling up to millions or more. The number of generations is usually in the 100 to 10,000 range. In general, to answer your final question, it's better to have lots of members in the population so that "late-bloomer" genes stay in a population long enough to mature, and then use a smaller number of generations.
But really, these aren't the important thing. The most critical part of any GA is the fitness function. If you don't have a decent fitness function that accurately evaluates what you consider to be a "good" level or a "bad" level, you're not going to end up with interesting results no matter how many generations you use, or how big your population is :)
Just as Mike said, you need to try different numbers. If you have a large population, you need to make sure to have a good selection function. With a large population, it is very easy to cause the GA to converge to a "not so good" answer early on.

Seeking clarifications about structuring code to reduce cyclomatic complexity

Recently our company has started measuring the cyclomatic complexity (CC) of the functions in our code on a weekly basis, and reporting which functions have improved or worsened. So we have started paying a lot more attention to the CC of functions.
I've read that CC could be informally calculated as 1 + the number of decision points in a function (e.g. if statement, for loop, select etc), or also the number of paths through a function...
I understand that the easiest way of reducing CC is to use the Extract Method refactoring repeatedly...
There are somethings I am unsure about, e.g. what is the CC of the following code fragments?
1)
for (int i = 0; i < 3; i++)
Console.WriteLine("Hello");
And
Console.WriteLine("Hello");
Console.WriteLine("Hello");
Console.WriteLine("Hello");
They both do the same thing, but does the first version have a higher CC because of the for statement?
2)
if (condition1)
if (condition2)
if (condition 3)
Console.WriteLine("wibble");
And
if (condition1 && condition2 && condition3)
Console.WriteLine("wibble");
Assuming the language does short-circuit evaluation, such as C#, then these two code fragments have the same effect... but is the CC of the first fragment higher because it has 3 decision points/if statements?
3)
if (condition1)
{
Console.WriteLine("one");
if (condition2)
Console.WriteLine("one and two");
}
And
if (condition3)
Console.WriteLine("fizz");
if (condition4)
Console.WriteLine("buzz");
These two code fragments do different things, but do they have the same CC? Or does the nested if statement in the first fragment have a higher CC? i.e. nested if statements are mentally more complex to understand, but is that reflected in the CC?
Yes. Your first example has a decision point and your second does not, so the first has a higher CC.
Yes-maybe, your first example has multiple decision points and thus a higher CC. (See below for explanation.)
Yes-maybe. Obviously they have the same number of decision points, but there are different ways to calculate CC, which means ...
... if your company is measuring CC in a specific way, then you need to become familiar with that method (hopefully they are using tools to do this). There are different ways to calculate CC for different situations (case statements, Boolean operators, etc.), but you should get the same kind of information from the metric no matter what convention you use.
The bigger problem is what others have mentioned, that your company seems to be focusing more on CC than on the code behind it. In general, sure, below 5 is great, below 10 is good, below 20 is okay, 21 to 50 should be a warning sign, and above 50 should be a big warning sign, but those are guides, not absolute rules. You should probably examine the code in a procedure that has a CC above 50 to ensure it isn't just a huge heap of code, but maybe there is a specific reason why the procedure is written that way, and it's not feasible (for any number of reasons) to refactor it.
If you use tools to refactor your code to reduce CC, make sure you understand what the tools are doing, and that they're not simply shifting one problem to another place. Ultimately, you want your code to have few defects, to work properly, and to be relatively easy to maintain. If that code also has a low CC, good for it. If your code meets these criteria and has a CC above 10, maybe it's time to sit down with whatever management you can and defend your code (and perhaps get them to examine their policy).
After browsing thru the wikipedia entry and on Thomas J. McCabe's original paper, it seems that the items you mentioned above are known problems with the metric.
However, most metrics do have pros and cons. I suppose in a large enough program the CC value could point to possibly complex parts of your code. But that higher CC does not necessarily mean complex.
Like all software metrics, CC is not perfect. Used on a big enough code base, it can give you an idea of where might be a problematic zone.
There are two things to keep in mind here:
Big enough code base: In any non trivial project you will have functions that have a really high CC value. So high that it does not matter if in one of your examples, the CC would be 2 or 3. A function with a CC of let's say over 300 is definitely something to analyse. Doesn't matter if the CC is 301 or 302.
Don't forget to use your head. There are methods that need many decision points. Often they can be refactored somehow to have fewer, but sometimes they can't. Do not go with a rule like "Refactor all methods with a CC > xy". Have a look at them and use your brain to decide what to do.
I like the idea of a weekly analysis. In quality control, trend analysis is a very effective tool for indentifying problems during their creation. This is so much better than having to wait until they get so big that they become obvious (see SPC for some details).
CC is not a panacea for measuring quality. Clearly a repeated statement is not "better" than a loop, even if a loop has a bigger CC. The reason the loop has a bigger CC is that sometimes it might get executed and sometimes it might not, which leads to two different "cases" which should both be tested. In your case the loop will always be executed three times because you use a constant, but CC is not clever enough to detect this.
Same with the chained ifs in example 2 - this structure allows you to have a statment which would be executed if only condition1 and condition2 is true. This is a special case which is not possible in the case using &&. So the if-chain has a bigger potential for special cases even if you dont utilize this in your code.
This is the danger of applying any metric blindly. The CC metric certainly has a lot of merit but as with any other technique for improving code it can't be evaluated divorced from context. Point your management at Casper Jone's discussion of the Lines of Code measurement (wish I could find a link for you). He points out that if Lines of Code is a good measure of productivity then assembler language developers are the most productive developers on earth. Of course they're no more productive than other developers; it just takes them a lot more code to accomplish what higher level languages do with less source code. I mention this, as I say, so you can show your managers how dumb it is to blindly apply metrics without intelligent review of what the metric is telling you.
I would suggest that if they're not, that your management would be wise to use the CC measure as a way of spotting potential hot spots in the code that should be reviewed further. Blindly aiming for the goal of lower CC without any reference to code maintainability or other measures of good coding is just foolish.
Cyclomatic complexity is analogous to temperature. They are both measurements, and in most cases meaningless without context. If I said the temperature outside was 72 degrees that doesn’t mean much; but if I added the fact that I was at North Pole, the number 72 becomes significant. If someone told me a method has a cyclomatic complexity of 10, I can’t determine if that is good or bad without its context.
When I code review an existing application, I find cyclomatic complexity a useful “starting point” metric. The first thing I check for are methods with a CC > 10. These “>10” methods are not necessarily bad. They just provide me a starting point for reviewing the code.
General rules when considering a CC number:
The relationship between CC # and # of tests, should be CC# <= #tests
Refactor for CC# only if it increases
maintainability
CC above 10 often indicates one or more Code Smells
[Off topic] If you favor readability over good score in the metrics (Was it J.Spolsky that said, "what's measured, get's done" ? - meaning that metrics are abused more often than not I suppose), it is often better to use a well-named boolean to replace your complex conditional statement.
then
if (condition1 && condition2 && condition3)
Console.WriteLine("wibble");
become
bool/boolean theWeatherIsFine = condition1 && condition2 && condition3;
if (theWeatherIsFine)
Console.WriteLine("wibble");
I'm no expert at this subject, but I thought I would give my two cents. And maybe that's all this is worth.
Cyclomatic Complexity seems to be just a particular automated shortcut to finding potentially (but not definitely) problematic code snippets. But isn't the real problem to be solved one of testing? How many test cases does the code require? If CC is higher, but number of test cases is the same and code is cleaner, don't worry about CC.
1.) There is no decision point there. There is one and only one path through the program there, only one possible result with either of the two versions. The first is more concise and better, Cyclomatic Complexity be damned.
1 test case for both
2.) In both cases, you either write "wibble" or you don't.
2 test cases for both
3.) First one could result in nothing, "one", or "one" and "one and two". 3 paths. 2nd one could result in nothing, either of the two, or both of them. 4 paths.
3 test cases for the first
4 test cases for the second