Data structure for storing chord progression rules? [closed]

Data structure for storing chord progression rules? [closed] - language-agnostic

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
What would be the most appropriate (naturally suited) way to represent the various chord progression (musical) rules in a data-structure such that each chord had a weighted set of options that it could progress to?
This data structure would be implemented in a procedural music generation program in a way that you could code: (language-agnostic pseudo-code)
Chord[7] songArray;
Chord first = new Chord(I); //set the first chord's value
songArray[0] = first;
for (i=0; i<7; i++){
Chord temp = songArray[i].next(); //select the following chord
songArray[i+1] = temp;
}
Note: In classical-type music, each chord in a given key can naturally progress to another chord following these rules:
----------------------
| Chord | Leads to |
|=======================
| I | any |
| ii | V, vii |
| iii | IV, vi |
| IV | ii, V, vii |
| V | vi |
| vi | ii, ii, IV, V|
| vii | I |
----------------------
The data structure would store the various progressions as weighted options. As an example, consider the IV chord in any given major key: IV can naturally progress to ii, V, or vii, but could also break the rules in progressing to any other chord. Breaking the rules would happen infrequently.
I have considered some sort of linked list/tree data structure, but it would hardly resemble any type of tree or list I've ever used -- additionally, I can't work out how to implement the weighting:
Another thought was to use JSON or something similar, but it seems to get redundant very quickly:
{
"I":{
"100%":{
"I",
"ii",
"iii",
"IV",
"V",
"vi",
"vii"
}
},
"ii":{
"80%":{
"V",
"vii"
},
"20%":{
"i",
"ii",
"iii",
"IV",
"vi"
}
},
// ...
}
Note: I am comfortable implementing this in a handful of languages, and at this point am NOT concerned with a specific language implementation, but a language-agnostic data-structure architecture.

A Markov Chain might be a good fit for this problem.
A Markov chain is a stochastic process where the progression to the next state is determined by the current state. So for a given interval from your table you would apply weights to the "Leads to" values and then determine randomly to which state to progress.

I'd expect you to have less than 100 chords, therefore if you use 32 bits to represent probability series (likely extreme overkill) you'd end up with a 100x100x4 (40000) byte array for a flat Markov matrix representation. Depending on the sparsity of the matrix (e.g. if you have 50 chords, but each one typically maps to 2 or 3 chords) for speed and less importantly space reasons you may want an array of arrays where each final array element is (chord ID, probability).
In either case, one of the key points here is that you should use a probability series, not a probability sequence. That is, instead of saying "this chord has a 10% chance, and this one has a 10% chance, and this one has a 80% chance) say "the first chord has a 10% chance, the first two chords have a 20% chance, and the first three chords have a 100% chance."
Here's why: When you go to select a random but weighted value, you can generate a number in a fixed range (for unsigned integers, 0 to 0xFFFFFFFF) and then perform a binary search through the chords rather than linear search. (Search for the element with least probability series value that is still greater than or equal to the number you generated.)
On the other hand, if you've only got a few following chords for each chord, a linear search would likely be faster than a binary search due to a tighter loop, and then all the probability series saves you calculating a simple running sum of the probability values.
If you don't require the most staggeringly amazing performance (and I suspect you don't -- for a computer there's just not that many chords in a piece of music) for this portion of your code, I'd honestly just stick to a flat representation of a Markov matrix -- easy to understand, easy to implement, reasonable execution speed.
Just as a fun aside, this sort of thing lends itself well to thinking about predictive coding -- a common methodology in data compression. You might consider an n-gram based algorithm (e.g. PPM) to achieve higher-order structure in your music generation without too much example material required. It's been working in data compression for years.

It sounds like you want some form of directed, weighted graph where the nodes are the chords and the edges are the progression options with edge weights being the progression's likelihood.

Related

How to build deep learning model that picks words from serval distinct bags and forms a meaningful sentence [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Image of Bags and how to choose from them
Imagine I have 10 bags,Ordered one after other.ie Bag 1 , Bag 2 ......... Bag n.
Each bag has distinct set of words.
In order to understand what a bag is,
Consider we have a vocabulary of 10,000 words.
The first bag contains words Hello , India , Manager.
ie Bag 1 will have 1's at the words index present in the bag.
ex:Bag 1 will be of size 10000*1
if Hello's index was 1 India's index was 2 and Manager's was 4
It will be
[0 , 1, 1, 0 , 1 ,0,0,0,0.........]
*I dont have a model yet.
*I'm thinking to use story books,But its still kind of abstract for me.
A word has to chosen from each bag and assigned a number word 1(word from bag 1)
word 2(word from bag 2) and they must form a MEANINGFULL sentence in their numerical order.!

First, we need a way that the computer can recognise a word otherwise it cannot pick the correct one. That means at this stage, we need to decide what we are teaching the computer to begin with (ie what is a verb, noun, grammar) but I will assume we will dump a dictionary into it and give no information except the words themselves.
So that the computer can compute what sentences are, we need to convert them to numbers (one way would be to work alphabetically starting at 1, using them as keys for a dictionary (digital this time(!)) and the word as the value). Now we can apply the same linear algebra techniques to this problem as any other problem.
So we need to make generations of matrices of weights to multiply into the keys of the dictionary, then remove all the weights beyond the range of dictionary keys, the rest can be used to get the value in the dictionary and make a sentence. Optionally, you can also use a threshold value to take off of all the outputs of the matrix multiplication
Now for the hard part: learning. Once you have a few (say 100) matrices, we need to "breed" the best ones (this is where human intervention is needed) and you need to pick the 50 most meaningful sentences (might be hard at first) and use them to base your next 100 of (easiest way would be to weight the 50 matrices randomly for a weighted mean 100 times).
And the boring bit, keep running the generations over and over until you get to a point where your sentences are meaningful most of the time (of course there is no guarantee that it will always be meaningful but that's the nature of ANN's)
If you find it doesn't work, you can use more layers (more matrices) and/or I recently heard of a different technique that dynamically changed the network but I can't really help with that.

Have a database with thousands/millions of valid sentences.
Create a dictionary where each word represents a number (reserve 0 for "nothing", 1 for "start of sentence" and 2 for "end of sentence").
word_dic = { "_nothing_": 0, "_start_": 1, "_end_": 2, "word1": 3, "word2": 4, ...}
reverse_dic = {v:k for k,v in word_dic.items()}
Remember to add "_start_" and "_end_" at the beginning and end of all sentences in the database, and "_nothing_" after the end to complete the desired length capable of containing all sentences. (Ideally, work with sentences with 10 or less words, so your model wont't try to create bigger sentences).
Transform all your sentences into sequences of indices:
#supposing you have an array of shape (sentences, length) as string:
indices = []
for word in database.reshape((-1,)):
indices.append(word_dic[word])
indices = np.array(indices).reshape((sentences,length))
Transform this into categorical words with the keras function to_categorical()
cat_sentences = to_categorical(indices) #shape (sentences,length,dictionary_size)
Hint: keras has lots of useful text preprocessing functions here.
Separate training input and output data:
#input is the sentences except for the last word
x_train = cat_sentences[:,:-1,:]
y_train = cat_sentences[:,1:,:]
Let's create an LSTM based model that will predict the next words from the previous words:
model = Sequential()
model.add(LSTM(dontKnow,return_sequences=True,input_shape=(None,dictionary_size)))
model.add(.....)
model.add(LSTM(dictionary_size,return_sequences=True,activation='sigmoid'))
#or a Dense(dictionary_size,activation='sigmoid')
Compile and fit this model with x_train and y_train:
model.compile(....)
model.fit(x_train,y_train,....)
Create an identical model using stateful=True in all LSTM layers:
newModel = ......
Transfer the weights from the trained model:
newModel.set_weights(model.get_weights())
Create your bags in a categorical way, shape (10, dictionary_size).
Use the model to predict one word from the _start_ word.
#reset the states of the stateful model before you start a 10 word prediction:
newModel.reset_states()
firstWord = newModel.predict(startWord) #startword is shaped as (1,1,dictionary_size)
The firstWord will be a vector with size dictionary_size telling (sort of) the probabilities of each existing word. Compare to the words in the bag. You can choose the highest probability, or use some random selecting if the probabilities of other words in the bag are also good.
#example taking the most probable word:
firstWord = np.array(firstWord == firstWord.max(), dtype=np.float32)
Do the same again, but now input firstWord in the model:
secondWord = newModel.predict(firstWord) #respect the shapes
Repeat the process until you get a sentence. Notice that you may find _end_ before the 10 words in the bag are satisfied. You may decide to finish the process with a shorter sentence then, especially if other word probabilities are low.

Understanding stateful LSTM [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm going through this tutorial on RNNs/LSTMs and I'm having quite a hard time understanding stateful LSTMs. My questions are as follows :
1. Training batching size
In the Keras docs on RNNs, I found out that the hidden state of the sample in i-th position within the batch will be fed as input hidden state for the sample in i-th position in the next batch. Does that mean that if we want to pass the hidden state from sample to sample we have to use batches of size 1 and therefore perform online gradient descent? Is there a way to pass the hidden state within a batch of size >1 and perform gradient descent on that batch ?
2. One-Char Mapping Problems
In the tutorial's paragraph 'Stateful LSTM for a One-Char to One-Char Mapping' were given a code that uses batch_size = 1 and stateful = True to learn to predict the next letter of the alphabet given a letter of the alphabet. In the last part of the code (line 53 to the end of the complete code), the model is tested starting with a random letter ('K') and predicts 'B' then given 'B' it predicts 'C', etc. It seems to work well except for 'K'. However, I tried the following tweak to the code (last part too, I kept lines 52 and above):
# demonstrate a random starting point
letter1 = "M"
seed1 = [char_to_int[letter1]]
x = numpy.reshape(seed, (1, len(seed), 1))
x = x / float(len(alphabet))
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
print(int_to_char[seed1[0]], "->", int_to_char[index])
letter2 = "E"
seed2 = [char_to_int[letter2]]
seed = seed2
print("New start: ", letter1, letter2)
for i in range(0, 5):
x = numpy.reshape(seed, (1, len(seed), 1))
x = x / float(len(alphabet))
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
print(int_to_char[seed[0]], "->", int_to_char[index])
seed = [index]
model.reset_states()
and these outputs:
M -> B
New start: M E
E -> C
C -> D
D -> E
E -> F
It looks like the LSTM did not learn the alphabet but just the positions of the letters, and that regardless of the first letter we feed in, the LSTM will always predict B since it's the second letter, then C and so on.
Therefore, how does keeping the previous hidden state as initial hidden state for the current hidden state help us with the learning given that during test if we start with the letter 'K' for example, letters A to J will not have been fed in before and the initial hidden state won't be the same as during training ?
3. Training an LSTM on a book for sentence generation
I want to train my LSTM on a whole book to learn how to generate sentences and perhaps learn the authors style too, how can I naturally train my LSTM on that text (input the whole text and let the LSTM figure out the dependencies between the words) instead of having to 'artificially' create batches of sentences from that book myself to train my LSTM on? I believe I should use stateful LSTMs could help but I'm not sure how.

Having a stateful LSTM in Keras means that a Keras variable will be used to store and update the state, and in fact you could check the value of the state vector(s) at any time (that is, until you call reset_states()). A non-stateful model, on the other hand, will use an initial zero state every time it processes a batch, so it is as if you always called reset_states() after train_on_batch, test_on_batch and predict_on_batch. The explanation about the state being reused for the next batch on stateful models is just about that difference with non-stateful; of course the state will always flow within each sequence in the batch and you do not need to have batches of size 1 for that to happen. I see two scenarios where stateful models are useful:
You want to train on split sequences of data because these are very long and it would not be practical to train on their whole length.
On prediction time, you want to retrieve the output for each time point in the sequence, not just at the end (either because you want to feed it back into the network or because your application needs it). I personally do that in the models that I export for later integration (which are "copies" of the training model with batch size of 1).
I agree that the example of an RNN for the alphabet does not really seem very useful in practice; it will only work when you start with the letter A. If you want to learn to reproduce the alphabet starting at any letter, you would need to train the network with that kind of examples (subsequences or rotations of the alphabet). But I think a regular feed-forward network could learn to predict the next letter of the alphabet training on pairs like (A, B), (B, C), etc. I think the example is meant for demonstrative purposes more than anything else.
You may have probably already read it, but the popular post The Unreasonable Effectiveness of Recurrent Neural Networks shows some interesting results along the lines of what you want to do (although it does not really dive into implementation specifics). I don't have personal experience training RNN with textual data, but there is a number of approaches you can research. You can build character-based models (like the ones in the post), where your input and receive one character at a time. A more advanced approach is to do some preprocessing on the texts and transform them into sequences of numbers; Keras includes some text preprocessing functions to do that. Having one single number as feature space is probably not going to work all that well, so you could simply turn each word into a vector with one-hot encoding or, more interestingly, have the network learn the best vector representation for each for, which is what they call en embedding. You can go even further with the preprocessing and look into something like NLTK, specially if you want to remove stop words, punctuation and things like that. Finally, if you have sequences of different sizes (e.g. you are using full texts instead of excerpts of a fixed size, which may or may not be important for you) you will need to be a bit more careful and use masking and/or sample weighting. Depending on the exact problem, you can set up the training accordingly. If you want to learn to generate similar text, the "Y" would be the similar to the "X" (one-hot encoded), only shifted by one (or more) positions (in this case you may need to use return_sequences=True and TimeDistributed layers). If you want to determine the autor, your output could be a softmax Dense layer.
Hope that helps.

Determining edge weights given a list of walks in a graph

These questions regard a set of data with lists of tasks performed in succession and the total time required to complete them. I've been wondering whether it would be possible to determine useful things about the tasks' lengths, either as they are or with some initial guesstimation based on appropriate domain knowledge. I've come to think graph theory would be the way to approach this problem in the abstract, and have a decent basic grasp of the stuff, but I'm unable to know for certain whether I'm on the right track. Furthermore, I think it's a pretty interesting question to crack. So here we go:
Is it possible to determine the weights of edges in a directed weighted graph, given a list of walks in that graph with the lengths (summed weights) of said walks? I recognize the amount and quality of permutations on the routes taken by the walks will dictate the quality of any possible answer, but let's assume all possible walks and their lengths are given. If a definite answer isn't possible, what kind of things can be concluded about the graph? How would you arrive at those conclusions?
What if there were several similar walks with possibly differing lengths given? Can you calculate a decent average (or other illustrative measure) for each edge, given enough permutations on different routes to take? How will discounting some permutations from the available data set affect the calculation's accuracy?
Finally, what if you had a set of initial guesses as to the weights and had to refine those using the walks given? Would that improve upon your guesstimation ability, and how could you apply the extra information?
EDIT: Clarification on the difficulties of a plain linear algebraic approach. Consider the following set of walks:
a = 5
b = 4
b + c = 5
a + b + c = 8
A matrix equation with these values is unsolvable, but we'd still like to estimate the terms. There might be some helpful initial data available, such as in scenario 3, and in any case we can apply knowledge of the real world - such as that the length of a task can't be negative. I'd like to know if you have ideas on how to ensure we get reasonable estimations and that we also know what we don't know - eg. when there's not enough data to tell a from b.

Seems like an application of linear algebra.
You have a set of linear equations which you need to solve. The variables being the lengths of the tasks (or edge weights).
For instance if the tasks lengths were t1, t2, t3 for 3 tasks.
And you are given
t1 + t2 = 2 (task 1 and 2 take 2 hours)
t1 + t2 + t3 = 7 (all 3 tasks take 7 hours)
t2 + t3 = 6 (tasks 2 and 3 take 6 hours)
Solving gives t1 = 1, t2 = 1, t3 = 5.
You can use any linear algebra techniques (for eg: http://en.wikipedia.org/wiki/Gaussian_elimination) to solve these, which will tell you if there is a unique solution, no solution or an infinite number of solutions (no other possibilities are possible).
If you find that the linear equations do not have a solution, you can try adding a very small random number to some of the task weights/coefficients of the matrix and try solving it again. (I believe falls under Perturbation Theory). Matrices are notorious for radically changing behavior with small changes in the values, so this will likely give you an approximate answer reasonably quickly.
Or maybe you can try introducing some 'slack' task in each walk (i.e add more variables) and try to pick the solution to the new equations where the slack tasks satisfy some linear constraints (like 0 < s_i < 0.0001 and minimize sum of s_i), using Linear Programming Techniques.

Assume you have an unlimited number of arbitrary characters to represent each edge. (a,b,c,d etc)
w is a list of all the walks, in the form of 0,a,b,c,d,e etc. (the 0 will be explained later.)
i = 1
if #w[i] ~= 1 then
replace w[2] with the LENGTH of w[i], minus all other values in w.
repeat forever.
Example:
0,a,b,c,d,e 50
0,a,c,b,e 20
0,c,e 10
So:
a is the first. Replace all instances of "a" with 50, -b,-c,-d,-e.
New data:
50, 50
50,-b,-d, 20
0,c,e 10
And, repeat until one value is left, and you finish! Alternatively, the first number can simply be subtracted from the length of each walk.

I'd forget about graphs and treat lists of tasks as vectors - every task represented as a component with value equal to it's cost (time to complete in this case.
In tasks are in different orderes initially, that's where to use domain knowledge to bring them to a cannonical form and assign multipliers if domain knowledge tells you that the ratio of costs will be synstantially influenced by ordering / timing. Timing is implicit initial ordering but you may have to make a function of time just for adjustment factors (say drivingat lunch time vs driving at midnight). Function might be tabular/discrete. In general it's always much easier to evaluate ratios and relative biases (hardnes of doing something). You may need a functional language to do repeated rewrites of your vectors till there's nothing more that romain knowledge and rules can change.
With cannonical vectors consider just presence and absence of task (just 0|1 for this iteratioon) and look for minimal diffs - single task diffs first - that will provide estimates which small number of variables. Keep doing this recursively, be ready to back track and have a heuristing rule for goodness or quality of estimates so far. Keep track of good "rounds" that you backtraced from.
When you reach minimal irreducible state - dan't many any more diffs - all vectors have the same remaining tasks then you can do some basic statistics like variance, mean, median and look for big outliers and ways to improve initial domain knowledge based estimates that lead to cannonical form. If you finsd a lot of them and can infer new rules, take them in and start the whole process from start.
Yes, this can cost a lot :-)

Text-correlation in MySQL [duplicate]

Suppose I want to match address records (or person names or whatever) against each other to merge records that are most likely referring to the same address. Basically, I guess I would like to calculate some kind of correlation between the text values and merge the records if this value is over a certain threshold.
Example:
"West Lawnmower Drive 54 A" is probably the same as "W. Lawn Mower Dr. 54A" but different from "East Lawnmower Drive 54 A".
How would you approach this problem? Would it be necessary to have some kind of context-based dictionary that knows, in the address case, that "W", "W." and "West" are the same? What about misspellings ("mover" instead of "mower" etc)?
I think this is a tricky one - perhaps there are some well-known algorithms out there?

A good baseline, probably an impractical one in terms of its relatively high computational cost and more importantly its production of many false positive, would be generic string distance algorithms such as
Edit distance (aka Levenshtein distance)
Ratcliff/Obershelp
Depending on the level of accuracy required (which, BTW, should be specified both in terms of its recall and precision, i.e. generally expressing whether it is more important to miss a correlation than to falsely identify one), a home-grown process based on [some of] the following heuristics and ideas could do the trick:
tokenize the input, i.e. see the input as an array of words rather than a string
tokenization should also keep the line number info
normalize the input with the use of a short dictionary of common substituions (such as "dr" at the end of a line = "drive", "Jack" = "John", "Bill" = "William"..., "W." at the begining of a line is "West" etc.
Identify (a bit like tagging, as in POS tagging) the nature of some entities (for example ZIP Code, and Extended ZIP code, and also city
Identify (lookup) some of these entities (for example a relative short database table can include all the Cities / town in the targeted area
Identify (lookup) some domain-related entities (if all/many of the address deal with say folks in the legal profession, a lookup of law firm names or of federal buildings may be of help.
Generally, put more weight on tokens that come from the last line of the address
Put more (or less) weight on tokens with a particular entity type (ex: "Drive", "Street", "Court" should with much less than the tokens which precede them.
Consider a modified SOUNDEX algorithm to help with normalization of
With the above in mind, implement a rule-based evaluator. Tentatively, the rules could be implemented as visitors to a tree/array-like structure where the input is parsed initially (Visitor design pattern).
The advantage of the rule-based framework, is that each heuristic is in its own function and rules can be prioritized, i.e. placing some rules early in the chain, allowing to abort the evaluation early, with some strong heuristics (eg: different City => Correlation = 0, level of confidence = 95% etc...).
An important consideration with search for correlations is the need to a priori compare every single item (here address) with every other item, hence requiring as many as 1/2 n^2 item-level comparisons. Because of this, it may be useful to store the reference items in a way where they are pre-processed (parsed, normalized...) and also to maybe have a digest/key of sort that can be used as [very rough] indicator of a possible correlation (for example a key made of the 5 digit ZIP-Code followed by the SOUNDEX value of the "primary" name).

I would look at producing a similarity comparison metric that, given two objects (strings perhaps), returns "distance" between them.
If you fulfil the following criteria then it helps:
distance between an object and
itself is zero. (reflexive)
distance from a to b is the same in
both directions (transitive)
distance from a to c is not more
than distance from a to b plus
distance from a to c. (triangle
rule)
If your metric obeys these they you can arrange your objects in metric space which means you can run queries like:
Which other object is most like
this one
Give me the 5 objects
most like this one.
There's a good book about it here. Once you've set up the infrastructure for hosting objects and running the queries you can simply plug in different comparison algorithms, compare their performance and then tune them.
I did this for geographic data at university and it was quite fun trying to tune the comparison algorithms.
I'm sure you could come up with something more advanced but you could start with something simple like reducing the address line to the digits and the first letter of each word and then compare the result of that using a longest common subsequence algorithm.
Hope that helps in some way.

You can use Levenshtein edit distance to find strings that differ by only a few characters. BK Trees can help speed up the matching process.

Disclaimer: I don't know any algorithm that does that, but would really be interested in knowing one if it exists. This answer is a naive attempt of trying to solve the problem, with no previous knowledge whatsoever. Comments welcome, please don't laugh too laud.
If you try doing it by hand, I would suggest applying some kind of "normalization" to your strings : lowercase them, remove punctuation, maybe replace common abbreviations with the full words (Dr. => drive, St => street, etc...).
Then, you can try different alignments between the two strings you compare, and compute the correlation by averaging the absolute differences between corresponding letters (eg a = 1, b = 2, etc.. and corr(a, b) = |a - b| = 1) :
west lawnmover drive
w lawnmower street
Thus, even if some letters are different, the correlation would be high. Then, simply keep the maximal correlation you found, and decide that their are the same if the correlation is above a given threshold.

When I had to modify a proprietary program doing this, back in the early 90s, it took many thousands of lines of code in multiple modules, built up over years of experience. Modern machine-learning techniques ought to make it easier, and perhaps you don't need to perform as well (it was my employer's bread and butter).
So if you're talking about merging lists of actual mailing addresses, I'd do it by outsourcing if I can.
The USPS had some tests to measure quality of address standardization programs. I don't remember anything about how that worked, but you might check if they still do it -- maybe you can get some good training data.

What's the absolute minimum a programmer should know about binary numbers and arithmetic? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Although I know the basic concepts of binary representation, I have never really written any code that uses binary arithmetic and operations.
I want to know
What are the basic concepts any
programmer should know about binary
numbers and arithmetic ? , and
In what "practical" ways can binary
operations be used in programming. I
have seen some "cool" uses of shift
operators and XOR etc. but are there
some typical problems where using binary
operations is an obvious choice.
Please give pointers to some good reference material.

If you are developing lower-level code, it is critical that you understand the binary representation of various types. You will find this particularly useful if you are developing embedded applications or if you are dealing with low-level transmission or storage of data.
That being said, I also believe that understanding how things work at a low level is useful even if you are working at much higher levels of abstraction. I have found, for example, that my ability to develop efficient code is improved by understanding how things are represented and manipulated at a low level. I have also found such understanding useful in working with debuggers.
Here is a short-list of binary representation topics for study:
numbering systems (binary, hex, octal, decimal, ...)
binary data organization (bits, nibbles, bytes, words, ...)
binary arithmetic
other binary operations (AND,OR,XOR,NOT,SHL,SHR,ROL,ROR,...)
type representation (boolean,integer,float,struct,...)
bit fields and packed data
Finally...here is a nice set of Bit Twiddling Hacks you might find useful.

Unless you're working with lower level stuff, or are trying to be smart, you never really get to play with binary stuff.
I've been through a computer science degree, and I've never used any of the binary arithmetic stuff we learned since my course ended.
Have a squizz here: http://www.swarthmore.edu/NatSci/echeeve1/Ref/BinaryMath/BinaryMath.html

You must understand bit masks.
Many languages and situations require the use of bit masks, for example flags in arguments or configs.
PHP has its error level which you control with bit masks:
error_reporting = E_ALL & ~E_NOTICE
Or simply checking if an int is odd or even:
isOdd = myInt & 1

I believe basic know-hows on binary operations line AND, OR, XOR, NOT would be handy as most of the programming languages support these operations in the form of bit-wise operators.
These operations are also used in image processing and other areas in graphics.
One important use of XOR operation which I can think of is Parity check. Check this http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BitOp/xor.html
cheers

The following are things I regularly appreciate knowing in my quite conventional programming work:
Know the powers of 2 up to 2^16, and know that 2^32 is about 4.3 billion. Know them well enough so that if you see the number 2147204921 pop up somewhere your first thought is "hmm, that looks pretty close to 2^31" -- that's a very effective module for your bug radar.
Be able to do simple arithmetic; e.g. convert a hexadecimal digit to a nybble and back.
Have some vague idea of how floating-point numbers are represented in binary.
Understand standard conventions that you might encounter in other people's code related to bit twiddling (flags get ORed together to make composite values and AND checks if one's set, shift operators pack and unpack numbers into different bytes, XOR something twice and you get the same something back, that kind of thing.)
Further knowledge is mostly gravy unless you work with significant performance constraints or do other less common work.

At the absolute bare minimum you should be able to implement a bit mask solution. The tasks associated with bit mask operations should ensure that you at least understand binary at a superficial level.

From the top of my head, here are some examples of where I've used bitwise operators to do useful stuff.
A piece of javascript that needed one of those "check all" boxes was something along these lines:
var check = true;
for(var i = 0; i < elements.length; i++)
check &= elements[i].checked;
checkAll.checked = check;
Calculate the corner points of a cube.
Vec3f m_Corners[8];
void corners(float a_Size){
for(size_t i = 0; i < 8; i++){
m_Corners[i] = a_Size * Vec3f(axis(i, Vec3f::X), axis(i, Vec3f::Y), axis(i, Vec3f::Z));
}
}
float axis(size_t a_Corner, int a_Axis) const{
return ((a_Corner >> a_Axis) & 1) == 1
? -.5f
: +.5f;
}
Draw a Sierpinski triangle
for(int y = 0; y < 512; y++)
for(int x = 0; x < 512; x++)
if(x & y) pixels[x + y * w] = someColor;
else pixels[x + y * w] = someOtherColor;
Finding the next power of two
int next = 1 << ((int)(log(number) / log(2));
Checking if a number is a power of two
bool powerOfTwo = number & (number - 1);
The list can go on and on, but for me these are (except for Sierpinksi) everyday examples. Once you'll understand and work with it though, you'll encounter it in more and more places such as the corners of a cube.

You don't specifically mention (nor rule out!-) floating point binary numbers and arithmetic, so I won't miss the opportunity to flog one of my favorite articles ever (seriously: I sometimes wish I could make passing a strict quiz on it a pre-req of working as a programmer...;-).

The most important thing every programmer should know about binary numbers and arithmetic is : Every number in a computer is represented in some kind of binary encoding, and all arithmetic on a computer is binary arithmetic.
The consequences of this are many:
Floating point "bugs" when doing math with IEEE floating point binary numbers (Which is all numbers in javascript, and quite a few in JAVA, and C)
The upper and lower bounds of representable numbers for each type
The performance cost of multiplication/division/square root etc operations (for embedded systems
Precision loss, and accumulation errors
and more. This is stuff you need to know even if you never do a bitwise xor, or not, or whatever in your life. You'll still run into these things.

This really depends on the language you're using. Recent languages such as C# and Java abstract the binary representation from you -- this makes working with binary difficult and is not usually the best way to do things anyway in these languages.
Middle and low level languages like C and C++, however, require you to understand quite a bit about how the numbers are stored underneath -- especially regarding endianness.
Binary knowledge is also useful when implementing a cross platform protcol of some sort .... for example, on x86 machines, byte order is little endian. but most network protocols want big endian numbers. Therefore you have to realize you need to do the conversion for things to go smoothly. Many RFCs, such as this one -> https://www.rfc-editor.org/rfc/rfc4648 require binary knowledge to understand.
In short, it's completely dependent on what you're trying to do.
Billy3

It's handy to know the numbers 256 and 65536. It's handy to know how two's complement negative numbers work.
Maybe you won't run into a lot of binary. I still use it pretty often, but maybe out of habit.
A good familiarity with bitwise operations should make you more facile with boolean algebra, and I think that's important for every programmer--you want to be able to quickly simplify complex logic expressions.

Absolute minimum is, that "2" is not a binary digit and 10b is smaller than 3.

If you never do low-level programming (like C in embedded systems), never have to use a debugger, and never have to work with real numbers, then I suppose you could get by without knowing binary. But knowing binary will make you a stronger programmer, even if indirectly.
Once you venture into those areas you will need to know binary (and its ``sister'' base, hexadecimal). Without knowing it:
Embedded systems programming would be impossible.
Debugging would be hard because you wouldn't know what you were looking at in memory.
Numerical calculations with decimals would give you answers you don't understand.

I learned to twiddle bits back when c and asm were still used for "mainstream" programming. Although I no longer have much use for that knowledge, I recently used it to solve a real-world business problem.
We use a fax service that posts a message back to us when the fax has been sent or failed after x number of retries. The only way I had to identify the fax was a 15 character field. We wanted to consolidate this into one URL for all of our clients. Before we consolidated, all we had to fit in this field was the FaxID PK (32 bit int) column which we just sent as a string.
Now we had to identify the client (a 4 character code) and the database (32 bit int) underneath the client. I was able to do this using base 64 encoding. Without understanding the binary representation of numbers and characters, I probably would never have even thought of this solution.

Some useful information about the number system.
Binary | base 2
Hexadecimal | base 16
Decimal | base 10
Octal | base 8
These are the most common.
Converting them is faily easy.
112 base 8 = (1 x 8^2) + (2 x 8^1) + (4 x 8^0)
74 base 10 = (7 x 10^1) + (4 x 10^0)
The AND, OR, XOR, and etc. are used in logic gates. Search boolean algebra, something well worth the time knowing.
Say for instance, you have 11001111 base 2 and you want to extract the last four only.
Truth table for AND:
P | Q | R
T | T | T
T | F | F
F | F | F
F | T | F
You can use 11001111 base 2 AND 00111111 base 2 = 00001111 base 2
There are plenty of resources on the internet.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008