Implementing CUDA kernel to use row wise features for histogram - cuda

I am trying to write a Cuda kernel to generate row-wise histogram based on the input feature set (2 x 6) where each feature row (each having 6 features) is to generate a histogram having nbins=10.
I have implemented the below code but it doesn’t seem to generate the correct row-wise histogram.
import numba
import numpy as np
from numba import cuda
np.random.seed(0)
feature = np.random.randint(1, high=6, size=(2,6), dtype=int)
output = np.zeros(20).astype(np.float32).reshape(2,10)
### Kernal Configuration
threads_per_block = 6
blocks = 2
# moving data to device
d_feature = cuda.to_device(feature)
d_output = cuda.to_device(output)
feature_size = d_feature.shape[1]
#cuda.jit
def row_wise_histogram(feature, output, n):
xmin = np.float32(-4.0)
xmax = np.float32(4.0)
idx = cuda.grid(1)
nbins = 10
bin_width = (xmax - xmin) / nbins
for i in range(n):
# Each thread will take all the row features to generate historgram
input = feature[idx][i]
bin_number = np.int32(nbins * (np.float32(input) - np.float32(xmin)) / (np.float32(xmax) - np.float32(xmin)))
if bin_number >= 0 and bin_number < output.shape[1]:
cuda.atomic.add(output[idx], bin_number, 1)
row_wise_histogram[blocks, threads_per_block](d_feature, d_output, feature_size)
print(d_output.copy_to_host())
And the out results in
[[ 0. 0. 0. 0. 0. 0. 81111. 81111. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 162222. 0. 81111. 0.]]
which is wrong, Will appreciate it if I can get help with the issue inside the row_wise_historgram function!

I think the main issue you have in your code is that your kernel has a thread strategy to have each thread process a row, and you have 2 rows in your feature dataset, but you are launching 12 threads total:
### Kernal Configuration
threads_per_block = 6
blocks = 2
10 of those threads will be indexing out-of-bounds. For 2 rows you only need 2 threads. We can fix this multiple ways, but I will add a "thread-check" to your kernel code, to prevent out-of-bounds threads from doing anything.
You are also histogramming values that don't fit in your output array. Let's suppose your feature has a input value of 4 at some location. Let's put that value through your arithmetic:
bin_number = np.int32(nbins * (np.float32(4) - np.float32(-4)) / (np.float32(4) - np.float32(-4)))
That is 10 * (4-(-4))/(4-(-4))
So that is a bin index of 10. But you only have 10 bins, so valid bin index can only go up to 9. Which means some of your input values (e.g. 4, 5) will not be recorded in your output.
The following code is your code to add the threadcheck, plus the range of input adjusted. And I am printing out the input, the bin each input value was assigned, and the output bins. It seems to be working correctly.
$ cat t65.py
import numba
import numpy as np
from numba import cuda
np.random.seed(0)
feature = np.random.randint(1, high=4, size=(2,6), dtype=int)
output = np.zeros(20).astype(np.float32).reshape(2,10)
mybin = np.empty_like(feature)
### Kernal Configuration
threads_per_block = 6
blocks = 2
# moving data to device
d_feature = cuda.to_device(feature)
d_output = cuda.to_device(output)
feature_size = d_feature.shape[1]
d_mybin = cuda.to_device(mybin)
#cuda.jit
def row_wise_histogram(feature, output, mybin, n):
xmin = np.float32(-4.0)
xmax = np.float32(4.0)
idx = cuda.grid(1)
nbins = 10
bin_width = (xmax - xmin) / nbins
if idx < output.shape[0]:
for i in range(n):
# Each thread will take all the row features to generate historgram
input = feature[idx][i]
bin_number = np.int32(nbins * (np.float32(input) - np.float32(xmin)) / (np.float32(xmax) - np.float32(xmin)))
mybin[idx][i] = bin_number
if bin_number >= 0 and bin_number < output.shape[1]:
cuda.atomic.add(output[idx], bin_number, 1)
row_wise_histogram[blocks, threads_per_block](d_feature, d_output, d_mybin, feature_size)
print(feature)
print(d_mybin.copy_to_host())
print(d_output.copy_to_host())
$ python t65.py
[[1 2 1 2 2 3]
[1 3 1 1 1 3]]
[[6 7 6 7 7 8]
[6 8 6 6 6 8]]
[[ 0. 0. 0. 0. 0. 0. 2. 3. 1. 0.]
[ 0. 0. 0. 0. 0. 0. 4. 0. 2. 0.]]
$ cuda-memcheck python t65.py
========= CUDA-MEMCHECK
[[1 2 1 2 2 3]
[1 3 1 1 1 3]]
[[6 7 6 7 7 8]
[6 8 6 6 6 8]]
[[ 0. 0. 0. 0. 0. 0. 2. 3. 1. 0.]
[ 0. 0. 0. 0. 0. 0. 4. 0. 2. 0.]]
========= ERROR SUMMARY: 0 errors
$
Note that when I restrict the input values to 1..3, then the maximum bin index is 8 (do the math). If I increase the input range to include 4, the maximum bin index goes to 10, which "won't fit". You're correctly handling this case, but it may confuse you, as these values of 4 or 5 won't be recorded in the output. Histogram bin arithmetic is fun. You will need to work out exactly what you want.
Also note that if you run this code, you should see output almost exactly the same as above. If you don't, there is a good chance your numba or cuda install is broken somehow, and the additional run I show with cuda-memcheck will help to discover what may be the issue.
Note that since you are using atomics anyway, there isn't any particular need to assign one thread to each row, you could instead assign one thread to each input point. But that isn't your question; it's a story for another day. Conversely, if you do proceed with one thread per row, each thread doing effectively a private histogram, there is no particular need to use atomics.

Related

Removing DC component for matrix in chuncks in octave

I'm new to octave and if this as been asked and answered then I'm sorry but I have no idea what the phrase is for what I'm looking for.
I trying to remove the DC component from a large matrix, but in chunks as I need to do calculations on each chuck.
What I got so far
r = dlmread('test.csv',';',0,0);
x = r(:,2);
y = r(:,3); % we work on the 3rd column
d = 1
while d <= (length(y) - 256)
e = y(d:d+256);
avg = sum(e) / length(e);
k(d:d+256) = e - avg; % this is the part I need help with, how to get the chunk with the right value into the matrix
d += 256;
endwhile
% to check the result I like to see it
plot(x, k, '.');
if I change the line into:
k(d:d+256) = e - 1024;
it works perfectly.
I know there is something like an element-wise operation, but if I use e .- avg I get this:
warning: the '.-' operator was deprecated in version 7
and it still doesn't do what I expect.
I must be missing something, any suggestions?
GNU Octave, version 7.2.0 on Linux(Manjaro).
Never mind the code works as expected.
The result (K) got corrupted because the chosen chunk size was too small for my signal. Changing 256 to 4096 got me a better result.
+ and - are always element-wise. Beware that d:d+256 are 257 elements, not 256. So if then you increment d by 256, you have one overlaying point.

Created Shared Memory Code with Python Cuda

I'm struggling to get some code running to explore the shared memory features to get a fast matrix multiply. But everytime I try this I seem to run into errors that I cannot fathom.
import numpy as np
from numba import cuda, types
m = 128
n = 32
a = np.arange(m*n).reshape(m,n).astype(np.int32)
b = np.arange(m*n).reshape(n,m).astype(np.int32)
c = np.zeros((m, n)).astype(np.int32)
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)
block_size = (m,n)
grid_size = (int(m/n),int(m/n))
#cuda.jit
def mm(a, b, c):
column, row = cuda.grid(2)
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
a_cache[cuda.threadIdx.y, cuda.threadIdx.x] = a[row, column]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[column, row]
cuda.syncthreads()
for i in range(a.shape[1]):
sum += a_cache[row][i] * b_cache[i][column]
c[row][column] = sum
and testing
mm[grid_size, block_size](d_a, d_b, d_c)
solution = a#b
output = d_c.copy_to_host()
keeps resulting in the following error:
CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
After chatting with the provider of one answer, I've updated the function. But still cannot make this work. So for the computation of the sum for each element in the output c we need to loop over the columns of A and the rows of B, using i as the index. We have therefore n*n products. I think the i us correct in the sum, but I cannot seem to get the correct index for the row and column of a and b in the expression for the sum.
import numpy as np
from numba import cuda, types
#cuda.jit
def mm_shared(a, b, c):
column, row = cuda.grid(2)
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row, column]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[row, column]
cuda.syncthreads()
for i in range(a.shape[1]):
sum += a_cache[cuda.threadIdx.x, i] * b_cache[i, cuda.threadIdx.y]
c[row][column] = sum
Your block size is invalid. CUDA devices have a limit of 1024 threads per block. When I run your code I see this:
/opt/miniconda3/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py in _check_error(self, fname, retcode)
327 _logger.critical(msg, _getpid(), self.pid)
328 raise CudaDriverError("CUDA initialized before forking")
--> 329 raise CudaAPIError(retcode, msg)
330
331 def get_device(self, devnum=0):
CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE
When I fix that I see this:
$ cuda-memcheck python somethingsometing.py
========= CUDA-MEMCHECK
========= Invalid __shared__ read of size 4
========= at 0x000008b0 in cudapy::__main__::mm$241(Array<int, int=2, A, mutable, aligned>, Array<int, int=2, A, mutable, aligned>, Array<int, int=2, A, mutable, aligned>)
========= by thread (15,11,0) in block (3,2,0)
========= Address 0x00000ec0 is out of bounds
The why is pretty obvious:
for i in range(a.shape[1]):
sum += a_cache[row][i] * b_cache[i][column]
row and column are dimensions in the execution grid, not the local share memory tile, and similarly i is bounded by the shape of a, not the shape of a_cache (note also that you seemed to lapse in C style 2D array indexing syntax about half way through the code, which is a potential bug if you don't understand the difference between the two in Python).
To fix it you will have to change the indexing and then implement the rest of the code for multiplication (i.e. you must iteratively load the whole row and column slices through the local shared tiles to compute the full dot product for each row/column pair which a block will process).
Note also that
The dimensions you have selected for c are wrong (should be m x m)
The grid size you run the kernel on is also wrong because the dimensions of C are wrong and so your code could never calculate the whole matrix
Even after fixing all of this, it is likely that the results of the multiplication will be incorrect at anything other than trivial sizes because of integer overflow.
#disruptive: Hi, did you find any solution to your problem?
I had the same problem as you but I solved it by restarting the kernel of Jupyter notebook.
My code is slightly different than yours:
def mm_shared(a, b, c):
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
col, row = cuda.grid(2)
row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
col = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row][col]
b_cache[cuda.threadIdx.y, cuda.threadIdx.x] = b[col][row]
for i in range(a.shape[1]):
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row, cuda.threadIdx.y + i * N]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[cuda.threadIdx.x + i * N, col]
cuda.syncthreads()
for j in range(N):
sum += a_cache[cuda.threadIdx.x, j] * b_cache[j, cuda.threadIdx.y]
# Wait until all threads finish computing
cuda.syncthreads()
c[row][col] = sum
Please let me know if you have any update.
This is the correct solution:
import numpy as np
from numba import cuda, types
#cuda.jit
def mm_shared(a, b, c):
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
# TODO: use each thread to populate one element each a_cache and b_cache
x,y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x
TPB = int(N)
for i in range(a.shape[1] / TPB):
a_cache[tx, ty] = a[x, ty + i * TPB]
b_cache[tx, ty] = b[tx + i * TPB, y]
cuda.syncthreads()
for j in range(TPB):#a.shape[1]):
# TODO: calculate the `sum` value correctly using values from the cache
sum += a_cache[tx][j] * b_cache[j][ty]
cuda.syncthreads()
c[x][y] = sum

Octave leasqr only doing one iteration

As I'm trying to fit a function to some experimental data, I've written a function with three inputs, three parameters and one output:
qrfunc = #(x, p) exp(-1*p(1)*x(:,1) - p(2)*x(:,2))+p(3)*x(:,3)+20;
When I generate some input and output values:
pS = [0.5; 0.3; 0.3];
x1 = [1 1 1; 1 1.1 1; 1 1.1 1.1; 2 1.2 2];
y1 = qrfunc(x1, pS);
And call the leasqr function:
pin =[1; 1; 1];
[f1, p1, kvg1, iter1, corp1, covp1, covr1, stdresid1, Z1, r21] = leasqr(x1, y1, pin, qrfunc, 0.0001);
This works correct, the function makes 7 iterations and provides the right outputs.
But when I load x1 from my experimental data (a text file with three columns, about 1500 lines) as well as my y1 (a text file with the same amount of lines) and run the same function, it only makes one iteration, and does not change the parameters.
It even shows that the error margins are very high:
sqrt(diag(covp1))
ans =
3.0281e+004
3.7614e+005
1.9477e-002
What am I doing wrong? There are no error messages, no 'Convergence not achieved' or anything like that...
Edit:
The data is loaded with the command:
load "input.txt"
load "output.txt"
Proof of loading:
size(input)
ans =
1540 3
The first few lines from my input file:
10 0.4 5
20 0.4 5
30 0.4 5
40 0.4 5
50 0.4 5
The second and third parameters have different values further down the line.

Using arrayfun to apply two arguments of a function on every combination

Let i = [1 2] and j = [3 5]. Now in octave:
arrayfun(#(x,y) x+y,i,j)
we get [4 7]. But I want to apply the function on the combinations of i vs. j to get [i(1)+j(1) i(1)+j(2) i(2)+j(1) i(2)+j(2)]=[4 6 5 7].
How do I accomplish this? I know I can go with for-loopsl but I want vectorized-code because it's faster.
In Octave, for finding summations between two vectors, you can use a truly vectorized approach with broadcasting like so -
out = reshape(ii(:).' + jj(:),[],1)
Here's a runtime test on ideone for the input vectors of size 1 x 100 each -
-------------------- With FOR-LOOP
Elapsed time is 0.148444 seconds.
-------------------- With BROADCASTING
Elapsed time is 0.00038299 seconds.
If you want to keep it generic to accommodate operations other than just summations, you can use anonymous functions like so -
func1 = #(I,J) I+J;
out = reshape(func1(ii,jj.'),1,[])
In MATLAB, you could accomplish the same with two bsxfun alternatives as listed next.
I. bsxfun with Anonymous Function -
func1 = #(I,J) I+J;
out = reshape(bsxfun(func1,ii(:).',jj(:)),1,[]);
II. bsxfun with Built-in #plus -
out = reshape(bsxfun(#plus,ii(:).',jj(:)),1,[]);
With the input vectors of size 1 x 10000 each, the runtimes at my end were -
-------------------- With FOR-LOOP
Elapsed time is 1.193941 seconds.
-------------------- With BSXFUN ANONYMOUS
Elapsed time is 0.252825 seconds.
-------------------- With BSXFUN BUILTIN
Elapsed time is 0.215066 seconds.
First, your first example is not the best because the most efficient way to accomplish what you're doing with arrayfun would be to vectorize:
a = [1 2];
b = [3 5];
out = a+b
Second, in Matlab at least, arrayfun is not necessarily faster than a simple for loop. arrayfun is mainly a convenience (especially for it's more advanced options). Try this simple timing example yourself:
a = 1:1e5;
b = a+1;
y = arrayfun(#(x,y)x+y,a,b); % Warm up
tic
y = arrayfun(#(x,y)x+y,a,b);
toc
y = zeros(1,numel(a));
for k = 1:numel(a)
y(k) = a(k)+b(k); % Warm up
end
tic
y = zeros(1,numel(a));
for k = 1:numel(a)
y(k) = a(k)+b(k);
end
toc
In Matlab R2015a, the for loop method is over 70 times faster run from the Command window and over 260 times faster when run from an M-file function. Octave may be different, but you should experiment.
Finally, you can accomplish what you want using meshgrid:
a = [1 2];
b = [3 5];
[x,y] = meshgrid(a,b);
out = x(:).'+y(:).'
which returns [4 6 5 7] as in your question. You can also use ndgrid to get output in a different order.

Code Golf: Who has the best poker hand?

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I love challenges like this, I'll hopefully submit my answer soon.
Which player has the best 7 card hand?
Given an unordered list of 9 cards (separated by a space), work out which player has the best poker hand. Here is a list of poker hand rankings. Example input:
2C 5H AS KS 2D 4D QD KH 3S
(ie: [[2C 5H] [AS KS] [2D 4D QD KH 3S]])
First 2 cards in the array represent player 1's hand, second 2 in the array represent player 2's hand. The last 5 cards represent the community cards, cards both players share. In effect, both players have 7 cards, and you must determine which player has the best 5 card poker hand.
A card is defined as a string, with the first character representing the card value, and the second value representing the suit. Always upper-case. No card may appear twice.
The function will calculate if the hand is a draw or a win to either player. It will ouput the totals at the end of the input. The output format is defined later on in this post.
Examples
2C 5H AS KS 2D 4D QD KH 3S
(ie: [[2C 5H] [AS KS] [2D 4D QD KH 3S]])
Player 2 wins this hand. Player 1 has a pair of 2's, player 2 has a pair of kings.
5S 6S 8H 9D 7S 8S JH TS 2H
(ie: [[5S 6S] [8H 9D] [7S 8S JH TS 2H]])
Player 1 wins this hand Player 1 has a flush, player 2 has a straight.
2S 2H AC AS 2C AH 9H TS 2D
(ie: [[2S 2H] [AC AS] [2C AH 9H TS 2D]])
Player 1 wins this hand. Player 1 has quads, player 2 has a full house
5S 6S 2D 4D 9S AS KD JC 9D
(ie: [[5S 6S] [2D 4D] [9S AS KD JC 9D]])
A draw. Both players have Ace high.
More Info
Thanks to mgroves for the following link to Project Euler which has a similar problem:
http://projecteuler.net/index.php?section=problems&id=54
Test Data
We will use the Project Euler test data:
http://projecteuler.net/project/poker.txt
Your solution should accept that text file as input, and output a total of wins and draws.
Example Output
Output must be in this format:
1: 45
2: 32
D: 12
Player 1 won 45 hands, player 2 won 32 hands, and there were 12 draws. (Not actual results)
Rules
Doesn't have to return the winning hand type, only WHO won if anyone
Card list input has no particular order
No card appears twice in the input
Input is always uppercase
Takes the Project Euler test data as an input
Outputs a count, of which player won the most hands and total draws in given format above
Perl, 414 398 370/458 344/416 char
Line breaks are not significant.
%M=map{$_,$Z++}0..9,T,J,Q,K,A;sub N{/.$/;$M{$`}.$&}
sub B{$s=#p=();
for$m(#_){$m-$_||($s+=2,++$p[$m])for#_}
#_=sort{$p[$b]-$p[$a]||$b-$a}#_;
$s=23 if$s<11&&($_[0]-$_[4]<5||$_[0]-$_[1]>8&&push#_,shift);
"#_"=~/.$/;$s+=14*(4<grep/$&/,#_);
$s=100*$s+$_ for#_;$s}
++$X{B((#c=map{N}split)[0..4])<=>B(#c[5..9])}for<>;
printf"1: %d\n2: %d\nD: %d\n",#X{1,-1,0}
This solves the "10 card" problem (10 cards are dealt, player 1 has the first 5 cards and player 2 has the second 5 cards).
The first section defines a subroutine N that can transform each card so that it has a numerical value. For non-face cards, this is a trivial mapping (5H ==> 5H) but it does transform the face cards (KC => 13C, AD => 14D).
The last section parses each line of input into cards, transforms the cards to contain numerical values, divides the cards into separate hands for the two players, and analyzes and compares those hands. Every hand increments one element of the hash %X. When all the input is parsed, %X contains the number of hands won by player 1, won by player 2, or ties.
The middle section is a subroutine that takes a set of five cards as input and produces a
12-digit number with the property that stronger poker hands will have higher-valued numbers. Here's how it works:
for$m(#_){$m-$_||($s+=2,++$p[$m])for#_}
This is the "pair" detector. If any two cards have the same numerical value, increment a hash element for one of the cards and increase the "score" variable $s by two. Note that we will end up comparing each card to itself, so $s will be at least 10 and $p[$x] will be at least one for every card $x. If the hand contains three of a kind, then those three cards will match with the other two cards -- it will be like there are 9 matches among those three cards and the "score" will be at least 18.
#_=sort{$p[$b]-$p[$a]||$b-$a}#_;
Sort the cards by (1) the number of times that card is part of a "pair" and (2) the value of the card. Thus in a hand with two 7's and two 3's, the two 7's will appear first, followed by the two 3's, followed by the kicker. In a hand with two 7's and three 3's, the three 3's will be first followed by the two 7's. The goal of this ordering is to distinguish two hands that have the same score -- a hand with a pair of 8's and a hand with a pair of 7's both have one pair, but we need to be able to tell that a pair of 8's is better.
$s=23 if$s<11&&($_[0]-$_[4]<5||$_[0]-$_[1]>8&&push#_,shift);
This line is the "straight" detector. A straight is worth 23 points and occurs when there are no pairs in the hand ($s<11 means only 5 "pairs" - each card matching with itself - were found) and either (1) the value of the highest card is exactly four more than the value of the lowest card ($_[0]-$_[4]==4), or (2) the highest value card is an Ace and the next highest card is a 5 ($_[0]-$_[1]==9), which means the hand has an A-2-3-4-5 straight. In the latter case, the Ace is now the least valuable card in the hand, so we manipulate #_ to reflect that (push#_,shift)
"#_"=~/.$/;$s+=14*(4<grep/$&/,#_);
This line is the flush detector. A flush is worth 14 more points and occurs when the last character is the same for each card. The first expression ("#_"=~/.$/) has the side effect of setting $& to the last character (the suit) of the last card in the hand. The final expression (4<grep/$&/,#_) will be true if and only if all elements of #_ have the same last character.
$s=100*$s+$_ for#_;$s}
Creates and returns a value that begins with the hand's score and then contains the values of the cards, in order of the card's importance. Scores for the various hands will be
Hand Score
---------- ------
High card 10 (each card matches itself for two points)
One pair 14 (2 additional matches)
Two pair 18 (4 additional matches)
Three of a kind 22 (6 additional matches)
Straight 23 (no pair, but 23 points for straight)
Flush 24 (no pair, but 14 additional points for the flush)
Full house 26 (8 additional matches)
4 of a kind 34 (12 additional matches)
Straight flush 37 (23 + 14 points)
which is consistent with the rules of poker. Hands with the same score can be distinguished by the values of the hand's cards, in order of importance to the hand, all the way down to the least valuable card in the hand.
The solution to the 9 card problem (two cards to player 1, two cards to player 2, the players share the next 5 cards and build their best 5 card hand) needs about 70 more strokes to choose the best 5 card hand out of the 7 cards available to each player:
%M=map{$_,$Z++}0..9,T,J,Q,K,A;sub N{/./;$M{$&}.$'}
sub A{my$I;
for$k(0..41){#d=#_;splice#d,$_,1for$k%7,$k/7;$s=#p=();
for$m(grep$_=N,#d){$m-$_||($s+=2,$p[$m]++)for#d}
#d=sort{$p[$b]-$p[$a]||$b-$a}#d;
$s=23 if$s<11&&($d[0]-$d[4]<5||$d[0]-$d[1]>8&&push#d,shift#d);
"#d"=~/.$/;$s+=14*(4<grep/$&/,#d);
$s=100*$s+$_ for#d;
$I=$s if$s>$I}$I}
++$X{A((#c=split)[0,1,4..8])<=>A(#c[2..8])}for<>;
printf"1: %d\n2: %d\nD: %d\n",#X{1,-1,0}
GolfScript - 151/187 chars
This program works on an input list of 10 cards per line, i.e. two 5 card hands.
n%0.#{3/5/{[zip~;.&,(!15*\[{n),*"TJQKA"+?}/]:|$),-4>=14*+1|{.2\?|#-,5\-.49?#*#+\.+#+\}/.16445=13*#+\]}%.~={0):0;;}{~>.!#+\#+\}if}/"1: "##n"2: "#n"D: "0
This program works on an input list of 9 cards per line, of the format described in the specifications.
n%0.#{3/.4>:D;2/2<{D+.{3/1$^.{3/1$^[zip~;.&,(!15*\[{n),*"TJQKA"+?}/]$:|),-4>=14*+1|{.2\?|#-,5\-.49?#*#+\.+#+\}/.16445=13*#+\]}%\;~}%$-1=\;}%.~={0):0;\(\}*~>.!#+\#+\}/"1: "##n"2: "#n"D: "0
Haskell: 793 796 806 826 864 904 901 880 863
Since the text file is inconsistent with 9 card hands, I'm just reading a line from the console and outputting who wins.
Bugfixes:
Ace now counts lower than a 2 in an ace-low run.
Comparing full houses fixed (again :D).
Guarantees that the best version of a given hand type is chosen. For example, if a player can choose between a 2-6 run and a 3-7 run, the 3-7 run is chosen (flushes aside).
Now shorter than the PHP solution!
Golfed:
import Data.List
(%)=mod
m=map
y=foldr1
t=0<1
z=13
w=[0,1,2,3,12]
n&x|length x<n=[]|t=take n x
b?x|b=x|t=[]
n!k= \c->e(n&m(%k)c)?(n&c)
e[]=1<1
e(x:y)=all(x==)y
k q c|any null[q c,p$c\\q c]=[]|t=q c
f=5!4
s c=(sort(m(%z)c)`elem`w:[[n..n+4]|n<-[0..8]])?c
r=3!z
p=2!z
g x y|c x y<2=x|t=y
q x(_,[])=x
q _ y=y
b h=y q$m($h)$zipWith(\t f->(,)t.y g.m(f.take 5).permutations)[1..][1!1,p,k p,r,s,f,k r,4!z,s.f]
h=reverse.a.m(%z)
a v|w\\v==[]=[-1..3]|t=sort v
c x y=o(h x)$h y
o[](_:_)=2
o[]_=0
o _[]=1
o(a:b)(k:d)|a>k=1|a<k=2|t=o b d
d n(a,k)|a==[]=0|n<1=0|r>s=1|r<s=2|f/=0=f|t=d(n-length o)(a\\o,k\\u)where(r,o)=b a;(s,u)=b k;f=c o u
i x=head.findIndices(x==)
u(n:k)c#[r,s]|n%z==i r"23456789TJQKA"&&n%4==i s"HDSC"=n|t=u k c
l c=(2&c++snd(splitAt 4c),drop 2c)
main=getLine>>=print.d 5.l.m(u[0..]).words
Ungolfed:
import Control.Exception (assert)
import Data.List (permutations, sort, intersect, findIndices, (\\))
import Data.Function (on)
(%) = mod
aceLowRun = [0,1,2,3,12]
tryTake n xs
| length xs < n = []
| otherwise = take n xs
cond ? xs
| cond = xs
| otherwise = []
eqOn n f cards = allEq (tryTake n $ map f cards) ? tryTake n cards
allEq [] = False
allEq (x:xs) = all (== x) xs
combWithPair pokerHand cards
| any null [picked1, picked2] = []
| otherwise = pokerHand cards
where
picked1 = pokerHand cards
picked2 = pair $ cards \\ picked1
straightFlush = straight . flush
quads = eqOn 4 (% 13)
fullHouse = combWithPair triples
flush = eqOn 5 (% 4)
straight cards = (sort (map (% 13) cards) `elem` runs) ? cards
where
runs = aceLowRun : [[n..n+4] | n <- [0..8]]
triples = eqOn 3 (% 13)
twoPair = combWithPair pair
pair = eqOn 2 (% 13)
single = eqOn 1 id
bestVersionOfHand [] ys = ys
bestVersionOfHand xs [] = xs
bestVersionOfHand xs ys
| compareSameRankedHands xs ys < 2 = xs
| otherwise = ys
rate rating pokerHand cards = (rating, handResult)
where
handResult = foldr1 bestVersionOfHand
(map (pokerHand . take 5) $ permutations cards)
pokerHands = zipWith rate [1..] [
single
, pair
, twoPair
, triples
, straight
, flush
, fullHouse
, quads
, straightFlush
]
bestHand hand = foldr1 (\xs ys -> if null (snd ys) then xs else ys)
(map ($ hand) pokerHands)
highestVals = reverse . arrangeVals . map (% 13)
where
arrangeVals vals = if vals `intersect` aceLowRun == aceLowRun
then [-1..3]
else sort vals
compareSameRankedHands = compareSameRankedHands' `on` highestVals
compareSameRankedHands' [] [] = 0
compareSameRankedHands' (card1:cards1) (card2:cards2)
| card1 > card2 = 1
| card1 < card2 = 2
| otherwise = compareSameRankedHands' cards1 cards2
decideWinner n cards1 cards2
| null cards1 = assert (null cards2) 0
| n < 1 = 0
| rating1 > rating2 = 1
| rating1 < rating2 = 2
| cmpRes /= 0 = cmpRes
| otherwise = decideWinner
(n - assert (length bests1 == length bests2) (length bests1))
(cards1 \\ bests1)
(cards2 \\ bests2)
where
(rating1, bests1) = bestHand cards1
(rating2, bests2) = bestHand cards2
cmpRes = compareSameRankedHands bests1 bests2
indexOf x = head . findIndices (x==)
toNum = toNum' [0..]
toNum' (n:ns) [rank, suit]
| n % 13 == indexOf rank "23456789TJQKA" && n % 4 == indexOf suit "HDSC" = n
| otherwise = toNum' ns [rank, suit]
cluster cards = (take 2 cards ++ snd (splitAt 4 cards), drop 2 cards)
main = getLine >>= print
. uncurry (decideWinner 5)
. cluster
. map toNum
. words
GolfScript 258 241 247/341 217/299 char
Solution for the 10 card problem. Only the last couple of newlines are significant:
10:T):J):Q):K):A;0:a;0:b;0:d;"\r\n"%{' '/5/{.{)\;}/4*-+++!:f;{);~}%{$0:z(%{.z-
!99*+:z}%}5*.{+}*99/:P!{..)\(#4+-!2*\;\.2<~9+-!\;+}and:s;[s f*6P=4P=f s P 6$]\;}
%.~={;;d):d;}{~>{a):a;}{b):b;}if}if}/
'1: 'a'
2: 'b'
D: 'd n
The 9 card problem currently needs about 80 more characters.
10:T):J):Q):K):A;0:a;0:b;0:d;"\r\n"%{' '/);{('Z'%+}2*[0$2>\7<]
{:H;7,{H=:x;H{x=!},:I;6,{I=:x;I{x=!},}/}%{.{)\;}/4*-+++!:f;
{);~}%{$0:z(%{.z-!99*+:z}%}5*.{+}*99/:P!{..)\(#4+-!2*\;\.2<~9+-!\;+}and:s;[
s f*6P=4P=f s P 6$]\;}%{[\].~>{~;}{~\;}if}*}%.~={;;d):d;}{~>{a):a;}{b):b;}if}if}/
'1: 'a'
2: 'b'
D: 'd n
Less golfed version of 10 card problem.
10:T;11:J;12:Q;13:K;14:A; # map for face cards
0:a;0:b;0:d; # other initialization
"\r\n"% # split input on \n
{ # on each line of input
' '/ # divide line into ten cards
5/ # split into five card hands
{. # on each of the two hands
{)\;}% # chop last character of each card
.(5*\;\{+}*= # check sum of elem == 5*1st elem
:f; # this is the flush flag
{);~}%$ # reduce cards to numerical values
0:z;{.z- 20%{}
{;z 20+}if:z}%{-1*}$ # detect pairs
.(:h;; # extract value of highest card
20h>{..)\(#4+-!2*\;\ # detect straight
.2<~9+-!\;+}and:s; # s=2 for regular straight, s=1 for A-5 straight
# result of this mapping - 6 elem array
[ 0$ # #6 - cards in the hand
.{20/}%{+}*:P # #5 - number of pairs
s # #4 - is this a straight?
f # #3 - is this a flush?
4P= # #2b - is this a full house?
h 59> # #2 - is this 4 of a kind?
s f * # #1 - is this a straight flush?
]-1%
\;
}/
\.#.# # put [hand1 hand2 hand1 hand2] on stack
= # check hand1==hand2
{;;d):d;} # if equal, increment d (draw)
{>{a):a;} # if >, increment a (player 1 wins)
{b):b;}if # if <, increment b (player 2 wins)
}if
}/
# output results
'1: 'a'
2: 'b'
D: 'd n
C, 665+379 chars
Here's my answer in 2 parts.
The first is a complete 7 card evaluator, including the "AddCard" macro A. It returns a 32-bit number ranking the hand. The high nibble is the type, bits 13..25 indicate the high card(s) and bits 0..12 indicate the kicker(s). When comparing the results, the better hand will always have the larger value.
#define U unsigned
#define c(a)X=a;i=C=0;while(X){C|=(X&1)<<i++;X/=4;}
#define A(h,c)h[c&7]+=c,h[3]|=c
U C,i,X;
U E(U h[]){
U a=h[0]+h[1]+h[2]+h[4]-(h[3]&-16),t,v,k,e=a&0x55555540,o=a&0xAAAAAA80;
if(v=e&o/2){t=7;k=h[3]^v;i=0;while(k/=4)i++;k=1<<2*i;}
else if(v=o&o-1){t=6;v/=2;k=o/2^v;}
else if(e>1&o>1){t=6;v=o/2;k=(i=e&e-1)?i:e;}
else{a=h[3];
if(t=h[i=1]-(a&1)&4||h[i=2]-(a&2)&8||h[i=4]-(a&4)&16||h[i=0]-(a&8)&32)a=h[i];
a&=-64;v=a|a>>26&16;t*=5;
if(v=v&v<<2&v<<4&v<<6&v<<8){t+=4;a=v&=~(v/2);}
else if(t)for(i=(h[i]&63)/(i?i:8),v=a;i-->5;)a&=a-1;
else if(v=o/2)t=3;
else if (e){o=e&e-1;v=(i=o&o-1)?o:e;t=1+(o>0);}
k=a^v;k&=k-1;k&=k-(i==0);}
c(v);v=C/8;c(k);
return t<<28|v<<13|C/8;}
The second is the input processor. It parses the project Euler file as 2+2+5 cards (ignoring the 10th card). It uses the Parse macro, P to create 32-bit values representing each card. The representation is 0A0K0Q0J0T090807060504030200shdc. A hand is stored as an array of 5 ints.
char*gets(char*);char*strchr(char*,char);
#define P(c)X=strchr(R,*c++)-R;C=1<<strchr(S,*c++)-S|64<<X*2;c++;
#define L(n)for(i=0;i<n;i++)
U g[5],h[5];
char*c,b[32];
char*S="CDHS";
char*R="23456789TJQKA";
int d,r[3]={0};
main(q){while(c=gets(b)){
L(2){P(c)A(g,C);}
L(2){P(c)A(h,C);}
L(5){P(c)A(g,C);A(h,C);}
d=E(g)-E(h);
r[d>0?0:d<0?1:2]++;
L(7)g[i]=h[i]=0;
}L(3)printf("%c:%d\n","12D"[i],r[i]);}
I'm sure there are a few more characters to be trimmed off. I'll add an explanation soon.
The evaluator runs #17.6 Million hands/second on my 3Ghz Core2 Duo. That's only 3.5x slower than the PokerSource evaluator, which uses at least 56K of lookup tables.
PHP, 799 chars
Line breaks are not significant. This takes input from the linked url, which is different from the example input (doesn't deal with community cards). Processing is similar to mobrule's perl answer, with a different scoring method.
<?php
function s($i){$o=array_map('intval',$i);$f=(count(array_unique(str_replace($o,'',$i)))==1);
sort($o);$v=array_count_values($o);arsort($v);$u=array_keys($v);$h=max($u);$m=$u[0];$c=reset($v);
$p=count($v);$e=$c==1&&$o[4]==14&&$o[3]==5;$r=$o==range($o[0],$o[0]+4)||$e;$q=$e?5:$h;
$s=($f&&$r&&($h==12)?2<<11:($f&&$r?(2<<10)+$q:0))+($c==4?(2<<9)+$m:0)+($c==3&&$p==2?(2<<8)+$m:0)+($f?(2<<7)+$h:0)+
($r?(2<<6)+$q:0)+($c==3?(2<<5)+$m:0)+($c==2&&$p==3?(2<<4)+$m:0)+($p==4?(2<<3)+$m:0);$s+=!$s?$h:0;return array($s,$u);}
foreach(file($argv[1]) as $d){
list($y,$z)=array_chunk(explode(' ',trim(strtr($d,array('T'=>10,'J'=>11,'Q'=>12,'K'=>13,'A'=>14)))),5);
$y=s($y);$z=s($z);$w=$y[0]-$z[0];$x=1;while(!$w&&$x<5){$w=$y[1][$x]-$z[1][$x++];}if(!$w)#$t++;elseif($w<0)#$l++;else #$k++;}
#print "1: $k\n2: $l\nD: $t";