i'm developing a program in octave that i will explain as i put the code.
So i have this matrix in a file called matprec.m:
function [res1] = obtemDadosPrec()
res1 = [
1,2001,1,2,0.00;
1,2001,1,5,5.33;
2,2001,1,5,4.57;
3,2001,1,5,5.33;
4,2001,1,5,5.59;
5,2001,1,5,4.32;
2,2001,1,13,0.00;
3,2001,1,13,0.00;
4,2001,1,13,0.00;
3,2001,1,30,30.73;
2,2001,2,1,1.02;
3,2001,2,1,1.52;
4,2001,2,1,1.78;
5,2001,2,1,1.27;
1,2001,2,2,1.78;
2,2001,2,2,1.27;
3,2001,2,2,1.78;
4,2001,2,2,2.03;
5,2001,2,2,1.78;
1,2001,3,4,18.03;
3,2001,3,4,15.75;
5,2001,3,4,17.53;
1,2001,3,5,13.46;
2,2001,3,5,12.19;
3,2001,3,5,11.94;
4,2001,3,5,9.65;
5,2001,3,5,10.92;
2,2001,4,30,0.00;
4,2001,4,30,0.00];
format short g
return
endfunction
so in this matrix the first column is just the station where we measure the amount of precipitation, the second is the year, the third is the month, the fourth is the day and the fifth is the value of precipitation.
And what i want to do in another file is call this matrix and do the following calculus, in the month 1 i want do the average on all the days for example:
in month 1 day 5 i have 5 values 5.33, 4.57, 5.33, 5.59, 4.32, so i would do
(5.33 + 4.57 + 5.33 + 5.59 + 4.32)/5 = 5.028
And i want to do that for all the days and when i have all the days i would add them all to know the amount of precipitation in that month, and do that for all the 4 months.
I'm kind of stuck there if you could help me i would appreciate, thanks a lot!
First, get your array
>> Result = obtemDadosPrec();
Then get a logical array where rows corresponding to month == 1 are true (i.e. 1) and all others are false (i.e. 0)
>> month1Indices = Result(:,3) == 1;
Use this logical array to perform logical indexing and isolate only the 'true' rows.
>> month1Rows = Result(month1Indices, :);
Repeat same procedure to isolate 'day 5'
>> day5Indices = month1Rows(:,4) == 5;
>> day5Rows = month1Rows(day5Indices , :);
Calculate the average of the 5th column.
>> mean(day5Rows(:,5))
ans = 5.028
Related
I am currently making some code to randomly generate a set of random dates and assigning them to a matrix. I wish to randomly generate N amount of dates (days and months) and display them in a Nx2 matrix. My code is as follows
function dates = dategen(N)
month = randi(12);
if ismember(month,[1 3 5 7 8 10 12])
day = randi(31);
dates = [day, month];
elseif ismember(month,[4 6 9 11])
day = randi(30);
dates = [day, month];
else
day = randi(28);
dates = [day, month];
end
end
For example if I called on the function, as
output = dategen(3)
I would expect 3 dates in a 2x3 matrix. However, I am unsure how to do this. I believe I need to include N into the function somewhere but I'm not sure where or how.
Any help is greatly appreciated.
You can do it using logical indexing as follows:
function dates = dategen(N)
months = randi(12, 1, N);
days = NaN(size(months)); % preallocate
ind = ismember(months, [1 3 5 7 8 10 12]);
days(ind) = randi(31, 1, sum(ind));
ind = ismember(months, [4 6 9 11]);
days(ind) = randi(30, 1, sum(ind));
ind = ismember(months, 2);
days(ind) = randi(28, 1, sum(ind));
dates = [months; days];
end
I want to test the hypothesis whether some 30 occurrences should fit a Poisson distribution.
#GNU Octave
X = [8 0 0 1 3 4 0 2 12 5 1 8 0 2 0 1 9 3 4 5 3 3 4 7 4 0 1 2 1 2]; #30 observations
bins = {0, 1, [2:3], [4:5], [6:20]}; #each bin can be single value or multiple values
I am trying to use Pearson's chi-square statistics here and coded the below function. I want a Poisson vector to contain corresponding Poisson probabilities for each bin and count the observations for each bin. I feel the loop is rather redundant and ugly. Can you please let me know how can I re-factor the function without the loop and make the whole calculation cleaner and more vectorized?
function result= poissonGoodnessOfFit(bins, observed)
assert(iscell(bins), "bins should be a cell array");
assert(all(cellfun("ismatrix", bins)) == 1, "bin entries either scalars or matrices");
assert(ismatrix(observed) && rows(observed) == 1, "observed data should be a 1xn matrix");
lambda_head = mean(observed); #poisson lambda parameter estimate
k = length(bins); #number of bin groups
n = length(observed); #number of observations
poisson_probability = []; #variable for poisson probability for each bin
observations = []; #variable for observation counts for each bin
for i=1:k
if isscalar(bins{1,i}) #this bin contains a single value
poisson_probability(1,i) = poisspdf(bins{1, i}, lambda_head);
observations(1, i) = histc(observed, bins{1, i});
else #this bin contains a range of values
inner_bins = bins{1, i}; #retrieve the range
inner_bins_k = length(inner_bins); #number of values inside
inner_poisson_probability = []; #variable to store individual probability of each value inside this bin
inner_observations = []; #variable to store observation counts of each value inside this bin
for j=1:inner_bins_k
inner_poisson_probability(1,j) = poisspdf(inner_bins(1, j), lambda_head);
inner_observations(1, j) = histc(observed, inner_bins(1, j));
endfor
poisson_probability(1, i) = sum(inner_poisson_probability, 2); #assign over the sum of all inner probabilities
observations(1, i) = sum(inner_observations, 2); #assign over the sum of all inner observation counts
endif
endfor
expected = n .* poisson_probability; #expected observations if indeed poisson using lambda_head
chisq = sum((observations - expected).^2 ./ expected, 2); #Pearson Chi-Square statistics
pvalue = 1 - chi2cdf(chisq, k-1-1);
result = struct("actual", observations, "expected", expected, "chi2", chisq, "pvalue", pvalue);
return;
endfunction
There's a couple of things worth noting in the code.
First, the 'scalar' case in your if block is actually identical to your 'range' case, since a scalar is simply a range of 1 element. So no special treatment is needed for it.
Second, you don't need to create such explicit subranges, your bin groups seem to be amenable to being used as indices into a larger result (as long as you add 1 to convert from 0-indexed to 1-indexed indices).
Therefore my approach would be to calculate the expected and observed numbers over the entire domain of interest (as inferred from your bin groups), and then use the bin groups themselves as 1-indices to obtain the desired subgroups, summing accordingly.
Here's an example code, written in the octave/matlab compatible subset of both languges:
function Result = poissonGoodnessOfFit( BinGroups, Observations )
% POISSONGOODNESSOFFIT( BinGroups, Observations) calculates the [... etc, etc.]
pkg load statistics; % only needed in octave; for matlab buy statistics toolbox.
assert( iscell( BinGroups ), 'Bins should be a cell array' );
assert( all( cellfun( #ismatrix, BinGroups ) ) == 1, 'Bin entries either scalars or matrices' );
assert( ismatrix( Observations ) && rows( Observations ) == 1, 'Observed data should be a 1xn matrix' );
% Define helpful variables
RangeMin = min( cellfun( #min, BinGroups ) );
RangeMax = max( cellfun( #max, BinGroups ) );
Domain = RangeMin : RangeMax;
LambdaEstimate = mean( Observations );
NBinGroups = length( BinGroups );
NObservations = length( Observations );
% Get expected and observed numbers per 'bin' (i.e. discrete value) over the *entire* domain.
Expected_Domain = NObservations * poisspdf( Domain, LambdaEstimate );
Observed_Domain = histc( Observations, Domain );
% Apply BinGroup values as indices
Expected_byBinGroup = cellfun( #(c) sum( Expected_Domain(c+1) ), BinGroups );
Observed_byBinGroup = cellfun( #(c) sum( Observed_Domain(c+1) ), BinGroups );
% Perform a Chi-Square test on the Bin-wise Expected and Observed outputs
O = Observed_byBinGroup; E = Expected_byBinGroup ; df = NBinGroups - 1 - 1;
ChiSquareTestStatistic = sum( (O - E) .^ 2 ./ E );
PValue = 1 - chi2cdf( ChiSquareTestStatistic, df );
Result = struct( 'actual', O, 'expected', E, 'chi2', ChiSquareTestStatistic, 'pvalue', PValue );
end
Running with your example gives:
X = [8 0 0 1 3 4 0 2 12 5 1 8 0 2 0 1 9 3 4 5 3 3 4 7 4 0 1 2 1 2]; % 30 observations
bins = {0, 1, [2:3], [4:5], [6:20]}; % each bin can be single value or multiple values
Result = poissonGoodnessOfFit( bins, X )
% Result =
% scalar structure containing the fields:
% actual = 6 5 8 6 5
% expected = 1.2643 4.0037 13.0304 8.6522 3.0493
% chi2 = 21.989
% pvalue = 0.000065574
A general comment about the code; it is always preferable to write self-explainable code, rather than code that does not make sense by itself in the absence of a comment. Comments generally should only be used to explain the 'why', rather than the 'how'.
I am trying to improve my programming skills by writing functions in multiple ways, this teaches me new ways of writing code but also understanding other people's style of writing code. Below is a function that calculates the sum of all even numbers in a fibonacci sequence up to the max value. Do you have any recommendations on writing this algorithm differently, maybe more compactly or more pythonic?
def calcFibonacciSumOfEvenOnly():
MAX_VALUE = 4000000
sumOfEven = 0
prev = 1
curr = 2
while curr <= MAX_VALUE:
if curr % 2 == 0:
sumOfEven += curr
temp = curr
curr += prev
prev = temp
return sumOfEven
I do not want to write this function recursively since I know it takes up a lot of memory even though it is quite simple to write.
You can use a generator to produce even numbers of a fibonacci sequence up to the given max value, and then obtain the sum of the generated numbers:
def even_fibs_up_to(m):
a, b = 0, 1
while a <= m:
if a % 2 == 0:
yield a
a, b = b, a + b
So that:
print(sum(even_fibs_up_to(50)))
would output: 44 (0 + 2 + 8 + 34 = 44)
I have a question, more on the theoretical side. I want to make a recursive function that counts all (not only prime) different divisors of a given natural number.
For example with f(0)=0 (per Def.), f(3) = 2, f(6) = 4, f(16) = 5 etc.
Theoretically, how could I do that?
Thanks.
If I understand correctly, you only want to COUNT them, not to collect them, right?
A second assumption is that you don't want to count only independent divisors (i.e. you want to count "2", "3" but not "6").
If this is the case, the algorithm shown in Sean's answer can be simplified significantly:
You don't need the array divisorList but only a counter,
You as soon as you find a divisor, you can reduce the max limit of the loop by the result of dividing the root number by the divisor (e.g. if your root number is 900 and 2 is the first divisor, you can set the limit of the loop to 450; then, when checking 3 you will reduce the limit to 150 and so on).
EDIT:
After thinking a little bit more, here is the correct algorithm:
Assume that the number is "N". Then, you already start with a count of 2 (i.e. 1 and N),
You then check if N divides by 2; if it does, you need to add 2 to the count (i.e. 2 and N/2),
You then change the limit of the loop to N/2,
Test if dividing by 3 yields an integer; if it does, you add 2 to the count (i.e. 3 and N/3) and reduce the limit to N/3,
Test 4...
Test 5...
...
In Pseudo-code:
var Limit = N ;
Count = 2 ;
for (I = 2 ; I < Limit ; I++) {
if (N/I is integer) {
Count = Count + 3 ;
Limit = N / I ;
} ;
} ;
Note: I don't know which language you are programming, so you need to verify if your language allows you to change the limit of the loop. If it does not, you can include an EXIT-LOOP condition (e.g. if I >= Limit then exit loop).
Hope this resolves your problem.
public static ArrayList<int> recursiveDivisors(int num)
{
ArrayList<int> divisorList = new ArrayList<int>();
for (int i = 1; i <= num; i++)
{
if (num % i == 0)
divisorList.add(i)
}
return divisorList;
}
Something like this?
Returns all divisors in a divisor array list
EDIT: Not recursive
I'm trying to calculate various time period returns (monthly, quarterly, yearly etc.) for each unique member (identified by Code in the example below) of a data set. The data set will contain monthly pricing information for a 20 year period for approximately 500 stocks. An example of the data is below:
Date Code Price Dividend
1 2005-01-31 xyz 1000.00 20.0
2 2005-01-31 abc 1.00 0.1
3 2005-02-28 xyz 1030.00 20.0
4 2005-02-28 abc 1.01 0.1
5 2005-03-31 xyz 1071.20 20.0
6 2005-03-31 abc 1.03 0.1
7 2005-04-30 xyz 1124.76 20.0
I am fairly new to R, but thought that there would be a more efficient solution than looping through each Code and then each Date as shown here:
uniqueDates <- unique(data$Date)
uniqueCodes <- unique(data$Code
for (date in uniqueDates) {
for (code in uniqueCodes) {
nextDate <- seq.Date(from=stock_data$Date[i], by="3 months",length.out=2)[2]
curPrice <- data$Price[data$Date == date]
futPrice <- data$Price[data$Date == nextDate]
data$ret[(data$Date == date) & (data$Code == code)] <- (futPrice/curPrice)-1
}
}
This method in itself has an issue in that seq.Date does not always return the final day in the month.
Unfortunately the data is not uniform (the number of companies/codes varies over time) so using a simple row offset won't work. The calculation must match the Code and Date with the desired date offset.
I had initially tried selecting the future dates by using the seq.Date function
data$ret = (data[(data$Date == (seq.Date(from = data$Date, by="3 month", length.out=2)[2])), "Price"] / data$Price) - 1
But this generated an error as seq.Date requires a single entry.
> Error in seq.Date(from = stock_data$Date, by = "3 month", length.out =
> 2) : 'from' must be of length 1
I thought that R would be well suited to this type of calculation but perhaps not. Since all the data is in a mysql database I am now thinking that it might be faster/easier to do this calc directly in the database.
Any suggestions would be greatly appreciated.
Load data:
tc='
Date Code Price Dividend
2005-01-31 xyz 1000.00 20.0
2005-01-31 abc 1.00 0.1
2005-02-28 xyz 1030.00 20.0
2005-02-28 abc 1.01 0.1
2005-03-31 xyz 1071.20 20.0
2005-03-31 abc 1.03 0.1
2005-04-30 xyz 1124.76 20.0'
df = read.table(text=tc,header=T)
df$Date=as.Date(df$Date,"%Y-%m-%d")
First I would organize the data by date:
library(plyr)
pp1=reshape(df,timevar='Code',idvar='Date',direction='wide')
Then you would like to obtain monthly, quarterly, yearly, etc returns.
For that there are several options, one could be:
Make the data zoo or xts class. i.e
library(xts)
pp1[2:ncol(pp1)] = as.xts(pp1[2:ncol(pp1)],order.by=pp1$Date)
#let's create a function for calculating returns.
rets<-function(x,lag=1){
return(diff(log(x),lag))
}
Since this database is monthly, the lags for the returns will be:
monthly=1, quaterly=3, yearly =12. for instance let's calculate monthly return
for xyz.
lagged=1 #for monthly
This calculates Monthly returns for xyz
pp1$returns_xyz= c(NA,rets(pp1$Price.xyz,lagged))
To get all the returns:
#create matrix of returns
pricelist= ls(pp1)[grep('Price',ls(pp1))]
returnsmatrix = data.frame(matrix(rep(0,(nrow(pp1)-1)*length(pricelist)),ncol=length(pricelist)))
j=1
for(i in pricelist){
n = which(names(pp1) == i)
returnsmatrix[,j] = rets(pp1[,n],1)
j=j+1
}
#column names
codename= gsub("Price.", "", pricelist, fixed = TRUE)
names(returnsmatrix)=paste('ret',codename,sep='.')
returnsmatrix
You can do this very easily with the quantmod and xts packages. Using the data in AndresT's answer:
library(quantmod) # loads xts too
pp1 <- reshape(df,timevar='Code',idvar='Date',direction='wide')
# create an xts object
x <- xts(pp1[,-1], pp1[,1])
# only get the "Price.*" columns
p <- getPrice(x)
# run the periodReturn function on each column
r <- apply(p, 2, periodReturn, period="monthly", type="log")
# merge prior result into a multi-column object
r <- do.call(merge, r)
# rename columns
names(r) <- paste("monthly.return",
sapply(strsplit(names(p),"\\."), "[", 2), sep=".")
Which leaves you with an r xts object containing:
monthly.return.xyz monthly.return.abc
2005-01-31 0.00000000 0.000000000
2005-02-28 0.02955880 0.009950331
2005-03-31 0.03922071 0.019608471
2005-04-30 0.04879016 NA