Data set too large to load into memory for processing - mysql

I have a bigger rapidly growing data set of around 4 million rows, in order to define and exclude the outliers (for statistics / analytics usage) I need the algorithm to consider all entries in this data set. However this is too much data to load into memory and my system chokes. I'm currently using this to collect and process the data:
#scoreInnerFences = innerFence Post.where( :source => 1 ).
order( :score ).
pluck( :score )
Using the typical divide and conquer method won't work, I don't think because every entry has to be considered to keep my outlier calculation accurate. How can this be achieved efficiently?
innerFence identifies the lower quartile and upper quartile of the data set, then uses those findings to calculate the outliers. Here is the (yet to be refactored, non-DRY) code for this:
def q1(s)
q = s.length / 4
if s.length % 2 == 0
return ( s[ q ] + s[ q - 1 ] ) / 2
else
return s[ q ]
end
end
def q2(s)
q = s.length / 4
if s.length % 2 == 0
return ( s[ q * 3 ] + s[ (q * 3) - 1 ] ) / 2
else
return s[ q * 3 ]
end
end
def innerFence(s)
q1 = q1(s)
q2 = q2(s)
iq = (q2 - q1) * 3
if1 = q1 - iq
if2 = q2 + iq
return [if1, if2]
end

This is not the best way, but it is an easy way:
Do several querys. First you count the number of scores:
q = Post.where( :source => 1 ).count
then you do your calculations
then you fetch the scores
q1 = Post.where( :source => 1 ).
reverse_order(:score).
select("avg(score) as score").
offset(q).limit((q%2)+1)
q2 = Post.where( :source => 1 ).
reverse_order(:score).
select("avg(score) as score").
offset(q*3).limit((q%2)+1)
The code is probably wrong but I'm sure you get the idea.

For large datasets, I sometimes drop down below ActiveRecord. It's a memory hog, even I imagine, using pluck. Of course it's less portable, but sometimes it's worth it.
scores = Post.connection.execute('select score from posts where score > 1 order by score').map(&:first)
Don't know if that will help enough for 4 million record. If not, maybe look at a stored procedure?

Related

How to calculate a probability vector and an observation count vector for a range of bins?

I want to test the hypothesis whether some 30 occurrences should fit a Poisson distribution.
#GNU Octave
X = [8 0 0 1 3 4 0 2 12 5 1 8 0 2 0 1 9 3 4 5 3 3 4 7 4 0 1 2 1 2]; #30 observations
bins = {0, 1, [2:3], [4:5], [6:20]}; #each bin can be single value or multiple values
I am trying to use Pearson's chi-square statistics here and coded the below function. I want a Poisson vector to contain corresponding Poisson probabilities for each bin and count the observations for each bin. I feel the loop is rather redundant and ugly. Can you please let me know how can I re-factor the function without the loop and make the whole calculation cleaner and more vectorized?
function result= poissonGoodnessOfFit(bins, observed)
assert(iscell(bins), "bins should be a cell array");
assert(all(cellfun("ismatrix", bins)) == 1, "bin entries either scalars or matrices");
assert(ismatrix(observed) && rows(observed) == 1, "observed data should be a 1xn matrix");
lambda_head = mean(observed); #poisson lambda parameter estimate
k = length(bins); #number of bin groups
n = length(observed); #number of observations
poisson_probability = []; #variable for poisson probability for each bin
observations = []; #variable for observation counts for each bin
for i=1:k
if isscalar(bins{1,i}) #this bin contains a single value
poisson_probability(1,i) = poisspdf(bins{1, i}, lambda_head);
observations(1, i) = histc(observed, bins{1, i});
else #this bin contains a range of values
inner_bins = bins{1, i}; #retrieve the range
inner_bins_k = length(inner_bins); #number of values inside
inner_poisson_probability = []; #variable to store individual probability of each value inside this bin
inner_observations = []; #variable to store observation counts of each value inside this bin
for j=1:inner_bins_k
inner_poisson_probability(1,j) = poisspdf(inner_bins(1, j), lambda_head);
inner_observations(1, j) = histc(observed, inner_bins(1, j));
endfor
poisson_probability(1, i) = sum(inner_poisson_probability, 2); #assign over the sum of all inner probabilities
observations(1, i) = sum(inner_observations, 2); #assign over the sum of all inner observation counts
endif
endfor
expected = n .* poisson_probability; #expected observations if indeed poisson using lambda_head
chisq = sum((observations - expected).^2 ./ expected, 2); #Pearson Chi-Square statistics
pvalue = 1 - chi2cdf(chisq, k-1-1);
result = struct("actual", observations, "expected", expected, "chi2", chisq, "pvalue", pvalue);
return;
endfunction
There's a couple of things worth noting in the code.
First, the 'scalar' case in your if block is actually identical to your 'range' case, since a scalar is simply a range of 1 element. So no special treatment is needed for it.
Second, you don't need to create such explicit subranges, your bin groups seem to be amenable to being used as indices into a larger result (as long as you add 1 to convert from 0-indexed to 1-indexed indices).
Therefore my approach would be to calculate the expected and observed numbers over the entire domain of interest (as inferred from your bin groups), and then use the bin groups themselves as 1-indices to obtain the desired subgroups, summing accordingly.
Here's an example code, written in the octave/matlab compatible subset of both languges:
function Result = poissonGoodnessOfFit( BinGroups, Observations )
% POISSONGOODNESSOFFIT( BinGroups, Observations) calculates the [... etc, etc.]
pkg load statistics; % only needed in octave; for matlab buy statistics toolbox.
assert( iscell( BinGroups ), 'Bins should be a cell array' );
assert( all( cellfun( #ismatrix, BinGroups ) ) == 1, 'Bin entries either scalars or matrices' );
assert( ismatrix( Observations ) && rows( Observations ) == 1, 'Observed data should be a 1xn matrix' );
% Define helpful variables
RangeMin = min( cellfun( #min, BinGroups ) );
RangeMax = max( cellfun( #max, BinGroups ) );
Domain = RangeMin : RangeMax;
LambdaEstimate = mean( Observations );
NBinGroups = length( BinGroups );
NObservations = length( Observations );
% Get expected and observed numbers per 'bin' (i.e. discrete value) over the *entire* domain.
Expected_Domain = NObservations * poisspdf( Domain, LambdaEstimate );
Observed_Domain = histc( Observations, Domain );
% Apply BinGroup values as indices
Expected_byBinGroup = cellfun( #(c) sum( Expected_Domain(c+1) ), BinGroups );
Observed_byBinGroup = cellfun( #(c) sum( Observed_Domain(c+1) ), BinGroups );
% Perform a Chi-Square test on the Bin-wise Expected and Observed outputs
O = Observed_byBinGroup; E = Expected_byBinGroup ; df = NBinGroups - 1 - 1;
ChiSquareTestStatistic = sum( (O - E) .^ 2 ./ E );
PValue = 1 - chi2cdf( ChiSquareTestStatistic, df );
Result = struct( 'actual', O, 'expected', E, 'chi2', ChiSquareTestStatistic, 'pvalue', PValue );
end
Running with your example gives:
X = [8 0 0 1 3 4 0 2 12 5 1 8 0 2 0 1 9 3 4 5 3 3 4 7 4 0 1 2 1 2]; % 30 observations
bins = {0, 1, [2:3], [4:5], [6:20]}; % each bin can be single value or multiple values
Result = poissonGoodnessOfFit( bins, X )
% Result =
% scalar structure containing the fields:
% actual = 6 5 8 6 5
% expected = 1.2643 4.0037 13.0304 8.6522 3.0493
% chi2 = 21.989
% pvalue = 0.000065574
A general comment about the code; it is always preferable to write self-explainable code, rather than code that does not make sense by itself in the absence of a comment. Comments generally should only be used to explain the 'why', rather than the 'how'.

Writing Fibonacci Sequence Elegantly Python

I am trying to improve my programming skills by writing functions in multiple ways, this teaches me new ways of writing code but also understanding other people's style of writing code. Below is a function that calculates the sum of all even numbers in a fibonacci sequence up to the max value. Do you have any recommendations on writing this algorithm differently, maybe more compactly or more pythonic?
def calcFibonacciSumOfEvenOnly():
MAX_VALUE = 4000000
sumOfEven = 0
prev = 1
curr = 2
while curr <= MAX_VALUE:
if curr % 2 == 0:
sumOfEven += curr
temp = curr
curr += prev
prev = temp
return sumOfEven
I do not want to write this function recursively since I know it takes up a lot of memory even though it is quite simple to write.
You can use a generator to produce even numbers of a fibonacci sequence up to the given max value, and then obtain the sum of the generated numbers:
def even_fibs_up_to(m):
a, b = 0, 1
while a <= m:
if a % 2 == 0:
yield a
a, b = b, a + b
So that:
print(sum(even_fibs_up_to(50)))
would output: 44 (0 + 2 + 8 + 34 = 44)

Trading algorithm - actions in Q-learning/DQN

The following has completed using MATLAB.
I am trying to build a trading algorithm using Deep Q learning. I have just taken a years worth of daily stock prices and am using that as the training set.
My state space is my [money, stock, price]
money is the amount of cash I have,
stock is the number of stocks I have, and
price is the price of the stock at that time step.
The issue I am having is with the actions; looking online, people only have three actions, { buy | sell | hold }.
My reward function is the difference between the value of portfolio value in the current time step and the previous time step.
But using just three actions, I am unsure how to choose to buy, lets say 67 stocks at the price?
I am using a neural network to approximate the q-values. It has three inputs
[money, stock, price] and 202 outputs, i.e. I can sell between 0 and 100 number of stock, 0, I can hold the stock, or I can buy 1-100 stock.
Can anyone shed some light on the how can I reduce this to 3 actions?
My code is :
% p is the stock price
% sp is the stock price at the next time interval
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
hidden_layers = 1;
actions = 202;
net = newff( [-1000000 1000000;-1000000 1000000;0 1000;],
[hidden_layers, actions],
{'tansig','purelin'},
'trainlm'
);
net = init( net );
net.trainParam.showWindow = false;
% neural network training parameters -----------------------------------
net.trainParam.lr = 0.01;
net.trainParam.mc = 0.1;
net.trainParam.epochs = 100;
% parameters for q learning --------------------------------------------
epsilon = 0.8;
gamma = 0.95;
max_episodes = 1000;
max_iterations = length( p ) - 1;
reset = false;
inital_money = 1000;
inital_stock = 0;
%These will be where I save the outputs
save_s = zeros( max_iterations, max_episodes );
save_pt = zeros( max_iterations, max_episodes );
save_Q_target = zeros( max_iterations, max_episodes );
save_a = zeros( max_iterations, max_episodes );
% construct the inital state -------------------------------------------
% a = randi( [1 3], 1, 1 );
s = [inital_money;inital_stock;p( 1, 1 )];
% construct initial q matrix -------------------------------------------
Qs = zeros( 1, actions );
Qs_prime = zeros( 1, actions );
for i = 1:max_episodes
for j = 1:max_iterations % max_iterations --------------
Qs = net( s );
%% here we will choose an action based on epsilon-greedy strategy
if ( rand() <= epsilon )
[Qs_value a] = max(Qs);
else
a = randi( [1 202], 1, 1 );
end
a2 = a - 101;
save_a(j,i) = a2;
sp = p( j+1, 1 ) ;
pt = s( 1 ) + s( 2 ) * p( j, 1 );
save_pt(j,i) = pt;
[s_prime,reward] = simulateStock( s, a2, pt, sp );
Qs_prime = net( s_prime );
Q_target = reward + gamma * max( Qs_prime );
save_Q_target(j,i) = Q_target;
Targets = Qs;
Targets( a ) = Q_target;
save_s( j, i ) = s( 1 );
s = s_prime;
end
epsilon = epsilon * 0.99 ;
reset = false;
s = [inital_money;inital_stock;p(1,1)];
end
% ----------------------------------------------------------------------
function[s_prime,reward] = simulateStock( s, a, pt, sp )
money = s(1);
stock = s(2);
price = s(3);
money = money - a * price ;
money = max( money, 0 );
stock = s(2) + a;
stock = max( stock, 0 );
s_prime = [money;stock;sp];
reward = ( money + stock * price ) - pt;
end
Actions: ill-defined ( if not giving an ultimate reason for so flattened, decaffeinated & knowingly short-cut model )
You may be right, that using a range of just { buy | hold | sell } actions is a frequent habit for academic papers, where authors sometimes decide to illustrate their demonstrated academic efforts on improving learning / statistical methods and opt to pick an exemplary application in a trading domain. The pity is, this could be done in academic papers, but not in the reality of trading.
Why?
Even with an elementary view on trading, the problem is much more complex. As a brief reference, there are more than five principal domains of such model-space. Given a trading is to be modelled, one cannot remain without a fully described strategy --
Tru-Strategy := { SelectPOLICY,
DetectPOLICY,
ActPOLICY,
AllocatePOLICY,
TerminatePOLICY
}
Any whatever motivated simplification, that would opt to omit any single one domain of these five principal domains will become whatever but a truly Trading Strategy.
One can easily figure out, what comes out of just training ( the worse from later harnessing such model in doing real trades with ) an ill-defined model, that is not coherent with the reality.
Sure, it can reach ( and will ( again, unless ill-formulated minimiser's criterion function ) ) some mathematical function's minimum, but that does not ensure the reality to immediately change it's so far natural behaviours and to start "obey" the ill-defined model and to "dance" according to such oversimplified or otherwise skewed ( ill-modelled )-opinions about the reality.
Rewards: ill-defined ( if not giving a reason for ignoring the fact or delayed rewards )
If in doubts what this means, try to follow an example:
Today, the Strategy-Model decides to A:Buy(AAPL,67).
Tomorrow, AAPL goes down, some 0.1% and thus the immediate reward ( as was proposed above ) is negative, thus punishing such decision. The Model is stimulated not to do it ( do not buy AAPL ).
The point is, that after some period of time, AAPL rises much higher, producing much higher reward compared to initial fluctuations in D2D Close, which is known, but the proposed Strategy-Model Q-fun simply principally erroneously did not reflect at all.
Beware WYTIWYG -- What You Train Is What You Get ...
This means an as-is-Model could be trained to act according to the such defined stimuli, but it's actual behaviour will favour NOTHING but such extremely naive intraday "quasi-scalping" shots with limited ( if any at all ) support from actual Market State & Market Dynamics as are available by many industry-wide accepted quantitative models.
So, sure, one can train a reality-blind model, that was kept blind & deaf ( ignoring the reality of the Problem Domain ), but for what sake?
Epilogue:
There is nothing like a "Data Science"even when MarCom & HR beat their drums & whistles, as they indeed do a lot nowadays
Why?
Exactly because the above observed rationale. Having data-points is nothing. Sure, it is better than standing clueless in front of the customer without a single observation of the reality, but the Data-points do not save the game.
It is the domain-knowledge, that starts to make some sense from the Data-points, not the Data-points per se.
If still in doubts, if one has a few terabytes of numbers, there is no Data Science to tell you, what the data-points represent.
On the other hand, if one knows, from the domain-specific context, these data-points ought be temperature readings, there is still no Data-Science God to tell you, whether there are all ( just by coincidence ) in [°K] or [°C] ( if there are just positive readings >= 0.00001 ).

What way to "rank" dataset Mysql

Situation is as follows:
I have a database with 40.000 cities. Those cities have certain types of properties with an value.
For example "mountains" or "beaches". If a city has lots of mountains the value for mountain will be high if there are less mountains the number is lower.
Table with city name and properties and values:
With that, I have a table with the avarage values of all those properties.
What I need to happen: I want the user search for a city with has one or multiple properties, find the best match and attach a score from 0 - 100 to it.
The way I do this is as follow:
1. I first get the 25%, 50% and 70% values for the properties:
_var_[property]_25 = [integer]
_var_[property]_50 = [integer]
_var_[property]_70 = [integer]
2. Then I need to use this algorithm:
_var_user_search_for_properties = [mountain,beach]
_var_max_property_percentage = 100 / [properties user search for]
_var_match_percentage = 0
for each _var_user_search_for_properties
if [property] < _var_[property]_25 then
_var_match_percentage += _var_max_property_percentage
elseif [property] < _var_[property]_50 then
_var_match_percentage += _var_max_property_percentage / 4 * 3
elseif [property] < _var_[property]_75 then
_var_match_percentage += _var_max_property_percentage / 4 * 2
elseif [property] < 0 then
_var_match_percentage += _var_max_property_percentage / 4 * 1
end if
next
order all rows by _var_match_percentage desc
The question is: is it posible to do this with MySQL?
How do I calculate this "match percentage" with it?
Or wil it be faster to get all the rows and indexes out of the database and loop them all trough .NET?
If the percentages can be stored in the database, you could try MySQL's LIMIT clause. See http://www.mysqltutorial.org/mysql-limit.aspx.

ActiveRecord joins and where

I have three models Company, Deal and Slot. They are associated as Company has_many deals and Deal has_many slots. All the A company can be expired if all of its deals are expired. And a deal is expired when all of its slots are expired.
I have written a scope..
scope :expired,
lambda { |within|
self.select(
'DISTINCT companies.*'
).latest(within).joins(
:user =>{ :deals => :slots }
).where(
"companies.spam = false AND deals.deleted_at IS NULL
AND deals.spam = false AND slots.state = 1
OR slots.begin_at <= :time",
:time => Time.zone.now + SLOT_EXPIRY_MARGIN.minutes
)
}
The above scope does not seem right to me from what I am trying to achieve. I need companies with all of its slots for all the deals are either in state 1 or the begin_at is less than :time making it expired.
Thanks for having a look in advance.
AND has a higher precedence than OR in SQL so your where actually gets parsed like this:
(
companies.spam = false
and deals.deleted_at is null
and deals.spam = false
and slots.state = 1
)
or slots.begin_at <= :time
For example (trimmed a bit for brevity):
mysql> select 1 = 2 and 3 = 4 or 5 = 5;
+---+
| 1 |
+---+
mysql> select (1 = 2 and 3 = 4) or 5 = 5;
+---+
| 1 |
+---+
mysql> select 1 = 2 and (3 = 4 or 5 = 5);
+---+
| 0 |
+---+
Also, you might want to use a placeholder instead of the literal false in the SQL, that should make things easier if you want to switch databases (but of course, database portability is largely a myth so that's just a suggestion); you could also just use not in the SQL. Furthermore, using a class method is the preferred way to accept arguments for scopes. Using scoped instead of self is also a good idea in case other scopes are already in play but if you use a class method, you don't have to care.
If we fix the grouping in your SQL with some parentheses, use a placeholder for false, and switch to a class method:
def self.expired(within)
select('distinct companies.*').
latest(within).
joins(:user => { :deals => :slots }).
where(%q{
not companies.spam
and not deals.spam
and deals.deleted_at is null
and (slots.state = 1 or slots.begin_at <= :time)
}, :time => Time.zone.now + SLOT_EXPIRY_MARGIN.minutes)
end
You could also write it like this if you prefer little blobs of SQL rather than one big one:
def self.expired(within)
select('distinct companies.*').
latest(within).
joins(:user => { :deals => :slots }).
where('not companies.spam').
where('not deals.spam').
where('deals.deleted_at is null').
where('slots.state = 1 or slots.begin_at <= :time', :time => Time.zone.now + SLOT_EXPIRY_MARGIN.minutes)
end
This one also neatly sidesteps your "missing parentheses" problem.
UPDATE: Based on the discussion in the comments, I think you're after something like this:
def self.expired(within)
select('distinct companies.*').
latest(within).
joins(:user => :deals).
where('not companies.spam').
where('not deals.spam').
where('deals.deleted_at is null').
where(%q{
companies.id not in (
select company_id
from slots
where state = 1
and begin_at <= :time
group by company_id
having count(*) >= 10
)
}, :time => Time.zone.now + SLOT_EXPIRY_MARGIN.minutes
end
That bit of nastiness at the bottom grabs all the company IDs that have ten or more expired or used slots and then companies.id not in (...) excludes them from the final result set.