aggregation and statistical functions on NOSQL databases - mysql

Using SQL databases, it is easy to do statistical / aggregate functions like covariance, standard deviation, kurtosis, skewness, deviations, means and medians, summation and product etc, without taking the data out to an application server.
http://www.xarg.org/2012/07/statistical-functions-in-mysql/
How are such computations done effectively (as close as possible to the store, assuming map/reduce "jobs" won't be realtime) on NoSql databases in general and dynamodb(cassandra) in particular, for large datasets.
AWS RDS (MySQL, PostgresSQL, ...) is, well, not NoSQL and Amazon Redshift (ParAccel) - a column store - has a SQL interface and may be an overkill ($6.85/hr). Redshift has limited aggregation functionality (http://docs.aws.amazon.com/redshift/latest/dg/c_Aggregate_Functions.html, http://docs.aws.amazon.com/redshift/latest/dg/c_Window_functions.html)

For DB's which have no aggregate functionality (e.g. Cassandra) you are always going to have to pull some data out. Building distributed computation clusters close to your DB is a popular option at the moment (using projects such as Storm). This way you can request and process data in parallel to do your operations. Think of it as a "real time" Hadoop (though it isn't the same).
Implementing such a setup is obviously more complicated than having a system that supports it out of the box, so factor that into your decision. The upside is that, if needed, a cluster allows you to do perform complex custom analysis way beyond anything that will be supported in a traditional DB solution.

Well, in MongoDB you have a possibility to create a some kind of UDF:
db.system.js.save( { _id : "Variance" ,
value : function(key,values)
{
var squared_Diff = 0;
var mean = Avg(key,values);
for(var i = 0; i < values.length; i++)
{
var deviation = values[i] - mean;
squared_Diff += deviation * deviation;
}
var variance = squared_Diff/(values.length);
return variance;
}});
db.system.js.save( { _id : "Standard_Deviation"
, value : function(key,values)
{
var variance = Variance(key,values);
return Math.sqrt(variance);
}});
The description is here.

MongoDB has some aggregation capabilities that might fit your needs http://docs.mongodb.org/manual/aggregation/

Related

Splitting a feature collection by system index in Google Earth Engine?

I am trying to export a large feature collection from GEE. I realize that the Python API allows for this more easily than the Java does, but given a time constraint on my research, I'd like to see if I can extract the feature collection in pieces and then append the separate CSV files once exported.
I tried to use a filtering function to perform the task, one that I've seen used before with image collections. Here is a mini example of what I am trying to do
Given a feature collection of 10 spatial points called "points" I tried to create a new feature collection that includes only the first five points:
var points_chunk1 = points.filter(ee.Filter.rangeContains('system:index', 0, 5));
When I execute this function, I receive the following error: "An internal server error has occurred"
I am not sure why this code is not executing as expected. If you know more than I do about this issue, please advise on alternative approaches to splitting my sample, or on where the error in my code lurks.
Many thanks!
system:index is actually ID given by GEE for the feature and it's not supposed to be used like index in an array. I think JS should be enough to export a large featurecollection but there is a way to do what you want to do without relying on system:index as that might not be consistent.
First, it would be a good idea to know the number of features you are dealing with. This is because generally when you use size().getInfo() for large feature collections, the UI can freeze and sometimes the tab becomes unresponsive. Here I have defined chunks and collectionSize. It should be defined in client side as we want to do Export within the loop which is not possible in server size loops. Within the loop, you can simply creating a subset of feature starting from different points by converting the features to list and changing the subset back to feature collection.
var chunk = 1000;
var collectionSize = 10000
for (var i = 0; i<collectionSize;i=i+chunk){
var subset = ee.FeatureCollection(fc.toList(chunk, i));
Export.table.toAsset(subset, "description", "/asset/id")
}

Loading a pandas Dataframe into a sql database with Django

I describe the outcome of a strategy by numerous rows. Each row contains a symbol (describing an asset), a timestamp (think of a backtest) and a price + weight.
Before a strategy runs I delete all previous results from this particular strategy (I have many strategies). I then loop over all symbols and all times.
# delete all previous data written by this strategy
StrategyRow.objects.filter(strategy=strategy).delete()
for symbol in symbols.keys():
s = symbols[symbol]
for t in portfolio.prices.index:
p = prices[symbol][t]
w = weights[symbol][t]
row = StrategyRow.objects.create(strategy=strategy, symbol=s, time=t)
if not math.isnan(p):
row.price = p
if not math.isnan(w):
row.weight = w
row.save()
This works but is very, very slow. Is there a chance to achive the same with write_frame from pandas? Or maybe using faster raw sql?
I don't think the first thing you should try is the raw SQL route (more on that in a bit)
But I think it's because of calling row.save() on many objects, that operation is known to be slow.
I'd look into StrategyRow.objects.bulk_create() first, https://docs.djangoproject.com/en/1.7/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
The difference is you pass it a list of your StrategyRow model, instead of calling .save() on individual instances. It's pretty straightforward, bundle up a few rows then create them in batches, maybe try 10, 20, a 100 etc at a time, your database configs can also help find the optimum batch size. (e.g. http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_max_allowed_packet)
Back to your idea of raw SQL, that would make a difference, if e.g. the Python code that creates the StrategyRow instances is slow (e.g. StrategyRow.objects.create()), but still I believe the key is to batch insert them instead of running N queries

Atomic in MongoDB with transfer money

I'm new for MongoDB
I make a simple application abount account in bank.an account can transfer money to others
I design Account collection like that
account
{
name:A
age: 24
money: 100
}
account
{
name:B
age: 22
money: 300
}
assuming that user A transfer 100$ for user B , there are 2 operations :
1) decrease 100$ in user A // update for document A
2) increase 100$ for user B // update with document B
It said that atomic only applied for only single document but no mutiple document.
I have a alter desgign
Bank
{
name:
address:
Account[
{
name:A
age: 22
money: SS
},
{
name:B
age: 23
money: S1S
}
]
}
I have some question :
If I use later way, How can I write transaction query (Can I use findAndModify() function?) ?
Does MongoDB support transaction operations like Mysql (InnoDB)?
Some pepple tell me that use Mysql for this project is the best way, and just only use MongoDB to save transaction information.(use extra
collection named Transaction_money to save them), If I use both
MongoDB and Mysql (InnoDB) how can make some operations below are
atomic (fail or success whole):
> 1) -100$ with user A
> 2) +100$ with user B
> 3) save transaction
information like
transaction
{
sender: A
receiver: B
money : 100
date: 05/04/2013
}
Thank so much.
I am not sure this is what you are looking for:
db.bank.update({name : "name"},{ "$inc" : {'Account.0.money' : -100, 'Account.1.money' : 100}})
update() operation is satisfies ACI properties of ( ACID ). Durability ( D ) depends on the mongo and application configuration while making query.
You can prefer to use findAndModify() which won't yield lock on page fault
MongoDB provides transactions within a document
I can't understand, if your application requirement is very simple, then why you are trying to use mongodb. No doubt its a good data-store, but I guess MySql will satisfy all your requirements.
Just FYI : There is a doc which is exactly the problem you are trying to solve. http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/
But I won't recommend you to use this because a single query ( transferring money) has been turned into sequence of queries.
Hope it helped
If I use later way, How can I write transaction query (Can I use findAndModify() function?) ?
There are a lot of mis-conceptions about what findAndModify does; it is not a transaction. That being said it is atomic, which is quite different.
The reason for two phase commits and transactions in this sense is so that if something goes wrong you can fix it (Or at least have a 99.99% chance that corruption hasn't occurred)
The problem with findAndModify is that it has no such transactional behaviour, not only that but MongoDB only provides atomic state upon single document level which means that, in the same call, if your functions change multiple documents you could actually have an in-consistent in-between state in your database data. This, of course, won't do for money handling.
It is noted that MongoDB is not extremely great in these scenarios and you are trying to use MongoDB away from its purpose, with this in mind it is clear you have not researched your question well, as your next question shows:
Does MongoDB support transaction operations like Mysql (InnoDB)?
No it does not.
With all that background info aside let's look at your schema:
Bank
{
name:
address:
Account[{
name:A
age: 22
money: SS
},{
name:B
age: 23
money: S1S
}]
}
It is true that you could get a transaction query on here whereby the document would never be able to exist in a in-between state, only one or the other; as such no in-consistencies would exist.
But then we have to talk more about the real world. A document in MongoDB is 16mb big. I do not think you would fit an entire bank into one document, so this schema is badly planned and useless.
Instead you would require (maybe) a document per account holder in your bank with a subdocument of their accounts. With this you now have the problem that in-consistencies can occur.
MongoDB, as #Abhishek states, does support client side 2 phase commits but these are not going to be as good as server-side within the database itself whereby the mongod can take safety precautions to ensure that the data is consistent at all times.
So coming back to your last question:
Some pepple tell me that use Mysql for this project is the best way, and just only use MongoDB to save transaction information.(use extra collection named Transaction_money to save them), If I use both MongoDB and Mysql (InnoDB) how can make some operations below are atomic (fail or success whole):
I would say something a bit more robust than MySQL personally, I heard MSSQL is quite good for this.

Which DAL libraries support stored procedure execution and results materialisation

I'm used to EF because it usually works just fine as long as you get to know it better, so you know how to optimize your queries. But.
What would you choose when you know you'll be working with large quantities of data? I know I wouldn't want to use EF in the first place and cripple my application. I would write highly optimised stored procedures and call those to get certain very narrow results (with many joins so they probably won't just return certain entities anyway).
So I'm a bit confused which DAL technology/library I should use? I don't want to use SqlConnection/SqlCommand way of doing it, since I would have to write much more code that's likely to hide some obscure bugs.
I would like to make bug surface as small as possible and use a technology that will accommodate my process not vice-a-versa...
Is there any library that gives me the possibility to:
provide the means of simple SP execution by name
provide automatic materialisation of returned data so I could just provide certain materialisers by means of lambda functions?
like:
List<Person> result = Context.Execute("StoredProcName", record => new Person{
Name = record.GetData<string>("PersonName"),
UserName = record.GetData<string>("UserName"),
Age = record.GetData<int>("Age"),
Gender = record.GetEnum<PersonGender>("Gender")
...
});
or even calling stored procedure that returns multiple result sets etc.
List<Question> result = Context.ExecuteMulti("SPMultipleResults", q => new Question {
Id = q.GetData<int>("QuestionID"),
Title = q.GetData<string>("Title"),
Content = q.GetData<string>("Content"),
Comments = new List<Comment>()
}, c => new Comment {
Id = c.GetData<int>("CommentID"),
Content = c.GetData<string>("Content")
});
Basically this last one wouldn't work, since this one doesn't have any knowledge how to bind both together... but you get the point.
So to put it all down to a single question: Is there a DAL library that's optimised for stored procedure execution and data materialisation?
Business Layer Toolkit might be exactly what's needed here. It's a lightweight ORM tool that supports lots of scenarios including multiple result sets although they seem very complicated to do.

How can I force Linq to SQL NOT to use the cache?

When I make the same query twice, the second time it does not return new rows form the database (I guess it just uses the cache).
This is a Windows Form application, where I create the dataContext when the application starts.
How can I force Linq to SQL not to use the cache?
Here is a sample function where I have the problem:
public IEnumerable<Orders> NewOrders()
{
return from order in dataContext.Orders
where order.Status == 1
select order;
}
The simplest way would be to use a new DataContext - given that most of what the context gives you is caching and identity management, it really sounds like you just want a new context. Why did you want to create just the one and then hold onto it?
By the way, for simple queries like yours it's more readable (IMO) to use "normal" C# with extension methods rather than query expressions:
public IEnumerable<Orders> NewOrders()
{
return dataContext.Orders.Where(order => order.Status == 1);
}
EDIT: If you never want it to track changes, then set ObjectTrackingEnabled to false before you do anything. However, this will severely limit it's usefulness. You can't just flip the switch back and forward (having made queries between). Changing your design to avoid the singleton context would be much better, IMO.
It can matter HOW you add an object to the DataContext as to whether or not it will be included in future queries.
Will NOT add the new InventoryTransaction to future in memory queries
In this example I'm adding an object with an ID and then adding it to the context.
var transaction = new InventoryTransaction()
{
AdjustmentDate = currentTime,
QtyAdjustment = 5,
InventoryProductId = inventoryProductId
};
dbContext.InventoryTransactions.Add(transaction);
dbContext.SubmitChanges();
Linq-to-SQL isn't clever enough to see this as needing to be added to the previously cached list of in memory items in InventoryTransactions.
WILL add the new InventoryTransaction to future in memory queries
var transaction = new InventoryTransaction()
{
AdjustmentDate = currentTime,
QtyAdjustment = 5
};
inventoryProduct.InventoryTransactions.Add(transaction);
dbContext.SubmitChanges();
Wherever possible use the collections in Linq-to-SQL when creating relationships and not the IDs.
In addition as Jon says, try to minimize the scope of a DataContext as much as possible.