I describe the outcome of a strategy by numerous rows. Each row contains a symbol (describing an asset), a timestamp (think of a backtest) and a price + weight.
Before a strategy runs I delete all previous results from this particular strategy (I have many strategies). I then loop over all symbols and all times.
# delete all previous data written by this strategy
StrategyRow.objects.filter(strategy=strategy).delete()
for symbol in symbols.keys():
s = symbols[symbol]
for t in portfolio.prices.index:
p = prices[symbol][t]
w = weights[symbol][t]
row = StrategyRow.objects.create(strategy=strategy, symbol=s, time=t)
if not math.isnan(p):
row.price = p
if not math.isnan(w):
row.weight = w
row.save()
This works but is very, very slow. Is there a chance to achive the same with write_frame from pandas? Or maybe using faster raw sql?
I don't think the first thing you should try is the raw SQL route (more on that in a bit)
But I think it's because of calling row.save() on many objects, that operation is known to be slow.
I'd look into StrategyRow.objects.bulk_create() first, https://docs.djangoproject.com/en/1.7/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
The difference is you pass it a list of your StrategyRow model, instead of calling .save() on individual instances. It's pretty straightforward, bundle up a few rows then create them in batches, maybe try 10, 20, a 100 etc at a time, your database configs can also help find the optimum batch size. (e.g. http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_max_allowed_packet)
Back to your idea of raw SQL, that would make a difference, if e.g. the Python code that creates the StrategyRow instances is slow (e.g. StrategyRow.objects.create()), but still I believe the key is to batch insert them instead of running N queries
Related
In Pyarrow now you can do:
a = ds.dataset("blah.parquet")
b = a.to_batches()
first_batch = next(b)
What if I want the iterator to return me every Nth batch instead of every other? Seems like this could be something in FragmentScanOptions but that's not documented at all.
No, there is no way to do that today. I'm not sure what you're after but if you are trying to sample your data there are a few choices but none that achieve quite this effect.
To load only a fraction of your data from disk you can use pyarrow.dataset.head
There is a request in place for randomly sampling a dataset although the proposed implementation would still load all of the data into memory (and just drop rows according to some random probability).
Update: If your dataset is only parquet files then there are some rather custom parts and pieces that you can cobble together to achieve what you want.
a = ds.dataset("blah.parquet")
all_fragments = []
for fragment in a.get_fragments():
for row_group_fragment in fragment.split_by_row_group():
all_fragments.append(row_group_fragment)
sampled_fragments = all_fragments[::2]
# Have to construct the sample dataset manually
sampled_dataset = ds.FileSystemDataset(sampled_fragments, schema=a.schema, format=a.format)
# Iterator which will only return some of the batches
# of the source dataset
sampled_dataset.to_batches()
When I use the following function, it takes up to 10 seconds to execute. Is there any way to make it run quicker?
def select_top_20 (df, col):
most_data = df.groupBy(col).count().sort(f.desc("count"))
top_20_count = most_data.limit(20).drop("count")
top_20 = [row[col] for row in top_20_count.collect()]
return top_20
Hard to answer in general, the code seems fine to me.
It depends on how the input DataFrame was created:
if it was directly read from a data source (parquet, database or so), it is an I/O problem and there is not much you can do.
if the DataFrame went through some processing before the function is executed, you might inspect this part. Lazy evaluation in Spark means that all this processing is done from scratch when you execute this function (instead of only the commands listed in the function). I.e. reading the data from disk, processing, everything. Persisting or caching the DataFrame somewhere in-between might speed you up considerably.
I have several CSV files that range from 25-100 MB in size. I have created constraints, created indices, am using periodic commit, and increased the allocated memory in the neo4j-wrapper.conf and neo4j.properties.
neo4j.properties:
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=100M
neostore.propertystore.db.strings.mapped_memory=100M
neostore.propertystore.db.arrays.mapped_memory=0M
neo4j-wrapper.conf changes:
wrapper.java.initmemory=5000
wrapper.java.maxmemory=5000
However my load is still taking a very long time, and I am considering using the recently released Import Tool (http://neo4j.com/docs/milestone/import-tool.html). Before I switch to it, I was wondering whether I could be doing anything else to improve the speed of my imports.
I begin by creating several constraints to make sure that the IDs I'm using are unique:
CREATE CONSTRAINT ON (Country) ASSERT c.Name IS UNIQUE;
//and constraints for other name identifiers as well..
I then use periodic commit...
USING PERIODIC COMMIT 10000
I then LOAD in the CSV where I ignore several fields
LOAD CSV WITH HEADERS FROM "file:/path/to/file/MyFile.csv" as line
WITH line
WHERE line.CountryName IS NOT NULL AND line.CityName IS NOT NULL AND line.NeighborhoodName IS NOT NULL
I then create the necessary nodes from my data.
WITH line
MERGE(country:Country {name : line.CountryName})
MERGE(city:City {name : line.CityName})
MERGE(neighborhood:Neighborhood {
name : line.NeighborhoodName,
size : toInt(line.NeighborhoodSize),
nickname : coalesce(line.NeighborhoodNN, ""),
... 50 other features
})
MERGE (city)-[:IN]->(Country)
CREATE (neighborhood)-[:IN]->(city)
//Note that each neighborhood only appears once
Does it make sense to use CREATE UNIQUE rather than applying MERGE to any COUNTRY reference? Would this speed it up?
A ~250,000-line CSV file took over 12 hours to complete, and seemed excessively slow. What else can I be doing to speed this up? Or does it just make sense to use the annoying-looking Import Tool?
A couple of things. Firstly, I would suggest reading Mark Needham's "Avoiding the Eager" blog post:
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
Basically what it says is that you should add a PROFILE to the start of each of your queries to see if any of them use the Eager operator. If they do this can really cost you performance-wise and you should probably split up your queries into separate MERGEs
Secondly, your neighborhood MERGE contains a lot of properties, and so each time it's trying to match on every single one of those properties before deciding if it should create it or not. I'd suggest something like:
MERGE (neighborhood:Neighborhood {name: line.NeighborhoodName})
ON CREATE SET
neighborhood.size = toInt(line.NeighborhoodSize),
neighborhood.nickname = coalesce(line.NeighborhoodNN, ""),
... 50 other features
})
Let's say I have this Django model:
class Question(models.Model):
question_code = models.CharField(max_length=10)
and I have 15k questions in the database.
I want to sort it by question_code, which is alphanumeric. This is quite a classical problem and has been talked about in:
http://blog.codinghorror.com/sorting-for-humans-natural-sort-order/
Does Python have a built in function for string natural sort?
I tried the code in the 2nd link (which is copied below, changed a bit), and notice it takes up to 3 seconds to sort the data. To make sure about the function's performance, I write a test which creates a list of 100k random alphanumeric string. It takes only 0.76s to sort that list. So what's happening?
This is what I think. The function needs to get the question_code of each question for comparing, thus calling this function to sort 15k values means requesting mysql 15k separate times. And this is the reason why it takes so long. Any idea? And any solution to natural sort for Django in general? Thanks a lot!
def natural_sort(l, ascending, key=lambda s:s):
def get_alphanum_key_func(key):
convert = lambda text: int(text) if text.isdigit() else text
return lambda s: [convert(c) for c in re.split('([0-9]+)', key(s))]
sort_key = get_alphanum_key_func(key)
return sorted(l, key=sort_key, reverse=ascending)
As far as I'm aware there isn't a generic Django solution to this. You can reduce your memory usage and limit your db queries by building an id/question_code lookup structure
from natsort import natsorted
question_code_lookup = Question.objects.values('id','question_code')
ordered_question_codes = natsorted(question_code_lookup, key=lambda i: i['question_code'])
Assuming you want to page the results you can then slice up ordered_question_codes, perform another query to retrieve all the questions you need order them according to their position in that slice
#get the first 20 questions
ordered_question_codes = ordered_question_codes[:20]
question_ids = [q['id'] for q in ordered_question_codes]
questions = Question.objects.filter(id__in=question_ids)
#put them back into question code order
id_to_pos = dict(zip((question_ids), range(len(question_ids))))
questions = sorted(questions, key = lambda x: id_to_pos[x.id])
If the lookup structure still uses too much memory, or takes too long to sort, then you'll have to come up with something more advanced. This certainly wouldn't scale well to a huge dataset
I have a Grails application that does a rather huge createCriteria query pulling from many tables. I noticed that the performance is pretty terrible and have pinpointed it to the Object manipulation I do afterwards, rather than the createCriteria itself. My query successfully gets all of the original objects I wanted, but it is performing a new query for each element when I am manipulating the objects. Here is a simplified version of my controller code:
def hosts = Host.createCriteria().list(max: maxRows, offset: rowOffset) {
// Lots of if statements for filters, etc.
}
def results = hosts?.collect{ [ cell: [
it.hostname,
it.type,
it.status.toString(),
it.env.toString(),
it.supporter.person.toString()
...
]]}
I have many more fields, including calls to methods that perform their own queries to find related objects. So my question is: How can I incorporate joins into the original query so that I am not performing tons of extra queries for each individual row? Currently querying for ~700 rows takes 2 minutes, which is way too long. Any advice would be great! Thanks!
One benefit you get using criteria is you can easily fetch associations eagerly. As a result of which you would not face the well known N+1 problem while referring associations.
You have not mentioned the logic in criteria but I would suggest for ~700 rows I would definitely go for something like this:
def hosts = Host.createCriteria().list(max: maxRows, offset: rowOffset) {
...
//associations are eagerly fetched if a DSL like below
//is used in Criteria query
supporter{
person{
}
}
someOtherAssoc{
//Involve logic if required
//eq('someOtherProperty', someOtherValue)
}
}
If you feel that tailoring a Criteria is cumbersome, then you can very well fallback to HQL and use join fetch for eager indexing for associations.
I hope this would definitely reduce the turnaround time to less than 5 sec for ~700 records.