Comparing ActiveRecord results and only using varied ones - mysql

I'm currently needing to sync data via a redis slave to a remote site. The data lives in MySQL. To do this, I've devised a sync script similar to this:
MyTable
.select("id, first_name, status")
.find_each do |user|
STDOUT.write(gen_redis_proto("SET", "users:#{user.id}",user.to_json))
end
This works perfectly. I pipe this to redis-cli --pipe (as per https://www.redis.io/topics/mass-insert) and it inserts to the local master and syncs to the remote slave.
Unfortunately I have several thousand rows, making this sync quite large. I'd like to only sync rows that have changed, however there's no "last_modified" or similar value available in the table.
The above code runs in a loop with a sleep between runs, so I can store the previous resultset and make a comparison, but I can't work out an efficient way to do this. I'm thinking something similar to the below pseudocode:
lines = [
"{\"id\":123,\"first_name\":\"Jimmy\",\"status\":1}",
"{\"id\":456,\"first_name\":\"John\",\"status\":2}",
"{\"id\":789,\"first_name\":\"James\",\"status\":2}"
]
previous_lines = [
"{\"id\":123,\"first_name\":\"Jimmy\",\"status\":2}",
"{\"id\":456,\"first_name\":\"John\",\"status\":3}",
"{\"id\":789,\"first_name\":\"James\",\"status\":2}"
]
varied_lines = diff(lines, previous_lines) # returns something like [0,1]
varied_lines.each do |line|
this_line = line.to_a
STDOUT.write(gen_redis_proto("SET", "users:#{this_line.id}",line))
end
I suspect too much manipulation of data or comparison will come with a performance overhead, and I'm also unsure of the best way to diff this data to get results.

Why you don't need it, the rickshaw gem lets you add hashing onto strings easily. I'd probably do something with storing & comparing hash values which would take up less space. example.
require 'rickshaw'
def diff(current, previous)
current.select.each_with_index do |line, idx|
previous_line = previous[idx]
line.to_sha1 == previous_line.to_sha1
end
end
lines = [
"{\"id\":123,\"first_name\":\"Jimmy\",\"status\":1}",
"{\"id\":456,\"first_name\":\"John\",\"status\":2}",
"{\"id\":789,\"first_name\":\"James\",\"status\":2}"
]
previous_lines = [
"{\"id\":123,\"first_name\":\"Jimmy\",\"status\":2}",
"{\"id\":456,\"first_name\":\"John\",\"status\":3}",
"{\"id\":789,\"first_name\":\"James\",\"status\":2}"
]
varied_lines = diff(lines, previous_lines)
varied_lines.each do |line|
this_line = line.to_a
STDOUT.write(gen_redis_proto("SET", "users:#{this_line.id}",line))
end

Related

Read every nth batch in pyarrow.dataset.Dataset

In Pyarrow now you can do:
a = ds.dataset("blah.parquet")
b = a.to_batches()
first_batch = next(b)
What if I want the iterator to return me every Nth batch instead of every other? Seems like this could be something in FragmentScanOptions but that's not documented at all.
No, there is no way to do that today. I'm not sure what you're after but if you are trying to sample your data there are a few choices but none that achieve quite this effect.
To load only a fraction of your data from disk you can use pyarrow.dataset.head
There is a request in place for randomly sampling a dataset although the proposed implementation would still load all of the data into memory (and just drop rows according to some random probability).
Update: If your dataset is only parquet files then there are some rather custom parts and pieces that you can cobble together to achieve what you want.
a = ds.dataset("blah.parquet")
all_fragments = []
for fragment in a.get_fragments():
for row_group_fragment in fragment.split_by_row_group():
all_fragments.append(row_group_fragment)
sampled_fragments = all_fragments[::2]
# Have to construct the sample dataset manually
sampled_dataset = ds.FileSystemDataset(sampled_fragments, schema=a.schema, format=a.format)
# Iterator which will only return some of the batches
# of the source dataset
sampled_dataset.to_batches()

How to store database records into hash ruby on rails

I'm using mysql database.
Now I want to retrieve data from my database: value and time, for plotting graph.
my db store up to 4000 data,and I need plot 1000 of them.
First method come to my mind is:
points=Hash.new
Records.all.each do |record|
points[record.time.to_s]=record.value.to_s
end
then cut the first 1000 records.
But this way will be very inefficient and time consuming,It will cause my web load long.
I feel there must be a efficient way of doing this?
convert first 1000 database records's attributes into hash?
or convert to array pairs also ok for me,as long as data pair can plot.
thanks!
data = Record.limit(1000) # Load no more than a 1000 entries
.pluck(:time, :value) # Pick the field values into sub-arrays
# it will also `SELECT` only these two
# At this point you have [[time1, value1], [time2, value2]]
# good already, but we can go further:
.to_h # Ruby 2.1+ only! I hope you're up-to-date!
# Now it is {time1 => value1, time2 => value2}
You can use limit:
points = Record.limit(1000).map { |r| { r.time => r.value } }

Should I use Neo4j's Import Tool or Load Command to Insert Several Million Rows?

I have several CSV files that range from 25-100 MB in size. I have created constraints, created indices, am using periodic commit, and increased the allocated memory in the neo4j-wrapper.conf and neo4j.properties.
neo4j.properties:
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=100M
neostore.propertystore.db.strings.mapped_memory=100M
neostore.propertystore.db.arrays.mapped_memory=0M
neo4j-wrapper.conf changes:
wrapper.java.initmemory=5000
wrapper.java.maxmemory=5000
However my load is still taking a very long time, and I am considering using the recently released Import Tool (http://neo4j.com/docs/milestone/import-tool.html). Before I switch to it, I was wondering whether I could be doing anything else to improve the speed of my imports.
I begin by creating several constraints to make sure that the IDs I'm using are unique:
CREATE CONSTRAINT ON (Country) ASSERT c.Name IS UNIQUE;
//and constraints for other name identifiers as well..
I then use periodic commit...
USING PERIODIC COMMIT 10000
I then LOAD in the CSV where I ignore several fields
LOAD CSV WITH HEADERS FROM "file:/path/to/file/MyFile.csv" as line
WITH line
WHERE line.CountryName IS NOT NULL AND line.CityName IS NOT NULL AND line.NeighborhoodName IS NOT NULL
I then create the necessary nodes from my data.
WITH line
MERGE(country:Country {name : line.CountryName})
MERGE(city:City {name : line.CityName})
MERGE(neighborhood:Neighborhood {
name : line.NeighborhoodName,
size : toInt(line.NeighborhoodSize),
nickname : coalesce(line.NeighborhoodNN, ""),
... 50 other features
})
MERGE (city)-[:IN]->(Country)
CREATE (neighborhood)-[:IN]->(city)
//Note that each neighborhood only appears once
Does it make sense to use CREATE UNIQUE rather than applying MERGE to any COUNTRY reference? Would this speed it up?
A ~250,000-line CSV file took over 12 hours to complete, and seemed excessively slow. What else can I be doing to speed this up? Or does it just make sense to use the annoying-looking Import Tool?
A couple of things. Firstly, I would suggest reading Mark Needham's "Avoiding the Eager" blog post:
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
Basically what it says is that you should add a PROFILE to the start of each of your queries to see if any of them use the Eager operator. If they do this can really cost you performance-wise and you should probably split up your queries into separate MERGEs
Secondly, your neighborhood MERGE contains a lot of properties, and so each time it's trying to match on every single one of those properties before deciding if it should create it or not. I'd suggest something like:
MERGE (neighborhood:Neighborhood {name: line.NeighborhoodName})
ON CREATE SET
neighborhood.size = toInt(line.NeighborhoodSize),
neighborhood.nickname = coalesce(line.NeighborhoodNN, ""),
... 50 other features
})

Loading a pandas Dataframe into a sql database with Django

I describe the outcome of a strategy by numerous rows. Each row contains a symbol (describing an asset), a timestamp (think of a backtest) and a price + weight.
Before a strategy runs I delete all previous results from this particular strategy (I have many strategies). I then loop over all symbols and all times.
# delete all previous data written by this strategy
StrategyRow.objects.filter(strategy=strategy).delete()
for symbol in symbols.keys():
s = symbols[symbol]
for t in portfolio.prices.index:
p = prices[symbol][t]
w = weights[symbol][t]
row = StrategyRow.objects.create(strategy=strategy, symbol=s, time=t)
if not math.isnan(p):
row.price = p
if not math.isnan(w):
row.weight = w
row.save()
This works but is very, very slow. Is there a chance to achive the same with write_frame from pandas? Or maybe using faster raw sql?
I don't think the first thing you should try is the raw SQL route (more on that in a bit)
But I think it's because of calling row.save() on many objects, that operation is known to be slow.
I'd look into StrategyRow.objects.bulk_create() first, https://docs.djangoproject.com/en/1.7/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
The difference is you pass it a list of your StrategyRow model, instead of calling .save() on individual instances. It's pretty straightforward, bundle up a few rows then create them in batches, maybe try 10, 20, a 100 etc at a time, your database configs can also help find the optimum batch size. (e.g. http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_max_allowed_packet)
Back to your idea of raw SQL, that would make a difference, if e.g. the Python code that creates the StrategyRow instances is slow (e.g. StrategyRow.objects.create()), but still I believe the key is to batch insert them instead of running N queries

Rails 3: Measure database performance (MongoDB and MySQL)

I set up my Rails application twice. One is working with MongoDB (Mongoid as mapper) and the other with MySQL and ActiveRecord. Then I wrote a rake task which inserts some test-data to both databases (100.000 entries).
I measured how long it takes for each database with the ruby Benchmark module. I did some testing with 100 and 10.000 entries where mongodb was always faster than mysql (about 1/3). The weird thing is that it takes about 3 times longer in mongodb to insert the 100.000 entries than with mysql. I have no idea why mongodb has this behaviour?! The only thing that I know is that the cpu time is much lower than the total time. Is it possible that mongodb starts some sort of garbage collection while it's inserting the data? At the beginning it's fast, but as more data mongodb is inserting, it gets slower and slower...any idea on this?
To get somehow a read performance of the two databases, I thought about measuring the time when the database gets an search query and respond the result. As I need some precise measurements, I don't want to include the time where Rails is processing my query from the controller to the database.
How do I do the measurement directly at the database and not in the Rails controller? Is there any gem / tool which would help me?
Thanks in advance!
EDIT: Updated my question according to my current situation
If your base goal is to measure database performance time at the DB level, I would recommend you get familiar with the benchRun method in MongoDB.
To do the type of thing you want to do, you can get started with the example on the linked page, here is a variant with explanations:
// skipped dropping the table and reinitializing as I'm assuming you have your test dataset
// your database is called test and collection is foo in this code
ops = [
// this sets up an array of operations benchRun will run
{
// possible operations include find (added in 2.1), findOne, update, insert, delete, etc.
op : "find" ,
// your db.collection
ns : "test.foo" ,
// different operations have different query options - this matches based on _id
// using a random value between 0 and 100 each time
query : { _id : { "#RAND_INT" : [ 0 , 100 ] } }
}
]
for ( x = 1; x<=128; x*=2){
// actual call to benchRun, each time using different number of threads
res = benchRun( { parallel : x , // number of threads to run in parallel
seconds : 5 , // duration of run; can be fractional seconds
ops : ops // array of operations to run (see above)
} )
// res is a json object returned, easiest way to see everything in it:
printjson( res )
print( "threads: " + x + "\t queries/sec: " + res.query )
}
If you put this in a file called testing.js you can run it from mongo shell like this:
> load("testing.js")
{
"note" : "values per second",
"errCount" : NumberLong(0),
"trapped" : "error: not implemented",
"queryLatencyAverageMs" : 69.3567923734754,
"insert" : 0,
"query" : 12839.4,
"update" : 0,
"delete" : 0,
"getmore" : 0,
"command" : 128.4
}
threads: 1 queries/sec: 12839.4
and so on.
I found the reason why MongoDB is getting slower while inserting many documents.
Many to many relations are not recommended for over 10,000 documents when using MRI due to the garbage collector taking over 90% of the run time when calling #build or #create. This is due to the large array appending occuring in these operations.
http://mongoid.org/performance.html
Now I would like to know how to measure the query performance of each database. My main concerns are the the measurement of the query time and the flow capacity / throughput. This measurement should be made directly at the database, so that nothing can adulterate the result.