How can I make a select function more performant in pyspark? - function

When I use the following function, it takes up to 10 seconds to execute. Is there any way to make it run quicker?
def select_top_20 (df, col):
most_data = df.groupBy(col).count().sort(f.desc("count"))
top_20_count = most_data.limit(20).drop("count")
top_20 = [row[col] for row in top_20_count.collect()]
return top_20

Hard to answer in general, the code seems fine to me.
It depends on how the input DataFrame was created:
if it was directly read from a data source (parquet, database or so), it is an I/O problem and there is not much you can do.
if the DataFrame went through some processing before the function is executed, you might inspect this part. Lazy evaluation in Spark means that all this processing is done from scratch when you execute this function (instead of only the commands listed in the function). I.e. reading the data from disk, processing, everything. Persisting or caching the DataFrame somewhere in-between might speed you up considerably.

Related

How to run Julia function on specific processor using remotecall(), when the function itself does not have return

I tried to use remotecall() in julia to distribute work to specific processor. The function I like to run does not have any return but it will output something. I can not make it work as there is no output file after running the code.
This is the test code I am creating:
using DelimitedFiles
addprocs(4) # add 4 processors
#everywhere function test(x) # Define the function
print("hi")
writedlm(string("test",string(x),".csv"), [x], ',')
end
remotecall(test, 2, 2) # To run the function on process 2
remotecall(test, 3, 3) # To run the function on process 3
This is the output I am getting:
Future(3, 1, 67, nothing)
And there is no output file (csv), or "hi" shown
I wonder if anyone can help me with this or I did anything wrong. I am fairly new to julia and have never used parallel processing.
The background is I need to run a big simulation (A big function with bunch of includes, but no direct return outputs) lots of times, and I like to split the work to different processors.
Thanks a lot
If you want to use a module function in a worker, you need to import that module locally in that worker first, just like you have to do it in your 'root' process. Therefore your using DelimitedFiles directive needs to occur "#everywhere" first, rather than just on the 'root' process. In other words:
#everywhere using DelimitedFiles
Btw, I am assuming you're using a relatively recent version of Julia and simply forgot to add the using Distributed directive in your example.
Furthermore, when you perform a remote call, what you get back is a "Future" object, which is a way of allowing you to obtain the 'future results of that computation' from that worker, once they're finished. To get the results of that 'future computation', use fetch.
This is all very simplistic and general information, since you haven't provided a specific example that can be copy / pasted and answered specifically. Have a look at the relevant section in the manual, it's fairly clearly written: https://docs.julialang.org/en/v1/manual/parallel-computing/#Multi-Core-or-Distributed-Processing-1

How to debug knex.js? Without having to pollute my db

does anyone know anything about debugging with knexjs and mysql? I'm trying to do a lot of things and test out stuff and I keep polluting my test database with random data. ideally, I'd just like to do things and see what the output query would be instead of running it against the actual database to see if it actually worked.
I can't find anything too helpful in their docs. they mention passing {debug: true} as one of the options in your initialize settings but it doesn't really explain what it does.
I am a junior developer, so maybe some of this is not meant to be understood by juniors but at the end of the day Is just not clear at all what steps I should take to be able to just see what queries would have been ran instead of running the real queries and polluting my db.
const result = await db().transaction(trx =>
trx.insert(mapToSnakeCase(address), 'id').into('addresses')
.then(addressId =>
trx.insert({ addresses_id: addressId, display_name: displayName }, 'id')
.into('chains')).toString();
You can build a knex query, but until you attach a .then() or awiat() (or run . asCallback((error,cb)=>{})), the query is just an object.
So you could do
let localVar = 8
let query = knex('table-a').select().where('id', localVar)
console.log(query.toString())
// outputs a string 'select * from table-a where id = 8'
This does not hit the database, and is synchronous. Make as many of these as you want!
As soon as you do await query or query.then(rows => {}) or query.asCallback( (err,rows)=>{} ) you are awaiting the db results, starting the promise chain or defining the callback. That is when the database is hit.
Turning on debug: true when initializing just writes the results of query.toSQL() to the console as they run against the actual DB. Sometimes an app might make a lot of queries and if one goes wrong this is a way to see why a DB call failed (but is VERY verbose so typically is not on all the time).
In our app's tests, we do actually test against the database because unit testing this type of stuff is a mess. We use knex's migrations on a test database that is brought down and up every time the tests run. So it always starts clean (or with known seed data), and if a test fails the DB is in the same state to be manually inspected. While we create a lot of test data in a testing run, it's cleaned up before the next test.

GCP dataflow - processing JSON takes too long

I am trying to process json files in a bucket and write the results into a bucket:
DataflowPipelineOptions options = PipelineOptionsFactory.create()
.as(DataflowPipelineOptions.class);
options.setRunner(BlockingDataflowPipelineRunner.class);
options.setProject("the-project");
options.setStagingLocation("gs://some-bucket/temp/");
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://some-bucket/2016/04/28/*/*.json"))
.apply(ParDo.named("SanitizeJson").of(new DoFn<String, String>() {
#Override
public void processElement(ProcessContext c) {
try {
JsonFactory factory = JacksonFactory.getDefaultInstance();
String json = c.element();
SomeClass e = factory.fromString(json, SomeClass.class);
// manipulate the object a bit...
c.output(factory.toString(e));
} catch (Exception err) {
LOG.error("Failed to process element: " + c.element(), err);
}
}
}))
.apply(TextIO.Write.to("gs://some-bucket/output/"));
p.run();
I have around 50,000 files under the path gs://some-bucket/2016/04/28/ (in sub-directories).
My question is: does it make sense that this takes more than an hour to complete? Doing something similar on a Spark cluster in amazon takes about 15-20 minutes. I suspect that I might be doing something inefficiently.
EDIT:
In my Spark job I aggregate all the results in a DataFrame and only then write the output, all at once. I noticed that my pipeline here writes each file separately, I assume that is why it's taking much longer. Is there a way to change this behavior?
Your jobs are hitting a couple of performance issues in Dataflow, caused by the fact that it is more optimized for executing work in larger increments, while your job is processing lots of very small files. As a result, some aspects of the job's execution end up dominated by per-file overhead. Here's some details and suggestions.
The job is limited rather by writing output than by reading input (though reading input is also a significant part). You can significantly cut that overhead by specifying withNumShards on your TextIO.Write, depending on how many files you want in the output. E.g. 100 could be a reasonable value. By default you're getting an unspecified number of files which in this case, given current behavior of the Dataflow optimizer, matches number of input files: usually it is a good idea because it allows us to not materialize the intermediate data, but in this case it's not a good idea because the input files are so small and per-file overhead is more important.
I recommend to set maxNumWorkers to a value like e.g. 12 - currently the second job is autoscaling to an excessively large number of workers. This is caused by Dataflow's autoscaling currently being geared toward jobs that process data in larger increments - it currently doesn't take into account per-file overhead and behaves not so well in your case.
The second job is also hitting a bug because of which it fails to finalize the written output. We're investigating, however setting maxNumWorkers should also make it complete successfully.
To put it shortly:
set maxNumWorkers=12
set TextIO.Write.to("...").withNumShards(100)
and it should run much better.

Loading a pandas Dataframe into a sql database with Django

I describe the outcome of a strategy by numerous rows. Each row contains a symbol (describing an asset), a timestamp (think of a backtest) and a price + weight.
Before a strategy runs I delete all previous results from this particular strategy (I have many strategies). I then loop over all symbols and all times.
# delete all previous data written by this strategy
StrategyRow.objects.filter(strategy=strategy).delete()
for symbol in symbols.keys():
s = symbols[symbol]
for t in portfolio.prices.index:
p = prices[symbol][t]
w = weights[symbol][t]
row = StrategyRow.objects.create(strategy=strategy, symbol=s, time=t)
if not math.isnan(p):
row.price = p
if not math.isnan(w):
row.weight = w
row.save()
This works but is very, very slow. Is there a chance to achive the same with write_frame from pandas? Or maybe using faster raw sql?
I don't think the first thing you should try is the raw SQL route (more on that in a bit)
But I think it's because of calling row.save() on many objects, that operation is known to be slow.
I'd look into StrategyRow.objects.bulk_create() first, https://docs.djangoproject.com/en/1.7/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
The difference is you pass it a list of your StrategyRow model, instead of calling .save() on individual instances. It's pretty straightforward, bundle up a few rows then create them in batches, maybe try 10, 20, a 100 etc at a time, your database configs can also help find the optimum batch size. (e.g. http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_max_allowed_packet)
Back to your idea of raw SQL, that would make a difference, if e.g. the Python code that creates the StrategyRow instances is slow (e.g. StrategyRow.objects.create()), but still I believe the key is to batch insert them instead of running N queries

How do chain python generator expressions with filtering?

I am trying to chain generators so I can process a large CSV file as a stream of lines rather than doing each single operation in a batch.
This way, I can delay each iteration of each step, avoiding loading an entire data set in memory.
Generator expressions work fine except if I put try to filter the output with an if statement.
This works: and only one iteration is consumed at a time
file_iterable = open("myfile.csv")
parsed_csv_iterable = (parse_i(i) for i in file_iterable)
then I can get a line at a time by calling next() on the resulting iterable. However, if I do this
file_iterable = open("myfile.csv")
parsed_csv_iterable = (parse_i(i) for i in file_iterable if parse_i(i)[0] in [1,2,3])
Then, the iterator keeps running until is exhausted. Why? What's the workaround?