Merge multiple optimization paths in mlr - mlr

Is it possible to merge the results of two tuning processes with the same parameter set? I am currently using the function addOptPathEl in a loop, but I wonder if there is something to do it faster.
Thanks!
VĂ­ctor

Related

Handling very large datasets (>1M records) using Power Automate

Before I go much further down this thought process, I wanted to check whether the idea was feasible at all. Essentially, I have two datasets, each of which will consist of ~500K records. For the sake of discussion, we can assume they will be presented in CSV files for ingesting.
Basically, what I'll need to do, is take records from the first dataset, do a lookup against the second dataset, and then produce an output that essentially merges the two together and produces an output CSV file with the results. The expected number of records after the merge will be in the range of 1.5-2M records.
So, my questions are,
Will Power Automate allow me to work with CSV datasets of those sizes?
Will the "Apply to each" operator function across that large of a dataset?
Will Power Automate allow me to produce the export CSV file with that size?
Will the process actually complete, or will it eventually just hit some sort of internal timeout error?
I know that I can use more traditional services like SQL Server Integration Services for this, but I'm wondering whether Power Automate has matured enough to handle this level of ETL operation.

In Foundry Code Repositories, how do I iterate over all datasets in a directory?

I'm trying to read (all or multiple) datasets from single directory in single Pyspark transform. Is it possible to iterate over all the datasets in a path, without hardcoding individual datasets as input?
I'd like to dynamically fetch different columns from multiple datasets without having to hardcode individual input datasets.
So this doesn't work since you will have inconsistent results every time you run CI. This will break TLLV (transforms level logic versioning) by making it impossible to tell when logic actually has changed, thus marking a dataset as stale.
You will have to write out the logical paths of each dataset you wish to transform, even if it means they are passed into a generated transform. There will need to be at least some consistent record of which datasets were targeted by which commit.
Another tactic to achieve what you're looking for is to make a single long dataset that is the unpivoted version of the datasets. In this way, you could simple APPEND new rows / files to this dataset, which would let you accept arbitrary inputs, assuming your transform is constructed in such a way to handle this.
My rule of thumb is this: if you need dynamic schemas or dynamic counts of datasets, then you're better off using dynamic files / row counts in a single dataset.

CakePHP: Is it possible to force find() to run a single MySQL query

I'm using CakePHP 2.x. When I inspect the sql dump, I notice that it's "automagic" is causing one of my find()s to run several separate SELECT queries (and then presumably merging them all together into a single pretty array of data).
This is normally fine, but I need to run one very large query on a table of 10K rows with several joins, and this is proving too much for the magic to handle because when I try to construct it through find('all', $conditions) the query times out after 300 seconds. But when I write an equivalent query manually with JOINS, it runs very fast.
My theory is that whatever PHP "magic" is required to weave the separate queries together is causing a bottleneck for this one large query.
Is my theory a plausible explanation for what's going on?
Is there a way to tell Cake to just keep it simple and make one big fat SELECT instead of it's fancy automagic?
Update: I forgot to mention that I already know about $this->Model->query(); Using this is how I figured out that the slow-down was coming from PHP magic. It works when we do it this way, but it feels a little clunky to maintain the same query in two different forms. That's why I was hoping CakePHP offered an alternative to the way it builds up big queries from multiple smaller ones.
In cases like this where you query tables with 10k records you shouldn't be doing a find('all') without limiting the associations, these are some of the strategies you can apply:
Set recursive to 0 If you don't need related models
Use Containable Behavior to bring only the associated models you need.
Apply limits to your query
Caching is a good friend
Create and destroy associations on the fly As you need.
Since you didn't specify the problem I just gave you general ideas to apply depending on the problem you have

Can MySQL parallelize UNION subqueries (or anything at all)?

I use a partitioned table with a large amount of data. According to MySQL docs, it is on the ToDo list that:
Queries involving aggregate functions such as SUM() and COUNT() can
easily be parallelized.
... but, can I achieve the same functionality using UNION subqueries? Are they parallelized, or do I have to create a multithreaded client to run concurrent queries with all the possible partition keys?
Edit:
The question is not strictly about UNION or subqueries. I would like to utilize as many cores as possible for my queries. Is there any way to do this (and make sure it's done) without paralellizing my application?
Any good documentation about MySQL's current parallelizing capabilities?
As far as I know, currently the only way to use more than one thread/core to run queries in your application, is to use more than one connection. This of course makes it impossible to run parallel queries that are part of a single transaction.
The different queries that are UNIONed together in one larger query aren't really subqueries, strictly speaking.
The queries are run in order
The data type of the columns is determined by the first query
By default, identical rows are dropped (UNION defaults to DISTINCT)
The result set is not finished building until all queries are run
...there is no way to parallelize the different queries, as they are all really part of the same query.
You may want to try runing the different queries in parallel from your code, and then mashing the results up together in your code once the queries all complete.
The documentation on UNIONs can be found here.
I think a similar question was answered here.
http://forums.mysql.com/read.php?115,84453,84453
(May be I should have posted this as a comment, but I honestly couldn't find a comment button anywhere around here.)

How to iteratively optimize a MySQL query?

I'm trying to iteratively optimize a slow MySQL query, which means I run the query, get timing, tweak it, re-run it, get timing, etc. The problem is that the timing is non-stationary, and later executions of the query perform very differently from earlier executions.
I know to clear the query cache, or turn it off, between executions. I also know that, at some level, the OS will affect query performance in ways MySQL can't control or understand. But in general, what's the best I can do wrt this kind of iterative query optimization, so that I can compare apples to apples?
Your best tool for query optimization is EXPLAIN. It will take a bit to learn what the output means, but after doing so, you will understand how MySQL's (horrible, broken, backwards) query planner decides to best retrieve the requested data.
Changes in the parameters to the query can result in wildly different query plans, so this may account for some of the problems you are seeing.
You might want to consider using the slow query log to capture all queries that might be running with low performance. Perhaps you'll find that the query in question only falls into the low performance category when it uses certain parameters?
Create a script that runs the query 1000 times, or whatever number of iterations causes the results to stabilize.
Then follow your process as described above, but just make sure you aren't relying on a single execution, but rather an average of multiple executions, because you're right, the results will not be stable as row counts change, and your machine is doing other things.
Also, try to use a wide array of inputs to the query, if that makes sense for your use case.