How many executors is my Foundry job using?

How many executors is my Foundry job using? - palantir-foundry

I can see the parallelism of my job in the Spark Details page, but I wonder how many Executors my job is actually running with.
Where can I see this?

If you follow the same methodology to find the Environment tab noted over here, you'll find an entry on that page for the number of executors used.
Depending on your environment, you may find that dynamicAllocation is true, in which case you'll have a minExecutors and a maxExecutors setting noted, which is used as the 'bounds' of your job. You'll have any number of executors in that range depending on if you're using static allocation or dynamic allocation.
If dynamicAllocation is false, then you'll see an executorInstances entry, with the count listed there.

Related

Model Derivative getModelViewProperties is limiting requests per minute. How can I have unlimited requests?

The following api link for the Node JS package shows an endpoint that limits to 60 RPM: forge-api docs
Support has said that 60RPM is only for endpoints where forceget is set to true in the options, and that the implicit value for the forceget option when calling getModelviewProperties is set to false, yet I'm still getting a limited to 60 requests per minute.
Even when setting forceget to false it limits the rpm.
How can I have more RPM for this endpoint?

Unfortunately these endpoints have to be limited for reasons. But I agree 60rpm with forceget=true is a very low limit. However, forceget=false has a 14000rpm limit. Note as well that forceget=true is ignored if the resources are still available in the cache from a previous call.
That means that you first call with forceget=false, and if there is a problem, you call the resource with forceget=true only once, and give it some time for the resources to be loaded in the cache. All following calls should be using forceget=false. That way, you would save your forceget=true quota.
If you still have problems, you can either let me know at cyrille at autodesk.com with an example, and I'll debug the problem with you.
Or an alternative is to parse/expand the db files yourself, which can be a very good option if you need to improve your application performance a lot. I got an old sample doing this here which is more powerful than the Forge API for querying properties. I am currently rewriting it in typescript in a new sample, but this one isn't ready yet - it will be posted here.

Replacing items in message queue

Our system requirements say that we need to build a slightly unusual producer-consumer processing system. Imagine we have multiple data streams and we take a snapshot each X seconds and put it into the queue for processing. The streams count is not constant. The more clients we have, the more streams we need to process. At the same time, we don't need to process ALL taken snapshots. If we have too many clients and we are not able to process all items in real-time, we would prefer to skip old snapshots and process only the latest ones.
So as I see, the requirements can be met by keeping only one item in a queue for each stream. If there is a new snapshot, while the previous is still there, we need to REPLACE it using stream id as a key.
Is it possible to implement such behavior by Service Bus queue or something similar? Or maybe it makes sense to look into some other solutions like Redis?

So as I see, the requirements can be met by keeping only one item in a
queue for each stream. If there is a new snapshot, while the previous
is still there, we need to REPLACE it using stream id as a key. Is it
possible to implement such behavior by Service Bus queue or something
similar?
To the best of my knowledge, Azure Service Bus does not support this scenario. Through it's duplicate detection functionality, in fact it supports exact opposite of that. You would need to use some other mechanism (like Redis Cache you mentioned) to accomplish this.

CaseStudy in Bugzilla

I wish to conduct a case study in bugzilla, where I would like to ideally find out some statistics such as
The number of Memory Leaks
The percentage of bugs which are performance bugs
The percentage of semantic bugs
How can I search through the bugzilla database for softwares such as apache http server or mysql database server to generate such statistics. I would like an idea of how to get started?

I finally figured it out and am going to show my approach here. It's more or less a manual process. Hopefully this helps others as well who might be doing case-studies:
Selecting the bugs:
I went to bugs.mysql.com and searched for all bugs which were marked as resolved and fixed. Unfortunately, for mysql you cannot select specific components. I filtered out a random time range (2013-2014). And saved all of these in excel file(csv)
Classifying and filtering the bugs:
I manually went through the bugs, I skipped the ones which I could see clearly belonged to the documentation component, installation, compilation failure, or required restarts.
Then I read the report, and checked if the report actually suggested a bug fix, and if the bugfix was semantic (i.e. change limits, add a condition check, make sure some if condition is correctly marked for an edge case etc. - most were along similar lines). I followed a similar process for performance, resource-leak (both cpu resources, and memory leak considered in this category), and concurrency bugs.

What tools do distributed programmers lack?

I have a dream to improve the world of distributed programming :)
In particular, I'm feeling a lack of necessary tools for debugging, monitoring, understanding and visualizing the behavior of distributed systems (heck, I had to write my own logger and visualizers to satisfy my requirements), and I'm writing a couple of such tools in my free time.
Community, what tools do you lack with this regard? Please describe one per answer, with a rough idea of what the tool would be supposed to do. Others can point out the existence of such tools, or someone might get inspired and write them.

OK, let me start.
A distributed logger with a high-precision global time axis - allowing to register events from different machines in a distributed system with high precision and independent on the clock offset and drift; with sufficient scalability to handle the load of several hundred machines and several thousand logging processes. Such a logger allows to find transport-level latency bottlenecks in a distributed system by seeing, for example, how many milliseconds it actually takes for a message to travel from the publisher to the subscriber through a message queue, etc.
Syslog is not ok because it's not scalable enough - 50000 logging events per second will be too much for it, and timestamp precision will suffer greatly under such load.
Facebook's Scribe is not ok because it doesn't provide a global time axis.
Actually, both syslog and scribe register events under arrival timestamps, not under occurence timestamps.
Honestly, I don't lack such a tool - I've written one for myself, I'm greatly pleased with it and I'm going to open-source it. But others might.
P.S. I've open-sourced it: http://code.google.com/p/greg

Dear Santa, I would like visualizations of the interactions between components in the distributed system.
I would like a visual representation showing:
The interactions among components, either as a UML collaboration diagram or sequence diagram.
Component shutdown and startup times as self-interactions.
On which hosts components are currently running.
Location of those hosts, if available, within a building or geographically.
Host shutdown and startup times.
I would like to be able to:
Filter the components and/or interactions displayed to show only those of interest.
Record interactions.
Display a desired range of time in a static diagram.
Play back the interactions in an animation, with typical video controls for playing, pausing, rewinding, fast-forwarding.
I've been a good developer all year, and would really like this.

Then again, see this question - How to visualize the behavior of many concurrent multi-stage processes?.
(I'm shamelessly refering to my own stuff, but that's because the problems solved by this stuff were important for me, and the current question is precisely about problems that are important for someone).

You could have a look at some of the tools that come with erlang/OTP. It doesn't have all the features other people suggested, but some of them are quite handy, and built with a lot of experience. Some of these are, for instance:
Debugger that can debug concurrent processes, also remotely, AFAIR
Introspection tools for mnesia/ets tables as well as process heaps
Message tracing
Load monitoring on local and remote nodes
distributed logging and error report system
profiler which works for distributed scenarios
Process/task/application manager for distributed systems
These come of course in addition to the base features the platform provides, like Node discovery, IPC protocol, RPC protocols & services, transparent distribution, distributed built-in database storage, global and node-local registry for process names and all the other underlying stuff that makes the platform tic.

I think this is a great question and here's my 0.02 on a tool I would find really useful.
One of the challenges I find with distributed programming is in the deployment of code to multiple machines. Quite often these machines may have slightly varying configuration or worse have different application settings.
The tool I have in mind would be one that could on demand reach out to all the machines on which the application is deployed and provide system information. If one specifies a settings file or a resource like a registry, it would provide the list for all the machines. It could also look at the user access privileges for the users running the application.
A refinement would be to provide indications when settings are not matching a master list provided by the developer. It could also indicate servers that have differing configurations and provide diff functionality.
This would be really useful for .NET applications since there are so many configurations (machine.config, application.config, IIS Settings, user permissions, etc) that the chances of varying configurations are high.

In my opinion, what is missing is a distributed programming platform...a platform that makes application programming over distributed systems as transparent as non-distributed programming is now.

Isn't it a bit early to work on Tools when we don't even agree on a platform? We have several flavors of actor models, virtual shared memory, UMA, NUMA, synchronous dataflow, tagged token dataflow, multi-hierchical memory vector processors, clusters, message passing mesh or network-on-a-chip, PGAS, DGAS, etc.
Feel free to add more.
To contribute:
I find myself writing a lot of distributed programs by constructing a DAG, which gets transformed into platform-specific code. Every platform optimization is a different kind of transformation rules on this DAG. You can see the same happening in Microsoft's Accelerator and Dryad, Intel's Concurrent Collections, MIT's StreaMIT, etc.
A language-agnostic library that collects all these DAG transformations would save re-inventing the wheel every time.

You can also take a look at Akka:
http://akka.io

Let me notify those who've favourited this question by pointing to the Greg logger - http://code.google.com/p/greg . It is the distributed logger with a high-precision global time axis that I've talked about in the other answer in this thread.

Apart from the mentioned tool for "visualizing the behavior of many concurrent multi-stage processes" (splot), I've also written "tplot" which is appropriate for displaying quantitative patterns in logs.
A large presentation about both tools, with lots of pretty pictures here.

Simple scalable work/message queue with delay

I need to set up a job/message queue with the option to set a delay for the task so that it's not picked up immediately by a free worker, but after a certain time (can vary from task to task). I looked into a couple of linux queue solutions (rabbitmq, gearman, memcacheq), but none of them seem to offer this feature out of the box.
Any ideas on how I could achieve this?
Thanks!

I've used BeanstalkD to great effect, using the delay option on inserting a new job to wait several seconds till the item becomes available to be reserved.
If you are doing longer-term delays (more than say 30 seconds), or the jobs are somewhat important to perform (abeit later), then it also has a binary logging system so that any daemon crash would still have a record of the job. That said, I've put hundreds of thousands of live jobs through Beanstalkd instances and the workers that I wrote were always more problematical than the server.

You could use an AMQP broker (such as RabbitMQ) and I have an "agent" (e.g. a python process built using pyton-amqplib) that sits on an exchange an intercepts specific messages (specific routing_key); once a timer has elapsed, send back the message on the exchange with a different routing_key.
I realize this means "translating/mapping" routing keys but it works. Working with RabbitMQ and python-amqplib is very straightforward.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How many executors is my Foundry job using? - palantir-foundry

I can see the parallelism of my job in the Spark Details page, but I wonder how many Executors my job is actually running with. Where can I see this?

Related

Model Derivative getModelViewProperties is limiting requests per minute. How can I have unlimited requests?

Replacing items in message queue

CaseStudy in Bugzilla

What tools do distributed programmers lack?

Simple scalable work/message queue with delay

Categories

Resources