Apache Superset - Concurrent loading of dashboard slices (Athena) - gunicorn

I've got a dashboard with a few slices setup. Slices are loading one after the other, not concurrently. This results in a bad user experience. My data is sitting on S3 and I'm using the Athena connector for queries. I can see that the calls to Athena are fired in order, with each query waiting for the one before it to finish before running.
I'm using gevent, which is far as I can tell shouldn't be a problem?
Below is an extract from my Superset config:
SUPERSET_WORKERS = 8
WEBSERVER_THREADS = 8
It used to be set to 2 and 1 respectively, but I upped it to 8 each see if that could be the issue. I'm getting the same result though.
Is this a simple misconfiguration issue or am I missing something?

It's important to understand multiProcessing and multiThreading before increasing the workers and threads for gunicorn. If you are looking at CPU intense operations, you want many processes' while you probably want many threads for I/O intensive operations.
With your problem, you don't need many process; but threads within a process. With that config, next step would be to debug how you're spawning greenlets. (gevents facilitates concurrency. Concurrency !== Parallel Processing).
To bootstrap gunicorn with multiple threads you could do something like this:
gunicorn -b localhost:8000 main:app --threads 30 --reload
Please post more code to facilitate targeted help.

Related

Architecture advice for EventMachine and MySQL

We are writing a real-time game in EventMachine/Ruby. We're using ActiveRecord with MySQL for storing the game objects.
When we start the server we plan to load all the game objects into memory. This will allow for us to avoid any blocking/slow SQL queries with ActiveRecord.
However, we still need to persist the data in the database in case the server crashes, of course.
What are our options for doing so? I could use EM.Defer but I have no idea how many concurrent players that could handle since the thread pool is limited to 20.
Currently I'm thinking using Resque with Redis would be the best bet. Do everything with the objects in memory, and whenever there is a save that needs to occur for the database, fire off a job and add it to the Resque queue.
Any advice?
Threadpool size can be tweaked - see EventMachine.threadpool_size
Each server process (apache...) will spawn its own EventMachine reactor and its own EM.defer threadpool, so if you use a forking server (a mongrel farm, passenger, ...) you don't need to go crazy on the threadpool size
See EM-Synchrony by Ilja Grigorik (https://github.com/igrigorik/em-synchrony) - you should be able to simplify your code with it
Afaik, mysql has a non-blocking driver that you can use freely with EM, EM::Synchrony supports it http://www.igvita.com/2010/04/15/non-blocking-activerecord-rails/ - this would mean you don't need EM.defer at all!
Take a look at Thin - https://github.com/macournoyer/thin/ - it's non-blocking EM-based webserver that supports Rails
Having said all this, writing evented code is a bitch - forget about stack traces and make sure you're running benchmark tests often as anything blocking your reactor will block the entire application.
Also, this all applies to MRI Ruby ONLY. If you mean to use jruby... You're bound to get into trouble as thread-safety of eventmachine seems to be largely due to GIL of MRI Ruby and standard patterns don't work (many aspects of it can be made to work with this fork https://github.com/WebtehHR/eventmachine/tree/v1.0.3_w_fix which fixes some issues EM has with JRuby)
Unfortunately, guys from https://github.com/eventmachine/eventmachine are kind of not very active, the project currently has 200+ issues and almost 60 open pull requests which is why I've had to use a separate fork to continue playing with my current project - this still means EM is an awesome project just don't expect problems you encounter to be quickly fixed so do your best to not go out of the trodden path of EM use.
Another problem with JRuby is that EM::Synchrony imposes a heavy performance penalty because JRuby doesn't have implemented fibers as of 1.7.8 but rather maps them to Java native threads which are MUCH slower
Also, have you considered messaging with something like RabbitMQ (it has a synchronous https://github.com/ruby-amqp/bunny, and evented driver https://github.com/ruby-amqp/amqp) as a possibility to communicate game objects between clients and perhaps reduce overhead on the database / distributed memory store that you had in mind?
Redis/Resque seem good, but if all the jobs will need to do is simple persistance, and if there will be A LOT of such calls, you might want to consider beanstalkd - it has A LOT faster but simpler queue then Resque and you can probably make this even faster if you don't really need activerecord to dump attribute hashes into the database, see delayed_jobs vs resque vs beanstalkd?
A couple years and a failed project later, some thoughts:
avoid eventmachine if any way possible, there's a plethora of opportunities to peg your CPU nowadays with YARV/MRI Ruby on a IO constrained application and without wasting memory.
My favorite approach for a web application at this time is use Puma with multiple processes and threads.
Have in mind that GIL in YARV only affects the Ruby interpreter code, not the IO operations, meaning that on a IO constrained application you can add threads and see better utilization of a single core,
add more processes and you see better utilization of many cores :) On Heroku 1x worker we run 2 processes with 4 threads each and this pegs our CPU potential to the top in benchmark meaning the application is no longer IO bound, but CPU bound and doing so without unacceptable memory losses.
When we needed super-fast responses we were troubled by the DB write operation times which did not affect the response to client, so we did asynchronous database writes using sidekiq / resque,
In hindsight you could even do celluloid or concurrent-ruby for asynchronous IO reads/writes (think DB writes, cache visits etc), it's less overhead and infrastructure but harder to debug and problem solve in production - my worst nightmare being an async operation failing silently with no error trace in our Errors console (an exception in exception handling for example)
End result is that your application experiences the same sort of benefits you used to get from using eventmachine (elimination of the IO bound, full utilization of CPU without huge memory footprint, parallel non-blocking IO) without resorting to writing reactor code which is a complete bitch to do as explained in my 2013 post

Performing a distributed CUDA/OpenCL based password cracking

Is there a way to perform a distributed (as in a cluster of a connected computers) CUDA/openCL based dictionary attack?
For example, if I have a one computer with some NVIDIA card that is sharing the load of the dictionary attack with another coupled computer and thus utilizing a second array of GPUs there?
The idea is to ensure a scalability option for future expanding without the need of replacing the whole set of hardware that we are using. (and let's say cloud is not an option)
This is a simple master / slave work delegation problem. The master work server hands out to any connecting slave process a unit of work. Slaves work on one unit and queue one unit. When they complete a unit, they report back to the server. Work units that are exhaustively checked are used to estimate operations per second. Depending on your setup, I would adjust work units to be somewhere in the 15-60 second range. Anything that doesn't get a response by the 10 minute mark is recycled back into the queue.
For queuing, offer the current list of uncracked hashes, the dictionary range to be checked, and the permutation rules to be applied. The master server should be able to adapt queues per machine and per permutation rule set so that all machines are done their work within a minute or so of each other.
Alternately, coding could be made simpler if each unit of work were the same size. Even then, no machine would be idle longer than the amount of time for the slowest machine to complete one unit of work. Size your work units so that the fastest machine doesn't enter a case of resource starvation (shouldn't complete work faster than five seconds, should always have a second unit queued). Using that method, hopefully your fastest machine and slowest machine aren't different by a factor of more than 100x.
It would seem to me that it would be quite easy to write your own service that would do just this.
Super Easy Setup
Let's say you have some GPU enabled program X that takes a hash h as input and a list of dictionary words D, then uses the dictionary words to try and crack the password. With one machine, you simply run X(h,D).
If you have N machines, you split the dictionary into N parts (D_1, D_2, D_3,...,D_N). Then run P(x,D_i) on machine i.
This could easily be done using SSH. The master machine splits the dictionary up, copies it to each of the slave machines using SCP, then connects to the slaves and tells them to run the program.
Slightly Smarter Setup
When one machine cracks the password, they could easily notify the master that they have completed the task. The master then kills the programs running on the other slaves.

Reliably monitor a serial port (Nortel CS1000)

I have a custom python script that monitors the call logs from a Nortel phone system. This phone system is under extremely high volume throughout the day and it's starting to appear that some records may be getting lost.
Some of you may dislike this, but I'm not interested in sharing the source code or current method in any way. I would rather consider this from a "new project" approach.
I'm looking for insight into the easiest and safest way to reliably monitor heavy data output through a serial port on Linux. I'm not limiting this to any particular set of tools or languages, I want to find out what works best to do this one critical job. I'm comfortable enough parsing the data and inserting it into mysql that we could just assume the data could be dropped to a text file.
Thank you
Well, the way that I would approach this this to have 2 threads (or processes) working.
Thread 1: The read thread
This thread does nothing but read data from the raw serial port and put the data into a local buffer/queue (In memory is preferred for speed). It should do nothing else. Depending on the clock speed of the serial connection, this should be pretty easy to do.
Thread2: The processing thread
This thread just sleeps until there is data in the local buffer to process, then reads and processes it. That's it.
The reason for splitting it apart in two, is so that if one is busy (a block in MySQL for the processing thread) it won't affect the other. After all, while the serial port is buffered by the OS, the buffer size is limited.
But then again, any local program is likely going to be way faster than the serial port can send data. Serial transfer is actually quite slow relative to the clock speed of the processor (115.2kbps is about the limit on standard hardware). So unless you're CPU speed bound (such as on an Arduino), I can't see normal conditions affecting it too much. So your choice of language really shouldn't be of too much concern (assuming modern hardware). Stick to what you know.

Does there exist an open-source distributed logging library?

I'm talking about a library that would allow me to log events from different machines and would align these events on a "global" time axis with sufficiently high precision.
Actually, I'm asking because I've written such a thing myself in the course of a cluster computing project, I found it terrifically useful, and I was surprised that I couldn't find any analogues.
Therefore, the point is whether something like this exists (and I better contribute to it) or nothing exists (and I better write an open-source analogue of my solution).
Here are the features that I'd expect from such a library:
Independence on the clock offset between different machines
Timing precision on the order of at least milliseconds, preferably microseconds
Scalability to thousands of concurrent logging processes, with at least several megabytes of aggregated logs per second
Soft real-time operation (t.i. I don't want to collect 200 big logs from 200 machines and then compute clock offsets and merge them - I want to see what happens "live", perhaps with a small lag like 10s)
Facebook's contribution in the matter is called 'Scribe'.
Excerpt:
Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups.
...
Scribe is implemented as a thrift service using the non-blocking C++ server. The installation at facebook runs on thousands of machines and reliably delivers tens of billions of messages a day.
The API is Thrift-based, so you have a good platform coverage, but in case you're looking for simple integration for Java you may want to have a look at Digg's log4j appender for Scribe.
You could use log4j/log4net targeting a central syslog daemon. log4j has a builtin SyslogAppender, and in log4net you can do it as shown here. log4cpp docs here.
There are Windows implementations of Syslog around if you don't have a Unix system to hand for this.
Use Chukwa, Its Open source and Large scale Log Monitoring System

Reporting Services won't use more than 25% of CPU

I've set up a solution that creates rapid fire PDF reports. Currently it seems I can't get Reporting Services to use all the resources it has available to it. The system doesn't appear to be IO bound, CPU bound, or memory bound. Any suggestions on trying to figure out why it's running so?
The application isn't network IO bound, and it is multi-threaded to 2 times the number of processors.
SQL Server Reporting Services limits the number of reports run to 2 simultaneous ad-hoc reports and 2 simultaneous web reports. This is a hard limit imposed by the server.
Robin Day is probably right, however if you are using a processor that supports hyper threading you may get a performance benefit by turning this off in the BIOS. You can try an a A/B performance test.
You could also check the SQL instance (when you say reporting service you mean SSRS right?) has not a got processor affinity set.
Is this a case of not using a multi threaded approach? Is the machine using 100% of one core of a processor and that's the bottleneck?
EDIT: Sorry for stating the obvious, was just an idea before you mentioned that it was already multi threaded. I'm afraid I can't offer any more suggestions.
Any suggestions on trying to figure out why it's running so?
a) There's an API to restrict a whole process to one CPU: test that using GetProcessAffinityMask.
b) 'Thread state' and 'Thread wait reason' are two of the performance counters ... maybe you can read this to see why threads, we you think ought to be running, aren't.
All the threads of your application are fighting for a single lock. Use a profiler to see if there is a congestion somewhere.
If you have four cores, that would explain why you see 25% overall CPU usage.
Maybe the server can't deliver more data over the network (so it's network IO bound)?