I'm working on a moderately large query
MATCH path = ((s:User {UserId:"indexedKey"})-[r:HasAccess]-(t:User)) WHERE all(y IN rels(path) WHERE toInt(y.Importance) >= 3) RETURN path
Expected result ~3,000 nodes; unknown edges, probably >6,000. The browser errors out before returning results. When I limit the range (i.e.)
MATCH path = ((s:User {UserId:"indexedKey"})-[r:HasAccess*3]-(t:User)) WHERE all(y IN rels(path) WHERE toInt(y.Importance) >= 3) RETURN path*
it returns without issue.
Server: Neo4j 3.1.1 CE on a Ubuntu 16.04.1 LTS under Hyper-V on Windows 10. VM with 4 virtual cores, unlimited claim on Host; Assigned 7GB out of 11GB Max in VM. Neo4j Heap = 5G, and pagecache.size = 5G.
Client: Windows 10 with Chrome 64-bit and Firefox 64-bit, same behavior. Different host from the server.
The failure occurs after an indeterminate time (left to run overnight) i get Error from Chrome 64-bit. Watching the resources on the client side i see ~300 MB/s coming into the the browser, browser memory steadily increases. I've seen it hit 6 GB (Firefox 64-Bit). After a recent update Chrome 64-bit increases it's memory usage at a significantly slower rate.
Is this a bug in the neo4j javascript driver/neo4j browser or are there server/client configurations I can look at to resolve the issue?
Thanks in advance,
EDIT
A simple example of the query pattern. Once i can identify the A,B,C cluster and the D,E,F cluster I can interview C and D and determine if that relationship is business critical or can be disrupted. In production is seems that clusters are connected to many other clusters, making the query quickly very large.
EDIT - revised quires inline above
EDIT [Resolution] - I have built a php application that recursively performs small scope queries (graph aware driver) and composite the results in a json structure for visualization by d3. But the neo4j browser is more feature rich than my solution, so any suggestions for improvement of the query or infrastructure would be welcome.
Related
I have a host in our cluster with 8 Nvidia K80s and I would like to set it up so that each device can run at most 1 process. Before, if I ran multiple jobs on the host and each use a large amount of memory, they would all attempt to hit the same device and fail.
I set all the devices to compute mode 3 (E. Process) via nvidia-smi -c 3 which I believe makes it so that each device can accept a job from only one CPU process. I then run 2 jobs (each of which only takes about ~150 MB out of 12 GB of memory on the device) without specifying cudaSetDevice, but the second job fails with ERROR: CUBLAS_STATUS_ALLOC_FAILED, rather than going to the second available device.
I am modeling my assumptions off of this site's explanation and was expecting each job to cascade onto the next device, but it is not working. Is there something I am missing?
UPDATE: I ran Matlab using gpuArray in multiple different instances, and it is correctly cascading the Matlab jobs onto different devices. Because of this, I believe I am correctly setting up the compute modes at the OS level. Aside from cudaSetDevice, what could be forcing my CUDA code to lock into device 0?
This is relying on an officially undocumented behavior (or else prove me wrong and point out the official documentation, please) of the CUDA runtime that would, when a device was set to an Exclusive compute mode, automatically select another available device, when one is in use.
The CUDA runtime apparently enforced this behavior but it was "broken" in CUDA 7.0.
My understanding is that it should have been "fixed" again in CUDA 7.5.
My guess is you are running CUDA 7.0 on those nodes. If so, I would try updating to CUDA 7.5, or else revert to CUDA 6.5 if you really need this behavior.
It's suggested, rather than relying on this, that you instead use an external means, such as a job scheduler (e.g. Torque) to manage resources in a situation like this.
I'm building a one-off smart-home data collection box. It's expected to run on a raspberry-pi-class machine (~1G RAM), handling about 200K data points per day (each a 64-bit int). We've been working with vanilla MySQL, but performance is starting to crumble, especially for queries on the number of entries in a given time interval.
As I understand it, this is basically exactly what time-series databases are designed for. If anything, the unusual thing about my situation is that the volume is relatively low, and so is the amount of RAM available.
A quick look at Wikipedia suggests OpenTSDB, InfluxDB, and possibly BlueFlood. OpenTSDB suggests 4G of RAM, though that may be for high-volume settings. InfluxDB actually mentions sensor readings, but I can't find a lot of information on what kind of resources are required.
Okay, so here's my actual question: are there obvious red flags that would make any of these systems inappropriate for the project I describe?
I realize that this is an invitation to flame, so I'm counting on folks to keep it on the bright and helpful side. Many thanks in advance!
InfluxDB should be fine with 1 GB RAM at that volume. Embedded sensors and low-power devices like Raspberry Pi's are definitely a core use case, although we haven't done much testing with the latest betas beyond compiling on ARM.
InfluxDB 0.9.0 was just released, and 0.9.x should be available in our Hosted environment in a few weeks. The low end instances have 1 GB RAM and 1 CPU equivalent, so they are a reasonable proxy for your Pi performance, and the free trial lasts two weeks.
If you have more specific questions, please reach out to us at influxdb#googlegroups.com or support#influxdb.com and we'll see hwo we can help.
Try VictoriaMetrics. It should run on systems with low RAM such as Raspberry Pi. See these instructions on how to build it for ARM.
VictoriaMetrics has the following additional benefits for small systems:
It is easy to configure and maintain since it has zero external dependencies and all the configuration is done via a few command-line flags.
It is optimized for low CPU usage and low persistent storage IO usage.
It compresses data well, so it uses small amounts of persistent storage space comparing to other solutions.
Did you try with OpenTSDB. We are using OpenTSDB for almost 150 houses to collect smart meter data where data is collected every 10 minutes. i.e is a lot of data points in one day. But we haven't tested it in Raspberry pi. For Raspberry pi OpenTSDB might be quite heavy since it needs to run webserver, HBase and Java.
Just for suggestions. You can use Raspberry pi as collecting hub for smart home and send the data from Raspberry pi to server and store all the points in the server. Later in the server you can do whatever you want like aggregation, or performing statistical analysis etc. And then you can send results back to the smart hub.
ATSD supports ARM architecture and can be installed on a Raspberry Pi 2 to store sensor data. Currently, Ubuntu or Debian OS is required. Be sure that the device has at least 1 GB of RAM and an SD card with high write speed (60mb/s or more). The size of the SD card depends on how much data you want to store and for how long, we recommend at least 16GB, you should plan ahead. Backup batter power is also recommended, to protect against crashes and ungraceful shutdowns.
Here you can find an in-depth guide on setting up a temperature/humidity sensor paired with an Arduino device. Using the guide you will be able to stream the sensor data into ATSD using MQTT or TCP protocol. Open-source sketches are included.
i recently have migrated between 2 servers (the newest has lower specs), and it freezes all the time even though there is no load on the server, below are my specs:
HP DL120G5 / Intel Quad-Core Xeon X3210 / 8GB RAM
free -m output:
total used free shared buffers cached
Mem: 7863 7603 260 0 176 5736
-/+ buffers/cache: 1690 6173
Swap: 4094 412 3681
as you can see there is 412 mb ysed in swap while there is almost 80% of the physical ram available
I don't know if this should cause any trouble, but almost no swap was used in my old server so I'm thinking this does not seem right.
i have cPanel license so i contacted their support and they noted that i have high iowait, and yes when i ran sar i noticed sometimes it exceeds 60%, most often it's 20% but sometimes it reaches to 60% or even 70%
i don't really know how to diagnose that, i was suspecting my drive is slow and this might cause the latency so i ran a test using dd and the speed was 250 mb/s so i think the transfer speed is ok plus the hardware is supposed to be brand new.
the high load usually happens when i use gzip or tar to extract files (backup or restore a cpanel account).
one important thing to mention is that top is reporting that mysql is using 100% to 125% of the CPU and sometimes it reaches much more, if i trace the mysql process i keep getting this error continually:
setsockopt(376, SOL_IP, IP_TOS, [8], 4) = -1 EOPNOTSUPP (Operation not supported)
i don't know what that means nor did i get useful information googling it.
i forgot to mention that it's a web hosting server for what it's worth, so it has the standard setup for web hosting (apache,php,mysql .. etc)
so how do i properly diagnose this issue and find the solution, or what might be the possible causes?
As you may have realized by now, the free -m output shows 7603MiB (~7.6GiB) USED, not free.
You're out of memory and it has started swapping which will drastically slow things down. Since most applications are unaware that the virtual memory is now coming from much slower disk, the system may very well appear to "hang" with no feedback describing the problem.
From your description, the first process I'd kill in order to regain control would be the Mysql. If you have ssh/rsh/telnet connectivity to this box from another machine, you may have to login from that in order to get a usable commandline to kill from.
My first thought (hypothesis?) for what's happening is...
MySQL is trying to do something that is not supported as this machine is currently configured. It could be missing a library or an environment variable is not set or any number things.
That operation allocates some memory but is failing and not cleaning up the allocation when it does. If this were a shell script, it could be fixed by putting an event trap command at the beginning that runs a function that releases memory and cleans up.
The code is written to keep retrying on failure so it rapidly uses up all your memory. Refering back to the shell script illustration, the trap function might also prompt to see if you really want to keep retrying.
Not a complete answer but hopefully will help.
I want to stress-test my application this way, because it seems to be failing in some very old client machines.
At first I read a bit about QEmu and thought about hardware emulation, but it seems a long shot. I asked at superuser, but didn't get much feedback (yet).
So I'm turning to you guys... How do you this kind of testing?
I'm not sure about slowing a CPU but if you use a virtual machine, like VMWare, you can control how much RAM is actually used. I run it on a MBP at home with 8GB and my WinXP VM is capped at 1.5 GB RAM.
EDIT: I just checked my version of VMWare and I can control the number of cores it can use. It's definitely not the same as a slower CPU but it might highlight some issues for you.
Since it's not entirely clear that your app is failing because of the old hardware or the old OS, a VM client should allow you to test various versions of OSes rather quickly. It came in handy for me a few years back when I was trying to get a .Net 2.0 app to run on Win98 (it can be done though I don't remember how I got it working...).
Virtual Box is a free virtual machine similar to VMWare. It also has the capacity to reduce available memory. It can restrict how many CPUs are available, but not how fast those CPUs are.
Try cpulimit, most distro includes it (Ubuntu does) http://www.digipedia.pl/man/doc/view/cpulimit.1
If you want to lower the speed of your cpu you can easily do this by modifying a fork bomb program
int main(){
int x=0;
int limit = 10
while( x < limit ){
int pid = fork();
if( pid == 0 )
while( 1 ){}
else
x++;
}
}
This will slow down your computer quite quickly, you may want to change the limit variable to a higher number. I must warn you though this can be dangerous, because if implemented wrong you could fork bomb your system leaving it useless unless you restart it. Read this first if you don't understand what this code will do.
On POSIX (Unix) systems you can apply run limits to processes (that is, to executions of a program). The system call to do this is called setrlimit(), and most shells enable you to use the ulimit built-in to set them from the command-line (plain POSIX ulimit is not very useful). Using these you can run a program with low limits to simulate a smaller computer.
POSIX systems also provide a nice command for running a program at lower CPU priority, which can simulate a slower CPU if you also ensure there is another CPU intensive progam running at the same time.
I think it's pretty unlikely that cpu speed is going to exercise very many bugs; On the other hand, it's much more likely for different cpu features to matter. Many VM implementations provide ways of toggling on and off certain cpu features; qemu in particular permits a high level of control over what's available to the CPU.
Think outside the box. Which application of the ones you use regularly does this?
A debugger of course! But, how can you achieve such a behavior, to emulate a low cpu?
The secret to your question is asm _int 3. This is the assembly "pause me" command that is send from the attached debugger to the application you are debugging.
More about int 3 to this question.
You can use the code from this tool to pause/resume your process continuously. You can add an interval and make that tool pause your application for that amount of time.
The emulated-cpu-speed would be: (YourCPU/Interval) -0.00001% because of the signaling and other processes running on your machine, but it should do the trick.
About the low memory emulation:
You can create a wrapper class that allocates memory for the application and replace each allocation with call to this class. You would be able to set exactly the amount of memory your application can use before it fails to allocate more memory.
Something such as: MyClass* foo = AllocWrapper(new MyClass(arguments or whatever));
Then you can have the AllocWrapper allocating/deallocating memory for you.
On Linux, you can use ulimit as Raedwald said. On Windows, you can use the SetProcessWorkingSetSize system call. But these only set a limit on a per process basis. In reality, parts of the system will start to fail in a stressed environment. I would suggest using the Sysinternals' testlimit tool to stress the entire machine.
See https://serverfault.com/questions/36309/throttle-down-cpu-speed-of-vmware-image
were it is claimed free-as-in-beer VMware vSphere Hypervisorâ„¢ (ESXi) allows you to select the virtual CPU speed on top of setting the memory size of the virtual machine.
I've noticed that CUDA applications tend to have a rough maximum run-time of 5-15 seconds before they will fail and exit out. I realize it's ideal to not have CUDA application run that long but assuming that it is the correct choice to use CUDA and due to the amount of sequential work per thread it must run that long, is there any way to extend this amount of time or to get around it?
I'm not a CUDA expert, --- I've been developing with the AMD Stream SDK, which AFAIK is roughly comparable.
You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious.
To disable it, you need to regedit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display\DisableBugCheck, create a REG_DWORD and set it to 1.
You may also need to do something in the NVidia control panel. Look for some reference to "VPU Recovery" in the CUDA docs.
Ideally, you should be able to break your kernel operations up into multiple passes over your data to break it up into operations that run in the time limit.
Alternatively, you can divide the problem domain up so that it's computing fewer output pixels per command. I.e., instead of computing 1,000,000 output pixels in one fell swoop, issue 10 commands to the gpu to compute 100,000 each.
The basic unit that has to fit within the time slice is not your entire application, but the execution of a single command buffer. In the AMD Stream SDK, a long sequence of operations can be broken up into multiple time slices by explicitly flushing the command queue with a CtxFlush() call. Perhaps CUDA has something similar?
You should not have to read all of your data back and forth across the PCIX bus on every time slice; you can leave your textures, etc. in gpu local memory; you just have some command buffers complete occasionally, to prove to the OS that you're not stuck in an infinite loop.
Finally, GPUs are fast, so if your application is not able to do useful work in that 5 or 10 seconds, I'd take that as a sign that something is wrong.
[EDIT Mar 2010 to update:] (outdated again, see the updates below for the most recent information) The registry key above is out-of-date. I think that was the key for Windows XP 64-bit. There are new registry keys for Vista and Windows 7. You can find them here: http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx
or here: http://msdn.microsoft.com/en-us/library/ee817001.aspx
[EDIT Apr 2015 to update:] This is getting really out of date. The easiest way to disable TDR for Cuda programming, assuming you have the NVIDIA Nsight tools installed, is to open the Nsight Monitor, click on "Nsight Monitor options", and under "General" set "WDDM TDR enabled" to false. This will change the registry setting for you. Close and reboot. Any change to the TDR registry setting won't take effect until you reboot.
[EDIT August 2018 to update:]
Although the NVIDIA tools allow disabling the TDR now, the same question is relevant for AMD/OpenCL developers. For those: The current link that documents the TDR settings is at https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
On Windows, the graphics driver has a watchdog timer that kills any shader programs that run for more than 5 seconds. Note that the Xorg/XFree86 drivers don't do this, so one possible workaround is to run the CUDA apps on Linux.
AFAIK it is not possible to disable the watchdog timer on Windows. The only way to get around this on Windows is to use a second card that has no displayed screens on it. It doesn't have to be a Tesla but it must have no active screens.
Resolve Timeout Detection and Recovery - WINDOWS 7 (32/64 bit)
Create a registry key in Windows to change the TDR settings to a
higher amount, so that Windows will allow for a longer delay before
TDR process starts.
Open Regedit from Run or DOS.
In Windows 7 navigate to the correct registry key area, to create the
new key:
HKEY_LOCAL_MACHINE>SYSTEM>CurrentControlSet>Control>GraphicsDrivers.
There will probably one key in there called DxgKrnlVersion there as a
DWord.
Right click and select to create a new key REG_DWORD, and name it
TdrDelay. The value assigned to it is the number of seconds before
TDR kicks in - it > is currently 2 automatically in Windows (even
though the reg. key value doesn't exist >until you create it). Assign
it with a new value (I tried 4 seconds), which doubles the time before
TDR. Then restart PC. You need to restart the PC before the value will
work.
Source from Win7 TDR (Driver Timeout Detection & Recovery)
I have also verified this and works fine.
The most basic solution is to pick a point in the calculation some percentage of the way through that I am sure the GPU I am working with is able to complete in time, save all the state information and stop, then to start again.
Update:
For Linux: Exiting X will allow you to run CUDA applications as long as you want. No Tesla required (A 9600 was used in testing this)
One thing to note, however, is that if X is never entered, the drivers probably won't be loaded, and it won't work.
It also seems that for Linux, simply not having any X displays up at the time will also work, so X does not need to be exited as long as you screen to a non-X full-screen terminal.
This isn't possible. The time-out is there to prevent bugs in calculations from taking up the GPU for long periods of time.
If you use a dedicated card for CUDA work, the time limit is lifted. I'm not sure if this requires a Tesla card, or if a GeForce with no monitor connected can be used.
The solution I use is:
1. Pass all information to device.
2. Run iterative versions of algorithms, where each iteration invokes the kernel on the memory already stored within the device.
3. Finally transfer memory to host only after all iterations have ended.
This enables control over iterations from CPU (including option to abort), without the costly device<-->host memory transfers between iterations.
The watchdog timer only applies on GPUs with a display attached.
On Windows the timer is part of the WDDM, it is possible to modify the settings (timeout, behaviour on reaching timeout etc.) with some registry keys, see this Microsoft article for more information.
It is possible to disable this behavior in Linux. Although the "watchdog" has an obvious purpose, it may cause some very unexpected results when doing extensive computations using shaders / CUDA.
The option can be toggled in your X-configuration (likely /etc/X11/xorg.conf)
Adding: Option "Interactive" "0" to the device section of your GPU does the job.
see CUDA Visual Profiler 'Interactive' X config option?
For details on the config
and
see ftp://download.nvidia.com/XFree86/Linux-x86/270.41.06/README/xconfigoptions.html#Interactive
For a description of the parameter.