Using Pickle to serialise large objects - what causes 'Memory error' - json

I'm pickling a very large (both in terms of properties and in terms raw size) class. I've been picking it no problem with pickle using pickle.dump, until I hit just under 4GB and now I consistently get 'Memory Error'. I've also tried using json.dump (and I get 'is not JSON serializable' error). I've also tried Hickle but I get the same error with Hickle as I do with Pickle.
I can't post all the code here (it's very long) but in essence It's a class that holds a dictionary of values from another class - something like this:
class one:
def __init__(self):
self.somedict = {}
def addItem(self,name,item)
self.somedict[name] = item
class two:
def __init__(self):
self.values = [0]*100
Where name is a string and item is an instance of the class two object.
There's a lot more code to it, but this is where the vast majority of things are held. Is there a reliable and ideally fast solution to saving this object to file and then being able to reload it at a later time. I save it every few thousand iterations (as a backup incase something goes wrong, so I need it to be reasonably quick).
Thanks!
Edit #1:
I've just thought that it might be useful to include some details on my system. I have 64Gb of ram - so I don't think pickling a 3-4GB file should cause this type of issue (although I could be wrong on this!).

You probably checked this one first but just in case: Did you make sure your Python installation 64 bit? The 3-4GB immediately reminded me of the memory limit of 32bit applications.
I found this resource quite useful for analyzing and resolving some of the more common memory related issues with Python.

Related

How to add to / amend / consolidate JRuby Profiler data?

Say I have inside my JRuby program the following loop:
loop do
x=foo()
break if x
bar()
end
and I want to collect profiling information just for the invocations of bar. How to do this? I got so far:
pd = []
loop do
x=foo()
break if x
pd << JRuby::Profiler.profile { bar() }
end
This leaves me with an array pd of profile data objects, one for each invocation of bar. Is there a way to create a "summary" data object, by combining all the pd elements? Or even better, have a single object, where profile would just add to the existing profiling information?
I googled for a documentation of the JRuby::Profiler API, but couldn't find anything except a few simple examples, none of them covering my case.
UPDATE : Here is another attempt I tried, which does not work either.
Since the profile method initially clears the profile data inside the Profiler, I tried to separate the profiling steps from the data initializing steps, like this:
JRuby::Profiler.clear
loop do
x=foo()
break if x
JRuby::Profiler.send(:current_thread_context).start_profiling
bar()
JRuby::Profiler.send(:current_thread_context).stop_profiling
end
profile_data = JRuby::Profiler.send(:profile_data)
This seems to work at first, but after investigation, I found that profile_data then contains the profiling information from the last (most recent) execution of bar, not of all executions collected together.
I figured out a solution, though I have the feeling that I'm using a ton of undocumented features to get it working. I also must add that I am using (1.7.27), so later JRuby versions might or might not need a different approach.
The problem with profiling is that start_profiling (corresponding to the Java method startProfiling in the class Java::OrgJrubyRuntime::ThreadContext) not only turns on the profiling flag, but also allocates a fresh ProfileData object. What we want to do, is to reuse the old object. stop_profiling OTOH only toggles the profiling switch and is uncritical.
Unfortunately, ThreadContext does not provide a method to manipulate the isProfiling toggle, so as a first step, we have to add one:
class Java::OrgJrubyRuntime::ThreadContext
field_writer :isProfiling
end
With this, we can set/reset the internal isProfiling switch. Now my loop becomes:
context = JRuby::Profiler.send(:current_thread_context)
JRuby::Profiler.clear
profile_data_is_allocated = nil
loop do
x=foo()
break if x
# The first time, we allocate the profile data
profile_data_is_allocated ||= context.start_profiling
context.isProfiling = true
bar()
context.isProfiling = false
end
profile_data = JRuby::Profiler.send(:profile_data)
In this solution, I tried to keep as close as possible to the capabilities of the JRuby::Profiler class, but we see, that the only public method still used is the clear method. Basically, I have reimplemented profiling in terms of the ThreadContext class; so if someone comes up with a better way to solve it, I will highly appreciate it.

CNTK load pictures with class affiliation in percent

I am trying to build a neuronal network with CNTK to estimate the age of a person.
Currently I want to try an approach using only one class. So every picture gets label 0 but also an affiliation to the class in percent.
So the net should learn that the probability of a 30 year old person to match class 0 is 30% ... 60yo = 60% ... 93yo = 93%.
Currently I am working on a reduced data set of 50k images (.jpg) and use the MiniBatchSourceFromData function.
Since I have a lot more training data available (400k + augmentations) I wanted to load the pictures in chunks for training, due to limited server RAM.
Following THIS CNTK tutorial I have to use the MiniBatchSource function and feed a deserializer with a map_file which includes the paths and labels to my training data. .
My Problem is, that the map_file doesn't support class affiliations. I can only define what picture belongs to which class.
Since I am new to CNTK and deep learning in general, I'd like to know if there is another option to read chunked data as well as tell the network how likely it is that the picture corresponds to a specific class.
Best regards.
You can create a composite reader. One deserializes you images, another can deserialise your numeric data.
Read this, the last section shows you how to use a composite reader

Using Play Framework and case class with greater than 22 parameters

I have seen some of the other issues involving the infamous "22 fields/parameters" issue that is an inherent bug (feature?) of Scala V < 2.11. See here and here. However, as per this blog post, it appears that the 22 parameter limit in case class has been fixed; at least where the language is concerned.
I have a case class that I want to load an arbitrary (Read: > 22) number of values into which will later be read into a JSON object using the Play library.
It looks something like this:
object L {
import play.api.libs.json.Reads. _
import play.api.libs.functional.syntax._
implicit val responseRead: Reads[L] = (
MyField1.jsPath.Read[MyField1.t] and
MyField2.jsPath.Read[MyField2.t] and
...
MyField35.jsPath.Read[MyField35.t]
) (L.apply _)
}
case class L(myField1: MyField1.t, myField2: MyField2.t, ... myField35: MyField35.t)
The issue is that on compile, Scala complains that there are more than 22 parameters in the case class. (Specifically: on the last line of the object definition, when the compiler attempts to build, I get: "implementation restricts functions to 22 parameters".) I'm currently using Scala v2.11.6, so I think it's not a language issue. That makes me think that the Play library hasn't updated their implementation of Read.
If that's the case, then I guess the best bet is to group related fields into Tuples and pass the Tuples in through the JSON API?
As mentioned in the blog post you referenced, the 22-parameter limit is still in effect for functions in Scala 2.11 and later, so what you've encountered is a language issue. The function call in this case is:
L.apply _
Restructuring your model is one way to deal with this limit.
So the answer to this question is actually two parts:
1. Workaround
I'll call this the "workaround" because while it does "work" it usually addresses the symptom and not the problem.
My solution was to use shapeless to provide generic heterogeneous lists of arbitrary length. This solution is already widely discussed and available elsewhere. See, e.g., (1) [SO Post] How to get around the Scala case class limit of 22 fields?; (2) Blog post; (3) Yet another blog post.
2. Solution
As #jeffrey-chung mentions is to restructure the model to deal with this limit. As many in the industry have noted, having a function with more than 30 arguments is likely a sign that your function is doing too much or that the function should be refactored to ingest a smaller number of arguments. See, e.g., (1) Rule of 30 – When is a method, class or subsystem too big?; (2) Databrick's style guide.
See answer here
https://stackoverflow.com/a/57317220/1606452
It seems this handles it all nicely.
+22 field case class formatter and more for play-json
https://github.com/xdotai/play-json-extensions
Supports Scala 2.11.x, 2.12.x, and 2.13.x and play 2.3, 2.4, 2.5 and 2.7
And is referenced in the play-json issue as the preferred solution (but not yet merged)

Status of in-place `rfft` and `irfft` in Julia

So I'm doing some hobby-related stuff which involves taking Fourier transforms of large real arrays which barely fit in memory, and was curious to see if there was an in-place version of rfft and irfft that saved RAM, since RAM consumption is important to me. These transforms are possible despite the input-vs-output-type mismatch, and require an extra row of padding.
In Implement in-place rfft! and irfft!, Tim Holy said he was working on an in-place rfft! and irfft! that made use of a buffer-containing RCpair object, but then Steven Johnson said that he was implementing something equivalent using A_mul_B!(y, plan, x), which he elaborated on here.
Things get a little weird from then on. In the documentation for both 0.3.0 and 0.4.0 there is no mention of A_mul_B!, although A_mul_B is listed. But when I try entering them into Julia, I get
A_mul_B!
A_mul_B! (generic function with 28 methods)
A_mul_B
ERROR: A_mul_B not defined
which suggests that the situation is actually the opposite of what the documentation currently describes.
So since A_mul_B! seems to exist, but isn't documented anywhere, I tried to guess how to test it in-place as follows:
A = rand(Float32, 10, 10);
p = plan_rfft(A);
A_mul_B!(A,p,A)
which resulted in
ERROR: `A_mul_B!` has no method matching A_mul_B!(::Array{Float32,2}, ::Function, ::Array{Float32,2})
So...
Are in-place real FFTs still a work in progress? Or am I using A_mul_B! wrong?
Is there a mismatch between the 0.3.0 documentation and 0.3.0's function library?
That pull request from Steven Johnson is listed as open, not merged; that means the work hasn't been finished yet. The one from me is closed, but if you want the code you can grab it by clicking on the commits.
The docs indeed omit mention of A_mul_B!. A_mul_B is equivalent to A*B, and so isn't exported independently now. A_mul_B! would be used like this: instead of C = A*B, you could say A_mul_B!(C, A, B).
Can you please edit the docs to fix these issues? (You can edit files here in your webbrowser.)

Python practices: Is there a better way to check constructor parameters?

I find myself trying to convert constructor parameters to their right types very often in my Python programs. So far I've been using code similar to this, so I don't have to repeat the exception arguments:
class ClassWithThreads(object):
def __init__(self, num_threads):
try:
self.num_threads= int(num_threads)
if self.num_threads <= 0:
raise ValueError()
except ValueError:
raise ValueError("invalid thread count")
Is this a good practice? Should I just don't bother catching any exceptions on conversion and let them propagate to the caller, with the possible disadvantage of having less meaningful and consistent error messages?
When I have a question like this, I go hunting in the standard library for code that I can model my code after. multiprocessing/pool.py has a class somewhat close to yours:
class Pool(object):
def __init__(self, processes=None, initializer=None, initargs=(),
maxtasksperchild=None):
...
if processes is None:
try:
processes = cpu_count()
except NotImplementedError:
processes = 1
if processes < 1:
raise ValueError("Number of processes must be at least 1")
if initializer is not None and not hasattr(initializer, '__call__'):
raise TypeError('initializer must be a callable')
Notice that it does not say
processes = int(processes)
It just assumes you sent it an integer, not a float or a string, or whatever.
It should be pretty obvious, but if you feel it is not, I think it suffices to just document it.
It does raise ValueError if processes < 1, and it does check that initializer, when given, is callable.
So, if we take multiprocessing.Pool as a model, your class should look like this:
class ClassWithThreads(object):
def __init__(self, num_threads):
self.num_threads = num_threads
if self.num_threads < 1:
raise ValueError('Number of threads must be at least 1')
Wouldn't this approach possibly fail very unpredictably for some
conditions?
I think preemptive type checking generally goes against the grain of Python's
(dynamic-, duck-typing) design philosophy.
Duck typing gives Python programmers opportunities for great expressive power,
and rapid code development but (some might say) is dangerous because it makes no
attempt to catch type errors.
Some argue that logical errors are far more serious and frequent than type
errors. You need unit tests to catch those more serious errors. So even if you
do do preemptive type checking, it does not add much protection.
This debate lies in the realm of opinions, not facts, so it is not a resolvable argument. On which side of the fence
you sit may depend on your experience, your judgment on the likelihood of type
errors. It may be biased by what languages you already know. It may depend on
your problem domain.
You just have to decide for yourself.
PS. In a statically typed language, the type checks can be done at compile-time, thus not impeding the speed of the program. In Python, the type checks have to occur at run-time. This will slow the program down a bit, and maybe a lot if the checking occurs in a loop. As the program grows, so will the number of type checks. And unfortunately, many of those checks may be redundant. So if you really believe you need type checking, you probably should be using a statically-typed language.
PPS. There are decorators for type checking for (Python 2) and (Python 3). This would separate the type checking code from the rest of the function, and allow you to more easily turn off type checking in the future if you so choose.
You could use a type checking decorator like this activestate recipe or this other one for python 3. They allow you to write code something like this:
#require("x", int, float)
#require("y", float)
def foo(x, y):
return x+y
that will raise an exception if the arguments are not of the required type. You could easily extend the decorators to check that the arguments have valid values aswell.
This is subjective, but here's a counter-argument:
>>> obj = ClassWithThreads("potato")
ValueError: invalid thread count
Wait, what? That should be a TypeError. I would do this:
if not isinstance(num_threads, int):
raise TypeError("num_threads must be an integer")
if num_threads <= 0:
raise ValueError("num_threads must be positive")
Okay, so this violates "duck typing" principles. But I wouldn't use duck typing for primitive objects like int.