CNTK load pictures with class affiliation in percent

CNTK load pictures with class affiliation in percent - deep-learning

I am trying to build a neuronal network with CNTK to estimate the age of a person.
Currently I want to try an approach using only one class. So every picture gets label 0 but also an affiliation to the class in percent.
So the net should learn that the probability of a 30 year old person to match class 0 is 30% ... 60yo = 60% ... 93yo = 93%.
Currently I am working on a reduced data set of 50k images (.jpg) and use the MiniBatchSourceFromData function.
Since I have a lot more training data available (400k + augmentations) I wanted to load the pictures in chunks for training, due to limited server RAM.
Following THIS CNTK tutorial I have to use the MiniBatchSource function and feed a deserializer with a map_file which includes the paths and labels to my training data. .
My Problem is, that the map_file doesn't support class affiliations. I can only define what picture belongs to which class.
Since I am new to CNTK and deep learning in general, I'd like to know if there is another option to read chunked data as well as tell the network how likely it is that the picture corresponds to a specific class.
Best regards.

You can create a composite reader. One deserializes you images, another can deserialise your numeric data.
Read this, the last section shows you how to use a composite reader

Related

Using Pickle to serialise large objects - what causes 'Memory error'

I'm pickling a very large (both in terms of properties and in terms raw size) class. I've been picking it no problem with pickle using pickle.dump, until I hit just under 4GB and now I consistently get 'Memory Error'. I've also tried using json.dump (and I get 'is not JSON serializable' error). I've also tried Hickle but I get the same error with Hickle as I do with Pickle.
I can't post all the code here (it's very long) but in essence It's a class that holds a dictionary of values from another class - something like this:
class one:
def __init__(self):
self.somedict = {}
def addItem(self,name,item)
self.somedict[name] = item
class two:
def __init__(self):
self.values = [0]*100
Where name is a string and item is an instance of the class two object.
There's a lot more code to it, but this is where the vast majority of things are held. Is there a reliable and ideally fast solution to saving this object to file and then being able to reload it at a later time. I save it every few thousand iterations (as a backup incase something goes wrong, so I need it to be reasonably quick).
Thanks!
Edit #1:
I've just thought that it might be useful to include some details on my system. I have 64Gb of ram - so I don't think pickling a 3-4GB file should cause this type of issue (although I could be wrong on this!).

You probably checked this one first but just in case: Did you make sure your Python installation 64 bit? The 3-4GB immediately reminded me of the memory limit of 32bit applications.
I found this resource quite useful for analyzing and resolving some of the more common memory related issues with Python.

How do I get molecular structural information from SMILES

My question is: is there any algorithm that can convert a SMILES structure into a topological fingerprint? For example if glycerol is the input the answer would be 3 x -OH , 2x -CH2 and 1x -CH.
I'm trying to build a python script that can predict the density of a mixture using an artificial neural network. As an input I want to have the structure/fingerprint of my molecules starting from the SMILES structure.
I'm already familiar with -rdkit and the morganfingerprint but that is not what i'm looking for. I'm also aware that I can use the 'matching substructure' search in rdkit, but then I would have to define all the different subgroups. Is there any more convenient/shorter way?

For most of the structures, there's no existing option to find the fragments. However, there's a module in rdkit that can provide you the number of fragments especially when it's a function group. Check it out here. As an example, let's say you want to find the number of aliphatic -OH groups in your molecule. You can simply call the following function to do that
from rdkit.Chem.Fragments import fr_Al_OH
fr_Al_OH(mol)
or the following would return the number of aromatic -OH groups:
from rdkit.Chem.Fragments import fr_Ar_OH
fr_Ar_OH(mol)
Similarly, there are 83 more functions available. Some of them would be useful for your task. For the ones, you don't get the pre-written function, you can always go to the source code of these rdkit modules, figure out how they did it, and then implement them for your features. But as you already mentioned, the way would be to define a SMARTS string and then fragment matching. The fragment matching module can be found here.

If you want to predict densities of pure components before predicting the mixtures I recommend the following paper:
https://pubs.acs.org/doi/abs/10.1021/acs.iecr.6b03809
You can use the fragments specified by rdkit as mnis proposes. Or you could specify the groups as SMARTS patterns and look for them yourself using GetSubstructMatches as you proposed yourself.
Dissecting a molecule into specific groups is not as straightforward as it might appear in the first place. You could also use an algorithm I published a while ago:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0382-3
It includes a list of SMARTS for the UNIFAC model, but you could also use them for other things, like density prediction.

Apigility GET collection returns only 10 results when content negotiation is set to JSON

This issue is bugging me for some time now. To test it I just installed a fresh Apigility, set the db (PDO:mysql) and added a DB-Connected service. In the table I have 40+ records. When I make a GET collection request the response looks OK (with the default HAL content negotiation). Then I change the content negotiation to JSON. Now when I make a GET collection request my response contains only 10 elements.
So my question is: where do I set/change this limit?

You can set the page size manually, like so:
$paginator = $this->getAlbumTable()->fetchAll(true);
// set the current page to what has been passed in query string, or to 1 if none set
$paginator->setCurrentPageNumber((int) $this->params()->fromQuery('page', 1));
// set the number of items per page to 10
$paginator->setItemCountPerPage(10);
http://framework.zend.com/manual/current/en/tutorials/tutorial.pagination.html

Could you please send the page_size, total_items part at the end of the json output?
it's like:
"page_count": 140002,
"page_size": 25,
"total_items": 3500035,
"page": 1

This is not an ideal fix, because it requires you to go into the source code rather than using the page size given in the UI.
The collection class that is auto generated for you by the DB-Connected style derives off of Zend/Paginator/Paginator. This class defines the $defaultItemCountPerPage static protected member which is defaulted to 10. That's why you're only getting 10 results. If you open up the auto-generated collection class for your entity and add: protected static $defaultItemCountPerPage = 100; in the otherwise empty class, you will see that you now get up to 100 results in the response. You can look at other Paginator class variables and methods that you could replace in your derived class to get your desired behavior.
This is not an ideal solution. I'd prefer that the generated code automatically used the same configed page size that the HalJson strategy uses. Maybe I'll contribute a PR to change that. Or, maybe I'll just use the HalJson approach. It does seem like the better way to go. You should have some limit to how much data you load in from the DB at a time to not have an overly long running query or an overly large collection of data coming back you have to deal with. And, whatever limit you set, what do you do when you hit that limit? With the simple Json method, you can't ever get "page 2" of data. So, if you are going to work with some sizeable amount of data, it might be better to use HalJson on and then have some logic on the client side to grab pages of data at a time as needed. The returned JSON structure is a little more complicated, but not terribly so.
I'm probably in the same spot you are -- I'm trying to do a simple little api to play with while keeping everything simple and so I didn't want the client to have to deal with the other stuff in HalJson, but probably better to deal with that complexity and have a smooth way to page through data if you're going to use this with some real set of data. At least, that's the pep talk I'm giving myself right now. :-)

How to find likelihood in NLTK

http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/comment-page-1/#comment-73511
I am trying to understand NLTK using this link. I cannot understand how the values of feature_probdist and show_most_informative_features are computed.
Esp when the word "best" does not come how the likelihood is computed as 0.077 . I was trying since long back

That is because it is explaining code from NLTK's source code but not displaying all of it. The full code is available on NLTK's website (and is also linked to in the article you referenced). These are a field within a method and a method (respectively) of the NaiveBayesClassifier class within NLTK. This class is of course using a Naive Bayes classifier, which is essentially a modification of Bayes Theorum with a strong (naive) assumption that each event is independent.
feature_probdist = "P(fname=fval|label), the probability distribution for feature values, given labels. It is expressed as a dictionary whose keys are (label,fname) pairs and whose values are ProbDistIs over feature values. I.e., P(fname=fval|label) = feature_probdist[label,fname].prob(fval). If a given (label,fname) is not a key in feature_probdist, then it is assumed that the corresponding P(fname=fval|label) is 0 for all values of fval."
most_informative features returns "a list of the 'most informative' features used by this classifier. For the purpose of this function, the informativeness of a feature (fname,fval) is equal to the highest value of P(fname=fval|label), for any label, divided by the lowest value of P(fname=fval|label), for any label:"
max[ P(fname=fval|label1) / P(fname=fval|label2) ]
Check out the source code for the entire class if this is still unclear, the article's intent was not to break down how NLTK works under the hood in depth, but rather just to give a basic concept of how to use it.

When could a CSV records not have the same number of fields?

I am storing a series of events to a CSV file, each event type comes with a different set of data.
To illustrate, say I have two events (there will be many more):
Running, which has a data set containing speed and incline.
Sleeping, which has a data set containing snores.
There are two options to store this data in CSV records:
Option A
Storing each possible item of data in it's own field...
speed, incline, snores
therefore...
15mph, 20%, ,
, , 12
16mph, 20%, ,
14mph, 20%, ,
Option B
Storing each event in its own record...
event, value1...
therefore...
running, 15mph, 20%
sleeping, 12
running, 16mph, 20%
running, 14mph, 20%
Without a specific CSV specification, the consensus seems to be:
Each record "should" contain the same number of comma-separated fields.
Context
There are a number of events which each have a large & different set of data values.
CSV data is to be of use to other developers (I will/could/should/won't use either structure).
The 'other developers' to be toward the novice end of the spectrum and/or using resource limited systems. CSV is accessible.
The CSV format is being provided non-exclusively as feature not requirement. Although, if said application is providing a CSV file it should be provided in the correct manner from now on.
Question
Would it be valid – in this case - to go with Option B?
Thoughts
Option B maintains a level of human readability, which is an advantage say CSV is read by human not processor. Neither method is more complex to parse using a custom parser, but will Option B void the usefulness of a CSV format with other libraries, frameworks, applications et al. With Option A future changes/versions to the data set of an individual event may break the CSV structure (zombie , , to maintain forwards compatibility); whereas Option B will fail gracefully.
edit
This may be aimed at students and frameworks like OpenFrameworks, Plask, Proccessing et al. where CSV is easier to implement.

Any "other frameworks, libraries and applications" I've ever used all handle CSV parsing differently, so trying to conform to one or many of these standards might over-complicate your end result. My recommendation would be to keep it simple and use what works for your specific task. If human readbility is a requirement, then CSV in the form of Option B would work fine. Otherwise, you may want to consider JSON or XML.

As you say there is no "CSV Standard" with regard to contents. The real answer depend on what you are doing and why. You mention "other frameworks, libraries and applications". The one thing I've learnt is "Dont over engineer". i.e. Don't write reams of code today on the assumption that you will plug it into some other framework tomorrow.
I'd say option B is fine, unless you have specific requirements to use other apps etc.
< edit >
Having re-read your context, I'd probably pick one output format and use it, and forget about having multiple formats:
Having multiple output formats is a source of inconsistency (e.g. bug in one format but not another).
Having multiple formats means more code that needs to be
tested
documented
supported
< /edit >

Is there any reason you can't use XML? Yes, it's slightly more difficult to parse, at least for novices, but if so they probably need the practice. File size would be much greater, of course, but it's compressible.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

CNTK load pictures with class affiliation in percent - deep-learning

You can create a composite reader. One deserializes you images, another can deserialise your numeric data. Read this, the last section shows you how to use a composite reader

Related

Using Pickle to serialise large objects - what causes 'Memory error'

How do I get molecular structural information from SMILES

Apigility GET collection returns only 10 results when content negotiation is set to JSON

How to find likelihood in NLTK

When could a CSV records not have the same number of fields?

Categories

Resources

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

CNTK load pictures with class affiliation in percent - deep-learning

You can create a composite reader. One deserializes you images, another can deserialise your numeric data. Read this, the last section shows you how to use a composite reader

Related

Using Pickle to serialise large objects - what causes 'Memory error'

How do I get molecular structural information from SMILES

Apigility GET collection returns only 10 results when content negotiation is set to JSON

How to find likelihood in NLTK

When could a CSV records *not* have the same number of fields?

Categories

Resources

When could a CSV records not have the same number of fields?