Recommended attributes and capitalization for networks in Nexus - igraph

I am glad, that igraph 0.6 have a possibility to get network data easily from the Nexus repository. I have been stored several networks, and I want to made them as nexus-conform as possible.
I have two questions about the attributes of networks stored in the Nexus repository.
Are there recommended attributes for the graph? I have found several attributes: name, Author, Citation, Description (in networks of Newman), URL
Is there any policy for using capitalized attributes for special (eg. recommended) attributes?

Yes, the current policy is that the graph should have attributes name, Author, Citation and URL, if possible. It may have others as well. As currently all data sets are added by the nexus authors, they make sure that this policy is met. In the future, this will be hopefully different and people can just upload their data sets, and they will be published (after a quick check) automatically.
The rule of thumb with capitalization is that all attributes should be capitalized, except if you want igraph to recognize and treat your attribute specially. Currently the attributes that are treated specially are: name (graph, vertex and edge), type (vertex), weight (edge) and all the graphical attributes like shape, width, label, size, color, etc.

Related

Best practices to fine-tune a model?

I have a few questions regarding the fine-tuning process.
I'm building an app that is able to recognize data from the following documents:
ID Card
Driving license
Passport
Receipts
All of them have different fonts (especially receipts) and it is hard to match exactly the same font and I will have to train the model on a lot of similar fonts.
So my questions are:
Should I train a separate model for each of the document types for better performance and accuracy or it is fine to train a single eng model on a bunch of fonts that are similar to the fonts that are being used on this type of documents?
How many pages of training data should I generate per font? By default, I think tesstrain.sh generates around 4k pages.
Maybe any suggestions on how I can generate training data that is closest to real input data
How many iterations should be used?
For example, if I'm using some font that has a high error rate and I want to target 98% - 99% accuracy rate.
As well maybe some of you had experience working with this type of documents and maybe you know some common fonts that are being used for these documents?
I know that MRZ in passport and id cards is using OCR-B font, but what about the rest of the document?
Thanks in advance!
Ans 1
you can train a single model to achieve the same but if you want to detect different languages then I think you will need different models.
Ans 2
If you are looking for some datasets then have a look at this Mnist Png Dataset which has digits as well as alphabets from various computer-based fonts. Here is a link to some starter code to use the data set implemented in Pytorch.
Ans 3
You can use optuna to find the best set of params for your model, but you will need some of the
using-optuna-to-optimize-pytorch-hyperparameters
Have a look at these
PAN-Card-OCR
document-details-parsing-using-ocr
They are trying to achieve similar task.
Hope it answers your Question...!
I would train a classifier on the 4 different types to classify an ID, license, passport, receipts. Basically so you know that a passport is a passport vs a drivers license ect. Then I would have 4 more models that are used for translating each specific type (passport, drivers license, ID, and receipts). It should be noted that if you are working with multiple languages this will likely mean making 4 models based each specific language meaning that if you have L languages you make need 4*L number of models for translating those.
Likely a lot. I don’t think that font is really an issue. Maybe what you should do is try and define some templates for things like drivers license and then generate based on that template?
This is the least of your problems, just test for this.
Assuming you are referring to a ML data model that might be used to perform ocr using computer vision I'd recommend to:
Setup your taxonomy as required by your application requirements.
This means to categorize the expected font sets per type of scanned document (png,jpg tiff etc.) to include inside the appropriate dataset. Select the fonts closest to the ones being used as well as the type of information you need to gather (Digits only, Alphabetic characters).
Perform data cleanup on your dataset and make sure you have homogenous data for the OCR functionality. For example, all document images should be of png type, with max dimensions of 46x46 to have an appropriate training model. Note that higher resolution images and smaller scale means higher accuracy.
Cater for handwritting as well, if you have damaged or non-visible font images. This might improve character conversion options in cases that fonts on paper are not clearly visible/worn out.
In case you are using keras module with TF on mnist provided datasets, setup a cancellation rule for ML model training when you reach 98%-99% accuracy for more control in case you expect your fonts in images to be error-prone (as stated above). This helps avoid higher margin of errors when you have bad images in your training dataset. For a dataset of 1000+ images, a good setting would be using TF Dense of 256 and 5 epochs.
A sample training dataset can be found here.
If you just need to do some automation with your application or do data entry that requires OCR conversion from images, a good open source solution would be to use information gathering automatically via PSImaging module (Powershell) use the degrees of confidence retrieved (from png) and run them against your current datasets to improve your character match accuracy.
You can find the relevant link here

Training Faster R-CNN with multiple objects in an image

I want to train Faster R-CNN network with my own images to detect faces. I have checked quite a few Github libraries, but this is the example of the training file I always find:
/data/imgs/img_001.jpg,837,346,981,456,cow
/data/imgs/img_002.jpg,215,312,279,391,cat
But I can't find an example how to train with images containing couple objects. Should it be:
1) /data/imgs/img_001.jpg,837,346,981,456,cow,215,312,279,391,cow
or
2) /data/imgs/img_001.jpg,837,346,981,456,cow
/data/imgs/img_001.jpg,215,312,279,391,cow
?
I just could not help myself but quote FarCry3 here: "The definition of insanity is doing the same thing over and over and expecting different results."
(Note that this is purely in an entertaining context, and not meant to insult you in any way; I would not take the time to answer your question if I didn't think it worthwile)
In your second example, you would feed the exact same input data, but require the network to learn two different outcomes. But, as you already noted, it is not very common for many of the libraries to support multiple labels per image.
Oftentimes, this is purely done for the sake of simplicity, as it requires you to change your metrics, to accomodate for multiple outputs: Instead of having one-hot encoded targets, you now could have multiple "targets".
This is even more challenging in the task of object detection (and not object classification, as described before), since you now have to decide how you represent your targets.
If it is possible at all, I would personally restrict myself to labeling one class per image, or have a look at another image library that does support that, since the effort of rewriting that much code is probably not worth the minute improvement in the results.

Is there a standard algorithm for locale resolution?

In support of software internationalization, many programming languages and platforms support a means of obtaining localized resources to be used in the UI that is shown to the user (e.g. Java's java.util.ResourceBundle class). Often, if resources for the user's preferred locale are not available, then there is a fallback mechanism, or locale resolution process, that will attempt to locate the nearest-matching resources from the sets of available resources. For example, if resources for en-US are not available, then commonly the system attempts to find resources for en.
The locale resolution process seems nearly the same for many languages' and platforms' resource bundle solutions. Are they following some standard locale resolution algorithm, or, if not, does such a standard exist?
There is apparently RFC 4647, Matching of Language Tags, which describes the syntax of "language-ranges" for specifying the list of a user's preferred languages, as well as the "filtering" and "lookup" mechanisms for comparing and matching language-ranges to RFC 4646 language tags. RFC 4647 describes these mechanisms as:
Filtering produces a (potentially empty) set of language tags, whereas lookup produces a single language tag.
I'm not aware of a standard per se.
However, the algorithm being used is a trivial consequence of the fact that locales are hierarchical. There is a (notional) root locale with no name. Beneath this are language-only locales (en, fr, etc). Beneath those are national locales (en_GB, en_US, etc). Beneath those are, optionally, variant locales (en_GB_Yorkshire, en_GB_cockney, etc - for realistic examples, look at Norway).
The natural way to find an appropriate resource is to start with the lowest, most specific, locale you can, and walk up the tree until you find something. So, starting with en_US_TX, you step up to en_US, then en, then the root.
The CLDR - Unicode Common Locale Data Repository has a proposed (as of 2015) algorithm based on language distance. Without the distance data this is not a solution, but is worth watching for a solution in the future.

MediaWiki extension to support taxonomy by genus and species

I'm trying to build a MediaWiki-based website for a very specific purpose. Namely, I would like to create a field guide for a specific group of animals (reptiles and amphibians). Since the people I would want to generate content on the website aren't necessarily techies, I'd like to make things as easy and painless as possible for contributors.
Now, in most groups of animals, taxonomic designations are fluid, and change all the time. As an example, consider the following:
A species used to be called Genus1 species1. It was then called Genus2 species1. As of now, this species has been split into several species, say Genus2 species1, Genus2 species2, Genus2 species3, etc. In the worst case, anything about the nomenclature and classification of the species could change, including, but not limited to, the species being moved, split or merged with any other species.
For users, these changes should be transparent. That is, on typing in http://url_of_wiki/wiki/Genus1_species1, they should automatically be redirected to the lowest taxonomic group (in this case Genus2) that is non-ambiguous. Essentially, if a page is redesignated (moved, split or merged), I would like to automatically create all new pages and redirects required.
I should be able to implement this as an extension quite easily. However, I've read the MediaWiki documentation on extensions, but haven't been able to figure out just what part of MediaWiki it would be best to target.
So, the question is, is this type of extension best implemented as a parser extension, by adding new tags, or a user-interface extension, or a combination of the two (a user-interface extension backed by a parser extension)?
Nice challenging problem! If it were up to me I would solve it in a different way:
use page level for genera and
sub page level for species.
This will automatically take care of renaming since redirects will be made.
Alternatively:
- use page level for species and
- categories for genera.
Then use an if pagename template (see Wikipedia example) to change the category based on the page name.
Or possibly combine these methods.
(See also Wikis and Wikipedia)

How do you handle exceptional cases

This is often situation, but here is latest example:
Companies have various contact data (addresses, phone numbers, e-mails...) when they make job ad, they have checkboxes where they choose how they want to be contacted. It is basically descriptive data. User when reading an ad sees something like "You can apply by mail, in person...", except if it's "through web portal" or "by e-mail" because then appropriate buttons should appear. These options are stored in database, and client (owner of the site, not company making an ad) can change them (e.g. they can add "by telepathy" or whatever), yet if they tamper with "e-mail" and "web-portal" options, they screw their web site.
So how should I handle data where everything behaves same way except "this thing" that behaves this way, and "that thing" that behaves some other way, and data itself is live should be editable by client.
You've tagged your question as "language-agnostic", and not all languages cleanly support polymorphism, but that's the way I would approach this.
Each option has some type, and different types require different properties to be set. However, every type supports some sort of "render" method that can display the contact method as needed. Since the properties (phone number, or web address, etc.) are type-specific, you can validate the administrator's input when creating these "objects", to make sure that the necessary data is provided and valid. Since you implement the render method, rather than spitting out HTML provided by a user, you can ensure that the rendered page is correct. It's less flexible, but safer and more user friendly.
In the database, you can have one sparsely populated table that holds data for all types of contacts, or a "parent" table with common properties and sub-tables with type-specific properties. It depends on how many types you have and how different they are. In either case, you would have some sort of type indicator, so that you know the type of object to which the data should be be bound.
First of all, think twice do you really need it. Reason is simple. You are supposed to serve specific need and input data is a mean to provide that service. If data does not fit with existing service then what is its value and who are consumer of that specific information?
There are two possible answers: You are expanding your client base or you need to change existing service because of change of demand. In both cases you need to star from development of business model. If you describe what service you need and what information it should provide you will avoid much of specific data and come with clear requirements easy to implement in software.
I'd recommend the resolution pattern for this, based on the mention of a database. The link above describes it, but it's actually a lot simpler than it sounds. You write a database query that returns all the possible options (for example, you read the standard options and the customized options together using perhaps a UNION or a JOIN depending on your schema) - the COALESCE SQL keyword is then useful to find the first 'resolution' of the option value that isn't NULL.
Well, if all it is is that you have two options that are special, and then anything else is dealt with in the same way, then store your options as strings, and if either of the two special ones appears in that list, then show the appropriate stuff for that special item.
Just check your list of items for the two special ones. Nothing fancy.
By writing a very simple Rules Engine. You can use an out-of-the box implementation, or you can roll your own. Since your case seems so simple, I tend to roll my own, because it means less dependencies (YMMV).