How to train a custom Glove vector representations using many PDF files? - nltk

I wanted to train my own custom Glove representations from using many PDF files. How can i do that ? and is there any way to use the concept of POS tagging and dependency parsing etc? Can you suggest any link for implementing that?

Your question is overbroad to give any tight answers, but of course you can do what you describe.
You'd 1st look into libraries for extracting plain text from PDFs.
Some word2vec projects have trained word-vectors based on word-tokens that have been extended with POS-labels, or dependency-defined contexts, with potential benefits depending on your goals. See for example Levy & Goldberg's paper on dependency-based embeddings:
https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/

Related

Best practices to fine-tune a model?

I have a few questions regarding the fine-tuning process.
I'm building an app that is able to recognize data from the following documents:
ID Card
Driving license
Passport
Receipts
All of them have different fonts (especially receipts) and it is hard to match exactly the same font and I will have to train the model on a lot of similar fonts.
So my questions are:
Should I train a separate model for each of the document types for better performance and accuracy or it is fine to train a single eng model on a bunch of fonts that are similar to the fonts that are being used on this type of documents?
How many pages of training data should I generate per font? By default, I think tesstrain.sh generates around 4k pages.
Maybe any suggestions on how I can generate training data that is closest to real input data
How many iterations should be used?
For example, if I'm using some font that has a high error rate and I want to target 98% - 99% accuracy rate.
As well maybe some of you had experience working with this type of documents and maybe you know some common fonts that are being used for these documents?
I know that MRZ in passport and id cards is using OCR-B font, but what about the rest of the document?
Thanks in advance!
Ans 1
you can train a single model to achieve the same but if you want to detect different languages then I think you will need different models.
Ans 2
If you are looking for some datasets then have a look at this Mnist Png Dataset which has digits as well as alphabets from various computer-based fonts. Here is a link to some starter code to use the data set implemented in Pytorch.
Ans 3
You can use optuna to find the best set of params for your model, but you will need some of the
using-optuna-to-optimize-pytorch-hyperparameters
Have a look at these
PAN-Card-OCR
document-details-parsing-using-ocr
They are trying to achieve similar task.
Hope it answers your Question...!
I would train a classifier on the 4 different types to classify an ID, license, passport, receipts. Basically so you know that a passport is a passport vs a drivers license ect. Then I would have 4 more models that are used for translating each specific type (passport, drivers license, ID, and receipts). It should be noted that if you are working with multiple languages this will likely mean making 4 models based each specific language meaning that if you have L languages you make need 4*L number of models for translating those.
Likely a lot. I don’t think that font is really an issue. Maybe what you should do is try and define some templates for things like drivers license and then generate based on that template?
This is the least of your problems, just test for this.
Assuming you are referring to a ML data model that might be used to perform ocr using computer vision I'd recommend to:
Setup your taxonomy as required by your application requirements.
This means to categorize the expected font sets per type of scanned document (png,jpg tiff etc.) to include inside the appropriate dataset. Select the fonts closest to the ones being used as well as the type of information you need to gather (Digits only, Alphabetic characters).
Perform data cleanup on your dataset and make sure you have homogenous data for the OCR functionality. For example, all document images should be of png type, with max dimensions of 46x46 to have an appropriate training model. Note that higher resolution images and smaller scale means higher accuracy.
Cater for handwritting as well, if you have damaged or non-visible font images. This might improve character conversion options in cases that fonts on paper are not clearly visible/worn out.
In case you are using keras module with TF on mnist provided datasets, setup a cancellation rule for ML model training when you reach 98%-99% accuracy for more control in case you expect your fonts in images to be error-prone (as stated above). This helps avoid higher margin of errors when you have bad images in your training dataset. For a dataset of 1000+ images, a good setting would be using TF Dense of 256 and 5 epochs.
A sample training dataset can be found here.
If you just need to do some automation with your application or do data entry that requires OCR conversion from images, a good open source solution would be to use information gathering automatically via PSImaging module (Powershell) use the degrees of confidence retrieved (from png) and run them against your current datasets to improve your character match accuracy.
You can find the relevant link here

Custom translator - How can I train the machine to recognize the right translation solution (synonyms)?

I'm pretty new with Custom Translator and I'm working on a fashion-related EN_KO project.
There are many cases where a single English term has two possible translations into Korean. An example: if "fastening"is related to "bags, backpacks..." is 잠금 but if it's related to "clothes, shoes..." is 여밈.
I'd like to train the machine to recognize these differences. Could it be useful to upload a phrase dictionary? Any ideas? Thanks!
The purpose of training a custom translation system is to teach it how to translate terms in context.
The best way to teach the system how to translate is training with parallel documents of full sentence prose: the same document in two languages. A translation memory extract in a TMX or XLIFF file is the best material, but many other document formats are suitable as well, as long as you have both languages. Have at least 10000 sentences in both languages, upload to http://customtranslator.ai, and build a custom system with it.
If you have documents in Korean that are representative of the terminology and style you want to achieve, without an English match, you can automatically translate those to English, and add to the training material as parallel documents. Be sure to not use the automatically translated documents in the other direction.
A phrase dictionary is of limited help, because it is unaware of context. It is useful only in bootstrapping your custom system or for very rare terms where you cannot find or create a sentence.

What does backbone mean in a neural network?

I am getting confused with the meaning of "backbone" in neural networks, especially in the DeepLabv3+ paper. I did some research and found out that backbone could mean
the feature extraction part of a network
DeepLabv3+ took Xception and ResNet-101 as its backbone. However, I am not familiar with the entire structure of DeepLabv3+, which part the backbone refers to, and which parts remain the same?
A generalized description or definition of backbone would also be appreciated.
In my understanding, the "backbone" refers to the feature extracting network which is used within the DeepLab architecture. This feature extractor is used to encode the network's input into a certain feature representation. The DeepLab framework "wraps" functionalities around this feature extractor. By doing so, the feature extractor can be exchanged and a model can be chosen to fit the task at hand in terms of accuracy, efficiency, etc.
In case of DeepLab, the term backbone might refer to models like the ResNet, Xception, MobileNet, etc.
TL;DR Backbone is not a universal technical term in deep learning.
(Disclaimer: yes, there may be a specific kind of method, layer, tool etc. that is called "backbone", but there is no "backbone of a neural network" in general.)
If authors use the word "backbone" as they are describing a neural network architecture, they mean
feature extraction ( a part of the network that "sees" the input), but this interpretation is not quite universal in the field: for instance, in my opinion, computer vision researchers would use the term to mean feature extraction, whereas natural language processing researchers would not.
in informal language, that this part in question is crucial to the overall method.
Backbone is a term used in DeepLab models/papers to refer to the feature extractor network. These feature extractor networks compute features from the input image and then these features are upsampled by a simple decoder module of DeepLab models to generate segmented masks. The authors of DeepLab models have shown performance with different feature extractors (backbones) like MobileNet, ResNet, and Xception network.
CNNs are used for extracting features. Several CNNs are available, for instance, AlexNet, VGGNet, and ResNet(backbones). These networks are mainly used for object classification tasks and have evaluated on some widely used benchmarks and datasets such as ImageNet. In image classification or image recognition, the classifier classifies a single object in the image, outputs a single category per image, and gives the probability of matching a class. Whereas in object detection, the model must be able to recognize several objects in a single image and provides the coordinates that identify the location of the objects. This shows that the detection of objects can be more difficult than the classification of images.
source and more info: https://link.springer.com/chapter/10.1007/978-3-030-51935-3_30

Deep Learning methods for Text Generation (PyTorch)

Greetings to everyone,
I want to design a system that is able to generate stories or poetry based on a large dataset of text, without being needed to feed a text description/start/summary as input at inference time.
So far I did this using RNN's, but as you know they have a lot of flaws. My question is, what are the best methods to achieve this task at the time?
I searched for possibilities using Attention mechanisms, but it turns out that they are fitted for translation tasks.
I know about GPT-2, Bert, Transformer, etc., but all of them need a text description as input, before the generation and this is not what I'm seeking. I want a system able to generate stories from scratch after training.
Thanks a lot!
edit
so the comment was: I want to generate text from scratch, not starting from a given sentence at inference time. I hope it makes sense.
yes, you can do that, that's just simple code manipulation on top of the ready models, be it BERT, GPT-2 or LSTM based RNN.
How? You have to provide random input to the model. Such random input can be randomly chosen word or phrase or just a vector of zeroes.
Hope it helps.
You have mixed up several things here.
You can achieve what you want either using LSTM based or transformer based architecture.
When you said you did it with RNN, you probably mean that you have tried LSTM based sequence to sequence model.
Now, there is attention in your question. So you can use attention to improve your RNN but it is not a required condition. However, if you use transformer architecture, then it is built in the transormer blocks.
GPT-2 is nothing but a transformer based model. Its building block is a transformer architecture.
BERT is also another transformer based architecture.
So to answer your question, you should and can try using LSTM based or transformer based architecture to achieve what you want. Sometimes such architecture is called GPT-2, sometimes BERT depending on how it is realized.
I encourage you to read this classic from Karpathy, if you understand it then you have cleared most of your questions:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Affective Demonstratives and POS Tagging

Is there a way to accurately tag affective demonstratives in a corpus? Attempting a project using a Twitter corpus and I need to be able to sort through 200,000+ tweets to pick out the ones with affective demonstratives. I'd rather not do it by hand!
I'm using NLTK and Twython with this whole process if that helps at all.
I don't know of an off-the-shelf solution, but this sounds like a classic NLP classification task. You'll need a sizeable corpus in which you (or someone else) have marked up the "affective demonstratives", and then you'll need to train a classifier and experiment with different features or feature selection algorithms. Look over the nltk book for details.
You would probably want to start by using a standard tagger to POS-tag your corpus; then you can use these tags (and anything else you think might be useful) as input features for your classifier.