How to use BERT in non-homogeneous very long text classification - deep-learning

At first let me explain what do I mean by homogeneous and non-homogeneous text? I will give an example to describe these two.
Let's say we have two news columns in a newspaper, one is about sports and another one is about technology. These two are examples of homogeneous text. Why? because no matter which portion you take from any of the column it will tell you something about that subject(e.g. if you take a chuck from sports column it will tell you something about sports or if you take a chuck from technology column it will tell you something about technology).
In contrast, if different portion of text describes different topics then the text is non-homogeneous.
In my case, I have two classes of data-set of very long texts. But both classes are very similar. The only difference is in one class of data there are very few terms or keywords compared to the length of data.(e.g. if the text has 24000 words , there are 5-10 words for which that text is in that class).
Now, I am taking small chunks of texts and feeding them into the BERT one by one and at last using an LSTM layer to combine them and predicting the output class.
But as the text is very long and also non-homogeneous , the model is not performing well. It is only gaining 64% accuracy.( But this model works well on homogeneous dataset. I have tried using this model with 20_newsgroup dataset and it is having an accuracy of 80%)
So, how should I use BERT on this type of dataset? Or should I try anything apart from BERT?

Related

Anchor Boxes in YOLO : How are they decided

I have gone through a couple of YOLO tutorials but I am finding it some what hard to figure if the Anchor boxes for each cell the image is to be divided into is predetermined. In one of the guides I went through, The image was divided into 13x13 cells and it stated each cell predicts 5 anchor boxes(bigger than it, ok here's my first problem because it also says it would first detect what object is present in the small cell before the prediction of the boxes).
How can the small cell predict anchor boxes for an object bigger than it. Also it's said that each cell classifies before predicting its anchor boxes how can the small cell classify the right object in it without querying neighbouring cells if only a small part of the object falls within the cell
E.g. say one of the 13 cells contains only the white pocket part of a man wearing a T-shirt how can that cell classify correctly that a man is present without being linked to its neighbouring cells? with a normal CNN when trying to localize a single object I know the bounding box prediction relates to the whole image so at least I can say the network has an idea of what's going on everywhere on the image before deciding where the box should be.
PS: What I currently think of how the YOLO works is basically each cell is assigned predetermined anchor boxes with a classifier at each end before the boxes with the highest scores for each class is then selected but I am sure it doesn't add up somewhere.
UPDATE: Made a mistake with this question, it should have been about how regular bounding boxes were decided rather than anchor/prior boxes. So I am marking #craq's answer as correct because that's how anchor boxes are decided according to the YOLO v2 paper
I think there are two questions here. Firstly, the one in the title, asking where the anchors come from. Secondly, how anchors are assigned to objects. I'll try to answer both.
Anchors are determined by a k-means procedure, looking at all the bounding boxes in your dataset. If you're looking at vehicles, the ones you see from the side will have an aspect ratio of about 2:1 (width = 2*height). The ones viewed from in front will be roughly square, 1:1. If your dataset includes people, the aspect ratio might be 1:3. Foreground objects will be large, background objects will be small. The k-means routine will figure out a selection of anchors that represent your dataset. k=5 for yolov3, but there are different numbers of anchors for each YOLO version.
It's useful to have anchors that represent your dataset, because YOLO learns how to make small adjustments to the anchor boxes in order to create an accurate bounding box for your object. YOLO can learn small adjustments better/easier than large ones.
The assignment problem is trickier. As I understand it, part of the training process is for YOLO to learn which anchors to use for which object. So the "assignment" isn't deterministic like it might be for the Hungarian algorithm. Because of this, in general, multiple anchors will detect each object, and you need to do non-max-suppression afterwards in order to pick the "best" one (i.e. highest confidence).
There are a couple of points that I needed to understand before I came to grips with anchors:
Anchors can be any size, so they can extend beyond the boundaries of
the 13x13 grid cells. They have to be, in order to detect large
objects.
Anchors only enter in the final layers of YOLO. YOLO's neural network makes 13x13x5=845 predictions (assuming a 13x13 grid and 5 anchors). The predictions are interpreted as offsets to anchors from which to calculate a bounding box. (The predictions also include a confidence/objectness score and a class label.)
YOLO's loss function compares each object in the ground truth with one anchor. It picks the anchor (before any offsets) with highest IoU compared to the ground truth. Then the predictions are added as offsets to the anchor. All other anchors are designated as background.
If anchors which have been assigned to objects have high IoU, their loss is small. Anchors which have not been assigned to objects should predict background by setting confidence close to zero. The final loss function is a combination from all anchors. Since YOLO tries to minimise its overall loss function, the anchor closest to ground truth gets trained to recognise the object, and the other anchors get trained to ignore it.
The following pages helped my understanding of YOLO's anchors:
https://medium.com/#vivek.yadav/part-1-generating-anchor-boxes-for-yolo-like-network-for-vehicle-detection-using-kitti-dataset-b2fe033e5807
https://github.com/pjreddie/darknet/issues/568
I think that your statement about the number of predictions of the network could be misleading. Assuming a 13 x 13 grid and 5 anchor boxes the output of the network has, as I understand it, the following shape: 13 x 13 x 5 x (2+2+nbOfClasses)
13 x 13: the grid
x 5: the anchors
x (2+2+nbOfClasses): (x, y)-coordinates of the center of the bounding box (in the coordinate system of each cell), (h, w)-deviation of the bounding box (deviation to the prior anchor boxes) and a softmax activated class vector indicating a probability for each class.
If you want to have more information about the determination of the anchor priors you can take a look at the original paper in the arxiv: https://arxiv.org/pdf/1612.08242.pdf.

Having Issues with getting 1 out of 2 legends splitting into two columns when they are both formatted the same? arcmap 10.4

Although I have been able to split one of the legends into two columns on my map the other one is proving problematic in doing so. I cannot see any differences in the properties of each legend in terms of the item column customizing.
First of all, convert the legend to graphics (by right clicking the image and selecting convert to graphics). Follow this up by ungrouping the image into different elements. You can then just move the items across into a second column manually.

ssrs keep certain "chunks" of text "atomic" within a report header

I have some text I want to keep "atomic" within a report header. The text in question is the "Period Ending Date: 1/15/1998". It is OK if that block of text wraps down below, but I would like to keep the whole block of text together on one line. This has to be dynamic however, as the text will grow and shrink dynamically--sometimes the server and database and company name will be short and everything fits on one row, sometimes long even the database name will need to wrap.
And this is how I have it defined in the expression
Is keeping it "atomic" possible?
As the comments say you basically have two options, either make the text box wide enough to fit any possible combination of company, server, database and end date that could occur - you can use max(len([your fields])) in SQL to determine the maximum possible characters and then figure out the width from your font information - or you can put the separate chunks of information into separate text boxes and arrange them however is most aesthetically pleasing to you. Personally I'd have a company name box on one line, the Server and database beneath it and the end date beneath that. Up to your preference though obviously.
#Jeff.Clark, I guess you need to rethink the design. I agree with #Viking's comment. As far as i know, what you are trying to achieve is not possible in SSRS (keeping the Can Grow = False) and want to wrap it on the field level not the word level. I tried using it in the placeholder but it is splitting the field.
However, if your requirements are very critical and no matter what you have to do it in this way then i think you can achieve it by determining maximum characters the cell can accommodate in one line and then subtracting the SUM of LENGTH of the the fields+ the text(ex: "Totals per payroll period") and find out a position in your string to insert VBCRLF so the rest of the data will go to next line. Without the original expression i will not be able to provide the accurate updated expression but it will be nested IIFs to get the position of VBCRLF. I personally do not prefer this method because it will take lot of processing on the report design level and can affect the overall performance of the report as well as not very pretty when it comes to the maintenance of the report.

Rapidminer Classification

I am trying to solve a simple classification problem where the label has 12 different levels and need to classify each example into one of these 12. However, I want my output to look like refer the image:
http://i.stack.imgur.com/49USG.png
Here; assuming that I set a confidence threshold of 20%; I want my output to contain all the labels for each id which are above 20% and ordered (highest confidence first). If none of the labels are above 20%; then a default label.
More specifically, are there any existing operators in Rapidminer which could give such an output?
Whenever the Apply Model operator runs, it produces new special attributes corresponding to confidences for the individual values of the label attribute. So if the label has values one, two, three, three new attributes will be created confidence(one), confidence(two), confidence(three). It would be possible to use the Generate Attributes operator to work out some logic to decide how to really classify each example. It would also be possible to use the Apply Threshold operator (with Create Threshold) to do something similar. It's impossible to give any more guidance unless you post a representative example with data.

Tesseract OCR text order for documents with tables or rows

I am using Tesseract OCR to convert scanned PDF's into plain text. Overall it is highly effective but I am having issues with the order that the text is scanned. Documents with tabular data seem to scan down column by column when it seems like the more natural way would be to scan row by row. A very small scale example would be:
This is column A, row 1 This is column B, row 1 This is column C, row 1
This is column A, row 2 This is column B, row 2 This is column C, row 2
Is yielding the following text:
This is column A, row 1
This is column A, row 2
This is column B, row 1
This is column B, row 2
This is column C, row 1
This is column C, row 2
I am starting to read documentation and do a guess and test, brute force approach with parameters documented here but if someone has already tackled an issue similar, I would appreciate the insight on the fix. It could also be some training data but I do not know exactly how that works.
Try running tesseract in one of the single column Page Segmentation Modes:
tesseract input.tif output-filename --psm 6
By default Tesseract expects a page of text when it segments an image. If you're just seeking to OCR a small region try a different segmentation mode, using the -psm argument. Note that adding a white border to text which is too tightly cropped may also help, see issue 398.
To see a complete list of supported page segmentation modes, use tesseract -h. Here's the [ed: excerpt only] list as of 3.21:
Fully automatic page segmentation, but no OSD. (Default)
Assume a single column of text of variable sizes.
Assume a single uniform block of vertically aligned text.
Assume a single uniform block of text.
See examples here: #using-different-page-segmentation-modes
I know this is an old question, but I've been struggling with a similar issue and found hOCR output to be the solution. Running
tesseract input.tif output-filename hocr
will create output-file.hocr (basically HTML) that gives coordinates for the bounding boxes of each phrase. It's up to you to determine how to reconstruct the table from this data (probably using the dimensions of the input image).
As in the other answers, specifying some particular page segmentation mode might be useful in getting the phrases of your table grouped appropriately, but the coordinates will provide the precise result needed.
You need to use following config
#Read Image
r = Image.open('8.png')
r.load()
#Converting inmage to text with preserving interline spaces
text = pytesseract.image_to_string(r,config='-c preserve_interword_spaces=1x1 --psm
1 --oem 3' )
OR
Another Solution is to draw contours around text, save all contours in separate files and sort them according to their x,y co-ordinates. After that you only need to extract text from each image and display as you want.