Clean text coming from PDFs - language-agnostic

Clean text coming from PDFs - language-agnostic

this is more of an algorithmic question rather than a specific language question, so I am happy to receive an answer in any language - even pseudocode, even just an idea.
Here is my problem: I need to work on large dataset of papers that come from articles in PDF and that were brutally copied/pasted into .txt. I only have the result of this abomination, which is around 16k papers, for 3.5 GB or text (the corpus I am using is the ACL Antology Network, http://clair.si.umich.edu/clair/aan/DatasetContents.html ).
The "junk" comes from things like formulae, images, tables, and so on. It just pops in the middle of the running text, so I can't use regular expressions to clean it, and I can't think of any way to use machine learning for it either. I already spent a week on it, and then I decided to move on with a quick&dirty fix. I don't care about cleaning it completely anymore, I don't care about false negatives and positives as long as the majority of this areas of text is removed.
Some examples of the text: note that formulae contain junk characters, but tables and caption don't (but they still make my sentence very long, and thus unparsable). Junk in bold.
Easy one:
The experiments were repeated while inhibiting specialization of first the scheme with the most expansions, and then the two most expanded schemata.
Measures of coverage and speedup are important 1 As long as we are interested in preserving the f-structure assigned to sentences, this notion of coverage is stricter than necessary.
The same f-structure can in fact be assigned by more than one parse, so that in some cases a sentence is considered out of coverage even if the specialized grammar assigns to it the correct f-structure.
2'VPv' and 'VPverb[main]' cover VPs headed by a main verb.
'NPadj' covers NPs with adjectives attached.
205 The original rule: l/Pperfp --+ ADVP* SE (t ADJUNCT) ($ ADV_TYPE) = t,padv ~/r { #M_Head_Perfp I#M_Head_Passp } #( Anaph_Ctrl $) { AD VP+ SE ('~ ADJUNCT) ($ ADV_TYPE) = vpadv is replaced by the following: ADVP,[.E (~ ADJUNCT) (.l.
ADV_TYPE) = vpadv l/'Pperfp --+ #PPadjunct #PPcase_obl {#M.Head_Pevfp [#M..Head_Passp} #( Anaph_Ctrl ~ ) V { #M_Head_Perfp I#M_Head_Passp } #( Anaph_Ctrl ~) Figure 1: The pruning of a rule from the actual French grammar.
The "*" and the "+" signs have the usual interpretation as in regular expressions.
A sub-expression enclosed in parenthesis is optional.
Alternative sub-expressions are enclosed in curly brackets and separated by the "[" sign.
An "#" followed by an identifier is a macro expansion operator, and is eventually replaced by further functional descriptions.
Corpus --..
,, 0.1[ Disambiguated Treebank treebank Human expert Grammar specialization Specialized grammar Figure 2: The setting for our experiments on grammar specialization.
indicators of what can be achieved with this form of grammar pruning.
However, they could potentially be misleading, since failure times for uncovered sentences might be considerably lower than their sentences times, had they not been out of coverage.
Hard one:
Table 4 summarizes the precision results for both English and Romanian coreference.
The results indicate that the English coreference is more indicate than the Romanian coreference, but SNIZZLE improves coreference resolution in both languages.
There were 64% cases when the English coreference was resolved by a heuristic with higher priority than the corresponding heuristic for the Romanian counterpart.
This result explains why there is better precision enhancement for
English Romanian SWIZZLE on English SWIZZLE on Romanian Nominal Pronominal 73% 89% 66% 78% 76% 93% 71°/o 82% Table 4: Coreference precision Total 84% 72% 87% 76% English Romanian SWIZZLE on English SWIZZLE on Romanian Nominal 69% 63% 66% 61% Pronominal Total 89% 78% 83% 72% 87% 77% 80% 70% Table 5: Coreference recall the English coreference. Table 5 also illustrates the recall results.
The advantage of the data-driven coreference resolution over other methods is based on its better recall performance.
This is explained by the fact that this method captures a larger variety of coreference patterns.
Even though other coreference resolution systems perform better for some specific forms of systems, their recall results are surpassed by the systems approach.
Multilingual coreference in turn improves more the precision than the recall of the monolingual data-driven coreference systems.
In addition, Table 5 shows that the English coref- erence results in better recall than Romanian coref- erence.
However, the recall shows a decrease for both languages for SNIZZLE because imprecise coreference links are deleted.
As is usually the case, deleting data lowers the recall.
All results were obtained by using the automatic scorer program developed for the MUC evaluations.
Note how the table does not contain strange characters and goes right in the middle of the sentence: "This result explains why there is better precision enhancement for -TABLE HERE- the English coreference." I can't know where the table will be in regard to the running text. It may occur before a sentence, after it or within it like in this case. Also note that the table shit does not end with a full stop (most captions in papers don't...) so I can't rely on punctuation to spot it. I am happy with non-accurate boundaries of course, but I still need to do something with these tables. Some of them contain words rather than numbers, and I don't have enough information in those cases: no junky characters, nothing. It is obvious to only humans :S

(I hate crappy copy&pastes. )
Few ideas that you might find helpful (I used each and every one of them myself in that point or another)
(Very brute force) : Using a tokenizer and a dictionary (real dictionary, not the data structure) - parse the words out and any word which is not a dictionary word - remove it. It might prove problematic if your text contains a lot of company/products names - but this too can be solved using the correct indexes (there are a few on the web - I'm using some propriety ones so I can't share them, sorry)
Given a set of clean documents (lets say a 2K), build an tf/idf index of them, and use this as a dictionary - every term from the other documents that doesn't appear in the index (or appears with a very low tf/idf) - remove it. This should give you a rather clean document.
Use Amazon's mechanical turk mechanism : set up a task where the person reading the document needs to mark the paragraph that doesn't make sense. Should be rather easy for the mechanical turk platform (16.5K is not that much) - this will probably cost you a couple of hundred $ , but you'll probably get a rather nice cleanup of the text (So if it's on corporate money, that can be your way out - they need to pay for their mistakes :) ).
Considering your documents are from the same domain (same topics, all in all), and the problems are quite the same (same table headers, roughly same formulas): Break all the documents to sentences, and try clustering the sentences using ML. If the table headers / formulas are relatively similar, they should cluster nicely away from the rest of the sentences, and then you can clean the documents sentence-by-sentence (Get a document, break it to sentences, for each sentence, if it's part of the "weird" cluster, remove it)

Related

Data PreProcessing for BERT (base-german)

I am working on a sentiment analysis solution with BERT to analyze tweets in german. My training dataset of is a class of 1000 tweets, which have been manually annotated into the classes neutral, positive and negative.
The dataset with 10.000 tweets is quite unevenly distributed:
approx.
3000 positive
2000 negative
5000 neutral
the tweets contain formulations with #names, https links, numbers, punctuation marks, smileys like :3 :D :) etc..
The interesting thing is, if I remove them with the following code during Data Cleaning, the F1 score gets worse. Only the removal of https links (if I do it alone) leads to a small improvement.
# removing the punctuation and numbers
def remove_punct(text):
text = re.sub(r'http\S+', '', text) # removing links
text = re.sub(r'#\S+', '', text) # removing referencing on usernames with #
text = re.sub(r':\S+', '', text) # removing smileys with : (like :),:D,:( etc)
text = "".join([char for char in text if char not in string.punctuation])
text = re.sub('[0-9]+', '', text)
return text
data['Tweet_clean'] = data['Tweet'].apply(lambda x: remove_punct(x)) # extending the dataset with the column tweet_clean
data.head(40)
also steps like stop words removal or lemmitazation lead more to a deterioration. Is this because I do something wrong or can the model BERT actually handle such values?
A second question is:
I found other records that were also manually annotated, but these are not tweets and the structure of the sentences and language use is different. Would you still recommend to add these records to my original?
There are about 3000 records in German.
My last question:
Should I reduce the class sizes to the size of the smallest unit and thus balance?

BERT can handle punctuation, smileys etc. Of course, smileys contribute a lot to sentiment analysis. So, don't remove them. Next, it would be fair to replace #mentions and links with some special tokens, because the model will probably never see them again in the future.
If your model is designed for tweets, I suggest that you fine-tune BERT with additional corpus, and after fine-tune with Twitter corpus. Or do it simultaneously. More training samples is generally better.
No, it is better to use class weights instead of downsampling.

Based on this paper (By Adam Ek, Jean-Philippe Bernardy and Stergios Chatzikyriakidis), BERT models outperform BiLSTM in terms of better generalizing to punctuations. Looking at the experiments' results in the paper, I say keep the punctuations.
I couln't find anything solid for smiley faces; However, after doing some experiments with the HuggingFace API, I didn't notice much difference with/without smiley faces.

Brainfuck with 1bit memory cells?

Would an implementation of the programming language Brainfuck, still be turing complete if its memory cells were 1bit in capacity, instead of the usual 8bit?
The + and - instructions become identical, however this need not be a problem.
I see no issue with, for example 4bit memory cells, but I cannot work out if this scales all the way to single bit values.

Yes, the resulting language would still be Turing-complete. In fact, several such languages exist. One of them is Boolfuck. It does exactly what you suggest: have each cell be a single bit and get rid of -, because it's redundant. It also uses ; instead . for output. The official website contains a reduction from Brainfuck to Boolfuck which proves Boolfuck's Turing-completeness. I'm reiterating the reduction here to make the answer self-contained:
Brain. Bool.
+ >[>]+<[+<]>>>>>>>>>[+]<<<<<<<<<
- >>>>>>>>>+<<<<<<<<+[>+]<[<]>>>>>>>>>[+]<<<<<<<<<
< <<<<<<<<<
> >>>>>>>>>
, >,>,>,>,>,>,>,>,<<<<<<<<
. >;>;>;>;>;>;>;>;<<<<<<<<
[ >>>>>>>>>+<<<<<<<<+[>+]<[<]>>>>>>>>>[+<<<<<<<<[>]+<[+<]
] >>>>>>>>>+<<<<<<<<+[>+]<[<]>>>>>>>>>]<[+<]
Other bit-based Brainfuck-derivatives include Smallfuck and BitChanger. This article may also be of interest to you, which goes through several steps of minimising the Brainfuck language by removing redundancy (including using bits instead of bytes).

Reasons and history for choice of common comment signs

Most of the programming languages use // or # for a single line comment (see wiki). It seems to be that # is especially used for interpreted languages. According to this question the reason for that seems to be that one of the early shells (bourne shell) used '#' as a comment and made use of it (shebang).
Is there a logical reason why to choose # as a comment sign (e.g. symobolize crossing out by #)? And why do we use // as a comment sign in many compiled languages (especially in C as it seems to be one of the earliest compiled languages with that symbol)? Are there logical reasons for that? Why not use # instead of //, or // instead of #?

Is there a logical reason why to choose # as a comment sign [in early shells]?
The Bourne shell tokenizer is quite simple. To add comment line support, a single character identifier was the simplest, and logical, choice.
The set of single characters you can choose from, if you wish to be compatible with both EBCDIC and ASCII (the two major character sets used at that time), is quite small:
! (logical not in bc)
#
% (modulo in bc)
#
^ (power in bc)
~ (used in paths)
Now, I've listed the ones used in bc, the calculator used in the same time period, not because they were a reason, but because you should understand the context of the Bourne shell developers and users. The bc notation did not arrive from out of thin air; the prevailing preferences influenced the choice, because the developers wanted the syntax to be intuitive, at least for themselves. The above bc notes are therefore useful in showing what kind of associations contemporary developers had with specific characters. I don't intend to imply that bc necessarily had an impact on Bourne shell -- but I do believe it did; that one of the reasons for developing the Bourne shell was to make using and automating tools like bc easier.
Effectively, only # and # were "unused" characters available in both ASCII and EBCDIC; and it appears "hash" won over "at".
And why do we use // as a comment sign in many compiled languages?
The // comment style is from BCPL. Many of the BCPL tokens and operators were already multiple characters long, and I suspect that at time the developers considered it better (for interoperability) to reuse an already used character for the comment line token, rather than introduce a completely new character.
I suspect that the // comment style has a historical background in margin notes; a double vertical line used to separate the actual content from notes or explanations being a clear visual separator to even those not familiar with the practice.
Why not use # instead of //, or [vice versa]?
In both of the cases above, there is clear logic. However, that does not mean that these were the only logical choices available. These are just the ones that made the most sense to the developers at the time when the choice was made -- and I've tried to shed some light on the possible reasons, the context for the choices, above.
If these kinds of questions interest you, I recommend you find old math and science (physics in particular) books, and perhaps even reproductions of old notes. Best tools are intuitive, you see; and to find what was intuitive to someone, you need to find out the context they worked in. I am absolutely certain you can find interesting "reasons" -- things that made certain choices logical and intuitive to them, while to us they may seem odd -- by finding out the habits of the early developers and their colleagues and mentors.

What is an Abstract Syntax Tree/Is it needed?

I've been interested in compiler/interpreter design/implementation for as long as I've been programming (only 5 years now) and it's always seemed like the "magic" behind the scenes that nobody really talks about (I know of at least 2 forums for operating system development, but I don't know of any community for compiler/interpreter/language development). Anyways, recently I've decided to start working on my own, in hopes to expand my knowledge of programming as a whole (and hey, it's pretty fun :). So, based off the limited amount of reading material I have, and Wikipedia, I've developed this concept of the components for a compiler/interpreter:
Source code -> Lexical Analysis -> Abstract Syntax Tree -> Syntactic Analysis -> Semantic Analysis -> Code Generation -> Executable Code.
(I know there's more to code generation and executable code, but I haven't gotten that far yet :)
And with that knowledge, I've created a very basic lexer (in Java) to take input from a source file, and output the tokens into another file. A sample input/output would look like this:
Input:
int a := 2
if(a = 3) then
print "Yay!"
endif
Output (from lexer):
INTEGER
A
ASSIGN
2
IF
L_PAR
A
COMP
3
R_PAR
THEN
PRINT
YAY!
ENDIF
Personally, I think it would be really easy to go from there to syntactic/semantic analysis, and possibly even code generation, which leads me to question: Why use an AST, when it seems that my lexer is doing just as good a job? However, 100% of my sources I use to research this topic all seem adamant that this is a necessary part of any compiler/interpreter. Am I missing the point of what an AST really is (a tree that shows the logical flow of a program)?
TL;DR: Currently in route to develop a compiler, finished the lexer, seems to me like the output would make for easy syntactic analysis/semantic analysis, rather than doing an AST. So why use one? Am I missing the point of one?
Thanks!

First off, one thing about your list of components does not make sense. Building an AST is (pretty much) the syntactic analysis, so it either shouldn't be in there, or at least come before the AST.
What you got there is a lexer. All it gives you are individual tokens. In any case, you will need an actual parser, because regular languages aren't any fun to program in. You can't even (properly) nest expressions. Heck, you can't even handle operator precedence. A token stream doesn't give you:
An idea where statements and expressions start and end.
An idea how statements are grouped into blocks.
An idea Which part of the expression has which precedence, associativity, etc.
A clear, uncluttered view at the actual structure of the program.
A structure which can be passed through a myriad of transformations, without every single pass knowing and having code to accomodate that the condition in an if is enclosed by parentheses.
... more generally, any kind of comprehension above the level of a single token.
Suppose you have two passes in your compiler which optimize certain kinds of operators applies to certain arguments (say, constant folding and algebraic simplifications like x - x -> 0). If you hand them tokens for the expression x - x * 1, these passes are cluttered with figuring out that the x * 1 part comes first. And they have to know that, lest the transformation is incorrect (consider 1 + 2 * 3).
These things are tricky enough to get right as it is, so you don't want to be pestered by parsing problems as well. That's why you solve the parsing problem first, in a separate parsing step. Then you can, say, replace a function call with its definition, without worrying about adding parenthesis so the meaning remains the same. You save time, you separate concerns, you avoid repetition, you enable simpler code in many other places, etc.
A parser figures all that out, and builds an AST which consequently holds all that information. Without any further data on the nodes, the shape of the AST alone gives you no. 1, 2, 3, and much more, for free. None of the bazillion passes that follow have to worry about it anymore.
That's not to say you always have to have an AST. For sufficiently simple languages, you can do a single-pass compiler. Instead of generating an AST or some other intermediate representation during parsing, you emit code as you go. However, this becomes harder for less simple languages and you can't reasonably do a lot of stuff (such as 70% of all optimizations and diagnostics -- and yes I just made that number up). Generally, I wouldn't advise you to do this. There are good reasons single-pass compilers are mostly dead. Even languages which permit them (e.g. C) are nowadays implemented with multiple passes and ASTs. It's a simple way to get started, but will severely limit you (and the language, if you design it) later.

You've got the AST at the wrong point in your flow diagram. Typically, the output of the lexer is a series of tokens (as you have in your output), and these are fed to the parser/syntactic analyzer, which generates the AST. So the output of your lexer is different from an AST because they are used at different points in the compilation process and fulfill different purposes.
The next logical question is: What, then, is an AST? Well, the purpose of parsing/syntactic analysis is to turn the series of tokens generated by the lexer into an AST (or parse tree). The AST is an intermediate representation that captures the relationship between syntactical elements in a way that is easier to work with programmatically. One way of thinking about this is that a text program is a one dimensional construct, and can only represent ideas as a sequence of elements, while the AST is freed from this constraint, and can represent the underlying relationships between those elements in 2 dimensions (as typically drawn), or any higher dimension space if you so choose to think about it that way.
For instance, a binary operator has two operands, let's call them A and B. In code, this may be spelled 'A * B' (assuming an infix operator - another advantage of an AST is to hide such distinctions that may be important syntactically, but not semantically), but for the compiler to "understand" this expression, it must read 5 characters sequentially, and this logic can quickly become cumbersome, given the many possibilities in even a small language. In an AST representation, however, we have a "binary operator" node whose value is '*', and that node has two children, values 'A' and 'B'.
As your compiler project progresses, I think you will begin to see the advantages of this representation.

Text-correlation in MySQL [duplicate]

Suppose I want to match address records (or person names or whatever) against each other to merge records that are most likely referring to the same address. Basically, I guess I would like to calculate some kind of correlation between the text values and merge the records if this value is over a certain threshold.
Example:
"West Lawnmower Drive 54 A" is probably the same as "W. Lawn Mower Dr. 54A" but different from "East Lawnmower Drive 54 A".
How would you approach this problem? Would it be necessary to have some kind of context-based dictionary that knows, in the address case, that "W", "W." and "West" are the same? What about misspellings ("mover" instead of "mower" etc)?
I think this is a tricky one - perhaps there are some well-known algorithms out there?

A good baseline, probably an impractical one in terms of its relatively high computational cost and more importantly its production of many false positive, would be generic string distance algorithms such as
Edit distance (aka Levenshtein distance)
Ratcliff/Obershelp
Depending on the level of accuracy required (which, BTW, should be specified both in terms of its recall and precision, i.e. generally expressing whether it is more important to miss a correlation than to falsely identify one), a home-grown process based on [some of] the following heuristics and ideas could do the trick:
tokenize the input, i.e. see the input as an array of words rather than a string
tokenization should also keep the line number info
normalize the input with the use of a short dictionary of common substituions (such as "dr" at the end of a line = "drive", "Jack" = "John", "Bill" = "William"..., "W." at the begining of a line is "West" etc.
Identify (a bit like tagging, as in POS tagging) the nature of some entities (for example ZIP Code, and Extended ZIP code, and also city
Identify (lookup) some of these entities (for example a relative short database table can include all the Cities / town in the targeted area
Identify (lookup) some domain-related entities (if all/many of the address deal with say folks in the legal profession, a lookup of law firm names or of federal buildings may be of help.
Generally, put more weight on tokens that come from the last line of the address
Put more (or less) weight on tokens with a particular entity type (ex: "Drive", "Street", "Court" should with much less than the tokens which precede them.
Consider a modified SOUNDEX algorithm to help with normalization of
With the above in mind, implement a rule-based evaluator. Tentatively, the rules could be implemented as visitors to a tree/array-like structure where the input is parsed initially (Visitor design pattern).
The advantage of the rule-based framework, is that each heuristic is in its own function and rules can be prioritized, i.e. placing some rules early in the chain, allowing to abort the evaluation early, with some strong heuristics (eg: different City => Correlation = 0, level of confidence = 95% etc...).
An important consideration with search for correlations is the need to a priori compare every single item (here address) with every other item, hence requiring as many as 1/2 n^2 item-level comparisons. Because of this, it may be useful to store the reference items in a way where they are pre-processed (parsed, normalized...) and also to maybe have a digest/key of sort that can be used as [very rough] indicator of a possible correlation (for example a key made of the 5 digit ZIP-Code followed by the SOUNDEX value of the "primary" name).

I would look at producing a similarity comparison metric that, given two objects (strings perhaps), returns "distance" between them.
If you fulfil the following criteria then it helps:
distance between an object and
itself is zero. (reflexive)
distance from a to b is the same in
both directions (transitive)
distance from a to c is not more
than distance from a to b plus
distance from a to c. (triangle
rule)
If your metric obeys these they you can arrange your objects in metric space which means you can run queries like:
Which other object is most like
this one
Give me the 5 objects
most like this one.
There's a good book about it here. Once you've set up the infrastructure for hosting objects and running the queries you can simply plug in different comparison algorithms, compare their performance and then tune them.
I did this for geographic data at university and it was quite fun trying to tune the comparison algorithms.
I'm sure you could come up with something more advanced but you could start with something simple like reducing the address line to the digits and the first letter of each word and then compare the result of that using a longest common subsequence algorithm.
Hope that helps in some way.

You can use Levenshtein edit distance to find strings that differ by only a few characters. BK Trees can help speed up the matching process.

Disclaimer: I don't know any algorithm that does that, but would really be interested in knowing one if it exists. This answer is a naive attempt of trying to solve the problem, with no previous knowledge whatsoever. Comments welcome, please don't laugh too laud.
If you try doing it by hand, I would suggest applying some kind of "normalization" to your strings : lowercase them, remove punctuation, maybe replace common abbreviations with the full words (Dr. => drive, St => street, etc...).
Then, you can try different alignments between the two strings you compare, and compute the correlation by averaging the absolute differences between corresponding letters (eg a = 1, b = 2, etc.. and corr(a, b) = |a - b| = 1) :
west lawnmover drive
w lawnmower street
Thus, even if some letters are different, the correlation would be high. Then, simply keep the maximal correlation you found, and decide that their are the same if the correlation is above a given threshold.

When I had to modify a proprietary program doing this, back in the early 90s, it took many thousands of lines of code in multiple modules, built up over years of experience. Modern machine-learning techniques ought to make it easier, and perhaps you don't need to perform as well (it was my employer's bread and butter).
So if you're talking about merging lists of actual mailing addresses, I'd do it by outsourcing if I can.
The USPS had some tests to measure quality of address standardization programs. I don't remember anything about how that worked, but you might check if they still do it -- maybe you can get some good training data.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008