I would like to create a tokenizer that tokenizes words in a sentence depending on its surroundings. So one word could have different tokens depending on its neighbor words. For example, if I have these two sentences: "Soccer is fun" and "She plays soccer really well". I would like to create a tokenizer that creates different tokens for the same word "soccer" because the surrounding words are different. Is doing such a tokenizer possible, and if somebody can direct me on how to do it (some links or resources)?
Related
I have words in the Hebrew language. Part of them are originally in English, and part of them are 'Hebrew English', meaning that those are words that are originally from English but are written with Hebrew words.
For example: 'insulin' in Hebrew is "אינסולין" (Same phonetic sound).
I have a simple binary dataset.
X: words (Written with Hebrew characters)
y: label 1 if the word is originally in English and is written with Hebrew characters, else 0
I've tried using the classifier, but the input for it is full text, and my input is just words.
I don't want any MASKING to happen, I just want simple classification.
Is it possible to use BERT for this mission? Thanks
BERT is intended to work with words in context. Without context, a BERT-like model is equivalent to simple word2vec lookup (there is fancy tokenization, but I don't know how it works with Hebrew - probably, not very efficiently). So if you really really want to use distributional features in your classifier, you can take a pretrained word2vec model instead - it's simpler than BERT, and no less powerful.
But I'm not sure it will work anyway. Word2vec and its equivalents (like BERT without context) don't know much about inner structure of a word - only about contexts it is used in. In your problem, however, word structure is more important than possible contexts. For example, words בלוטת (gland) or דם (blood) or סוכר (sugar) often occur in the same context as insulin, but בלוטת and דם are Hebrew, whereas סוכר is English (okay, originally Arabic, but we are probably not interested in too ancient origins). You just cannot predict it from context only.
So why not start with some simple model (e.g. logistic regression or even naive bayes) over simple features (e.g. character n-grams)? Distributional features (I mean w2v) may be added as well, because they tell about topic, and topics may be informative (e.g. in medicine, and technology in general, there are probably relatively more English words than in other domains).
Can anyone clarify what the following note exactly means?
NOTE: There must not be any new line characters; “\n” or “\r” at the end of sentences. If there are then the alignment of sentences will be corrupted and the training will not be effective.
The note appears on page 5, section 2.1.2.1 Parallel documents.
Does this apply to any document formats? It does not make much sense (at least to me), for instance for .align documents...
Thank you for bringing this to our attention. We will update the documentation as this statement is inaccurate. It should read
"NOTE: There must not be any new line characters; “\n” or “\r” within a sentence. If there are then the alignment of sentences will be corrupted and the training will not be effective."
The issue we want to address here is that parallel documents should not break a single sentence across multiple lines as it makes sentence alignment much less effective.
In regards to your question regarding .align files. We do not sentence align on these files, so you could break the sentences across multiple lines as long as you did it consistently. That is to say that if you have a sentence broken into three lines on the source side, it should be broken into three lines on the target side. Since the sentence aligner is not used, even one in unmatched split would cause misalignments to all the following sentences. There is no advantage to splitting sentences, so I strongly urge you not to do that.
What would be a good Lucene analyzer to use for documents that are a mix of text and diverse source code?
For example, I want "C" and "C++" to be considered different words, and I want Charset.forName("utf-8") to be split between the class name and method name, and for the parameter to be considered either one or two words.
A good example dataset for what I'd like to look at is StackOverflow itself. I believe that StackOverflow uses Lucene.NET for search; does it use a stock analyzer, or has it been heavily customized?
You're probably best to use the WhitespaceTokenizer and customize it to strip off punctuation. For example we strip of all puncutation except '+', '-' so that words such as C++, etc... are left but opening and closing quotes and brackets, etc are left. In reality though for something like this you might have to add the document twice using different tokenizers to catch the different parts of the document. i.e. once with the StandardTokenizer and once with a WhitespaceTokenizer, in this case the StandardTokenizer will split all your code, e.g. between class and method names as the Whitespace one will pick-up the words such as C++. Obviously it kind of depends on the language though as e.g. Scala allows some punctuation characters in method names.
I was reading XML is not S-Expressions. XML scoping is kind of strict, as are S-expressions. And in every programming language I've seen, you can't have the following:
<b>BOLD <i>BOTH </b>ITALIC</i> == BOLD BOTH ITALIC
It's not even expressible with S-Expressions:
(bold "BOLD" (italic "BOTH" ) "ITALIC" ) == :(
Does any programming language support this kind of "overlapping" scoping? Could there be any practical use for it?
Overlapping markup structures has many practical uses. Consider for example applications of concurrent markup for text analysis in the humanities. The International Workshop on Markup of Overlapping Structures noted that:
Overlapping structures are ubiquitous, appearing in applications of textual markup as varied as aircraft maintenance manuals and ancient scriptural and liturgical works. The “overlap issue“ raises its ugly head whenever text encoding looks beyond the snapshot view of a particular hierarchy to represent and process multiple concurrent aspects of a text, including features that reflect the text’s evolution across multiple versions and variants whether typographic or presentational, structural, annotational or referential, taxonomic or topical.
Overlap is a problem in texts as diverse as technical documents and product manuals (versioning), legal codes (effectivity), literary works (prosadic versus dramatic stucture, rhetorical structures, annotation), sacred texts (chapter plus verse reference versus sentence structure and commentary), and language corpora (multiple layers of linguistic annotation).
The Text Encoding Initiative (TEI) publishes Guidelines to handle non-nesting information and provides an XML syntax for overlap. They stated in 2004 that:
[N]o solution has yet been suggested which combines all the desirable attributes of formal simplicity, capacity to represent all occurring or imaginable kinds of structures, suitability for formal or mechanical validation, and clear identity with the notations needed for simpler cases (i.e. cases where the textual features do nest properly).
Some options to handle overlapping structures include:
SGML has a CONCUR feature that can be used to support overlapping structures, although Goldfarb (the author of the standard) writes that "“I therefore recommend that CONCUR not be used to create multiple logical views of a document".
GODDAG provides a data structure for representing documents with overlapping structures.
XCONCUR is an experimental markup language with the major goal to provide a convenient method to express concurrent hierarchies in an XML-like fashion.
There probably isn't any programming language that supports overlapping scopes in its formal definition. While technically possible, it would make the implementation more complex than it needed to be. It would also make the language ambiguous as to accept as valid what would very likely supposed to be a mistake.
The only practical use I can think of right now is that it's less typing and is written more intuitively, just as writing attributes in mark-up feel more intuitive without uneccessary quotes, as in <foo id=45 /> instead of <foo id="45" />.
I think that enforcing nested structures makes for more efficient processing, too. By enforcing nested structures, the parser can push and pop nodes onto a single stack to keep track of the list of open nodes. With overlapped scopes, you'd need an ordered list of open scopes that you'd have to append to whenever you come across a begin-new-scope token, and then scan each time you come across an end-scope token to see which open scope is most likely to be the one it closes.
Although no programming languages support overlapping scopes, there are HTML parsers that support it as part of their error-recovery algorithms, including the ones in all major browsers.
Also, the switch statement in C allows for constructs that look something like overlapping scopes, as in Duff's Device:
switch(count%8)
{
case 0: do{ *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
} while(--n>0);
}
So, in theory, a programming language can have similar semantics for scopes in general to allow these kinds of tricks for optimization when needed but readability would be very low.
The goto statement, along with break and continue in some languages also lets you structure programs to behave like overlapped scopes:
BOLD: while (bold)
{ styles.add(bold)
print "BOLD"
while(italic)
{ styles.add(italic)
print "BOTH";
break BOLD;
}
}
italic-continued:
styles.remove(bold)
print "ITALIC"
HTML 5 is introducing a new element: <ruby>; here's the W3C's description:
The ruby element allows one or more spans of phrasing content to be marked with ruby annotations. Ruby annotations are short runs of text presented alongside base text, primarily used in East Asian typography as a guide for pronunciation or to include other annotations. In Japanese, this form of typography is also known as furigana.
They then go on to give a few examples of Ruby annotations in use for Chinese and Japanese text. I'm wondering though: is this element going to be useful only for east-asian HTML documents, or are there good semantic applications for the <ruby> element in other western languages like English, German, Spanish, etc.?
id-ee-oh-SINK-ruh-sees
Could be useful for people learning English, as our writing system has many idiosyncrasies that make it somewhat less than phonetic.
As a linguist, I can see the benefits in using <ruby> for marking up linguistic examples with various theoretical notational conventions. One example that comes to mind is indicating tonal levels in autosegmental phonology. Here's a quick example I threw together that can be seen in the latest Webkit/Chromium (at least):
http://miketaylr.com/code/western_ruby.html
Currently, this type of notation is left for LaTex and friends, and if on the web, generally a non-accessible image.
As I understand it, ruby annotations are not really relevant in Western languages because Western alphabets are (more or less) phonetic. In Japanese they are used to give a pronunciation guide for logographic characters which don't have obvious pronunciations (unless you've memorized them). I suppose the Western analog would be IPA notation in brackets following a word, but those are rarely used and I don't know if Ruby annotations would be appropriate for them.
My list:
theoretical notational conventions (miketylr's answer)http://miketaylr.com/code/western_ruby.html
language learning (Adam Bellaire's answer) id-ee-oh-SINK-ruh-sees foo idiosyncrasies bar - made with ascii 'nbsp' art
abbreviation, acronym, initialism (possibly - why hover?)
learning technical terms of English origin accidentally translated to your non-english native language
I'm often forced to do the latter in uni. While the translated terminology is often consistent, very often it's not at all self-explaining or not as much as the original english one.
Also the same term may have been translated using several translation systems by different authors/groups.
Another problem group is when, for example, queue, row, series (and sometimes tuple) are translated to the very same word in your language.
Given a western language with less users, and the low percentage of technical people in the population, this actually makes learning the topic much easier directly from English and then learn the translations in a second step.
Ruby could be a tool to transform this into a one-step process, providing either the translations or the original as a "Furigana".