Why does gensim ignore underscores during preprocessing? - nltk

Going through the gensim source, I noticed the simple_preprocess utility function clears all punctuations except those with words starting with an underscore, _. Is there a reason for this?
def simple_preprocess(doc, deacc=False, min_len=2, max_len=15):
tokens = [
token for token in tokenize(doc, lower=True, deacc=deacc, errors='ignore')
if min_len <= len(token) <= max_len and not token.startswith('_')
]
return tokens

The underscore ('_') isn't typically meaningful punctuation, but is often considered a "word" character in programming and text-processing.
For example, common regular-expression syntax uses \w to indicate a "word character". Per https://www.regular-expressions.info/shorthand.html :
\w stands for "word character". It always matches the ASCII characters
[A-Za-z0-9_]. Notice the inclusion of the underscore and digits. In
most flavors that support Unicode, \w includes many characters from
other scripts. There is a lot of inconsistency about which characters
are actually included. Letters and digits from alphabetic scripts and
ideographs are generally included. Connector punctuation other than
the underscore and numeric symbols that aren't digits may or may not
be included. XML Schema and XPath even include all symbols in \w.
Again, Java, JavaScript, and PCRE match only ASCII characters with \w.
As such, it's often used in authoring, or in other text-preprocessing steps, to connect other groups of letters/numbers that should be kept together as a unit. Thus it's not often cleared with other true punctuation.
The code you've referenced also does something else, different than your question about clearing punctuation: it drops word-tokens beginning with _.
I'm not sure why it does that; at some point that code may have be designed with some specific text-format in mind where leading-underscore tokens were semantically-unimportant formatting directives.
The simple_preprocess() function in gensim is just a quick-and-dirty baseline helpful for internal tests and compact beginner tutorials. It shouldn't be considered a "best practice".
Real projects should give more consideration to the kind of word-tokenization that makes sense for their data and purposes – and either look to libraries with more options, or custom approaches (which still need not be more than a few lines of Python), to implement tokenization that best suits their needs.

Related

How to check whether a numeric encoded entity is a valid ISO8859-1 encoding?

Let's say I was given random character reference like 〹. I need a solution to check whether this a valid encoding or not.
I think I can use the Charset lib but I can't fully wrap my mind on how to come up with a solution.
[This answer has been rewritten after further research.]
There's no simple answer to this using Charsets; see below for a complicated one.
There are simple answers using the character code, but it turns out to depend on exactly what you mean by ISO8859-1!
According to the Wikipedia page on ISO/IEC 8859-1, the character set ISO8859-1 defines only characters 32–126 and 160–255. So you could simply check for those ranges, e.g.:
fun Char.isISO8859_1() = this.toInt() in 32..126 || this.toInt() in 160..255
However, that same page also mentions the character set ISO-8859-1 (note the extra hyphen), which defines all 8-bit characters (0–255), assigning control characters to the extra ones. You could check for that with e.g.:
fun Char.isISO_8859_1() = this.toInt() in 0..255
ISO8859-1 includes all the printable characters, so if you only want to know whether a character has a defined glyph, you could use the former. However, these days most people tend to mean ISO-8859-1: that's what many web pages use (those which haven't yet moved on to UTF-8), and that's what the first 256 Unicode characters are defined as. So the latter will probably be more generally useful.
Both of the above methods are of course very short, simple, and efficient; but they only work for the one character set; and it's awkward hard-coding details of a character set, when library classes already have that information.
It seems that Charset objects are mainly aimed at encoding and decoding, so they don't provide a simple way to tell which characters are defined as such. But you can find out whether they can encode a given character. Here's the simplest way I found:
fun Char.isIn(charset: Charset) =
try {
charset.newEncoder()
.onUnmappableCharacter(CodingErrorAction.REPORT)
.encode(CharBuffer.wrap(toString()))
true
} catch (x: CharacterCodingException) {
false
}
That's really inefficient, but will work for all Charsets.
If you try this for ISO_8859_1, you'll find that it can encode all 8-bit values, i.e. 0–255. So it's clearly using the full ISO-8859-1 definition.

Regex getting the tags from an <a href= ...> </a> and the likes

I've tried the answers I've found in SOF, but none supported here : https://regexr.com
I essentially have an .OPML file with a large number of podcasts and descriptions.
in the following format:
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
What regex I can use to so I can just get the title and the link:
Software Engineering Daily
http://softwareengineeringdaily.com/feed/podcast/
Brief
There are many ways to go about this. The best way is likely using an XML parser. I would definitely read this post that discusses use of regex, especially with XML.
As you can see there are many answers to your question. It also depends on which language you are using since regex engines differ. Some accept backreferences, whilst others do not. I'll post multiple methods below that work in different circumstances/for different regex flavours. You can probably piece together from the multiple regex methods below which parts work best for you.
Code
Method 1
This method works in almost any regex flavour (at least the normal ones).
This method only checks against the attribute value opening and closing marks of " and doesn't include the possibility for whitespace before or after the = symbol. This is the simplest solution to get the values you want.
See regex in use here
\b(text|xmlUrl)="[^"]*"
Similarly, the following methods add more value to the above expression
\b(text|xmlUrl)\s*=\s*"[^"]*" Allows whitespace around =
\b(text|xmlUrl)=(?:"[^"]*"|'[^']*') Allows for ' to be used as attribute value delimiter
As another alternative (following the comments below my answer), if you wanted to grab every attribute except specific ones, you can use the following. Note that I use \w, which should cover most attributes, but you can just replace this with whatever valid characters you want. \S can be used to specify any non-whitespace characters or a set such as [\w-] may be used to specify any word or hyphen character. The negation of the specific attributes occurs with (?!text|xmlUrl), which says don't match those characters. Also, note that the word boundary \b at the beginning ensures that we're matching the full attribute name of text and not the possibility of other attributes with the same termination such as subtext.
\b((?!text|xmlUrl)\w+)="[^"]*"
Method 2
This method only works with regex flavours that allow backreferences. Apparently JGsoft applications, Delphi, Perl, Python, Ruby, PHP, R, Boost, and Tcl support single-digit backreferences. Double-digit backreferences are supported by JGsoft applications, Delphi, Python, and Boost. Information according this article about numbered backreferences from Regular-Expressions.info
See regex in use here
This method uses a backreference to ensure the same closing mark is used at the start and end of the attribute's value and also includes the possibility of whitespace surrounding the = symbol. This doesn't allow the possibility for attributes with no delimiter specified (using xmlUrl=http://softwareengineeringdaily.com/feed/podcast/ may also be valid).
See regex in use here
\b(text|xmlUrl)\s*=\s*(["'])(.*?)\2
Method 3
This method is the same as Method 2 but also allows attributes with no delimiters (note that delimiters are now considered to be space characters, thus, it will only match until the next space).
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(.*?)\2|(\S*))
Method 4
While Method 3 works, some people might complain that the attribute values might either of 2 groups. This can be fixed by either of the following methods.
Method 4.A
Branch reset groups are only possible in a few languages, notably JGsoft V2, PCRE 7.2+, PHP, Delphi, R (with PCRE enabled), Boost 1.42+ according to Regular-Expressions.info
This also shows the method you would use if backreferences aren't possible and you wanted to match multiple delimiters ("([^"])"|'([^']*))
See regex in use here
\b(text|xmlUrl)\s*=\s*(?|"([^"]*)"|'([^']*)'|(\S*))
Method 4.B
Duplicate subpatterns are not often supported. See this Regular-Expresions.info article for more information
This method uses the J regex flag, which allows duplicate subpattern names ((?<v>) is in there twice)
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(?<v>.*?)\2|(?<v>\S*))
Results
Input
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
Output
Each line below represents a different group. New matches are separated by two lines.
text
Software Engineering Daily
xmlUrl
http://softwareengineeringdaily.com/feed/podcast/
Explanation
I'll explain different parts of the regexes used in the Code section that way you understand the usage of each of these parts. This is more of a reference to the methods above.
"[^"]*" This is the fastest method possible (to the best of my knowledge) to grabbing anything between two " symbols. Note that it does not check for escaped backslashes, it will match any non-" character between two ". Whilst "(.*?)" can also be used, it's slightly slower
(["'])(.*?)\2 is basically shorthand for "(.*?)"|'(.*?)'. You can use any of the following methods to get the same result:
(?:"(.*?)"|'(.*?)')
(?:"([^"])"|'([^']*)') <-- slightly faster than line above
(?|) This is a branch reset group. When you place groups inside it like (?|(x)|(y)) it returns the same group index for both matches. This means that if x is captured, it'll get group index of 1, and if y is captured, it'll also get a group index of 1.
For simple HTML strings you might get along with
Url=(['"])(.+?)\1
Here, take group $2, see a demo on regex101.com.
Obligatory: consider using a parser instead (see here).

"Text run is not in Unicode Normalization Form C" using 〉 [duplicate]

While I was trying to validate my site I get the following error:
Text run is not in Unicode Normalization Form C
A: What does it mean?
B: Can I fix it with notepad++ and how?
C: If B is no, How can I fix this with free tools(not dreamweaver)?
What does it mean?
From W3C:
In Unicode it is possible to produce
the same text with different sequences
of characters. For example, take the
Hungarian word világ. The fourth
letter could be stored in memory as a
precomposed U+00E1 LATIN SMALL LETTER A WITH ACUTE (a single
character) or as a decomposed
sequence of U+0061 LATIN SMALL LETTER
A followed by U+0301 COMBINING ACUTE
ACCENT (two characters).
világ = világ
The Unicode Standard allows either of
these alternatives, but requires that
both be treated as identical. To
improve efficiency, an application
will usually normalize text before
performing searches or comparisons.
Normalization, in this case, means
converting the text to use all
precomposed or all decomposed
characters.
There are four normalization forms
specified by the Unicode Standard:
NFC, NFD, NFKC and NFKD. The C stands
for (pre-)composed, and the D for
decomposed. The K stands for
compatibility. To improve
interoperability, the W3C recommends
the use of NFC normalized text on
the Web.
Besides "to improve interoperability", precomposed text usually looks better than decomposes text.
How can I fix this with free tools
By using the function equivalent to Python's text = unicodedata.normalize('NFC', text) in your favorite programming language.
(Or, if you weren't planning to write a program, your question should be moved to superuser or webmasters.)
A. It means what it says (see dan04’s explanation for a brief answer and the Unicode Standard for a long one), but it simply indicates that the authors of the validator wanted to issue the warning. HTML5 rules do not require Normalization Form C (NFC); it is rather something generally favored by the W3C.
B.There is no need to fix anything, unless you decide that using NFC would actually be better. If you do, then there are various tools for automatic conversion to NFC, such as the free BabelPad editor. If you only need to deal with one character not in NFC, you can use character information repositories such as Fileformat.info character search to find out the canonical decomposition of the character and use it.
Whether you use NFC or not depends on many considerations and on the characters involved. As a rule, NFC works better, but in some cases, an alternative, non-NFC presentation produces more suitable rendering or works better in some specific processing.
For example, in a duplicate question, the reference Ω has been reported as triggering the message. (The validator actually checks for characters entered as such references, too, instead of just plain text level NFC check.) The reference stands for U+2126 OHM SIGN “Ω”, which is defined to be canonical equivalent to U+03A9 GREEK CAPITAL LETTER OMEGA “Ω”. The Unicode Standard explicitly says that the latter is the preferred character. It is also better covered in fonts. But if you have a special reason to use OHM SIGN, you can do that, without violating current HTML5 rules, and you can ignore the validator warning.

Using magic strings or constants in processing punctuation?

We do a lot of lexical processing with arbitrary strings which include arbitrary punctuation. I am divided as to whether to use magic characters/strings or symbolic constants.
The examples should be read as language-independent although most are Java.
There are clear examples where punctuation has a semantic role and should be identified as a constant:
File.separator not "/" or "\\"; // a no-brainer as it is OS-dependent
and I write XML_PREFIX_SEPARATOR = ":";
However let's say I need to replace all examples of "" with an empty string ``. I can write:
s = s.replaceAll("\"\"", "");
or
s = s.replaceAll(S_QUOT+S_QUOT, S_EMPTY);
(I have defined all common punctuation as S_FOO (string) and C_FOO (char))
In favour of magic strings/characters:
It's shorter
It's natural to read (sometimes)
The named constants may not be familiar (C_APOS vs '\'')
In favour of constants
It's harder to make typos (e.g. contrast "''" + '"' with S_APOS+S_APOS + C_QUOT)
It removes escaping problems Should a regex be "\\s+" or "\s+" or "\\\\s+"?
It's easy to search the code for punctuation
(There is a limit to this - I would not write regexes this way even though regex syntax is one of the most cognitively dysfunctional parts of all programming. I think we need a better syntax.)
If the definitions may change over time or between installations, I tend to put these things in a config file, and pick up the information at startup or on-demand (depending on the situation). Then provide a static class with read-only interface and clear names on the properties for exposing the information to the system.
Usage could look like this:
s = s.replaceAll(CharConfig.Quotation + CharConfig.Quotation, CharConfig.EmtpyString);
For general string processing, I wouldn't use special symbols. A space is always going to be a space, and it's just more natural to read (and write!):
s.replace("String", " ");
Than:
s.replace("String", S_SPACE);
I would take special care to use things like "\t" to represent tabs, for example, since they can't easily be distinguished from spaces in a string.
As for things like XML_PREFIX_SEPARATOR or FILE_SEPARATOR, you should probably never have to deal with constants like that, since you should use a library to do the work for you. For example, you shouldn't be hand-writing: dir + FILE_SEPARATOR + filename, but rather be calling: file_system_library.join(dir, filename) (or whatever equivalent you're using).
This way, you'll not only have an answer for things like the constants, you'll actually get much better handling of various edge cases which you probably aren't thinking about right now

In MATLAB, what ASCII characters are allowed to be in a function name?

I have a set of objects that I read information out of that contain information that ends up becoming a MATLAB m file. One piece of information ends up being a function name in MATLAB. I need to remove all of the not-allowed characters from that string before writing the M file out to the filesystem. Can someone tell me what characters make up the set of allowed characters in a function name for MATLAB?
Legal names follow the pattern [A-Za-z][A-Za-z0-9_]*, i.e. an alphabetic character followed by zero or more alphanumeric-or-underscore characters, up to NAMELENGTHMAX characters.
Since MATLAB variable and function naming rules are the same, you might find genvarname useful. It sanitizes arbitrary strings into legal MATLAB names.
The short answer...
Any alphanumeric characters or underscores, as long as the name starts with a letter.
The longer answer...
The MATLAB documentation has a section "Working with M-Files" that discusses naming with a little more detail. Specifically, it points out the functions NAMELENGTHMAX (the maximum number of characters in the name that the OS will pay attention to), ISVARNAME (to check if the variable/function name is valid), and ISKEYWORD (to display restricted keywords).
Edited:
this may be more informative:
http://scv.bu.edu/documentation/tutorials/MATLAB/functions.html