Related
Going through the gensim source, I noticed the simple_preprocess utility function clears all punctuations except those with words starting with an underscore, _. Is there a reason for this?
def simple_preprocess(doc, deacc=False, min_len=2, max_len=15):
tokens = [
token for token in tokenize(doc, lower=True, deacc=deacc, errors='ignore')
if min_len <= len(token) <= max_len and not token.startswith('_')
]
return tokens
The underscore ('_') isn't typically meaningful punctuation, but is often considered a "word" character in programming and text-processing.
For example, common regular-expression syntax uses \w to indicate a "word character". Per https://www.regular-expressions.info/shorthand.html :
\w stands for "word character". It always matches the ASCII characters
[A-Za-z0-9_]. Notice the inclusion of the underscore and digits. In
most flavors that support Unicode, \w includes many characters from
other scripts. There is a lot of inconsistency about which characters
are actually included. Letters and digits from alphabetic scripts and
ideographs are generally included. Connector punctuation other than
the underscore and numeric symbols that aren't digits may or may not
be included. XML Schema and XPath even include all symbols in \w.
Again, Java, JavaScript, and PCRE match only ASCII characters with \w.
As such, it's often used in authoring, or in other text-preprocessing steps, to connect other groups of letters/numbers that should be kept together as a unit. Thus it's not often cleared with other true punctuation.
The code you've referenced also does something else, different than your question about clearing punctuation: it drops word-tokens beginning with _.
I'm not sure why it does that; at some point that code may have be designed with some specific text-format in mind where leading-underscore tokens were semantically-unimportant formatting directives.
The simple_preprocess() function in gensim is just a quick-and-dirty baseline helpful for internal tests and compact beginner tutorials. It shouldn't be considered a "best practice".
Real projects should give more consideration to the kind of word-tokenization that makes sense for their data and purposes – and either look to libraries with more options, or custom approaches (which still need not be more than a few lines of Python), to implement tokenization that best suits their needs.
I've tried the answers I've found in SOF, but none supported here : https://regexr.com
I essentially have an .OPML file with a large number of podcasts and descriptions.
in the following format:
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
What regex I can use to so I can just get the title and the link:
Software Engineering Daily
http://softwareengineeringdaily.com/feed/podcast/
Brief
There are many ways to go about this. The best way is likely using an XML parser. I would definitely read this post that discusses use of regex, especially with XML.
As you can see there are many answers to your question. It also depends on which language you are using since regex engines differ. Some accept backreferences, whilst others do not. I'll post multiple methods below that work in different circumstances/for different regex flavours. You can probably piece together from the multiple regex methods below which parts work best for you.
Code
Method 1
This method works in almost any regex flavour (at least the normal ones).
This method only checks against the attribute value opening and closing marks of " and doesn't include the possibility for whitespace before or after the = symbol. This is the simplest solution to get the values you want.
See regex in use here
\b(text|xmlUrl)="[^"]*"
Similarly, the following methods add more value to the above expression
\b(text|xmlUrl)\s*=\s*"[^"]*" Allows whitespace around =
\b(text|xmlUrl)=(?:"[^"]*"|'[^']*') Allows for ' to be used as attribute value delimiter
As another alternative (following the comments below my answer), if you wanted to grab every attribute except specific ones, you can use the following. Note that I use \w, which should cover most attributes, but you can just replace this with whatever valid characters you want. \S can be used to specify any non-whitespace characters or a set such as [\w-] may be used to specify any word or hyphen character. The negation of the specific attributes occurs with (?!text|xmlUrl), which says don't match those characters. Also, note that the word boundary \b at the beginning ensures that we're matching the full attribute name of text and not the possibility of other attributes with the same termination such as subtext.
\b((?!text|xmlUrl)\w+)="[^"]*"
Method 2
This method only works with regex flavours that allow backreferences. Apparently JGsoft applications, Delphi, Perl, Python, Ruby, PHP, R, Boost, and Tcl support single-digit backreferences. Double-digit backreferences are supported by JGsoft applications, Delphi, Python, and Boost. Information according this article about numbered backreferences from Regular-Expressions.info
See regex in use here
This method uses a backreference to ensure the same closing mark is used at the start and end of the attribute's value and also includes the possibility of whitespace surrounding the = symbol. This doesn't allow the possibility for attributes with no delimiter specified (using xmlUrl=http://softwareengineeringdaily.com/feed/podcast/ may also be valid).
See regex in use here
\b(text|xmlUrl)\s*=\s*(["'])(.*?)\2
Method 3
This method is the same as Method 2 but also allows attributes with no delimiters (note that delimiters are now considered to be space characters, thus, it will only match until the next space).
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(.*?)\2|(\S*))
Method 4
While Method 3 works, some people might complain that the attribute values might either of 2 groups. This can be fixed by either of the following methods.
Method 4.A
Branch reset groups are only possible in a few languages, notably JGsoft V2, PCRE 7.2+, PHP, Delphi, R (with PCRE enabled), Boost 1.42+ according to Regular-Expressions.info
This also shows the method you would use if backreferences aren't possible and you wanted to match multiple delimiters ("([^"])"|'([^']*))
See regex in use here
\b(text|xmlUrl)\s*=\s*(?|"([^"]*)"|'([^']*)'|(\S*))
Method 4.B
Duplicate subpatterns are not often supported. See this Regular-Expresions.info article for more information
This method uses the J regex flag, which allows duplicate subpattern names ((?<v>) is in there twice)
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(?<v>.*?)\2|(?<v>\S*))
Results
Input
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
Output
Each line below represents a different group. New matches are separated by two lines.
text
Software Engineering Daily
xmlUrl
http://softwareengineeringdaily.com/feed/podcast/
Explanation
I'll explain different parts of the regexes used in the Code section that way you understand the usage of each of these parts. This is more of a reference to the methods above.
"[^"]*" This is the fastest method possible (to the best of my knowledge) to grabbing anything between two " symbols. Note that it does not check for escaped backslashes, it will match any non-" character between two ". Whilst "(.*?)" can also be used, it's slightly slower
(["'])(.*?)\2 is basically shorthand for "(.*?)"|'(.*?)'. You can use any of the following methods to get the same result:
(?:"(.*?)"|'(.*?)')
(?:"([^"])"|'([^']*)') <-- slightly faster than line above
(?|) This is a branch reset group. When you place groups inside it like (?|(x)|(y)) it returns the same group index for both matches. This means that if x is captured, it'll get group index of 1, and if y is captured, it'll also get a group index of 1.
For simple HTML strings you might get along with
Url=(['"])(.+?)\1
Here, take group $2, see a demo on regex101.com.
Obligatory: consider using a parser instead (see here).
The following regex will do validation for the P. O. Box is entered in the text box,
\b[P|p](OST|ost)?[.\s-]+[O|o](FFICE|ffice)?[.\s-]+[B|b](OX|ox)\b
i want to negate this so as to detect whether user is not entering P. O. Box it in the text box,
I know that we can do it by using javascript also, but my platform has different form structure, its demandware form, where i have regex as the field attribute.
We can submit a regex in this field and it will validate it automatically.Any idea?
I recommend not attempting to do this with a regular expression. It is way too easy to defeat1. Instead, you need to out source the problem to someone who knows what they are doing. In this case, since you are dealing with US addresses only, the USPS.
So, you should use the USPS address standarization/verification API. You can submit an address, and it will return to you a "cleaned" version of that address. It will tell you whether or not the address is valid. And if it is a post office box, it will return it to you in a standardized format and now you don't need a regular expression that can be defeated, now you only need a simple string match. And, as an added bonanza, you'll get a standardized and validated representation of the delivery address reducing2 the possibility of error.
I recognize I am sidestepping your actual engineering question. But part of engineering is abandoning solutions that are the wrong path. You need to validate addresses. So validate addresses rather than trying to build a state machine that can detect some inputs that represent post office boxes but will fail on others. And the USPS provides a validation service and they are the authoritative experts here.
1: I am not saying that you'll face adversaries, just you'll face all the creative, sloppy, lazy ways that people have for entering in their addresses.
2: But not eliminating.
You need to use a negative lookaround: (?!pattern).
In this case
(?! \b[P|p](OST|ost)?[.\s-]+[O|o](FFICE|ffice)?[.\s-]+[B|b](OX|ox)\b )
For reference:
Regex lookarounds
You can use this pattern:
^(?:[^p]+|\Bp+|p(?!(?:ost)?[.\s-]+o(?:ffice)?[.\s-]+box\b))+$
The idea is to test only substrings that begin with "p" (for more performances). To make this check case insensitive, you can add (?i) at the begining of the pattern:
^(?i)(?:[^p]+|\Bp+|p(?!(?:ost)?[.\s-]+o(?:ffice)?[.\s-]+box\b))+$
If the regex flavor is JavaScript's, then you can use negative look-ahead:
^(?!.*?\b[Pp](OST|ost)?[.\s-]+[Oo](FFICE|ffice)?[.\s-]+[Bb](OX|ox)\b)
When I'm programming, I often find myself writing functions that -should- (to be proper english) contain apostrophes (too bad C started everyone thinking that an apostrophe was an appropriate delimiter). For example: get_user's_group() -> get_users_group() . What do you guys do with that forced-bad-english ambiguous english? Just ignore the apostrophe? Create a different phrasing?
In that case, I would do get_group_for_user().
So, yes, I would "create a different phrasing" :)
Either that, or user.get_group().
getGroupForUser()
or
getGroupByUser()
My original answer of Ignore it, move on! is incomplete. You should ignore the fact you can't use ' in your method/function names. But you should continue to look at the naming of them to better explain what they do. I think this is a worthwhile pursuit in programming.
Picking on JavaScript, you could if you wanted to use apostrophes:
const user = {
"get_user's_group": () => console.log("Naming things! Am I right?!")
}
user["get_user's_group"]()
But don't do that 😬
Taking it further, you could if you wanted to, use a transpiler to take your grammatically correct name and transform it into something you never see.
Again with JavaScript as an example, maybe you could write a babel transform.
But don't do that 😛
As others have said, if there is context available from an object, that's a nice option:
user.get_group()
Failing that, the context of the surrounding code should be enough to make this your choice:
get_users_group()
How about getGroupByUser?
Either get_user_ApostropheShouldBeHereButLanguageWillNotLetMe_s_group or just ignore it because it really doesn't matter.
I ignore the apostraphe getGroupyUser and group_from_user are both perfectly understandable. Worrying about having correct grammer in your function names is a waste of time and distracts from the correct goal of having clear and understandable user names.
the point of proper english in function naming is a bit extreme ...
i mean why is the apostrophe bothering you but the _ instead of a space is not ?
Depending on the programming language you may be able to use Unicode variable names, this SO thread lists a few.
With Unicode identifiers you could use one of the unicode apostrophes to give the proper english language formatting to your variable name. Though this only speculative. And it would be hard to maintain. Actually, now that I think about it, it sounds downright evil.
Two points: First, don't use a name that would otherwise require an apostrophe if you can avoid it. Second, you are right in being concerned about ambiguity. For example, you could have:
getUsersGroup: gets the group of a list of users. If you are using an object-oriented language, this could have more information than just a group ID string. You could also have something like createUsersGroup, which would create a group object from a list of users passed in.
getGroupOfUser: takes in some sort of user object; returns the name of the group of the user
getGroupByUserId: takes in the user's name or a unique ID associated with that user; returns the name of the group of the user
The best way to delineate the difference between all of these is to just use standard method comments that explain the method names. This would depend on what language you are working with and what style of method comments your organization conventionally uses.
Normally I just drop the apostrophe, but do back-ticks work? (get_user`s_group)
getGroupOfUser? getUserGroup?
It's a programming language, not literature...
It would be getBackgroundColour in proper English (rather than getBackgroundColor)
Personally I'd write get_user_group() rather than get_group_for_user() since it feels like it reads better to me. Of course, I use a programming language where apostrophes are allowed in names:
proc get_user's_group {id} {#...}
Although, some of the more prolific non-English-native European users use it as a word separator:
proc user'group {id} {#...}
to each his own I guess..
Typically languages have keywords that you are unable to use directly with the exact same spelling and case for naming things (variables,functions,classes ...) in your program. Yet sometimes a keyword is the only natural choice for naming something. What is your system for avoiding/getting around this clash in your chosen technology?
I just avoid the name, usually. Either find a different name or change it slightly - e.g. clazz instead of class in C# or Java. In C# you can use the # prefix, but it's horrible:
int #int = 5; // Ick!
There is nothing intrinsically all-encompassing about a keyword, in that it should stop you from being able to name your variables. Since all names are just generalized instances of some type to one degree or another, you can always go up or down in the abstraction to find another useful name.
For example, if your writing a system that tracks students and you want an object to represent their study in a specific field, i.e. they've taken a "class" in something, if you can't use the term directly, or the plural "classes", or an alternative like "studies", you might find a more "instanced" variation: studentClass, currentClass, etc. or a higher perspective: "courses", "courseClass" or a specfic type attribute: dailyClass, nightClass, etc.
Lots of options, you should just prefer the simplest and most obvious one, that's all.
I always like to listen to the users talk, because the scope of their language helps define the scope of the problem, often if you listen long enough you'll find they have many multiple terms for the same underlying things (with only subtle differences). They usually have the answer ...
Paul.
My system is don't use keywords period!
If I have a function/variable/class and it only seems logical to name it with a keyword, I'll use a descriptive word in front of the keyword.
(adjectiveNoun) format. ie: personName instead of Name where "Name" is a keyword.
I just use a more descriptive name. For instance, 'id' becomes identifier, 'string' becomes 'descriptionString,' and so on.
In Python I usually use proper namespacing on my modules to avoid name clashes.
import re
re.compile()
instead of:
from re import *
compile()
Sometimes, when I can't avoid keyword name clashes I simply drop the last letter off the name of my variable.
for fil in files:
pass
As stated before either change class to clazz in Java/C#, or use some underscore as a prefix, for example
int _int = 0;
There should be no reason to use keywords as variable names. Either use a more detailed word or use a thesaraus. Capitalizing certain letters of the word to make it not exactly like the keyword is not going to help much to someone inheriting your code later.
Happy those with a language without ANY keywords...
But joke apart, I think in the seldom situations where "Yet sometimes a keyword is the only natural choice for naming something." you can get along by prefixing it with "my", "do", "_" or similar.
I honestly can't really think of many such instances where the keyword alone makes a good name ("int", "for" and "if" are definitely bad anyway). The only few in the C-language family which might make sense are "continue" (make it "doContinue"), "break" (how about "breakWhenEOFIsreached" or similar ?) and the already mentioned "class" (how about "classOfThingy" ?).
In other words: make the names more reasonable.
And always remember: code is WRITTEN only once, but usualy READ very often.
Typically I follow Hungarian Notation. So if, for whatever reason, I wanted to use 'End' as a variable of type integer I would declare it as 'iEnd'. A string would be 'strEnd', etc. This usually gives me some room as far as variables go.
If I'm working on a particular personal project that other people will only ever look at to see what I did, for example, when making an add-on to a game using the UnrealEngine, I might use my initials somewhere in the name. 'DS_iEnd' perhaps.
I write my own [vim] syntax highlighters for each language, and I give all keywords an obvious colour so that I notice them when I'm coding. Languages like PHP and Perl use $ for variables, making it a non-issue.
Developing in Ruby on Rails I sometime look up this list of reserved words.
In 15 years of programming, I've rarely had this problem.
One place I can immediately think of, is perhaps a css class, and in that case, I'd use a more descriptive name. So instead of 'class', I might use 'targetClass' or something similar.
In python the generally accepted method is to append an '_'
class -> class_
or -> or_
and -> and_
you can see this exemplified in the operator module.
I switched to a language which doesn't restrict identifier names at all.
First of all, most code conventions prevent such a thing from happening.
If not, I usually add a descriptive prose prefix or suffix:
the_class or theClass infix_or (prefix_or(class_param, in_class) , a_class) or_postfix
A practice, that is usually in keeping with every code style advice you can find ("long names don't kill", "Longer variable names don't take up more space in memory, I promise.")
Generally, if you think the keyword is the best description, a slightly worse one would be better.
Note that, by the very premise of your question you introduce ambiguity, which is bad for the reader, be it a compiler or human. Even if it is a custom to use class, clazz or klass and even if that custom is not so custom that it is a custom: it takes a word word, precisely descriptive as word may be, and distorts it, effectively shooting w0rd's precision in the "wrd". Somebody used to another w_Rd convention or language might have a few harsh wordz for your wolds.
Most of us have more to say about things than "Flower", "House" or "Car", so there's usually more to say about typeNames, decoratees, class_params, BaseClasses and typeReferences.
This is where my personal code obfuscation tolerance ends:
Never(!!!) rely on scoping or arcane syntax rules to prevent name clashes with "key words". (Don't know any compiler that would allow that, but, these days, you never know...).
Try that and someone will w**d you in the wörd so __rd, Word will look like TeX to you!
My system in Java is to capitalize the second letter of the word, so for example:
int dEfault;
boolean tRansient;
Class cLass;