Semaphores notation - why V and P instead of S and W - language-agnostic

I am asking this question because I'm trying to get the notation to stick in my head. My lecturer said that V and P are the first letters of the dutch words for signal and wait, but this is not true.
Does anyone know what words V and P abbreviate or did Dijkstra just pick his favorite two letters?

From Wikipedia's Semaphore (programming) article and a copy of Dijkstra's work:
Probeer te verlagen (P) means 'try to decrement'
Verhogen (V) means 'increment'
P has the Proberen... (try) term in front of the meaningful verlagen (decrement) term because the Dutch words for increment and decrement both start with "V". Dijkstra added the "try to" words in front of the meaningful "decrement" term so that there would be a simpler way of distinguishing between the two functions.

Related

Is there a function to compare two strings using a custom homoglyphs list

I need a function that would compare two strings and outputs an edit distance like Levenshtein, but only if the characters are homoglyphs in cursives. I have a list of those homoglyphs so I could feed a custom list to this function.
Example
homoglyphs = [["o","a"],["rn","m","nn"],...] // In cursive they look-alike
compare("Mory", "Mary", homoglyphs) // Levenshtein gives 1
compare("Mory", "Tory", homoglyphs) // Levenshtein gives 1, but I want false, 99 or -1
compare("Morio", "Mario", homoglyphs) // I expect a distance of 1
compare("Morio", "Maria", homoglyphs) // I expect a distance of 2
Tory should give a false result since there's no way someone misread an M as a T. An A could be misread as an O so it can count as 1.
The scoring could be different, I just need to know that Mory is probably Mary not Tory and Morio is a little more likely to be Mario than Maria.
Do something like this exists?
The key to your problem can be thought of like an IQ word association question.
Sound Glyph
--------- = ----------
Homophone Homoglyphs
Now if you know that there is a way to find similar sounding words (homophone) then the same can be applied but instead of sounds change to glyphs (homoglyph).
The way to find similar sounding words is via Soundex (Sound Index).
So just do what Soundex does but instead of having a mapping from similar homophones use similar homoglyphs.
Once you convert each word (glyphs) input into a Glyphdex (Glyph Index) then you can compute the Levenshtein distance for the two Glyphdex.
Make sense?
If you are into cellular biology then codon translation into amino acids (ref) might make more sense. Many amino acids are coded by more than one 3 letter codon.
Note: Since the word glyhdex has been used prior to me writing this I can not say I coined that word, however the usage I currently find via Google (search) for the word are not in the same context as described here. So in the context of converting a sequence of glyphs into an index of similar sequence of glyphs I will take credit.

Pumping Lemma for Regular Languages

I'm having some trouble with a rather difficult question. I'm being asked to prove the language {0^n 1^m 0^n | m,n >= 0} is irregular using the pumping lemma. In all the examples I've seen, the language is only being raised to the same variable (i.e. a^n b^n). So my question is, how do I pick a suitable string to test if this language is irregular?
Also a follow up to that question is once I have my string, how do you decompose the string into the form xyz where |xy| <= pumping length and |y| >=1?
In the examples you have seen before there were different letters: n as followed by bs. In the given example, the are n Os at the beginning and the end of the word. The language adds 0 or more 1s between those blocks of Os.
W in the pumping lemma is decomposed w = x y z with |xy| <= m and |y| > 0, where m is the pumping length. The way to pick a w is the same as before: you pick it such that the xy is completely inside a block consisting of one letter. For a^n b^n a word in L was selected such that xy would entirely consist of as, such that if it is 'pumped' there will be more as than bs. So you need at least m as and for the word to be in the language that means you need to pick m bs. The shortest is w = a^mb^m. For the new troublesome language, pick a word in this L such that xy consists entirely of Os (in the first block), such that if it is 'pumped' there will be more Os in the first block than the last block -and the number of 1s in the middle was not changed. However, you need to include at least one 1 in your original word otherwise there is only one block of Os - and pumped words in fact are in the language, which means there is no contradiction and thus not proof that L is irregular.

Regex for finding possible words (scrabble-style)

I'm trying to create a "scrabble-solver" to run stress-tests on a scrabble-like game I'm developing. I have a database containing ~200.000 words and I'm now looking for a way to match the scrabble tiles given with the words in the database.
Example:
Given tiles: A, P, E, F, O, L, M
Result: APE, POLE, PALE, MOLE, PAL...
Is this possible by using a simple SELECT-statement with REGEXP? If possible I would also like to add letters on specific positions and be able to determine max/min length.
I hope this question made sense :)
I've been googling my eyes out but I can't seem to find what I'm looking for. Anyone got an idea?
Thanks! :)
It doesn't sound like a regex problem. I think you'll be better off simply creating all possible combinations of letters from the existing tiles and then running your SELECT statement with the IN clause. For example, with tiles:
A, P, E
your SELECT clause will be
SELECT word FROM words WHERE word IN ('APE', 'AEP', 'PAE' ,'PEA', 'EPA', 'EAP');
You'll get the list of valid words from your table.
A regex would not help you much in this case. You need to construct the possible words by yourself.
The problem is that you have a limited number of each possible letter and a regex cannot encode that information. If you had infinite supply of each letter, then you could use a regex like [APEFOI]*.
You will have to enumerate all the possible words yourself. The implementation would depend on the language your using, but your best bet might be a next_permutation function or better a function that enumerates all permutations. A simple (and slightly inefficient) implementation (in Python-like pseudocode) would be:
words = []
for permutation in permutations(letters): # enumerate all character orders
for i in range(1, len(permutation)): # enumerate all lengths of words
words.append(letters[:i]) # append to candidate set
At that point words will contain all the candidate words you would then use in a SELECT ... IN statement.
That isn't the most efficient approach, but should be practical enough to get you started.

In 0-based indexing system, do people call the element at index 0 the "first" or the "zeroth" element?

In Java/C++, for example, do you casually say that 'a' is the first character of "abc", or the zeroth?
Do people say both and it's always going to be ambiguous, or is there an actual convention?
A quote from wikipedia on Zeroth article:
In computer science, array references also often start at 0, so computer programmers might use zeroth in situations where others might use first, and so forth.
This would seem support the hypothesis that it's always going to be ambiguous.
Thanks to Alexandros Gezerlis (see his answer below) for finding this quote, from How to Think Like a Computer Scientist: Learning with Python by Allen B. Downey, Jeffrey Elkner and Chris Meyers, chapter 7:
The first letter of "banana" is not a. Unless you are a computer scientist. For perverse reasons, computer scientists always start counting from zero. The 0th letter (zero-eth) of "banana" is b. The 1th letter (one-eth) is a, and the 2th (two-eth) letter is n.
This seems to suggest that we as computer scientists should reject the natural semantics of "first", "second", etc when dealing with 0-based indexing systems.
This quote suggests that perhaps there ARE official rulings for certain languages, so I've made this question [language-agnostic].
It is the first character or element in the array, but it is at index zero.
The term "first" has nothing to do with the absolute index of the array, but simply it's relative position as the lowest indexed element. Turbo Pascal, for example, allows arbitrary indexes in arrays (say from 5 to 15). The element located at array[5] would still be referred to as the first element.
To quote from this wikipedia article:
While the term "zeroth" is not itself
ambiguous, it creates ambiguity for
all subsequent elements of the
sequence when lacking context, since
they are called "first", "second",
etc. in conflict with the normal
everyday meanings of those words.
So I would say "first".
Probably subjective but I call it the first element or element zero. It is the first and, Isaac Asimov's laws of robotics aside, I'm not even confident that zeroth is a real word :-)
By definition, anything preceding the first becomes the first, and pushes everything else out by one.
Definitely first, never heard zeroth until today!
I would agree with most answers here, which say first character which is at zero index, but just for the record, the following is from Allen Downey's book "Python for Software Design":
So b is the 0th letter (“zero-eth”) of
'banana', a is the 1th letter
(“one-eth”), and n is the 2th
(“two-eth”) letter.
Thus, he removes the ambiguity by either using:
a) a number and then "th", or
b) a word and then "-eth".
The C and C++ standards say "initial element" and "first element", meaning the same thing. If I remember to be unambiguous, I say "initial", "zeroth", or "first counting from zero". But normally I say "first". That banana stuff is either a humorous exaggeration or a bit bonkers (I suspect the former - it's just a way to explain 0-indexing). I don't think I know anyone who would actually say "first" to mean "index 1 of a 0-indexed array" unless they had first said "zeroth" in the same paragraph in order to make it clear what they mean.
It depends of whether or not you are a fan of Isaac Asimov's robot series.

Human name comparison: ways to approach this task

I'm not a Natural Language Programming student, yet I know it's not trivial strcmp(n1,n2).
Here's what i've learned so far:
comparing Personal Names can't be solved 100%
there are ways to achieve certain degree of accuracy.
the answer will be locale-specific, that's OK.
I'm not looking for spelling alternatives! The assumption is that the input's spelling is correct.
For example, all the names below can refer to the same person:
Berry Tsakala
Bernard Tsakala
Berry J. Tsakala
Tsakala, Berry
I'm trying to:
build (or copy) an algorithm which grades the relationship 2 input names
find an indexing method (for names in my database, for hash tables, etc.)
note:
My task isn't about finding names in text, but to compare 2 names. e.g.
name_compare( "James Brown", "Brown, James", "en-US" ) ---> 99.0%
I used Tanimoto Coefficient for a quick (but not super) solution, in Python:
"""
Formula:
Na = number of set A elements
Nb = number of set B elements
Nc = number of common items
T = Nc / (Na + Nb - Nc)
"""
def tanimoto(a, b):
c = [v for v in a if v in b]
return float(len(c)) / (len(a)+len(b)-len(c))
def name_compare(name1, name2):
return tanimoto(name1, name2)
>>> name_compare("James Brown", "Brown, James")
0.91666666666666663
>>> name_compare("Berry Tsakala", "Bernard Tsakala")
0.75
>>>
Edit: A link to a good and useful book.
Soundex is sometimes used to compare similar names. It doesn't deal with first name/last name ordering, but you could probably just have your code look for the comma to solve that problem.
We've just been doing this sort of work non-stop lately and the approach we've taken is to have a look-up table or alias list. If you can discount misspellings/misheard/non-english names then the difficult part is taken away. In your examples we would assume that the first word and the last word are the forename and the surname. Anything in between would be discarded (middle names, initials). Berry and Bernard would be in the alias list - and when Tsakala did not match to Berry we would flip the word order around and then get the match.
One thing you need to understand is the database/people lists you are dealing with. In the English speaking world middle names are inconsistently recorded. So you can't make or deny a match based on the middle name or middle initial. Soundex will not help you with common name aliases such as "Dick" and "Richard", "Berry" and "Bernard" and possibly "Steve" and "Stephen". In some communities it is quite common for people to live at the same address and have 2 or 3 generations living at that address with the same name. The only way you can separate them is by date of birth. Date of birth may or may not be recorded. If you have the clout then you should probably make the recording of date of birth mandatory. A lot of "people databases" either don't record date of birth or won't give them away due to privacy reasons.
Effectively people name matching is not that complicated. Its entirely based on the quality of the data supplied. What happens in practice is that a lot of records remain unmatched - and even a human looking at them can't resolve the mismatch. A human may notice name aliases not recorded in the aliases list or may be able to look up details of the person on the internet - but you can't really expect your programme to do that.
Banks, credit rating organisations and the government have a lot of detailed information about us. Previous addresses, date of birth etc. And that helps them join up names. But for us normal programmers there is no magic bullet.
Analyzing name order and the existence of middle names/initials is trivial, of course, so it looks like the real challenge is knowing common name alternatives. I doubt this can be done without using some sort of nickname lookup table. This list is a good starting point. It doesn't map Bernard to Berry, but it would probably catch the most common cases. Perhaps an even more exhaustive list can be found elsewhere, but I definitely think that a locale-specific lookup table is the way to go.
I had real problems with the Tanimoto using utf-8.
What works for languages that use diacritical signs is difflib.SequenceMatcher()