How does TextX defines boundaries between words? - textx

Consider the following made up example with 3 rules.
Model: id_1=Ident 'is' id_2=Ident;
Keyword: 'is';
Ident: !Keyword ID;
It seems like TextX cannot parse inputs where id_2 starts with "is", e.g., "Tom is isolated".
I get the following error:
None:1:7: error: Expected Not at position ../test.txt:(1, 7) => 'Tom is* isolated'.
Why is that happening? Shouldn't "isolated" be considered as one word that is different from "is"? TextX can parse the input if id_2 does not start with "is".
Is there anyway to solve this problem?
Thanks!

textX doesn't assume by default that matching should be done on word boundaries. To solve your problem you can either define keywords to match on word boundaries like:
Keyword: /\bis\b/;
or use auto keywords feature which can be turned on using autokwd=True in metamodel_from_file/str calls.

Related

Replace with unknown character

I am correcting a json-Array. I want to replace a few errors.
For Example: in "index" : NumberInt(8), i want to cut off NumberInt(*) without the number where the * is, in order to make the json-file valid.
How can i do that? didn't find anything on google. it's quite hard to define this question.
Example
Before:
"index" : NumberInt(8),
(Some way of changing the JSON)
After:
"index" : 8,
Edit:
after the marked answer i could figure out my specific case by myself.
I solved my provlem using the "Back-References" ($1, $2, etc)
Example, which i used for my case:
press cmd+R -> replace function
insert in Search-String: NumberInt\(+(\d)\) insert in
Replace-String: $1
what happens: it searches for "NumberInt()" and replaces it
with the , referenced by the $1-symbol.
Thanks for your help! i learned a lot
I think there might be a bit missing from the question but I'm going to make some assumptions and hope thats what you were going for.
I'm going to assume that NumberInt(8) is a string (if it's not you can operate on the object and pull out the first argument and set that as the value)
If we are looking to pull out what's in the parens we could use a basic Regexp (() around what we want as a second element):
v = "NumberInt(8)"
v.match(/NumberInt\((\d)\)/)
=> Array [ "NumberInt(8)", "8" ]
Should be able to parseInt() the second element and override the previous value.

NLTK letter 'u' in front of text result?

I'm learning NLTK with a tutorial and whenever I try to print some text contents, it returns with 'u' in front of it.
In the tutorial it looks like this,
firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se...
But in my result, it looks like this
(u'firefox.txt', u'Cookie Manager: "Don\'t allow sites that set removed cookies to se', '...')
I am not sure why. I followed exact way the tutorial is explaining. Can someone help me understand this problem? Thank you!
That leading u just means that that string is Unicode. All strings are Unicode in Python 3. The parentheses means that you are dealing with a tuple. Both will go away if you print the individual elements of the tuple, as with t[0], t[1], and so on (assuming that t is your tuple).
If you want to print the whole tuple as a whole, removing u's and parentheses, try the following:
print " ".join (t)
As mentioned in other answer the leading u just means that string is Unicode. str() can be used to convert unicode to str but there doesnt seem to be a direct way to convert all the values in a tuple from unicode to string.
Simple function as below and using it when ever you are referring to any tuple in nltk.
>>> def str_tuple(t, encoding="ascii"):
... return tuple([i.encode(encoding) for i in t])
>>> str_tuple(nltk.corpus.gutenberg.fileids())
('austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt')
I guess you are using Python2.6 or any version before 3.0.
Python allows its users to do the same operation on 'str()' and 'unicode' in the early version. They tried to make conversion between 'str()' and 'unicode' directly in some case rely on default encoding, which on most platform is ASCII. That's probably the reason cause your problem. Here are two ways may solve it:
First, manually assign decoding method. For example:
>> for name in nltk.corpus.gutenberg.fileids():
>> name.decode('utf-8')
>> print(name)
The other way is to UPDATE your Python to version 3.0+ (Recommended). They fix this problem in Python3.0. Here is the link to update detail description:
https://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
Hope this helps you.

WordNet 3.0 Curse Words

I'm developing a system where keywords are extracted from plain text.
The requirements for a keyword are:
Between 1 - 45 letters long
Word must exist within the WordNet database
Must not be a "common" word
Must not be a curse word
I have fulfilled requirements 1 - 3, however I can't find a method for finding a distinction between curse words; how do I filter them?
I know this would not be a definitive method of filtering out all the curse words, but what happens is all keywords are first set to a state of "pending" before being "approved" by a moderator. However if I can get WordNet to filter most of the curse words out, it would make the moderator's job more easy.
It's strange, the Unix command line version of WordNet (wn) will give you the desired
information with the option -domn (domain):
wn ass -domnn (-domnv for a verb)
...
>>> USAGE->(noun) obscenity#2, smut#4, vulgarism#1, filth#4, dirty word#1
>>> USAGE->(noun) slang#2, cant#3, jargon#1, lingo#1, argot#1, patois#1, vernacular#1
However, the equivalent method in the NLTK just returns an empty list:
from nltk.corpus import wordnet
a = wordnet.synsets('ass')
for s in a:
for l in s.lemmas:
print l.usage_domains()
[]
[]
...
As an alternative you could try to filter words that have "obscene", "coarse" or "slang" in their SynSet's definition. But probably it's much easier to filter against a fixed list as suggested before (like the one at noswearing.com).
For the 4th point it would be better and effective if you can collect the list of curse words and remove them through iterative process.
To achieve the same you can checkout this blog
I will summarize the same here.
1. Load Swear words text file from here
2. Compare it with the text, remove if it match.
def remove_curse_words():
text = 'Hey Bro Fuck you'
text = ' '.join([word for word in text.split() if word not in curseWords])
return text
The output would be.
Hey bro you

Erlang binary pattern matching fails

Why does this issue a badmatch error? I can't figure out why this would fail:
<<IpAddr, ":*:*">> = <<"2a01:e34:ee8b:c080:a542:ffaf:*:*">>.
You need to specify the size of IpAddr so that it can be pattern-matched:
1> <<IpAddr:28/binary, ":*:*">> = <<"2a01:e34:ee8b:c080:a542:ffaf:*:*">>.
<<"2a01:e34:ee8b:c080:a542:ffaf:*:*">>
2> IpAddr.
<<"2a01:e34:ee8b:c080:a542:ffaf">>
Pattern matching of a binary proceeds left-to-right so it will match IpAddr first before it tries the following segment. There is no back-tracking until there is a match. A default typed variable like IpAddr matches one byte. See Bit Syntax Expressions and Bit Syntax for a proper description and more examples.
As alternative to using pattern matching here you might consider using the binary module. There are two functions which could be useful to you: binary:match/2/3 and binary:split/2/3. These search which may better fit your problem.
As a last alternative you could try using regular expressions and the re module.

How to select rows that start with a digit in Rails?

I have page that shows items in an index.
I'm able to get items by letter using the following:
scope :by_letter, lambda { |letter| where("name LIKE '#{letter}%'") }
But I can't figure out an elegant solution for names that start with a number (0-9).
How could I rewrite this or a separate scope that would let me search for names starting with a digit?
EDIT: I'm trying to get all rows that start with 0-9 in one go (not separately for each number).
this should work
scope :starts_with_number, where("name REGEXP '[0-9]%'")
Jacob, try this slightly rewritten version of what you ended up with:
#letter_merchants = (0..9).map { |d| Merchant.by_letter(d) }
Please note that this should only illustrate how awesome language Ruby is, not how the problem should be solved (there would be too many database calls).
Here's how I ended up doing it:
#letter_merchants = []
(0..9).to_a.each do |digit|
#letter_merchants |= Merchant.by_letter(digit)
end
One disadvantage of REGEXP is that it can't use indexes. however
scope :starts_with_number, where("name >= '0' and name < ':')
can use an index on name. It does rely on the characters 0-9: being in precisely that order, with nothing in between which will be the case in anything like ascii, utf8 but not if you used ebcdic or anything crazy like that