How to seperate an address string mashed together in MySQL - mysql

I have an address string in MySQL that has been mashed together from the source. I think it is possible to use a regular expression or some other method to seperate the string into usable parts in MySQL, but I am not aware of how this could be acheived.
Basically each string looks something like these examples (I have added a marker to the top to show what each bit is):
<-------------><-------><-><-->
123 Fake StreetRESERVOIRVIC3001
<-----------------><--------------------><------><-><-->
Brooks Nursing Home123 Little Fake StreetSMITHTONNSW2001
<-------------------><-------------------><--- ><><-->
Grange Police StationShop 1 Fairytale LaneGRANGEWA8001
The address supposed to be broken up into optionally two lines of address information, suburb, state and post code. I'm in Australia so the state will be either NSW,VIC,QLD,WA,SA,NT or ACT and the postcode will always be a 4 digit number at the very end.
The possible ways to break it up are that the suburb will always be capitalised, the state and postcode will be predicatable within the last 6 or 7 characters (depending on state) and the first two lines of address information will be broken up by a change in case with no space character in between.
I have some 100,000 records like this, so to go through and do it by hand would be very time consuming. Any help on a way of doing this programatically would be much appreciated.

With no spaces? Most gross...
MySQL doesn't have the tools to deal with that, so you'll have to access the database with an external program. I tend to use Perl for manipulations like this.
Start from the end and work backwards... we know the last four should be digits, and the letters preceding that one of 7 options. Use that knowledge and you'll be down 2 fields and 6-7 characters.
It looks like your example now has a town in all capital letters at the end... Parse out that, and it should match to the state and area code. I'm certain you can find a database of zip codes within some minutes online.
With the name and street address remaining, that will have some variability to it, and I wish you a bit of luck there. You may have a head-start with being able to concentrate on the lack of a space between a lowercase and capital, or a letter and number as a breaking point.

Challenge accepted. I'll even throw in some basic punctuation to allow for "101 St. Mark's St." and the like.
/^(([\w\'\.](?=[a-z \'\.])| )+[a-z\'\.])?(([\w\'\.](?=[a-z \d\'\.])| )+[a-z\.\'])([A-Z]+)(NSW|VIC|QLD|WA|SA|NT|ACT)(\d{4})/
Could probably use a little more clean-up, but it should work in any language which supports basic regex with lookahead (some implementations, like JavaScript's and (I think) Ruby's, support lookahead, but not lookbehind). (That, and this puzzle kept me up well past my bed time.) At the very least, it worked on the three examples you provided.
By the way, 2problems.com is a great site for quickly testing regular expressions. It's what I used to work this puzzle out. The guy who built it must have been a real genius. (koff koff)
Rubular is another good option, though since it works by making Ajax calls to a Ruby script behind-the-scenes, it's a bit slower. It does have the nice feature of being able to link to entered patterns and haystacks, though; here's this pattern on Rubular. The 2problems guy really should get around to implementing something like that some day.

Related

How do I fill a list with all the world's phone prefixes in Dart on Flutter?

I'd like to implement an app with Dart on Flutter. I'm on my first approach with this new language and for the first time I meet this problem.
My app must necessarily work with a mobile phone number. I would like to see a ban on the insertion of unse prefixed telephone numbers or, alternatively, the typing of a number with more digits than expected. For example, in Italy the figures after +39 (0039) are at most 10. I probably thought I'd separate the two parts to make it easier to distinguish between lengths (one field where you select the country and another that allows you to enter the number).
Is there, as you know, a JSON that contains exactly: - the prefix of each state, - the length of the telephone number (excluding prefix), - name, *flag and *sigla (Italy, green-white-red, IT)?
Sifting through the web a little bit, I saw that flutter should actually provide already in itself with .demoTextFieldEnterITPhoneNumber, through GalleryLocalizations to do such a job, but I didn't quite understand if it bothers to control a particular regular expression for each nation or not. Could I copy and paste a number for example? Will nationality be automatically recognized?
In the end I think that such a control, so deep, is not possible so I would just need this, so make two fields, one with a list, which at the choice automatically fills in depending on the selected prefix, and a field on which the user types his number: in case of copied and pasted number check if that string also contains a +prefix.
Thank you very much, I need a lot, since my app will mainly revolve around a correct value for this field. :)
Try using the international_phone_input or country_code_picker flutter package. They are quite easy to implement

Extract adjacent word? (Names, streets, creeks, rivers)

Extract adjacent word? (Names, streets, creeks, rivers)
Hi I am looking for a function that I can run through a massive list of paragraphs to extract the word proceeding ‘creek’ such that the creek names could be isolated.
For example a given paragraph might read:
“The site was located up stream three miles from the bridge along Clark Creek.”
The ideal output would be simply
Clark Creek
It would have to be something that looks up the word ‘creek’ as a criteria and extracts the preceding word, even just ‘Clark’ would work for me.
I have been playing around with the RQSlite package & gsub, but no luck so far… I am sure this is a common procedure.
If you're extracting actual addresses, there are services which do this intelligently and can even verify the results: http://smartystreets.com/products/liveaddress-api/extract (To be fair, you should know I helped develop that, although I no longer work there.)
For place names, assuming the place is just one word, you could try a simple regex:
/(?<=\s)(\S+\s+(Creek|Street|River))/ig
Granted, I've never used RQSLite or gsub, but I imagine something like this would do the trick.

Banned words checking algo

I am building a text chat system. I want to add the ability to check for banned words/phrases.
The only technique I can think of, and can't believe it could possibly be the best approach is to do a FOR loop through all the words and search for matches in the text. This seems like it would be unbelievably slow once lots of words are added.
I'm using AS3, but an answer in most any language would probably be useful.
take care,
lee
use an AS3 dictionary or a dict in python and just check if the word is in the dict. there is no way I can see to not go over all the words.
Consider concatenating all the entries in your Dictionary into a single RegExp, with which you have to parse the text only once. I've done some testing, and it's going to be way faster than replacing word for word.
function censorWithDictionary ( dict:Dictionary, text:String ) : String {
var reg : String = "";
for (var key:Object in dict)
{
reg += reg=="" ? "" : "|"; // add an "or" for multiple search words
reg += "\\b"+dict[key]+"\\b"; // only whole words
}
var regExp : RegExp = new RegExp ( reg, "gi" );
return text.replace ( regExp, "----" );
}
I had a similar problem - we run a gaming site and wanted to introduce a chat system which was not manually moderated. We went the "banned word" route and it's working really well.
I just counted them and we now have a list of (just) 79 banned words which originated from something I found on-line to which we have added words over time when chat messages crept through.
The way we check things is that we concatenate an entire chat message by removing all spaces and none alpha characters and then search for banned words in what's left.
The key decisions we made are:
Don't tell people why you rejected their messages
Don't let people post chat until you trust them a bit (on our site they have
to have played 3 games)
5 "Bad" messages and we automatically block you
We email a report out daily with all the chat which got through which we scan through
We allow other users to complain about posted messages - if that happens the message is automatically removed so we can check it later.
1+3+5 Hardly ever happen now and it works wonderfully even though - sometimes messages like
"I wish it was hot!"
Are rejected (the clue is the "sh" part of wish and "it") but even that doesn't happen often.
This is more a comment than an answer, but comments are limited in length and there're big issues here.
I believe you are fundamentally asking the wrong question!
Certainly dictionaries and blacklist would highlight words or phrases that you want to ban but would that list be acceptable to users of your system? Would there be text that users of your system find offensive but you do not. Who decides?
For example, would people living here have trouble or indeed people living here. What if you supported this football/soccer team. This person probably never visits the UK.
Then you get into the issue of anagrams and slang. FCUK is a high street brand in the UK (and elsewhere I'm sure). And then there's pr0n (no link!) or NAMBLA.
The real question is - How do I stop people using the system from using language that is generally unacceptable? And that's more a design / social engineering problem than a programming problem. I don't think this site has word / phrase filtering and yet there's nothing here that would cause offense to anyone.
Here's an idea - let your users decide what is acceptable! Use a reputation based system. Allow users to vote up users who behave and vote down users that cause offense (with the option of allowing users to give feedback on the vote to give them a chance to mend their ways) and then have an option to filter out users with low / negative reputations.

How to search for a person's name in a text? (heuristic)

I have a huge list of person's full names that I must search in a huge text.
Only part of the name may appear in the text. And it is possible to be misspelled, misstyped or abreviated. The text has no tokens, so I don't know where a person name starts in the text. And I don't if know if the name will appear or not in the text.
Example:
I have "Barack Hussein Obama" in my list, so I have to check for occurrences of that name in the following texts:
...The candidate Barack Obama was elected the president of the United States... (incomplete)
...The candidate Barack Hussein was elected the president of the United States... (incomplete)
...The candidate Barack H. O. was elected the president of the United States... (abbreviated)
...The candidate Barack ObaNa was elected the president of the United States... (misspelled)
...The candidate Barack OVama was elected the president of the United States... (misstyped, B is next to V)
...The candidate John McCain lost the the election... (no occurrences of Obama name)
Certanily there isn't a deterministic solution for it, but...
What is a good heuristic for this kind of search?
If you had to, how would you do it?
You said it's about 200 pages.
Divide it into 200 one-page PDFs.
Put each page on Mechanical Turk, along with the list of names. Offer a reward of about $5 per page.
Split everything on spaces removing special characters (commas, periods, etc). Then use something like soundex to handle misspellings. Or you could go with something like lucene if you need to search a lot of documents.
What you want is a Natural Lanuage Processing library. You are trying to identify a subset of proper nouns. If names are the main source of proper nouns than it will be easy if there are a decent number of other proper nouns mixed in than it will be more difficult. If you are writing in JAVA look at OpenNLP or C# SharpNLP. After extracting all the proper nouns you could probably use Wordnet to remove most non-name proper nouns. You may be able to use wordnet to identify subparts of names like "John" and then search the neighboring tokens to suck up other parts of the name. You will have problems with something like "John Smith Industries". You will have to look at your underlying data to see if there are features that you can take advantage of to help narrow the problem.
Using an NLP solution is the only real robust technique I have seen to similar problems. You may still have issues since 200 pages is actually fairly small. Ideally you would have more text and be able to use more statistical techniques to help disambiguate between names and non names.
At first blush I'm going for an indexing server. lucene, FAST or Microsoft Indexing Server.
I would use C# and LINQ. I'd tokenize all the words on space and then use LINQ to sort the text (and possibly use the Distinct() function) to isolate all the text that I'm interested in. When manipulating the text I'd keep track of the indexes (which you can do with LINQ) so that I could relocate the text in the original document - if that's a requirement.
The best way I can think of would be to define grammars in python NLTK. However it can get quite complicated for what you want.
I'd personnaly go for regular expressions while generating a list of permutations with some programming.
Both SQL Server and Oracle have built-in SOUNDEX Functions.
Additionally there is a built-in function for SQL Server called DIFFERENCE, that can be used.
pure old regular expression scripting will do the job.
use Ruby, it's quite fast. read lines and match words.
cheers

How can I program a simple chat bot AI?

I want to build a bot that asks someone a few simple questions and branches based on the answer. I realize parsing meaning from the human responses will be challenging, but how do you setup the program to deal with the "state" of the conversation?
It will be a one-to-one conversation between a human and the bot.
You probably want to look into Markov Chains as the basics for the bot AI. I wrote something a long time ago (the code to which I'm not proud of at all, and needs some mods to run on Python > 1.5) that may be a useful starting place for you: http://sourceforge.net/projects/benzo/
EDIT: Here's a minimal example in Python of a Markov Chain that accepts input from stdin and outputs text based on the probabilities of words succeeding one another in the input. It's optimized for IRC-style chat logs, but running any decent-sized text through it should demonstrate the concepts:
import random, sys
NONWORD = "\n"
STARTKEY = NONWORD, NONWORD
MAXGEN=1000
class MarkovChainer(object):
def __init__(self):
self.state = dict()
def input(self, input):
word1, word2 = STARTKEY
for word3 in input.split():
self.state.setdefault((word1, word2), list()).append(word3)
word1, word2 = word2, word3
self.state.setdefault((word1, word2), list()).append(NONWORD)
def output(self):
output = list()
word1, word2 = STARTKEY
for i in range(MAXGEN):
word3 = random.choice(self.state[(word1,word2)])
if word3 == NONWORD: break
output.append(word3)
word1, word2 = word2, word3
return " ".join(output)
if __name__ == "__main__":
c = MarkovChainer()
c.input(sys.stdin.read())
print c.output()
It's pretty easy from here to plug in persistence and an IRC library and have the basis of the type of bot you're talking about.
Folks have mentioned already that statefulness isn't a big component of typical chatbots:
a pure Markov implementations may express a very loose sort of state if it is growing its lexicon and table in real time—earlier utterances by the human interlocutor may get regurgitated by chance later in the conversation—but the Markov model doesn't have any inherent mechanism for selecting or producing such responses.
a parsing-based bot (e.g. ELIZA) generally attempts to respond to (some of the) semantic content of the most recent input from the user without significant regard for prior exchanges.
That said, you certainly can add some amount of state to a chatbot, regardless of the input-parsing and statement-synthesis model you're using. How to do that depends a lot on what you want to accomplish with your statefulness, and that's not really clear from your question. A couple general ideas, however:
Create a keyword stack. As your human offers input, parse out keywords from their statements/questions and throw those keywords onto a stack of some sort. When your chatbot fails to come up with something compelling to respond to in the most recent input—or, perhaps, just at random, to mix things up—go back to your stack, grab a previous keyword, and use that to seed your next synthesis. For bonus points, have the bot explicitly acknowledge that it's going back to a previous subject, e.g. "Wait, HUMAN, earlier you mentioned foo. [Sentence seeded by foo]".
Build RPG-like dialogue logic into the bot. As your parsing human input, toggle flags for specific conversational prompts or content from the user and conditionally alter what the chatbot can talk about, or how it communicates. For example, a chatbot bristling (or scolding, or laughing) at foul language is fairly common; a chatbot that will get het up, and conditionally remain so until apologized to, would be an interesting stateful variation on this. Switch output to ALL CAPS, throw in confrontational rhetoric or demands or sobbing, etc.
Can you clarify a little what you want the state to help you accomplish?
Imagine a neural network with parsing capabilities in each node or neuron. Depending on rules and parsing results, neurons fire. If certain neurons fire, you get a good idea about topic and semantic of the question and therefore can give a good answer.
Memory is done by keeping topics talked about in a session, adding to the firing for the next question, and therefore guiding the selection process of possible answers at the end.
Keep your rules and patterns in a knowledge base, but compile them into memory at start time, with a neuron per rule. You can engineer synapses using something like listeners or event functions.
I think you can look at the code for Kooky, and IIRC it also uses Markov Chains.
Also check out the kooky quotes, they were featured on Coding Horror not long ago and some are hilarious.
I think to start this project, it would be good to have a database with questions (organized as a tree. In every node one or more questions).
These questions sould be answered with "yes " or "no".
If the bot starts to question, it can start with any question from yuor database of questions marked as a start-question. The answer is the way to the next node in the tree.
Edit: Here is a somple one written in ruby you can start with: rubyBOT
naive chatbot program. No parsing, no cleverness, just a training file and output.
It first trains itself on a text and then later uses the data from that training to generate responses to the interlocutor’s input. The training process creates a dictionary where each key is a word and the value is a list of all the words that follow that word sequentially anywhere in the training text. If a word features more than once in this list then that reflects and it is more likely to be chosen by the bot, no need for probabilistic stuff just do it with a list.
The bot chooses a random word from your input and generates a response by choosing another random word that has been seen to be a successor to its held word. It then repeats the process by finding a successor to that word in turn and carrying on iteratively until it thinks it’s said enough. It reaches that conclusion by stopping at a word that was prior to a punctuation mark in the training text. It then returns to input mode again to let you respond, and so on.
It isn’t very realistic but I hereby challenge anyone to do better in 71 lines of code !! This is a great challenge for any budding Pythonists, and I just wish I could open the challenge to a wider audience than the small number of visitors I get to this blog. To code a bot that is always guaranteed to be grammatical must surely be closer to several hundred lines, I simplified hugely by just trying to think of the simplest rule to give the computer a mere stab at having something to say.
Its responses are rather impressionistic to say the least ! Also you have to put what you say in single quotes.
I used War and Peace for my “corpus” which took a couple of hours for the training run, use a shorter file if you are impatient…
here is the trainer
#lukebot-trainer.py
import pickle
b=open('war&peace.txt')
text=[]
for line in b:
for word in line.split():
text.append (word)
b.close()
textset=list(set(text))
follow={}
for l in range(len(textset)):
working=[]
check=textset[l]
for w in range(len(text)-1):
if check==text[w] and text[w][-1] not in '(),.?!':
working.append(str(text[w+1]))
follow[check]=working
a=open('lexicon-luke','wb')
pickle.dump(follow,a,2)
a.close()
here is the bot
#lukebot.py
import pickle,random
a=open('lexicon-luke','rb')
successorlist=pickle.load(a)
a.close()
def nextword(a):
if a in successorlist:
return random.choice(successorlist[a])
else:
return 'the'
speech=''
while speech!='quit':
speech=raw_input('>')
s=random.choice(speech.split())
response=''
while True:
neword=nextword(s)
response+=' '+neword
s=neword
if neword[-1] in ',?!.':
break
print response
You tend to get an uncanny feeling when it says something that seems partially to make sense.
I would suggest looking at Bayesian probabilities. Then just monitor the chat room for a period of time to create your probability tree.
I'm not sure this is what you're looking for, but there's an old program called ELIZA which could hold a conversation by taking what you said and spitting it back at you after performing some simple textual transformations.
If I remember correctly, many people were convinced that they were "talking" to a real person and had long elaborate conversations with it.
If you're just dabbling, I believe Pidgin allows you to script chat style behavior. Part of the framework probably tacks the state of who sent the message when, and you'd want to keep a log of your bot's internal state for each of the last N messages. Future state decisions could be hardcoded based on inspection of previous states and the content of the most recent few messages. Or you could do something like the Markov chains discussed and use it both for parsing and generating.
If you do not require a learning bot, using AIML (http://www.aiml.net/) will most likely produce the result you want, at least with respect to the bot parsing input and answering based on it.
You would reuse or create "brains" made of XML (in the AIML-format) and parse/run them in a program (parser). There are parsers made in several different languages to choose from, and as far as I can tell the code seems to be open source in most cases.
You can use "ChatterBot", and host it locally using - 'flask-chatterbot-master"
Links:
[ChatterBot Installation]
https://chatterbot.readthedocs.io/en/stable/setup.html
[Host Locally using - flask-chatterbot-master]: https://github.com/chamkank/flask-chatterbot
Cheers,
Ratnakar