Extract adjacent word? (Names, streets, creeks, rivers) - extract

Extract adjacent word? (Names, streets, creeks, rivers)
Hi I am looking for a function that I can run through a massive list of paragraphs to extract the word proceeding ‘creek’ such that the creek names could be isolated.
For example a given paragraph might read:
“The site was located up stream three miles from the bridge along Clark Creek.”
The ideal output would be simply
Clark Creek
It would have to be something that looks up the word ‘creek’ as a criteria and extracts the preceding word, even just ‘Clark’ would work for me.
I have been playing around with the RQSlite package & gsub, but no luck so far… I am sure this is a common procedure.

If you're extracting actual addresses, there are services which do this intelligently and can even verify the results: http://smartystreets.com/products/liveaddress-api/extract (To be fair, you should know I helped develop that, although I no longer work there.)
For place names, assuming the place is just one word, you could try a simple regex:
/(?<=\s)(\S+\s+(Creek|Street|River))/ig
Granted, I've never used RQSLite or gsub, but I imagine something like this would do the trick.

Related

How to realize a context search based on synomyns?

Lets say an internet user searches for "trouble with gmail".
How can I return entries with "problem|problems|issues|issue|trouble|troubles with gmail|googlemail|google mail"?
I don't like to manually add these linkings between different keywords so the links between "issue <> problem <> trouble" and "gmail <> googlemail <> google mail" are completly unknown. They should be found in an automated process.
Approach to solve the problem
I provide a synonyms/thesaurus plattform like thesaurus.com, synonym.com, etc. or use an synomys database/api and use this user generated input for my queries on a third website.
But this won't cover all synonyms like the "gmail"-example.
Which other options do I have? Maybe something based on the given data and logged search phrases of the past?
You have to think of it ignoring the language.
When you show a baby the same thing using two words, he understand that those words are synonym. He might not have understood perfectly, but he will learn when this is repeated.
You type "problem with gmail".
Two choices:
Your search give results: you click on one item.
The system identify that this item was already clicked before when searching for "google mail bug". That's a match, and we will call it a "relative search".
Your search give poor results:
We will search in our history for a matching search:
We propose : "do you mean trouble with yahoo mail? yes/no". You click no, that's a "no match". And we might propose others suggestions like a list of known "relative search" or a list of might be related playing with both full text search in our history and levenshtein distance.
When a term is sufficiently scored to be considered as a "synonym", you can consider it is. Algorithm might be wrong, but in fact it depends on what you really expect.
If i search "sending a message is difficult with google", and "gmail issue", nothing is synonym, but search are relatively the same. This is more important to me than true synonyms.
And if you really want to get the synonym, i would do it in a second phase comparing words inside "relative searches" and would include a manual check.
I think google algorithm use synonym mainly to highlight search terms in page result, but not to do an actual search where they use the relative search terms, except in known situations, as the result for "gmail" and "google mail" are not the same.
But if you identify 10 relative searches for "gmail" which all contains "google mail", that will be a good start point to guess they are synonyms.
This is a bit long for a comment.
What you are looking for is called a "thesaurus" or "synonyms" list in the world of text searching. Apparently, there is a proposal for such functionality in MySQL. It is not yet implemented. (Here is a related question on Stack Overflow, although the link in the question doesn't seem to work.)
The work-around would be to modify queries before sending them to the database. That is, parse the query into words, then look up all the synonyms for those words, and reconstruct the query. This works better for the natural language searches than the boolean searches (which require more careful reconstruction).
Pseudo-code for getting the final word list with synonyms would be something like:
select #finalwords = concat_ws(' ', group_concat(synonyms separator ' ') )
from synonyms s
where find_in_set(s.baseword, #words) > 0;
Seems to me that you have two problems on your hands:
Lemmatisation, which breaks words down into their lemma, sometimes called the headword or root word. This is more difficult than Stemming, as it doesn't just chop suffixes off of words, but tries to find a true root, e.g. "are" => "be". This is something that is often done programatically, although it appears to be a complex task. Here is an online example of text being lemmatized: http://lemmatise.ijs.si/Services
Searching for synonymous lemmas. This is a very complex problem. One approach to this that I have heard of is modifying the lemmatisation engine to return more than one lemma for a given set of words, i.e. "problems" => "problem" and "issue", thereby allowing a more flexible set of results. However, this means that the synonymous lemmas must be provided to the lemmatisation engine from elsewhere. I truly have no idea how you would build a list of synonyms programatically.
So, you may consider a strategy whereby you lemmatise the text to be searched for, then pass each lemma out to your synonym finder (however that works) to get a final list of lemmas to perform your search with.
I think you have bitten off a very large problem for yourself.
If the system in question is a publicly accessible website, one 'out there' option is to ensure all content can be crawled by Google and then use a Google search on your own site, which should give you the synonym capability 'for free'. There would obviously be some vagaries in the results though and lag in getting match results for newly created content, depending upon how regularly the crawlers hit the site. Probably not suitable in your use case, but for some people, this may be sufficient.
Seeing your revised question, what about using a public API?
http://www.programmableweb.com/category/reference/apis?category=20066&keyword=synonym

How to seperate an address string mashed together in MySQL

I have an address string in MySQL that has been mashed together from the source. I think it is possible to use a regular expression or some other method to seperate the string into usable parts in MySQL, but I am not aware of how this could be acheived.
Basically each string looks something like these examples (I have added a marker to the top to show what each bit is):
<-------------><-------><-><-->
123 Fake StreetRESERVOIRVIC3001
<-----------------><--------------------><------><-><-->
Brooks Nursing Home123 Little Fake StreetSMITHTONNSW2001
<-------------------><-------------------><--- ><><-->
Grange Police StationShop 1 Fairytale LaneGRANGEWA8001
The address supposed to be broken up into optionally two lines of address information, suburb, state and post code. I'm in Australia so the state will be either NSW,VIC,QLD,WA,SA,NT or ACT and the postcode will always be a 4 digit number at the very end.
The possible ways to break it up are that the suburb will always be capitalised, the state and postcode will be predicatable within the last 6 or 7 characters (depending on state) and the first two lines of address information will be broken up by a change in case with no space character in between.
I have some 100,000 records like this, so to go through and do it by hand would be very time consuming. Any help on a way of doing this programatically would be much appreciated.
With no spaces? Most gross...
MySQL doesn't have the tools to deal with that, so you'll have to access the database with an external program. I tend to use Perl for manipulations like this.
Start from the end and work backwards... we know the last four should be digits, and the letters preceding that one of 7 options. Use that knowledge and you'll be down 2 fields and 6-7 characters.
It looks like your example now has a town in all capital letters at the end... Parse out that, and it should match to the state and area code. I'm certain you can find a database of zip codes within some minutes online.
With the name and street address remaining, that will have some variability to it, and I wish you a bit of luck there. You may have a head-start with being able to concentrate on the lack of a space between a lowercase and capital, or a letter and number as a breaking point.
Challenge accepted. I'll even throw in some basic punctuation to allow for "101 St. Mark's St." and the like.
/^(([\w\'\.](?=[a-z \'\.])| )+[a-z\'\.])?(([\w\'\.](?=[a-z \d\'\.])| )+[a-z\.\'])([A-Z]+)(NSW|VIC|QLD|WA|SA|NT|ACT)(\d{4})/
Could probably use a little more clean-up, but it should work in any language which supports basic regex with lookahead (some implementations, like JavaScript's and (I think) Ruby's, support lookahead, but not lookbehind). (That, and this puzzle kept me up well past my bed time.) At the very least, it worked on the three examples you provided.
By the way, 2problems.com is a great site for quickly testing regular expressions. It's what I used to work this puzzle out. The guy who built it must have been a real genius. (koff koff)
Rubular is another good option, though since it works by making Ajax calls to a Ruby script behind-the-scenes, it's a bit slower. It does have the nice feature of being able to link to entered patterns and haystacks, though; here's this pattern on Rubular. The 2problems guy really should get around to implementing something like that some day.

designing keyword/tag input

I am currently working on a form and am stuck on a keyword/tag input field (think youtube.. or even stackoverflow). Anyway, I thought it was pretty logical to use ',' to separate the tags... which would allow users to use combinations of words as tags using ' '. However my boss wants it separated with just ' '. Which worries me as I think we will end up with millions of 'The' tags. Personally I like the tag system used here on stackoverflow... but its not up to me.
So far my only idea was to have a list of common words that would automatically be removed.... problem being is its an international site, so not much good making a list of English words.
Any ideas?
As the user is typing in the keywords show a preview of the matched keywords right below. This might help clarify to the user that each word separated by spaces is a distinct keyword and that duplicates are eliminated.
Enter keywords: [[[ the house the boat the cat the coat ]]]
Keyword preview: [the] [house] [boat] [cat] [coat]
*** Warning: Duplicate keywords found. To combine multiple words use a "-"
If you really can't convince your boss otherwise, I can only think of your dictionary idea coupled with scheduled maintenance. So, if you notice a word repeatedly being used, such as an equivalent of 'the', then you can blacklist it and remove it from current tags. Also, you could make it clear that words such as 'the' are not required and other words should be hyphenated (as is on SO).

How to search for a person's name in a text? (heuristic)

I have a huge list of person's full names that I must search in a huge text.
Only part of the name may appear in the text. And it is possible to be misspelled, misstyped or abreviated. The text has no tokens, so I don't know where a person name starts in the text. And I don't if know if the name will appear or not in the text.
Example:
I have "Barack Hussein Obama" in my list, so I have to check for occurrences of that name in the following texts:
...The candidate Barack Obama was elected the president of the United States... (incomplete)
...The candidate Barack Hussein was elected the president of the United States... (incomplete)
...The candidate Barack H. O. was elected the president of the United States... (abbreviated)
...The candidate Barack ObaNa was elected the president of the United States... (misspelled)
...The candidate Barack OVama was elected the president of the United States... (misstyped, B is next to V)
...The candidate John McCain lost the the election... (no occurrences of Obama name)
Certanily there isn't a deterministic solution for it, but...
What is a good heuristic for this kind of search?
If you had to, how would you do it?
You said it's about 200 pages.
Divide it into 200 one-page PDFs.
Put each page on Mechanical Turk, along with the list of names. Offer a reward of about $5 per page.
Split everything on spaces removing special characters (commas, periods, etc). Then use something like soundex to handle misspellings. Or you could go with something like lucene if you need to search a lot of documents.
What you want is a Natural Lanuage Processing library. You are trying to identify a subset of proper nouns. If names are the main source of proper nouns than it will be easy if there are a decent number of other proper nouns mixed in than it will be more difficult. If you are writing in JAVA look at OpenNLP or C# SharpNLP. After extracting all the proper nouns you could probably use Wordnet to remove most non-name proper nouns. You may be able to use wordnet to identify subparts of names like "John" and then search the neighboring tokens to suck up other parts of the name. You will have problems with something like "John Smith Industries". You will have to look at your underlying data to see if there are features that you can take advantage of to help narrow the problem.
Using an NLP solution is the only real robust technique I have seen to similar problems. You may still have issues since 200 pages is actually fairly small. Ideally you would have more text and be able to use more statistical techniques to help disambiguate between names and non names.
At first blush I'm going for an indexing server. lucene, FAST or Microsoft Indexing Server.
I would use C# and LINQ. I'd tokenize all the words on space and then use LINQ to sort the text (and possibly use the Distinct() function) to isolate all the text that I'm interested in. When manipulating the text I'd keep track of the indexes (which you can do with LINQ) so that I could relocate the text in the original document - if that's a requirement.
The best way I can think of would be to define grammars in python NLTK. However it can get quite complicated for what you want.
I'd personnaly go for regular expressions while generating a list of permutations with some programming.
Both SQL Server and Oracle have built-in SOUNDEX Functions.
Additionally there is a built-in function for SQL Server called DIFFERENCE, that can be used.
pure old regular expression scripting will do the job.
use Ruby, it's quite fast. read lines and match words.
cheers

Recognize Missing Space

How can I recognize when a user has missed a space when entering a search term? For example, if the user enters "usbcable", I want to search for "usb cable". I'm doing a REGEX search in MySQL to match full words.
I have a table with every term used in a search, so I know that "usb" and "cable" are valid terms. Is there a way to construct a WHERE clause that will give me all the rows where the term matches part of the string?
Something like this:
SELECT st.term
FROM SearchTerms st
WHERE 'usbcable' LIKE '%' + st.term + '%'
Or any other ideas?
Text Segmentation is a part of Natural Language Processing, and is what you're looking for in this specific example. It's used in search engines and spell checkers, so you might have some luck with example source code looking at open source spell checkers and search engines.
Spell checking might be the correct paradigm to consider anyway, as you first need to know whether it's a legitimate word or not before trying to pry it apart.
-Adam
Posted in the comments, but I thought it important to bring up as an answer:
Does that query not work? – Simon Buchan
Followed by:
Well, I should've tested it before I
posted. That query does not work, but
with a CONCAT it does, like so: WHERE
'usbcable' LIKE Concat('%', st.term,
'%'). I think this is the simplest,
and most relevant (specific to my
site), way to go. – arnie0674
Certainly far easier than text segmentation for this application...
-Adam