How to clean data having spelling mistakes - data-munging

I have data as below:
Carepnter
Carpentor
Labourer
Labor
Labour
Housewife
House Wife
housewife.
I want to clean data and rectify the spelling mistakes but not manually because its a huge data. Due to spelling mistakes these 50/60 occupations have become around 2000.

You would have to find strings that are close to the actual occupation, for instance carpenter.
Then you can try to find the closest n-matches to it.
Another question on here also dealt with finding similar strings (Python: find closest string (from a list) to another string) and the solutions from the answers for you could be either:
difflib.get_close_matches
Spelling corrector

Related

How to create RegEx with SubMatches of the same Match that capture 2 different types of output?

I'm trying to get my Jira data via JSON REST API into Excel, i.e. using VBA, and I'm parsing JSON output using RegEx. There are plenty of useful tutorials on the web, and after a couple of days I do have more or less working solution I'm happy with, except one minor obstacle. Long story short:
Among many issue fields I need friendly Assignee name, but some issues in my projects may be Unassigned, that obviously results in TWO VERY different kinds of JSON output:
Unassigned issue:
..."assignee":null,"updated"...
Assigned issue:
"assignee":{
"self":...
<Lots of NOT needed fields here>
...
},
"displayName":"Doe, John", <-- That's what I need, name only part
"active":...
<Lots of NOT needed fields here>
...
},
"updated"...
Well, I suppose that something like:
"assignee".*?"displayName":"(.*?)"|"assignee":(.*?),"updated"
will handle the job by producing TWO possible Matches, but... Is there a way to create RegEx where ANY of output options will result in SubMatches of ONE Match?
I'm a total newbie to RegEx, so sorry if the wording of my question is silly due to incorrectly used terms. Anyway, I hope the sample part is more or less clear, and I'll be extremely grateful for useful suggestions.
After an hour of tryouts on regex101 I ended up with the following RegEx:
"assignee":(null|.*?"displayName":"(.*?)","active")
Probably it's ugly and may be improved - but it DOES the job, and does NOT ruin in the process the indexes of subsequent Matches in collection, therefore keeping the rest of code working as it is now.

Extract adjacent word? (Names, streets, creeks, rivers)

Extract adjacent word? (Names, streets, creeks, rivers)
Hi I am looking for a function that I can run through a massive list of paragraphs to extract the word proceeding ‘creek’ such that the creek names could be isolated.
For example a given paragraph might read:
“The site was located up stream three miles from the bridge along Clark Creek.”
The ideal output would be simply
Clark Creek
It would have to be something that looks up the word ‘creek’ as a criteria and extracts the preceding word, even just ‘Clark’ would work for me.
I have been playing around with the RQSlite package & gsub, but no luck so far… I am sure this is a common procedure.
If you're extracting actual addresses, there are services which do this intelligently and can even verify the results: http://smartystreets.com/products/liveaddress-api/extract (To be fair, you should know I helped develop that, although I no longer work there.)
For place names, assuming the place is just one word, you could try a simple regex:
/(?<=\s)(\S+\s+(Creek|Street|River))/ig
Granted, I've never used RQSLite or gsub, but I imagine something like this would do the trick.

Passing a command line argument (as a string) into my Perl script

I'm extremely new at Perl and trying to prove I can pick it up quickly. What I was asked to do is add a string as an argument on my command line, and then feed that into my script. From there it is supposed to search a MySQL table I've made for matches in one column, and spit the contents of another column into an array. It was suggested I used the Getops:Std but I'm uncertain how exactly to do that, and if that's the best technique.
For example: I have a MySQL table with car manufacturers, and car models. I want to run, Perl myscript.pl Ford, and then have it shoot me back an array with
Mustang
Escape
Focus
But I'm uncertain how to get that string input in the first place. Would Getops:Std be best? If so how would it be written? I'm picking this up quickly, but I've been at it less than a week, so the simpler the explanation, the better.
Edit: Basically I was confused why it was suggested I should use GetOpts::Std for this. It seems to be completely inappropriate for what I'm trying to do.
GetOpts::Std is overkill for this. Your command line arguments are in #ARGV. If you haven't been able to work that out after a week, then you need better references for Perl.
The first argument will be in $ARGV[0], the second in $ARGV[1] , and so on.
You should check the DBI module. Google for some tutorial out there.
Then try to write your script and post more specific questions with some code if you need more help.

How to seperate an address string mashed together in MySQL

I have an address string in MySQL that has been mashed together from the source. I think it is possible to use a regular expression or some other method to seperate the string into usable parts in MySQL, but I am not aware of how this could be acheived.
Basically each string looks something like these examples (I have added a marker to the top to show what each bit is):
<-------------><-------><-><-->
123 Fake StreetRESERVOIRVIC3001
<-----------------><--------------------><------><-><-->
Brooks Nursing Home123 Little Fake StreetSMITHTONNSW2001
<-------------------><-------------------><--- ><><-->
Grange Police StationShop 1 Fairytale LaneGRANGEWA8001
The address supposed to be broken up into optionally two lines of address information, suburb, state and post code. I'm in Australia so the state will be either NSW,VIC,QLD,WA,SA,NT or ACT and the postcode will always be a 4 digit number at the very end.
The possible ways to break it up are that the suburb will always be capitalised, the state and postcode will be predicatable within the last 6 or 7 characters (depending on state) and the first two lines of address information will be broken up by a change in case with no space character in between.
I have some 100,000 records like this, so to go through and do it by hand would be very time consuming. Any help on a way of doing this programatically would be much appreciated.
With no spaces? Most gross...
MySQL doesn't have the tools to deal with that, so you'll have to access the database with an external program. I tend to use Perl for manipulations like this.
Start from the end and work backwards... we know the last four should be digits, and the letters preceding that one of 7 options. Use that knowledge and you'll be down 2 fields and 6-7 characters.
It looks like your example now has a town in all capital letters at the end... Parse out that, and it should match to the state and area code. I'm certain you can find a database of zip codes within some minutes online.
With the name and street address remaining, that will have some variability to it, and I wish you a bit of luck there. You may have a head-start with being able to concentrate on the lack of a space between a lowercase and capital, or a letter and number as a breaking point.
Challenge accepted. I'll even throw in some basic punctuation to allow for "101 St. Mark's St." and the like.
/^(([\w\'\.](?=[a-z \'\.])| )+[a-z\'\.])?(([\w\'\.](?=[a-z \d\'\.])| )+[a-z\.\'])([A-Z]+)(NSW|VIC|QLD|WA|SA|NT|ACT)(\d{4})/
Could probably use a little more clean-up, but it should work in any language which supports basic regex with lookahead (some implementations, like JavaScript's and (I think) Ruby's, support lookahead, but not lookbehind). (That, and this puzzle kept me up well past my bed time.) At the very least, it worked on the three examples you provided.
By the way, 2problems.com is a great site for quickly testing regular expressions. It's what I used to work this puzzle out. The guy who built it must have been a real genius. (koff koff)
Rubular is another good option, though since it works by making Ajax calls to a Ruby script behind-the-scenes, it's a bit slower. It does have the nice feature of being able to link to entered patterns and haystacks, though; here's this pattern on Rubular. The 2problems guy really should get around to implementing something like that some day.

Is hard-coding literals ever acceptable?

The code base I'm currently working on is littered with hard-coded values.
I view all hard coded values as a code smell and I try to eliminate them where possible...however there are some cases that I am unsure about.
Here are two examples that I can think of that make me wonder what the best practice is:
1. MyTextBox.Text = someCondition ? "Yes" : "No"
2. double myPercentage = myValue / 100;
In the first case, is the best thing to do to create a class that allows me to do MyHelper.Yes and MyHelper.No or perhaps something similar in a config file (though it isn't likely to change and who knows if there might ever be a case where its usage would be case sensitive).
In the second case, finding a percentage by dividing by 100 isn't likely to ever change unless the laws of mathematics change...but I still wonder if there is a better way.
Can anyone suggest an appropriate way to deal with this sort of hard coding? And can anyone think of any places where hard coding is an acceptable practice?
And can anyone think of any places where hard coding is an acceptable practice?
Small apps
Single man projects
Throw aways
Short living projects
For short anything that won't be maintained by others.
Gee I've just realized how much being maintainer coder hurt me in the past :)
The real question isn't about hard coding, but rather repetition. If you take the excellent advice found in "The Pragmatic Programmer", simply Don't Repeat Yourself (DRY).
Taking the principle of DRY, it is fine to hardcode something at any point. However, once you use that particular value again, refactor so this value is only hardcoded once.
Of course hard-coding is sometimes acceptable. Following dogma is rarely as useful a practice as using your brain.
(For an example of this, perhaps it's interesting to go back to the goto wars. How many programmers do you know that will swear by all things holy that goto is evil? Why then does Steve McConnell devote a dozen pages to a measured discussion of the subject in Code Complete?)
Sure, there's a lot of hard-gained experience that tells us that small throw-away applications often mutate into production code, but that's no reason for zealotry. The agilists tell us we should do the simplest thing that could possibly work and refactor when needed.
That's not to say that the "simplest thing" shouldn't be readable code. It may make perfect sense, even in a throw-away spike to write:
const MAX_CACHE_RECORDS = 50
foo = GetNewCache(MAX_CACHE_RECORDS)
This is regardless of the fact that in three iterations time, someone might ask for the number of cache records to be configurable, and you might end up refactoring the constant away.
Just remember, if you go to the extremes of stuff like
const ONE_HUNDRED = 100
const ONE_HUNDRED_AND_ONE = 101
we'll all come to The Daily WTF and laugh at you. :-)
Think! That's all.
It's never good and you just proved it...
double myPercentage = myValue / 100;
This is NOT percentage. What you wanted to write is :
double myPercentage = (myValue / 100) * 100;
Or more correctly :
double myPercentage = (myValue / myMaxValue) * 100;
But this hard coded 100 messed with your mind... So go for the getPercentage method that Colen suggested :)
double getpercentage(double myValue, double maxValue)
{
return (myValue / maxValue) * 100;
}
Also as ctacke suggested, in the first case you will be in a world of pain if you ever need to localize these literals. It's never too much trouble to add a couple more variables and/or functions
The first case will kill you if you ever need to localize. Moving it to some static or constant that is app-wide would at least make localizing it a little easier.
Case 1: When should you hard-code stuff: when you have no reason to think that it will ever change. That said, you should NEVER hard code stuff in-line. Take the time to make static variables or global variables or whatever your language gives you. Do them in the class in question, and if you notice that two classes or areas of your code share the same value FOR THE SAME REASON (meaning it's not just coincidence), point them to the same place.
Case 2: For case case 2, you're correct: the laws of "percentage" will not change (being reasonable, here), so you can hard code inline.
Case 3: The third case is where you think the thing could change but you don't want to/have time to bother loading ResourceBundles or XML or whatever. In that case, you use whatever centralizing mechanism you can -- the hated Singleton class is a good one -- and go with that until you actually have need to deal with the problem.
The third case is tricky, though: it's extraordinarily hard to internationalize an application without really doing it... so you will want to hard-code stuff and just hope that, when the i18n guys come knocking, your code is not the worst-tasting code around :)
Edit: Let me mention that I've just finished a refactoring project in which the prior developer had placed the MySql connect strings in 100+ places in the code (PHP). Sometimes they were uppercase, sometimes they were lower case, etc., so they were hard to search and replace (though Netbeans and PDT did help a lot). There are reasons why he/she did this (a project called POG basically forces this stupidity), but there is just nothing that seems less like good code than repeating the same thing in a million places.
The better way for your second example would be to define an inline function:
double getpercentage(double myValue)
{
return(myValue / 100);
}
...
double myPercentage = getpercentage(myValue);
That way it's a lot more obvious what you're doing.
Hardcoded literals should appear in unit tests for the test values, unless there is so much reuse of a value within a single test class that a local constant is useful.
The unit tests are a description of expected values without any abstraction or redirection.
Imagine yourself reading the test - you want the information literally in front of you.
The only time I use constants for test values is when many tests repeat a value (itself a bit suspicious) and the value may be subject to change.
I do use constants for things like names of test files to compare.
I don't think that your second is really an example of hardcoding. That's like having a Halve() method that takes in a value to use to divide by; doesn't make sense.
Beyond that, example 1, if you want to change the language for your app, you don't want to have to change the class, so it should absolutely be in a config.
Hard coding should be avoided like Dracula avoids the sun. It'll come back to bite you in the ass eventually.
"hardcoding" is the wrong thing to worry about. The point is not whether special values are in code or in config files, the point is:
If the value could ever change, how much work is that and how hard is it to find? Putting it in one place and referring to that place elsewhere is not much work and therefore a way to play it safe.
Will maintainance programmers definitely understand why the value is what it is? If there is any doubt whatsoever, use a named constant that explains the meaning.
Both of these goals can be achieved without any need for config files; in fact I'd avoid those if possible. "putting stuff in config files means it's easier to change" is a myth, unless either
you actually want to support customers changing the values themselves
no value that could possibly be put in the config file can cause a bug (buffer overflow, anyone?)
your build and deployment process sucks
The text for the conditions should be in a resource file; that's what it's there for.
Not normally (Are hard-coding literals acceptable)
Another way at looking at this is how using a good naming convention
for constants used in-place of hard coded literals provides additional
documentation in the program.
Even if the number is used only once, it can still be hard to recognized
and may even be hard to find for future changes.
IMHO, making programs easier to read should be second nature to a
seasoned software professional. Raw numbers rarely communicate
meaningfully.
The extra time taken to use a well named constant will make the
code readability (easy to recall to the mind) and useful for future
re-mining (code re-use).
I tend to view it in terms of the project's scope and size.
Some simple projects that I am a solo dev on? Sure, I hard code lots of things. Tools I write that only I will ever use? Sure, if it gets the job done.
But, in working on larger, team projects? I agree, they are suspect and usually the product of laziness. Tag them for review and see if you can spot a pattern where they can be abstracted away.
In your example, the text box should be localizable, so why not a class that handles that?
Remember that you WILL forget the meaning of any non-obvious hard-coded value.
So be certain to put a short comment after each to remind you.
A Delphi example:
Length := Length * 0.3048; { 0.3048 converts feet to meters }
no.
What is a simple throw away app today will be driving your entire enterprise tomorrow. Always use best practices or you'll regret it.
Code always evolves. When you initially write stuff hard coding is the easiest way to go. Later when a need arrives to change the value it can be improved. In some cases the need never comes.
The need can arrive in many forms:
The value is used in many places and it needs to be changed by a programmer. In this case a constant is clearly needed.
User needs to be able to change the value.
I don't see the need to avoid hard coding. I do see the need to change things when there is a clear need.
Totally separate issue is that of course the code needs to be readable and this means that there might be a need for a comment for the hard coded value.
For the first value, it really depends. If you don't anticipate any kind of wide-spread adoption of your application and internationalization will never be an issue, I think it's mostly fine. However, if you are writing some kind of open source software or something with a larger audience consider the fact that it may one day need to be translated. In that case, you may be better off using string resources.
It's okay as long as you don't do refactoring, unit-testing, peer code reviews. And, you don't want repeat customers. Who cares?
I once had a boss who refused to not hardcode something because in his mind it gave him full control over the software and the items related to the software. Problem was, when the hardware died that ran the software the server got renamed... meaning he had to find his code. That took a while. I simply found a hex editor and hacked around it instead of waiting.
I normally add a set of helper methods for strings and numbers.
For example when I have strings such as 'yes' and 'no' I have a function called __ so I call __('yes'); which starts out in the project by just returning the first parameter but when I need to do more complex stuff (such as internationaizaton) it's already there and the param can be used a key.
Another example is VAT (form of UK tax) in online shops, recently it changed from 17.5% to 15%. Any one who hard coded VAT by doing:
$vat = $price * 0.175;
had to then go through all references and change it to 0.15, instead the super usefull way of doing it would be to have a function or variable for VAT.
In my opinion anything that could change should be written in a changeable way. If I find myself doing the same thing more than 5 times in the same day then it becomes a function or a config var.
Hard coding should be banned forever. Althought in you very simple examples i don't see anything wrong using them in any kind of project.
In my opinion hard coding is when you believe that a variable/value/define etc. will never change and create all your code based on that belief.
Example of such hard coding is the book Teach Yourself C in 24 Hours that everybody should avoid.