I am dealing with SCORM package data and here is just one of the many nasty column data that I need to manipulate:
;~;VarQuestion_0016=kl%2Fklkl;VarReflectiveWriting_0001=I%20fink%20that%20aw%20childwens%20should%20be%20wuvved%20and%20pwotected.;VarQuestion_0005=D.%20%20Radio%20public%20service%20announcements%20aired%20during%20Child%20Abuse%20Prevention%20Month%20in%20April;VarQuestion_0004=D.%20%20Societal%20approach;VarQuestion_0003=B.%20%20Free%20respite%20child%20care%20offered%20to%20any%20family%20needing%20a%20break%20in%20order%20to%20reduce%20stress%2CC.%20%20Court-ordered%20substance%20abuse%20prevention%20classes%20for%20parents%20involved%20with%20Child%20Protective%20Services;VarQuestion_0001=B.%20%20A%20treatment%20program%20for%20parents%20identified%20by%20Child%20Protective%20Services%20as%20having%20abused%20their%20children%2CC.%20%20A%20parent%20education%20class%20open%20to%20all%20parents;VarQuestion_0009=Sexual%20abuse%20prevention%20training%20for%20children%20or%20adults;VarQuestion_0011=Community%20Volunteer;VarQuestion_0013=I%20am%20very%20familiar%20with%20the%20research%20and%20with%20community-based%20approaches%3B%20I%20could%20teach%20others%20about%20it.;
I want it to look more like this:
QUESTION ANSWER
Question 1 B. A treatment program for parents identified by Child Protective Services as having abused their children,C. A parent education class open to all parents
Question 3 B. Free respite child care offered to any family needing a break in order to reduce stress,C. Court-ordered substance abuse prevention classes for parents involved with Child Protective Services
Question 4 D. Societal approach
Question 5 D. Radio public service announcements aired during Child Abuse Prevention Month in April
Question 9 Sexual abuse prevention training for children or adults
Question 11 Community Volunteer
Question 13 I am very familiar with the research and with community-based approaches; I could teach others about it.
Question 16 I fink that aw childwens should be wuvved and pwotected.
Steps to solve:
urldecode
remove first few digits
explode the string by 'VarQuestion_', explode these strings by '=', select first element for column 1 and last element for column 2 (editing each with trim to remove '0' and excess end ';')
MySQL hurdles to solve:
find function for urldecode
find function to explode data
edit/manipulate array from explode function
output array into two columns for reporting
It seems simple on paper but it is an absolute nightmare for MySQL. Are there any package/procedures/functions that you all can recommend for each step?
Related
I am creating a custom environment in OpenAI Gym, and I'm having some trouble navigating the observation space.
Every timestep, the agent is given two potential students to accept or deny admission to - these are randomized and are part of the observation space. As the reward is based on which students are currently enrolled (who we have accepted in the past), we need to keep track of who has been accepted and who has not within the state space (there are a limited number of spots available to students). Each student has a 'major' (1-15) and a 'minor' (1-5) which, in the simulator I built, have weights associated with them that have a bearing on the reward, so they must be included in the state space. After a number of timesteps (varies depending on the major/minor combination), students graduate and can be removed from the list of enrolled students (and removed from being represented in the state space).
Thus, I currently have something like:
spaces = {
'potential_student_I': spaces.Tuple(((spaces.Discrete(15), spaces.Discrete(5)))),
'potential_student_II': spaces.Tuple(((spaces.Discrete(15), spaces.Discrete(5)))),
'enrolled_student_I': spaces.Tuple(((spaces.Discrete(16), spaces.Discrete(6)))),
'enrolled_student_II': spaces.Tuple(((spaces.Discrete(16), spaces.Discrete(6)))),
'enrolled_student_III': spaces.Tuple(((spaces.Discrete(16), spaces.Discrete(6)))),
}
self.observation_space = spaces.Dict(spaces)
In the above code, there's only room for three potential accepted students to be represented. These are spaces.Tuple(((spaces.Discrete(16), spaces.Discrete(6)))) rather than spaces.Tuple(((spaces.Discrete(15), spaces.Discrete(5)))) because the list doesn't necessarily need to be filled, so there are extra options for 'NULL'.
Is there a better way to do this? I thought about maybe using one-hot encoding or something similar. Ideally this environment could have up to 50 enrolled students, which obviously is not efficient if I continue representing the observation space the way I currently am. I plan on using a neural net because of the large state space, but I'm caught up on how to efficiently represent the observation space.
I am currently working on a poker AI and I am stuck on this question: What is the best way to encode poker cards for my AI? I am using deep reinforcement learning techniques and I just don't know how to anwser my question.
The card information is stored as a string. For example: "3H" would be "three of hearts". I thought about ranking the cards and then attaching values to them such that a high-rated card like AH ("Ace of hearts") would get a high number like 52 or something like that. The problem with this approach is that it doesn't take the suits into acccount.
I have seen some methods where they just assign a number to each and every card such that at the end there are 52 numbers from 0-51 (https://www.codewars.com/kata/52ebe4608567ade7d700044a/javascript). The problem I see with that is that my neural net wouldn't or at least have difficulties getting the connection between similar cards like Aces ('cause as in the link above one Ace is labeled with a 0 the other one with 13 etc.).
Can someone please help me with this question such that the encodings take care of the: suits, values, ranks, etc and my NN would be able to get the connections between similar cards.
Thanks in andvance
I need a function that would compare two strings and outputs an edit distance like Levenshtein, but only if the characters are homoglyphs in cursives. I have a list of those homoglyphs so I could feed a custom list to this function.
Example
homoglyphs = [["o","a"],["rn","m","nn"],...] // In cursive they look-alike
compare("Mory", "Mary", homoglyphs) // Levenshtein gives 1
compare("Mory", "Tory", homoglyphs) // Levenshtein gives 1, but I want false, 99 or -1
compare("Morio", "Mario", homoglyphs) // I expect a distance of 1
compare("Morio", "Maria", homoglyphs) // I expect a distance of 2
Tory should give a false result since there's no way someone misread an M as a T. An A could be misread as an O so it can count as 1.
The scoring could be different, I just need to know that Mory is probably Mary not Tory and Morio is a little more likely to be Mario than Maria.
Do something like this exists?
The key to your problem can be thought of like an IQ word association question.
Sound Glyph
--------- = ----------
Homophone Homoglyphs
Now if you know that there is a way to find similar sounding words (homophone) then the same can be applied but instead of sounds change to glyphs (homoglyph).
The way to find similar sounding words is via Soundex (Sound Index).
So just do what Soundex does but instead of having a mapping from similar homophones use similar homoglyphs.
Once you convert each word (glyphs) input into a Glyphdex (Glyph Index) then you can compute the Levenshtein distance for the two Glyphdex.
Make sense?
If you are into cellular biology then codon translation into amino acids (ref) might make more sense. Many amino acids are coded by more than one 3 letter codon.
Note: Since the word glyhdex has been used prior to me writing this I can not say I coined that word, however the usage I currently find via Google (search) for the word are not in the same context as described here. So in the context of converting a sequence of glyphs into an index of similar sequence of glyphs I will take credit.
Suppose I want to match address records (or person names or whatever) against each other to merge records that are most likely referring to the same address. Basically, I guess I would like to calculate some kind of correlation between the text values and merge the records if this value is over a certain threshold.
Example:
"West Lawnmower Drive 54 A" is probably the same as "W. Lawn Mower Dr. 54A" but different from "East Lawnmower Drive 54 A".
How would you approach this problem? Would it be necessary to have some kind of context-based dictionary that knows, in the address case, that "W", "W." and "West" are the same? What about misspellings ("mover" instead of "mower" etc)?
I think this is a tricky one - perhaps there are some well-known algorithms out there?
A good baseline, probably an impractical one in terms of its relatively high computational cost and more importantly its production of many false positive, would be generic string distance algorithms such as
Edit distance (aka Levenshtein distance)
Ratcliff/Obershelp
Depending on the level of accuracy required (which, BTW, should be specified both in terms of its recall and precision, i.e. generally expressing whether it is more important to miss a correlation than to falsely identify one), a home-grown process based on [some of] the following heuristics and ideas could do the trick:
tokenize the input, i.e. see the input as an array of words rather than a string
tokenization should also keep the line number info
normalize the input with the use of a short dictionary of common substituions (such as "dr" at the end of a line = "drive", "Jack" = "John", "Bill" = "William"..., "W." at the begining of a line is "West" etc.
Identify (a bit like tagging, as in POS tagging) the nature of some entities (for example ZIP Code, and Extended ZIP code, and also city
Identify (lookup) some of these entities (for example a relative short database table can include all the Cities / town in the targeted area
Identify (lookup) some domain-related entities (if all/many of the address deal with say folks in the legal profession, a lookup of law firm names or of federal buildings may be of help.
Generally, put more weight on tokens that come from the last line of the address
Put more (or less) weight on tokens with a particular entity type (ex: "Drive", "Street", "Court" should with much less than the tokens which precede them.
Consider a modified SOUNDEX algorithm to help with normalization of
With the above in mind, implement a rule-based evaluator. Tentatively, the rules could be implemented as visitors to a tree/array-like structure where the input is parsed initially (Visitor design pattern).
The advantage of the rule-based framework, is that each heuristic is in its own function and rules can be prioritized, i.e. placing some rules early in the chain, allowing to abort the evaluation early, with some strong heuristics (eg: different City => Correlation = 0, level of confidence = 95% etc...).
An important consideration with search for correlations is the need to a priori compare every single item (here address) with every other item, hence requiring as many as 1/2 n^2 item-level comparisons. Because of this, it may be useful to store the reference items in a way where they are pre-processed (parsed, normalized...) and also to maybe have a digest/key of sort that can be used as [very rough] indicator of a possible correlation (for example a key made of the 5 digit ZIP-Code followed by the SOUNDEX value of the "primary" name).
I would look at producing a similarity comparison metric that, given two objects (strings perhaps), returns "distance" between them.
If you fulfil the following criteria then it helps:
distance between an object and
itself is zero. (reflexive)
distance from a to b is the same in
both directions (transitive)
distance from a to c is not more
than distance from a to b plus
distance from a to c. (triangle
rule)
If your metric obeys these they you can arrange your objects in metric space which means you can run queries like:
Which other object is most like
this one
Give me the 5 objects
most like this one.
There's a good book about it here. Once you've set up the infrastructure for hosting objects and running the queries you can simply plug in different comparison algorithms, compare their performance and then tune them.
I did this for geographic data at university and it was quite fun trying to tune the comparison algorithms.
I'm sure you could come up with something more advanced but you could start with something simple like reducing the address line to the digits and the first letter of each word and then compare the result of that using a longest common subsequence algorithm.
Hope that helps in some way.
You can use Levenshtein edit distance to find strings that differ by only a few characters. BK Trees can help speed up the matching process.
Disclaimer: I don't know any algorithm that does that, but would really be interested in knowing one if it exists. This answer is a naive attempt of trying to solve the problem, with no previous knowledge whatsoever. Comments welcome, please don't laugh too laud.
If you try doing it by hand, I would suggest applying some kind of "normalization" to your strings : lowercase them, remove punctuation, maybe replace common abbreviations with the full words (Dr. => drive, St => street, etc...).
Then, you can try different alignments between the two strings you compare, and compute the correlation by averaging the absolute differences between corresponding letters (eg a = 1, b = 2, etc.. and corr(a, b) = |a - b| = 1) :
west lawnmover drive
w lawnmower street
Thus, even if some letters are different, the correlation would be high. Then, simply keep the maximal correlation you found, and decide that their are the same if the correlation is above a given threshold.
When I had to modify a proprietary program doing this, back in the early 90s, it took many thousands of lines of code in multiple modules, built up over years of experience. Modern machine-learning techniques ought to make it easier, and perhaps you don't need to perform as well (it was my employer's bread and butter).
So if you're talking about merging lists of actual mailing addresses, I'd do it by outsourcing if I can.
The USPS had some tests to measure quality of address standardization programs. I don't remember anything about how that worked, but you might check if they still do it -- maybe you can get some good training data.
I'm not a Natural Language Programming student, yet I know it's not trivial strcmp(n1,n2).
Here's what i've learned so far:
comparing Personal Names can't be solved 100%
there are ways to achieve certain degree of accuracy.
the answer will be locale-specific, that's OK.
I'm not looking for spelling alternatives! The assumption is that the input's spelling is correct.
For example, all the names below can refer to the same person:
Berry Tsakala
Bernard Tsakala
Berry J. Tsakala
Tsakala, Berry
I'm trying to:
build (or copy) an algorithm which grades the relationship 2 input names
find an indexing method (for names in my database, for hash tables, etc.)
note:
My task isn't about finding names in text, but to compare 2 names. e.g.
name_compare( "James Brown", "Brown, James", "en-US" ) ---> 99.0%
I used Tanimoto Coefficient for a quick (but not super) solution, in Python:
"""
Formula:
Na = number of set A elements
Nb = number of set B elements
Nc = number of common items
T = Nc / (Na + Nb - Nc)
"""
def tanimoto(a, b):
c = [v for v in a if v in b]
return float(len(c)) / (len(a)+len(b)-len(c))
def name_compare(name1, name2):
return tanimoto(name1, name2)
>>> name_compare("James Brown", "Brown, James")
0.91666666666666663
>>> name_compare("Berry Tsakala", "Bernard Tsakala")
0.75
>>>
Edit: A link to a good and useful book.
Soundex is sometimes used to compare similar names. It doesn't deal with first name/last name ordering, but you could probably just have your code look for the comma to solve that problem.
We've just been doing this sort of work non-stop lately and the approach we've taken is to have a look-up table or alias list. If you can discount misspellings/misheard/non-english names then the difficult part is taken away. In your examples we would assume that the first word and the last word are the forename and the surname. Anything in between would be discarded (middle names, initials). Berry and Bernard would be in the alias list - and when Tsakala did not match to Berry we would flip the word order around and then get the match.
One thing you need to understand is the database/people lists you are dealing with. In the English speaking world middle names are inconsistently recorded. So you can't make or deny a match based on the middle name or middle initial. Soundex will not help you with common name aliases such as "Dick" and "Richard", "Berry" and "Bernard" and possibly "Steve" and "Stephen". In some communities it is quite common for people to live at the same address and have 2 or 3 generations living at that address with the same name. The only way you can separate them is by date of birth. Date of birth may or may not be recorded. If you have the clout then you should probably make the recording of date of birth mandatory. A lot of "people databases" either don't record date of birth or won't give them away due to privacy reasons.
Effectively people name matching is not that complicated. Its entirely based on the quality of the data supplied. What happens in practice is that a lot of records remain unmatched - and even a human looking at them can't resolve the mismatch. A human may notice name aliases not recorded in the aliases list or may be able to look up details of the person on the internet - but you can't really expect your programme to do that.
Banks, credit rating organisations and the government have a lot of detailed information about us. Previous addresses, date of birth etc. And that helps them join up names. But for us normal programmers there is no magic bullet.
Analyzing name order and the existence of middle names/initials is trivial, of course, so it looks like the real challenge is knowing common name alternatives. I doubt this can be done without using some sort of nickname lookup table. This list is a good starting point. It doesn't map Bernard to Berry, but it would probably catch the most common cases. Perhaps an even more exhaustive list can be found elsewhere, but I definitely think that a locale-specific lookup table is the way to go.
I had real problems with the Tanimoto using utf-8.
What works for languages that use diacritical signs is difflib.SequenceMatcher()