Basic text similarity through WorldNet synsets for taxonomy mapping/merging - nltk

I would like to implement a basic text similarity routine with semantic distance using WordNet and NLTK in Python. This is the idea: extend two concepts/prases/categories A and B with synsets, hyponyms, hypernyms, meronyms, metonyms and compute distance between two formed vectors, a and b. I am sot sure how I will compute these, maybe as a cosine distance.
My input data for most cases are not made of phrases but rather proper nouns or nouns (product names with brand or product categories).
For example, I would like to determine that "resort" is a "luxury hotel" or "black caviar" is "gourmet", A - "black caviar", B - "gourmet".
To which extent this could works and how do I walk up and down WordNet to make it a bit more sophisticated then one level up and down with hypo/hyper-nyms.
I am looking for simple basic solution that works well enough, not using sophisticated things like Whoosh or something.
Should I use something better then WordNet?
UPDATE:
I am processing each noun phrase the following way (using NLTK & WordNet):
1. For each word in a phrase I collect a synset (nouns only), then I complement it with a synset of hypernyms and hyponyms for each element in the synset. For now, I grab all the synsets into the list ignoring the hierarchy.
2. I repeat the process for the keywords describing each of my categories category.
3. Now I have a list of synsets for each category and for my target. Just compute a distance to each (cosine or Wu and Palmer's distance). I collect pairwise distances in my two vectors, summing them, normalizing by the number of keywords describing the category or a target. Then I pick a min distance.
Sounds like pretty basic and inefficient. What is the next step to make it better?
I am interested to do it from scratch, it's also the best exercise to understand how things work and how it needs to be done.
EXAMPLE:
word_list - target:
['school', 'kids', 'teacher']
categories:
[['business', 'organization', 'company'],['education', 'school', 'university']]
extended list for target concept 'education', 3 keywords:
[Synset('school.n.01'), Synset('school.n.02'), Synset('school.n.03'), Synset('school.n.04'), Synset('school.n.05'), Synset('school.n.06'), Synset('school.n.07'), Synset('child.n.01'), Synset('kid.n.02'), Synset('kyd.n.01'), Synset('child.n.02'), Synset('kid.n.05'), Synset('teacher.n.01'), Synset('teacher.n.02'), Synset('educational_institution.n.01'), Synset('building.n.01'), Synset('education.n.03'), Synset('body.n.02'), Synset('time_period.n.01'), Synset('educational_institution.n.01'), Synset('animal_group.n.01'), Synset('academy.n.03'), Synset('alma_mater.n.01'), Synset('conservatory.n.01'), Synset('correspondence_school.n.01'), Synset('crammer.n.03'), Synset('dance_school.n.01'), Synset('dancing_school.n.01'), Synset('day_school.n.02'), Synset('direct-grant_school.n.01'), Synset('driving_school.n.01'), Synset('finishing_school.n.01'), Synset('flying_school.n.01'), Synset('grade_school.n.01'), Synset('graduate_school.n.01'), Synset('language_school.n.01'), Synset('night_school.n.01'), Synset('nursing_school.n.01'), Synset('private_school.n.01'), Synset('public_school.n.01'), Synset('religious_school.n.01'), Synset('riding_school.n.01'), Synset('secondary_school.n.01'), Synset('secretarial_school.n.01'), Synset('sunday_school.n.01'), Synset('technical_school.n.01'), Synset('training_school.n.01'), Synset('veterinary_school.n.01'), Synset('conservatory.n.02'), Synset('day_school.n.03'), Synset('art_nouveau.n.01'), Synset('ashcan_school.n.01'), Synset('deconstructivism.n.01'), Synset('historical_school.n.01'), Synset('lake_poets.n.01'), Synset('pointillism.n.01'), Synset('secession.n.01')]
Extended list for category concept 'business', 3 keywords, 223 in extended list:
[Synset('business.n.01'), Synset('commercial_enterprise.n.02'), Synset('occupation.n.01'), Synset('business.n.04'), Synset('business.n.05'), Synset('business.n.06'), Synset('business.n.07'), Synset('clientele.n.01'), Synset('business.n.09'), Synset('organization.n.01'), Synset('arrangement.n.03'), Synset('administration.n.02'), Synset('organization.n.04'), Synset('organization.n.05'), Synset('organization.n.06'), Synset('constitution.n.02'), Synset('company.n.01'), Synset('company.n.02'), Synset('company.n.03'), Synset('company.n.04'), Synset('caller.n.01'), Synset('company.n.06'), Synset('party.n.03'), Synset('ship's_company.n.01'), Synset('company.n.09'), Synset('enterprise.n.02'), Synset('commerce.n.01'), Synset('activity.n.01'), Synset('concern.n.04'), Synset('aim.n.02'), Synset('business_activity.n.01'), Synset('sector.n.02'), Synset('people.n.01'), Synset('acting.n.01'), Synset('social_group.n.01'), Synset('structure.n.03'), Synset('body.n.02'), Synset('administration.n.01'), Synset('orderliness.n.01'), Synset('activity.n.01'), Synset('beginning.n.05'), Synset('institution.n.01'), Synset('army_unit.n.01'), Synset('friendship.n.01'), Synset('organization.n.01'), Synset('visitor.n.01'), Synset('social_gathering.n.01'), Synset('set.n.05'), Synset('complement.n.03'), Synset('unit.n.03'), Synset('agency.n.02'), Synset('brokerage.n.02'), Synset('carrier.n.05'), Synset('chain.n.04'), Synset('firm.n.01'), Synset('franchise.n.02'), Synset('manufacturer.n.01'), Synset('partnership.n.01'), Synset('processor.n.01'), Synset('shipbuilder.n.03'), Synset('underperformer.n.02'), Synset('advertising.n.02'), Synset('agribusiness.n.01'), Synset('butchery.n.02'), Synset('construction.n.07'), Synset('discount_business.n.01'), Synset('employee-owned_enterprise.n.01'), Synset('field.n.06'), Synset('finance.n.01'), Synset('fishing.n.02'), Synset('industry.n.02'), Synset('packaging.n.01'), Synset('printing.n.02'), Synset('publication.n.04'), Synset('real-estate_business.n.01'), Synset('storage.n.03'), Synset('tourism.n.01'), Synset('transportation.n.05'), Synset('venture.n.03'), Synset('accountancy.n.01'), Synset('appointment.n.05'), Synset('career.n.01'), Synset('catering.n.01'), Synset('confectionery.n.03'), Synset('employment.n.02'), Synset('farming.n.02'), Synset('game.n.10'), Synset('metier.n.02'), Synset('photography.n.03'), Synset('position.n.06'), Synset('profession.n.02'), Synset('sport.n.02'), Synset('trade.n.02'), Synset('treadmill.n.03'), Synset('occasions.n.01'), Synset('land-office_business.n.01'), Synset('trade.n.03'), Synset('big_business.n.01'), Synset('shtik.n.02'), Synset('adhocracy.n.01'), Synset('affiliate.n.02'), Synset('alliance.n.03'), Synset('association.n.01'), Synset('blue.n.03'), Synset('bureaucracy.n.03'), Synset('company.n.04'), Synset('defense.n.09'), Synset('deputation.n.01'), Synset('enterprise.n.02'), Synset('establishment.n.05'), Synset('federation.n.01'), Synset('fiefdom.n.02'), Synset('fire_brigade.n.01'), Synset('force.n.04'), Synset('girl_scouts.n.01'), Synset('grey.n.04'), Synset('hierarchy.n.02'), Synset('host.n.06'), Synset('institution.n.01'), Synset('line_of_defense.n.01'), Synset('line_organization.n.01'), Synset('machine.n.03'), Synset('machine.n.05'), Synset('musical_organization.n.01'), Synset('nongovernmental_organization.n.01'), Synset('party.n.01'), Synset('peace_corps.n.01'), Synset('polity.n.02'), Synset('pool.n.03'), Synset('professional_organization.n.01'), Synset('quango.n.01'), Synset('tammany_hall.n.01'), Synset('union.n.01'), Synset('unit.n.03'), Synset('calendar.n.01'), Synset('classification_system.n.01'), Synset('contrivance.n.04'), Synset('coordinate_system.n.01'), Synset('data_structure.n.01'), Synset('design.n.02'), Synset('distribution.n.01'), Synset('genetic_map.n.01'), Synset('kinship_system.n.01'), Synset('lattice.n.01'), Synset('living_arrangement.n.01'), Synset('ontology.n.01'), Synset('county_council.n.01'), Synset('curia.n.01'), Synset('executive.n.02'), Synset('government_officials.n.01'), Synset('judiciary.n.01'), Synset('management.n.02'), Synset('top_brass.n.01'), Synset('nonprofit_organization.n.01'), Synset('rationalization.n.04'), Synset('reorganization.n.01'), Synset('self-organization.n.01'), Synset('syndication.n.01'), Synset('listing.n.02'), Synset('order.n.15'), Synset('randomization.n.01'), Synset('systematization.n.01'), Synset('territorialization.n.01'), Synset('collectivization.n.01'), Synset('colonization.n.01'), Synset('communization.n.02'), Synset('federation.n.03'), Synset('unionization.n.01'), Synset('broadcasting_company.n.01'), Synset('bureau_de_change.n.01'), Synset('car_company.n.01'), Synset('closed_shop.n.01'), Synset('corporate_investor.n.01'), Synset('distributor.n.03'), Synset('dot-com.n.01'), Synset('drug_company.n.01'), Synset('east_india_company.n.01'), Synset('electronics_company.n.01'), Synset('film_company.n.01'), Synset('food_company.n.01'), Synset('furniture_company.n.01'), Synset('holding_company.n.01'), Synset('joint-stock_company.n.01'), Synset('limited_company.n.01'), Synset('livery_company.n.01'), Synset('mining_company.n.01'), Synset('mover.n.04'), Synset('oil_company.n.01'), Synset('open_shop.n.01'), Synset('packaging_company.n.01'), Synset('pipeline_company.n.01'), Synset('printing_concern.n.01'), Synset('record_company.n.01'), Synset('service.n.04'), Synset('shipper.n.02'), Synset('shipping_company.n.01'), Synset('steel_company.n.01'), Synset('stock_company.n.01'), Synset('subsidiary_company.n.01'), Synset('target_company.n.01'), Synset('think_tank.n.01'), Synset('transportation_company.n.01'), Synset('union_shop.n.01'), Synset('white_knight.n.01'), Synset('trainband.n.01'), Synset('freemasonry.n.01'), Synset('ballet_company.n.01'), Synset('chorus.n.05'), Synset('circus.n.01'), Synset('minstrel_show.n.01'), Synset('minstrelsy.n.01'), Synset('opera_company.n.01'), Synset('theater_company.n.01'), Synset('attendance.n.03'), Synset('cohort.n.01'), Synset('number.n.07'), Synset('fatigue_party.n.01'), Synset('landing_party.n.01'), Synset('party_to_the_action.n.01'), Synset('rescue_party.n.01'), Synset('search_party.n.01'), Synset('stretcher_party.n.01'), Synset('war_party.n.01')]
Extended list for category concept 'education' - 97 synsets:
[Synset('education.n.01'), Synset('education.n.02'), Synset('education.n.03'), Synset('education.n.04'), Synset('education.n.05'), Synset('department_of_education.n.01'), Synset('school.n.01'), Synset('school.n.02'), Synset('school.n.03'), Synset('school.n.04'), Synset('school.n.05'), Synset('school.n.06'), Synset('school.n.07'), Synset('university.n.01'), Synset('university.n.02'), Synset('university.n.03'), Synset('activity.n.01'), Synset('content.n.05'), Synset('learning.n.01'), Synset('profession.n.02'), Synset('upbringing.n.01'), Synset('executive_department.n.01'), Synset('educational_institution.n.01'), Synset('building.n.01'), Synset('education.n.03'), Synset('body.n.02'), Synset('time_period.n.01'), Synset('educational_institution.n.01'), Synset('animal_group.n.01'), Synset('body.n.02'), Synset('establishment.n.04'), Synset('educational_institution.n.01'), Synset('coeducation.n.01'), Synset('continuing_education.n.01'), Synset('course.n.01'), Synset('elementary_education.n.01'), Synset('extension.n.04'), Synset('extracurricular_activity.n.01'), Synset('higher_education.n.01'), Synset('secondary_education.n.01'), Synset('team_teaching.n.01'), Synset('work-study_program.n.01'), Synset('enlightenment.n.01'), Synset('eruditeness.n.01'), Synset('experience.n.01'), Synset('foundation.n.04'), Synset('physical_education.n.01'), Synset('acculturation.n.03'), Synset('mastering.n.01'), Synset('school.n.03'), Synset('self-education.n.01'), Synset('special_education.n.01'), Synset('vocational_training.n.01'), Synset('teaching.n.01'), Synset('academy.n.03'), Synset('alma_mater.n.01'), Synset('conservatory.n.01'), Synset('correspondence_school.n.01'), Synset('crammer.n.03'), Synset('dance_school.n.01'), Synset('dancing_school.n.01'), Synset('day_school.n.02'), Synset('direct-grant_school.n.01'), Synset('driving_school.n.01'), Synset('finishing_school.n.01'), Synset('flying_school.n.01'), Synset('grade_school.n.01'), Synset('graduate_school.n.01'), Synset('language_school.n.01'), Synset('night_school.n.01'), Synset('nursing_school.n.01'), Synset('private_school.n.01'), Synset('public_school.n.01'), Synset('religious_school.n.01'), Synset('riding_school.n.01'), Synset('secondary_school.n.01'), Synset('secretarial_school.n.01'), Synset('sunday_school.n.01'), Synset('technical_school.n.01'), Synset('training_school.n.01'), Synset('veterinary_school.n.01'), Synset('conservatory.n.02'), Synset('day_school.n.03'), Synset('art_nouveau.n.01'), Synset('ashcan_school.n.01'), Synset('deconstructivism.n.01'), Synset('historical_school.n.01'), Synset('lake_poets.n.01'), Synset('pointillism.n.01'), Synset('secession.n.01'), Synset('gown.n.02'), Synset('varsity.n.01'), Synset('city_university.n.01'), Synset('oxbridge.n.01'), Synset('redbrick_university.n.01'), Synset('multiversity.n.01'), Synset('open_university.n.01')]
Extended list for my target, 57 synsets:
[Synset('school.n.01'), Synset('school.n.02'), Synset('school.n.03'), Synset('school.n.04'), Synset('school.n.05'), Synset('school.n.06'), Synset('school.n.07'), Synset('child.n.01'), Synset('kid.n.02'), Synset('kyd.n.01'), Synset('child.n.02'), Synset('kid.n.05'), Synset('teacher.n.01'), Synset('teacher.n.02'), Synset('educational_institution.n.01'), Synset('building.n.01'), Synset('education.n.03'), Synset('body.n.02'), Synset('time_period.n.01'), Synset('educational_institution.n.01'), Synset('animal_group.n.01'), Synset('academy.n.03'), Synset('alma_mater.n.01'), Synset('conservatory.n.01'), Synset('correspondence_school.n.01'), Synset('crammer.n.03'), Synset('dance_school.n.01'), Synset('dancing_school.n.01'), Synset('day_school.n.02'), Synset('direct-grant_school.n.01'), Synset('driving_school.n.01'), Synset('finishing_school.n.01'), Synset('flying_school.n.01'), Synset('grade_school.n.01'), Synset('graduate_school.n.01'), Synset('language_school.n.01'), Synset('night_school.n.01'), Synset('nursing_school.n.01'), Synset('private_school.n.01'), Synset('public_school.n.01'), Synset('religious_school.n.01'), Synset('riding_school.n.01'), Synset('secondary_school.n.01'), Synset('secretarial_school.n.01'), Synset('sunday_school.n.01'), Synset('technical_school.n.01'), Synset('training_school.n.01'), Synset('veterinary_school.n.01'), Synset('conservatory.n.02'), Synset('day_school.n.03'), Synset('art_nouveau.n.01'), Synset('ashcan_school.n.01'), Synset('deconstructivism.n.01'), Synset('historical_school.n.01'), Synset('lake_poets.n.01'), Synset('pointillism.n.01'), Synset('secession.n.01')]
I have 3 vectors, target - 57, business - 223, and education - 97.
Now compute pairwise Wu and Palmer's distances between target and business, divide by 57x223=12711; between target and education, divide by 57x97=5529.
target to business distance: 2305.709117171037 / 5529 = 0.9125370052417936
target to education distance: 5045.417101981877 / 12711 = 0.39693313680921066
Min distance is to education. That's a correct answer.

WordNet + some similarity can be of a solution.
You can also use Word2Vec to determine semantic distance of words that you get from WordNet synset/*nyms search.
Maybe someone could help with a particular library (nothing pops in my mind at the moment that you could use directly).

Word2Vec for semantic representation + a maximum likelihood method as described in the paper below would be a good approach to merge two taxonomies: http://www.ideal.ece.utexas.edu/papers/rajan05aaai.pdf

Related

Computing a Multinominal Logistic multilevel regression using glmr from R

Problem: I'm trying to perform a Computing a multinominal logistic multilevel regression. I try to follow this approach:
multinomial logistic multilevel models in R
Details: Therefor I computed six separate models with glmr from the lm4 package from R.
I would like to investigate the influence that meaning in life has on people's everyday lives.
As dependent variables I have pleasant days, meaningful days, pleasant-meaningful days and meaningful-unpleasant days.
I have always made pairwise comparisons as described in the link and have always excluded the other cases.
Question I: Is my approach correct?
#Model-1
#comparision_1:meaningful-pleasant days vs. pleasant days
Model.1 <- glmer(comparision_1~ 1+ (1|subject_id), data = data, family = binomial(), na.action = na.omit)
summary(eelModel.1)
#comparision_2: meaningfulday-pleasant days vs. meaningfuldays,
eelModel.2 <- glmer(comparision_2~ 1+ (1|subject_id), data = data, family = binomial(), na.action = na.omit)
... and so on.
#Model-2
#comparision_1:meaningful-pleasant days vs. pleasant days
Mode2.1 <- glmer(comparision_1~ meaning_in_life+ (1|subject_id), data = data, family = binomial(), na.action = na.omit)
summary(eelModel.1)
#comparision_2: meaningfulday-pleasant days vs. meaningfuldays,
Mode2.2 <- glmer(comparision_2~ meaning_in_life+ (1|subject_id), data = data, family = binomial(), na.action = na.omit)
... and so on.
Question II: Are the estimates from the Output the log-odds? Or do I have to compute them?
Thanks for your help,
Christoph

R: Web scraping JSON, extracting information from nest

I am trying to use tidyJSON to extract information from JSON, but I am open to any R package that can achieve my ends. I took a look at the documentation and vignittes and found the complex example was helpful. However, the information I want is nested inside of a non-key-value pair and I am not sure how to access it. I am interested in getting appid, name, developer, etc., but this information is within 570 and 730:
{"570":{"appid":570,"name":"Dota 2","developer":"Valve","publisher":"Valve","score_rank":71,"owners":102151578,"owners_variance":259003,"players_forever":102151578,"players_forever_variance":259003,"players_2weeks":9436299,"players_2weeks_variance":89979,"average_forever":11727,"average_2weeks":1229,"median_forever":277,"median_2weeks":662,"ccu":811259,"price":"0","tags":{"Free to Play":22678,"MOBA":7808,"Strategy":7415,"Multiplayer":6757,"Team-Based":4848,"Action":4602,"e-sports":4089,"Online Co-Op":3669,"Competitive":3553,"PvP":2655,"RTS":2267,"Difficult":2129,"RPG":2114,"Fantasy":2044,"Tower Defense":2024,"Co-op":1898,"Character Customization":1514,"Replay Value":1487,"Action RPG":1397,"Simulation":1024}},
"730":{"appid":730,"name":"Counter-Strike: Global Offensive","developer":"Valve","publisher":"Valve","score_rank":78,"owners":29225079,"owners_variance":154335,"players_forever":28552354,"players_forever_variance":152685,"players_2weeks":9102348,"players_2weeks_variance":88410,"average_forever":17648,"average_2weeks":791,"median_forever":5030,"median_2weeks":358,"ccu":543626,"price":"1499","tags":{"FPS":17082,"Multiplayer":13744,"Shooter":12833,"Action":10881,"Team-Based":10369,"Competitive":9664,"Tactical":8529,"First-Person":7329,"e-sports":6716,"PvP":6383,"Online Co-Op":5714,"Military":4621,"Co-op":4435,"Strategy":4424,"War":4361,"Realistic":3196,"Trading":3191,"Difficult":3158,"Fast-Paced":3100,"Moddable":2496}}
There are many thousands of such entries. Is there a way to skip the "top-level" and look within the nest?
The JSON information is from http://steamspy.com/api.php?request=top100in2weeks
This might be what you need:
library(jsonlite)
data = fromJSON("http://steamspy.com/api.php?request=top100in2weeks")
appid = lapply(data, function(x){x$appid})
name = lapply(data, function(x){x$name})
df = data.frame(appid = unlist(appid),
name = unlist(name),
stringsAsFactors = F)
Result:
> head(df)
appid name
570 570 Dota 2
730 730 Counter-Strike: Global Offensive
578080 578080 PLAYERUNKNOWN'S BATTLEGROUNDS
440 440 Team Fortress 2
271590 271590 Grand Theft Auto V
433850 433850 H1Z1: King of the Kill
I'll let you add the rest of the information
Edit: Adding arrays to a dataframe
Adding the tags information for each game in the data frame is possible. And the times tagged as well. For each game you must store an array of tag names in a column and the tag quantities in another.
After the definition of df add the following lines:
for(k in 1:nrow(d)){
d$tags[k] = list(names(data[[k]]$tags))
d$tagsQ[k] = list(unlist(data[[k]]$tags))
}
This will give you:
> d["570",]
appid name
570 570 Dota 2
tags
570 Free to Play, MOBA, Strategy, Multiplayer, Team-Based, Action, e-sports, Online Co-Op, Competitive, PvP, RTS, Difficult, RPG, Fantasy, Tower Defense, Co-op, Character Customization, Replay Value, Action RPG, Simulation
tagsQ
570 22686, 7810, 7420, 6759, 4850, 4603, 4092, 3672, 3555, 2657, 2267, 2130, 2116, 2045, 2024, 1898, 1514, 1487, 1397, 1023
In this situation, columns tags and tagsQ contain lists. To obtain the second tag and quantity for appid 570 do:
> df["570","tags"][[1]][2]
[1] "MOBA"
> d["570","tagsQ"][[1]][2]
MOBA
7810

Difficulty unpacking JSON tuple string

I figured out how to use rebar. I'm trying to use jsx (jiffy doesn't work properly on Windows) to parse json that I obtained using the openexchangerates.org API, but I can't even figure out how to correctly utilize Erlang's extensive binary functionality in order to unpack the JSON tuple obtained. Using the following code snippet, I managed to get a tuple that has all the data I need:
-module(currency).
-export([start/0]).
start() ->
URL = "http://openexchangerates.org",
Endpoint = "/api/latest.json?app_id=<myprivateid>",
X = string:concat(URL, Endpoint),
% io:format("~p~n",[X]).
inets:start(),
{ok, Req} = httpc:request(X),
Req.
Here is the obtained response:
9> currency:start().
{{"HTTP/1.1",200,"OK"},
[{"cache-control","public"},
{"connection","close"},
{"date","Fri, 15 Aug 2014 01:28:06 GMT"},
{"etag","\"d9ad180d4af1caaedab6e622ec0a8a70\""},
{"server","Apache"},
{"content-length","4370"},
{"content-type","application/json; charset=utf-8"},
{"last-modified","Fri, 15 Aug 2014 01:00:56 GMT"},
{"access-control-allow-origin","*"}],
"{\n \"disclaimer\": \"Exchange rates are provided for informational purposes only, and do not constitute financial advice of any kind. Although every attempt is made to ensure quality, NO guarantees are given whatsoever of accuracy, validity, availability, or fitness for any purpose - please use at your own risk. All usage is subject to your acceptance of the Terms and Conditions of Service, available at: https://openexchangerates.org/terms/\",\n \"license\": \"Data sourced from various providers with public-facing APIs; copyright may apply; resale is prohibited; no warranties given of any kind. Bitcoin data provided by http://coindesk.com. All usage is subject to your acceptance of the License Agreement available at: https://openexchangerates.org/license/\",\n \"timestamp\": 1408064456,\n \"base\": \"USD\",\n \"rates\": {\n \"AED\": 3.673128,\n \"AFN\": 56.479925,\n \"ALL\": 104.147599,\n \"AMD\": 413.859001,\n \"ANG\": 1.789,\n \"AOA\": 97.913074,\n \"ARS\": 8.274908,\n \"AUD\": 1.073302,\n \"AWG\": 1.79005,\n \"AZN\": 0.783933,\n \"BAM\": 1.46437,\n \"BBD\": 2,\n \"BDT\": 77.478631,\n \"BGN\": 1.464338,\n \"BHD\": 0.377041,\n \"BIF\": 1546.956667,\n \"BMD\": 1,\n \"BND\": 1.247024,\n \"BOB\": 6.91391,\n \"BRL\": 2.269422,\n \"BSD\": 1,\n \"BTC\": 0.0019571961,\n \"BTN\": 60.843812,\n \"BWP\": 8.833083,\n \"BYR\": 10385.016667,\n \"BZD\": 1.99597,\n \"CAD\": 1.0906,\n \"CDF\": 924.311667,\n \"CHF\": 0.906799,\n \"CLF\": 0.02399,\n \"CLP\": 577.521099,\n \"CNY\": 6.153677,\n \"COP\": 1880.690016,\n \"CRC\": 540.082202,\n \"CUP\": 1.000688,\n \"CVE\": 82.102201,\n \"CZK\": 20.81766,\n \"DJF\": 178.76812,\n \"DKK\": 5.579046,\n \"DOP\": 43.43789,\n \"DZD\": 79.8973,\n \"EEK\": 11.70595,\n \"EGP\": 7.151305,\n \"ERN\": 15.062575,\n \"ETB\": 19.83205,\n \"EUR\": 0.748385,\n \"FJD\": 1.85028,\n \"FKP\": 0.599315,\n \"GBP\": 0.599315,\n \"GEL\": 1.74167,\n \"GGP\": 0.599315,\n \"GHS\": 3.735499,\n \"GIP\": 0.599315,\n \"GMD\": 39.73668,\n \"GNF\": 6995.309935,\n \"GTQ\": 7.839405,\n \"GYD\": 205.351249,\n \"HKD\": 7.750863,\n \"HNL\": 21.04854,\n \"HRK\": 5.708371,\n \"HTG\": 44.66625,\n \"HUF\": 233.847801,\n \"IDR\": 11682.083333,\n \"ILS\": 3.471749,\n \"IMP\": 0.599315,\n \"INR\": 60.81923,\n \"IQD\": 1178.211753,\n \"IRR\": 26354,\n \"ISK\": 115.976,\n \"JEP\": 0.599315,\n \"JMD\": 112.604801,\n \"JOD\": 0.707578,\n \"JPY\": 102.501401,\n \"KES\": 88.106539,\n \"KGS\": 51.96,\n \"KHR\": 4056.578416,\n \"KMF\": 368.149,\n \"KPW\": 900,\n \"KRW\": 1021.166657,\n \"KWD\": 0.283537,\n \"KYD\": 0.826373,\n \"KZT\": 182.076001,\n \"LAK\": 8049.834935,\n \"LBP\": 1509.068333,\n \"LKR\": 130.184301,\n \"LRD\": 91.49085,\n \"LSL\": 10.56165,\n \"LTL\": 2.583284,\n \"LVL\": 0.521303,\n \"LYD\": 1.244127,\n \"MAD\": 8.372529,\n \"MDL\": 13.7178,\n \"MGA\": 2495.605,\n \"MKD\": 45.99967,\n \"MMK\": 972.1784,\n \"MNT\": 1884.666667,\n \"MOP\": 7.986251,\n \"MRO\": 292.0081,\n \"MTL\": 0.683602,\n \"MUR\": 30.61708,\n \"MVR\": 15.37833,\n \"MWK\": 392.9201,\n \"MXN\": 13.07888,\n \"MYR\": 3.175156,\n \"MZN\": 30.3522,\n \"NAD\": 10.56145,\n \"NGN\": 162.303701,\n \"NIO\": 26.07651,\n \"NOK\": 6.157432,\n \"NPR\": 97.66846,\n \"NZD\": 1.179688,\n \"OMR\": 0.38501,\n \"PAB\": 1,\n \"PEN\": 2.795018,\n \"PGK\": 2.464545,\n \"PHP\": 43.66429,\n \"PKR\": 99.5662,\n \"PLN\": 3.126223,\n \"PYG\": 4272.421673,\n \"QAR\": 3.641137,\n \"RON\": 3.320192,\n \"RSD\": 87.82784,\n \"RUB\": 36.00216,\n \"RWF\": 690.269,\n \"SAR\": 3.750523,\n \"SBD\": 7.269337,\n \"SCR\": 12.40801,\n \"SDG\": 5.699103,\n \"SEK\": 6.86018,\n \"SGD\": 1.246263,\n \"SHP\": 0.599315,\n \"SLL\": 4372.166667,\n \"SOS\": 841.5678,\n \"SRD\": 3.275,\n \"STD\": 18316.816667,\n \"SVC\": 8.745567,\n \"SYP\": 150.751249,\n \"SZL\": 10.56279,\n \"THB\": 31.86192,\n \"TJS\": 4.9856,\n \"TMT\": 2.8501,\n \"TND\": 1.719658,\n \"TOP\": 1.8861,\n \"TRY\": 2.15338,\n \"TTD\": 6.343484,\n \"TWD\": 30.00481,\n \"TZS\": 1661.865,\n \"UAH\": 13.02466,\n \"UGX\": 2614.28,\n \"USD\": 1,\n \"UYU\": 23.70693,\n \"UZS\": 2337.106637,\n \"VEF\": 6.295009,\n \"VND\": 21191.15,\n \"VUV\": 94.6,\n \"WST\": 2.301222,\n \"XAF\": 491.286739,\n \"XAG\": 0.05031657,\n \"XAU\": 0.00076203,\n \"XCD\": 2.70154,\n \"XDR\": 0.654135,\n \"XOF\": 491.394602,\n \"XPF\": 89.414091,\n \"YER\": 214.985901,\n \"ZAR\": 10.55678,\n \"ZMK\": 5253.075255,\n \"ZMW\": 6.169833,\n \"ZWL\": 322.355006\n }\n}"}
I don't understand why this code oesn't work:
X = "Arthur".
B = <<X>>.
JSX allows a lot of parsing functionality but only if I have a binary as my representation of JSON, and this JSON I'm getting from the currency API is a string in a tuple... I'm a bit lost as to where to start to research. Unpacking a tuple using pattern matching is supposedly quite simple (I've done some Prolog programming and I can see that erlang has similar behavior) but is there a another, better, Erlang-appropriate way to grab the "rates" part of the JSON I'm receiving as a response?
Thank you! I'm working on a cool web app to learn erlang and this is a good first step. I have three Erlang books and I'm reading through them diligently but the problem is that I want as much practical exposure as early on as possible. I love this language but I want to get a solid grounding as fast as possible.
Thank you!
get_currencies() ->
URL = "http://openexchangerates.org",
Endpoint = "/api/latest.json?app_id=<myprivateid>",
X = string:concat(URL, Endpoint),
% io:format("~p~n",[X]).
inets:start(),
{ok, {_,_,R}} = httpc:request(X),
E = jsx:decode(lists_to_binary(R)),
Base = proplists:get_value(<<"base">>,E),
Sec = proplists:get_value(<<"timestamp">>,E),
{Days,Time} = calendar:seconds_to_daystime(Sec),
Date = calendar:gregorian_days_to_date(Days+719528),
Currencies = proplists:get_value(<<"rates">>,E),
fun(C) -> V = proplists:get_value(C,Currencies),
{change,Date,Time,Base,C,V}
end.
and somewhere in your code:
GC = get_currencies(), %% you can store GC in an ets, a server state...
%% but don't forget to refresh it :o)
and use it later
{change,D,T,B,C,V} = GC(<<"ZWL">>),
%% should return {change,{2014,8,15},{2,0,12},<<"USD">>,<<"ZWL">>,322.355006}
[edit]
When I use an external application such as jsx (using rebar itself), I use also rebar and its dependency mechanism to create my own application, in my opinion it is the most convenient way. (In other cases, I use also rebar :o)
Then I build my application using the OTP behaviors (application, supervisor, gen_server...). It is a lot of modules to write, but some of them are very very short (application and supervisors) and they facilitate the application structure (see What is OTP if you are not familiar with this).
In your case, my first idea is to have a gen server that build and store the GC anonymous function in its state, each time it get a cast message such as update_currencies, and provide the answer each time it get a call message such as {get_change,Cur} (and maybe refresh GC if it is undefined or out dated).
You will also have to decide where the errors will be managed - it may be nowhere: if the gen_server does nothing else but answer to this currency query: if something wrong appears it will crash and be restarted by its supervisor - because this code has many interfaces with the out world and so subject to numerous failures (no Internet access, structure of answer change from site, bad currency request from user...)
I figured out my problem.
So first of all, I wasn't thinking of how simple it is to simply count how many elements there are in the tuple. That being said, I realized there were only three.
So the line I needed was
{A,B,C} = Req.
After that, I only wanted to look at the last one (C, the JSON payload), which was a string.
I found out through another source (not to disregard what you told me, Kay!) that you need to use the list functions, since strings and just lists of integers within an ASCII range (I think), in this case list_to_binary.
Once I used this line:
D = list_to_binary(C), and subsequently
E = jsx:decode(D), I got this output:
[{<<"disclaimer">>,
<<"Exchange rates are provided for infor
attempt is made to ensure quality, NO gua
se - please use at your own risk. All usag
s://openexchangerates.org/terms/">>},
{<<"license">>,
<<"Data sourced from various providers w
any kind. Bitcoin data provided by http://
t: https://openexchangerates.org/license/"
{<<"timestamp">>,1408068012},
{<<"base">>,<<"USD">>},
{<<"rates">>,
[{<<"AED">>,3.673128},
{<<"AFN">>,56.479925},
{<<"ALL">>,104.1526},
{<<"AMD">>,413.859001},
{<<"ANG">>,1.789},
{<<"AOA">>,97.913949},
{<<"ARS">>,8.274608},
{<<"AUD">>,1.073236},
{<<"AWG">>,1.79005},
{<<"AZN">>,0.783933},
{<<"BAM">>,1.46437},
{<<"BBD">>,2},
{<<"BDT">>,77.478631},
{<<"BGN">>,1.464358},
{<<"BHD">>,0.377041},
{<<"BIF">>,1546.956667},
{<<"BMD">>,1},
{<<"BND">>,1.246774},
{<<"BOB">>,6.91391},
{<<"BRL">>,2.269462},
{<<"BSD">>,1},
{<<"BTC">>,0.0019393375},
{<<"BTN">>,60.843812},
{<<"BWP">>,8.833083},
{<<"BYR">>,10385.016667},
{<<"BZD">>,1.99597},
{<<"CAD">>,1.090486},
{<<"CDF">>,924.311667},
{<<"CHF">>,0.906833},
{<<"CLF">>,0.02399},
{<<"CLP">>,577.521099},
{<<"CNY">>,6.151637},
{<<"COP">>,1880.690016},
{<<"CRC">>,540.082202},
{<<"CUP">>,1.000688},
{<<"CVE">>,82.049699},
{<<"CZK">>,20.818},
{<<"DJF">>,179.084119},
{<<"DKK">>,5.579049},
{<<"DOP">>,43.43789},
{<<"DZD">>,79.8641},
{<<"EEK">>,11.7064},
{<<"EGP">>,7.150475},
{<<"ERN">>,15.062575},
{<<"ETB">>,19.83205},
{<<"EUR">>,0.748419},
{<<"FJD">>,1.850441},
{<<"FKP">>,0.599402},
{<<"GBP">>,0.599402},
{<<"GEL">>,1.74167},
{<<"GGP">>,0.599402},
{<<"GHS">>,3.735499},
{<<"GIP">>,0.599402},
{<<"GMD">>,39.73668},
{<<"GNF">>,6995.309935},
{<<"GTQ">>,7.839405},
{<<"GYD">>,205.351249},
{<<"HKD">>,7.750754},
{<<"HNL">>,21.04854},
{<<"HRK">>,5.708511},
{<<"HTG">>,44.66625},
{<<"HUF">>,233.8448},
{<<"IDR">>,11685.75},
{<<"ILS">>,3.471469},
{<<"IMP">>,0.599402},
{<<"INR">>,60.82523},
{<<"IQD">>,1178.211753},
{<<"IRR">>,26355.666667},
{<<"ISK">>,115.96},
{<<"JEP">>,0.599402},
{<<"JMD">>,112.604801},
{<<"JOD">>,0.707778},
{<<"JPY">>,102.495401},
{<<"KES">>,88.107639},
{<<"KGS">>,51.991},
{<<"KHR">>,4056.578416},
{<<"KMF">>,368.142141},
{<<"KPW">>,900},
{<<"KRW">>,1021.353328},
{<<"KWD">>,0.283537},
{<<"KYD">>,0.826373},
{<<"KZT">>,182.076001},
{<<"LAK">>,8049.834935},
{<<"LBP">>,1509.068333},
{<<"LKR">>,130.184301},
{<<"LRD">>,91.49085},
{<<"LSL">>,10.56165},
{<<"LTL">>,2.583364},
{<<"LVL">>,0.521328},
{<<"LYD">>,1.244147},
{<<"MAD">>,8.372619},
{<<"MDL">>,13.7178},
{<<"MGA">>,2495.605},
{<<"MKD">>,46.00037},
{<<"MMK">>,972.1784},
{<<"MNT">>,1885},
{<<"MOP">>,7.986291},
{<<"MRO">>,292.0081},
{<<"MTL">>,0.683738},
{<<"MUR">>,30.61748},
{<<"MVR">>,15.37833},
{<<"MWK">>,392.9201},
{<<"MXN">>,13.07883},
{<<"MYR">>,3.175406},
{<<"MZN">>,30.3272},
{<<"NAD">>,10.56145},
{<<"NGN">>,162.303701},
{<<"NIO">>,26.07651},
{<<"NOK">>,6.156902},
{<<"NPR">>,97.66846},
{<<"NZD">>,1.179692},
{<<"OMR">>,0.38501},
{<<"PAB">>,1},
{<<"PEN">>,2.795018},
{<<"PGK">>,2.464545},
{<<"PHP">>,43.68439},
{<<"PKR">>,99.5642},
{<<"PLN">>,3.126203},
{<<"PYG">>,4272.421673},
{<<"QAR">>,3.641297},
{<<"RON">>,3.319212},
{<<"RSD">>,87.8205},
{<<"RUB">>,36.00206},
{<<"RWF">>,690.088},
{<<"SAR">>,3.750583},
{<<"SBD">>,7.258136},
{<<"SCR">>,12.40829},
{<<"SDG">>,5.697837},
{<<"SEK">>,6.857347},
{<<"SGD">>,1.246447},
{<<"SHP">>,0.599402},
{<<"SLL">>,4360},
{<<"SOS">>,841.5678},
{<<"SRD">>,3.275},
{<<"STD">>,18322.733333},
{<<"SVC">>,8.745567},
{<<"SYP">>,150.793749},
{<<"SZL">>,10.56279},
{<<"THB">>,31.87122},
{<<"TJS">>,4.985575},
{<<"TMT">>,2.8501},
{<<"TND">>,1.719698},
{<<"TOP">>,1.889033},
{<<"TRY">>,2.15342},
{<<"TTD">>,6.343484},
{<<"TWD">>,29.99281},
{<<"TZS">>,1661.865},
{<<"UAH">>,13.02466},
{<<"UGX">>,2614.28},
{<<"USD">>,1},
{<<"UYU">>,23.70693},
{<<"UZS">>,2337.773304},
{<<"VEF">>,6.295009},
{<<"VND">>,21191.15},
{<<"VUV">>,94.4875},
{<<"WST">>,2.301222},
{<<"XAF">>,491.283058},
{<<"XAG">>,0.05031404},
{<<"XAU">>,7.6211e-4},
{<<"XCD">>,2.70154},
{<<"XDR">>,0.654135},
{<<"XOF">>,491.394602},
{<<"XPF">>,89.416091},
{<<"YER">>,214.985901},
{<<"ZAR">>,10.55649},
{<<"ZMK">>,5252.024745},
{<<"ZMW">>,6.169833},
{<<"ZWL">>,322.355006}]}]
ok
So this is closer to what I want, but how do I extract a specific currency easily? Like ZWL, for example.

Cluster in Haskell

How can I define a cluster in Haskell using list comprehension?
I want to define a function for the cluster :
( a b c ) = [ a <- [1 .. 10],b<-[2 .. 10], c = (a, b)]
In your comment you gave the example [(1,2,1),(1,3,1),(1,4,1),(1,5,1),(1,6,1),(1,7,1)].
In that example, only the middle number changes, the other two are always 1. You can do this particular one with
ones = [(1,a,1)| a<-[1..7]]
However, you might want to vary the other ones. Let's have a look at how that works, but I'll use letters instead to make it clearer:
> [(1,a,b)| a<-[1..3],b<-['a'..'c']]
[(1,1,'a'),(1,1,'b'),(1,1,'c'),(1,2,'a'),(1,2,'b'),(1,2,'c'),(1,3,'a'),(1,3,'b'),(1,3,'c')]
You can see that the letters are varying more frequently than the numbers - the b<-[1..3] is like an outer loop, with c<-['a'..'c'] being the inner loop.
You could copy the c into the first of the three elements of the tuple:
> [(b,a,b)| a<-[1..3],b<-['a'..'b']]
[('a',1,'a'),('b',1,'b'),('a',2,'a'),('b',2,'b'),('a',3,'a'),('b',3,'b')]
Or give each its own varying input
> [(a,b,c)| a<-[1..2],b<-['a'..'b'],c<-[True,False]]
[(1,'a',True),(1,'a',False),(1,'b',True),(1,'b',False),(2,'a',True),(2,'a',False),(2,'b',True),(2,'b',False)]

Clustering similar values in a matrix

I have an interesting problem and I'm sure there is an elegant algorithm with which to solve the solution but I'm having trouble describing is succinctly which would help finding such an algorithm.
I have a symmetric matrix of comparison values e.g:
-104.2732 -180.3972 -130.6969 -160.8333 -141.5499 -139.2758 -144.7697 -114.0545 -117.6409 -140.1391
-180.3972 -93.05421 -171.618 -162.0157 -156.8562 -156.3221 -159.9527 -163.2649 -170.127 -153.2709
-130.6969 -171.618 -101.1591 -154.4978 -143.6272 -116.3477 -137.2391 -125.5645 -128.9505 -131.6046
-160.8333 -162.0157 -154.4978 -96.96312 -122.7894 -141.5103 -127.7861 -149.6883 -153.0445 -130.2555
-141.5499 -156.8562 -143.6272 -122.7894 -101.7487 -141.451 -123.9087 -138.7041 -139.2517 -125.3494
-139.2758 -156.3221 -116.3477 -141.5103 -141.451 -99.99486 -134.6553 -132.7735 -138.7249 -134.1319
-144.7697 -159.9527 -137.2391 -127.7861 -123.9087 -134.6553 -100.0683 -141.3492 -138.0292 -120.5331
-114.0545 -163.2649 -125.5645 -149.6883 -138.7041 -132.7735 -141.3492 -106.8555 -115.58 -139.3355
-117.6409 -170.127 -128.9505 -153.0445 -139.2517 -138.7249 -138.0292 -115.58 -104.9484 -140.4741
-140.1391 -153.2709 -131.6046 -130.2555 -125.3494 -134.1319 -120.5331 -139.3355 -140.4741 -101.3919
The diagonal will always show the maximum score (as it is a self-to-self comparison). However I know that of these values some of them represent the same item. Taking a quick look at the matrix I can see (and have confirmed manually) that items 0, 7 & 8 as well as 2 & 5 and 3, 4, 6 & 9 all identify the same item.
Now what I'd like to do is find an elegant solution as to how I would cluster these together to produce me 4 clusters.
Does anyone know of such an algorithm? Any help would be much appreciated as I seem so close to a solution to my problem but am tripping at this one last stumbling block :(
Cheers!