Mapping synset ids from older version of Wordnet - nltk

How do I map synsets offsets of older versions of Wordnet (1.6 to be specific) to the current version (3.0), preferably using NLTK?
For example, in 1.6, wrath has the offset 5588321, while 3.0 gives 7516905 for the same.
My primary need for this was to implement Wordnet-Affect http://wndomains.fbk.eu/wnaffect.html
which uses version 1.6 unfortunately.
I did manage to find a repository at https://github.com/clemtoy/WNAffect thanks to which I could successfully use the database that did offer a solution to the problem for the requirements that using Wordnet affect had, but it did not map synsets to achieve it.
Getting offset in Wordnet 3.0, nltk
wn.synset('wrath.n.01').offset()
7516905
EDIT:
Getting the name of a synset from its id for Wordnet 1.6 would serve as well.
EDIT2:
Here is how the information is exactly stored. This is a small subset -
<noun-syn id="n#05588321" categ="wrath"/>
<noun-syn id="n#05576115" categ="worship"/>
<noun-syn id="n#05600844" categ="world-weariness"/>
<noun-syn id="n#05582577" categ="wonder"/>
<noun-syn id="n#05600968" categ="woe"/>
<noun-syn id="n#05579569" categ="withdrawal"/>
<noun-syn id="n#05604301" categ="weight"/>
<noun-syn id="n#05601315" categ="weepiness"/>
<noun-syn id="n#05574157" categ="weakness"/>
<noun-syn id="n#05611809" categ="warpath"/>
These ids are all outdate WN1.6 ids

Since nobody has suggested a shortcut, sounds like you need to do it the obvious way: Fire up an 1.6 Wordnet and convert offsets to synsets yourself. You'll find an official version 1.6 download of Wordnet on this page.
I've got no idea how hard they tried to maintain backward compatibility, but hopefully wrath.n.01 is more or less the same thing in all versions of Wordnet. I'm guessing that some senses were split up into two or more synsets between versions, and maybe even the reverse happens occasionally. In such cases there won't be such a thing as an exact counterpart of the original synset. Whether that's a problem for you is for you to decide.

Related

How do I get molecular structural information from SMILES

My question is: is there any algorithm that can convert a SMILES structure into a topological fingerprint? For example if glycerol is the input the answer would be 3 x -OH , 2x -CH2 and 1x -CH.
I'm trying to build a python script that can predict the density of a mixture using an artificial neural network. As an input I want to have the structure/fingerprint of my molecules starting from the SMILES structure.
I'm already familiar with -rdkit and the morganfingerprint but that is not what i'm looking for. I'm also aware that I can use the 'matching substructure' search in rdkit, but then I would have to define all the different subgroups. Is there any more convenient/shorter way?
For most of the structures, there's no existing option to find the fragments. However, there's a module in rdkit that can provide you the number of fragments especially when it's a function group. Check it out here. As an example, let's say you want to find the number of aliphatic -OH groups in your molecule. You can simply call the following function to do that
from rdkit.Chem.Fragments import fr_Al_OH
fr_Al_OH(mol)
or the following would return the number of aromatic -OH groups:
from rdkit.Chem.Fragments import fr_Ar_OH
fr_Ar_OH(mol)
Similarly, there are 83 more functions available. Some of them would be useful for your task. For the ones, you don't get the pre-written function, you can always go to the source code of these rdkit modules, figure out how they did it, and then implement them for your features. But as you already mentioned, the way would be to define a SMARTS string and then fragment matching. The fragment matching module can be found here.
If you want to predict densities of pure components before predicting the mixtures I recommend the following paper:
https://pubs.acs.org/doi/abs/10.1021/acs.iecr.6b03809
You can use the fragments specified by rdkit as mnis proposes. Or you could specify the groups as SMARTS patterns and look for them yourself using GetSubstructMatches as you proposed yourself.
Dissecting a molecule into specific groups is not as straightforward as it might appear in the first place. You could also use an algorithm I published a while ago:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0382-3
It includes a list of SMARTS for the UNIFAC model, but you could also use them for other things, like density prediction.

neutral label for NLTK

I have similar problem like below
Why did NLTK NaiveBayes classifier misclassify one record?
In my case, I queried positive feed and built positive_vocab and then queried negative feed and built negative_voca. I get the data from feed clean and built the classifier. How do I build the neutral_vocab. Is there a way I can instruct NLTK classifier to return neutral label when the given word is not found in the negative_voca and positive_vocab. How do I do that?
In my current implementation, if I give a word which is not present in the both sets it tells positive by default. Instead it should tell, neutral or notfound

Princeton Wordnet database - two different synset identifiers?

I am trying to make sense of the different identifiers in the Princeton Wordnet database. I am using version 3.1. You can read about the structure here but my focus is on the synsets table.
The Synset Table The synsets table is one of the most important tables in the database. It is responsible for housing all the definitions within WordNet. Each row in the synset table has a synsetid, a definition, a pos (parts of speech field) and a lexdomainid (which links to the lexdomain table) There are 117373 synsets in the WordNet Database.
When I search for the word joy in my senses table, I see that there are four different results (2 nouns and 2 vebs). From there, I can identify the sense/meaning that I am looking for, which is the one that corresponds to the meaning:
"the emotion of great happiness"
So I have now found the result that I am looking for. The synset id of this result is 107542591 and I can search this id to find other words with the same sense/meaning.
However, when I use some online versions of Wordnet and I search for words in the synset "the emotion of great happiness", I see a different type of identifier. This identifier is 07527352-n.
For example, you can see it at the top-left corner of this site. On that same site, in the address bar you'll see that identifier is referred to as the synset id: &synset=07527352-n.
I would like to know how to retrieve the second type of identifier for a given synset. I've read through the documentation here and searched through the raw data files, but I cannot figure it out.
Thank you!
There are two things going on.
First, MySQL does not like ids starting with a 0, so they start with 1. (Specifically, nouns get a 1 prefix, verbs 2, adjectives 3, and adverbs get a 4 prefix: see WordNet identifiers section at http://wordnet-rdf.princeton.edu/ )
Second, 07542591 is from WordNet 3.1 (I've checked both the raw WordNet files, and the SQL files, and they both use this).
"07527352" is from an older version of WordNet. In the case of Chinese WordNet I believe they use WordNet 3.0. http://compling.hss.ntu.edu.sg/cow/
Additional: https://stackoverflow.com/a/33348009/841830 has more information. Strangely, I've not been able to track a simple 3.0 to 3.1 conversion table yet... but I'm sure I've seen one.

Status of in-place `rfft` and `irfft` in Julia

So I'm doing some hobby-related stuff which involves taking Fourier transforms of large real arrays which barely fit in memory, and was curious to see if there was an in-place version of rfft and irfft that saved RAM, since RAM consumption is important to me. These transforms are possible despite the input-vs-output-type mismatch, and require an extra row of padding.
In Implement in-place rfft! and irfft!, Tim Holy said he was working on an in-place rfft! and irfft! that made use of a buffer-containing RCpair object, but then Steven Johnson said that he was implementing something equivalent using A_mul_B!(y, plan, x), which he elaborated on here.
Things get a little weird from then on. In the documentation for both 0.3.0 and 0.4.0 there is no mention of A_mul_B!, although A_mul_B is listed. But when I try entering them into Julia, I get
A_mul_B!
A_mul_B! (generic function with 28 methods)
A_mul_B
ERROR: A_mul_B not defined
which suggests that the situation is actually the opposite of what the documentation currently describes.
So since A_mul_B! seems to exist, but isn't documented anywhere, I tried to guess how to test it in-place as follows:
A = rand(Float32, 10, 10);
p = plan_rfft(A);
A_mul_B!(A,p,A)
which resulted in
ERROR: `A_mul_B!` has no method matching A_mul_B!(::Array{Float32,2}, ::Function, ::Array{Float32,2})
So...
Are in-place real FFTs still a work in progress? Or am I using A_mul_B! wrong?
Is there a mismatch between the 0.3.0 documentation and 0.3.0's function library?
That pull request from Steven Johnson is listed as open, not merged; that means the work hasn't been finished yet. The one from me is closed, but if you want the code you can grab it by clicking on the commits.
The docs indeed omit mention of A_mul_B!. A_mul_B is equivalent to A*B, and so isn't exported independently now. A_mul_B! would be used like this: instead of C = A*B, you could say A_mul_B!(C, A, B).
Can you please edit the docs to fix these issues? (You can edit files here in your webbrowser.)

How do you know what SRID to use for a shp file?

I am trying to put a SHP file into my PostGIS database, the the data is just a little off. I think this is because I am using the wrong SRID. The contents of the PRJ file are as follows:
GEOGCS["GCS_North_American_1983",
DATUM["D_North_American_1983",
SPHEROID["GRS_1980",6378137.0,298.257222101]],
PRIMEM["Greenwich",0.0],
UNIT["Degree",0.0174532925199433]]
What SRID does this correlate to? And more generally, how can I look up the SRID based on the information found in the PRJ file? Is there a lookup table somewhere that lists all SRID's and their 'geogcs' equivalents?
The data imported using srid=4269 and 4326 were the exact same results.
Does this mean I'm using the wrong SRID, or is this just expected margin of error?
The shp file is from here.
To elaborate on synecdoche's answer, the SRID is sometimes called an "EPSG" code. The SRID/EPSG code is a defacto short-hand for the Well-Known-Text representations of projections.
You can do a quick search on the SRID table to see if you can find an exact or similar match:
SELECT srid, srtext, proj4text FROM spatial_ref_sys WHERE srtext ILIKE '%BLAH%'
Above was found at http://www.bostongis.com/?content_name=postgis_tut01.
You can also search on spatialreference.org for these kinds of things. The search tool is primitive so you may have to use a Google search and specify the site, but any results will show you the ESRI PRJ contents, the PostGIS SQL INSERT, and a bunch of other representations.
I think your PRJ is at: http://spatialreference.org/ref/sr-org/15/
Prj2EPSG is a small website aimed at exactly this problem; paste in the PRJ contents and it does its best to find a matching EPSG. They also have a web service API. It's not an exact science. They seem to use Lucene and the EPSG database to do text searches for matches.
The data seems to be NAD83, which has an SRID of 4269. Your PostGIS database has a spatial_ref_sys table which is the SRID lookup table.
If the data looks the same with an SRID of 4269 (NAD83) and 4326 (WGS84), then there's something wrong.
Go and download the GDAL utilities , the ogrinfo (which would spit the projection information) and ogr2ogr utilities are invaluable.
James gave already a link to spatialreference.org. That helps to find spatial reference information... I assume you did load the spatial_ref_sys.sql when you prepared your postgis instance.
And to be honest, I don't think the problem is in the PostGIS side of things.
I usually keep my data in different SRIDs in my PostGIS dbs. However, I always need to project to the output SRS. You are showing OpenStreetMap pre-rendered tiles, and I bet they have been drawn using SRID 900913 (the Google Map's modified mercator projection that now everyone uses to render).
My recommendation to you is:
1- Set the right projection in the OpenLayers code which matches whatever tiles you are reading from.
2.- Keep the data in the database in whatever SRID you want (as long as it is correct of course).
3.- Make sure the server you are using to generate the images from your data (ArcGIS Server, Mapserver, GeoServer or whatever it is) is reprojecting to that same SRS.
Everything will match.
Cheers
Use GDAL's OSR Python module to determine the code:
from osgeo import osr
srsWkt = '''GEOGCS["GCS_North_American_1983",
DATUM["D_North_American_1983",
SPHEROID["GRS_1980",6378137.0,298.257222101]],
PRIMEM["Greenwich",0.0],
UNIT["Degree",0.0174532925199433]]'''
# Load in the projection WKT
sr = osr.SpatialReference(srsWkt)
# Try to determine the EPSG/SRID code
res = sr.AutoIdentifyEPSG()
if res == 0: # success
print('SRID=' + sr.GetAuthorityCode(None))
# SRID=4269
else:
print('Could not determine SRID')
Be sure to take a look at: http://www.epsg-registry.org/
Use the Query by Filter option and enter: North American Datum 1983.
This yields -> EPSG:6269.
Hope this works for you.