Why do my t-SNE plots with euclidean and cosine distances look similar - nltk

I have a question about two t-SNE plots I made.
I have a set of 850 articles for which I wanted to check which articles are similar to each other.
This was done by pre-processing the articles first, then making a tf-idf vector of the whole set and making a t-SNE plot of this tf-idf, one with cosine distances and one with euclidean distances.
However, they both look very similar, it looks a bit like that only the axes are switched or something... Is there any logical reasoning for this?
The colors are the labels an article got from a simple sentiment analysis.
Above the Cosine Distances
Above the Euclidean distances
Thanks for any help in advance!

The test result indicates that Euclidean distance and cosine distance are likely the same distance function (up to certain scaling factor) for the specific type of data. You could verify this by heatmaps of the two distance matrixes.

Related

SQL Finding the coordinates that belong to a circle

I have a SQL database set of places to which I am assigned coordinates (lat, long). I would like to ask those points that lie within a radius of 5km from my point inside. I wonder how to construct a query in a way that does not collect unnecessary records?
Since you are talking about small distances of about 5 km and we are probably not in the direct vicinity of the north or south pole we can work with an approximated grid system of longitude and latitude values. Each degree in latidude is equivalent to a distance of km_per_lat=6371km*2*pi/360degrees = 111.195km. The distance between two longitudinal lines that are 1 degree apart depends on the actual latitude:
km_per_long=km_per_lat * cos(lat)
For areas here in North Germany (51 degrees north) this value would be around 69.98km.
So, assuming we are interested in small distances around lat0 and long0 we can safely assume that the translation factors for longitudinal and latitudinal angles will stay the same and we can simply apply the formula
SELECT 111.195*sqrt(power(lat-#lat0,2)
+power(cos(pi()/180*#lat0)*(long-#long0),2)) dist_in_km FROM tbl
Since you want to use the formula in the WHERE clause of your select you could use the following:
SELECT * FROM tbl
WHERE 111.195*sqrt(power(lat-#lat0,2)
+power(cos(pi()/180*#lat0)*(long-#long0),2)) < 5
The select statement will work for latitude and longitude values given in degree (in a decimal notation). Because of that we have to convert the value inside the cos() function to radians by multiplying it with pi()/180.
If you have to work with larger distances (>500km) then it is probably better to apply the appropriate distance formula used in navigation like
cos(delta)=cos(lat0)*cos(lat)*cos(long-long0) + sin(lat0)*sin(lat)
After calculating the actual angle delta by applying acos() you simply multiply that value by the earth's radius R = 6371km = 180/pi()*111.195km and you have your desired distance (see here: Wiki: great circle distance)
Update (reply to comment):
Not sure what you intend to do. If there is only one reference position you want to compare against then you can of course precompile your distance calculation a bit like
SELECT #lat0:=51,#long0:=-9; -- assuming a base position of: 51°N 9°E
SELECT #rad:=PI()/180,#fx:=#rad*6371,#fy:=#fx*cos(#rad*#lat0);
Your distance calculation will then simplify to just
SELECT #dist:=sqrt(power(#fx*(lat-#lat0),2)+power(#fy*(long-#long0),2))
with current positions in lat and long (no more cosine functions necessary). It is up to you whether you want to store all incoming positions in the database first or whether you want to do the calculations somewhere outside in Spring, Java or whatever language you are using. The equations are there and easy to use.
I would go with Euklid. dist=sqrt(power(x1-x2,2)+power(y1-y2,2)) . It works everywhere. Maybe you have to add a conversion to the x/y-coordinates, if degrees can't be translated in km that easy.
Than you can go and select everything you like WHERE x IS BETWEEN (x-5) AND (x+5) AND y IS BETWEEN (y-5) AND (y+5) . Now you can check the results with Euklid.
With an optimisation of the result order, you can get better results at first. Maybe there's a way to take Euklid to SQL, too.

Distance Between Two Geo Coordinates

I am just starting to work with spatial data in SQL Server (2008 r2). I am looking to calculate the distance between two coordinates (miles).
DECLARE #source geography
DECLARE #target geography
SET #source = geography::STGeomFromText('POINT (43.420026 -83.974472)', 4326);
SET #target = geography::STGeomFromText('POINT (43.458786 -84.029471)', 4326);
SELECT #source.STDistance(#target)/1609.344 -- meters to miles
My query results in a value of 3.827 miles but I checked it against the site linked below and they are returning a distance of 3.85 miles. Am I doing this incorrectly?
http://www.boulter.com/gps/distance/?from=43.420026+-83.974472&to=43.458786+-84.029471&units=m
I don't think you're doing anything wrong. Your SQL query looks reasonable to me. But…
I'm far from an expert on spatial reference systems and geodesic stuff (have you considered asking the expert folks over at GIS SE?); nevertheless, here's three possibilites that come to my mind:
Perhaps they calculate the distance along a straight line instead of the geodesic distance (i.e. the distance along a curved line, the Earth isn't flat after all). SQL Server's geography type should account for that.
This does not seem very plausible, given that "their" distance is greater than "yours": you'd expect straight-line distance to be smaller than geodesic distance.
Perhaps they can do the calculation more accurately than SQL Server (see note on the MSDN reference page for geography.STDistance)‌​:
"STDistance() returns the shortest LineString between two geography types. This is a close approximate to the geodesic distance. The deviation of STDistance() on common earth models from the exact geodesic distance is no more than .25%. This avoids confusion over the subtle differences between length and distance in geodesic types."
Or, both calculations are somewhat inaccurate, but in opposite directions. If I'm not mistaken, the two results you're citing differ by something like 0.5%, that could just be the sum of their deviation and SQL Server's.
But I could be completely wrong.

Octave approximation of e

I want to use MCMC algorithm in Octave to calculate with max precision the following expression: "1/e". After reading some tutorials I found a formula for calculating π, but I do not understand how it works.
octave:2> S=1e7; a=rand(S,2); 4*mean(sum(a.*a,2)<1)
ans = 3.1418
Can someone maybe explain and help me with a hint, how to use such thing for calculating the value of 'e'?
Thanks in advance.
This is an application of the dartboard method for estimating pi. Essentially you are creating an Sx2 matrix (think of it as S number of (x,y) coordinates) all with values between 0 and 1, so geometrically within a 1x1 square. You are then squaring the x and y values and adding them to get the distance squared of the point from the origin. <1 will translate all of these distances into either a 0 or a 1 depending on whether the point lies within the quarter circle of radius one centered at the origin. The mean of this binary array is the ratio of "darts" that hit within the quarter circle out of the total thrown, which is an approximation of its area. Multiply by 4, and you have an estimate for the full circle of radius 1, whose exact area is equal to pi.
Doing a google search brings up this (hopefully) useful publication for calculating e in a similar manner: Monte Carlo estimations of e

Computing which points (latitude, longitude) are within a certain distance in mysql?

There are two points A, B, and distances x (miles from A), and y (miles from B). Let the distance from A to B be N. So, A is N miles away from B. How do I solve the problem: What are the points available that are (N + x + y) miles away from A? I'm not sure how to explain this any better. I really have no clue on how to attack this problem, I read Fastest Way to Find Distance Between Two Lat/Long Points and I believe the solution given calculates the distance between two points and have no idea if this solution could be used to apply to my problem, or if so, how.
If you are looking for an approximation algorithm I suggest to look for a k-means algorithm or a hierarchical cluster, especially a monster curve or a space filling curve. First off you can compute a minimal spanning tree of the graph and then remove the longest and expensivest edges. Then the tree makes many little trees and you can use the k-means to compute group of points i.e. clusters.
"The single-link k-clustering algorithm ... is precisely Kruskal's algorithm ... equivalent to finding an MST and deleting the k-1 most expensive edges." See for example here: https://stats.stackexchange.com/questions/1475/visualization-software-for-clustering.
A good example for a monster curve is the hilbert curve. The basic form of this curve is an U-shape and by copy many of it together and rotating it the curve fills the euklidian space. Surprisingly a gray code can help to find out the orientation of this U-shape. You can look up Nick's spatial index quadtree hilbert curve blog article about more details. Instead to calculate the curve's index you can put together a quadkey like in bing maps. The quadkey is unique for each coordinate and it can be used with normal string operations. Each position in the key is part of the U-shape curve and thus you can select this region of points from select partially from left to right from the quadkey.
In this image you can see the green polygon is found using a hilbert curve:
You can find my php classes here: http://www.phpclasses.org/package/6202-PHP-Generate-points-of-an-Hilbert-curve.html

Mysql geometry AREA() function returns what exactly when coords are long/lat?

My question is somewhat related to this similar one, which links to a pretty complex solution - but what I want to understand is the result of this:
Using a Mysql Geometry field to store a small polygon I duly ran
select AREA(myPolygon) where id =1
over it, and got an value like 2.345. So can anyone tell me, just what does that number represent seeing as the stored values were long/lat sets describing the polygon?
FYI, the areas I am working on are relatively small (car parks and the like) and the area does not have to be exact - I will not be concerned about the curvature of the earth.
2.345 of what? Thanks, this is bugging me.
The short answer is that the units for your area calculation are basically meaningless ([deg lat diff] * [deg lon diff]). Even though the curvature of the earth wouldn't come into play for the area calculation (since your areas are "small"), it does come into play for the calculation of distance between the lat/lon polygon coordinates.
Since a degree of longitude is different based on the distance from the equator (http://en.wikipedia.org/wiki/Longitude#Degree_length), there really is no direct conversion of your area into m^2 or km^2. It is dependent on the distance north/south of the equator.
If you always have rectangular polygons, you could just store the opposite corner coordinates and calculate area using something like this: PHP Library: Calculate a bounding box for a given lat/lng location
The most "correct" thing to do would be to store your polygons using X-Y (meters) coordinates (perhaps UTM using the WGS-84 ellipsoid), which can be calculated from lat/lon using various libraries like the following for Java: Java, convert lat/lon to UTM. You could then continue to use the MySQL AREA() function.