Quick search the most similar objects in the n-dimensional space - mysql

Lets assume that we have a points in the n-dimensional space. So we have a n coords(n columns) which can describe location of the each point.
We need to implement a table which can be used for a quick searching the most similar points, i.e. points which have the smallest distance to the desired point.
E.g. points in the db:
id c1 c2 c3 c4 c5
1 5 19 42 12 16
2 3 23 38 15 12
3 14 21 32 33 1
4 12 29 21 24 5
If we want to find the best matching for point with coords:
c1 c2 c3 c4 c5
4 20 40 14 15
We will get points with id 1 and 2.
We also have mean coordinate for each dimension(column) and vector for each point in which first element - number of the dimension in which point has the largest difference from the mean coordinate in this dimension, and last - number of the dimension in which point has the smallest difference. Maybe it can be used for the more rapid filtering points which have the biggest distance to the desired point.
So how can I do something like this using MySQL?
I think the composite index and order by abs(cx - $mycx) can be a good solution, but I can't use it because I will have more then 16 columns which I need to include in the one index.
Any help will be very useful!

Related

How to reduce redundant cells in a column containing logged data

Is there a function to reduce the amount of redundant data from one column to match the number of cells in a second column?
I have logged data from two sensors that sent values at different rates. in 8 hours, I collected 11857 values for the first sensor and 8130 for the second one.
I need to compress the first column by deleting data to match the number of cells on the second column, so I can display synchronized values on a chart.
It is not a matter of cutting 3727 cells from the head or tail of the first column, but to delete cells in a proportional way.
I've tried using de Modulus function, but it does not give me the right amount of compression; e.g., by running =MOD(A1,3) and then filtering cells containing '0' value and deleting those rows, I get 7905, which is close to 8130 but still, the data is shifted out.
Edit:
I found a method that requires several steps:
Copy the sensors' data into two columns
Get the number of cells for both columns using COUNTA
Get the ratio between the smaller count over the bigger count
In a new column, create an index for the rows using =INT(ROW()*ratio)
Remove duplicate rows using the index column as the reference with Data > Remove Duplicates
It works, but it will be much faster if there was a ready-made function that will run over the provided data columns and copy the values into two new columns
I tested this solution in LibreOffice Calc. The functions used are basic enough to be found in Excel as well.
Here's a sample with data from 2 sensors, s1 and s2, similar to yours:
Row s1 s2
1 2 3
2 4 6
3 6 9
4 8 12
5 10 15
6 12 18
7 14 21
8 16
9 18
10 20
11 22
What I did was match the data from s1 samples with those from s2 that relatively match the position of the first, so instead of ending up with a number of rows with no s2 values, I padded non-existent s2 values with the last sample taken for any given period of time (column s2a)
Row s1 s2 s2a
1 2 3 3
2 4 6 6
3 6 9 6
4 8 12 9
5 10 15 12
6 12 18 12
7 14 21 15
8 16 18
9 18 18
10 20 21
11 22 21
Assuming that s1 is column A and s2 is column B in the spreadsheet, the function you want on each cell of the new column is:
=INDIRECT( ADDRESS( CEILING( ROW()* COUNT(B:B)/COUNT(A:A)),2))
Let's go from bottom to top:
COUNT(B:B)/COUNT(A:A) - this is the ratio. 0.63' above. It indicates that each sample in any given row in s1 will be found at that row x 0.63 in column s2.
Ceiling - Spreadsheets don't start at row 0, so the first one HAS to be 1. I experimented with Int(), but if the ratio were less than 0.5 we would end up with a 0, which we don't want.
Address - Returns a string with the address of a cell given its row,column coordinates (e.g. Address(3, 2) = "B3" and Address(3,2,2) as used here, will yield an absolute column or "$B3").
Indirect - Returns the contents of a cell whose address is passed as a string (e.g. Address("x5") will return whatever value is stored in cell X5).
Alex

Finding Reccurring Number Combinations in Column of Numbers

I have searched and found discussions and solutions to similar problems, but not quite or as complex as I'm trying to figure out.
I have an access table which consists of two columns Draw Number and Number Drawn as shown below. Draw Number is repeated 20 times, to correspond to the 20 numbers that are drawn in each particular draw.
I'm trying to figure a way to determine the most frequent occurring combination of numbers (5 numbers) for all of the draws in each of the 20 number sets. So for instance, 12341 occurs n x, 12342 occurs nx, 12343 occurs n x, etc.
I've created parameter queries which allow me to search for different number combinations from 2 to 10 numbers, and they work OK returning the number of occurrences of a combination of numbers that I input through a simple UI. But the goal is to figure out pragmatically what the optimum combination of numbers.
Hope this makes sense. And by the way, there are 36 million or so rows in the table. The para queries work quite well however; it takes just over a second to return results for each number added. So, query two numbers = 2 second wait, three numbers = 3 second wait, etc.
I've been thinking about a loop of some type but don't know how to get started? Processing time isn't an issue; can take a day if required!
This is written in VBA and has an assortment of queries, temp tables, etc to get the job done.
The text says Access, but the tags say MySql, which is it? – RBarryYoung 21 hours ago
This part confuses me: I'm trying to figure a way to determine the most frequent occurring combination of numbers (5 numbers) for all of the draws in each of the 20 number sets. So for instance, 12341 occurs n x, 12342 occurs nx, 12343 occurs n x, etc. – Newd 21 hours ago
^What do you mean five numbers? No where in your sample data do I see 12341. Please explain using the data you have, and give expected results using that data. – McAdam331 21 hours ago
drosberg - clarification:
thanks for the response. It is an Access application, but as a first-time poster Stackoverflow recommends tags?
By five numbers I mean the most frequently occurring group of five numbers (I used five as an example, could be groups of 2 to 10 numbers) which occur in each draw, where a draw consists of 20 drawn numbers from a total of 80 numbers. So the data that I posted was intended as an example. The sample provided only has 50, 51 in common. I can plug 50 and 51 into the parameter query and it will tell me that this combination occurs 60,000 times (or whatever), but perhaps 50 and 57 occurs 65,000 times.
If i was to do this manually, and assuming I'm looking for the most frequent 5 number combination I would enter the following in the parameter query: 1,2,3,4,1 group = 30,000 occurrences 1,2,3,4,2 group = 31,000 occurrences 1,2,3,4,3 group = 31,050 occurrences 1,2,3,4,4 group = 29,050 occurrences etc........... etc...........
but I would have to do this for every combination of 5 numbers that can be derived from the numbers 1 thru 80. I'm hoping to have program do the work!!
thanks
don
DRAW NUMBER NUMBER DRAWN
1 1
1 28
1 19
1 3
1 38
1 46
1 43
1 29
1 13
1 22
1 20
1 11
1 50
1 51
1 53
1 54
1 57
1 64
1 76
1 78
2 29
2 14
2 2
2 1
2 35
2 40
2 39
2 30
2 10
2 27
2 21
2 6
2 42
2 50
2 51
2 53
2 54
2 61
2 65
2 69
I wrote a post a while ago about generating permutations with and without repetition using Excel. Perhaps you can use it.
https://michiel.wordpress.com/2015/03/29/permutations-with-repetition-using-excel/
Here's how it works. I am using strings, but you can easily modify that for numbers (since you say you need 5).
You can use the MID function to grab a single char from a string, and generate permutations from it.
=MID(Pattern,MOD([N]/[P],Length)+1,1)
N revers to the column N
P refers to the horizontal row (1,4,16). You can generate these with a formula like =4^.
After putting in the code, you can make a list of all permutations in Excel and in the cell next to it generate a sql query that you can perform as well from VBA.
Example: Looking up Access database in Excel
Or find a commercial tool like http://thingiequery.com/
I don't know if there's any open source tools for it.
I'm thinking that you should consider:
Say there are 100 balls.
Setting up a table to have one row for each "Draw number" with 100 columns one for every possible number each column has type boolean.
When you look to see which draws had number 23 you just add a
WHERE Column23 = true.
For numbers 23 and 56
WHERE Column23 = true AND Column56 = true
This should massivel simplify and speed up your SQL.
You set up a table with every possible combination of numbers.
You run SQL to find the counts.
Harvey

Efficiently joining over interval ranges in SQL

Suppose I have two tables as follows (data taken from this SO post):
Table d1:
x start end
a 1 3
b 5 11
c 19 22
d 30 39
e 7 25
Table d2:
x pos
a 2
a 3
b 3
b 12
c 20
d 52
e 10
The first row in both tables are column headers. I'd like to extract all the rows in d2 where column x matches with d1 and pos1 falls within (including boundary values) d1's start and end columns. That is, I'd like the result:
x pos start end
a 2 1 3
a 3 1 3
c 20 19 22
e 10 7 25
The way I've seen this done so far is:
SELECT * FROM d1 JOIN d2 USING (x) WHERE pos BETWEEN start AND end
But what is not clear to me is if this operation is done as efficient as it can be (i.e., optimised internally). For example, computing the entire join first is not really a scalable approach IMHO (in terms of both speed and memory).
Are there any other efficient query optimisations (ex: using interval trees) or other algorithms that can handle ranges efficiently (again, in terms of both speed and memory) in SQL that I can make use of? It doesn't matter if it's using SQLite, PostgreSQL, mySQL etc..
What is the most efficient way to perform this operation in SQL?
Thank you very much.
Not sure how it all works out internally, but depending on the situation I would advice to play around with a table that 'rolls out' all the values from d1 and then join on that one. This way the query engine can pinpoint the right record 'exactly' instead of having to find a combination of boundaries that match the value being looked for.
e.g.
x value
a 1
a 2
a 3
b 5
b 6
b 7
b 8
b 9
b 10
b 11
c 19 etc..
given an index on the value column (**), this should be quite a bit faster than joining with the BETWEEN start AND end on the original d1 table IMHO.
Off course, each time you make changes to d1, you'll need to adjust the rolled out table too (trigger?). If this happens frequently you'll spend more time updating the rolled out table than you gained in the first place! Additionally, this might take quite a bit of (disk)space quickly if some of the intervals are really big; and also, this assumes we don't need to look for non-whole numbers (e.g. what if we look for the value 3.14 ?)
(You might consider experimenting with a unique one on (value, x) here...)

Mysql :Exclude row that does not satisfy the condition list

So Here is My data
ID C1 C2 C3
6 Digit 2 6,8,10,12
12 Digit 3 15
15 127 Digit 2 6,7,8,9,10,11,12,13
68 140,141 Digit 11 85,86,87,88,167,168,158,159
73 1 Digit 11 85,86,87,88,169,170
76 Digit 11 85,86,87,91,164,165,166,167,168
99 Digit 11 20,27,85,86,87
106 Digit 1 1,2
111 Digit 11 85,86,87,88
112 Digit 11 85,86,87,88
135 Digit 11 85,86,87
and my condition string is (2,6,15,37,42,52,62,65,79,85,94,100,104,107,113,124,131)
Now,I want to exclude row 3,4,5 if the values 127,140,141,1 are not in the list condition. I tried Not in , but no avail. I think I might be missing something basic, but just cant get it.
It's better not to store multiple values in a column if possible. Then it's easier to do queries like this.
You cannot use "IN" or "NOT IN" because they are looking for a list of separate items. But C3 is just one item that happens to have commas in it.
Try this:
SELECT * FROM
(SELECT ID, C1, C2, CONCAT('|',REPLACE(C3,',','|'),'|') as C3 FROM `table` WHERE `C3` ) as t1
WHERE t1.C3 NOT LIKE "|127|" AND t1.C3 NOT LIKE "|140|" AND t1.C3 NOT LIKE "|141|" AND t1.C3 NOT LIKE "|1|"
You could avoid the "|" and just concat "," to the start and end.
Or you could fix your database schema so that it actually acts like a Normalized Relational Database.
Every column that contains multiple values should be separated out into its own table.
There should be no column C3 in your table above. Instead, you should have a table, some_other_data:
At this point, I see that C3=6 is related to more than one record in the main table. Therefore, you actually need a third, linking table, in addition to some_other_data. See below.
`some_other_data`
id
6
8
10
12
15
`main_table_to_some_other_data_link`
some_other_data_id | main_table_id
6 6
8 6
10 6
12 6
15 12
6 15
etc. You can see that the linking table can contain duplicates of either value. But your other two tables would have completely unique ids.
I think you're trying to solve the wrong problem.
(I'm assuming you can change your table structure. If you can't someone else will need to address your question.)
The long lists of comma-separated data are a flag that they have a one-to-many relationship with ID.
For example, make the data in C3 its own table:
ID MainID C3
================
1 6 6
2 6 8
3 6 10
4 6 12
5 12 15
6 15 6
7 15 7
8 15 8
9 15 9
10 15 10
11 15 11
12 15 12
13 15 13
// and so forth //
So ID is the primary key of the new table, MainID is the foreign key that refers to the record in your primary table, and C3 is the data in C3.
Each separate value of C3 now has its own record.
Now, you're in a position to use something like
Select * from MainTable
Inner Join NewTable
On MainTable.ID = NewTable.MainID
Where NewTable.C3 Not In (2,6,15,37,42,52,62,65,79,85,94,100,104,107,113,124,131);
If you can, pulling out the one-to-many relationships into their own tables will make things easier for you.

How to find matrix element with sql query?

I have array and table that I referenced some elements in array. Like my array
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
And I have area like Start point s=9,X=2,Y=2,Row Count R=6
then I have boxes 9,10,11,15,16,17,21,22,23
Now I am trying to write some sql that check if 16 number in this area.I created some logic like if ((s<16<s+X) || (s+6<16<s+x+6) || (s+12<16<s+x+12) ) but should I write it in one sql query? I am using mySql.
This doesn't have anything to do with SQL, I don't think, but something like the following condition is probably what you want. Since your example has the same value of X and Y, and "Row Count" sounds more like "number of rows" than "number of items in a row," I may have gotten rows and columns backwards from what you want.
set #s=9, #x=2, #y=5, #R=6, #testval=16;
(#testval-1)/#R between (#s-1)/#R and (#s-1)/#R - #y - 1
and (#testval-1)%#R between (#s-1)%#R and (#s-1)%#R - #x - 1