I have a table A which has a column 'template_phash'. I store the phash generated from 400K images.
Now I take a random image and generate a phash from that image.
Now how do I query so that I can get the record from table A which hamming distance difference is less than a threshold value, say 20.
I have seen Hamming distance on binary strings in SQL, but couldn't figure it out.
I think I figured out that I need to make a function to achieve this but how?
Both of my phash are in BigInt eg: 7641692061273169067
Please help me make the function so that I could query like
SELECT product_id, HAMMING_DISTANCE(phash1, phash2) as hd
FROM A
WHERE hd < 20 ORDER BY hd ASC;
I figured out that the hamming distance is just the count of different bits between the two hashes. First xor the two hashes then get the count of binary ones:
SELECT product_id, BIT_COUNT(phash1 ^ phash2) as hd from A ORDER BY hd ASC;
Related
Can somebody explain to me how this query below works?
It's a query to calculate a median Latitude from the table. I get tired of understanding this and still, I can't.
SELECT *
FROM station as st
where (
select count(LAT_N)
from station
where LAT_N < st.LAT_N
) = (
select count(LAT_N)
from STATION where LAT_N > st.LAT_N
);
The median is the middle value in a collection, which means there are as many values above it as below.
So for each row in the table, the first subquery counts the number of rows where LAT_N is lower than in the current row, and the second counts the number of rows where it's higher. Then it only returns the rows where these counts are the same.
Note that this won't work in many situations. The simplest example is when there are an even number of distinct values:
1
2
3
4
The median should be 2.5 (the mean of the two middle values), which doesn't exist in the table.
Another case is when there are duplicate values:
1
1
2
the median should be 1. But the count of lower values is 0, while the count of higher values is 1, so they won't be equal.
Oh that's some clever code! It's got a bug, but here's what it's trying to do.
We define median as the value(s) that have the same number of values greater and less than them. So here's the query in pseudocode:
for each station in st:
compute number of stations with latitude greater than the current station's latitude
compute number of stations with latitude less than the current station's latitude
If these two values are equal, include it in the result.
Bug:
For tables with an even number of distinct values, the median should be defined as the mean of the two middle values. This code doesn't handle that.
I have a column with bytes, and another with milliseconds. And I must calculate average bitrate in bits per second.
I'm doing this:
SELECT AVG(Bytes*8)/AVG(Milliseconds/1000)
FROM Tracks
Apparently it is wrong. I'm using an app with exercises
I have this result
254492.61
And should be
254400.25
I think you only want one average calculation
SELECT AVG((Bytes*8.0)/(Milliseconds/1000.0))
FROM Tracks
and you may want to increase precision to decimals which is why 8.0 and 1000.0 are used above. Remove if unwanted.
I would be inclined to write this as:
SELECT SUM(Bytes*8) / SUM(Milliseconds/1000)
FROM Tracks
This is equivalent to your query, though -- assuming that the values are never NULL.
Perhaps they mean the average of averages:
SELECT AVG(Bytes * 8 / (Milliseconds / 1000))
FROM Tracks;
I would not describe this as the average bits per second, however.
I'm using CodeIgniter 2 and in my database model, I have a query that joins two tables and filters row based upon distance from a given geolocation.
SELECT users.id,
(3959 * acos(cos(radians(42.327612)) *
cos(radians(last_seen.lat)) * cos(radians(last_seen.lon) -
radians(-77.661591)) + sin(radians(42.327612)) *
sin(radians(last_seen.lat)))) AS distance
FROM users
JOIN last_seen ON users.id = last_seen.seen_id
WHERE users.age >= 18 AND users.age <= 30
HAVING distance < 50
I'm not sure if it's the distance that is making this query take especially long. I do have over 300,000 rows in my users table. The same amount in my last_seen table. I'm sure that plays a role.
But, the age column in the users table is indexed along with the id column.
The lat and lon columns in the last_seen table are also indexed.
Does anyone have ideas as to why this query takes so long and how I can improve it?
UPDATE
It turns out that this query actually runs pretty quickly. When I execute this query in PHPMyAdmin, it takes 0.56 seconds. Not too bad. But, when I try to execute this query with a third party SQL client like SequelPro, it takes at least 20 seconds and all of the other apps on my mac slow down. When the query is executed by loading the script via jQuery's load() method, it takes around the same amount of time.
Upon viewing my network tab in Google Chrome's developer tools, it seems that the reason it's taking so long to load is because of what's called TTFB or Time To First Byte. It's taking forever.
To make this query faster you need to limit the count of rows using an index before actually calculating the distance on every and each of them. To do so you can limit the rows from last_seen based on their lat/lon and a rough formula for desired distance.
The idea is that the positions with the same latitude as the reference latitude would be in 50 miles distance if their longitude falls in a certain distance from the reference longitude and vice versa.
For 50 miles distance, RefLat+-1 and RefLon+-1 would be a good start to limit the rows before actually calculating the precise distance.
last_seen.lat BETWEEN 42.327612 - 1 AND 42.327612 + 1
AND last_seen.lon BETWEEN -77.661591 - 1 AND -77.661591 + 1
For this query:
SELECT users.id, (3959 * acos(cos(radians(42.327612)) * cos(radians(last_seen.lat)) * cos(radians(last_seen.lon) - radians(-77.661591)) + sin(radians(42.327612)) * sin(radians(last_seen.lat)))) AS distance
FROM users JOIN
last_seen
ON users.id = last_seen.seen_id
WHERE users.age >= 18 AND users.age <= 30
HAVING distance < 50;
The best index is users(age, id) and last_seen(seen_id). Unfortunately, the distance calculations are going to take a while, because they have to be calculated for every row. You might want to consider a GIS extension to MySQL to help with this type of query.
I have a variance report query here I need the 'Variance' to not have 10 decimal points in the Variance Column. What is the most convenient way to round Variance results to the 100th?
WITH A AS
(
select
A.FACTORY,
A.JOB_NUMBER,
A.PROCESS_STAGE,
A.PART_CODE,
B.PART_DESC_1,
A.INPUT_QTY_STD,
A.QUANTITY_INPUT,
A.QUANTITY_OUTSTANDING,
A.INPUT_QTY_ACTUAL,
(A.QUANTITY_OUTSTANDING*100/NULLIF(A.INPUT_QTY_STD,0)) as variance,
A.ACTUAL_CLOSE_DATE
from
(select * from [man_prod].[dbo].[JOB_STAGE_LINES]
where JOB_NUMBER in (select JOB_NUMBER from JOB_OUTPUTS where
BF_QTY_ACTUAL<>0
and ABS(DATEDIFF(HOUR,ACTUAL_CLOSE_DATE,GETDATE())) < 12 and STATUS_FLAG='C'
)) A
join fin_prod.dbo.PRODUCT_MASTER B
ON A.PART_CODE=B.PART_CODE
WHERE
A.INPUT_QTY_STD<>0 and
A.QUANTITY_OUTSTANDING <>0
)
SELECT * FROM A WHERE A.variance >10.000000 OR A.variance <-10
order by PROCESS_STAGE asc ,PART_CODE asc, variance desc ;
The Variance column comes out at 00.0000000000 i need it to display 00.000 or 00.000000
Help is greatly appreciated
Use the MySQL ROUND() function, the second argument is the number of decimal places if it is positive.
ROUND((A.QUANTITY_OUTSTANDING*100/NULLIF(A.INPUT_QTY_STD,0)), 3) as variance,
In this example if the value is 0.0000000000 it would be rounded to 3 decimal places, or 0.000.
You can use the TRUNCATE option:
TRUNCATE((A.QUANTITY_OUTSTANDING*100/NULLIF(A.INPUT_QTY_STD,0)), 3) as variance,
or use the ROUND if you are looking for rounding(as suggested by doublesharp)
ROUND((A.QUANTITY_OUTSTANDING*100/NULLIF(A.INPUT_QTY_STD,0)), 3) as variance,
Using Convert to convert it to a decimal of the desired length is what i prefer when i am not actually rounding the value, just formatting.
CONVERT(DECIMAL(10,3),10000)
Is there a function to find average time difference in the standard time format in my sql.
You can use timestampdiff to find the difference between two times.
I'm not sure what you mean by "average," though. Average across the table? Average across a row?
If it's the table or a subset of rows:
select
avg(timestampdiff(SECOND, startTimestamp, endTimestamp)) as avgdiff
from
table
The avg function works like any other aggregate function, and will respond to group by. For example:
select
col1,
avg(timestampdiff(SECOND, startTimestamp, endTimestamp)) as avgdiff
from
table
group by col1
That will give you the average differences for each distinct value of col1.
Hopefully this gets you pointed in the right direction!
What I like to do is a
SELECT count(*), AVG(TIME_TO_SEC(TIMEDIFF(end,start)))
FROM
table
Gives the number of rows as well...
In order to get actual averages in the standard time format from mysql I had to convert to seconds, average, and then convert back:
SEC_TO_TIME(AVG(TIME_TO_SEC(TIMEDIFF(timeA, timeB))))
If you don't convert to seconds, you get an odd decimal representation of the minutes that doesn't really make any sense (to me).
I was curious if AVG() was accurate or not, the way that COUNT() actually just approximates the value ("this value is an approximation"). After all, let's review the average formula: average = sum / count. So, knowing that the count is accurate is actually really important for this formula!
After testing multiple combinations, it definitely seems like AVG() works and is a great approach. You can calculate yourself to see if it's working with...
SELECT
COUNT(id) AS count,
AVG(TIMESTAMPDIFF(SECOND, OrigDateTime, LastDateTime)) AS avg_average,
SUM(TIMESTAMPDIFF(SECOND, OrigDateTime, LastDateTime)) / (select COUNT(id) FROM yourTable) as calculated_average,
AVG(TIME_TO_SEC(TIMEDIFF(LastDateTime,OrigDateTime))) as timediff_average,
SEC_TO_TIME(AVG(TIME_TO_SEC(TIMEDIFF(LastDateTime, OrigDateTime)))) as date_display
FROM yourTable
Sample Results:
count: 441000
avg_average: 5045436.4376
calculated_average: 5045436.4376
timediff_average: 5045436.4376
date_display: 1401:30:36
Seems to be pretty accurate!
This will return:
count: The count.
avg_average: The average based on AVG(). (Thanks to Eric for their answer on this!)
calculated_average: The average based on SUM()/COUNT().
timediff_avg: The average based on TIMEDIFF(). (Thanks to Andrew for their answer on this!)
date_display: A nicely-formatted display version. (Thanks to C S for their answer on this!)