Find similar users - mysql

I have a table with thousands of rows.
Sample data:
user_id ZIP City email
105 100051 Lond. jsmith#hotmail.com
382 251574 jgjefferson#gmail.com
225 0100051 London john.smith#hotmail.com
I need to compare every user with the others, to be able to know which ones are similar.
In the example given, the user 105 and 225 are almost the same, so the expected result would be a column of a new id that matches the two of them, like this:
user_id ZIP City email new_id
105 100051 Lond. jsmith#hotmail.com 105
382 251574 jgjefferson#gmail.com 382
225 0100051 London john.smith#hotmail.com 105
How would I compare every field with the others, and know how to compare them, like clustering, for example?

Your emails:
email<-c("jsmith#hotmail.com","jgjefferson#gmail.com","john.smith#hotmail.com")
Distance between emails:
dist<-stringdistmatrix(email,email,method="jw")
dist[dist==0]<-1
Minimum distance between emails:
cbind(email,email_near=email[apply(dist, 1, which.min)],dist=apply(dist, 1, FUN=min))
email email_near dist
[1,] "jsmith#hotmail.com" "john.smith#hotmail.com" "0.208754208754209"
[2,] "jgjefferson#gmail.com" "jsmith#hotmail.com" "0.281746031746032"
[3,] "john.smith#hotmail.com" "jsmith#hotmail.com" "0.208754208754209"
After that I suggest to use a threshold on dist to identify closest emails and then compute the new_ID.

Related

Inner join with multiple conditions on one column

I am trying to combine two tables to display.
I have one table (geofence) which holds each region id, name and associated tags. The second table is all routes and prices the user has entered from the available entries in their geofence table.
Table geofence
id
name
tags
52
texas
houston, dallas, austin
53
washington
spokane, seattle
54
oregon
portland, seaside
Table geofence_rates
id
origin_id
destination_id
price
1
52
53
1200
2
53
54
700
3
54
52
900
Desired HTML Output from combining tables
origin id
origin name
origin tags
destination id
destination name
destination tags
price
52
texas
houston, dallas, austin
53
washington
spokane, seattle
1200
53
washington
spokane, seattle
54
oregon
portland, seaside
700
54
oregon
portland, seaside
52
texas
houston, dallas, austin
900
I would like to show all routes, the price and then the associated name and tags for each of the geofence ID's.
My current sql statement gets me the routes and price but will only show the origin name and tags based off the origin id. I am not sure how to also extract the destination name and tags.
SELECT geofence_rates.origin_id, geofence_rates.destination_id,
geofence_rates.price, geofence.id, geofence.name, geofence.tags
FROM geofence_rates
INNER JOIN geofence ON geofence_rates.origin_id = geofence.id
How I can run a single statement and get both the origin and destination name and tags. I understand the bolded portion of my statement is what is causing this, but I am unsure how to create two conditions.
Your current sql statement gets you the routes and price only for the origin because you're matching the two tables on one single condition (the matching origin: geofence_rates.origin_id = geofence.id), yet at same time you're requiring information for destination too.
To solve this, you can apply two JOIN operations:
former to get information on origin
latter to get information on destination
separately.
SELECT orig.id AS origin_id,
orig.name AS origin_name,
orig.tags AS origin_tags,
dest.id AS destination_id,
dest.name AS destination_name,
dest.tags AS destination_tags,
rates.price
FROM geofence_rates rates
INNER JOIN geofence orig
ON rates.origin_id = orig.id
INNER JOIN geofence dest
ON rates.destination_id = dest.id
ORDER BY origin_id
Check the demo here.

Should I normalize or not? If yes how?

Currently I have a table with a column containing CSVs. I am not sure whether to normalize the whole table or not. The problem is this column, configuration, may contain up to 50 or more different types of values. For example in the table shown below it's 18, 20, but for other data in the same column it may be 0, 20, 21, 22, 23, 25, 26, 27, 40, 52, 54, 55 and so on, however these values are unique. They will never repeat.
I do not know what is the maximum number for it(it may vary) so that's why I kept it in CSV. I am currently have trouble normalizing it, or rather I am not sure whether I should even normalize it. Any help here?
id tester_type device_id board_id configuration
75946 UFLEX 997 220
44570 UFLEX 450 220 18,20
44569 UFLEX 449 220 18,20
44568 UFLEX 448 220 18,20
44567 UFLEX 447 220 18
Note: Configuration column does also contain empty values or empty spaces.
I do have to query against it so I guess I have to normalize it.
Yes, you do :)
If do create the table, does that mean I have to create for every possible configuration value?
An example of a normalised structure would be:
join table
==========
test_id configuration_id (spanning unique constraint)
------- ----------------
44570 18
44570 20
44569 18
44569 20
44569 20
44568 18
44568 20
44567 18
configurations table
====================
configuration_id
----------------
18
20
If you're using InnoDB, each column of the join table is also a foreign key to their respective parent tables.
I disagree with both "must" and "must not" normalize stands. My 2 cents:
Do not normalize "continuous" values such as prices, numbers, dates, floats, etc.
Do not normalize values that are unique or nearly so.
Do not normalize fields that are narrow. For example, don't replace a 2-letter country code with a 4-byte country_id.
"Normalize for simplicity": Do normalize things that are used in multiple tables and are subject to change. Sometimes names, addresses, company names, etc fall into this category. This is so you can change the value in exactly one place, not lots of places.
"Normalize for space": Do normalize things that would save a significant amount of overall space for the dataset. (This applies to gigabyte tables much more so than to kilobyte tables.)
Normalize, but don't "over-normalize". You will figure out what I mean when you have over-normalized and a nasty JOIN can't be optimized.
If you would like further specific advice, let's see SHOW CREATE TABLE and sample values for any un-obvious columns.

Access Calculated Field

I am having difficulty trying to make a calculated field that I need. So here is what I am trying to do:
I have a query that combines the information based on three tables. The most important fields that for the application are as follows:
Family Income Age Patient
15,000 18 Yes
28,000 25 No
30,000 1 Yes
From here I want to make a calculated field that gives the correct program the patient was enrolled in. based on these fields ie:
Program Minimum Income Maximum Income Minimum Age Maximum Age Patient
Children's 0 20,000 1 19 Yes
Adult 0 12,000 19 65 No
Non Patient 0 20,000 1 19 No
Adult 2 12,000 50,000 19 65 No
Etc.
to create:
Family Income Age Patient Program
15,000 18 Yes Children's
28,000 25 No Adult 2
30,000 1 Yes Children's 2
I know I can use IIf to hard code it in to the field, but then it will be really difficult for other people to update the information as the guidelines change. Is it possible to have the information stored in a table? and use the information on the table form etc, or will I need to use IIf
Any Ideas? is it possible to dynamically create the IIf in SQL using VBA while pulling the information from the table?
EDIT:::
Thank you for your response and for formatting my tables, I still have no idea how you changed it, but it looks amazing!
I tried to add the SQL you added down below, but I was not able to make it work. I'm not sure if I made a mistake so I included the SQL of my Query. The query currently returns 0 values, so I think I messed something up. (The real Query is embarassing...I'm sorry for that). Unfortunately, I have done everything in my power to avoid SQL, and now I am paying the price.
SELECT qry_CombinedIndividual.qry_PrimaryApplicant.[Application Date],
qry_CombinedIndividual.qry_PrimaryApplicant.[Eligibility Rep],
qry_CombinedIndividual.qry_PrimaryApplicant.Name,
qry_CombinedIndividual.qry_PrimaryApplicant.Clinic,
qry_CombinedIndividual.qry_PrimaryApplicant.Outreach,
qry_CombinedIndividual.qry_PrimaryApplicant.[Content Type ID],
qry_CombinedIndividual.qry_PrimaryApplicant.[Application Status],
qry_CombinedIndividual.qry_PrimaryApplicant.Renewal,
qry_CombinedIndividual.qry_Enrolled.EthnicityEnr,
qry_CombinedIndividual.qry_Enrolled.GenderEnr, qry_CombinedIndividual.AgeAtApp,
qry_CombinedIndividual.[Percent FPL], tbl_ChildrensMedical.MinPercentFPL,
tbl_ChildrensMedical.MaxPercentFPL, tbl_ChildrensMedical.MinAge,
tbl_ChildrensMedical.MaxAge, tbl_ChildrensMedical.Program
FROM qry_CombinedIndividual
INNER JOIN tbl_ChildrensMedical ON qry_CombinedIndividual.qry_Enrolled.Patient = tbl_ChildrensMedical.Patient
WHERE (((qry_CombinedIndividual.AgeAtApp)>=[tbl_ChildrensMedical].[MinAge]
And (qry_CombinedIndividual.AgeAtApp)<[tbl_ChildrensMedical].[MinAge])
AND ((qry_CombinedIndividual.[Percent FPL])>=[tbl_ChildrensMedical].[MinPercentFPL]
And (qry_CombinedIndividual.[Percent FPL])<[tbl_ChildrensMedical].[MaxPercentFPL]));
Also there are many different programs. Here is the real Children's Table (eventually I would like to add adults if possible)
*Note the actual table uses FPL (which takes family size into account, but is used the same as income). I am again at a total loss as to how you formated the table.
Program Patient MinPercentFPL MaxPercentFPL MinAge MaxAge
SCHIP (No Premium) No 0 210 1 19
SCHIP (Tier 1) No 210 260 1 19
SCHIP (Tier 2) No 260 312 1 19
Newborn No 0 300 0 1
Newborn (Patient) Yes 0 300 0 1
Children's Medical Yes 0 200 1 19
CHIP (20 Premium) Yes 200 250 1 19
CHIP (30 Premium) Yes 250 300 1 19
Do I have the correct implementation for the table I have? Or should I be changing something. I can also send more information/sample data if that would help.
Thank you again!
I just created some tables with your sample data and used the following SQL. Your 3rd 'patient' doesn't match any of the ranges (Age 1, Income $30K)
SELECT tblPatient.PatName, tblPatient.FamInc, tblPatient.Age, tblPatient.Patient,
tblPatientRange.Program, tblPatientRange.MinInc, tblPatientRange.MaxInc, tblPatientRange.MinAge,
tblPatientRange.MaxAge, tblPatientRange.Patient
FROM tblPatient INNER JOIN tblPatientRange ON tblPatient.Patient = tblPatientRange.Patient
WHERE (((tblPatient.FamInc)>=[tblPatientRange]![MinInc] And (tblPatient.FamInc)<=[tblPatientRange]![MaxInc])
AND ((tblPatient.Age)>=[tblPatientRange]![MinAge] And (tblPatient.Age)<=[tblPatientRange]![MaxAge]));

How to setup MySQL table to follow a variable over time?

Say I have several registered users in my website.
Users are saved on a single table 'users' that assigns a unique id for each one of them.
I want to allow my users to track their expenses, miles driven, temperature, etc.
I can't be sure each user will always enter a value for all trackable variables when they login -- so an example of what could happen would be:
'example data'
user date amount miles temp etc
1 3/1/2010 $10.00 5 54
2 3/1/2010 $20.00 15
1 3/12/2010 5 55
1 3/15/2010 $10.00 25 51
3 3/20/2010 45
3 4/12/2010 $20.00 10 54
What is the best way to set up my tables for this situation?
Should I create a table exclusive to each user when they register? (could end up with thousands of user-exclusive tables)
'user-1 table'
date amount miles temp etc
3/1/2010 $10.00 5 54
3/12/2010 5 55
3/15/2010 $10.00 25 51
'user-3 table'
date amount miles temp etc
3/20/2010 45
4/12/2010 $20.00 10 54
and so on...
Should I create a single table that is essentially the same as the example data above? (could end up with a gigantic table that needs to be combed to find rows with requested user id's).
'user data table'
user date amount miles temp etc
1 3/1/2010 $10.00 5 54
2 3/1/2010 $20.00 15
1 3/12/2010 5 55
1 3/15/2010 $10.00 25 51
3 3/20/2010 45
3 4/12/2010 $20.00 10 54
Any suggestions?
Databases are built to handle similar data as a set together.
What you want is a single user-data-table, with multiple users in the same table split by user_id. You might want to further normalize that though, so that it stores:
user date type units
1 3/1/2010 dollars 10.00
1 3/1/2010 miles 5
1 3/1/2010 temp 54
2 3/1/2010 dollars 20.00
2 3/1/2010 miles 15
1 3/12/2010 miles 5
1 3/12/2010 temp 55
etc
Or even further if the user+date makes a specific trip
trip-table
tripid user date
========= ======== =========
1 1 3/1/2010
type-table
typeid description
========= ============
1 dollars
2 miles
etc
trip-data
tripid type units
========= ======== =======
1 1 10.00
1 2 5
etc
However, if you will always (or almost always) show your data in the form as entered, with the data pivoted on all the input columns (like a spreadsheet), then you would be better off sticking to the un-normalised form for brevity, programmability and performance.
could end up with a gigantic table that needs to be combed to find rows with requested user id's
Assuming you employ indexes properly and judiciously, modern RDBMS are built to handle gigantic amounts of data. The indexes allow the queries to seek only the data it needs, so there is normally little penalty in keeping it all in one table.
No, just create one table with all possible nullable fields. If user hasn't filled that parameter - then just keep NULL value there.
could end up with a gigantic table that needs to be combed to find rows with requested user id's
Yes, and the query will be fast enough if you'll specify an index for user_id field (for queries like WHERE user_id = 42)

MS Access 03 - Query Expression to Add like ID Numbers

So I have a query that is a Top Nth aggregate query, and I have another query built from that one that returns all the offices/locations grouped for each of the top sales. I want to make a report that counts the number of offices associated with each of these top Nth ID values that are returned in this query. I want to use a domain aggregate expression in text boxes on the report so that I do not have to spend time each month looking up those IDs to determine what needs to go into the expression.
So is there an expression I can add to the second query that will assign a number descending number to the IDs?
The first query looks like: /
ID# ITEM Sold /
765 Lawnmowers 75 /
764 Weed trimmers 64 /
etc
the second query looks like: /
ID# ITEM Sold Location /
/
765 Lawnmowers 75 New York /
765 Lawnmowers 75 Maryland /
765 Lawnmowers 75 Ohio /
765 Lawnmowers 75 Virginia /
764 Weed trimmers 64 Florida
764 Weed trimmers 64 north Carolina /
I need:
/
ID# ITEM Sold Location IDGroup# /
765 Lawnmowers 75 New York 1 /
765 Lawnmowers 75 Maryland 1 /
765 Lawnmowers 75 Ohio 1 /
765 Lawnmowers 75 Virginia 1 /
764 Weed trimmers 64 Florida 2 /
764 Weed trimmers 64 north Carolina 2 /
please help. thanks! i am pulling my hair on this one!
what i am trying to do is be able to report the number of items sold (done), how many locations sold the item (need dynamically) for a given time period.
i was trying to use an DCount expression but if I use the Product ID, it does me no good because these figures change monthly (this is all based from a Top Ten query to begin with)
i know i confuse myself with this question :)
To answer your question, yes it's possible. You can use the Switch function. Ex:
Switch("LawnMower",1,"Weed trimmers",2) etc. Of course this requires you to put an entry for every unique value in the table and is pretty much a horrible id for anything but a throw-away query with a small number of groups.
A more orthodox approach would be to create a lookup table with two fields: "Item" and "Group". Make item the primary key and make "Group" a AutoNumbered field. Next you will need to create a Select Distinct query to get the unique items in the original table. Turn this query into an append query and pipe the values into your lookup table (sorting however you like). The autonum field will automatically assign sequential IDs to these records.
Now you can just take your original query and add a join to your lookup table to pull in a group id.
Addendum: If more than one item can be in a group, simply use the ITEM make the ID field a non-unique numeric long and then key the entries manually (it shouldn't take too long since it will be unique entries). Either way, with this approach you should be able to maintain your groups with relative ease going forward.