I have a large dataset containing city names and matching postcodes. City names and postcodes might appear several times. I need to get an overview which cities and postcodes I have in the dataset. Therefore, I want to extract a list showing me only the unique combinations of cities and postcodes.
Example:
City Postcode
x 123
y 456
x 123
z 342
p 256
z 342
x 321
I want to get:
City Postcode
x 123
y 456
z 342
p 256
x 321
I managed to do that in R, but I do not know how to do that in Stata.....
As often happens, the request for display of unique combinations is better phrased in terms of distinct combinations. The interest is not in those combinations that occur just once. For an overview of distinct observations in Stata, see this paper.
Here are two ways to approach the problem. First, the egen function tag() tags just one of each group of observations identical on the variables specified.
clear
input str1 City Postcode
x 123
y 456
x 123
z 342
p 256
z 342
x 321
end
egen tag = tag(City Postcode)
list City Postcode if tag , noobs
+-----------------+
| City Postcode |
|-----------------|
| x 123 |
| y 456 |
| z 342 |
| p 256 |
| x 321 |
+-----------------+
Second, groups is a convenience command that by default gives frequencies and percents for distinct cross-combinations. You must install this before you can use it. You can show more (and indeed fewer) results using options.
. ssc install groups
. groups City Postcode
+-----------------------------------+
| City Postcode Freq. Percent |
|-----------------------------------|
| p 256 1 14.29 |
| x 123 2 28.57 |
| x 321 1 14.29 |
| y 456 1 14.29 |
| z 342 2 28.57 |
+-----------------------------------+
For some general comments on groups, see this post.
look up the duplicates command.
Related
My MySQL knowledge is a bit shaky. I have a table with (among others) the following columns/values:
ID | importID | distID | email | street | city
-----------------------------------------------------------
25 | 5 | 2 | abc#d.com | Main Road | London
-----------------------------------------------------------
26 | 5 | 2 | mno#e.com | Oak Alley | York
-----------------------------------------------------------
27 | 5 | 2 | pqr#s.com | Tar Pits | London
-----------------------------------------------------------
28 | 5 | 2 | xyz#a.com | Fleet Street | London
-----------------------------------------------------------
...
-----------------------------------------------------------
99 | -1 | 2 | abc#d.com | New Street | Exeter
I do some checks when new rows are inserted: validate email addresses, find doublets with different dist(ributor)ID etc.
One of the tasks is "update existing rows with data of the freshly imported row when column "email" is identical" (yes, there can be multiple rows with identical email addresses).
At the time this task is performed, the importID of the currently inserted rows is always -1. I tried aliasing with all kinds of variations of
UPDATE table orig table dup
SET orig.street = dup.street, orig.city = dup.city
WHERE orig.email = dup.email
or joining with numerous variations of
UPDATE table orig
JOIN
(SELECT email FROM table
WHERE importID != -1) dup
ON orig.email = dup.email
SET orig.street = dup.street, orig.city = dup.city
What is my mistake?
Don't do it that way.
Store the duplicate email information in a separate table. That table is to contain records that are awaiting analysis and confirmation; do not try to include them in the main table.
This extra table can have multiple rows with the same email, but the main table must not.
The person processing the pending changes would use a SELECT into both tables (either two Selects or a Union), make a decision, and poke a button. One button would say to toss the new info; one would say to replace; etc.
And, as Tim suggests, have a TIMESTAMP in each table. This will assist the user in making changes, especially if there are multiple pending changes.
Given that it would be better to change the approach, you could code your requirement as follow:
UPDATE mytab INNER JOIN mytab AS newrec ON mytab.email=newrec.email AND newrec.importID=-1
SET mytab.street=newrec.street, mytab.city=newrec.city;
Example
Before
ID importID distID email street city
25 5 2 abc#d.com Main Road London
26 5 2 mno#e.com Oak Alley York
27 5 2 pqr#s.com Tar Pits London
28 5 2 xyz#a.com Fleet Street London
99 -1 2 abc#d.com New Street Exeter
100 -1 2 pqr#s.com foo bar
After
ID importID distID email street city
25 5 2 abc#d.com New Street Exeter
26 5 2 mno#e.com Oak Alley York
27 5 2 pqr#s.com foo bar
28 5 2 xyz#a.com Fleet Street London
99 -1 2 abc#d.com New Street Exeter
100 -1 2 pqr#s.com foo bar
I am facing a problem with the MySQL query. The problem is
I have a table called 'members' and a column 'area'. In 'area' column there are alphanumeric values like
1
2
Street # 2
5
78
Street # 1A
Street # 1
Street # 1C
Street # 1B
3
Now What I want? I want to sort like this
1
2
3
5
78
Street # 1
Street # 1A
Street # 1B
Street # 1C
Street # 2
I tried almost every code but not fulfilling my requirements. Last code that is good but not as per my requirement. Currently, I have this code:
SELECT DISTINCT(area) FROM members ORDER BY LENGTH(area), area ASC
One thing that I want to clear that area filed has duplicate values in it.
I'll be thankful if someone helps me.
Thanks in advance
Extract the first part of the string and order it if it is a number:
select t.*
from t
order by (area regexp '^[0-9]') desc, -- numbers first
substring_index(area, ' ', 1) + 0, -- by number
area asc -- rest alphabetically
Note that this handles the awkward case where the initial number start with 0.
And depending on how you want the strings ordered, you might still want to end with len(area), area as the last two order by keys.
Here is a db<>fiddle.
This should work for your sample data:
SELECT DISTINCT area
FROM members
ORDER BY (area + 0 > 0 or area = '0') desc, area + 0, area
Demo on DB Fiddle:
| area |
| :---------- |
| 1 |
| 2 |
| 3 |
| 5 |
| 78 |
| Street # 1 |
| Street # 1A |
| Street # 1B |
| Street # 1C |
| Street # 2 |
One more query:
SELECT DISTINCT(area)
FROM members
ORDER BY
area+0=0 ASC, -- get numbers first
area+0 ASC, -- order numbers
area ASC; -- order strings
Live fiddle here SQLize.online
I am stumped on a question in my assignment.
On a single table (Condo_Unit), we have several columns - CondoID, UnitNum, SqrFt (Square Feet) etc.
I need to find a query that can display the UnitNum of any pair of Condos which have the same square footage. For example, Condos 305 & 409 both have square footage of 1500ft. The output must show both condos in a pair
At this stage, I can generate a list showing only one of the pair duplicated across two result columns (ie unit 305 is shown twice, not 305 | 409) using:
SELECT UnitNum, UnitNum
FROM condo_unit
GROUP BY SqrFt
HAVING Count(SqrFt) >1;
Sample data includes:
Condo ID | UnitNum | SqrFt
1 | 102 | 675
2 | 201 | 1030
3 | 305 | 1500
4 | 409 | 1500
5 | 104 | 1030
6 | 207 | 870
From this data, we can see units 201 & 104 are a matching pair, as well as 305 & 409
Results should show:
1st Unit | 2nd Unit
201 | 104
305 | 409
The current results I am getting are:
1st Unit | 2nd Unit
201 | 201
305 | 305
Is anyone able to assist, or need further clarification?
Query:
SELECT
DISTINCT least(t.c,t.d) as "1st Unit",
greatest(t.c,t.d) as "2nd Unit"
FROM
(SELECT a.UnitNum c,b.UnitNum d
FROM world.condo a JOIN world.condo b
WHERE a.SqrFt=b.SqrFt AND a.Condo_ID!=b.Condo_ID) t;
Output:
This code will help you
select GROUP_CONCAT(UnitNum,'&'),SqrFt from Condo_Unit group by SqrFt ORDER BY SqrFt
This should do.
select GROUP_CONCAT(UnitNum SEPARATOR '|') as UnitName from condo_unit
group by SqrFt HAVING Count(SqrFt) >1;
DEMO FOR ANSWER
OUTPUT :
+----------+
| UnitName |
+----------+
| 201|104 |
+----------+
| 305|409 |
+----------+
You can use whatever separator you want. I have given pipe symbol |.
I have a table apartment as below
aid | aname
1 | dream home
2 | My hub
3 | Lake view
another table apartment_details
id | aid | bhk | size | facing
1 | 1 | 2 | 1200 | east
2 | 1 | 2 | 1200 | west
3 | 1 | 2 | 1000 | south
4 | 1 | 2 | 1000 | north
I have written the query as
SELECT distinct ap.aid, ap.aname, al.bhk, (select group_concat(distinct concat(al.bhk,'BHK - ',al.size)) from apartment_details as al where al.id = ap.aid) as details
When I tried to display details using foreach I get the output as
2BHK - 1200
2BHK - 1200
2BHK - 1000
2BHK - 1000
In this query it is considering bhk, size, facing in distinct and the output obtained is based on facing. This looks something like I am displaying duplicate data or something the same data is repeating as there is no facing displayed. How can I display only distinct values based on bhk, size and not facing so that I get the output as
2BHK - 1200
2BHK - 1000
Can anyone help me in solving this issue? Thanks in advance
To my way of thinking, in general, there is no problem in SQL for which GROUP_CONCAT is the solution. So, with that in mind, let's start with this:
SELECT DISTINCT bhk,size FROM apartment_details
I'm having the following situation:
I have a table with a list of postcodes with the format:
1234 AA (Dutch postcode)
2345 ZF
B-2345 (Belgium postcode)
B-4355
I have another table which contains postcoderanges:
PostcodeFrom
1000 AF
2000 ZF
B-1234
PostcodeTo
1999 ZX
2999 ZF
B-1889
I am looking for a solution how to look up the postcode value between the several ranges.
First I was thinking of
SUBSTRING(MyPostcode,1,4) BETWEEN SUBSTRING(PostcodeFrom,1,4) AND SUBSTRING(PostcodeTo,1,4)
.. but then there is still the problem with the characters (not even thinking about the belgium postcodes aswell).
Could anyone help me?
Yours,
Thanks for your reply!
The table you drew, needs one more field: RegionCode.
RangeTable:
| RCode | PCodeFrom | PCodeTo |
| 001 | 1000 BA | 1999 ZZ |
| 002 | 1000 AA | 1999 AZ |
Notice that if a postcode is 1234 AC, it must return RegionCode: 002 To compare numbers is not hard, but how to compare characters? I had an idea of making a table with AA - ZZ where each combination has a certain INT value, but I hope there is another, easier way.
You can only do this reliably (ignoring the potential un-reliability of doing this sort of range matching with postcodes) by splitting the portions of the postcode into different columns by character type.
I don't know much about Dutch postcodes, but if your formats are correct, you could create a table like:
+-------+------+
| code | city |
+-------+------+
| 1234 | AA |
+-------+------+
Splitting the postcodes up will allow you to do more fine-grained sorting.
Update:
Having looked at the Wikipedia page on Dutch postcodes it looks like this should work for all of them. My labels of code and city are inaccurate though.
Aside: I'm impressed that the Netherlands has such a sane postcode format, unlike the UK one where you need a huge regex to even decide if the format is valid.
Update 2:
Your checking will work with characters too, but you'll be better off storing the postcodes in a separate table, with an ID. The example above was just to show splitting up the characters from the numbers, so what you'll actually want is more like:
mysql> select * from postcodes;
+------+-------+-------+
| id | part1 | part2 |
+------+-------+-------+
| 1 | 1234 | AA |
| 2 | 5678 | BB |
+------+-------+-------+
When you're storing the ranges, don't store the postcodes in the ranges table, store the id for the entry in the postcodes table, like:
mysql> select * from ranges;
+-------------+---------------+-------------+
| region_code | postcode_from | postcode_to |
+-------------+---------------+-------------+
| 1 | 1 | 2 |
+-------------+---------------+-------------+
That record says "region 1 is 1234 AA to 5678 BB"
For an example, I'll say that postcodes start 0001 AA, then move to 0001 AB, all the way to 0001 ZZ, then 0002 AA and so on. This obviously isn't right but it demonstrates the theory. You need to substitute this for the algorithm you're using to define how postcodes are incremented and decremented.
When you want to find out "does postcode 3456 XY fit into region 89?", you split it into character and number, and check whether the values fit into a range. Using my algorithm, I check:
Is the number portion greater or less than the number portion of postcode_from?
If it's greater, then is it less than the number portion of postcode_to?
If you satisfy both conditions, check the letters - this is the important bit - MySQL's character set collation does allow you to say "is AB less than BC, you can have:
WHERE 'AB' < part2;
in your WHERE clause.
Using this method, you can figure out which of your regions has a start and an end that fit the value you're testing.
It's a bit long-winded but it will work without doing any conversions. You may need to check that the collation you're using fits the way the lettering sequence works for the specific type of postcode you're using though.