Is there a command for rename the second duplicated data in stata? - duplicates

I have a list of data with three variable, ADDRESS, FOODCODE and Gr, that consist 1752 household addresses and type of foods they consumed and the amount of food by gram.
these data have duplicated record, for example I have two similar address, with similar foodcode, but different gram.
Now I want to rename the second similar address to a new address in stata.
input double Address int(foodcode gr)
12401295014 11111 4000
12401295014 11111 10000
12501308608 11112 20000
12501313708 11112 10000
11701202105 11115 5000
11701202105 11115 10000
end

This isn't a problem for rename, which is about renaming variables, and nothing to do with changing the data otherwise. Although you don't ask for it, it might be useful for you to set up a new variable
bysort address: gen id = _n
Changing the values of address so that the same household gets different values is likely only to make analysis more difficult.

Related

How to find missing numbers within a column of strings

I'm trying to find unaccounted for numbers within a substantially large SQL dataset and facing some difficulty sorting.
By default the data for column reads
'Brochure1: Brochure2: Brochure3:...Brochure(k-1): Brochure(k):'
where k stands in for the number of brochures a unique id is eligible for.
Now the issue arises as the brochures are accounted for a sample updated data would read
'Brochure1: 00001 Brochure2: 00002 Brochure3: 00003....'
How does one query out the missing numbers, if in the range of number of say 00001-88888 some haven't been accounted next to Brochure(X):
The right way:
You should change the structure of your database. If you care about performance, you should follow the good practices of relational databases, so as first comment under your question said: normalize. Instead of placing information about brochures in one column of the table, it's much faster and more clear solution to create another table, that will describe relations between brochures and your-first-table-name
<your-first-table-name>_id | brochure_id
----------------------------+---------------
1 | 00002
1 | 00038
1 | 00281
2 | 28192
2 | 00293
... | ...
Not mention, if possible - you should treat brochure_id as integer, so using 12 instead of 0012.
The difference here is, that now you can make efficient and simple queries, to find out how many brochures one ID from your first table has, or what ID any brochure belongs to. If for some reason you need to keep the ordinal number of every single brochure you can add a column to the above table, like brochure_number.
What you want to achieve (not recommended): I think the fastest way to achieve your objective without changing the db structure, is to get the value of your brochures column, and then process it with your script. You really don't want to create a SQL statement to parse this kind of data. In PHP that wolud look something like this:
// Let's assume you already have your `brochures` column value in variable $brochures
$bs = str_replace(": ", ":", $brochures);
$bs = explode(" ", $bs);
$brochures = array();
foreach($bs as $b)
$brochures[substr($b, 8, 1)] = substr($b, strpos($b, ":")+1, 5);
// Now you have $brochures array with keys representing the brochure number,
// and values representing the ID of brochure.
if(isset($brochures['3'])){
// that row has a defined Brochure3
}else{
// ...
}

MySQL finding data if any 4 of 5 columns are found in a row

I have an imported table of several thousand customers, the development I am working on runs on the basis of anonymity for purchase checkouts (customers do not need to log in to check out), but if enough of their details match the database record then do a soft match and email the (probably new) email address and eventually associate the anonymous checkout with the account record on file.
This is rolling out this way due to the age of the records, many people have the same postal address or names but not the same email address, likewise some people will have moved house and some people will have changed name (marriage etc).
What I think I am looking for is a MySQL CASE system, however the CASE questions on Stack Overflow I've found don't appear to cover what I'm trying to get from this query.
The query should work something like this:
$input[0] = postcode (zip code)
$input[1] = postal address
$input[2] = phone number
$input[3] = surname
$input[4] = forename
SELECT account_id FROM account WHERE <4 or more of the variables listed match the same row>
The only way I KNOW I can do this is with a massive bunch of OR statements but that's excessive and I'm sure there's a cleaner more concise method.
I also apologise in advance if this is relatively easy but I don't [think I] know the keyword to research constructing this. As I say, CASE is my best guess.
I'm having trouble working out how to manipulate CASE to fit what I'm trying to do. I do not need to return the values only the account_id from the valid row (only) that matches 4 or 5 of the given inputs.
I imagine that I could construct a layout that does this:
SELECT account_id CASE <if postcode_column=postcode_var> X=X+1
CASE <if surname_column=surname_var> X=X+1
...
...
WHERE X > 3
Is CASE the right idea?
If not, What is the process I need to use to achieve the desired results?
What is [another] MySQL keyword / syntax I need to research, if not CASE.
Here is your pseudo query:
SELECT account_id
FROM account
WHERE (postcode = 'pc')+
(postal_address = 'pa')+
(phone_number = '12345678901')+
(surname = 'sn')+
(forename= 'fn') > 3

Crystal Reports XI - export shared variable string to csv truncates the string

I have two databases that store information on customer appointments:
AppointmentMaster has 1 record for each appoint:
Customer Name ApptDate ApptID
------------------------------------------------
2554 Smith,Bob 20140301 100
2468 Jones, Grace 20140301 101
2795 Roberts, Sam 20140302 102
2408 Harris, Chuck 20140305 103
AppointmentDetails holds a record for each operation performed at the appointment (sometimes none, sometimes dozens):
ApptID Operation OpDescription
------------------------------------------------
100 A10 Corrected the A10 unit.
100 IA Resolved issues with internal account.
100 C5 Brief consult with client.
101 A10C Replaced cage on A10 unit.
101 U1 Updated customer account.
103 C5 Brief consult with client.
My client needs a CSV file that contains 1 line per appointment. One of the fields in the CSV is a pipe separated listing of any and all operation codes performed at the appointment. The CSV file would look like this:
"2554", "Smith,Bob", "20140301", "A10|IA|C5|"
"2468", "Jones, Grace", "20140301", "A10C|U1|"
"2795", "Roberts, Sam", "20140302", ""
"2408", "Harris, Chuck", "20140305", "C5|"
I have a crystal report created that displays the fields correctly, however when I go to export to CSV I am seeing a file like this:
"2554", "Smith,Bob", "20140301", "C5|"
"2468", "Jones, Grace", "20140301", "U1|"
"2795", "Roberts, Sam", "20140302", ""
"2408", "Harris, Chuck", "20140305", "C5|"
Only the last Operation is getting exported into CSV even though all of them display.
If I export as PDF, Excel or Record Style the file has all of the operations. Unfortunately I need a CSV. I am trying to avoid having to do multiple reports and stitch them together with a script if possible; The client wants to be able to easily run and export this themselves on demand.
I created three formula fields to initialize, update and display a shared variable that concatenates the operations together.
My report is grouped by the ApptID and looks like this:
Group Header #1 (suppressed)
{#InitializeOperations}:
WhilePrintingRecords;
shared StringVar Operations := "";
Details (suppressed)
{#UpdateOperations}:
WhilePrintingRecords;
shared StringVar Operations := Operations + {AppointmentDetails.Operation} + "|";
Group Footer #1
{AppointmentMaster.Customer}
{AppointmentMaster.Name}
{AppointmentMaster.ApptDate}
{#DisplayOperations}:
WhilePrintingRecords;
shared StringVar Operations;
I have tried using evaluateAfter(#UpdateOperations) instead of WhilePrintingRecords on the #DisplayOperations, and have even tried removing any Evalutation Time command from it as well, but I still can't get the desired effect in the CSV file despite having it look correct on screen and every other way I have tried to export it.
Any help you can provide is appreciated.

Dealing with 5,000 attributes

I have a data set which contains 5,000 + attributes
The tables looks like below
id attr1 attr2, attr3
a 0 1 0
a 1 0 0
a 0 0 0
a 0 0 1
I wish to represent each record on a single row for example the table below to make it more amenable to data mining via clustering.
id, attr1, attr2, attr3
a 1 1 1
I have tried a multitude of ways of doing this
I have tried importing it into a MYSQL DB and getting the max value for each attribute (they can only be 1 or zero for each ID) but a table cant hold the 5,000 + attributes.
I have tried using the pivot function in excel and getting the Max Value per attribute but the number of columns a pivot can handle is far less than the 5,000 I'm currently looking at.
I have tried importing it into Tableua but that also suffers from the fact it cant handle so many records
I just want to get Table 2 in either a text/CSV file or a database table
Can anyone suggest anything at all, a piece of software or something i have not yet considered
Here is a Python script which does what you ask for
def merge_rows_by_id(path):
rows = dict()
with open(path) as in_file:
header = in_file.readline().rstrip()
for line in in_file:
fields = line.split()
id, attributes = fields[0], fields[1:]
if id not in rows:
rows[id] = attributes
else:
rows[id] = [max(x) for x in zip(rows[id], attributes)]
print (header)
for id in rows:
print ('{},{}'.format(id, ','.join(rows[id])))
merge_rows_by_id('my-data.txt')
Which was written for clarity more than maximum efficiency, although it's pretty efficient. However, this will still leave you with lines with 5000 attributes, just fewer of them.
I've seen this data "structure" too often used in bioinformatics where the researchers just say "put everything we know about "a" on one row, and then the set of "everything" doubles, and re-doubles, etc. I've had to teach them about data normalization to make an RDBM handle what they've got. Usually, attr_1…n are from one trial and attr_n+1…m is from a second trial, and so on which allows for a sensible normalization of the data.

Is this a case for denormalisation?

I have a site with about 30,000 members to which I'm adding a functionality that involves sending a random message from a pool of 40 possible messages. Members can never receive the same message twice.
One table contains the 40 messages and another table maps the many-to-many relationship between messages and members.
A cron script runs daily, selects a member from the 30,000, selects a message from the 40 and then checks to see if this message has been sent to this user before. If not, it sends the message. If yes, it runs the query again until it finds a message that has not yet been received by this member.
What I'm worried about now is that this m-m table will become very big: at 30,000 members and 40 messages we already have 1.2 million rows through which we have to search to find a message that has not yet been sent.
Is this a case for denormalisation? In the members table I could add 40 columns (message_1, message_2 ... message_40) in which a 1 flag is added each time a message is sent. If I'm not mistaken, this would make the queries in the cron script run much faster
?
I know that doesn't answer your original question, but wouldn't it be way faster if you selected all the messages that weren't yet sent to a user and then select one of those randomly?
See this pseudo-mysql here:
SELECT
CONCAT_WS(',', messages.ids) unsent_messages,
user.id user
FROM
messages,
user
WHERE
messages.id NOT IN (
SELECT
id
FROM
sent_messages
WHERE
user.id = sent_messages.user
)
GROUP BY ids
You could also append the id of the sent messages to a varchar-field in the members-table.
Despite of good manners, this would make it easily possible to use one statement to get a message that has not been sent yet for a specific member.
Just like this (if you surround the ids with '-')
SELECT message.id
FROM member, message
WHERE member.id = 2321
AND member.sentmessages NOT LIKE '%-' && id && '-%'
1.2 M rows # 8 bytes (+ overhead) per row is not a lot. It's so small I wouldn't even bet it needs indexing (but of course you should do it).
Normalization reduces redundancy and it is what you'll do if you have large amount of data which seems to be your case. You need not denormalize. Let there be an M-to-M table between members and messages.
You can archive the old data as your M-to-M data increases. I don't even see any conflicts because your cron job runs daily for this task and accounts only for the data for the current day. So you can archive M-to-M table data every week.
I believe there will be maintenance issue if you denormalize by adding additional coloumns to members table. I don't recommend the same. Archiving of old data can save you from trouble.
You could store only available (unsent) messages. This implies extra maintenance when you add or remove members or message types (nothing that can't be automated with foreign keys and triggers) but simplifies delivery: pick a random line from each user, send the message and remove the line. Also, your database will get smaller as messages get sent ;-)
You can achieve the effect of sending random messages by preallocating the random string in your m-m table and a pointer to the offset of the last message sent.
In more detail, create a table MemberMessages with columns
memberId,
messageIdList char(80) or varchar ,
lastMessage int,
primary key is memberId.
Pseudo-code for the cron job then looks like this...
ONE. Select next message for a member. If no row exists in MemberMessages for this member, go to step TWO. The sql to select next message looks like
select substr(messageIdList, 2*lastMessage + 1, 2) as nextMessageId
from MemberMessages
where member_id = ?
send the message identified by nextMessageId
then update lastMessage incrementing by 1, unless you have reached 39 in which case reset it to zero.
update MemberMessages
set lastMessage = MOD(lastMessage + 1, 40)
where member_id = ?
TWO. Create a random list of messageIds as a String of couplets like 2117390740... This is your random list of message IDs as an 80 char String. Insert a row to MemberMessages for your member_id setting message_id_list to your 80 char String and set last_message to 1.
Send the message identified by the first couplet from the list to the member.
You can create a kind of queue / heap.
ReceivedMessages
UserId
MessageId
then:
Pick up a member and select message to send:
SELECT * FROM Messages WHERE MessageId NOT IN (SELECT MessageId FROM ReceivedMessages WHERE UserId = #UserId) LIMIT 1
then insert MessageId and UserId to ReceivedMessages
and do send logic here
I hope that helps.
There are potential easier ways to do this, depending on how random you want "random" to be.
Consider that at the beginning of the day you shuffle an array A, [0..39] which describes the order of the messages to be sent to users today.
Also, consider that you have at most 40 Cron jobs, which are used to send messages to the users. Given the Nth cron job, and ID the selected user ID, numeric, you can choose M, the index of the message to send:
M = (A[N] + ID) % 40.
This way, a given ID would not receive the same message twice in the same day (because A[N] would be different), and two randomly selected users have a 1/40 chance of receiving the same message. If you want more "randomness" you can potentially use multiple arrays.