SAS Most efficient way to eliminate duplicate - duplicates

For starter I know my problem is similar to This(which is the closest to my question I have found), but with some differences at the same time, hence my new post.
I have a database with an identifier and declarations. Declarations are constructed as identifier + a letter.
If the idendifier is 123456, declarations would then be "123456A", "123456B" and so on
I would like to select one observation for each identifier, with the declaration that is the one with the last letter, which is of course, not always the same.
I assume I can do that with a proc sort and then another one with nodupkey :
proc sort data=have out=have2;
by identifier declaration /descending;
run;
proc sort data=have2 out=want nodupkey;
by declaration;
run;
but as I have a relatively important database (tens of millions observations) I would like to know the best in sense of both better suited and fastest method if it is another one.
Typically, if it is possible in one step.
Thanks

This looks like a quick solution. It sets only the first observation (in your case the last as you have already sorted by descending). Meaning the rest of the records will not be even loaded into the program data vector. If possible please let me know how it went. I am curious if this would be optimal. I know this to be true only in thoery. I have never tested it myself on a large dataset. 10x
data want;
do until ( first.identifier ) ;
set have;
by identifier ;
end ;
run;

This should work:
proc sql;
create table want as
select
identifier,
max(declaration) as last_declaration
from have
group by identifier;
quit;

Related

Overwrite function only for a particular instance in LUA

I basically don't look for an answer on how to do something but I found how to do it, yet want more information. Hope this kind of question is OK here.
Since I just discovered this the code of a game I'm modding I don't have any idea what should I google for.
In Lua, I can have for example:
Account = {balance = 0}
function Account.withdraw (v)
self.balance = self.balance - v
end
I can have (in another lua file)
function Account.withdrawBetter (v)
if self.balance > v then
self.balance = self.balance - v
end
end
....
--somewhere in some function, with an Account instance:
a1.withdraw = a1.withdrawBetter
`
What's the name for this "technique" so I can find some more information about it (possible pitfalls, performance considerations vs. override/overwrite, etc)? note I'm only changing withdraw for the particular instance (a1), not for every Account instance.
Bonus question: Any other oo programming languages with such facility?
Thanks
OO in Lua
First of all, it should be pointed out that Lua does not implement Object Oriented Programming; it has no concept of objects, classes, inheritance, etc.
If you want OOP in Lua, you have to implement it yourself. Usually this is done by creating a table that acts as a "class", storing the "instance methods", which are really just functions that accept the instance as its first argument.
Inheritance is then achieved by having the "constructor" (also just a function) create a new table and set its metatable to one with an __index field pointing to the class table. When indexing the "instance" with a key it doesn't have, it will then search for that key in the class instead.
In other words, an "instance" table may have no functions at all, but indexing it with, for example, "withdraw" will just try indexing the class instead.
Now, if we take a single "instance" table and add a withdraw field to it, Lua will see that it has that field and not bother looking it up in the class. You could say that this value shadows the one in the class table.
What's the name for this "technique"
It doesn't really have one, but you should definitely look into metatables.
In languages that do support this sort of thing, like in Ruby (see below) this is often done with singleton classes, meaning that they only have a single instance.
Performance considerations
Indexing tables, including metatables takes some time. If Lua finds a method in the instance table, then that's a single table lookup; if it doesn't, it then needs to first get the metatable and index that instead, and if that doesn't have it either and has its own metatable, the chain goes on like that.
So, in other words, this is actually faster. It does use up some more space, but not really that much (technically it could be quite a lot, but you really shouldn't worry about that. Nonetheless, here's where you can read up on that, if you want to).
Any other oo programming languages with such facility?
Yes, lots of 'em. Ruby is a good example, where you can do something like
array1 = [1, 2, 3]
array2 = [4, 5, 6]
def array1.foo
puts 'bar'
end
array1.foo # prints 'bar'
array2.foo # raises `NoMethodError`

MUMPS can't format Number to String

I am trying to convert larg number to string in MUMPS but I can't.
Let me explain what I would like to do :
s A="TEST_STRING#12168013110012340000000001"
s B=$P(A,"#",2)
s TAB(B)=1
s TAB(B)=1
I would like create an array TAB where variable B will be a primary key for array TAB.
When I do ZWR I will get
A="TEST_STRING#12168013110012340000000001"
B="12168013110012340000000001"
TAB(12168013110012340000000000)=1
TAB("12168013110012340000000001")=1
as you can see first SET recognize variable B as a number (wrongly converted) and second SET recognize variable B as a string ( as I would like to see ).
My question is how to write SET command to recognize variable B as a string instead of number ( which is wrong in my opinion ).
Any advice/explanation will be helpful.
This may be a limitation of sorting/storage mechanism built into MUMPS and is different between different MUMPS implementations. The cause is that while variable values in MUMPS are non typed, index values are -- and numeric indices are sorted before string ones. When converting a large string to number, rounding errors may occur. To prevent this from happening, you need to add a space before number in your index to explicitly treat it as string:
s TAB(" "_B)=1
As far as I know, Intersystems Cache doesn't have this limitation -- at least your code works fine in Cache and in documentation they claim to support up to 309 digits:
http://docs.intersystems.com/cache20141/csp/docbook/DocBook.UI.Page.cls?KEY=GGBL_structure#GGBL_C12648
I've tried to recreate your scenario, but I am not seeing the issue you're experiencing.
It actually is not possible ( in my opinion ) for the same command executed immediately ( one execution after another) to produce two different results.
s TAB(B)=1
s TAB(B)=1
for as long the value of B did not change between the executions, the result should be:
TAB("12168013110012340000000001")=1
Example of what GT.M implementation of MUMPS returns in your case

Why using a variable name "temp" considered a bad practice?

Whenever i put up a code for review from professional programmers they tend to point out that "using a variable named temp is bad" but no one seems to know why.
Why is it considered bad ?
temp indeed doesn't mean anything useful. A better question is: does it have to?
Without knowing the context (how is it used in your code?), it's hard to say what it's for, and whether temp is a good name. If you use a variable often or non-locally, the name must be descriptive. A name like temp can be fine if you use it, say, three times in three adjacent lines.
void swapIntPointers(int *a, int *b) {
int temp = *a;
*a = *b;
*b = temp;
}
It's immediately obvious what this function should do (from its name) and what it actually does (from its structure). In this specific case, I strongly prefer short (and automatically nondescript) names. I'd even say that temp may be a little too long here ;)
However:
If you use the variable often, it's apparently important, and 'deserves' a better name.
If you use the same variable in places in the function that are far apart (non-local), it helps programmers to 'remember' the meaning when you give it a recognizable name.
It's because temp suggests something about the longevity of the variable (temporary) but nothing about the meaning or significance of its content. Variables are generally best named to reflect what their underlying value is intended to represent.

How can I store an array of boolean values in a MySql database?

In my case, every "item" either has a property , or not. The properties can be some hundreds, so I will need , say, max 1000 true/false bits per item.
Is there a way to store those bits in one field of the item ?
If you're looking for a way to do this in a way that's searchable, then no.
A couple searchable methods (involving more than 1 column and/or table):
Use a bunch of SET columns. You're limited to 64 items (on/offs) in a set, but you cna probably figure out a way to group them.
Use 3 tables: Items (id, ...), FlagNames(id, name), and a pivot table ItemFlags(item_id, flag_id). You can then query for items with joins.
If you don't need it to be searchable, then all you need is a method to serialize your data before you put it in the database, and a unserialize it when you pull it out, then use a char, or varchar column.
Use facilities built in to your language (PHP's serialize/unserialize).
Concatenate a series of "y" and "n" characters together.
Bit-pack your values into a string (8 bits per character) in the client before making a call to the MySQL database, and unpack them when retrieving data out of the database. This is the most efficient storage mechanism (if all rows are the same, use char[x], not varchar[x]) at the expense of the data not being searchable and slightly more complicated code.
I would rather go with something like:
Properties
ID, Property
1, FirsProperty
2, SecondProperty
ItemProperties
ID, Property, Item
1021, 1, 10
1022, 2, 10
Then it would be easy to retrieve which properties are set or not with a query for any particular item.
At worst you would have to use a char(1000) [ynnynynnynynnynny...] or the like. If you're willing to pack it (for example, into hex isn't too bad) you could do it with a char(64) [hexadecimal chars].
If it is less than 64, then the SET type will work, but it seems like that's not enough.
You could use a binary type, but that's designed more for stuff like movies, etc.. so I'd not.
So yeah, it seems like your best bet is to pack it into a string, and then store that.
It should be noted that a VARCHAR would be wasting space, since you do know precisely how much space your data will take, and can allocate it exactly. (Having fixed-width rows is a good thing)
Strictly speaking you can accomplish this using the following:
$bools = array(0,1,1,0,1,0,0,1);
$for_db = serialize($array);
// Insert the serialized $for_db string into the database. You could use a text type
// make certain it could hold the entire string.
// To get it back out:
$bools = unserialize($from_db);
That said, I would strongly recommend looking at alternative solutions.
Depending on the use case you might try creating an "item" table that has a many-to-many relationship with values from an "attributes" table. This would be a standard implementation of the common Entity Attribute Value database design pattern for storing variable points of data about a common set of objects.

Reference table values in a war against magic numbers

This question bugged me for years now and can't seem to find good solution still. I working in PHP and Java but it sounds like this maybe language-agnostic :)
Say we have a standard status reference table that holds status ids for some kind of entity. Further let's assume the table will have just 5 values, and will remain like this for a long time, maybe edited occasionally with addition of a new status.
When you fetch a row and need to see what status it is you have 2 options(as I see it at least) - put it straight ID values(magic numbers that is) or use a named constant. Latter seem much cleaner, the question though is where those named constants should leave? In a model class? In a class that uses this particular constant? Somewhere else?
It sounds like what you're wanting to do is an enumerated value.
This is a value that has a literal name mapped to a constant value, this would be something like
Statusone = 1
Statustwo = 2
Then anywhere in your program you could refrenece statusone which the compiler would see as 1.
I'm not sure if this exists in php but I'm pretty sure it does in java
EDIT In response to some comments
I would typically put enumerated values in some kind of global namespace, or if you only need them when you are using that class spefically you can put them at the class level.