I wanted to see if folks were using decimal for financial applications instead of double. I have seen lots of folks using double all over the place with unintended consequences . .
Do you see others making this mistake . . .
We did unfortunately and we regret it. We had to change all doubles to decimals. Decimals are good for financial applications. You can look at this article
A Money type for the CLR:
A convenient, high-performance money
structure for the CLR which handles
arithmetic operations, currency types,
formatting, and careful distribution
and rounding without loss.
Yes, using float or double for financials is a common mistake, leading to much, much pain. decimal is the most obvious choice in this scenario.
For general knowledge, a good discussion of each is here (float/double) and here (decimal).
This is not as obvious as you may think. I recently had the controller of a large corporation tell me that he wanted his financial reports to match what Excel would generate, which is maintaining calculated results internally at maximum precision and only rounding at the last minute for display purposes. This means that you can't always match the Excel answers by manual calculations using only displayed values. His explanation was that there were multiple algorithms for generating the results, each one doing rounding at a different place using decimal values, therefore potentially generating conflicting answers, but the Excel method always generated the same answer.
I personally think he's wrong, but with so many financial people using Excel without understanding how to use it properly for financial calculations, I'll bet there's a lot of people agreeing with this controller.
I don't want to start a religious war, but I'd love to hear other opinions on this.
If it is "scientific" measurement (I mean weight, length, area etc) use double.
If it is financial, or has anything to do with law (e.g. the area of a property) then use decimal.
The hard part is rounding.
If the tax is 2.4% do you round in the details or after the sum?
Most of the time yo have to do both (AND fix the difs)
I've run into this a few times. Many languages have nothing of the sort built in, and to someone who doesn't understand the problem it seems like just another hassle, especially if it looks like it works as intended without it.
I have always used Decimal. At least when I had a language that supports it. Otherwise, rounding errors will kill you.
I totally agree on correctness issues of floating point vs decimal mentioned above, but
many financial applications are performance critical.
In such cases you will consider to use float/double as decimal has great impact on performance in systems where decimal types are not supported in hardware. And still it is possible to wrap floating point types in higher level classes (e.g. Tax, Commission, Balance, Dividend, Quote, Tick, etc...) that represent domain model and encapsulate all rounding logic as well as valid operators on these types and their interactions.
And yes - in some projects I have implemented custom rounding functions to squeeze up to 20% more out of calculations compared to .NET or win32 methods.
Another thing to consider is whether you pass your objects out of process, as serializing decimals which are usually 4 integers and passing them over the wire is much more CPU intensive (esp if not supported) and results in significantly more bandwidth and larger memory footprint.
Related
I have a problem about classification of text into several categories (topics). Apart from text, I have some numeric features that I believe may be useful (there are also missing values among those features). But the most important information is, of course, presented in the text. Therefore, I think deep learning approach (with a common pipeline: embedding layer + CNN or RNN with dropout + Dense layer) would be the best choice. What is the best practice to mix the current model that works only on text input with numeric features? Are there any tricks, best common practices, state-of-the-art research going on in this field? Are there any papers/experiments (on GitHub, maybe) on this topic?
It'd be great if we could think of the problem in general, but for the sake of having an idea of what sort of problem we may solve, I will give a specific example. Let's suppose we have reviews from users in which they describe a problem they faced while receiving a service or purchasing an item. The target feature is multi-label: the set of tags (categories/topics) associated with the complaint that a user had (we should choose relevant ones among a few hundreds of possible topics).
Then apart from the user's comment itself (which is the most important feature), we may want also to take into account some numerical features like price, waiting time, rating (customer satisfaction score), etc. This can potentially be useful for predicting some particular categories.
The idea is to mix all these features somehow in a deep learning model to produce the final model. Not sure if I know much about the best ways how to do it. What are the best practices / useful tricks for this kinds of problems?
For each numeric feature, statistically have a representation (you can use pandas.DataFrame.describe), also plotting the distribution would visually make you stronger.
After having the values of mean, std, max, min etc. You should get rid ofoutliers which can harm your training model. For example, if your features have its 90% of its numeric values from 18 to 72 but has also values like 1.1 or 1200 etc. you should get rid of those by equalizing them to 18 or 72 depending on the side. You can use np.clip()
After having a reasonable distribution, you should convert those numeric features to categorical features. For instance, numeric distribution from 18 to 72 can be grouped as 18, 27, 36, ......, 72, taking the intervals. You can increase the resolution or decrease it, depending on your understanding and the performance of the algorithm. You can use np.digitize() or do manually by a simple function that you can write.
In the end you have a categorical feature just like the texts. CNN or RNN can work fine with categorical representations of the numeric values as well as you get the better advantage to have feature crosses to increase your performance.
But if you ask for something of more complex, I might not have understood your question or I may not know it. Still, if you want to ask more or differently, I will be happy to try to help.
I've seen quite a few examples where binary numbers are being used in code, like 32,64,128 and so on (for instance, very well known example - minecraft)
I want to ask, does using binary numbers in such high level languages as Java / C++ help anything?
I know assembly and that you would always rather use these because in low level language it overcomplicates things if you go above register limit.
Will programs run any faster/save up more memory if you use binary numbers?
As with most things, "it depends".
In compiled languages, the better compilers will deduce that slow machine instructions can sometimes be done with different faster machine instructions (but only for special values, such as powers of two). Sometimes coders know this and program accordingly. (e.g. multiplying by a power of two is cheap)
Other times, algorithms are suited towards representations involving powers of two (e.g. many divide and conquer algorithms like the Fast Fourier Transform or a merge sort).
Yet other times, it's the most compact way to represent boolean values (like a bitmask).
And on top of that, other times it's more efficiency for memory purposes (typically because it's so fast do to multiply and divide logic with powers of two, the OS/hardware/etc will use cache line / page sizes / etc that are powers of two, so you'd do well to have nice power of two sizes for your important data structures).
And then, on top of that, other times.. programmers are just so used to using powers of two that they simply do it because it seems like a nice number.
There are some benefits of using powers of two numbers in your programs. Bitmasks are one application of this, mainly because bitwise operators (&, |, <<, >>, etc) are incredibly fast.
In C++ and Java, this is done a fair bit- especially with GUI applications. You could have a field of 32 different menu options (such as resizable, removable, editable, etc), and apply each one without having to go through convoluted addition of values.
In terms of raw speedup or any performance improvement, that really depends on the application itself. GUI packages can be huge, so getting any speedup out of those when applying menu/interface options is a big win.
From the title of your question, it sounds like you mean, "Does it make your program more efficient if you write constants in binary?" If that's what you meant, the answer is emphatically, No. The compiler translates all your constants to binary at compile time, so by the time the program runs, it makes no difference. I don't know if the compiler can interpret binary constants faster than decimal, but the difference would surely be trivial.
But the body of your question seems to indicate that you mean, "use constants that are round number in binary" rather than necessarily expressing them in binary digits.
For most purposes, the answer would be no. If, say, the computer has to add two numbers together, adding a number that happens to be a round number in binary is not going to be any faster than adding a not-round number.
It might be slightly faster for multiplication. Some compilers are smart enough to turn multiplication by powers of 2 into a bit shift operation rather than a hardware multiply, and bit shifts are usually faster than multiplies.
Back in my assembly-language days I often made elements in arrays have sizes that were powers of 2 so I could index into the array with a bit-shift rather than a multiply. But in a high-level language that would be hard to do, as you'd have to do some research to find out just how much space your primitives take in memory, whether the compiler adds padding bytes between them, etc etc. And if you did add some bytes to an array element to pad it out to a power of 2, the entire array is now bigger, and so you might generate an extra page fault, i.e. the operating system runs out of memory and has to write a chunck of your data to the hard drive and then read it back when it needs it. One extra hard drive right takes more time than 1000 multiplications.
In practice, (a) the difference is so trivial that it would almost never be worth worrying about; and (b) you don't normally know everything happenning at the low level, so it would often be hard to predict whether a change with its intendent ramifications would help or hurt.
In short: Don't bother. Use the constant values that are natural to the problem.
The reason they're used is probably different - e.g. bitmasks.
If you see them in array sizes, it doesn't really increase performance, but usually memory is allocated by power of 2. E.g. if you wrote char x[100], you'd probably get 128 allocated bytes.
No, your code will ran the same way, no matter what is the number you use.
If by binary numbers you mean numbers that are power of 2, like: 2, 4, 8, 16, 1024.... they are common due to optimization of space, normally. Example, if you have a 8 bit pointer it is capable of point to 256 (that is a power of 2), addresses, so if you use less than 256 you are wasting your pointer.... so normally you allocate a 256 buffer... this same works for all other power of 2 numbers....
In most cases the answer is almost always no, there is no noticeable performance difference.
However, there are certain cases (very few) when NOT using binary numbers for array/structure sizes/length will give noticeable performance benefits. These are cases when you're filling the cache and because you're looping over a structure that fills the cache in a such a way that you have cache collisions every time you loop through your array/structure. This case is very rare, and shouldn't be preoptimized unless you're having problems with your code performing much more slowly than theoretical limits say it should. Also, this case is very hardware dependent and will change from system to system.
I am using Modelica for solving a system of equations for heat transfer problems, and one of them is radiation which is written as
Ta^4-Tb^4
Can someone say if it is computationally faster solving a system with the equation written as:
(Ta-Tb)(Ta+Tb)(Ta^2+Tb^2)
?
There cannot be a definitive answer to this question. This is because the Modelica specification is used to formally define the problem statement but it says nothing about how tools solve such equations. Furthermore, since most Modelica tools do symbolic manipulation anyway, it is hard to predict what steps they might take with such an equation. For example, a tool may very well transform this into a Horner polynomial on its own (without your manual intervention).
If you are going to solve for the temperatures in such an equation as a non-linear system, be careful about negative temperature solutions. You should investigate the "start" attribute to specify initial (positive) guesses when these temperatures are iteration variables in non-linear problems.
I would say that there are two reasons why splitting it into (Ta-Tb)(Ta+Tb)(Ta^2+Tb^2) is SLOWER and NOT FASTER.
(Ta^2+Tb^2) requires 2 multiplications and an addition, which means that (Ta-Tb)(Ta+Tb)(Ta^2+Tb^2) requires 4 multiplications and 3 additions. On the other hand, i guess that Ta^4-Tb^4 is done like this: ((Ta^2)^2 - (Tb^2)^2) which means 1 addition and 4 multiplications.
Mathematica, like a more generic compiler probably knows very well how to optimise these very simple expression. Which means that it is generally safer in terms of computation time to use simple patterns which will be easily caugth and translated into super efficient machine code.
I might obviously be wrong, but I cannot see any reason why (Ta-Tb)(Ta+Tb)(Ta^2+Tb^2) could be FASTER. Hope it helps.
Oscar
I recently read that you can predict the outcomes of a PRNG if you:
Know what algorithm is being used.
Have consecutive data points.
Is it possible to figure out the seed used for a PRNG from only data points?
I managed to find a paper by Kelsey et al which details the different types of attack and also summarises some real-world examples. It seems most attacks rely on similar techniques to those against cryptosystems, and in most cases actually taking advantage of the fact that the PRNG is used in a cryptosystem.
With "enough" data points that are the absolute first data points generated by the PRNG with no gaps, sure. Most PRNG functions are invertible, so just work backwards and you should get the seed.
For example, the typical return seed=(seed*A+B)%N has an inverse of return seed=((seed-B)/A)%N.
It's always theoretically possible, if you're "allowed" to brute force all possible values for the seed, and if you have enough data points that there's only one seed that could have produced that output. If the PRNG was seeded with the time, and you know roughly when that happened, then this might be very fast since there aren't many plausible values to try. If the PRNG was seeded with data from a truly random source having 64 bits of entropy, then this approach is computationally infeasible.
Whether there are other techniques depends on the algorithm. For example doing this for Blum Blum Shub is equivalent to integer factorization, which is generally believed to be a hard computational problem. Other, faster PRNGs might be less "secure" in this sense. Any PRNG used for crypto purposes, for example in a stream cipher, pretty much needs there to be no known feasible way of doing it.
In a previous project, I noticed that the price field was being stored as an int, rather than as a float. This is done by multiplying the actual value by 100, the reason being was to avoid running into floating point problems.
Is this a good practice that I should follow or is it unnecessary and only makes the data less transparent?
Interesting question.
I wouldn't actually choose float in the mysql environment. Too many problems in the past with precision with that datatype.
To me, the choice would be between int and decimal(18,4).
I've seen real world examples integers used to represent floating point values. The internals of JD Edwards datatables all do this. Quantities are typically divided by 10000. While I'm sure it's faster and smaller in-table, it just means that we're always having to CAST the ints to a decimal value if we want to do anything with them, especially division.
From a programming perspective, I'd always prefer to work with decimal for price ( or money in RDBMSs that support it ).
Floating point errors could cause you problems if you are multiplying large numbers. In general, financial calculations should never be done with floating point numbers where possible.
I think Decimal is good for this use.
While it would save you float-related issues, having prices saved as integers might lead to a problem where you end up charging 100 times the price to a customer. It could also confuse other programmers.
I have seen both solution used successfully on medium-size ecommerce websites, but my preference goes to using floats.