Safe maximum amount of nodes in the DOM? [closed] - html

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
For a web application, given the available memory in a target mobile device1 running a target mobile browser2, how can one estimate the maximum number of DOM nodes, including text nodes, that can be generated via HTML or DHTML?
How can one calculate the estimate before
Failure
Crash
Significant degradation in response
Also, is there a hard limit on any browser not to cross per tab open?
Regarding Prior Closure
This is not like the other questions in the comments below. It is also asking a very specific question seeking a method for estimation. There is nothing duplicated, broad, or opinion based about it, especially now that it is rewritten for clarity without changing its author's expressed interests.
Footnotes
[1] For instance, Android or IOS mobile devices sold from 2013 - 2018 with some specific RAM capacity
[2] Firefox, Chrome, IE 11, Edge, Opera, Safari

This is a question for which only a statistical answer could be accurate and comprehensive.
Why
The appropriate equation is this, where N is the number of nodes, bytesN is the total bytes required to represent them in the DOM, and the node index n ∈ [0, N).
bytesN = ∑N (bytesContentn + bytesOverheadn)
The value requested in the question is the maximum value of N in the worst case handheld device, operating system, browser, and operating conditions. Solving for N for each permutation is not trivial. The equation above reveals three dependencies, each of which could drastically alter the answer.
The average size of a node is dependent on the average number of bytes used in each to hold the content, such as UTF-8 text, attribute names and values, or cached information.
The average overhead of a DOM object is dependent on the HTTP user agent that manages the DOM representation of each document. W3C's Document Object Model FAQ states, "While all DOM implementations should be interoperable, they may vary considerably in code size, memory demand, and performance of individual operations."
The memory available to use for DOM representations is dependent upon the browser used by default (which can vary depending on what browser handheld device vendors or users prefer), user override of the default browser, the operating system version, the memory capacity of the handheld device, common background tasks, and other memory consumption.
Rigorous Solution
One could run tests to determine (1) and (2) for each of the common http user agents used on handheld devices. The distribution of user agents for any given site can be obtained by configuring the logging mechanism of the web server to place the HTTP_USER_AGENT if it isn't there by default and then stripping all but that field in the log and counting the instances of each value.
The number of bytes per character would need to be tested for both attributes values and UTF-8 inner text (or whatever the encoding) to get a clear pair of factors for calculating (1).
The memory available would need to be tested too under a variety of common conditions, which would be a major research project by itself.
The particular value of N chosen would have to be ZERO to handle the actual worst case, so one would chose a certain percentage of typical cases of content, node structures, and run time conditions. For instance, one may take a sample of cases using some form of randomized in situ (within normal environmental conditions) study and find N that satisfies 95% of those cases.
Perhaps a set of cases could be tested in the above ways and the results placed in a table. Such would represent a direct answer to your question.
I'm guessing it would take an excellent mobile software engineer with a good math background and a statistics expert working together full time with a substantial budget for about four weeks to get reasonable results.
A More Practical Estimation
One could guess the worst case scenario. With a few full days of research and a few proof-of-concept apps, this proposal could be refined. Absent of the time to do that, here's a good first guess.
Consider a cell phone that permits 1 Gbyte for DOM because normal operating conditions use 3 Gbytes out of the 4 GBytes for the above mentioned purposes. One might assume the average consumption of memory for a node to be as follows, to get a ballpark figure.
2 bytes per character for 40 characters of inner text per node
2 bytes per character for 4 attribute values of 10 characters each
1 byte per character for 4 attribute names of 4 characters each
160 bytes for the C/C++ node overhead
In this case Nworst_case, the worst case max nodes,
= 1,024 X 1,024 X 1,024
/ (2 X 40 + 2 X 4 X 10 + 1 X 4 X 4 + 160)
= 3,195,660 . 190,476.
I would not, however, build a document in a browser with three million DOM nodes if it could be at all avoided. Consider employing the more common practice below.
Common Practice
The best solution is to stay far below what N might be and simply reduce the total number of nodes to the degree possible using standard HTTP design techniques.
Reduce the size and complexity of that which is displayed on any given page, which also improves visual and conceptual clarity.
Request minimal amounts of data from the server, deferring content that is not yet visible using windowing techniques or balancing response time with memory consumption in well-planned ways.
Use asynchronous calls to assist with the above minimalism.

There is no limit for the DOM. Instead there is a limit for a running application, called 'browser'. As all other applications, it has a limit of 4GB of virtual memory. How much of resident memory is used depends on the amount of physical memory. With low RAM you might get to situation of constantly swapping in and out (having affordable amount of swap memory). Some systems (Linux, Android) have a special kernel task to kill applications if the system runs out of memory. Also, the maximum size of application in Linux like systems is usually limited to 2MB of virual memory and can be changed by ulimit command.

Related

Simulating a matrix of variables with predefined correlation structure

For a simulation study I am working on, we are trying to test an algorithm that aims to identify specific culprit factors that predict a binary outcome of interest from a large mixture of possible exposures that are mostly unrelated to the outcome. To test this algorithm, I am trying to simulate the following data:
A binary dependent variable
A set of, say, 1000 variables, most binary and some continuous, that are not associated with the outcome (that is, are completely independent from the binary dependent variable, but that can still be correlated with one another).
A group of 10 or so binary variables which will be associated with the dependent variable. I will a-priori determine the magnitude of the correlation with the binary dependent variable, as well as their frequency in the data.
Generating a random set of binary variables is easy. But is there a way of doing this while ensuring that none of these variables are correlated with the dependent outcome?
Thank you!
"But is there a way of doing this while ensuring that none of these variables are correlated with the dependent outcome?"
With statistical sampling you can't ensure anything, you can only adjust the acceptable risk. Finding an acceptable level of risk may be harder than many people think.
Spurious correlations are a very real phenomenon. Real independent observations will often contain correlations, and if you want to actually test your algorithm to see how it will perform in reality then your tests should produce such phenomena in a manner similar to the real world—you should be generating independent candidate factors and allowing spurious correlations to occur.
If you are performing ~1000 independent tests of candidate factors, and you're targeting a risk level of α = 0.05, you can expect 50 non-significant terms to leak through into your analysis. To avoid this, you need to adjust your testing threshold using something along the lines of a Bonferroni correction. Recall that statistical discriminating power is based on standard error, which is inversely proportional to the square root of the sample size. Bonferroni says that 1000 simultaneous tests need their individual test threshold to be adjusted by a factor of 1000, which in turn means the sample size needs to be a million times larger than when performing a single test for significance.
So in summary I'd say that you shouldn't attempt to ensure lack of correlation, it's going to occur in the real world. You can mitigate the risk of non-predictive factors being included due to spurious correlation by generating massive amounts of data. In practice there will be non-predictors that leak through unless you can obtain enough data, so I'd suggest that your testing should address the rates of occurrence as a function of number of candidate factors and the sample size.

Is standard deviation (STDDEV) the right function for the job?

We wrote a monitoring system. This monitor is made of agents. Each agent runs on a different server, and monitors that specific server resources (RAM, CPU, SQL Server Status, Replication Status, Free Disk Space, Internet Access, specific bussiness metrics, etc.).
The agents report every measure they take to a central database where these "observations" are stored.
For example, every few seconds an agent would store in the central database a specific bussiness metric called "unprocessed_files" with its corresponding value:
(unprocessed_files, 41)
That value is constanty being written to our DB (among many others, as explained above).
We are now implementing a client application, a screen, that displays the status of every thing we monitor. So, how can we calculate what's a "normal" value and what's a wrong value?
For example, we know that if our servers are working correctly, the unprocessed_files would always be close to 0, but maybe (We don't know yet), 45 is an acceptable value.
So the question is, should we use the Standard Deviation in order to know what the acceptable range of values is?
ACCEPTABLE_RANGE = AVG(value) +- STDDEV(value) ?
We would like to notify with a red color when something is not going well.
For your backlog (unprocessed file) metric, using a standard deviation to know when to sound an alarm (turn something red) is going to drive you crazy with false alarms.
Why? most of the time your backlog will be zero. So, the standard deviation will also be very close to zero. Standard deviation tells you how much your metric varies. Therefore, whenever you get a nonzero backlog, it will be outside the avg + stdev range.
For a backlog, you may want to turn stuff yellow when the value is > 1 and red when the value is > 10.
If you have a "how long did it take" metric, standard deviation might be a valid way to identify alarm conditions. For example, you might have a web request that usually takes about half a second, but typically varies from 0.25 to 0.8 second. If they suddenly start taking 2.5 seconds, then you know something has gone wrong.
Standard deviation is a measurement that makes most sense for a normal distribution (bell curve distribution). When you handle your measurements as if they fit a bell curve, you're implicitly making the assumption that each measurement is entirely independent of the others. That assumption works poorly for typical metrics of a computing system (backlog, transaction time, load average, etc). So, using stdev is OK, but not great. You'll probably struggle to make sense of stdev numbers: that's because they don't actually make much sense.
You'd be better off, like #duffymo suggested, looking at the 95th percentile (the worst-performing operations). But MySQL doesn't compute those kinds of distributions natively. postgreSQL does. So does Oracle Standard Edition and higher.
How do you determine an out-of-bounds metric? It depends on the metric, and on what you're trying to do. If it's a backlog measurement, and it grows from minute to minute, you have a problem to investigate. If it's a transaction time, and it's far longer than average (avg + 3 x stdev, for example, you have a problem. The open source monitoring system Nagios has worked this out for various kinds of metrics.
Read a book by N. N. Taleb called "The Black Swan" if you want to know how assuming the real world fits normal distributions can crash the global economy.
Standard deviation is just a way of characterizing how much a set of values spreads away from its average (i.e. mean). In a sense, it's an "average deviation from average", though a little more complicated than that. It is true that values which differ from the mean by many times the standard deviation tend to be rare, but that doesn't mean the standard deviation is a good benchmark for identifying anomalous values that might indicate something is wrong.
For one thing, if you set your acceptable range at the average plus or minus one standard deviation, you're probably going to get very frequent results outside that range! You could use the average plus or minus two standard deviations, or three, or however many you want to reduce the number of notifications/error conditions as low as you want, but there's no telling whether any of this actually helps you identify error conditions.
I think your main problem is not statistics. Your problem is that you don't know what kinds of results actually indicate an error. So before you program in any acceptable range, just let the system run for a while and collect some calibration data showing what kinds of values you see when it's running normally, and what kinds of values you see when it's not running normally. Make sure you have some way to tell which are which. Once you have a good amount of data for both conditions, you can analyze it (start with a simple histogram) and see what kinds of values are characteristic of normal operation and what kinds are characteristics of error conditions. Then you can set your acceptable range based on that.
If you want to get fancy, there is a statistical technique called likelihood ratio testing that can help you evaluate just how likely it is that your system is working properly. But I think it's probably overkill. Monitoring systems don't need to be super-precise about this stuff; just show a cautionary notice whenever the readings start to seem abnormal.

Do different arithmetic operations have different processing times?

Are the basic arithmetic operations same with respect to processor usage. For e.g if I do an addition vs division in a loop, will the calculation time for addition be less than that for division?
I am not sure if this question belongs here or computer science SE
Yes. Here is a quick example:
http://my.safaribooksonline.com/book/hardware/9788131732465/instruction-set-and-instruction-timing-of-8086/app_c
those are the microcode and the timing of the operation of a massively old architecture, the 8086. it is a fairly simple point to start.
of relevant note, they are measured in cycles, or clocks, and everything move at the speed of the cpu (they are synchronized on the main clock or frequency of the microprocessor)
if you scroll down on that table you'll see a division taking anywhere from 80 to 150 cycles.
also note operation speed is affected by which area of memory the operand reside.
note that on modern processor you can have parallel instruction executed concurrently (even if the cpu is single threaded) and some of them are executed out of order, then vector instruction murky the question even more.
i.e. a SSE multiplication can multiply multiple number in a single operation (taking multiple time)
Yes. Different machine instructions are not equally expensive.
You can either do measurements yourself or use one of the references in this question to help you understand the costs for various instructions.

Does using binary numbers in code improves performance?

I've seen quite a few examples where binary numbers are being used in code, like 32,64,128 and so on (for instance, very well known example - minecraft)
I want to ask, does using binary numbers in such high level languages as Java / C++ help anything?
I know assembly and that you would always rather use these because in low level language it overcomplicates things if you go above register limit.
Will programs run any faster/save up more memory if you use binary numbers?
As with most things, "it depends".
In compiled languages, the better compilers will deduce that slow machine instructions can sometimes be done with different faster machine instructions (but only for special values, such as powers of two). Sometimes coders know this and program accordingly. (e.g. multiplying by a power of two is cheap)
Other times, algorithms are suited towards representations involving powers of two (e.g. many divide and conquer algorithms like the Fast Fourier Transform or a merge sort).
Yet other times, it's the most compact way to represent boolean values (like a bitmask).
And on top of that, other times it's more efficiency for memory purposes (typically because it's so fast do to multiply and divide logic with powers of two, the OS/hardware/etc will use cache line / page sizes / etc that are powers of two, so you'd do well to have nice power of two sizes for your important data structures).
And then, on top of that, other times.. programmers are just so used to using powers of two that they simply do it because it seems like a nice number.
There are some benefits of using powers of two numbers in your programs. Bitmasks are one application of this, mainly because bitwise operators (&, |, <<, >>, etc) are incredibly fast.
In C++ and Java, this is done a fair bit- especially with GUI applications. You could have a field of 32 different menu options (such as resizable, removable, editable, etc), and apply each one without having to go through convoluted addition of values.
In terms of raw speedup or any performance improvement, that really depends on the application itself. GUI packages can be huge, so getting any speedup out of those when applying menu/interface options is a big win.
From the title of your question, it sounds like you mean, "Does it make your program more efficient if you write constants in binary?" If that's what you meant, the answer is emphatically, No. The compiler translates all your constants to binary at compile time, so by the time the program runs, it makes no difference. I don't know if the compiler can interpret binary constants faster than decimal, but the difference would surely be trivial.
But the body of your question seems to indicate that you mean, "use constants that are round number in binary" rather than necessarily expressing them in binary digits.
For most purposes, the answer would be no. If, say, the computer has to add two numbers together, adding a number that happens to be a round number in binary is not going to be any faster than adding a not-round number.
It might be slightly faster for multiplication. Some compilers are smart enough to turn multiplication by powers of 2 into a bit shift operation rather than a hardware multiply, and bit shifts are usually faster than multiplies.
Back in my assembly-language days I often made elements in arrays have sizes that were powers of 2 so I could index into the array with a bit-shift rather than a multiply. But in a high-level language that would be hard to do, as you'd have to do some research to find out just how much space your primitives take in memory, whether the compiler adds padding bytes between them, etc etc. And if you did add some bytes to an array element to pad it out to a power of 2, the entire array is now bigger, and so you might generate an extra page fault, i.e. the operating system runs out of memory and has to write a chunck of your data to the hard drive and then read it back when it needs it. One extra hard drive right takes more time than 1000 multiplications.
In practice, (a) the difference is so trivial that it would almost never be worth worrying about; and (b) you don't normally know everything happenning at the low level, so it would often be hard to predict whether a change with its intendent ramifications would help or hurt.
In short: Don't bother. Use the constant values that are natural to the problem.
The reason they're used is probably different - e.g. bitmasks.
If you see them in array sizes, it doesn't really increase performance, but usually memory is allocated by power of 2. E.g. if you wrote char x[100], you'd probably get 128 allocated bytes.
No, your code will ran the same way, no matter what is the number you use.
If by binary numbers you mean numbers that are power of 2, like: 2, 4, 8, 16, 1024.... they are common due to optimization of space, normally. Example, if you have a 8 bit pointer it is capable of point to 256 (that is a power of 2), addresses, so if you use less than 256 you are wasting your pointer.... so normally you allocate a 256 buffer... this same works for all other power of 2 numbers....
In most cases the answer is almost always no, there is no noticeable performance difference.
However, there are certain cases (very few) when NOT using binary numbers for array/structure sizes/length will give noticeable performance benefits. These are cases when you're filling the cache and because you're looping over a structure that fills the cache in a such a way that you have cache collisions every time you loop through your array/structure. This case is very rare, and shouldn't be preoptimized unless you're having problems with your code performing much more slowly than theoretical limits say it should. Also, this case is very hardware dependent and will change from system to system.

What datasize is suitable for storing an RFID column in SQL server?

I'm new to the whole RFID arena.
I need to store an RFID pr asset in the database. No decision has yet been made on what system will feed that particular field (or fields?) so I just want to set aside some space right now.
Oracle has this whole "Identity" package that handles, amongst other things, the different versions and types of RFID, but I havn't seen anything for SQL server.
Perhaps I'm overcomplicating things, but I've searched wide but found no reference to how big such a tag is, or even if it is suitable for being stored in one field, or if you need multiple.
So, what columns should I have, and what should their sizes be?
Would nvarchar(10) suffice? nvarchar(20)?
There is no fixed data size for RFID tags. In fact they can store from a few bytes to a few kilobytes. They can even be used to hack into an unprotected system by storing code within them. Thus you should treat any data that you receive from them with the same suspicion that you would do from elsewhere.
As for an identifier that is uniques then if you allocate on the basis of it being no larger than a UUID then you should be OK.
AFAIK the generation 1 RFID tags are generally 128 bits, where 96 bits are the unique ID and the rest is checksum. But I strongly suspect that newer generations are at least 256 bits and it will continue to grow. I'm by no means an expert, so you may want to wait for another answer:)
So I'd go with a char or varchar of sufficient size, which should be easy to scale later.
Unfortunately, the standards in the RFID world at the moment specify all sorts of useful things, but not the tag size (these standards tend to be industry-specific and the ability to track cows may not map that well to what you have planned).
My advice would be to allocate something to hold enough for test data (nvarchar(10) should be fine) and then size it properly when you choose an actual implementation, at which point the vendor will be able to give you that information.
There is no set size for RFID tags, but I believe as it currently stands (Jan 2011) 2KB is the maximum size in HF specification, this includes the tag ID, user data and data set by the manufacturer required for the tag to function.
In the UHF specification, instead of unique IDs you have an EPC which is editable by a reader if the tag is unlocked, unlike unique IDs in HF which are set and locked by the manufacturer.
End of the day, you need to read the data layout for the memory of the tag your using. Manufactures will provide the technical document you need that explains the memory addresses available, and thus the max size you need.