I'm just curious how data is physically transferred through logic gates. For example, does the pixel on my monitor that is 684 pixels down and 327 pixels to the right have a specific set or path of transistors in the GPU that only care about populating that pixel with the correct color? Or is it more random?
Here is a cell library en.wikipedia.org/wiki/Standard_cell that is used when building a chip for a particular foundry, kind of like an instruction set used when compiling. the machine code for arm is different from x86 but the same code can be compiled for either (if you have a compiler for that language for each of course). So there is a list of standard functions (and, or, etc plus more complicated ones) that you compile your verilog/vhdl for. A particular cell is hardwired. There is an intimate relationship between the cell library and the foundry and the process used (28nm, 22nm, 14nm, etc). Basically you need to construct the chip one thin layer at a time using a photographic like process, the specific semiconductors and other factors for a specific piece of equipment may vary from some other, so the 28nm technology may be different than 14nm so you may need to construct an AND gate differently thus a different cell library. And that doesnt necessarily mean there is only one AND gate cell for a particular process at a particular foundry, it is possible that more than one has been developed.
as far has how pixels and video works, there is a memory somewhere, generally on the video card itself. Depending on the screen size, number of colors, etc that memory can be organized differently. Also there may be multiple frames used to avoid flicker and provide a higher frame rate. so you might have one screens image at address 0x000000 in this memory the video card will extract the pixel data starting at this address, while software is generating the next frame at say 0x100000.
then when it is time to switch frames based on the frame rate the logic may switch to displaying the image using 0x100000 while software modifies 0x000000. So for a particular video mode the first three bytes in the memory at some known offset could be the pixel data for the 0,0 coordinate pixel then the next three for 1,0, and so on. For a number like 684 they could start the second line at offset 684*3, but they might start the second at 0x400.
Whatever, for a particular mode, the OFFSET in a frame of video memory will be the same for a particular pixel so long as the mode settings dont change. The video card due to the rules of the interface used (vga, hdmi, or interfaces specific to a phone lcd for example) has logic that reads that memory and generates the right pulses or analog level signal for each pixel.
Related
I have heard the term "Single Cycle Cpu" and was trying to understand what single cycle cpu actually meant. Is there a clear agreed definition and consensus and what is means?
Some home brew "single cycle cpu's" I've come across seem to use both the rising and the falling edges of the clock to complete a single instruction. Typically, the rising edge acts as fetch/decode and the falling edge as execute.
However, in my reading I came across the reasonable point made here ...
https://zipcpu.com/blog/2017/08/21/rules-for-newbies.html
"Do not transition on any negative (falling) edges.
Falling edge clocks should be considered a violation of the one clock principle,
as they act like separate clocks.".
This rings true to me.
Needing both the rising and falling edges (or high and low phases) is effectively the same as needing the rising edge of two cycles of a single clock that's running twice as fast; and this would be a "two cycle" CPU wouldn't it.
So is it honest to state that a design is a "single cycle CPU" when both the rising and falling edges are actively used for state change?
It would seem that a true single cycle cpu must perform all state changing operations on a single clock edge of a single clock cycle.
I can imagine such a thing is possible providing the data strorage is all synchronous. If we have a synchronous system that has settled then on the next clock edge we can clock the results into a synchronous data store and simultaneously clock the program counter on to the next address.
But if the target data store is for example async RAM then the surely control lines would be changing whilst that data is being stored leading to unintended behaviours.
Am I wrong, are there any examples of a "single cycle cpu" that include async storage in the mix?
It would seem that using async RAM in ones design means one must use at least two logical clock cycles to achive the state change.
Of course, with some more complexity one could perhaps add anhave a cpu that uses a single edge where instructions use solely synchronout components, but relies on an extra cycle when storing to async data; but then that still wouldn't be a single cycle cpu, but rather a a mostly single cycle cpu.
So no CPU that writes to async RAM (or other async component) can honestly be considered a single cycle CPU because the entire instruction cannot be carried out on a single clock edge. The RAM write needs two edges (ie falling and rising) and this breaks the single clock principal.
So is there a commonly accepted single cycle CPU and are we applying the term consistently?
What's the story?
(Also posted in my hackday log https://hackaday.io/project/166922-spam-1-8-bit-cpu/log/181036-single-cycle-cpu-confusion and also on a private group in hackaday)
=====
Update: Looking at simple MIP's it seems the models use synchronous memory and so can probably operate off a single edge ad maybe it does - therefore warrant the category "single cycle".
And perhaps FPGA memory is always synchronous - I don't know about that.
But is the term using inconsistently elsewhere - ie like most Homebrew TTL Computers out there??
Or am I just plain wrong?
====
Update :
Some may have misunderstood my point.
Numerous home brew TTL cpu's claim "single cycle CPU" status (not interested for the purposes of this discussion in more complex beasts that do pipelining or whatever).
By single cycle these CPU's they typically mean that they do something like advancing the PC on one edge of the clock and then the use the opposing edge of the clock to update flipflops with the result. OR they will use the other phase of the clock to update async components like latches and sram.
However, the ZipCPU reference I provided suggests that using the opposing clock edge is akin to using a second clock cycle or even a second clock. BTW Ben Eater in his vids even compares the inverted clock that he uses to update his SRAM to being a second clock.
My objection to the use of "single cycle CPU" with such CPU's (basically most/all home bred TTL CPU's I've seen as they all work that way) is that I agree with ZipCPU that using the opposing edge (or phase) of the clock for the commit is effectively the same as using a second clock and this makes a mockery of the "single cycle" claim.
If the use of oposing edge is effectively the same a using a single edge but of dual clock cycles then I think that makes use of the term questionable. So I take ZipCPU's point to heart and tighten the term to mean use of a single edge.
On the other hand is seems perfectly possible to build a CPU that uses only sync components (ie edge triggered flip flops) and which uses only a single edge, where on each edge, we clock whatever is on the bus into whatever device is selected for write and at the same moment advance the PC.
Between one edge and the next same direction edge, settling occurs.
In this manner we end up with CPI=1 and use of only a single edge - which is very distinctly different to the common TTL CPU pattern of using both edges of the clock.
BTW my impression of FPGA's (which I'm not referring to here) is that the storage elements in FPGA are all synchronous flip flops. I don't know, but that's what my reading suggests. Anyway, if this is true then a trivial FPGA based CPU probabnly has a CPI=1 and uses only say the +ve edge and so these might well meet my narrow definition of "single cycle cpu". Also, my reading suggests that various MIP's impls (educational efforts probably) are probably meeting my definition ok.
This seems mostly a question of definitions and terminology, moreso than how to actually build simple CPUs.
If you insist on that strict definition of "single cycle CPU" meaning to truly use only clock edge to set everything in motion for that instruction, then yes, that would exclude real-world toy/hobby CPUs that use a 2nd clock edge to give a consistent interval for memory access.
But they certainly still fulfil the spirit of a single-cycle CPU, which is that every instruction runs in 1 clock cycle, with no pipelining and no multi-cycle microcode.
A whole clock cycle does have 2 clock edges, and it's normal for real-world (non-single-cycle) CPUs to use the "other" edge for internal timing in some cases, but we still talk about their frequency in whole cycles, not the edge frequency, except in cases like DDR memory where we do talk about the transfer rate = twice the memory clock frequency. What sets that apart is always using both edges, and for approximately equal things, not just some extra timing / synchronization within a clock cycle.
Now could you build a CPU that keeps a store value on a memory bus for some minimum time, without using a clock edge? Maybe.
Perhaps make sure the critical path leading to store-data it is short enough that the data is always ready. And possibly propagate some "data-ready" signal along with your computations (or just from the longest critical path of any instruction), and after a couple gate delays after the data is on the bus, flip the memory clock. (And on the next CPU clock edge, flip it back). If your memory doesn't mind its clock not having a uniform duty cycle, this might be fine as long as each half of the memory clock is long enough.
For loading from memory, you can maybe do something similar by initiating a memory load cycle some gate-delays after the CPU clock edge that starts this "cycle" of your single-cycle CPU. This might even involve building a long chain of gate delays intentionally with inverters dedicated to that purpose. Or perhaps even an analog RC time delay, but either way that's obviously worse than just using the other edge of the main clock, and you'd only ever do this as an exercise in single-cycle dogmatic purity. (It can also be flaky because gate-delay isn't constant, depending on voltage and temperature, so one side of the chip running hotter than the other could change relative timing.)
The definition says that a single cycle CPU takes just one instruction per one cycle. So it's possible to make a conclusion in theory that there are other CPU's that takes more or less instruction per cycle. You can check it out that there are some concepts like multi-cycle processor and pipelined processor (Instruction pipelining). "Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps." acorkcording to Wiki. I don't know how exactly it works, but maybe it just uses available registers (maybe instead of using for example EAX, ECX is used like EAX or maybe it works some other way, but one for sure is 100% true that number of registers in increasing, so maybe it's one of main purposes. Source: https://en.wikipedia.org/wiki/Instruction_pipelining
I think the answer for the question: "is-a-single-cycle-cpu-possible-if-asynchronous-components-are-used" depends on CPU controller that controls both CPU and RAM with opcodes. I found interesing information about this on site: http://people.cs.pitt.edu/~cho/cs1541/current/handouts/lect-pipe_4up.pdf
https://ibb.co/tKy6sR2
CONCLUSION: In my opinion, if we consider the term "single cycle CPU" it should be the simplest possible construction. The term "asynchronous" implements a conclusion, that is more complex than "synchronous". So both terms are not equivalent. It's something like "Can a basic data type be considered as a structure?". In my opinion the word "single" means the simplest possible and "asynchronous" means some modification, so more complex, so just think it's not possible, but maybe the term "are used" can be bypassed by "are used at the time" - if some switch, some controller can turn off asynchronous mode and make this all the simplest possible. But generally just think it's not possible
"modern" because that definition may change over time (and specifically I mean desktop browsers)
"handle" because that may vary depending on machine configurations/memory, but specifically I mean a general use case.
This question came to mind over a particular problem I'm trying to solve involving large datasets.
Essentially, whenever a change is made to a particular dataset I get the full dataset back and I have to render this data in the browser.
So for example, over a websocket I get a push event that tells me a dataset has changes, and then I have to render this dataset in HTML by grabbing an existing DOM element, duplicating it, populating the elements with data from this set using classnames or other element identifiers, and then add it back to the DOM.
Keep in mind that any object (JSON) in this dataset may have as many as 1000+ child objects, and there may be as many as 10,000+ parent objects, so as you can see there may be an instance where the returned dataset is upwards towards 1,000,000 => 10,000,000 data points or more.
Now the fun part comes when I have to render this stuff. For each data point there may be 3 or 4 tags used to render and style the data, and there may be event listeners for any of these tags (maybe on the parent container to lighten things up using delegation).
To sum it all up, there can be a lot of incoming information that needs to be rendered and I'm trying to figure out the best way to handle this scenario.
Ideally, you'd just want to render the changes for that single data point that has changes rather than re-rendering the whole set, but this may not be an option due to how the backend was designed.
My main concern here is to understand the limitations of the browser/DOM and looking at this problem through the lense of the frontend. There are some changes that should happen on the backend for sure (data design, caching, pagination), but that isnt the focus here.
This isn't a typical use case for HTML/DOM, as I know there are limitations, but what exactly are they? Are we still capped out at about 3000-4000 elements?
I've got a number of related subquestions for this that I'm actively looking up but I thought it'd be nice to share some thoughts with the rest of the stackoverflow community and try to pool some information together about this issue.
What is "reasonable" amount of DOM elements that a modern browser can handle before it starts becoming slow/non-responsive?
How can I benchmark the number of DOM elements a browser can handle?
What are some strategies for handling large datasets that need to be rendered (besides pagination)?
Are templating frameworks like mustache and handlebars more performant for rendering html from data/json (on the frontend) than using jQuery or Regular Expressions?
Your answer is: 1 OR millions. I'm going to copy/paste an answer from a similar question on SO.
To be honest, if you really need an absolute answer to this question, then you might want to reconsider your design.
No answer given here will be right, as it depends upon many factors that are specific to your application. E.g. heavy vs. little
CSS use, size of the divs, amount of actual graphics rendering
required per div, target browser/platform, number of DOM event
listeners etc..
Just because you can doesn't mean that you should! :-)"
See: how many div's can you have before the dom slows and becomes unstable?
This really is an unanswerable question, with too many factors at too many angles. I will say this however, in a single page load, I used a javascript setinterval at 1ms to continually add new divs to a page with the ID incrementing by 1. My Chrome browser just passed 20,000, and is using 600MB Ram.
This is a question for which only a statistically savvy answer could be accurate and comprehensive.
Why
The appropriate equation is this, where N is the number of nodes, bytesN is the total bytes required to represent them in the DOM, the node index range is n ∈ [0, N), bytesOverhead is the amount of memory used for a node with absolute minimum attribute configuration and no innerHTML, and bytesContent is the amount of memory used to fill such a minimal node.
bytesN = ∑N (bytesContentn + bytesOverheadn)
The value requested in the question is the maximum value of N in the worst case handheld device, operating system, browser, and operating conditions. Solving for N for each permutation is not trivial. The equation above reveals three dependencies, each of which could drastically alter the answer.
Dependencies
The average size of a node is dependent on the average number of bytes used in each to hold the content, such as UTF-8 text, attribute names and values, or cached information.
The average overhead of a DOM object is dependent on the HTTP user agent that manages the DOM representation of each document. W3C's Document Object Model FAQ states, "While all DOM implementations should be interoperable, they may vary considerably in code size, memory demand, and performance of individual operations."
The memory available to use for DOM representations is dependent upon the browser used by default (which can vary depending on what browser handheld device vendors or users prefer), user override of the default browser, the operating system version, the memory capacity of the handheld device, common background tasks, and other memory consumption.
The Rigorous Solution
One could run tests to determine (1) and (2) for each of the common http user agents used on handheld devices. The distribution of user agents for any given site can be obtained by configuring the logging mechanism of the web server to place the HTTP_USER_AGENT if it isn't there by default and then stripping all but that field in the log and counting the instances of each value.
The number of bytes per character would need to be tested for both attributes values and UTF-8 inner text (or whatever the encoding) to get a clear pair of factors for calculating (1).
The memory available would need to be tested too under a variety of common conditions, which would be a major research project by itself.
The particular value of N chosen would have to be ZERO to handle the actual worst case, so one would chose a certain percentage of typical cases of content, node structures, and run time conditions. For instance, one may take a sample of cases using some form of randomized in situ (within normal environmental conditions) study and find N that satisfies 95% of those cases.
Perhaps a set of cases could be tested in the above ways and the results placed in a table. Such would represent a direct answer to your question.
I'm guessing it would take an well educated mobile software engineer with flare for mathematics, especially statistics, five full time weeks to get reasonable results.
A More Practical Estimation
One could guess the worst case scenario. With a few full days of research and a few proof-of-concept apps, this proposal could be refined. Absent of the time to do that, here's a good first guess.
Consider a cell phone that permits 1 Gbyte for DOM because normal operating conditions use 3 Gbytes out of the 4 GBytes for the above mentioned purposes. One might assume the average consumption of memory for a node to be as follows, to get a ballpark figure.
2 bytes per character for 40 characters of inner text per node
2 bytes per character for 4 attribute values of 10 characters each
1 byte per character for 4 attribute names of 4 characters each
160 bytes for the C/C++ node overhead in the less efficient cases
In this case Nworst_case, the worst case max nodes,
= 1,024 X 1,024 X 1,024
/ (2 X 40 + 2 X 4 X 10 + 1 X 4 X 4 + 160)
= 3,195,660 . 190,476.
I would not, however, build a document in a browser with three million DOM nodes if it could be at all avoided. Consider employing the more common practice below.
Common Practice
The best solution is to stay far below what Nworst_case might be and simply reduce the total number of nodes to the degree possible using standard HTTP design techniques.
Reduce the size and complexity of that which is displayed on any given page, which also improves visual and conceptual clarity.
Request minimal amounts of data from the server, deferring content that is not yet visible using windowing techniques or balancing response time with memory consumption in well-planned ways.
Use asynchronous calls to assist with the above minimalism.
For those wondering: Google has it's Dom size recommendation:
Domsize recommandations
"
An optimal DOM tree:
Has less than 1500 nodes total.
Has a maximum depth of 32 nodes.
Has no parent node with more than 60 child nodes.
In general, look for ways to create DOM nodes only when needed, and destroy them when no longer needed.
If your server ships a large DOM tree, try loading your page and manually noting which nodes are displayed. Perhaps you can remove the undisplayed nodes from the loaded document, and only create them after a user gesture, such as a scroll or a button click.
If you create DOM nodes at runtime, Subtree Modification DOM Change Breakpoints can help you pinpoint when nodes get created.
If you can't avoid a large DOM tree, another approach for improving rendering performance is simplifying your CSS selectors. See Reduce The Scope And Complexity Of Style Calculations.
"
There are a number of ways the DOM elements can become too many. Here is a React + d3 component I have been using to render many elements and get a more real-world sense of the DOM's limits:
export const App = React.memo((props) => {
const gridRef = React.useRef(null);
React.useEffect(() => {
if (gridRef.current) {
const table = select(gridRef.current);
table
.selectAll("div")
.data([...new Array(10000)])
.enter()
.append("div")
.text(() => "testing");
}
if (props.onElementRendered) {
props.onElementRendered();
}
}, []);
return <div ref={gridRef} />;
});
On a 2021 Macbook Pro with 16GB of memory running Chrome I'm seeing serious delay (I think on the paint step) starting at around 30,000 elements
Just to add another data point. I loaded the single-page GNU Bison manual, which claims to be 2064K bytes. In the console, I typed document.querySelectorAll('*') and the answer was 22183 nodes, which rather exceeds Google's alleged "optimal sizes".
I detected no delay loading the page (50Mb ethernet connection). Once loaded, I detected no delay whatsoever clicking on internal links, scrolling, etc.
This was on my relatively massively powered desktop machine. Tried the same thing on my Galaxy Note 4 (ancient wifi connection, def not 50Mb). This time (no surprise) I had to wait a few seconds (<5) for it to load. After that, clicking on links and scrolling was again about as instantaneous as my eye could see.
I don't doubt that 30,000 nodes of React could spell trouble, nor that I can have vastly more than that number of framework-free simple HTML nodes without the slightest problem. The notion that I should worry about more than 1500 nodes sounds pretty crazy to me, but I'm sure YMMV.
So the question is: How does a computer go from binary code representing the letter "g" to the correct combination of pixel illuminations?
Here is what I have managed to figure out so far. I understand how the CPU takes the input generated by the keyboard and stores it in the RAM, and then retrieves it to do operations on using an instruction set. I also understand how it does these operations in detail. Then the CPU transmits the output of an operation which for this example is an instruction set that retrieves the "g" from the memory address and sends it to the monitor output.
Now my question is does the CPU convert the letter "g" to a bitmap directly or does it use a GPU that is either built-in or separate, OR does the monitor itself handle the conversion?
Also, is it possible to write your own code that interprets the binary and formats it for display?
In most systems the CPU doesn't speak with the monitor directly; it sends commands to a graphics card which in turn generates an electric signal that the monitor translates into a picture on the screen. There are many steps in this process and the processing model is system dependent.
From the software perspective, communication with the graphics card is made through a graphics card driver that translates your program's and the operating system's requests into something that the hardware on the card can understand.
There are different kinds of drivers; the simplest to explain is a text mode driver. In text mode the screen is composed of a number of cells, each of which can hold exactly one of predefined characters. The driver includes a predefined bit map font that describes how a character looks like by specifying which pixels are on and which are off . When a program requests a character to be printed on the screen, the driver looks it up in the font and tells the card to change the electric signal it's sending to the monitor so that the pixels on the screen reflect what's in the font.
The text mode has limited use though. You get only one choice of font, a limited choice of colors, and you can't draw graphics like lines or circles: you're limited to characters. For high quality graphics output a different driver is used. Graphics cards typically include a memory buffer that contains the contents of the screen in a well defined format, like "n bits per pixel, m pixels per row, .." To draw something on the screen you just have to write to this memory buffer. In order to do that the driver maps the buffer into the computer memory so that the operating system and programs can use the buffer as if it was a part of RAM. Programs then can directly put the pixels they want to show, and to put the letter g on the screen it's up to the application programmer to output pixels in a way that resembles that letter. Of course there are many libraries to help programmers do this, otherwise the current state of the graphical user interface would be even sorrier than it is.
Of course, this is a simplification of what actually goes on in a computer, and there are systems that don't work exactly like this, for example some CPUs do have an integrated graphics card, and some output devices are not based on drawing pixels but plotting lines, but I hope this clears the confusion a little.
See here http://en.m.wikipedia.org/wiki/Code_page_437
It describes the character based mechanism used to display characters on a VGA monitor in character mode.
I'm trying to write a custom xaudio2 effect that involves a fourier transform. However, the number of samples given to the process method each call is not a power of 2 (a precondition of the fourier transform implementation I have).
Is there a way to force power of 2 sized samples? Is there a technique to allow working with non power of 2 sizes?
Don't send samples to the FFT every call that you are given samples. Buffer (save) them up till you have at least a power-of-2 samples or more and then process the power-of-2 number of samples from your intermediate buffer. Rinse and repeat.
Also, newer FFTs will often allow sizes with prime factors larger than 2.
If your implementation requires that you have a power of 2 sample size, then you can pad the sample to force it to accept. Zero padding seems to be the easiest/most straight forward.
Here is an article that explains another way to do it:
The Chirp z-Transform Algorithm and Its Application
I'm working a game.
The game requires entities to analyse an image and head towards pixels with specific properties (high red channel, etc.)
I've looked into Pixel Bender, but this only seems useful for writing new colors to the image. At the moment, even at a low resolution (200x200) just one entity scanning the image slows to 1-2 Frames/second.
I'm embedding the image and instance it as a Bitmap as a child of the stage. The 1-2 FPS situation is using BitmapData.getPixel() (on each pixel) with a distance calculation beforehand.
I'm wondering if there's any way I can do this more efficiently... My first thought was some sort of spatial partioning coupled with splitting the image up into many smaller pieces.
I also feel like Pixel Bender should be able to help somehow, however I've had little experience with it.
Cheers for any help.
Jonathan
Let us call the pixels which entities head towards "attractors" because they attract the entities.
You describe a low frame rate due to scanning for attractors. This indicates that you may possibly be scanning an image at every frame. You don't specify whether the image scanned is static or changes as frequently as, e.g., a video input. If the image is changing with every frame, so that you must re-calculate attractors somehow, then what you are attempting is real-time computer vision with the ABC Virtual Machine, please see below.
If you have an unchanging image, then the most important optimization you can make is to scan the image one time only, then save a summary (or "memoization") of the locations of the attractors. At each rendering frame, rather than scan the entire image, you can search the list or array of known attractors. When the user causes the image to change, you can recalculate from scratch, or update your calculations incrementally -- as you see fit.
If you are attempting to do real-time computer vision with ActionScript 3, I suggest you look at the new vector types of Flash 10.1 and also that you look into using either abcsx to write ABC assembly code, or use Adobe's Alchemy to compile C onto the Flash runtime. ABC is the byte code of Flash. In other words, reconsider the use of AS3 for real-time computer vision.
BitmapData has a getPixels method (notice it's plural). It returns a byte array of all the pixels which can be iterated much faster than a for loop with a call to getPixel inside, nested inside another for loop . Unfortunately, bytearrays are, as their name implies, 1 dimensional arrays of bytes, so iterating each pixel(4 bytes) requires using a for loop, not a foreach loop. You can access each pixel's color channel individually by default, but this sounds like what you want (find pixels with a "high red channel"), so you won't have to bitwise-and each pixel value to isolate a particular channel.
I read somewhere that getPixel is very slow, so that's where I figured you'd save the most. I could be wrong, so it'd be worth timing it.
I would say Heath Hunnicutt's anwser is a good one. If the image doesnt change just store all the color values in a vector. or byteArray of whatever and use it as a lookup table so you don't need to call getPixel() every frame.