Couchbase - Large documents or lots of smaller ones? - couchbase

I am building an application which allows users to post content. This content can then be commented on.
Assume the following:
A document for the content is between 200KB and 3MB inside depending
on the text content.
Each comment is between 10KB and 100KB in size.
There could be 1 comment, or 1000. There are no limits.
My question is, when storing the content, should the individual comments be stored inside the same document or should they be broken up?

I'd certainly keep the post content and the comments separate, assuming there will be parts of the application where the posts would be previewed/used without comments.
For the comments themselves (assuming they are separate) I'd say many smaller ones would typically be better, if only due to the consideration that there exists a post with 100s or 1000s of comments, you're not going to want to use them all immediately, so it makes sense to only get the documents as they are required (to be displayed), rather than loading MBs of comments to only use <100KB of what has been gotten at the time.
You might not require comments to be stored individually, but providing your keys are logical and follow a predictable scheme, I see no reason not to store them individually. If you would want some grouping, however, I'd probably avoid grouping more comments per document than would be typical for your usage.
If you are using views there perhaps may be some benefit to grouping more comments into single documents, but even then that would depend on your use case.
There would be no software bottlenecks from Couchbase for storing lots of documents individually, and the only constraints would be hardware constraints that would be very similar; whether you had lots of small documents or fewer large documents.

I would go with a document for each answer as Couchbase is brilliant in organizing what to keep in memory and what not. Usually an application isn't displaying all comments when there is more than 25 (I think that's a number most applications use, display them in chunks of 25 with newer at the top eg). So if the latest 25 comments are kept in memory while older ones are subsequently automatically written to disk you keep an ideal level between how you use your memory (always critical with Couchbase) and access time for you application. Perfect balance I'd say, I had a similar decision to make in my application and it worked perfectly.

Related

Is higher number of different folders affecting website load speed?

I have one question about page load speed. Consider a case when I have different kinds of images on my web page, such as icons, logos, images inside content etc....
I want to know whether having separate folders for each media category may affect the page load speed:
/logos
/icons
/images
Will the webpage load faster if the images of all categories were located in a single folder rather in multiple ones?
Thanks in advance for your advice.
Even though performance-related questions often get closed due to them not being answerable without benchmarks on the machine, this one is worth an answer since unless you run a potato-based computer, you won't have any performance impact.
Directories are not actually physical folders like you would have in real life.
They are simply registers of pointers to disk spaces where your files are stored. (Of course this is massively over-simplified as it involves file-systems and more low-level stuff, but that's not needed at that point).
To come back to your question, the difference between loading two files from two directories:
/var/foobar/dir1/image1.jpeg
/var/foobar/dir2/image2.jpeg
or one directory:
/var/foobar/dir1/image1.jpeg
/var/foobar/dir1/image2.jpeg
...is that your file system will have to look-up two different directories tables. With modern file-systems and moderate (even low-end) hardware, this causes no issues.
As #AjitZero mentioned, here your performance impact will come from the size of the files, the number of distinct HTTP requests (i.e.: How many images, CSS, scripts, etc...) and the way you cache data on the user's computer.
No, the number of folders doesn't affect the speed of page-load.
Lesser number of HTTP requests matter, however, so you can use sprite sheets.

Trying to find connectivity in the Wikipedia Graph: Should I use the API or the relevant SQL Dump?

I apologize for the slightly obnoxious nature of my question.
A few years ago, I came across a game that could be played on Wikipedia. The goal is to start from a random page and try to get to the 'Adolf Hitler' page by utilizing internal wikilinks within 5 clicks(6 degrees of separation). I found this to be an interesting take on the Small-world experiment and tried it a few times, and was able to reach the target article within 5-6 clicks almost every time (however, there was no way for me to know if that was the shortest path or not).
As a project, I want to find out the degree of separation between a few(maybe hundreds, or thousands, if feasible) of random Wikipedia pages and Adolf Hitler's page in order to create a Histogram of sorts. My intention is to do an exhaustive search in a DFS manner from the root page (restricting the 'depth' of the search to 10, in order to ensure that the search terminates in case it has selected a bad path, or is running in cycles). So, the program would visit every article reachable within 10 clicks of the root article, and find the shortest way in which the target article is reachable.
I realise that the algorithm described above would certainly take too long to run. I have ideas for optimizations which I will play around with.
perhaps I will use a single source shortest path BFS based approach which seems to be more feasible considering that the degree of the graph would be quite high(mentioned later)
I will not mention all ideas for the algorithms here as they are not relevant to the question as in any possible implementation, I will have to query (directly by downloading relevant tables on my machine, or through the API) the:
pagelinks table which contains information about all internal links in all pages in the form of the 'id' of page containing the wikilink and the 'title' of the page that is being linked
page table which contains relevant information for mapping the page 'title' to 'id'. This mapping is needed because of the way data is stored in the pagelinks table.
Before I knew Wikipedia's schema, naturally, I started exploring the Wikipedia API and quickly found out that the following API query returns the list of all internal links on a given page, 500 at a time:
https://en.wikipedia.org/w/api.php?action=query&format=jsonfm&prop=links&titles=Mahatma%20Gandhi&pllimit=max
Runnig this on MediaWiki's API sandbox a few times gives a request time of about 30ms for 500 link results returned. This is not ideal as even 100 queries of this nature would end up taking 3 seconds, which means I would certainly have to reduce the scope of my search somehow.
On the other hand, I could download the relevant SQL Dumps for the two tables here and use them with MySQL. English Wikipedia has aroud 5.5 million articles (which can be considered as vertices of graph). The compressed sizes of the tables are: around 5GB for the pagelinks table and 1GB for the page table. A single tuple of the page table is a lot bigger than that of the pagelinks table, however (max size of around 630 and 270 bytes, by my estimate). Sorry I could not provide the number of tuples in each table, as I haven't yet downloaded and imported the database.
Downloading the database seems appealing because: since I have the entire list of pages in the page table, I could resort to a single pair shortest path BFS approach from Adolf Hitler's page by tracking all the internal backlinks. This would end up finding the degree of seperation of every page in the database. I would also imagine that eliminating the bottleneck (internet connection) would speed up the process.
Also, I would not be overusing the API.
However, I'm not sure that my Desktop would be able to perform even at par with the API considering the size of the database.
What would be a better approach to pursue?
More generally, what kind of workload requires the use of an offline copy of the database rather than just the API?
While we are on the topic, in your experience, is such a problem even feasible without the use of supercomputing, or perhaps a server to run the SQL queries?

Is it okay for performance to use a lot of ID tags?

As mentioned in title, I want to find out if there is really some performance overhead when using a lot of ID tags in html.
I know the difference between CLASSes and IDs, but I'am not sure about their performance. As far as I know, IDs have specific functionality both for JS and Browser. Browser stores them in it's memory somewhere, and JS uses it to get it from Browser's memory to access them much faster than traversing the whole source code in searching of a specific CLASS.
So, if I don't need to access the ID's with JS or anything else, will be it reasonable to use them in HTML markup?
The simple answer is yes but usually not much unless there are many
hundreds or thousands.
The detailed answer to this question (as stated) is "it depends".
It depends on:
Your definition of 'a lot'.
Some folks would consider 100 a lot, others a 1000, others 10,000
Which browser being used and the version of the browser.
The machine being used, how fast the cpu is, etc.
The OS being used.
The internet (regional/local) speed at that time to download all the div tags.
Where the div's are and if any of the page can load without them.
In conclusion:- given we're talking about web apps and many differences based on user clients, keep the number of div's low if possible.
I have never noticed a degradation of performance as the number of id's are used in HTML tags.
The thing that would degrade performance in my opinion would be the use of client side scripting to manipulate the HTML the ID's are assigned to in mass.
This is my opinion based on experience. No research completed on the subject.

Why people always encourage single js for a website?

I read some website development materials on the Web and every time a person is asking for the organization of a website's js, css, html and php files, people suggest single js for the whole website. And the argument is the speed.
I clearly understand the fewer request there is, the faster the page is responded. But I never understand the single js argument. Suppose you have 10 webpages and each webpage needs a js function to manipulate the dom objects on it. Putting 10 functions in a single js and let that js execute on every single webpage, 9 out of 10 functions are doing useless work. There is CPU time wasting on searching for non-existing dom objects.
I know that CPU time on individual client machine is very trivial comparing to bandwidth on single server machine. I am not saying that you should have many js files on a single webpage. But I don't see anything go wrong if every webpage refers to 1 to 3 js files and those js files are cached in client machine. There are many good ways to do caching. For example, you can use expire date or you can include version number in your js file name. Comparing to mess the functionality in a big js file for all needs of many webpages of a website, I far more prefer split js code into smaller files.
Any criticism/agreement on my argument? Am I wrong? Thank you for your suggestion.
A function does 0 work unless called. So 9 empty functions are 0 work, just a little exact space.
A client only has to make 1 request to download 1 big JS file, then it is cached on every other page load. Less work than making a small request on every single page.
I'll give you the answer I always give: it depends.
Combining everything into one file has many great benefits, including:
less network traffic - you might be retrieving one file, but you're sending/receiving multiple packets and each transaction has a series of SYN, SYN-ACK, and ACK messages sent across TCP. A large majority of the transfer time is establishing the session and there is a lot of overhead in the packet headers.
one location/manageability - although you may only have a few files, it's easy for functions (and class objects) to grow between versions. When you do the multiple file approach sometimes functions from one file call functions/objects from another file (ex. ajax in one file, then arithmetic functions in another - your arithmetic functions might grow to need to call the ajax and have a certain variable type returned). What ends up happening is that your set of files needs to be seen as one version, rather than each file being it's own version. Things get hairy down the road if you don't have good management in place and it's easy to fall out of line with Javascript files, which are always changing. Having one file makes it easy to manage the version between each of your pages across your (1 to many) websites.
Other topics to consider:
dormant code - you might think that the uncalled functions are potentially reducing performance by taking up space in memory and you'd be right, however this performance is so so so so minuscule, that it doesn't matter. Functions are indexed in memory and while the index table may increase, it's super trivial when dealing with small projects, especially given the hardware today.
memory leaks - this is probably the largest reason why you wouldn't want to combine all the code, however this is such a small issue given the amount of memory in systems today and the better garbage collection browsers have. Also, this is something that you, as a programmer, have the ability to control. Quality code leads to less problems like this.
Why it depends?
While it's easy to say throw all your code into one file, that would be wrong. It depends on how large your code is, how many functions, who maintains it, etc. Surely you wouldn't pack your locally written functions into the JQuery package and you may have different programmers that maintain different blocks of code - it depends on your setup.
It also depends on size. Some programmers embed the encoded images as ASCII in their files to reduce the number of files sent. These can bloat files. Surely you don't want to package everything into 1 50MB file. Especially if there are core functions that are needed for the page to load.
So to bring my response to a close, we'd need more information about your setup because it depends. Surely 3 files is acceptable regardless of size, combining where you would see fit. It probably wouldn't really hurt network traffic, but 50 files is more unreasonable. I use the hand rule (no more than 5), but surely you'll see a benefit combining those 5 1KB files into 1 5KB file.
Two reasons that I can think of:
Less network latency. Each .js requires another request/response to the server it's downloaded from.
More bytes on the wire and more memory. If it's a single file you can strip out unnecessary characters and minify the whole thing.
The Javascript should be designed so that the extra functions don't execute at all unless they're needed.
For example, you can define a set of functions in your script but only call them in (very short) inline <script> blocks in the pages themselves.
My line of thought is that you have less requests. When you make request in the header of the page it stalls the output of the rest of the page. The user agent cannot render the rest of the page until the javascript files have been obtained. Also javascript files download sycronously, they queue up instead of pull at once (at least that is the theory).

Web displays: Paging vs. long tables

It seems that the trend in web design is to provide paged output, where long tables are displayed a page at a time. My customers don't like that, and have requested that the web sites I design for them show all entries in long tables. The arguments for paging seem to be mostly based on the performance hit of displaying long tables, and this is less of a concern in a high-bandwidth corporate intranet. Arguments against paging include the ability to print the entire table, do string searches against the entire table, select arbitrary ranges from the entire table for copying, etc. I've pointed out that these features can easily be added to paged web designs (e.g. a print button that prints the entire table, or a button that creates a CSV file of the the table), but the paged output still seems inconvenient to them. Our typical table is about 100 to 600 items. Obviously tables that would be significantly larger would probably have to be paged.
Questions:
What is your experience with personal or customer preferences for paged vs. full output in long tables?
Web design tools seem to be pushing the paging paradigm. Are they out of touch, or are my customers unusual?
If you're thinking "It depends on the length of the table", what threshold would you use?
I love long one-page listings.
One of the few reasons I can see for paged
listing is the ones you point out about performance.
I think your customers are very usual and in-touch.
The threshold would be about page loading times. When the server can't produce the full lists fast enough or when the lists gets so long that the browser slows down. (The latter can happen for quite short lists if you have non-a-tag hover stuff in your CSS and the browser is IE.)
Give the users a powerful search function and they'll narrow down their page lists themselves.
Why not simply have it be a user configurable option. It sounds like you plan to essentially implement both anyway.
To be honest I think that no matter which you choose someone will complain. At least with it being user configurable you have the ability to put it back on the user.
Provide a default page length, and a configurable parameter (e.g. in the query string for programmatic use, and/or a form on the webpage for interactive use) to control how many listings are in a page.
User flexibility is good. Texas Instruments has a parametric search tool for electrical engineers to find ICs that meet certain technical characteristics, and they include a link both to "show all" in a webpage and "download all" as a .csv file. That's a good model, kudos to TI. Ditto to flickr; their API lets you control (to a large extent) how many results show up on a web service call.
I personally HATE websites that default to 10 listings per page with no way to increase it. It takes FOREVER to browse them, & I'm willing to wait longer if I can get all the stuff at once.
If it's an interactive webpage, I would consider going to an AJAX solution that downloads 100 at a time so there's an indication of progress (and the user can stop it if there are 20000 results).
I agree with PEZ, it's all about responsiveness.
Best solution: Don't provide lists with more than 100 items.
Usually your user doesn't want to read more than 100 or even 600 items. They just don't care. They are searching for one (or possibly a few). Make sure that there's a way for them to get to those items without visual-grep-ing through the list.
And if your client insists on displaying all items, then provide paging with a configurable page size and let him enter "100000 items per page" if he wants to.
One of the seminal books on web design (sorry, I forget which one) used to say not to count on your users scrolling down because most of them don't know how or can't be bothered. I think a more recent update says that while is is true for the general public, certain sectors of more technical users can be expected to scroll down and you can make pages that require scrolling IFF (if and only iff) you know your users can handle it.
I can understand your situation extremely well. I have been in similar situation. I moved a business workflow from being man managed to an automated one. Initially it was carried out using excel spreadsheets. The stakeholders for my software were in the age group of 55+ they dont like anything ajaxy or any of the UI patterns you are talking about. It such cases data retreival logic can be optimized. Any table that touches the 1K mark or has item like image blobs or things like that should be shown in parts from a performance point of view.
long outputs slow rendering and will be performance leech
Customers dont want to changes most times and customer is always right unless u can convince them.
I have put forth my threshhold but it also depends on the content of the rows.
Happy Coding!