Download complete website via CLI

Download complete website via CLI - html

I'm trying to reduce the size of my website, but to do that I need a reliable tool to measure the size of my pages.
I used to use Google Lighthouse, in the performance audits it reports the size, but it's not precise, and it's inconsistent with the network tab
I tried several combinations of curl, but I can't make it crawl website correctly
I tried several combinations of wget, but it couldn't handle correctly the gzip or brotli encoding
I came to the conclusion that wget or curl are not the right tools, because they don't evaluate JS, so they can't do conditional loading of assets
I'm trying now with puppetter.js and phanotm.js, but I still haven't managed to do it
Does anyone have a good solution for this?

How to Measure Size
Web browsers make a lot of decisions about what to download based on their particular context (for example, what compression algorithms it supports). It's difficult to replicate those conditions in an external tool, such as curl. So you'll want to use a tool that thinks like a browser (or is a browser).
The server can also choose to send different content based on visitor information (user agent, whether they're logged in, geolocation, etc.) or even completely arbitrary conditions (like a randomized image). So you'll want to look at more than one sample, preferably from many user agents and locations.
Most tools don't provide that kind of power.
The closest thing I can suggest is WebPageTest. It uses an actual web browser to visit your site and reports an analysis of that visit, including total page weight (even broken down by different page events). WebPageTest can be used as an API and even run locally. Output is available as JSON, so you can parse and do custom reporting with CLI apps.
How to Speed Up a Website
The technical question of "weighing" a website aside, there's a broader problem you're trying to tackle: how to speed up your website. There is a lot of information available for performance optimization.
Specifically, there's a lot of discussion about what metrics should be considered when evaluating a page's performance, how much weight should be given each metric, and how to use that information to prioritize optimizations.
When considering page weight, I would highly recommend breaking it down by how many bytes are necessary to accomplish certain tasks. Google recommends thinking about resources in terms of the critical rendering path - the HTML, blocking JS, and non-deferred CSS necessary to construct a web page.
You may have a 1MB page where render-critical assets only make up 10KB of the page - that's a very fast site. Or you may have a 1MB page where 500KB are required for an initial render - not so fast. WebPageTest helps break down those weights by event for you.
I wish I could give more technical detail about using WebPageTest with CLI tools. It's something I plan to explore soon. But for now, hopefully this will give you a good start.

Have you tried PageSpeed Insights ?
Analyze you website and read on optimization guidelines.

Related

Running R code from website (without paid hosting)

There are many related questions, but all of them are about Shiny R, and that requires paid hosting to be always available (since free options such as shinyapps.io have limits). So I'm wondering whether there is any alternative solution for running R code from a website hosted, for example, at GitHub.
To be more clear, I want to use an R script to interactively display a few plots and some derived information, based on some basic settings given by a user. To give a super simple example:
var_from_gui = 7 # input in HTML, user e.g. clicks OK
print(paste("input plus five is:", var_from_gui + 5)) # info displayed on website
plot(c(1, 2, 5) * var_from_gui) # image to be displayed on website
Firstly, I assume this is very possible in Shiny R - is that correct?
Secondly, is this possible in another way that allows me to run this via e.g. GitHub pages? (Actually I can also use this more comprehensive university server, but I don't suppose it helps with this case.)
I'm aware of htmlwidgets too, but, as far as I understand, that only allows very limited interaction such as filtering, and not things like drawing plots based on user input.
One option I found and seems to fit well is OpenCPU, but what's discouraging is the apparent lack of activity (no recent questions/answers/posts etc.) and hardly any useful tutorials or overviews, which also makes it hard to assess whether it's worth trying.

For up to 5 small apps with little traffic you could use the free plan on https://www.shinyapps.io/
very easy to deploy, because its a RStudio service

You can host your R functions on the public OpenCPU server, for free.
I have done that for my own applications and it works well. None of the limitations that you listed in your question. Also tried Shiny but, as you mentioned, not flexible enough for what I was trying to achieve.
OpenCPU is really a great tool, although not well supported by the community (not sure why, looking at the great value it brings)
I followed the docs here to get it up and running. Setup is a bit tedious but fairly well documented.
Once live, I found this server very reliable - your R functions are continuously available, with very low latency (much faster than a Shiny server from my experience)
You are also asking for "a solution for running R code from a website hosted, for example, at GitHub" - OpenCPU does handle CD/CI (Continuous Integration) from your custom GitHub repo through a webhook mechanism.
I also implemented such webhook for my apps, so can confirm it works smoothly. Just follow the well-written provided documentation here.

By now I guess I can answer my own question – though Marc's answer seems also useful in general (and prompted to write my own answer).
In essence, shinyapps.io worked for me perfectly fine. For a small and not too often used application the free plan is easily enough. What's more, even in the unlikely case that the website goes down due to excessive usage, R users have the possibility to easily run the Shiny app from their own computer (provided that they have R installed).
And of course, the example given in the question is very possible to implement in Shiny R: typically the code is executed via the eventReactive function, and, for the "trigger" button, one can use actionButton.

a/b testing a major html/css redesign

At my company we are redesiging our e-commerce website. HTML and CSS is re-written from the ground up to make the website responsive / mobile friendly.
Since it concerns one of our biggest websites which is responsible for generating of over 80% of our revenue it is very important that nothing goes "wrong".
Our application is running on a LAMP stack.
What are the best practices for testing a major redesign?
Some issues i am thinking of:
When a/b testing a whole design (if possible) i guess you definitaly
dont want Google to come by and index youre new design (since its
still in test phase). How to handle this?
Should you redirect a percentage of the users to a new url (or
perhaps subdomain)? Or is it better to serve the new content from the
existing indexed urls based on session?
How to compare statistics from a Google Analytics point of view?
How to hint Google about a new design? Should i e.g.
create a new UA code?

Solution might be to set a cookie only for customers who enter the website via the homepage. Doing so, you're excluding adwords traffic and returning visitors, who might be expecting an other webdesign, serve them the original website and leave their experience untouched.
Start the test with home traffic only, set cookie and redirect a percentage to a subdomain. Measure conversion rate by a dimension in Google analytics, within same analytics account. Set a 'disallow subdomain' in your robots.txt to exclude the subdomain from crawling by SE's.

Marc, You’re mixing a few different concerns here:
Instrumentation. If you changes can be expressed via HTML/CSS/JavaScript only, i.e. optimizational in nature, you may be able to instrument using tols like VWO or Optimizely. If there are server side changes too, then a tool like Sitespect (any server stack) or Variant (Java only) might be in order. The advantange of using a commecial product is that they provide a number of important features out of the box, e.g. collecting experiment data, experience stability (returning user sees the same experience), etc. You may be able to instrument on your own, but unless you’re looking at a handful of pages, that typically is hard, particularly if you want to do it outside of the app, via the DevOps mechanisms.
SEO. If you get your instrumentation right, this shouldn’t be an issue. Public URIs should not differ for the control and variant of the same resource.
Traffic routing. Another reason to consider a commercial tool. They factor that out of your app and let you set percentages. Some tools, like Variant, will allow you to write custom targeters, e.g. “value” users always see control.

Browser, upload large file

I'm looking for a way to allow a user to upload a large file (~1gb) to my unix server using a web page and browser.
There are a lot of examples that illustrate how to do this with a traditional post request, however this doesn't seem like a good idea when the file is this large.
I'm looking for recommendations on the best approach.
Bonus points if the method includes a way of providing progress information to the user.
For now security is not a major concern, as most users who will be using the service can be trusted. We can also assume that the connection between client and host will not be interrupted (or if it is they have to start over).
We can also assume the user is running a browser of supporting most modern features (JavaScript, Flash, etc)
edit
No language requirements. Just looking for the best solution.

There are several ways to handle this,
1. Flash Uploader
Theres plenty of flash uploaders to improve the users GUI so that they can examine the process and the process factors such as time left, KB Done etc.
This is very good if you understand how to improve Flash source code for later developments.
2. Ajax
Theres a few ways using Ajax and PHP (although PHP Does not support it) you can use Perl module to accomplish the same thing http://pecl.php.net/package/uploadprogress, This is only if you wish to show percentage information etc.
3 Basic Javascript.
This method would be just the regular form, but with some ajax styling so when the form is submitted you can show a basic loader saying please wait while you send us the file...
If your using asp, you can take a look at: http://neatupload.codeplex.com/
Hope theres some good information to get you on your way.
Regards

Not sure about your language requirements, but you can look e.g. into
http://pypi.python.org/pypi/gp.fileupload/
Supports progress information also, btw.

I have used the dojo FileUploader widget to reliably upload audio files greater than a gigabyte with a progress bar. Though you said security was not an issue, I'd like to say that I got HTTPS uploads w/cookie based authentication hooked up flawlessly.
See: http://www.sitepen.com/blog/2008/09/02/the-dojo-toolkit-multi-file-uploader/ and
http://api.dojotoolkit.org/jsdoc/1.3/dojox.form.FileUploader

reduce server load by loading image files / javascript files from another server?

I am thinking to save server load, i could load common javascript files (jquery src) and maybe certain images from websites like Google (which are almost always never down, and always pretty fast, maybe faster than my server).
Will it save much load?
Thanks!
UPDATE: I am not so much worried about saving bandwidth, as I am reducing Server Load because my server has difficulty when there are a lot of users online, and I think this is because there are too many images/files it loads from my single server.

You might consider putting up another server that does nothing but serve your static files using an ultra efficient web server such as lighttpd

This is known as a content delivery network, and it will help, although you should probably make sure you need one before you go about setting it all up. I have heard okay things about Amazon S3 for this (which Twitter, among other sites, use to host their images and such). Also, you should consider Google's API cloud if you are using any popular javascript libraries.

Well, there are a couple things in principle:
Serving up static resources (.htm files, image files, etc) rarely even make a server breathe hard except under the most demanding of circumstances (thousands of requests in a very short period of time)
Google's network is most likely faster than yours, and most everyone else. ;)
So if you are truly not experiencing any bandwidth problems, I don't think offloading your images, etc will do much for you. However, as you move stuff off to Google, then it frees your server's bandwidth up for more concurrent requests and faster transfer on the existing ones. The only tradeoff here is that clients will experience a slight (most likely unnoticable) initial delay while DNS looks up the other servers and initiates the connection to them.

It really depends on what your server load is like now. Are there lots of small web pages and lots of users? If so, then the 50K taken up by jQuery could mean a lot. If all of your pages are fairly large, and/or you have a small user base, caching jQuery with Google might not help much. Same with the pictures. That said, I have heard anecdotal reports (here on SO) that loading your scripts from Google does indeed provide noticeable performance improvement. I have also heard that Google is not necessarily 100% uptime (though it is close), and when it is down it is damned inconvenient.
If you're suffering from speed problems, putting your scripts at the bottom of the web page can help a lot.

I'm assuming you want to save costs by offloading commonly used resources to the web at large.
What you're suggesting is called Hotlinking.. that means directly linking to other people's content. While it can work in most cases, you do lose control of the content, that means your website may change without your input. Since image hosted on google are scoured from other websites, the images may be copyrighted, causing some (potential) concern, or they may have anti-hotlinking measures that may block the images from your webpage.
If you're just working on a hobby website, you can consider hosting your resources on a free web account to save bandwidth.

How to take screenshot of rendered HTML page

Our web analytics package includes detailed information about user's activity within a page, and we show (click/scroll/interaction) visualizations in an overlay atop the web page. Currently this is an IFrame containing a live rendering of the page.
Since pages change over time, older data no longer corresponds to the current layout of the page. We would like to run a spider to occasionally take snapshots of the pages, allowing us to maintain a record of interactions with various versions of the page.
We have a working implementation of this (Linux), but the snapshot process is a hideous Python/JavaScript/HTML hack which opens a Firefox window, screenshotting and scrolling and merging and saving to a file. This requires us to install the X stack on our normally headless servers, and takes over a minute per page.
We would prefer a headless implementation with performance closer to that of the rendering time in a regular web browser, but haven't found anything.
There's some movement towards building something using Mozilla source as a starting point, but that seems like overkill to me, as well as a maintenance nightmare if we try to keep it up to date.
Suggestions?

An article on Digital Inspiration points towards CutyCapt which is cross-platform and uses the Webkit rendering engine as well as IECapt which uses the present IE rendering engine and requires Windows, natch. Nothing off the top of my head which uses Gecko, Firefox's rendering engine.
I doubt you're going to be able to get away from X, however. Since CutyCapt requires Qt, it requires either X or a Windows installation. And, similarly, IECapt will require Windows (or Wine if you want to try to run it under Linux, and then you're back to needing X). I doubt you'll be able to find a rendering engine which doesn't require Qt, Gtk, GDI, or Cocoa, and therefore requires a full install of display libraries.

Why not store the HTML that is sent out to the client? You could then use that to redisplay in a webbrowser as a page to show what it looked like.
Using your webanalytics data about use actions, you could they use that to default the combo boxes, fields etc to the values the client would have had, even change the CSS on buttons, etc, to mark them as being pushed.
As a benefit, you don't need the X stack, don't need to do any crawling or storing of images.
EDIT (Re Andrew Moore):
This is were you store the current CSS/images under a version number. Place an easily parsable version number in a comment in the HTML. If you change your CSS/images and use the existing names, increment the version number in the HTML output sent out.
The system that stores the HTML will know that it needs to grab a new copy and store under a new number. When redisplaying, it simply uses the version number to determine which CSS/image set to use.
We currently have a system here that uses a very similiar system so we can track users actions and provide better support when they call our help desk, as they can bring up the users session and follow what they did, even some-what live.
you can even code it to auto-censor sensitive fields when it is stored.

depending on the specifics of your needs perhaps you could get away with using one of the many free webpage thumbnail services? snapcasa, for example lets you generate thousands per month / no charge no advertizing .. (not ever used, just googled 'free thumbnail service') to find this.
just a thot

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008