--no-clobber still overwrites file if --html-extension used in wget? - html

I have a script for downloading all of my Chrome Bookmarks. I use wget with the --html-extension because some of the bookmarks end in .php and can't be opened by a web browser unless --html-extension option is used. The problem I am having is that when I use --html-extension with --no-clobber, It doesn't recognize that most of the files are already there for some reason, so it goes through the whole process of redownloading stuff it already has.
An example:
wget -nc http://www.test.com/
run once will save the file like it is supposed to. if you run it again then it will say the file already there so not retrieving. that is the operation i would expect.
however, delete the file that was just saved and run:
wget -nc http://www.test.com/ --html-extension
and then run that same command again. it overwrites the file instead of saying file already there. What is going on?

When the html suffix is added, wget can't tell what remote file you want to compare it to.
man wget: http://unixhelp.ed.ac.uk/CGI/man-cgi?wget
======================
--html-extension
If a file of type application/xhtml+xml or text/html is downloaded
and the URL does not end with the regexp .[Hh][Tt][Mm][Ll]?, this
option will cause the suffix .html to be appended to the local
filename. This is useful, for instance, when you're mirroring a
remote site that uses .asp pages, but you want the mirrored pages
to be viewable on your stock Apache server. Another good use for
this is when you're downloading CGI-generated materials. A URL
like http://site.com/article.cgi?25 will be saved as arti-
cle.cgi?25.html.
Note that filenames changed in this way will be re-downloaded every
time you re-mirror a site, because Wget can't tell that the local
X.html file corresponds to remote URL X (since it doesn't yet know
that the URL produces output of type text/html or application/xhtml+xml. To prevent this re-downloading, you must use -k
and -K so that the original version of the file will be saved as
X.orig.

Related

How to download or list all files on a website directory

I have a pdf link like www.xxx.org/content/a.pdf, and I know that there are many pdf files in www.xxx.org/content/ directory but I don't have the filename list. And When I access www.xxx.org/content/ using browser, it will redirect to www.xxx.org/home.html.
I tried to use wget like "wget -c -r -np -nd --accept=pdf -U NoSuchBrowser/1.0 www.xxx.org/content", but it returns nothing.
So does any know how to download or list all the files in www.xxx.org/content/ directory?
If the site www.xxx.org blocks the listing of files in HTACCESS, you can't do it.
Try to use File Transfer Protocol with FTP path you can download and access all the files from the server. Get the absolute path of of the same URL "www.xxx.org/content/" and create a small utility of ftp server and get the work done.
WARNING: This may be illegal without permission from the website owner. Get permission from the web site first before using a tool like this on a web site. This can create a Denial of Service (DoS) on a web site if not properly configured (or if not able to handle your requests). It can also cost the web site owner money if they have to pay for bandwidth.
You can use tools like dirb or dirbuster to search a web site for folders/files using a wordlist. You can get a wordlist file by searching for a "dictionary file" online.
http://dirb.sourceforge.net/
https://sectools.org/tool/dirbuster/

Apache2: problems with Arabic filenames (404)

I have an apache2 server which archives content (news) downloaded from the web with wget. Some URL contain Arabic characters and are downloaded with weird encoding such as the following,
72101-%D9%81%D8%B3%D8%A7%D8%AF-%D8%AD%DA%A9%D9%88%D9%85%D8%AA%DB%8C%D8%8C-%DA%AF%D9%86%D8%AF%D8%A7%D8%A8-%D9%86%D8%B8%D8%A7%D9%85-%D8%A2%D8%AE%D9%88%D9%86%D8%AF%DB%8C.html
Although I use the --restrict-file-names=nocontrol parameter with wget, but somehow it doesn't seem to apply when URLs are in a file that wget accesses with -i option).
The problem is that when accessing these files on my server, apache2 seems to try to reencode these filenames in Arabic when processing the URL, looks for an URL in Arabic, and then complains that it doesn't find the files and returns a 404 error...
I am fine with these files having weird names, I just want Apache to stop reencoding the URLs - they should point to the following,
72101-%D9%81%D8%B3%D8%A7%D8%AF-%D8%AD%DA%A9%D9%88%D9%85%D8%AA%DB%8C%D8%8C-%DA%AF%D9%86%D8%AF%D8%A7%D8%A8-%D9%86%D8%B8%D8%A7%D9%85-%D8%A2%D8%AE%D9%88%D9%86%D8%AF%DB%8C.html, not
72101-فساد-حکومتی،-گنداب-نظام-آخوندی.html
which does not exist.
Is there a way to have apache2 stop doing that?

Linux shell script command - gzip

I am having one shell script in Linux in which the output will be generated in .csv format.
At the end of the script i am making this .csv to .gz format to reduce the space on my machine.
The file which is generated comes in this format Output_04-07-2015.csv
The command which i have written to make it zip is:-gzip Output_*.csv
But i am facing an issue that if the file already exists, then it should make the new file with that reported time stamp.
Can anyone help me with it.?
If all you want is to just overwrite the file if it already exists, gzip has a -f flag for it.
gzip -f Output_*.csv
What the -f flag does is forcefully create the gzip file, and overwrite whatever existing zip file there might already be.
Have a look at the man pages by typing man gzip or even this link for many other options.
If instead you want to do it more elegantly, you could check out and see if shell commands for your script work for you or not. But that would differ depending on what shell you have, bash, cshell, etc.

How to get AWStats to generate static HTML files?

I want to get AWStats running on my webserver that runs Debian 4.4.5-8 with Apache 2.
There are several websites that all have their own configuration file, similar to this:
Include "/etc/awstats/awstats.model.conf"
LogFile="/var/customers/logs/myname-example.com-access.log"
LogType=W
LogFormat = 1
LogSeparator=" "
SiteDomain="example.com"
HostAliases="*.example.com"
DirData="/www/myname/awstats/example.com/"
What I expect is that HTML files are written to /www/myname/awstats/example.com/ which I can then access through Apache. However when I run /usr/share/awstats/tools/buildstatic.sh what happens is that .txt files are written to that directory and HTML files that I want are written to /var/cache/awstats. The error file in /tmp remains empty.
Why is this happening and how do I make it work the way I want?
DirData is not supposed to be read directly by the Web Server. It is used by awstats.pl.
The fact is that /var/cache/awstats is hardcoded in buildstatic.sh so you have to change the two lines mentioning it:
mkdir -p /var/cache/awstats/$c/$Y/$m/
and
-dir=/var/cache/awstats/$c/$Y/$m/ >$TMPFILE 2>&1

Cannot find URL error while it is definitely there

I have small CGI script running on a server[Linux OS]. following is a part of script output..
<tr><td valign="center">Lol</td><td valign="center">10112</td><td>abc.pdf</td></tr>
But when I click on this abc.pdf hyperlink, browser displays error message:URL /home/pathtopdf/abc.pdf was not found on the server. while the pdf and path is definitely there and all files and folders in the path[including pdf] has full permission.
My server location is # /srv/www and script in /srv/www/cgi-bin, but when I put the link to pdf as follows
<tr><td valign="center">Lol</td><td valign="center">10112</td><td>abc.pdf</td></tr>
The error message was The requested URL '/srv/www/for_html/abc.pdf' resolves to a file which is marked executable but is not a CGI file; retrieving it is forbidden. Again permission is there for files.
What could be the problemo?
Your problem is that you try to request a file outside of the webroot. So by clicking that, the browser is really requesting
http://example.com/home/pathtopdf/abc.pdf
not
/home/pathtopdf/abc.pdf
You can edit your apache config file and add a virtual host to that directory under a subdomain (say downloads)
After your edit, I am assuming you are using the file:// protocol, directly on the server. I would say just to remove the executable bit from your .pdf's file permissions. Run from a shell:
chmod -x /srv/www/for_html/abc.pdf