Apache2: problems with Arabic filenames (404) - html

I have an apache2 server which archives content (news) downloaded from the web with wget. Some URL contain Arabic characters and are downloaded with weird encoding such as the following,
72101-%D9%81%D8%B3%D8%A7%D8%AF-%D8%AD%DA%A9%D9%88%D9%85%D8%AA%DB%8C%D8%8C-%DA%AF%D9%86%D8%AF%D8%A7%D8%A8-%D9%86%D8%B8%D8%A7%D9%85-%D8%A2%D8%AE%D9%88%D9%86%D8%AF%DB%8C.html
Although I use the --restrict-file-names=nocontrol parameter with wget, but somehow it doesn't seem to apply when URLs are in a file that wget accesses with -i option).
The problem is that when accessing these files on my server, apache2 seems to try to reencode these filenames in Arabic when processing the URL, looks for an URL in Arabic, and then complains that it doesn't find the files and returns a 404 error...
I am fine with these files having weird names, I just want Apache to stop reencoding the URLs - they should point to the following,
72101-%D9%81%D8%B3%D8%A7%D8%AF-%D8%AD%DA%A9%D9%88%D9%85%D8%AA%DB%8C%D8%8C-%DA%AF%D9%86%D8%AF%D8%A7%D8%A8-%D9%86%D8%B8%D8%A7%D9%85-%D8%A2%D8%AE%D9%88%D9%86%D8%AF%DB%8C.html, not
72101-فساد-حکومتی،-گنداب-نظام-آخوندی.html
which does not exist.
Is there a way to have apache2 stop doing that?

Related

How to download or list all files on a website directory

I have a pdf link like www.xxx.org/content/a.pdf, and I know that there are many pdf files in www.xxx.org/content/ directory but I don't have the filename list. And When I access www.xxx.org/content/ using browser, it will redirect to www.xxx.org/home.html.
I tried to use wget like "wget -c -r -np -nd --accept=pdf -U NoSuchBrowser/1.0 www.xxx.org/content", but it returns nothing.
So does any know how to download or list all the files in www.xxx.org/content/ directory?
If the site www.xxx.org blocks the listing of files in HTACCESS, you can't do it.
Try to use File Transfer Protocol with FTP path you can download and access all the files from the server. Get the absolute path of of the same URL "www.xxx.org/content/" and create a small utility of ftp server and get the work done.
WARNING: This may be illegal without permission from the website owner. Get permission from the web site first before using a tool like this on a web site. This can create a Denial of Service (DoS) on a web site if not properly configured (or if not able to handle your requests). It can also cost the web site owner money if they have to pay for bandwidth.
You can use tools like dirb or dirbuster to search a web site for folders/files using a wordlist. You can get a wordlist file by searching for a "dictionary file" online.
http://dirb.sourceforge.net/
https://sectools.org/tool/dirbuster/

For PyCharm/WebStorm my IDE's local server is not detecting other directories

When I select a html file to open "in browser" in Webstorm it works and it opens under the localhost. The issue I'm having is that this webstorm internal server is not detecting any of the other paths in my project root like images and javascript files.
I should note that this feature has worked before on other projects I started from scratch using "new project." The difference with this project is that I opened a directory as a project.
The built-in webserver serves files from http://localhost:<built-in server port>/<project root>. Forward slashes in URLs tell the browser to resolve them relative to the web server root (localhost:63342 in your case), causing 404 errors.
If you like to change the default web path on built-in web server, you have to re-configure the server by editing your system hosts file accordingly - see http://youtrack.jetbrains.com/issue/WEB-8988#comment=27-577559.

how to run html on python openshift server

I have an OpenShift server running python. However when I call php via SSL the php interpreter starts running. It suggests that there might be a way to run php as well. However, HTML if fair enough for me. Now, I do not know how to be able to reach html files on my server as when I am trying I always get 404 not found. I've read about a solution of placing a .htaccess file:
AddType application/x-httpd-php .html
I am not exactly sure where to place this file but placing in the folder of the .html file still not helps.
Could you please help me how I can make .html files reachable at an OpenShift server running Python? How about php?
Put the .html file in your app-root/repo/wsgi/static folder (or in that folder in your git repository). if you want it to be displayed like app-domain.rhcloud.com/file.html, you will have to use a .htaccess file in your wsgi folder that rewrites file.html to static/file.html

--no-clobber still overwrites file if --html-extension used in wget?

I have a script for downloading all of my Chrome Bookmarks. I use wget with the --html-extension because some of the bookmarks end in .php and can't be opened by a web browser unless --html-extension option is used. The problem I am having is that when I use --html-extension with --no-clobber, It doesn't recognize that most of the files are already there for some reason, so it goes through the whole process of redownloading stuff it already has.
An example:
wget -nc http://www.test.com/
run once will save the file like it is supposed to. if you run it again then it will say the file already there so not retrieving. that is the operation i would expect.
however, delete the file that was just saved and run:
wget -nc http://www.test.com/ --html-extension
and then run that same command again. it overwrites the file instead of saying file already there. What is going on?
When the html suffix is added, wget can't tell what remote file you want to compare it to.
man wget: http://unixhelp.ed.ac.uk/CGI/man-cgi?wget
======================
--html-extension
If a file of type application/xhtml+xml or text/html is downloaded
and the URL does not end with the regexp .[Hh][Tt][Mm][Ll]?, this
option will cause the suffix .html to be appended to the local
filename. This is useful, for instance, when you're mirroring a
remote site that uses .asp pages, but you want the mirrored pages
to be viewable on your stock Apache server. Another good use for
this is when you're downloading CGI-generated materials. A URL
like http://site.com/article.cgi?25 will be saved as arti-
cle.cgi?25.html.
Note that filenames changed in this way will be re-downloaded every
time you re-mirror a site, because Wget can't tell that the local
X.html file corresponds to remote URL X (since it doesn't yet know
that the URL produces output of type text/html or application/xhtml+xml. To prevent this re-downloading, you must use -k
and -K so that the original version of the file will be saved as
X.orig.

Google Translate TTS problem

I'm testing with a simple HTML file, which contains:
<audio src="http://translate.google.com/translate_tts?tl=en&q=A+simple_text+to+voice+demonstration." controls autoplay>
with Chrome v11.0.696.68 and FF v4.0.1. I'm going through a proxy server and it doesn't work. Nothing gets played and clicking on the play button doesn't work in Chrome. In FF it flashes and then shows an 'X' over the control. The error logs don't show anything.
So I've broken down the steps:
Typing the URL into either browser works
wget -q -U Mozilla -O /tmp/tts.mp3 "http://translate.google.com/translate_tts?tl=en&q=Welcome+to+our+fantastic+text+to+voice+demonstration." gets me a file that plays fine on both browsers.
If I serve this file from my local web server it works fine (i.e. one that doesn't go through the proxy). i.e. src="http://localhost/tts.mp3"
I'm stumped. If the proxy were the problem then wget and address bar access shouldn't work. If the src being a URL were the problem then it shouldn't work from my local server.
Any clues? suggestions?
The reason this isn't working is most likely because translate.google.com restricts certain types of requests to prevent the service from being overloaded. For instance, if you use wget without the "-U Mozilla" user agent option you will get an HTTP 404 because the service restricts responses from wget's default user agent string.
In your case, it looks like what is going on is that translate.google.com is returning a HTTP 404 if a HTTP Referrer is included in the request. When you run wget from command line there is no referrer. When you use the audio tag from within a webpage, an HTTP Referrer is provided when requesting the translation. I just tried the following and got a 404.
wget --referer="http://foo.com" -U Mozilla -O /tmp/tts.mp3 "http://translate.google.com/translate_tts?tl=en&q=Welcome+to+our+fantastic+text+to+voice+demonstration
However if you take the --referer option out, it works.
The service is working here (11-NOV-2011) but is limited to 100 characters. You can split your text into 100 char chunks, download the mp3 result for each chunk and then join the chunks for the final Mp3 file.