Using spéciâl characters in URLs - mysql

I'm developing a website which lets people create their own translator. They can choose the name of the URL, and it is sent to a database and I use .htaccess to redirect website.com/nameoftheirtranslator
to:
website.com/translator.php?name=nameoftheirtranslator
Here's my problem:
Recently, I've noticed that someone has created a translator with special characters in the name -> "LAEFÊVËŠI".
But when it is processed (posted to a php file, then mysqli_real_escape_string) and added to the database it appears as simply "LAEFVI" - so you can see the special characters have been lost somewhere.
I'm not quite sure what to do here, but I think there are two paths:
Try to keep the characters and do some encoding (no idea where to start)
Ditch them and tell users to only use 'normal' characters in the names of their translators (not ideal)
I'm wondering whether it's even possible to have a url like website.com/LAEFÊVËŠI - can that be interpreted by the server?
EDIT1: I notice that stack overflow, on this very question, translates the special characters in my title to .../using-special-characters-in-urls! This seems like a great solution, I guess I could make a function that translates special characters like â to their normal equivalent (like â)? And I suppose I would just ignore other characters like /##"',&? Now that I think of it, there must be some fairly standard/good-practice strategies for getting around problems like this.
EDIT2: Actually, now that I think about it (more) - I really want this thing to be usable by people of any language (not just English), so I would really love to be able to have special characters in the urls. Having said this, I've just found that Google doesn't interpret â as a, so people may have a hard time finding the LAEFÊVËŠI translator if I don't translate the letters to normal characters. Ahh!

Okay, after that crazy episode, here's what happened:
Found out that I was removing all the non alpha-numeric characters with PHP preg_replace().
Altered preg_replace so it only removes spaces and used rawurlencode():
$name = mysqli_real_escape_string($con, rawurlencode( preg_replace("/\s/", '', $name) ));
Now everything is in the database encoded, safe and sound.
Used this rewrite rule RewriteRule ^([^/.]+)$ process.php?name=$1 [B]
Run around in circles for 2 hours thingking my rewrite was wrong because I was getting "page not found"
Realise that process.php didn't have a rawurlencode() to read in the name
$name = rawurlencode($_GET['name']);
Now it works.
WOO!
Sleep time.

Related

Trying to get `sed `to fix a phpBB board - but I cannot get my regexp to work

I have an ancient phpBB3 board which has gone through several updates over its 15+ years of existence. Sometimes, in the distant past, such updates would partially fail, leaving all sorts of 'garbage' in the BBCode. I'm now trying to do a 'simple' regexp to match a particular issue and fix it.
What happened was the following... during a database update, long long ago, BBCode tags were, for some reason, 'tagged' with a pseudo-attribute — allegedly for the database updating script to figure out each token that required updating, I guess. This attribute was always a 8-char-long alphanumeric string, 'appended' to the actual BBCode with a semicolon, like this:
[I]something in italic[/I]
...
[I:i9o7y3ew]something in italic[/I:i9o7y3ew]
Naturally, phpBB doesn't recognise this as valid BBCode, and just prints the whole text out.
The replacement regexp is actually very basic:
s/\[(\/?)(.+):[[:alnum:]]{0,8}\]/[\1\2]/gim
You can see a working example on regex110.com (where capture groups use $1 instead of \1). The example given there includes a few examples from the actual database itself. [i] is actually the simplest case; there are plenty of others which are perfectly valid but a bit more complex, thus requiring a (.+) matcher, such as [quote=\"Gwyneth Llewelyn\":2m80kuso].
As you can see from the example on regex110.com, this works :-)
Why doesn't it work under (GNU) sed? I'm using version 4.8 under Linux:
$ sed -i.bak -E "s/\[(\/?)(.+):[[:alnum:]]+\]/[\1\2]/gim" table.sql
Just for the sake of the argument, I tried using [A-Za-z0-9]+ instead of [[:alnum:]]+; I've even tried (.+) (to capture the group and then just discard it)
None produced an error; none did any replacements whatsoever.
I understand that there are many different regexp engines out there (PCRE, PCRE2, Boost, etc. and so forth) so perhaps sed is using a syntax that is inconsistent with what I'm expecting...?
Rationale: well, I could have done this differently; after all, MySQL has built-in regexp replacements, too. However, since this particular table is so big, it takes eternities. I thought I'd be far better off by dumping everything to a text file, doing the replacements there, and importing the table again. There is a catch, though: the file is 95 MBytes in size, which means that most tools I've got (e.g. editors with built-in regexp search & replace) will fail with such a huge exception. One notable exception is good old emacs, which has no trouble with such large files. Alas, emacs cannot match anything, so I thought I'd give sed a try (it should be faster, too). sed takes also close to a minute or so to process the whole file — about the same as emacs, in fact — and has the same result, i.e. no replacements are being made. It seems to me that, although the underlying technology is so different (pure C vs. Emacs-LISP), both these tools somehow rely on similar algorithms... both of which fail.
My understanding is that some libraries use different conventions to signal literal vs. metacharacters and quantifiers. Here is an example from an instruction manual for vim: http://www.vimregex.com/#compare
Indeed, contemporary versions of sed seem to be able to handle two different kinds of conventions (thus the -E flag). The issue I have with my regexp is that I find it very difficult to figure out which convention to apply. Let's start with what I'm used to from PHP, Go, JavaScript and a plethora of other regexp implementations, which use the convention that metacharacters & quantifiers do not get backslashed (while literals do).
Thus, \[(\/?)(.+):[[:alnum:]]+\] presumes that there are a few literal matches for [, ], /, and only these few cases require backslashes.
Using the reverse convention — i.e. literals do not get backslashed, while metacharacters and some quantifies do — this would be written as:
[\(/\?\)\(\.\+\):\[\[:alnum:\]\]\+]
Or so I would think.
Sadly, sed also rejects this with an error — and so do vim and emacs, BTW (they seem to use a similar regexp library, or perhaps even the same one).
So what is the correct way to write my regexp so that sed accepts it (and does what I intend it to do)?
UPDATE
I have since learned that, in the database, phpBB, unlike I assumed, does not store BBCode (!) but rather a variant of HTML (some tags are the same, some are invented on the spot). What happens is that BBCode gets translated into that pseudo-HTML, and back again when displaying; that, at least, explains why phpBB extensions such as Markdown for phpBB — but also BBCode add-ons! — can so easily replace, partially or even totally, whatever is in the database, which will continue to work (to a degree!) even if those extensions get deactivated: the parsed BBCode/Markdown is just converted to this 'special' styling in the database, and, as such, will always be rendered correctly by phpBB3, no matter what.
On other words, fixing those 'broken' phpBB tags requires a bit more processing, and not merely search & replace with a single regexp.
Nevertheless, my question is still pertinent to me. I'm not really an expert with regexps but I know the basics — enough to make my life so much easier! — and it's always good to understand the different 'dialects' used by different platforms.
Notably, instead of using egrep and/or grep -E, I'm fond of using ugrep instead. It uses PCRE2 expressions (with the Boost library), and maybe that's the issue I'm having with the sed engine(s) — the different engines speak different regular expressions dialect, and converting from one grep variant to a different one might not be useful at all (because some options will not 'translate' well enough)...
Using sed
(\[[^:]*) - Retain everything up to but not including the next semi colon after a opening bracket within the parenthesis which can later be returned with back reference \1
[^]]* - Exclude everything else up to but not including the next closing bracket
$ sed -E 's/(\[[^:]*)[^]]*/\1/g' table.sql
[I]something in italic[/I]
...
[I]something in italic[/I]

Perl JSON encode in UTF-8 strange behaviour

Based on Perl JSON 2.90 documentation, to encode JSON object in UTF-8 all you need to do is:
$json_text = JSON->new->utf8->encode($perl_scalar)
That is obvious and this what I did. After a while, I got an issue report on GitHub from one of users, which made me really surprised, as it shouldn't be happening!
I was beating for hours to figure out what was happening but the solution happened to be very weird and wrong from my point of view.
What eventually worked for me is this:
$json_text = JSON->new->latin1->encode($perl_scalar)
After that, I tested this code with all different characters, including Russian and Chinese - it just worked?
Can anyone please explain, why encoding is working correctly with latin1 and not with utf8, when it's actually has to be visa versa?
Two possible bugs could result in the described outcome.
You were passing strings already encoded using UTF-8 to encode.
If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.C3 A9, are suffering from this bug.
If you are suffering from the this bug, properly decode all inputs to your program, and continue using JSON->new->utf8->encode (aka encode_json).
You were encoding the output of the JSON command using UTF-8 a second time, possibly via a :utf8 or :encoding layer on a file handle.
If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.E9, are suffering from this bug.
If you are suffering from the this bug, either use JSON->new->encode (aka to_json) and keep the second layer of encoding, or use JSON->new->utf8->encode (aka encode_json) and remove the second layer of encoding.
In neither case is the solution to use JSON->new->latin1->encode.
What are you doing to output $json_text? What kind of binmode do you use on that handle? The screenshot looks like it's double-encoded, which suggests the handle has :utf8 or :encoding enabled (which is incorrect for writing encoded data to). As unintuitively as it may seem, ->latin1 giving a correct result matches that hypothesis (PerlIO assumes any binary string is encoded as latin-1).

Cannot properly decode html entities in perl

I am having an issue which I am unable to solve after spending the last 10 hours searching around the internet for an answer.
I have some data in this format
??E??0??<?20120529184453+0200?20120529184453+0200???G0E?5?=20111213T103134000-136.225.6.103-30365316-1448169323, ver: 12??W??tP?2??
??|?????
??:o?????tP???B#?????B#??????)0????
49471010550??? ???tP???3??<????????????????
I have a PHP code, not written by me, which is just running html_entity_decode on that and it returns the correct results.
When I try running Perl's decode_entities I get a completely different result. After some debugging it seems to me that PHP is "properly" replacing what seems to be invalid entities, such as,  or  into their ascii counterparts, namely NULL and backspace for the 2 cases mentioned.
Perl on the other hand does not seem to decode those "invalid" entities and leaves them alone which later one screws up the result (Which goes through unpack or, in phph's case, bin2hex, which fails because rather than unpacking null to 00 it will unpack each individual character of ).
I have tried everything I can think of include running the following substitution in perl after running decode_entities
$var =~ s/&#(\d+);/chr($1)/g
however that does not work at all.
This is driving me mad and I would like to have this done in perl rather than phpI really hope I don't have to write 1000 pattern matching lines in perl to cover all possible entities and numbers.
Anybody that has an idea how to go about this problem without resorting to having to parse PHPs entire html_entity_decode function into perl or writing endless lines of pattern matching?
You're almost there. Instead of
$var =~ s/&#(\d+);/chr($1)/g
say
$var =~ s/&#(\d+);/chr($1)/ge
The /e modifier instructs Perl to 'e'valuate the replacement pattern.

Find and Replace with Notepad++

I have a document that was converted from PDF to HTML for use on a company website to be referenced and indexed for search. I'm attempting to format the converted document to meet my needs and in doing so I am attempting to clean up some of the junk that was pulled over from when it was a PDF such as page numbers, headers, and footers. luckily all of these lines that need to be removed are in blocks of 4 lines unfortunately they are not exactly the same therefore cannot be removed with a simple literal replace. The lines contain numbers which are incremental as they correlate with the pages. How can I remove the following example from my html file.
Title<br>
10<br>
<hr>
<A name=11></a>Footer<br>
I've tried many different regular expression attempts but as my skill in that area is limited I can't find the proper syntax. I'm sure i'm missing something fairly easy as it would seem all I need is a wildcard replace for the two numbers in the code and the rest is literal.
any help is apprciated
The search & replace of npp is quite odd. I can't find newline charactes with regular expression, although the documentation says:
As of v4.9 the Simple find/replace (control+h) has changed, allowing the use of \r \n and \t in regex mode and the extended mode.
I updated to the last version, but it just doesn't work. Using the extended mode allows me to find newlines, but I can't specify wildcards.
However, you can use the macros to overcome this problems.
prepare a search that will find a unique passage (like Title<br>\r\n, here you can use the extended mode)
start recording a macro
press F3 to use your search
mark the four lines and delete them
stop recording the macro ... done!
Just replay it and it deletes what you wanted to delete.
If I have understood your request correctly this pattern matches your string:
Title<br>( ?)\n([0-9]+)<br>( ?)\n<hr>( ?)\n<A name=([0-9]+)></a>Footer<br>
I use the Regex Coach to try out complicated regex patterns. Other utilities are available.
edit
As I do not use Notepad++ I cannot be sure that this pattern will work for you. Apologies if that transpires to be the case. (I'm a TextPad man myself, and it does work with that tool).

What are the Legal / Allowed characters for web server file names on?

What characters are allowed in filenames for HTML files on ALL servers (*nix, Windows, etc.) ?
I'm looking for the "lowest common denominator" that will work on all servers.
USE: I'm naming a file to be served up publicly (Mysite.com/My-Page.htm)
E.g., space? _ - , etc.
E.g., can I have File-Name.htm, File_Name.htm File Name.htm?
Obviously, this needs to work with all servers and browsers. (IIRC, the name is limited by the server not the browser, but I could be wrong).
What characters are allowed in filenames for HTML files on servers?
That totally depends on the server. HTTP itself allows any character at all, including control characters and non-ASCII characters, as long as they are suitably %-encoded when requested in a URL.
On a Unix server you cannot use ‘/’ or the zero byte. (If you could use them, they'd appear in the URL as ‘%2F’ and ‘%00’ respectively.) You also can't have the specific filenames ‘.’ or ‘..’, or the empty string.
On a Windows server you have all the limitations of a Unix server, plus you also can't use any of \/:*?"<>| or control characters 1-31 and you can't have leading or trailing dot or spaces, and you'll have difficulty using any of the legacy device filenames (CON, PRN, COM1 and many more).
This is nothing to do with HTTP; just how filenames work on Windows, which is complicated.
can I have File-Name.htm, File_Name.htm File Name.htm?
Certainly. But in the last case you should link to it by URL-encoding the space:
thingy
Browsers will usually let you get away with leaving the space in, but it's not really valid. If you want to avoid having to think about URL-escaping, HTML-escaping and case-sensitive issues, stick to a–z, 0–9 and underscore.
If you don't want your filenames to be encoded by the server, you should avoid reserved characters: $&+,/:;=?# and unsafe characters: space, quotation marks, <>#%{}|\^~[]`
But as the previous answers stated, the web servers should cope with whatever you want to use by encoding the chars.
Be sure to eliminate
* . " / \ [ ] : ; | = ,
which are never allowed, due to inconsistencies in file naming conventions standard practice is to use a-z and 0-9 and the underscore character. Space is needful for most users but if you can get away from using it there are parsing issues that improve reliability, you can read rfc's on mime ( multi-part internet mail extensions ) to get a taste of what is involved.
No matter what you do, something somewhere is likely to make life difficult - so much so that I now use cryptographic methods to generate random a-z lowercase strings and use those as filenames, embedding the useful info in the file source code.
Avoid the ampersand at any cost, ...
I would say a good rule of thumb for filenames for HTML files on ALL servers can be any combination of alphabet (lowercase preferred) and number characters (1 though 9), plus the underline(_), minus(-) or plus(+) characters but no spaces. Also, end the filename with dot html (e.g. filename.html). I personally avoid using underline and plus characters.
There isn't such a thing as an html filename.
Certain characters have to be encoded in html (eg if used in links) but the allowed characters in the document names will depend on the web server (and possibly the file system on the server).
Any file name will be URL-encoded so you should be fine. And for the record all three of your file names would work just fine.