mysqldump, avoid escape double quotes - mysql

When I have a field with double quotes, mysqldump put the escape character before. For example:
'My "test"' -> mysqldump generates a file with that field like 'My \"test\"'
The problem is that I'm using that file to import some data into a sqlite database, and SQLite doesn't remove the escape character. So I don't need that mysqldump writes escape characters, can I do that?

If on a *nix (unix, linux, mac) machine, I can suggest using sed to replace \" with ":
mysqldump [options] DB [table] | sed -e 's/\\"/"/g'
or, for an exsiting mydump.sql file:
sed -i -e 's/\\"/"/g' mydump.sql
EDIT: searching for a formal MySQL answer , I only found this somewhat related open bug about single quote escaping - https://bugs.mysql.com/bug.php?id=65941
It's also worth mentioning that there are other tools that can dump DBs without escaping quotes with backslash in the first place, such as JetBrains products (DataGrip, IntelliJ etc).

I am unsure what language you are using.
But the concept would be the same in any programming language.
First declare a variable equal to your sqldump.
var yourVariable = sqldump;
Then do a string.replace("\\", "").
Then use that 'cleaned' version to import your data with.

From a google search I found this which is reworking the output of mysqldump so sqlite can use it.
Not tested and 5 years old but I think:
you may find useful information
it may mean other people didn't find a more elegant solution
EDIT:
As per the comments, to answer your question it looks like you don't have a choice but replace \" into ".
Here is a repo with a script which translates mysqldump files into something sqlite can digest.
In particular, for your question, you can find this line:
gsub( /\\"/, "\"" )

Which SQL mode do you use?
Look at the mysqldump manual: http://dev.mysql.com/doc/refman/5.7/en/mysqldump.html
--quote-names, -Q
Quote identifiers (such as database, table, and column names) within
“`” characters. If the ANSI_QUOTES SQL mode is enabled, identifiers
are quoted within “"” characters. This option is enabled by default.
It can be disabled with --skip-quote-names, but this option should be
given after any option such as --compatible that may enable
--quote-names.
--quote-names is enabled by default.
ANSI_QUOTES
Treat “"” as an identifier quote character (like the “`” quote
character) and not as a string quote character. You can still use “`”
to quote identifiers with this mode enabled. With ANSI_QUOTES enabled,
you cannot use double quotation marks to quote literal strings,
because it is interpreted as an identifier.

You can convert escapes in the mysqldump output file.
Check esperlu MySQL to Sqlite converter script:
...
/INSERT/ {
gsub( /\\\047/, "\047\047" )
gsub(/\\n/, "\n")
gsub(/\\r/, "\r")
gsub(/\\"/, "\"")
gsub(/\\\\/, "\\")
gsub(/\\\032/, "\032")
print
next
}
...

Related

how can I pass a json as a command after -x in wscat [duplicate]

Should or should I not wrap quotes around variables in a shell script?
For example, is the following correct:
xdg-open $URL
[ $? -eq 2 ]
or
xdg-open "$URL"
[ "$?" -eq "2" ]
And if so, why?
General rule: quote it if it can either be empty or contain spaces (or any whitespace really) or special characters (wildcards). Not quoting strings with spaces often leads to the shell breaking apart a single argument into many.
$? doesn't need quotes since it's a numeric value. Whether $URL needs it depends on what you allow in there and whether you still want an argument if it's empty.
I tend to always quote strings just out of habit since it's safer that way.
In short, quote everything where you do not require the shell to perform word splitting and wildcard expansion.
Single quotes protect the text between them verbatim. It is the proper tool when you need to ensure that the shell does not touch the string at all. Typically, it is the quoting mechanism of choice when you do not require variable interpolation.
$ echo 'Nothing \t in here $will change'
Nothing \t in here $will change
$ grep -F '#&$*!!' file /dev/null
file:I can't get this #&$*!! quoting right.
Double quotes are suitable when variable interpolation is required. With suitable adaptations, it is also a good workaround when you need single quotes in the string. (There is no straightforward way to escape a single quote between single quotes, because there is no escape mechanism inside single quotes -- if there was, they would not quote completely verbatim.)
$ echo "There is no place like '$HOME'"
There is no place like '/home/me'
No quotes are suitable when you specifically require the shell to perform word splitting and/or wildcard expansion.
Word splitting (aka token splitting);
$ words="foo bar baz"
$ for word in $words; do
> echo "$word"
> done
foo
bar
baz
By contrast:
$ for word in "$words"; do echo "$word"; done
foo bar baz
(The loop only runs once, over the single, quoted string.)
$ for word in '$words'; do echo "$word"; done
$words
(The loop only runs once, over the literal single-quoted string.)
Wildcard expansion:
$ pattern='file*.txt'
$ ls $pattern
file1.txt file_other.txt
By contrast:
$ ls "$pattern"
ls: cannot access file*.txt: No such file or directory
(There is no file named literally file*.txt.)
$ ls '$pattern'
ls: cannot access $pattern: No such file or directory
(There is no file named $pattern, either!)
In more concrete terms, anything containing a filename should usually be quoted (because filenames can contain whitespace and other shell metacharacters). Anything containing a URL should usually be quoted (because many URLs contain shell metacharacters like ? and &). Anything containing a regex should usually be quoted (ditto ditto). Anything containing significant whitespace other than single spaces between non-whitespace characters needs to be quoted (because otherwise, the shell will munge the whitespace into, effectively, single spaces, and trim any leading or trailing whitespace).
When you know that a variable can only contain a value which contains no shell metacharacters, quoting is optional. Thus, an unquoted $? is basically fine, because this variable can only ever contain a single number. However, "$?" is also correct, and recommended for general consistency and correctness (though this is my personal recommendation, not a widely recognized policy).
Values which are not variables basically follow the same rules, though you could then also escape any metacharacters instead of quoting them. For a common example, a URL with a & in it will be parsed by the shell as a background command unless the metacharacter is escaped or quoted:
$ wget http://example.com/q&uack
[1] wget http://example.com/q
-bash: uack: command not found
(Of course, this also happens if the URL is in an unquoted variable.) For a static string, single quotes make the most sense, although any form of quoting or escaping works here.
wget 'http://example.com/q&uack' # Single quotes preferred for a static string
wget "http://example.com/q&uack" # Double quotes work here, too (no $ or ` in the value)
wget http://example.com/q\&uack # Backslash escape
wget http://example.com/q'&'uack # Only the metacharacter really needs quoting
The last example also suggests another useful concept, which I like to call "seesaw quoting". If you need to mix single and double quotes, you can use them adjacent to each other. For example, the following quoted strings
'$HOME '
"isn't"
' where `<3'
"' is."
can be pasted together back to back, forming a single long string after tokenization and quote removal.
$ echo '$HOME '"isn't"' where `<3'"' is."
$HOME isn't where `<3' is.
This isn't awfully legible, but it's a common technique and thus good to know.
As an aside, scripts should usually not use ls for anything. To expand a wildcard, just ... use it.
$ printf '%s\n' $pattern # not ``ls -1 $pattern''
file1.txt
file_other.txt
$ for file in $pattern; do # definitely, definitely not ``for file in $(ls $pattern)''
> printf 'Found file: %s\n' "$file"
> done
Found file: file1.txt
Found file: file_other.txt
(The loop is completely superfluous in the latter example; printf specifically works fine with multiple arguments. stat too. But looping over a wildcard match is a common problem, and frequently done incorrectly.)
A variable containing a list of tokens to loop over or a wildcard to expand is less frequently seen, so we sometimes abbreviate to "quote everything unless you know precisely what you are doing".
Here is a three-point formula for quotes in general:
Double quotes
In contexts where we want to suppress word splitting and globbing. Also in contexts where we want the literal to be treated as a string, not a regex.
Single quotes
In string literals where we want to suppress interpolation and special treatment of backslashes. In other words, situations where using double quotes would be inappropriate.
No quotes
In contexts where we are absolutely sure that there are no word splitting or globbing issues or we do want word splitting and globbing.
Examples
Double quotes
literal strings with whitespace ("StackOverflow rocks!", "Steve's Apple")
variable expansions ("$var", "${arr[#]}")
command substitutions ("$(ls)", "`ls`")
globs where directory path or file name part includes spaces ("/my dir/"*)
to protect single quotes ("single'quote'delimited'string")
Bash parameter expansion ("${filename##*/}")
Single quotes
command names and arguments that have whitespace in them
literal strings that need interpolation to be suppressed ( 'Really costs $$!', 'just a backslash followed by a t: \t')
to protect double quotes ('The "crux"')
regex literals that need interpolation to be suppressed
use shell quoting for literals involving special characters ($'\n\t')
use shell quoting where we need to protect several single and double quotes ($'{"table": "users", "where": "first_name"=\'Steve\'}')
No quotes
around standard numeric variables ($$, $?, $# etc.)
in arithmetic contexts like ((count++)), "${arr[idx]}", "${string:start:length}"
inside [[ ]] expression which is free from word splitting and globbing issues (this is a matter of style and opinions can vary widely)
where we want word splitting (for word in $words)
where we want globbing (for txtfile in *.txt; do ...)
where we want ~ to be interpreted as $HOME (~/"some dir" but not "~/some dir")
See also:
Difference between single and double quotes in Bash
What are the special dollar sign shell variables?
Quotes and escaping - Bash Hackers' Wiki
When is double quoting necessary?
I generally use quoted like "$var" for safe, unless I am sure that $var does not contain space.
I do use $var as a simple way to join lines:
lines="`cat multi-lines-text-file.txt`"
echo "$lines" ## multiple lines
echo $lines ## all spaces (including newlines) are zapped
Whenever the https://www.shellcheck.net/ plugin for your editor tells you to.

When use parameters in JSON POST data in Jmeter, another json entity is changed [duplicate]

I am tired of always trying to guess, if I should escape special characters like '()[]{}|' etc. when using many implementations of regexps.
It is different with, for example, Python, sed, grep, awk, Perl, rename, Apache, find and so on.
Is there any rule set which tells when I should, and when I should not, escape special characters? Does it depend on the regexp type, like PCRE, POSIX or extended regexps?
Which characters you must and which you mustn't escape indeed depends on the regex flavor you're working with.
For PCRE, and most other so-called Perl-compatible flavors, escape these outside character classes:
.^$*+?()[{\|
and these inside character classes:
^-]\
For POSIX extended regexes (ERE), escape these outside character classes (same as PCRE):
.^$*+?()[{\|
Escaping any other characters is an error with POSIX ERE.
Inside character classes, the backslash is a literal character in POSIX regular expressions. You cannot use it to escape anything. You have to use "clever placement" if you want to include character class metacharacters as literals. Put the ^ anywhere except at the start, the ] at the start, and the - at the start or the end of the character class to match these literally, e.g.:
[]^-]
In POSIX basic regular expressions (BRE), these are metacharacters that you need to escape to suppress their meaning:
.^$*[\
Escaping parentheses and curly brackets in BREs gives them the special meaning their unescaped versions have in EREs. Some implementations (e.g. GNU) also give special meaning to other characters when escaped, such as \? and +. Escaping a character other than .^$*(){} is normally an error with BREs.
Inside character classes, BREs follow the same rule as EREs.
If all this makes your head spin, grab a copy of RegexBuddy. On the Create tab, click Insert Token, and then Literal. RegexBuddy will add escapes as needed.
Modern RegEx Flavors (PCRE)
Includes C, C++, Delphi, EditPad, Java, JavaScript, Perl, PHP (preg), PostgreSQL, PowerGREP, PowerShell, Python, REALbasic, Real Studio, Ruby, TCL, VB.Net, VBScript, wxWidgets, XML Schema, Xojo, XRegExp.PCRE compatibility may vary
    Anywhere: . ^ $ * + - ? ( ) [ ] { } \ |
Legacy RegEx Flavors (BRE/ERE)
Includes awk, ed, egrep, emacs, GNUlib, grep, PHP (ereg), MySQL, Oracle, R, sed.PCRE support may be enabled in later versions or by using extensions
ERE/awk/egrep/emacs
    Outside a character class: . ^ $ * + ? ( ) [ { } \ |
    Inside a character class: ^ - [ ]
BRE/ed/grep/sed
    Outside a character class: . ^ $ * [ \
    Inside a character class: ^ - [ ]
    For literals, don't escape: + ? ( ) { } |
    For standard regex behavior, escape: \+ \? \( \) \{ \} \|
Notes
If unsure about a specific character, it can be escaped like \xFF
Alphanumeric characters cannot be escaped with a backslash
Arbitrary symbols can be escaped with a backslash in PCRE, but not BRE/ERE (they must only be escaped when required). For PCRE ] - only need escaping within a character class, but I kept them in a single list for simplicity
Quoted expression strings must also have the surrounding quote characters escaped, and often with backslashes doubled-up (like "(\")(/)(\\.)" versus /(")(\/)(\.)/ in JavaScript)
Aside from escapes, different regex implementations may support different modifiers, character classes, anchors, quantifiers, and other features. For more details, check out regular-expressions.info, or use regex101.com to test your expressions live
Unfortunately there really isn't a set set of escape codes since it varies based on the language you are using.
However, keeping a page like the Regular Expression Tools Page or this Regular Expression Cheatsheet can go a long way to help you quickly filter things out.
POSIX recognizes multiple variations on regular expressions - basic regular expressions (BRE) and extended regular expressions (ERE). And even then, there are quirks because of the historical implementations of the utilities standardized by POSIX.
There isn't a simple rule for when to use which notation, or even which notation a given command uses.
Check out Jeff Friedl's Mastering Regular Expressions book.
Unfortunately, the meaning of things like ( and \( are swapped between Emacs style regular expressions and most other styles. So if you try to escape these you may be doing the opposite of what you want.
So you really have to know what style you are trying to quote.
Really, there isn't. there are about a half-zillion different regex syntaxes; they seem to come down to Perl, EMACS/GNU, and AT&T in general, but I'm always getting surprised too.
Sometimes simple escaping is not possible with the characters you've listed. For example, using a backslash to escape a bracket isn't going to work in the left hand side of a substitution string in sed, namely
sed -e 's/foo\(bar/something_else/'
I tend to just use a simple character class definition instead, so the above expression becomes
sed -e 's/foo[(]bar/something_else/'
which I find works for most regexp implementations.
BTW Character classes are pretty vanilla regexp components so they tend to work in most situations where you need escaped characters in regexps.
Edit: After the comment below, just thought I'd mention the fact that you also have to consider the difference between finite state automata and non-finite state automata when looking at the behaviour of regexp evaluation.
You might like to look at "the shiny ball book" aka Effective Perl (sanitised Amazon link), specifically the chapter on regular expressions, to get a feel for then difference in regexp engine evaluation types.
Not all the world's a PCRE!
Anyway, regexp's are so clunky compared to SNOBOL! Now that was an interesting programming course! Along with the one on Simula.
Ah the joys of studying at UNSW in the late '70's! (-:
https://perldoc.perl.org/perlre.html#Quoting-metacharacters and https://perldoc.perl.org/functions/quotemeta.html
In the official documentation, such characters are called metacharacters. Example of quoting:
my $regex = quotemeta($string)
s/$regex/something/
For PHP, "it is always safe to precede a non-alphanumeric with "\" to specify that it stands for itself." - http://php.net/manual/en/regexp.reference.escape.php.
Except if it's a " or '. :/
To escape regex pattern variables (or partial variables) in PHP use preg_quote()
To know when and what to escape without attempts is necessary to understand precisely the chain of contexts the string pass through. You will specify the string from the farthest side to its final destination which is the memory handled by the regexp parsing code.
Be aware how the string in memory is processed: if can be a plain string inside the code, or a string entered to the command line, but a could be either an interactive command line or a command line stated inside a shell script file, or inside a variable in memory mentioned by the code, or an (string)argument through further evaluation, or a string containing code generated dynamically with any sort of encapsulation...
Each of this context assigned some characters with special functionality.
When you want to pass the character literally without using its special function (local to the context), than that's the case you have to escape it, for the next context... which might need some other escape characters which might additionally need to be escaped in the preceding context(s).
Furthermore there can be things like character encoding (the most insidious is utf-8 because it look like ASCII for common characters, but might be optionally interpreted even by the terminal depending on its settings so it might behave differently, then the encoding attribute of HTML/XML, it's necessary to understand the process precisely right.
E.g. A regexp in the command line starting with perl -npe, needs to be transferred to a set of exec system calls connecting as pipe the file handles, each of this exec system calls just has a list of arguments that were separated by (non escaped)spaces, and possibly pipes(|) and redirection (> N> N>&M), parenthesis, interactive expansion of * and ?, $(()) ... (all this are special characters used by the *sh which might appear to interfere with the character of the regular expression in the next context, but they are evaluated in order: before the command line. The command line is read by a program as bash/sh/csh/tcsh/zsh, essentially inside double quote or single quote the escape is simpler but it is not necessary to quote a string in the command line because mostly the space has to be prefixed with backslash and the quote are not necessary leaving available the expand functionality for characters * and ?, but this parse as different context as within quote. Then when the command line is evaluated the regexp obtained in memory (not as written in the command line) receives the same treatment as it would be in a source file.
For regexp there is character-set context within square brackets [ ], perl regular expression can be quoted by a large set of non alfa-numeric characters (E.g. m// or m:/better/for/path: ...).
You have more details about characters in other answer, which are very specific to the final regexp context. As I noted you mention that you find the regexp escape with attempts, that's probably because different context has different set of character that confused your memory of attempts (often backslash is the character used in those different context to escape a literal character instead of its function).
For Ionic (Typescript) you have to double slash in order to scape the characters.
For example (this is to match some special characters):
"^(?=.*[\\]\\[!¡\'=ªº\\-\\_ç##$%^&*(),;\\.?\":{}|<>\+\\/])"
Pay attention to this ] [ - _ . / characters. They have to be double slashed. If you don't do that, you are going to have a type error in your code.
to avoid having to worry about which regex variant and all the bespoke peculiarties, just use this generic function that covers every regex variant other than BRE (unless they have unicode multi-byte chars that are meta) :
jot -s '' -c - 32 126 |
mawk '
function ___(__,_) {
return substr(_="",
gsub("[][!-/_\140:-#{-~]","[&]",__),
gsub("["(_="\\\\")"^]",_ "&",__))__
} ($++NF = ___($!_))^_'
!"#$%&'()*+,-./0123456789:;<=>?
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
[!]["][#][$][%][&]['][(][)][*][+][,][-][.][/]
0 1 2 3 4 5 6 7 8 9 [:][;][<][=][>][?]
[#] ABCDEFGHIJKLMNOPQRSTUVWXYZ [[]\\ []]\^ [_]
[`] abcdefghijklmnopqrstuvwxyz [{][|][}][~]
square-brackets are much easier to deal with, since there's no risk of triggering warning messages about "escaping too much", e.g. :
function ____(_) {
return substr("", gsub("[[:punct:]]","\\\\&",_))_
}
\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/ 0123456789\:\;\<\=\>\?
\#ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^\_\`abcdefghijklmnopqrstuvwxyz \{\|\}\~
gawk: cmd. line:1: warning: regexp escape sequence `\!' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\"' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\#' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\%' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\&' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\,' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\:' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\;' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\=' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\#' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\_' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\~' is not a known regexp operator
Using Raku (formerly known as Perl_6)
Works (backslash or quote all non-alphanumeric characters except underscore):
~$ raku -e 'say $/ if "#.*?" ~~ m/ \# \. \* \? /; #works fine'
「#.*?」
There exist six flavors of Regular Expression languages, according to Damian Conway's pdf/talk "Everything You Know About Regexes Is Wrong". Raku represents a significant (~15 year) re-working of standard Perl(5)/PCRE Regular Expressions.
In those 15 years the Perl_6 / Raku language experts decided that all non-alphanumeric characters (except underscore) shall be reserved as Regex metacharacters even if no present usage exists. To denote non-alphanumeric characters (except underscore) as literals, backslash or escape them.
So the above example prints the $/ match variable if a match to a literal #.*? character sequence is found. Below is what happens if you don't: # is interpreted as the start of a comment, . dot is interpreted as any character (including whitespace), * asterisk is interpreted as a zero-or-more quantifier, and ? question mark is interpreted as either a zero-or-one quantifier or a frugal (i.e. non-greedy) quantifier-modifier (depending on context):
Errors:
~$ ~$ raku -e 'say $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!'
===SORRY!===
Regex not terminated.
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
Regex not terminated.
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
Couldn't find terminator / (corresponding / was at line 1)
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
expecting any of:
/
https://docs.raku.org/language/regexes
https://raku.org/

remove \r\n from within double quoted field in csv file

I have a large csv file with about 100 double quoted "text fields" per line. Many of the lines have a \r\n embedded in a double quoted field. The \r\n pair is also used for line termination.
How can the \r\n pairs be removed from the double quoted fields and not impact the \r\n line terminations.
I have tried creating individual sed scripts to identify the particular embedded sequences. That sort of worked, but the number of scripts became unmanageable. I have also tried using a 'tr -d '\r' command, that did not work.
For this purpose you can use perl.
perl -0pe 's/"(.*?)([[:space:]]+)(.*?)"/"\1 \3"/g' input.csv > output.csv
perl -pe 's/../../g' file is analogous to sed -e 's/../../g' file but using Perl compatible regular expressions (PCRE). See this or this for more details.
With the option zero perl -0 we are considering the null character as input record separator (see xargs -0 ... or find . -print0 ...).
All versions of sed supports Basic Regular Expressions (BRE) and some versions, as GNU sed (using sed -E), supports Extended Regular Expressions (ERE), a superset of BRE. Both in BRE and ERE the quantifiers * and + are always greedy, they match as many characters as possible. In some contexts, however, we need to match as few characters as possible. With PCRE we can do the quantifiers * and + lazy or minimal or reluctant using *? and +? respectively.
In order to remove \r\n inside double quotes preserving any other line break, we use lazy matching: "(.*?)([[:space:]]+)(.*?)". We also use backreferences "\1 \3" omitting the second matched parentheses which contains any sequence of white spaces (characters \t,\r, \n, \v or \f), in particular sequences of \r\n.
This approach works in CSV files in which "lines have a \r\n embedded in a double quoted field" as is your case. If input.csv contains cells with three or more paragraphs separated by \r\n, we must modify the regular expression.

HTML::Entities encoding and single ampersand

I'm attempting to use the following line of perl, as described here: Does anyone know of a vim plugin or script to convert special characters to their corresponding HTML entities? - to encode HTML entities in Vim.
%!perl -p -i -e 'BEGIN { use HTML::Entities; use Encode; } $_=Encode::decode_utf8($_) unless Encode::is_utf8($_); $_=Encode::encode("ascii", $_, sub{HTML::Entities::encode_entities(chr shift)});'
It works fine (£ to &pound, curly quotes etc.) except for an ampersand on it's own - & - which is left as it is.
I've tried removing the uf8 decoding, and looked at the CPAN documentation for HTML::Entities.
Answer:
#ZyX has answered the original question, but as others have pointed out in the comments, this is redundant as it's not actually necessary to use HTML entities if you are serving pages with a UTF-8 character set (which I am, both with the meta tag -
<meta charset="utf-8">
and also in the Apache configuration:
AddDefaultCharset utf-8
Indeed it's arguably a bad thing adding them in such cases; the filesize is bigger and the text is obfuscated should anyway want to make use of the source code.
It's essential you ensure whatever editor(s) you use to create files are writing them in UTF-8 as well.
My answer was only encoding characters that are above ascii range. If you want to encode something as html, you should use
$text=HTML::Entities::encode_entities($text);
:
%!perl -MHTML::Entities -MEncode -p -i -e '$_=Encode::decode_utf8($_) unless Encode::is_utf8($_); $_=HTML::Entities::encode_entities($_);'
I was not using this in that answer because TS only requested to encode unicode characters without encoding <, >, & as well.
By the way, you may use $text=HTML::Entities::encode_entities($text, '<>&"'); to encode only really unsafe characters (though I guess this is easily expressed with vimscript:
:let entities={'<': 'lt', '>': 'gt', '&': 'amp', '"': 'quot'}
:execute '%s/['.escape(join(keys(entities), ''), '\-]^').']/\="&".entities[submatch(0)].";"/g'
perl -MHTML::Entities -i -e 'print encode_entities shift'
should work, doesn't it?

iconv gives "Illegal Character" with smart quotes -- how to get rid of them?

I have a MySQL table with 120,000 lines stored in UTF-8 format. There is one field, product name, that contains text with many accents. I need to fill a second field with this same name after converting it to a url-friendly form (ASCII).
Since PHP doesn't directly handle UTF-8, I'm using:
$value = iconv ('UTF-8', 'ISO-8859-1', $value);
to convert the name to ISO-8859-1, followed by a massive strstr statement to replace any accented character by its unaccented equivalent (à becomes a, for example).
However, the original text names were entered with smart quotes, and iconv chokes whenever it comes across one -- I get:
Unknown error type: [8]
iconv() [function.iconv]: Detected an illegal character in input string
To get rid of the smart quotes before using iconv, I have tried using three statements like:
$value = str_replace('’', "'", $value);
(’ is the raw value of a UTF-8 smart single quote)
Because the text file is so long, these str_replace's cause the script to time out every single time.
What is the fastest way to strip out the smart quotes (or any invalid characters) from a UTF-8 string, prior to running iconv?
Or, is there an easier solution to this whole problem? What is the fastest way to convert a name with many accents, in UTF-8, to a name with no accents, spelled correctly, in ASCII?
Glibc (and the GNU libiconv) supports //TRANSLIT and //IGNORE suffixes.
Thus, on Linux, this works just fine:
$ echo $'\xe2\x80\x99'
’
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1
iconv: illegal input sequence at position 0
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1//translit
'
I'm not sure what iconv is in use by PHP, but the documentation implies that //TRANSLIT and //IGNORE will work there too.
What do you mean by "link-friendly"? Only way that makes sense to me, since the text between <a>...</a> tags can be anything, is actually "URL-friendly", similar to SO's URLs where everything is converted to [a-z-].
If that's what you're going for, you'll need a transliteration library, not a character set conversion library. (I've had no luck getting iconv() to do the work in the past, but I haven't tried in a while.) There's a beta PHP extension translit that probably does the job.
If you can't add extensions to your PHP install, you'll have to look for a PHP library that does the same thing. I haven't used it, but the PHP UTF-8 library implements a utf8_to_ascii library that I assume does something like what you need.
(Also, if iconv() is failing like you said, it means that your input isn't actually valid UTF-8, so no amount of replacing valid UTF-8 with anything else will help the problem. EDIT: I may take that back: if ephemient's answer is correct, the iconv error you're seeing may very well be because there's no direct representation of the character in the destination character set. So, nevermind.)
Have you considered using MySQL's REPLACE string function to change the offending strings into apostrophes, or whatever? You may be able to put together the "string to be replaced" part e.g. by using CONCAT on CHAR calls...