I am writing a data clean up script (MS Smart Quotes, etc.) that will operate on mySQL tables encoded in Latin1. While scanning the data I noticed a ton of 0D 0A where the line breaks are.
Since I am cleaning the data, should I also address all of the 0D, too, by removing them? Is there ever a good reason to keep 0D (carriage return) anymore?
Thanks!
0D0A (\r\n), and 0A (\n) are line terminators; \r\n is mostly used in OS Windows, \n in unix systems.
Is there ever a good reason to keep 0D anymore?
I think you should answer this question yourself.
You could remove '\r' from the data, but make sure that the programs that will use this data understand that '\n' means the end of line very well. In most cases it is taken into account, but check just in case.
The CR/LF combination is a Windows thing. *NIX operating systems just use LF. So based on the application that uses your data, you'll need to make the decision on whether you want/need to filter out CR's. See the Wikipedia entry on newline for more info.
Python's readline() returns a line followed with a \O12. \O means Octal. 12 is octal for decimal 10. You can see on the ASCII table that Dec 10 is NL or LF. Newline or line feed.
Standard for end-of-line in a unix text or script file.
http://www.asciitable.com/
So be aware that the len() will include the NL unless you try to read past the EOF the len() will never be zero.
Therefore if you INSERT any line of text obtained by the Python readline() into a mysql table it will include the NL character by default, at the end.
Related
I'm trying to send command which includes spaces,
I got the following error -
"system1 -loadChange "+10" -portAddress "10.10.X.X/1/15â€"
ecah space replaced by Â.
The full command is:
av::perform ChangeLoadForPort system1 -loadChange "+10" -portAddress "10.10.X.X/1/15”
X.X is the IP.
The sequence  (a  followed by a space) is the ISO 8859-1 interpretation of the sequence of bytes that make a non-breaking space in UTF-8. (It's also the same with ISO 8859-15 and a few other encodings.) †is another tell-tale. That you're seeing this points to two problems:
Various bits and pieces of the systems you're dealing with are disagreeing on what encodings are in use. That's Bad News, but would be just theoretically a problem if your code stuck to ASCII. (Almost all encodings have the majority of ASCII as a subset.)
Your code has unexpected non-ASCII characters in it, and that's just not going to work. You're probably being sabotaged by whatever program you're using to edit text; a programmer's editor is strongly recommended, and not a word processor.
I am writing a CSV parser and I want it to comply with this standards. It states:
Each record is located on a separate line, delimited by a line break (CRLF)
How should I handle rows ending with only CR of LF character? Should I treat them as literals and pass to field, interpret as a row end. Or maybe dub the file malformed?
I guess, that most flexible solution would be to accept either type of line end, but I am trying to figure out what standards say.
What do you think about it?
You should certainly not treat them as malformed, because there can be different line endings on Linux, Windows and Mac for example.
It's better to support them all.
Also, fields can have newlines in them as well, if they are properly quoted. So you'll need to check for that too.
For example:
123,"test on 2
lines",456
is a valid csv row.
We have an application called JIRA running on Windows using MSSQL and I need to migrate it to Linux/MySQL. The character encoding in the existing MSSQL db is latin1 but I need to use UTF-8 in MySQL.
I take an xml dump of the MSSQL data using a backup mechanism provided by the application. Run it through python filter to convert the encoding from latin1 to UTF-8. Here is the python code that was provided to me by my colleague.
#!/usr/bin/python
import codecs, re
try:
highpoints = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
#fin = codecs.open('unicodestuff.txt', encoding='utf-8', errors='replace')
fin = codecs.open('entities.xml', encoding='latin1')
fout = codecs.open('stripped.xml', encoding='utf-8', mode='w', errors='replace')
for line in fin:
line = highpoints.sub(u'', line)
fout.write(line)
fin.close()
fout.close()
I take the filtered xml dump and using a "restore" mechanism in the application, I restore the data. However, after restoring the data, I spot checked few records on the MySQL side and I see some weird characters and I am assuming these are related to character encoding. For example,
On the MSSQL side, the text string is
““Number of debits exceeds maximum of 0”
“2-Restrict All Credits”
Default ของประเภทบัญชีถูกต้อง แต่เลขบัญชีไม่ถูกต้อง
Branch : 724 มาบุญครอง
whereas on the MYSQL side, the corresponding text appears as
â??â??Number of debits exceeds maximum of 0â?
â??2-Restrict All Creditsâ?
Default à¸à¸à¸à¸à¸£à¸°à¹à¸ à¸à¸à¸±à¸à¸à¸µà¸à¸¹à¸à¸à¹à¸à¸ à¹à¸à¹à¹à¸¥à¸à¸à¸±à¸à¸à¸µà¹à¸¡à¹à¸à¸¹à¸à¸à¹à¸à¸
Branch : 724 มาà¸à¸¸à¸à¸à¸£à¸à¸
Can you please provide me some ideas to fix these character encoding issues? Kindly let me know if additional information is required.
Thanks
Sam
Clearly your XML file does not actually use the Latin-1 character set. You've shown that text such as "ของประเภทบัญชีถูกต้อง แต่เลขบัญชีไม่ถูกต้อง" is present in it. The Latin-1 character set does what it says on the label: it represents letters from Latin alphabets. Those letters do not exist in it. If the headers in your XML file claim that it's in Latin-1, then those headers are untrue and the XML is, strictly speaking, not valid. But it might still be usable.
Now the problem is, what character encoding is that XML file actually using? To find out, you may have to examine the XML file in hexadecimal. There are three main possibilities: (1) it's using an old codepage such as 874 which contains these characters; (2) it's using UTF-16; (3) it's using UTF-8.
If you examine in hexadecimal a section of the XML which contains some of this non-latin text, and some of the latin letters nearby, here's what you might see. If it's in a codepage such as 874, each latin letter will be one byte with a value from 32 to 7F, and each nonlatin letter will be one (or possibly two?) bytes with values of 80 to FF. If it's in UTF-16, each latin letter will be two bytes, one from 32 to 7F and the other being always 00, and the nonlatin letters will be two bytes with neither being 00. If it's in UTF-8, the latin letters will be one byte from 32 to 7F, and the nonlatin letters will be (probably) three bytes, all being from 80 to FF.
There may be an alternative to examining hexadecimal. Some text editor programs can save text files in your choice of encoding formats. TextPad 7, for instance, can save as ANSI, DOS, UTF-8, Unicode, or Unicode (big-endian). The latter two options are actually UTF-16. Try loading the XML into such a program, and saving copies of it as UTF-8 and as Unicode. One of these copies should be the same size as the original (plus or minus two or three bytes), and the other will be a different size. Whichever matches the size is probably the correct format. If both differ, then you've got something weird.
Anyway, if you save a version as UTF-8 and then are able to open it and see your data intact, you should then be able to import that without using a Python translator.
We have an MS Access .mdb file produced, I think, by an Access 2000 database. I am trying to export a table to SQL with mdbtools, using this command:
mdb-export -S -X \\ -I orig.mdb Reviewer > Reviewer.sql
That produces the file I expect, except one thing: Some of the characters are represented as question marks. This: "He wasn't ready" shows up like this: "He wasn?t ready", only in some cases (primarily single/double curly quotes), where maybe the content was pasted into the DB from MS Word. Otherwise, the data look great.
I have tried various values for "export MDB_ICONV=". I've tried using iconv on the resulting file, with ISO-8859-1 in the from/to, with UTF-8 in the from/to, with WINDOWS-1250 and WINDOWS-1252 and WINDOWS-1256 in the from, in various combinations. But I haven't succeeded in getting those curly quotes back.
Frankly, based on the way the resulting file looks, I suspect the issue is either in the original .mdb file, or in mdbtools. The malformed characters are all single question marks, but it is clear that they are not malformed versions of the same thing; so (my gut says) there's not enough data in the resulting file; so (my gut says) the issue can't be fixed in the resulting file.
Has anyone run into this one before? Any tips for moving forward? FWIW, I don't have and never have had MS Access -- the file is coming from a 3rd party -- so this could be as simple as changing something on the database, and I would be very glad to hear that.
Thanks.
Looks like "smart quotes" have claimed yet another victim.
MS word takes plain ascii quotes and translates them to the double-byte left-quote and right-quote characters and translates a single quote into the double byte apostrophe character. The double byte characters in question blelong to to an MS code page which is roughly compatable with unicode-16 except for the silly quote characters.
There is a perl script called 'demoroniser.pl' which undoes all this malarky and converts the quotes back to plain ASCII.
It's most likely due to the fact that the data in the Access file is UTF, and MDB Tools is trying to convert it to ascii/latin/is0-8859-1 or some other encoding. Since these encodings don't map all the UTF characters properly, you end up with question marks. The information here may help you fix your encoding issues by getting MDB Tools to use the correct encoding.
Many languages have functions which only process "plaintext", not binary. Does this mean that only characters within the ASCII range will be allowed?
Binary is just a series of bytes, isn't it similar to plaintext which is just a series of bytes interpreted as characters? So, can plaintext store the same data formats / protocols as binary?
a plain text is human readable, a binary file is usually unreadable by a human, since it's composed of printable and non-printable characters.
Try to open a jpeg file with a text editor (e.g. notepad or vim) and you'll understand what I mean.
A binary file is usually constructed in a way that optimizes speed, since no parsing is needed.
A plain text file is editable by hand, a binary file not.
"Plaintext" can have several meanings.
The one most useful in this context is that it is merely a binary files which is organized in byte sequences that a particular computers system can translate into a finite set of what it considers "text" characters.
A second meaning, somewhat connected, is a restriction that said system should display these "text characters" as symbols readable by a human as members of a recognizable alphabet. Often, the unwritten implication is that the translation mechanism is ASCII.
A third, even more restrictive meaning, is that this system must be a "simple" text editor/viewer. Usually implying ASCII encoding. But, really, there is VERY little difference between you, the human, reading text encoded in some funky format and displayed by a proprietary program, vs. VI text editor reading ASCII encoded file.
Within programming context, your programming environment (comprized by OS + system APIs + your language capabilities) defines both a set of "text" characters, and a set of encodings it is able to read to convert to these "text" characters. Please note that this may not necessarily imply ASCII, English, or 8 bits - as an example, Perl can natively read and use the full Unicode set of "characters".
To answer your specific question, you can definitely use "character" strings to transmit arbitrary byte sequences, with the caveat that string termination conventions must apply.
The problem is that the functions that already exist to "process character data" would probably not have any useful functionality to deal with your binary data.
One thing it often means is that the language might feel free to interpret certian control characters, such as the values 10 or 13, as logical line terminators. In other words, an output operation might automagicly append these characters at the end, and an input operation might strip them from the input (and/or terminate reading there).
In contrast, language I/O operations that advertise working on "binary" data will usually include an input parameter for the length of data to operate on, since there is no other way (short of reading past end of file) to know when it is done.
Generally, it depends on the language/environment/functionality.
Binary data is always that: binary. It is transferred without modification.
"Plain text" mode may mean one or more of the following things:
the stream of bytes is split into lines. The line delimiters are \r, \n, or \r\n, or \n\r. Sometimes it is OS-dependent (like *nix likes \n, while windows likes \r\n). The line ending may be adjusted for the reading application
character encoding may be adjusted. The environment might detect and/or convert the source encoding into the encoding the application expects
probably some other conversions should be added to this list, but I can't think of any more at this moment
Technically nothing. Plain text is a form of binary data. However a major difference is how values are stored. Think of how an integer might be stored. In binary data it would use a two's complement format, probably taking 32 bits of space. In text format a number would be stored instead as a series of unicode digits. So the number 50 would be stored as 0x32 (padded to take up 32 bits) in binary but would be stored as '5' '0' in plain text.