SQL: What are all the delimiters? - mysql

I am running some php on my website pulling content from an SQL database. Queries are working and all is good there. Only problem I am having is that when I upload my dataset (.csv format) using phpmyadmin, an error occurs dropping all contents after a certain row. Supposedly this is caused by SQL recognizing more columns in that specific row than intended. And unfortunately, this is not just a single occurrence. I cannot seem to find out exactly what the problem is but most likely it is caused by some values in the column 'description' containing delimiters that split it into multiple columns. Hopefully, by deleting/replacing all these delimiters the problem can be solved. I am rather new to SQL though and I cannot seem to find a source that simply lays out what all the potential delimiters are that I should consider. Is there somebody that can help me out?
Thank you in advance and take care!
Regards

From personal experience, once can delimit by many different things. I've seen pipes | and commas , as well as tab, fixed space, tildas ~ and colons :.
Taken directly from https://en.wikipedia.org/wiki/Delimiter-separated_values:
"Due to their widespread use, comma- and tab-delimited text files can be opened by several kinds of applications, including most spreadsheet programs and statistical packages, sometimes even without the user designating which delimiter has been used.[5][6] Despite that each of those applications has its own database design and its own file format (for example, accdb or xlsx), they can all map the fields in a DSV file to their own data model and format.[citation needed]
Typically a delimited file format is indicated by a specification. Some specifications provide conventions for avoiding delimiter collision, others do not. Delimiter collision is a problem that occurs when a character that is intended as part of the data gets interpreted as a delimiter instead. Comma- and space-separated formats often suffer from this problem, since in many contexts those characters are legitimate parts of a data field.
Most such files avoid delimiter collision either by surrounding all data fields in double quotes, or only quoting those data fields that contain the delimiter character. One problem with tab-delimited text files is that tabs are difficult to distinguish from spaces; therefore, there are sometimes problems with the files being corrupted when people try to edit them by hand. Another set of problems occur due to errors in the file structure, usually during import of file into a database (in the example above, such error may be a pupil's first name missing).
Depending on the data itself, it may be beneficial to use non-standard characters such as the tilde (~) as delimiters. With rising prevalence of web sites and other applications that store snippets of code in databases, simply using a " which occurs in every hyperlink and image source tag simply isn't sufficient to avoid this type of collision. Since colons (:), semi-colons (;), pipes (|), and many other characters are also used, it can be quite challenging to find a character that isn't being used elsewhere."

Related

Deal with semicolon separated data in MySQL table

I've got MySQL DB with multiple data in one column separated by semicolon. I need to use the first of them. What is the best recommended way how to deal with this kind of stored data? (for this specific problem and also generally how to use semicolon separated data).
Q: "What is the best recommended way how to deal with this kind of stored data?"
A: The best recommendation is to avoid storing data as comma separated lists. (And no, this does not mean we should use semicolons in place of commas as delimiters in the list.)
For an introductory discussion of this topic, I recommend a review of Chapter 2 in Bill Karwin's book: "SQL AntiPatterns: Avoiding the Pitfalls of Database Programming"
Which is conveniently available here
https://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557
and from other fine booksellers. Setting that recommendation aside for a moment...
To retrieve the first element from a semicolon delimited list, we can use the SUBSTRING_INDEX function.
As a demonstration:
SELECT SUBSTRING_INDEX('abc;def;ghi',';',1)
returns
'abc'
The MySQL SUBSTRING_INDEX function is documented here: https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_substring-index
I recognize that this might be considered a "link only answer". A good answer to this question is going to be much longer, giving examples to demonstrate the pitfall of storing comma separated lists.
If the database will only ever view the comma separated list as a blob of data, in its entirety, without a need to examine the contents of the list, then I would consider storing it, similar to the way we would store a .jpg image in the database.
I would store and retrieve a .jpg image as a BLOB, just a block of bytes, in its entirety. Save the whole thing, and retrieve the whole thing. I'm not ever going to have the database manipulate the contents of the image. I'm not going to ever ask the database to examine the image to discern information about what is "in" the jpg image. I'm not going to ask the database to derive any meaningful information out of it... How many people are in a photo, what are the names of people in a photo, add a person to the photo, and so on.
I will only condone storing a comma separated (or semicolon separated) separated list if we are intending it to be an object, an opaque block of bytes, like we handle a jpg image.
Use
SELECT SUBSTRING_INDEX(column_name, ';', 1) from your_table

Is it a bad idea to escape HTML before inserting into a database instead of upon output?

I've been working on a system which doesn't allow HTML formatting. The method I currently use is to escape HTML entities before they get inserted into the database. I've been told that I should insert the raw text into the database, and escape HTML entities on output.
Other similar questions here I've seen look like for cases where HTML can still be used for formatting, so I'm asking for a case where HTML wouldn't be used at all.
you will also restrict yourself when performing the escaping before inserting into your db. let's say you decide to not use HTML as output, but JSON, plaintext, etc.
if you have stored escaped html in your db, you would first have to 'unescape' the value stored in the db, just to re-escape it again into a different format.
also see this perfect owasp article on xss prevention
Yes, because at some stage you'll want access to the original input entered. This is because...
You never know how you want to display it - in JSON, in HTML, as an SMS?
You may need to show it back to the user as is.
I do see your point about never wanting HTML entered. What are you using to strip HTML tags? If it a regex, then look out for confused users who might type something like this...
3<4 :->
They'll only get the 3 if it is a regex.
Suppose you have the text R&B, and store it as R&B. If someone searches for R&B, it won't match with a search SQL:
SELECT * FROM table WHERE title LIKE ?
The same for equality, sorting, etc.
Or if someone searches for life span, it could return extraneous matches with the escaped <span>'s. Though this is a bit orthogonal, and can be solved by using an external service like Elasticsearch, or by storing a raw text version in another field; similar to what #limscoder suggested.
If you expose the data via an API, the consumers may not expect the data to be escaped. Adding documentation may help.
A few months later, a new team member joins. As a well-trained developer, he always uses HTML escaping, now only to see everything is double-escaped (e.g. titles are showing up like He said "nuff" instead of He said "nuff").
Some escaping functions have additional options. Forgetting to use the same functions/options while un-escaping could result in a different value than the original.
It's more likely to happen with multiple developers/consumers working on the same data.
I usually store both versions of the text. The escaped/formatted text is used when a normal page request is made to avoid the overhead of escaping/formatting every time. The original/raw text is used when a user needs to edit an existing entry, and the escaping/formatting only occurs when the text is created or changed. This strategy works great unless you have tight storage space constraints, since you will be duplicating data.

Storing apostrophes, exclamation marks, etc. in mysql database

I changed from latin1 to utf8. Although all sorts of text was displaying fine I noticed non-english characters were stored in the database as weird symbols. I spent a day trying to fix that and finally now non-english characters display as non-english characters in the database and display the same on the browser. However I noticed that I see apostrophes stored as ' and exclamation marks stored as !. Is this normal, or should they be appearing as ' and ! in the database instead? If so, what would I need to do in order to fix that?
It really depends on what you intend to do with the contents of the database. If your invariant is that "contents of the database are sanitized and may be placed directly in a web page without further validation/sanitization", then having & and other html entities in your database makes perfect sense. If, on the other hand, your database is to store only the raw original data, and you intend to process it/sanitize it, before displaying it in HTML code, then you should probably replace these entities with the original characters, encoded using UTF-8. So, it really depends on how you interpret your database content.
The &#XX; forms are HTML character entities, implying you passed the values stored in the database through a function such as PHP's htmlspecialchars or htmlentities. If the values are processed within an HTML document (or perhaps by any HTML processor, regardless of what they're a part of), they should display fine. Outside of that, they won't.
This means you probably don't want to keep them encoded as HTML entities. You can convert the values back using the counterpart to the function you used to encode them (e.g. html_entity_decode), which should take an argument as to which encoding to convert to. Once you've done that, check some of the previously problematic entries, making sure you're using the correct encoding to view them.
If you're still having problems, there's a mismatch between what encoding the stored values are supposed to use and what they're actually using. You'll have to figure out what they're actually using, and then convert them by pulling them from the DB and either converting them to the target encoding before re-inserting them, or re-inserting them with the encoding that they actually use. Similar to the latter option is to convert the columns to BLOBs, then changing the column character set, then changing the column type back to a text type, then directly converting the column to the desired character encoding. The reason for this unwieldy sequence is that text types are converted when changing the character encoding, but binary types aren't.
Read "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" for more on character encodings in general, and ยง 9.1.4. of the MySQL manual, "Connection Character Sets and Collations", for how encodings are used in MySQL.

SSIS is forcing Excel data types I don't want. Changing connection string isn't working

I am importing from an excel spreadsheet and it is not allowing me to change the source column type from Double to Unicode string.
I have tried using "IMEX=1;" in the connection string but this appears to be doing absolutely nothing.
The package refuses to validate and when i execute the package therefore will not run, and it keeps wanting to rest the "Input" of the external column to "Float" when i definately want it "Unicode", even though I've set "Validate" to false.
I must be missing something?!!?
There is no easy answer to this question.
Basically it is a known issue with SSIS. It autoreads a certain amount of records and decides that is the datatype, and you CANNOT change this (it will keep giving you Metadata errors and will not validate, and keep resetting it to what it has decided it thinks it is). You CAN set the number of records in the registry, but this doesn't solve the underlying problem, as sometimes the file you are importing may or may not contain strings.
I have worked around it by placing a single quote (') character in front of all the entries in the column that i want to be detected as a string. This means that whenever SSIS validates it will assume it is unicode, which is what I want.
Setting "IMEX=1" is only useful when SSIS actually detects alphanumeric records.
Cdonner:
The source datatype matters because it is setting it to numeric when in fact the column may contain strings as well. So it will bomb out when it comes across a string, maybe 1000 rows in.
Sam:
Excel 2003. I believe this is still an issue in SQL Server 2008 SSIS.

What are important points when designing a (binary) file format? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
When designing a file format for recording binary data, what attributes would you think the format should have? So far, I've come up with the following important points:
have some "magic bytes" at the beginning, to be able to recognize the files (in my specific case, this should also help to distinguish the files from "legacy" files)
have a file version number at the beginning, so that the file format can be changed later without breaking compatibility
specify the endianness and size of all data items; or: include some space to describe endianness/size of data (I would tend towards the former)
possibly reserve some space for further per-file attributes that might be necessary in the future?
What else would be useful to make the format more future-proof and minimize headache in the future?
Take a look at the PNG spec. This format has some very good rationale behind it.
Also, decide what's important for your future format: compactness, compatibility, allowing to embed other formats (different compression algorithms) inside it. Another interesting example would be the Google's protocol buffers, where size of the transferred data is the king.
As for endianness, I'd suggest you to pick one option and stick with it, not allowing different byte orders. Otherwise, reading and writing libraries will only get more complex and slower.
I agree that these are good ideas:
Magic numbers at the beginning. Pretty much required in *nix:
File version number for backwards compatibility.
Endianness specification.
But your fourth one is overkill, because #2 lets you add fields as long as you change the version number (and as long as you don't need forward compatibility).
possibly reserve some space for further per-file attributes that might be necessary in the future?
Also, the idea of imposing a block-structure on your file, expressed in many other answers, seems less like a universal requirement for binary files than a solution to a problem with certain kinds of payloads.
In addition to 1-3 above, I'd add these:
simple checksum or other way of detecting that the contents are intact. Otherwise you can't trust magic bytes or version numbers. Be careful to spec which bytes are included in the checksum. Typically you would include all bytes in the file that don't already have error detection.
version of your software (including the most granular number you have, e.g. build number) that wrote the file. You're going to get a bug report with an attached file from someone who can't open it and they will have no clue when they wrote the file because the error didn't occur then. But the bug is in the version that wrote it, not in the one trying to read it.
Make it clear in the spec that this is a binary format, i.e. all values 0-255 are allowed for all bytes (except the magic numbers).
And here are some optional ones:
If you do need forward compatibility, you need some way of expressing which "chunks" are "optional" (like png does), so that a previous version of your software can skip over them gracefully.
If you expect these files to be found "in the wild", you might consider embedding some clue to find the spec. Imagine how helpful it would be to find the string http://www.w3.org/TR/PNG/ in a png file.
It all depends on the purpose of the format, of course.
One flexible approach is to structure entire file as TLV (Tag-Length-Value) triplets.
For example, make your file comprized of records, each record beginning with a 4-byte header:
1 byte = record type
3 bytes = record length
followed by record content
Regarding the endianness, if you store endianness indicator in the file, all your applications will have to support all endianness formats. On the other hand, if you specify a particular endianness for your files, only applications on platforms with non-matching endiannes will have to do additional work, and it can be decided at compile time (using conditional compilation).
Another point, taken from .xz file spec (http://tukaani.org/xz/xz-file-format.txt): one of the first few bytes should be a non-character, "to prevent applications from misdetecting the file as a text file.". Note sure how many header bytes are usually inspected by editors and other tools, but using a non-binary byte in the first four or eight bytes seems useful.
One of the most important things to know before even starting is how your file will be used.
Will random or sequential access be the norm?
How often will the data be read?
How often will the data be written?
Will you write out the file in one go or will you be slowing writing it as data comes in.
Will the file need to be portable? Not all formats need to be.
Does it need to be compatible with other versions? Maybe updating the file is sufficient.
Does it need to be easy to read/write?
Size/Speed/Compexity tradeoff.
Most answers here give good advise on the portability/compatibility front so I am not going to add more. But consider the following (often overlooked) things.
Some files are often written and rarely read (backups, logs, ...) and you may want to focus on filesize and easy-writing.
Converting endianness is slow (relatively) if your file will never leave the host, or leaves rarely enough that conversion is a good option you can get a significant performance boost. Consider writing a number such as 0x1234 as part of the header so that you can detect (and instruct the user to convert) if this is the case.
Sometimes easy reading is really useful. If you are doing logs or text documents, consider compressing all in one go rather than per-entry so that you can zcat | strings the file and see what is inside.
There are many things to keep in mind and designing a good format takes a lot of planning and foresight. The little things such as zcating a file and getting useful information or the small performance boost from using native integers can give your product an edge, however you need to be careful that you don't sacrifice something important to get it.
One way to future proof the file would be to provide for blocks. Straight after your file header data, you can begin the first block. The block could have a byte or word code for the type of block, then a size in bytes. Now you can arbitrarily add new block types, and you can skip to the end of a block.
I would consider defining a substructure that higher levels use to store data, a little like a mini file system inside the file.
For example, even though your file format is going to store application-specific data, I would consider defining records / streams etc. inside the file in such a way that application-agnostic code is able to understand the layout of the file, but not of course understand the opaque payloads.
Let's get a little more concrete. Consider the usual ways of storing data in memory: generally they can be boiled down to either contiguous expandable arrays / lists, pointer/reference-based graphs, and binary blobs of data in particular formats.
Thus, it may be fruitful to define the binary file format along similar lines. Use record headers which indicate the length and composition of the following data, whether it's in the form of an array (a list of identically-typed records), references (offsets to other records in the file), or data blobs (e.g. string data in a particular encoding, but not containing any references).
If carefully designed, this can permit the file format to be used not just for persisting data in and out all in one go, but on an incremental, as-needed basis. If the substructure is properly designed, it can be application agnostic yet still permit e.g. a garbage collection application to be written, which understands the blobs, arrays and reference record types, and is able to trace through the file and eliminate unused records (i.e. records that are no longer pointed to).
That's just one idea. Other places to look for ideas are in general file system designs, or relational database physical storage strategies.
Of course, depending on your requirements, this may be overkill. You may simply be after a binary format for persisting in-memory data, in which case an approach to consider is tagged records.
In this approach, every piece of data is prefixed with a tag. The tag indicates the type of the immediately following data, and possibly its length and name. Lists may be suffixed with an "end-list" tag that has no payload. The tag may have an embedded identifier, so tags that aren't understood can be ignored by the serialization mechanism when it's reading things in. It's a bit like XML in this respect, except using binary idioms instead.
Actually, XML is a good place to look for long-term longevity of a file format. Look at its namespacing capabilities. If you construct your reading and writing code carefully, it ought to be possible to write applications that preserve the location and content of tagged (recursively) data they don't understand, possibly because it's been written by a later version of the same application.
Make sure that you reserve a tag code (or better yet reserve a bit in each tag) that specifies a deleted/free block/chunk.
Blocks can then be deleted by simply changing a block's current tag code to the deleted tag code or set the tag's deleted bit.
This way you don't need to right away completely restructure your file when you delete a block.
Reserving a bit in the tag provides the the option of possibly undeleting the block
(if you leave the block's data unchanged).
For security, however you might want to zero out the deleted block's data, in this case you would use a special deleted/free tag.
I agree with Stepan, that you should choose an endianess, but I would also have an endianess indicator in the file.
If you use an endianess indicator you might consider using
one of the UniCode Byte Order Marks also as an inidicator of any UniCode text encoding used for any text blocks. The BOM is usually the first few bytes of UniCoded text files, so if your BOM is the first entry in your file there might be a problem of some utility identifying your file as UniCode text (I don't think this is much an issue).
I would treat/reserve the BOM as one of your normal tags (using either the UTF16 BOM if using the 16bit tags or the UTF32 BOM if using 32bit tags) with a 0 length block/chunk.
See also http://en.wikipedia.org/wiki/File_format
I agree with atzz's suggestion of using a Tag Length Value system. For future compatibility, you could store a set of "pointers" to TLV entries at the start (or maybe Tag,Pointer and have the pointer point to a Length,Value; or perhaps Tag,Length,Pointer and then have all the data together elsewhere?).
So, my file could look something like:
magic number/file id
version
tag for first data entry
pointer to first data entry --------+
tag for second data entry |
pointer to second data entry |
... |
length of first data entry <--------+
value for first data entry
...
magic number, version, tags, pointers and lengths would all be a predefined set length, for easy decoding. Say, 2 bytes. Or 4, depending on what you need. They don't all need to be the same (eg, all tags are 1 byte, pointers are 4 etc).
The tag lets you know what is being stored. The pointer tells you where (either an offset or absolute value, in bytes), the length tells you how large the data is, and the value is length bytes of data of type tag.
If you use a MyFileFormat v1 decoder on a MyFileFormat v2 file, the pointers allow you to skip sections which the v1 decoder doesn't understand. If you simply skip invalid tags, you can probably simply use TLV instead of TPLV.
I would either hand code something like that, or maybe define my format in ASN.1 and generate a codec (I work in telecommunications, so ASN.1/TLV makes sense to me :-D)
If you're dealing with variable-length data, it's much more efficient to use pointers: Have an array of pointers to your data, ideally near the start of the file, rather than storing the data in an array directly.
Indirection is preferrable in this instance because it allows random access, which is only possible if all items are the same size. If the data was directly stored in an array, without specifying the locations of any records, data access would take O(n) time in the worst case; in order for your file-reading code to access a particular element it would have to know the length of all previous elements, and the only way to find that out is to look at each one. If you're reading the entire file at once, then you'd be doing this anyway, so it wouldn't be a problem. But if you only want one thing, then this isn't the way to go.
Whereas with an array of pointers, it's O(1) time all around: all you need is an index number, and you can retrieve and follow the pointer to get at your data.
When writing a file using this method, you would of course have to build up your table in memory before doing any writing.