MSAccess - CSV TransferText Import Spec, non-ascii delimiter? - ms-access

I receive a CSV file from a 3rd party I need to IMPORT into Access. They claim they are unable to add any sort of Text Qualifier; all my common delimiter options (comma, tabs, pipe, $, ~, ^, etc.) all seem to appear in the data, so not reliable to use in an Import Spec. I cannot edit the data, but we can adjust the delimiter. Record counts are in 500K range x 50 columns (250MB).
I tried a non-ascii char as a delimiter (i.e., ÿ), I can add to an Import Spec, the sample data appears to delimit OK, but get a error (Subscript out of Range) when attempting the actual Import. Also tried a multi-character delimiter, but no-go.
Any suggestions to permit me to receive these csv tables? Daily task, many low-skilled users, remote locations, import function behind a button.
Sample Raw Data, truncated for width (June7, not sure if this helps the discussion)
9798ÿ9798ÿ451219417ÿ9033504ÿ9033504ÿPUNCH BIOPSY 4MM UNI-PUNCH SS SEAMLS RAZOR SHARP BLADE...
9798ÿ9798ÿ451219418ÿ1673BXÿ1673BXÿCLEANER INST 1GL KLENZYME LATEXÿSTERIS PLCÿ1673BXÿ1673BX...
9798ÿ9798ÿ451219419ÿA4823PRÿA4823PRÿBAG BIOHAZ THK1.3 MIL 24X23IN RED LDPE PRINT INF WASTE...
9798ÿ9798ÿ451219420ÿCUR9225ÿCUR9225ÿGLOVE EXAM CURAD MEDIUM LATEX FREEÿMEDLINE INDUSTRIES,...
9798ÿ9798ÿ451219421ÿCUR9226ÿCUR9226ÿGLOVE EXAM CURAD LARGE LATEX FREEÿMEDLINE INDUSTRIES, ...
9798ÿ9798ÿ451219422ÿ90176101ÿ90176101ÿDRAPE CONSUMABLE PK EQUIP OEC UROVIEW 2800 STERILE L...

Try another extended-ASCII character (128 - 254). The chosen delimiter ÿ (255) apparently doesn't work, but it's already a suspicious character since it has all bits set and sometimes has special meaning for that reason.
It's also good to consider the code page. If you're in the US using standard English version of Windows, its likely that Access is using the default "Western European (Windows)" (Windows-1252) code page. But if you're outside the US or have other languages installed, it could be that the particular default code page will treat certain characters differently. For reference, I'm using Access 2013 on Windows 10. In the Access text import wizard, clicking on the [Advanced...] button shows more options, including the selection of the import code page. Since you're having problems with the import, it is worth inspecting that settings.
For the record, I had similar results as you and others using the sample data and delimiter ÿ (255).
Next I tried À (192) which is a standard letter character in various code pages, so it should likely work even if the default were not Windows-1252. Indeed, it worked on my system and resulted in no errors.
To get the import working without errors at first, I would choose all Short Text and Long Text fields before specifying integer, date or other non-text types. If all text columns work, then try specific fields types. In this way, you can at least differentiate between delimiter errors and other data errors.
This isn't to discourage other options like fixed-width text, especially since in that case you won't have to worry about the delimiter at all.

Related

In SQL tables, should I, for example, have "é" or should I have "e´"?

I have tried in vain to look up relevant questions. They are beyond my pay scale. I am not a professional. To explain this a bit more: in the HTML that I wrote, the em dash would be "& #151;" (that space inserted so it would not show up as an actual em dash). It ended up in the tables (someone else was doing that work) as "—". Those are not showing up correctly when searches are done using PHP. I only get the image with a question mark. I do have my SQL account set to Unicode.
Take a philosophical stand: The datastore (database table) should contain data, not some special encoding of the data.
The "data" is é
When you display that in HTML, you might need to convert it to e´. However, all modern browsers don't have a problem if é is encoded UTF-8.
If you choose to use "html entities", then have your application do the conversion after fetching é from the table. PHP has the function htmlentities() specifically for that task.
But, I still have not addressed what byte(s) are in the table to represent é. These days, you 'should' use UTF-8 (aka MySQL's utf8mb4). That would be two hex bytes C3A9, which can be discovered using SELECT HEX(col) .... If you use the old default, latin1, the hex would show C9.
A related question is whether you should store html 'tags' or construct the html on the fly after fetching the data. So, let me give you three philosophies; you pick which to apply:
The table contains pure data; formatting, etc, is done after fetching and before delivering to the user's browser.
The table contains an 'opaque' image of what needs to be sent to the browser -- complete with tags, entities, etc. With this approach, you may as well call it a BLOB, not TEXT.
Some compromise between those. Note: The use of CSS can avoid too much hard-coding of formatting before storing into the database.
Also, the first choice is much cleaner for searching. This may lead you to pick it. However, another approach is to have two columns -- one aimed at delivering mostly-formatted ouput; the other for searching (tags removed, no entities, etc); it would be mostly text, but you probably could not generate a web page (with links, paragraphs, etc) from it.
é -- different strokes for different folks
é in latin1 (not advised) hex E9, 1 byte
é in utf8 C3A9 2 bytes
\u00E9 -- Unicode codepoint -- 6 bytes
é -- html entity (see PHP's htmlentities()) -- 8 bytes
%C3%A9 -- PHP's urlencode() (for URLs) -- 6 bytes
Responding to Comments
If entries_lists, entries_languages, and authors_entries are many:many mapping tables, please consider the several optimizations mentioned here.
Do not use utf8_encode. Instead, figure out what caused them not to be encoded correctly, and/or not displayed correctly. Start by
echo bin2hex($record['author']);
SELECT name, HEX(name) FROM authors WHERE ...
for some author with an accented letter.

Univocity parser - false delimiter autodetection when too little information given

I set the parser to detect the delimiters automatically
CsvParserSettings settings = new CsvParserSettings();
settings.detectFormatAutomatically();
I have only 1 single record : 47W2E2qxPs, http://usda.gov/mattis.html
What I got :
code: 47W2E2qxPshttp url: //usda.gov/mattis.html
I expected the delimiter to be , and not :
so my expected result would be 47W2E2qxPs and http://usda.gov/mattis.html .
Could I fix it in an elegant way?
Author of the library here. The detection process is a heuristic that uses statistics collected from multiple rows of part of your input. Therefore it depends a lot on the size of the input.
Its purpose is to handle situations where you can't easily determine what is the CSV format - such as when users upload random files to you. Don't use the detection process if you already know what is the correct delimiter.
In your case, one row of data is absolutely not enough to reliably detect the delimiter, especially if there are multiple symbols present. There is little you can do about it except for testing what was the detected delimiter before continuing:
parser.beginParsing(new File("/path/to/your.csv"));
CsvFormat format = parser.getDetectedFormat();
//check if the format is sane.
The next version (2.6.0) will include more options to assist the heuristic such as providing a set of allowed characters to be used as delimiters - which will probably help in your case.

Getting MySQL to properly distinguish Japanese characters in SELECT calls

I'm setting up a database to do some linguistic analysis, and Japanese Kana are giving me just a bit of trouble.
Unlike other questions on this so far, I don't know that it's an encoding issue, per se. I've set the coallation to utf8_unicode_ci, and on the surface it's saving and recalling most things all right.
The problem, however, is when I get into related kana, such as キ (ki) and ギ (gi). For sorting purposes, Japanese doesn't distinguish between the two unless they are in direct conflict. So for example:
ぎ (gi) comes before きかい (kikai)
きる (kiru) comes before ぎわく (giwaku)
き (ki) comes before ぎ (gi)
It's this behavior that I think is at the root of my problem. When loading my data set from an external file, I had it do a SELECT call to verify that specific readings in Japanese had not already been logged. If it was already there, it would fetch the ID so it could be paired to a headword; otherwise a new entry was added and paired thereafter.
What I noticed after I put everything in is that wherever two such similar readings occurred, the first one encountered would be logged and would then show up as a false positive for the other if it showed up. For example:
キョウ (kyou) appeared first, so characters with ギョウ (gyou) got paired with kyou instead
ズ (zu) appeared before ス (su), so likewise even more characters got incorrectly matched.
I can go through and manually sort it out if need be, but what I would really like to do is set the database up to take a stricter view regarding differentiating between characters (e.g. if the characters have two different UTF-8 code points, treat them as different characters). Is there any way to get this behavior?
You can use utf8_bin to get a collation that compares characters by their Unicode code points.
The utf8_general_ci collation also distinguishes キョウ and ギョウ.
when saving to database
save it as binary
and when calling back change it to Japanese
same problem accorded with me with Arabic language

Why is SSIS complaining that "There is a partial row at the end of the file"?

I'm importing a flat file into a database using a Data Flow Task in SSIS. The file is very simple: it contains three comma-separated values per row. Whenever I run this task, however, I receive a warning from the Flat File component:
Warning: 0x8020200F: There is a partial row at the end of the file.
This warning seems to happen regardless of the size of the file: even with only a handful of rows in the file, visually validated (with extended characters and whatnot visible) I still receive it. Moreover, it doesn't seem to matter whether I have a blank row at the end of the file or I just end it without a trailing CR+LF.
How can I get rid of this warning so I can run my package with WarnAsError enabled?
(BTW, it seems someone else may have had a similar problem in There is a partial row at the end of the file, though it wasn't much of a question.)
I have found three things to try if you encounter this problem. In at least two out of the three cases, SSIS was ignoring rows of my input file with only the above warning to show for it. Because of that, I do not recommend ignoring this warning!
Step 1: verify that your flat file is valid
This error will appear when you have an invalid input file. This can be especially hard to detect if your input file has millions of lines, as mine do, but it's vital that you discover file format violations because SSIS will happily give you this warning and continue on its way without importing the offending lines or, in some cases, the lines after the offending lines. The easiest way I found to discover a problem with the source file is to check the number of rows that are being imported successfully. If it's vastly different than the number you expect in your flat file, something may have gone wrong in the middle somewhere.
Step 2: try a dummy line at the end (fixed-width only)
If you are using a fixed-width format input file, Microsoft may have a helpful KB article for you. Basically, they suggest that you add a dummy line at the end of the file.
I am not using fixed-width files, so I can't say how useful this technique is.
Step 3: turn off text qualification for non-text
This is the tricky one, because I believe the TextQualified property is True by default. If your input file uses non-text fields (integers, etc.), then you must tell SSIS that it should not expect those columns to be qualified as text. Essentially, your input file will be invalid in spite of looking perfectly valid.
TextQualified is a property of the columns in your Flat File Connection Manager.
To change it, open up your connection manager, click "Advanced", and then click on a non-text column. Make sure the TextQualified property is set to False. You will need to do this for all of your non-text columns.
If the byte width of a line in the file is known, you can always double check that the total byte size of the file can be divided by the expected line size to give you a nice round line count number (as opposed to a decimal).
It helps also to know from your source just how many records are expected, but if you don't have this you can at least double check the resultant loaded tables record count against the calculation of line count while loading the file.
I've seen this error often when a source flat text file is missing it's last \r\n at the end of the file.
Running on Windows 64 bit is perfect. It led to no missing row, but I lost the last row when running on Windows 2008.
My workaround is
1. open the ssis in BIDs on the Windows 2008.
2. open the file connection manager make sure Text Qualifier set to
3. rebuild it
All work fine in both Windows 7 and Windows 2008.

MySQL regexp with Japanese furigana

I have a large database (~2700 entries) of vocabulary. Each row contains an English word, the Japanese equivalent, and other data not relevant to this problem. I have created a facility to search and display the results in a table, but I'm having a small problem with the furigana.
Japanese sentences are written with a mix of Chinese characters (kanji) and the phonetic scripts (kana). Not everyone can read every kanji, and sometimes the same kanji has multiple readings. In those cases, the phoetic kana is placed above the kanji - this is called furigana:
I present these phonetic readings to the user with the <ruby> tag in the following format:
<ruby>
<rb>勉強</rb> <!-- the kanji -->
<rp>(</rp> <!-- define where the phonetic part starts in the string -->
<rt>べんきょう</rt> <!-- the phonetic kana itself -->
<rp>)</rp> <!-- define the end of the phonetic part -->
</ruby>する <!-- the last part is already phonetic so needs no ruby -->
The strings are stored in my database like this:
勉強(べんきょう)する
where anything between the parentheses is the reading for the kanji immediately preceeding it. Storing the strings this way allows fallback for browsers that don't support ruby tags (such as, amazingly, Firefox).
All of this is fine, but the problem comes when a user is searching. If they search for
勉強
Then it will show up. But if they try to search for
勉強する
it won't work, because in the database there is a string defining the phonetic pronunciation in the middle.
The full-width parentheses in the above example are used only to denote this phonetic script. Given this, I am looking for a way to essentially tell the MySQL search to ignore anything it finds between rounded parentheses. I have a basic knowledge of how to do most simple queries in MySQL, but I'm certainly not an expert. I have looked at the docs, but (to me, at least) they are not very user-friendly. Perhaps not very beginner-friendly. I thought it might be possible with some sort of construction involving a regular expression, but I can't figure out how.
Is there a way to do what I want?
As said in How to do a regular expression replace in MySQL?, there seems to be impossible without an user-defined function (you can only replace explicit sequences).
Rather dirty solution: you can tolerate anything between two consecutive Japanese characters, LIKE '勉%強%す%る'. I never suggested that.
Or, you can keep an optional field in your table that potentially contains a version with furigana.
I would advise against using LIKE queries beause you would have to have a % between every single character (since you don't know WHEN furigana will occur) and that could end up creating false positives (like if a valid character appeared between 勉 and 強).
As #Jill-Jênn Vie breifly mentioned, I'd suggest adding a new column to hold the text with furigana.
I'm working on an application which performs searches on Korean text. The problem is that Korean conjugation changes the characters. For example:
하다 + 아요 = 해요
"하다" is the verb "to do" in dictionary form and "아요" is the standard polite-form conjugation. Presumably you are a Japanese speaker, so you know how common such polite forms can be! Note how the 하 changes to 해. Obviously, if users try to search for "하다" in the string "해요", they won't find it. But if users want to see all instances of "하다" in the corpus, we need to be able to return it.
Our solution was two columns: "form" (conjugated form) and "analytic_string" which would represent "해요" as "하다+아요". You could take a similar approach and make a second column containing your sentence without furigana.
The main disadvantages of this approach is that you're effectively doubling your database size and you need to pay special attention when inputting data that the two column have the same data (I found a few rows in my database where the form and the analytic string have different words in them). The advantage is you can easily search your data while ignoring furigana.
It's your standard "size vs. performance" trade-off. Which is more important: size of the database or execution time? Any other solution I can think of involves returning too many rows and then individually analyzing them.