I'm having an issue with PL/JSON chopping off string values at exactly 5000 characters.
Example data: {"n1":"v1","n2":"v2","n3":"10017325060844,10017325060845,... this goes on for a total of 32,429 characters ...10017325060846,10017325060847"}
After I convert the JSON string to an object I run this...
dbms_output.put_line(json_obj.get('n3').get_string);
And it only outputs the first 5000 characters. So I did some digging, see line 26 of this code. And right below it at line 31 the extended_str is being set and contains all 32,429 chars. So now let's move on to the get_string() member function. There are two of them. I verified that it's the first one that is being called, the one with the max_byte_size and max_char_size parameters. Both of those parameters are null. So why is my text being chopped off at 5000 characters? I need this to work for data strings of varchar2(32767) and clobs. Thanks!
Version: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
UPDATE: I found that the chopping of the text is coming from line 35: dbms_lob.read(str, amount, 1, self.str);. I ignored this code before because I saw the comment and knew my string wasn't null. So why is this read needed? Is this a bug?
As maintainer of the pljson project I have answered your question on github (https://github.com/pljson/pljson/issues/154). For any further question feel free to ask on the same issue thread on github.
Related
I'm trying to load a large csv file, 3,715,259 lines.
I created this file myself and there are 9 fields separated by commas.
Here's the error:
df = pd.read_csv("avaya_inventory_rev2.csv", error_bad_lines=False)
Skipping line 2924525: expected 9 fields, saw 11
Skipping line 2924526: expected 9 fields, saw 10
Skipping line 2924527: expected 9 fields, saw 10
Skipping line 2924528: expected 9 fields, saw 10
This doesn't make sense to me, I inspected the offending lines using:
sed -n "2924524,2924525p" infile.csv
I can't list the outputs as they contain proprietary information for a client. I'll try to synthesize a meaningful replacement.
Lines 2924524 and 2924525 look to have to same number of fields to me.
Also, I was able to load the same file into a mySQL table with no error.
create table Inventory (path varchar (255), isText int, ext varchar(5), type varchar(100), size int, sloc int, comments int, blank int, tot_lines int);
I don't know enough about mySQL to understand why that may or maynot be a valid test and why pandas would have a different outcome from loading the same file.
TIA !
'''UPDATE''': I tried to read with the engine='python':
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
When I create this csv, I'm using a shell script I wrote. I feed lines to the file with redirect >>
I tried the suggested fix:
input = open(input, 'rU')
df.read_csv(input, engine='python')
Back to the same error:
ValueError: Expected 9 fields in line 5157, saw 11
I'm guessing it has to do with my csv creation script and how I dealt with
quoting in that. I don't know how to investigate this further.
I opened the csv input file in vim and on line 5157 there's a ^M which google says it Windows CR.
OK...I'm closer, although I did kinda suspect something like this and used dos2unix on the csv intput.
I removed the ^M using vim, and re-ran with same error about
11 fields. However, I can now see the 11 fields where as before I just saw
9. There's v's which is likely some kind of Windows hold over ?
SUMMARY: SOme somebody thought it'd be cute to name files with fobar.sh,v
So my profiler didn't mess up it's was just a name weirdness...plus the random \cr\lf from windows that snuck in....
Cheers
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?
Background:
I am using Django with MySQL 5.1 and I am having trouble with 4-byte UTF-8 characters causing fatal errors throughout my web application.
I've used a script to convert all tables and columns in my database to UTF-8 which has fixed most unicode issues, but there is still an issue with 4-byte unicode characters. As noted elsewhere, MySQL 5.1 does not support UTF-8 characters over 3 bytes in length.
Whenever I enter a 4-byte unicode character (e.g. 🀐) into a ModelForm on my Django website the form validates and then an exception similar to the following is raised:
Incorrect string value: '\xF0\x9F\x80\x90' for column 'first_name' at row 1
My question:
What is a reasonable way to avoid fatal errors caused by 4-byte UTF-8 characters in a Django web application with a MySQL 5.1 database.
I have considered:
Selectively disabling MySQL warnings to avoid specifically that error message (not sure whether that is possible yet)
Creating middleware that will look through the request.POST QueryDict and substitute/remove all invalid UTF8 characters
Somehow hook/alter/monkey patch the mechanism that outputs SQL queries for Django or for MySQLdb to substitute/remove all invalid UTF-8 characters before the query is executed
Example middleware to replacing invalid characters (inspired by this SO question):
import re
class MySQLUnicodeFixingMiddleware(object):
INVALID_UTF8_RE = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
def process_request(self, request):
"""Replace 4-byte unicode characters by REPLACEMENT CHARACTER"""
request.POST = request.POST.copy()
for key, values in request.POST.iterlists():
request.POST.setlist(key,
[self.INVALID_UTF8_RE.sub(u'\uFFFD', v) for v in values])
Do you have an option to upgrade mysql? If you do, you can upgrade and set the encoding to utf8mb4.
Assuming that you don't have the option, I see these options for you:
1) Add java script / frontend validations to prevent entry of anything other than 1,2, or 3 byte unicode characters,
2) Supplement that with a cleanup function in your models to strip the data of any 4 byte unicode characters (which would be your option 2 or 3)
At the same time, it does look like your users are in fact using 4 byte characters. If there is a business case for using them in your application, you could go to the powers that be and request for an upgrade.
Currently I have a powershell proccess that is scanning a SQL Server Table and is reading a columns containing text. Currently we have characters that are in the extended ASCII land that are causing our downstream processes to break. I was orginally idenitfying these differences in SQL Server but it is terrible at text parsing so I decided to write a powershell script to do this that combined regular expressions. I will post the code for that as well to help other lost souls looking for such a regex.
$x = [regex]::Escape("\``~!##$%^&*()_|{}=+:;`"'<,>.?/-")
$y = "([^A-z0-9 \0x005D\0x005B\t\n"+$x+"])"
$a = [regex]::match( $($Row[1]), $y)
The problem comes when I want to display some of the ascii values back in an email saying that I'm scrubbing the data. The numbers don't come out the same as SQL Server. Caution I'm not sure if your results will be the same copying from you browser because these are extended ascii.
In powershell
[int]"–"[-0]; #result 8211 that appears to be wrong
[int]" "[-0]; #result 160 this appears to be right
In SQL Server
select ASCII('–') --result 150
select ASCII(' ') --result 160
What in powershell will help you to get the same results as SQL Server on the ASCII look up, if there is one.
TLDR; So my question is, is the above the correct method to look up ASCII values in powershell because it works for most values but doesn't work for the ASCII value 150 (this is the long dash that is from word).
In SQL Server,
select UNICODE('–')
will return 8211.
I don't think PowerShell supports ANSI, except for I/O; it works in Unicode internally.
We have an MS Access .mdb file produced, I think, by an Access 2000 database. I am trying to export a table to SQL with mdbtools, using this command:
mdb-export -S -X \\ -I orig.mdb Reviewer > Reviewer.sql
That produces the file I expect, except one thing: Some of the characters are represented as question marks. This: "He wasn't ready" shows up like this: "He wasn?t ready", only in some cases (primarily single/double curly quotes), where maybe the content was pasted into the DB from MS Word. Otherwise, the data look great.
I have tried various values for "export MDB_ICONV=". I've tried using iconv on the resulting file, with ISO-8859-1 in the from/to, with UTF-8 in the from/to, with WINDOWS-1250 and WINDOWS-1252 and WINDOWS-1256 in the from, in various combinations. But I haven't succeeded in getting those curly quotes back.
Frankly, based on the way the resulting file looks, I suspect the issue is either in the original .mdb file, or in mdbtools. The malformed characters are all single question marks, but it is clear that they are not malformed versions of the same thing; so (my gut says) there's not enough data in the resulting file; so (my gut says) the issue can't be fixed in the resulting file.
Has anyone run into this one before? Any tips for moving forward? FWIW, I don't have and never have had MS Access -- the file is coming from a 3rd party -- so this could be as simple as changing something on the database, and I would be very glad to hear that.
Thanks.
Looks like "smart quotes" have claimed yet another victim.
MS word takes plain ascii quotes and translates them to the double-byte left-quote and right-quote characters and translates a single quote into the double byte apostrophe character. The double byte characters in question blelong to to an MS code page which is roughly compatable with unicode-16 except for the silly quote characters.
There is a perl script called 'demoroniser.pl' which undoes all this malarky and converts the quotes back to plain ASCII.
It's most likely due to the fact that the data in the Access file is UTF, and MDB Tools is trying to convert it to ascii/latin/is0-8859-1 or some other encoding. Since these encodings don't map all the UTF characters properly, you end up with question marks. The information here may help you fix your encoding issues by getting MDB Tools to use the correct encoding.
I've found a Perl regexp that can check if a string is UTF-8 (the regexp is from w3c site).
$field =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
But I'm not sure how to port it to MySQL as it seems that MySQL don't support hex representation of characters see this question.
Any thoughts how to port the regexp to MySQL?
Or maybe you know any other way to check if the string is valid UTF-8?
UPDATE:
I need this check working on the MySQL as I need to run it on the server to correct broken tables. I can't pass the data through a script as the database is around 1TB.
I've managed to repair my database using a test that works only if your data can be represented using a one-byte encoding in my case it was a latin1.
I've used the fact that mysql changes the bytes that aren't utf-8 to '?' when converting to latin1.
Here is how the check looks like:
SELECT (
CONVERT(
CONVERT(
potentially_broken_column
USING latin1)
USING utf8))
!=
potentially_broken_column) AS INVALID ....
If you are in control of both the input and output side of this DB then you should be able to verify that your data is UTF-8 on whichever side you like and implement constraints as necessary. If you are dealing with a system where you don't control the input side then you are going to have to check it after you pull it out and possibly convert in your language of choice (Perl it sounds like).
The database is a REALLY good storage facility but should not be used aggressively for other applications. I think this is one spot where you should just let the MySQL hold the data until you need to do something further with it.
If you want to continue on the path you are on then check out this MySQL Manual Page: http://dev.mysql.com/doc/refman/5.0/en/regexp.html
REGEX is normally VERY similar between languages (in fact I can almost always copy between JavaScript, PHP, and Perl with only minor adjustments for their wrapping functions) so if that is working REGEX then you should be able to port it easily.
GL!
EDIT: Look at this Stack article--you might want to use Stored Procedures considering you cannot using scripting to handle the data: Regular expressions in stored procedures
With Stored Procedures you can loop through the data and do a lot of handling without ever leaving MySQL. That second article is going to refer you right back to the one I listed though so I think you need to first prove out your REGEX and get it working, then look into Stored Procedures.