Badly Formed hexadecimal uuid string error in Django fixture; json uuid conversion fails issue - json

File "/home/malikarumi/Projects/cannon/local/lib/python2.7/site-packages/django/db/models/fields/__init__.py", line 2390, in get_db_prep_value
value = uuid.UUID(value)
File "/usr/lib/python2.7/uuid.py", line 134, in __init__
raise ValueError('badly formed hexadecimal UUID string')
ValueError: Problem installing fixture '/home/malikarumi/Projects/cannon/jamf/essell/fixtures/test22byhand.json': badly formed hexadecimal UUID string
I've found the following links so far:
https://github.com/dcramer/django-uuidfield/issues/40
https://github.com/dcramer/django-uuidfield/commit/caae1bc4e45445a06dd11bb22da6a9f07395f78a
Django UUIDField modelfield causes error in Django admin: badly formed hexadecimal UUID string
Django Primary Key: badly formed hexadecimal UUID string
I counted my uuidfield value. It is len=36, because it has dashes in it. At least the string representation I can see is that way. So I replaced it with the same alphanumeric without dashes, as suggested as a test by the bugfix, but I still got the same result.
I checked the model, but there is no max length on any uuid field, nor on the fk link back to the uuid. There's nothing on the fk to suggest it is, or should be limited to, chars, ints, uuids, etc.
Then I found this: http://arthurpemberton.com/2015/04/fixing-uuid-is-not-json-serializable which I hacked into /python2.7/site-packages/django/core/serializers/python.py. The blogger had put it into models.py. But I got the same error, before realizing it was NOT coming from serializers/python.py, as it was yesterday, but from /usr/lib/python2.7/uuid.py, line 134, in init. the relevant portions of that code are:
if hex is not None:
hex = hex.replace('urn:', '').replace('uuid:', '')
hex = hex.strip('{}').replace('-', '')
if len(hex) != 32:
raise ValueError('badly formed hexadecimal UUID string')
int = long(hex, 16)
Rather than try to hack more core code, given that the indication is the problem is json, not Python, I left this alone for now.
Finally, I looked at this:
https://code.djangoproject.com/ticket/24012
It is stated a couple of times here that Django's "UUIDField generates UUIDs in Python". Now here is some history. I created one row, a single instance of Model A into Django with a fixture that had no uuid and no datefield and had no issues. (The uuidfield is on an abstract model, so it is created when the object is created). I did that because I needed the uuid of that Model A instance for a fk field in Model B, which is the one I am struggling with now. I did that by copy pasting the Model A uuid into the fk field on Model B in a csv file which I then converted to json in order to use it as a fixture.
Is it possible that the uuid ran into problems in this copy paste maneuver, before the conversion to json?
If not, that means even though it was an acceptable Python object when it was created, going thru the json conversion messed it up, correct?
If that's the case, what is a workaround?
Can the Arthur Pemberton code be made to work somewhere else in this process?
If I leave the uuid off, I can probably make this work, but then I have to go back and put the all the fk uuid's in manually. Is there a better solution? Maybe a bulk insert of that field alone?
This may be a recurring issue for me, because I am also using Scrapy, which supports but does not require json. None of my scraped items will come with uuid, but how do I automate adding their fk's into my process in order to get them into Django?
Or is all of this a good reason to forget uuids altogether?
Thanks.
EDIT/UPDATE per #rolf:
Since I just discovered that the django shell differs more than I realized (the shell can find settings, the regular interpreter can't) I decided to run this once in each one, but the results were the same.
(cannon)malikarumi#Tetuoan2:~/Projects/cannon/jamf$ python manage.py shell
Python 2.7.10 (default, Oct 14 2015, 16:09:02)
IPython 4.0.3 -- An enhanced Interactive Python.
In [1]: uuid.UUID(a82857b6-e336-4c6c-8499-47601770b39d)
File "<ipython-input-1-e282858da374>", line 1
uuid.UUID(a82857b6-e336-4c6c-8499-47601770b39d)
^
SyntaxError: invalid syntax
In [2]: uuid.UUID(a0a69415-6627-43db-8c7a-b57d0c4cefe2)
File "<ipython-input-2-befebf1573ba>", line 1
uuid.UUID(a0a69415-6627-43db-8c7a-b57d0c4cefe2)
^
SyntaxError: invalid syntax
In [3]: uuid.UUID(e6e11b06-ea3b-4e98-a31f-9a83447ad884)
File "<ipython-input-3-a59ea095e61a>", line 1
uuid.UUID(e6e11b06-ea3b-4e98-a31f-9a83447ad884)
^
SyntaxError: invalid syntax
In [4]: uuid.UUID(bd116432-65d7-4612-abfe-9a99dcaf5cad)
File "<ipython-input-4-c4a04434aa3c>", line 1
uuid.UUID(bd116432-65d7-4612-abfe-9a99dcaf5cad)
^
SyntaxError: invalid syntax
Now that I have posted this, I notice that even Stack Overflow treats these uuid differently, i.e., the way they are colored, if that's relevant and meaningful here.
But now that we know this, what do we do with / about it?
2nd Update
This morning I thought, what about a uuid that had never been anywhere but in Django? So here's what I did:
In [5]: e.uuid
Out[5]: UUID('61877565-5fe5-4175-9f2b-d24704df0b74')
In [6]: uuid.UUID(61877565-5fe5-4175-9f2b-d24704df0b74)
File "<ipython-input-6-56137f5f4eb6>", line 1
uuid.UUID(61877565-5fe5-4175-9f2b-d24704df0b74)
^
SyntaxError: invalid syntax
In [7]: uuid.UUID('61877565-5fe5-4175-9f2b-d24704df0b74')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-3b4d3e5bd156> in <module>()
----> 1 uuid.UUID('61877565-5fe5-4175-9f2b-d24704df0b74')
NameError: name 'uuid' is not defined
This is apparently because I left the quote around the alphanumeric, but why that would generate a uuid not defined error, instead of 'string type' or some such error is beyond me.
In [8]: uuid.UUID(61877565-5fe5-4175-9f2b-d24704df0b74)
File "<ipython-input-8-56137f5f4eb6>", line 1
uuid.UUID(61877565-5fe5-4175-9f2b-d24704df0b74)
^
SyntaxError: invalid syntax
The first time I keyed in the characters by hand. I decided to repeat the test by copying and pasting, but as you can see, it made no difference. If there was something weird about the way only the 5 that the caret is pointing to was generated, we might be on to something, but if so, why do I get the same error in the same place when I typed it in by hand myself?
This no longer seems like a json issue to me, since – as far as I know – json has never touched this uuid, unless it did somehow in the internal workings of Django.
Instead, there is either
1. something wrong with the way uuid.UUID generates uuids, or
2. the way it generates them on my system, (Ubuntu 15.10, Django 1.9.1, Python 2.7.10) or
3. the way it reads and evaluates them when they come back, like in uuid.UUID() or being input outside the internal, automatic uuid generation process.
But that also means people using uuid.UUID() to generate uuids will never know there is an issue unless they do what I did, which is try to bring them in from outside. I remember reading somewhere that all uuids are supposed to be compatible. So, unless someone here has a better insight, I think we might be up for a bug report. But is it a Python bug, a Django bug, or both?

Your syntax is wrong:
uuid.UUID('61877565-5fe5-4175-9f2b-d24704df0b74') # note the quotes

Related

converting csv to arff

I am working on a school project for data mining, where we were given CSV data from kaggle (this is how the data looks (2 lines out of 6970)):
4,1970,Female,150,DomesticPartnersKids,Bachelor's Degree,Democrat,,Yes,No,No,No,Yes,Public,No,Yes,No,Yes,No,No,Yes,Science,Study first,Yes,Yes,No,No,Receiving,No,No,Pragmatist,No,No,Cool headed,Standard hours,No,Happy,Yes,Yes,Yes,No,A.M.,No,End,Yes,No,Me,Yes,Yes,No,Yes,No,Mysterious,No,No,,,,,,,,,,Mac,Yes,Cautious,No,Umm...,No,Space,Yes,In-person,No,Yes,Yes,No,Yay people!,Yes,Yes,Yes,Yes,Yes,No,Yes,,,,,,,,,,,,,,,,,No,No,No,Only-child,Yes,No,No
5,1997,Male,75,Single,High School Diploma,Republican,,Yes,Yes,No,,Yes,Private,No,No,No,Yes,No,No,Yes,Science,Study first,,Yes,No,Yes,Receiving,No,Yes,Pragmatist,No,Yes,Cool headed,Odd hours,No,Right,Yes,No,No,Yes,A.M.,Yes,Start,Yes,Yes,Circumstances,No,Yes,No,Yes,Yes,Mysterious,No,No,Tunes,Technology,Yes,Yes,Yes,Yes,No,Supportive,No,PC,No,Cautious,No,Umm...,No,Space,No,In-person,No,No,Yes,Yes,Grrr people,Yes,No,No,No,No,No,No,Yes,No,No,Yes,No,Own,Pessimist,Mom,No,No,No,No,Nope,Yes,No,No,No,Yes,No,Yes,No,Yes,No
and we have to get this to an .arff format for use in weka. I manualy typed the header(107 attributes)
#ATTRIBUTE user_id NUMERIC
#ATTRIBUTE yob NUMERIC
#ATTRIBUTE gender {Male,Female}
#ATTRIBUTE income {150,100,75,50,25,10}
#ATTRIBUTE householdstatus {MarriedKids,Married,DomesticPartnersKids,DomesticPartners,Single,SingleKids}
#ATTRIBUTE educationlevel {Bachelor's Degree,High School Diploma,Current K-12,Current Undergraduate,Master's Degree,Associate's Degree,Doctoral Degree}
#ATTRIBUTE party {Democrat,Republican}
#ATTRIBUTE Q124742 {Yes,No}
#ATTRIBUTE Q124122 {Yes,No}
and I get this error :
} expected at end of enumeration read token eol
Then I tried to use the weka converter but it gave me an error
Wrong number of values.Read 2,expected 1,read Token[EOL],line 4 Problem encountered at line:3
Here's what I did:
From Kaggle, I downloaded train.csv (5568 instances, highest ID numbeer 6960).
I didn't use the converter -- just loaded it into the Weka Explorer as a CSV file. Some problems and their solution:
Line 3: First instance of "Bachelor's Degree". It did NOT like that single quote ("line 3, read 7, expected 108"). Got rid of all single quotes (using a global replace in a text editor). Then I tried to load it into Weka again.
The file doesn't have a CR (the Enter key on the keyboard) at the end of the last line, which caused an error ("null on line 5569"). I added one, again in a text editor. Then I loaded it into Weka, and took a look at the variables.
YOB (Year of Birth) is missing for about 300 instances, with "NA" filled in. So, it didn't evaluate as either string or numeric. Edited these to be empty cells instead. Then I loaded it into Weka.
And, of course, moved Party to be the class variable (at the end). I did this in Weka.
Saved this as train.arff
Loaded it back in, and it seems to work OK. I generated 51% accuracy with a OneR classifier, but you wouldn't expect a OneR classifier to work well here. I'm sure you can do better.
Note I didn't do any manual typing of headers. That must have taken a while!
Good luck!

Can I read the rest of the line after a positive value of IOSTAT?

I have a file with 13 columns and 41 lines consisting of the coefficients for the Joback Method for 41 different groups. Some of the values are non-existing, though, and the table lists them as "X". I saved the table as a .csv and in my code read the file to an array. An excerpt of two lines from the .csv (the second one contains non-exisiting coefficients) looks like this:
48.84,11.74,0.0169,0.0074,9.0,123.34,163.16,453.0,1124.0,-31.1,0.227,-0.00032,0.000000146
X,74.6,0.0255,-0.0099,X,23.61,X,797.0,X,X,X,X,X
What I've tried doing was to read and define an array to hold each IOSTAT value so I can know if an "X" was read (that is, IOSTAT would be positive):
DO I = 1, 41
(READ(25,*,IOSTAT=ReadStatus(I,J)) JobackCoeff, J = 1, 13)
END DO
The problem, I've found, is that if the first value of the line to be read is "X", producing a positive value of ReadStatus, then the rest of the values of those line are not read correctly.
My intent was to use the ReadStatus array to produce an error message if JobackCoeff(I,J) caused a read error, therefore pinpointing the "X"s.
Can I force the program to keep reading a line after there is a reading error? Or is there a better way of doing this?
As soon as an error occurs during the input execution then processing of the input list terminates. Further, all variables specified in the input list become undefined. The short answer to your first question is: no, there is no way to keep reading a line after a reading error.
We come, then, to the usual answer when more complicated input processing is required: read the line into a character variable and process that. I won't write complete code for you (mostly because it isn't clear exactly what is required), but when you have a character variable you may find the index intrinsic useful. With this you can locate Xs (with repeated calls on substrings to find all of them on a line).
Alternatively, if you provide an explicit format (rather than relying on list-directed (fmt=*) input) you may be able to do something with non-advancing input (advance='no' in the read statement). However, as soon as an error condition comes about then the position of the file becomes indeterminate: you'll also have to handle this. It's probably much simpler to process the line-as-a-character-variable.
An outline of the concept (without declarations, robustness) is given below.
read(iunit, '(A)') line
idx = 1
do i=1, 13
read(line(idx:), *, iostat=iostat) x(i)
if (iostat.gt.0) then
print '("Column ",I0," has an X")', i
x(i) = -HUGE(0.) ! Recall x(i) was left undefined
end if
idx = idx + INDEX(line(idx:), ',')
end do
An alternative, long used by many many Fortran programmers, and programmers in other languages, would be to use an editor of some sort (I like sed) and modify the file by changing all the Xs to NANs. Your compiler has to provide support for IEEE NaNs for this to work (most of the current crop in widespread use do) and they will correctly interpret NAN in the input file to a real number with value NaN.
This approach has the benefit, compared with the already accepted (and perfectly good) answer, of not requiring clever programming in Fortran to parse input lines containing mixed entries. Use an editor for string processing, use Fortran for reading numbers.

NLTK letter 'u' in front of text result?

I'm learning NLTK with a tutorial and whenever I try to print some text contents, it returns with 'u' in front of it.
In the tutorial it looks like this,
firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se...
But in my result, it looks like this
(u'firefox.txt', u'Cookie Manager: "Don\'t allow sites that set removed cookies to se', '...')
I am not sure why. I followed exact way the tutorial is explaining. Can someone help me understand this problem? Thank you!
That leading u just means that that string is Unicode. All strings are Unicode in Python 3. The parentheses means that you are dealing with a tuple. Both will go away if you print the individual elements of the tuple, as with t[0], t[1], and so on (assuming that t is your tuple).
If you want to print the whole tuple as a whole, removing u's and parentheses, try the following:
print " ".join (t)
As mentioned in other answer the leading u just means that string is Unicode. str() can be used to convert unicode to str but there doesnt seem to be a direct way to convert all the values in a tuple from unicode to string.
Simple function as below and using it when ever you are referring to any tuple in nltk.
>>> def str_tuple(t, encoding="ascii"):
... return tuple([i.encode(encoding) for i in t])
>>> str_tuple(nltk.corpus.gutenberg.fileids())
('austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt')
I guess you are using Python2.6 or any version before 3.0.
Python allows its users to do the same operation on 'str()' and 'unicode' in the early version. They tried to make conversion between 'str()' and 'unicode' directly in some case rely on default encoding, which on most platform is ASCII. That's probably the reason cause your problem. Here are two ways may solve it:
First, manually assign decoding method. For example:
>> for name in nltk.corpus.gutenberg.fileids():
>> name.decode('utf-8')
>> print(name)
The other way is to UPDATE your Python to version 3.0+ (Recommended). They fix this problem in Python3.0. Here is the link to update detail description:
https://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
Hope this helps you.

Iterating over a string in Vimscript or Parse a JSON file

So I'm creating a vim script that needs to load and parse a JSON file into a local object graph. I searched and I couldn't find any native way to process a JSON file, and I don't want to add any dependencies to the script. So I wrote my own function to parse the JSON string (gotten from the file), but it's really slow. At the moment, I iterate through each character in the file like so:
let len = strlen(jsonString) - 1
let i = 0
while i < len
let c = strpart(jsonString, i, 1)
let i += 1
" A lot of code to process file....
" Note: I've tried short cutting the process by searching for enclosing double-quotes when I come across the initial double quotes (also taking into account escaping '\' character. It doesn't help
endwhile
I've also tried this method:
for c in split(jsonString, '\zs')
" Do a lot of parsing ....
endfor
For reference, a file with ~29,000 characters takes about 4 seconds to process, which is unacceptable.
Is there a better way to iterate over a string in vim script?
Or better yet, have I missed a native function to parse JSON?
Update:
I asked for no dependencies because I:
Didn't want to deal with them
Genuinely wanted some ideas for best way to do this without someone else's work.
Sometimes I just like to do things manually even though the problem has already been solved.
I'm not against plugins or dependencies at all, it's just that I'm curious. Thus the question.
I ended up creating my own function to parse the JSON file. I was creating a script that could parse the package.json file associated with node.js modules. Because of this, I could rely on a fairly consistent format and quit the processing whenever I'd retrieved the information I needed. This usually cut out large chunks of the file since most developers put the largest chunk of the file, their "readme" section, at the end. Because the package.json file is strictly defined, I left the process somewhat fragile. It assumed a root dictionary { } and actively looks for certain entries. You can find the script here: https://github.com/ahayman/vim-nodejs-complete/blob/master/after/ftplugin/javascript.vim#L33.
Of course, this doesn't answer my own question. It's only the solution to my unique problem. I'll wait a few days for new answers and pick the best one before the bounty ends (already set an alarm on my phone).
The simplest solution with the least dependencies is just using the json_decode vim function.
let dict = json_decode(jsonString)
Even though Vim's origin dates back a lot it happens that its internal string() eval() representation is that close to JSON that its likely to work unless you need special characters.
You can lookup the implementation here which even supports true/false/null if you want:
https://github.com/MarcWeber/vim-addon-json-encoding
Better use that library (vim-addon-manager allows to install dependencies easily).
Now it depends on your data whether this is good enough.
Now Benjamin Klein posted your question to vim_use which is why I'm replying.
Best and fast replies happen if you subscribe to the Vim mailinglist.
Goto vim.sf.net and follow the community link.
You cannot expect the Vim community to scrape stackoverflow.
I've added the keyword "json" and "parsing" to that little code that it can be found easier.
If this solution does not work for you you can try the many :h if_* bindings or write an external script which extracts the information you're looking for, or turns JSON into Vim's dictionary representation which can be read by eval() escaping special characters you care about correctly.
If you seek for completely correct solution omitting dependencies is one of the worst thing you can do. The eval() variant mentioned by #MarcWeber is one of the fastest, but it has its disadvantages:
Using solution for securing eval I mentioned in comment makes it no longer the fastest. In fact after you use this it makes eval() slower by more then an order of magnitude (0.02s vs 0.53s in my test).
It does not respect surrogate pairs.
It cannot be used to verify that you have correct JSON: it accepts some strings (e.g. "\<C-o>") that are not JSON strings and it allows trailing commas.
It fails to give normal error messages. It fails badly if you use vam#VerifyIsJSON I mentioned in p.1.
It fails to load floating point values like 1e10 (vim requires numbers to look like 1.0e10, but numbers like 1e10 are allowed: note “and/or” in the first paragraph).
. All of the above (except for the first) statements also apply to vim-addon-json-encoding mentioned by #MarcWeber because it uses eval. There are some other possibilities:
Fastest and the most correct is using python: pyeval('json.loads(vim.eval("varname"))'). Not faster then eval, but fastest among other possibilities. (0.04 in my test: approximately two times slower then eval())
Note that I use pyeval() here. If you want solution for vim version that lacks this functionality it will no longer be one of the fastest.
Use my json.vim plugin. It has an advantages of slightly better error reporting compared to failed vam#VerifyIsJSON, slightly worse compared to eval() and it correctly loads floating-point numbers. It can be used for verification of strings (it does not accept "\<C-a>"), but it loads lists with trailing comma just fine. It does not support surrogate pairs. It is also very slow: in the test I used (it uses 279702 character long strings) it takes 11.59s to load. Json.vim tries to use python if possible though.
For the best error reporting you can take yaml.vim and purge YAML support out of it leaving only JSON (I once have done the same thing for pyyaml, though in python: see markedjson library used in powerline: it is pyyaml minus YAML stuff plus classes with marks). But this variant is even slower then json.vim and should only be used if the main thing you need is error reporting: 207 seconds for loading the same 279702 character long string.
Note that the only variant mentioned that satisfies both requirements “no dependencies” and “no python” is eval(). If you are not fine with its disadvantages you have to throw away one or both of these requirements. Or copy-paste code. Though if you take speed into account only two candidates are left: eval() and python: if you want to parse json fast you really must use C and only these solutions spend most time in functions written in C.
Most other interpreters (ruby/perl/TCL) do not have pyeval() equivalent so they will be slower even if their JSON implementation is written in C. Some other (lua/racket (mzscheme)) have pyeval() equivalent, but e.g. luaeval('{}') is zero meaning that you will have to add additional step explicitly and recursively converting objects into vim dictionaries and lists (e.g. luaeval('vim.dict({})')) which will impact performance. Cannot say anything about mzeval(), but I have never heard about anybody actually using racket (mzscheme) with vim.

Types of Errors during Compilation and at Runtime

I have this question in a homework assignment for my Computer Languages class. I'm trying to figure out what each one means, but I'm getting stuck.
Errors in a computer program can be
classified according to when they are
detected and, if they are detected at
compile time, what part of the
compiler detects them. Using your
favorite programming language, give an
example of:
(a) A lexical error, detected by the
scanner.
(b) A syntax error, detected by the
parser.
(c) A static semantic error, detected
(at compile-time) by semantic
analysis.
(d) A dynamic semantic error, detected
(at run-time) by code generated by the
compiler.
For (a), I think this is would be correct: int char foo;
For (b), int foo (no semicolon)
For (c) and (d), I'm not sure what is being asked.
Thanks for the help.
I think it's important to understand what a scanner is, what a parser is and how they are involved in the compilation process.
(I'll try my best at a high-level explanation)
The scanner takes a sequence of characters (a source file) and converts it to a sequence of tokens. e.g., sees the text if 234 ) and converts to the tokens, IF INTEGER RPAREN (there's more to it but should be enough for the example).
Another way you can think of how the scanner works is that it takes the text and makes sure you use the correct keywords and not makes them up. It has to be able to convert the entire source file to the associated language's recognized tokens and this varies from language to language. In other words, "Does every piece of text correspond to a construct a language understands". Or better put with an example, "Do all these words found in a book, belong to the English language?"
The parser takes a sequence of tokens (usually from the scanner) and (among other things) sees if it is well formed. e.g., a C variable declaration is in the form Type Identifier SEMICOLON.
The parser checks "Does this sequence of tokens in this order make sense to me?" And similarly the analogy, "Does this sequence of English words (with punctuation) form complete sentences?"
C asks for errors that can be found when compiling the program. D asks for errors that you see when running the program after it compiled successfully. You should be able to distinguish these two by now hopefully.
I hope this helps you get a better understanding and make answering these easier.
I'll give it a shot. Here's what I think:
a. int foo+; (foo+ is an invalid identifier because + is not a valid char in identifiers)
b. foo int; (Syntax error is any error where the syntax is invalid - either due to misplacement of words, bad spelling, missing semicolons etc.)
c. Static semantic error are logical errors. for e.g passing float as index of an array - arr[1.5] should be a SSE.
d. I think exceptions like NullReferenceException might be an example of DME. Not completely sure but in covariant returns that raise an exception at compile time (in some languages) might also come in this category. Also, passing the wrong type of object in another object (like passing a Cat in a Person object at runtime might qualify for DME.) Simplest example would be trying to access an index that is out of bounds of the array.
Hope this helps.