pyspark prompts an error for udf not defined - exception

Here is the code:
from py4j.protocol import Py4JJavaError
def parse_clf_time(s):
try:
#return "{0:04d}-{1:02d}-{2:02d} {3:02d}:{4:02d}:{5:02d}".format(int(s[7:11]),month_map[s[3:6]],int(s[0:2]),int(s[12:14]),int(s[15:17]),int(s[18:20]))
return "{0:04d}-{1:02d}-{2:02d} {3:02d}:{4:02d}:{5:02d}".format(
int(s[7:11]),
month_map[s[3:6]],
int(s[0:2]),
int(s[12:14]),
int(s[15:17]),
int(s[18:20])
)
except Py4JJavaError as e:
return "2016-08-11 00:00:01".format(
int(s[7:11]),
month_map[s[3:6]],
int(s[0:2]),
int(s[12:14]),
int(s[15:17]),
int(s[18:20])
u_parse_time = udf(parse_clf_time)
final_df = cleaned_df.select('*', u_parse_time(cleaned_df['timestamp']).cast('timestamp').alias('time')).drop('timestamp')
total_log_entries = final_df.count()
The df may contain bad data so I want to use a silly try except to handle it, please let me what is the best practice to exclude bad data.
For unknown reason, I got error:
So what's wrong with the code? It works in another project on the same environment so I am pretty sure the error should not be from the code itself.
Thank you very much, any clue is appreciated.

You missed a ) for return "2016-08-11 00:00:01".format(
Also, you didn't have
from pyspark.sql.functions import udf

missing parentheses or bracket are indeed so common, I would suggest you using a text edit tool for double check in case like this. I use UltraEdit which is great to me.

Related

JSON error with Julia Dash simple example

I've been trying to replicate the "A simple example" posted at https://dash-julia.plotly.com/clientside-callbacks
The server runs... but when I connect to it I get a JSON parse error in Firefox.
I was able to solve the problem, but I'd like to understand what's wrong...
The problem was this line inside the Dash app.layout:
options=[(label = country, value = country) for country in available_countries]
and the available_countries variable was obatined from:
read_remote_csv(url) = DataFrame(CSV.File(HTTP.get(url).body))
df = read_remote_csv("https://raw.githubusercontent.com/plotly/datasets/master/gapminderDataFiveYear.csv")
available_countries = unique(df.country)
Apparently the error showed because available_countries was an Array{String31,1}: and specifically the problem was the String31 type.
when I converted the variable to a plain String type:
available_countries = convert(Array{String,1},available_countries)
the problem solved.
Now... I'm not sure if the String31 type came from the HTTP.get(), CSV.File() or the DataFrame() functions.
I'm assuming the example used to work when it was originally written but it broke with an update...
Can anyone explain where exactly the error originates? Is it a Package version thing? which package? (HTTP, CSV, Dataframes)? How can I avoid it moving on?

NLTK letter 'u' in front of text result?

I'm learning NLTK with a tutorial and whenever I try to print some text contents, it returns with 'u' in front of it.
In the tutorial it looks like this,
firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se...
But in my result, it looks like this
(u'firefox.txt', u'Cookie Manager: "Don\'t allow sites that set removed cookies to se', '...')
I am not sure why. I followed exact way the tutorial is explaining. Can someone help me understand this problem? Thank you!
That leading u just means that that string is Unicode. All strings are Unicode in Python 3. The parentheses means that you are dealing with a tuple. Both will go away if you print the individual elements of the tuple, as with t[0], t[1], and so on (assuming that t is your tuple).
If you want to print the whole tuple as a whole, removing u's and parentheses, try the following:
print " ".join (t)
As mentioned in other answer the leading u just means that string is Unicode. str() can be used to convert unicode to str but there doesnt seem to be a direct way to convert all the values in a tuple from unicode to string.
Simple function as below and using it when ever you are referring to any tuple in nltk.
>>> def str_tuple(t, encoding="ascii"):
... return tuple([i.encode(encoding) for i in t])
>>> str_tuple(nltk.corpus.gutenberg.fileids())
('austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt')
I guess you are using Python2.6 or any version before 3.0.
Python allows its users to do the same operation on 'str()' and 'unicode' in the early version. They tried to make conversion between 'str()' and 'unicode' directly in some case rely on default encoding, which on most platform is ASCII. That's probably the reason cause your problem. Here are two ways may solve it:
First, manually assign decoding method. For example:
>> for name in nltk.corpus.gutenberg.fileids():
>> name.decode('utf-8')
>> print(name)
The other way is to UPDATE your Python to version 3.0+ (Recommended). They fix this problem in Python3.0. Here is the link to update detail description:
https://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
Hope this helps you.

Cannot use sleep(secs) in sikuli

So, I am new to sikuli coding, I do not have much experience with python either, so for many of you this might be a silly question. My problem is that I am trying to pause the program for x seconds. I have tried these 2 ways but every time I am getting an error. Here is what I have tried to do:
import time
time.sleep(10)
Error I am getting: [error] SyntaxError ( "no viable alternative at input 'time'", )
=======
sleep(10)
Error I am getting: [error] SyntaxError ( "no viable alternative at input 'sleep'", )
I hope that someone can help me with my silly problem. I would really appreciate :) (also, sorry for the Bad English)
Thanks in advance!
sleep(10) is 100% correct for sikuli IDE for how to get your program to pause for 10 seconds, so here are a few thoughts:
That error can crop up for lots of different reasons, but a really common one--in Python, white space DOES matter, and indentation is often a huge culprit for errors like this. In the sikuli IDE, your loops have to be indented exactly 4 spaces ( = 1 tab), any more or less will throw this error. You can also check for some missing syntax like apostrophes or brackets, sometimes in the line preceding the one that's throwing the error.
In this particular case, sometimes the import statement is finicky. You could try from time import * instead of just import time. They're supposed to be equal, but they seem to behave differently to me sometimes.
If you're importing the 'time' module just to use in functions like sleep(i) and wait(i), then the import is unnecessary, because those functions just require you to supply an integer i that represents a number of seconds, and they do the rest as part of their built-in functionality.
Finally, if you find that 'import time' is the problem, I have found that the Sikuli IDE does not have native access to all of the possible modules to import. I have had lots of success with the datetime module, but I have never tried just the time module. You might switch to 'import datetime' and see if that helps...

Python - pythoncom.com_error handling in Python 3.2.2

I am using Python 3.2.2, and building a Tkinter interface to do some Active Directory updating. I am having trouble trying to handle pythoncom.com_error exceptions.
I grabbed some code from here:
http://code.activestate.com/recipes/303345-create-an-account-in-ms-active-directory/
However, I use the following (straight from the above site) handle the exceptions raised:
except pythoncom.com_error,(hr,msg,exc,arg):
This code is consistent with many of the sites I have seen that handle these exceptions, however with Python 3.2.2, I get a syntax error if I include the comma after "pythoncom.com_error". If I remove the comma, the program starts, but then when the exception is raised, I get other exceptions because "hr", "msg" etc are not defined as global variables.
If I remove the comma and all of the bits in the brackets, then it all works well, except I can't see exactly what happens in the exception, which I want so I can pass through the actual error message from AD.
Does anyone know how to handle these pythoncom exceptions properly in Python 3.2.2?
Thanks in advance!
You simply need to use the modern except-as syntax, I think:
import pythoncom
import win32com
import win32com.client
location = 'fred'
try:
ad_obj=win32com.client.GetObject(location)
except pythoncom.com_error as error:
print (error)
print (vars(error))
print (error.args)
hr,msg,exc,arg = error.args
which produces
(-2147221020, 'Invalid syntax', None, None)
{'excepinfo': None, 'hresult': -2147221020, 'strerror': 'Invalid syntax', 'argerror': None}
(-2147221020, 'Invalid syntax', None, None)
for me [although I'm never sure whether the args order is really what it looks like, so I'd probably refer to the keys explicitly; someone else may know for sure.]
I use this structure (Python 3.5) --
try:
...
except Exception as e:
print ("error in level argument", e)
...
else:
...

Python equivalent of PHP's #

Is there a Python equivalent of PHP's #?
#function_which_is_doomed_to_fail();
I've always used this block:
try:
foo()
except:
pass
But I know there has to be a better way.
Does anyone know how I can Pythonicify that code?
I think adding some context to that code would be appropriate:
for line in blkid:
line = line.strip()
partition = Partition()
try:
partition.identifier = re.search(r'^(/dev/[a-zA-Z0-9]+)', line).group(0)
except:
pass
try:
partition.label = re.search(r'LABEL="((?:[^"\\]|\\.)*)"', line).group(1)
except:
pass
try:
partition.uuid = re.search(r'UUID="((?:[^"\\]|\\.)*)"', line).group(1)
except:
pass
try:
partition.type = re.search(r'TYPE="((?:[^"\\]|\\.)*)"', line).group(1)
except:
pass
partitions.add(partition)
What you are looking for is anti-pythonic, because:
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
In your case, I would use something like this:
match = re.search(r'^(/dev/[a-zA-Z0-9]+)', line)
if match:
partition.identifier = match.group(0)
And you have 3 lines instead of 4.
There is no better way. Silently ignoring error is bad practice in any language, so it's naturally not Pythonic.
Building upon Gabi Purcanu's answer and your desire to condense to one-liners, you could encapsulate his solution into a function and reduce your example:
def cond_match(regexp, line, grp):
match = re.search(regexp, line)
if match:
return match.group(grp)
else:
return None
for line in blkid:
line = line.strip()
partition = Partition()
partition.identifier = cond_match(r'^(/dev/[a-zA-Z0-9]+)', line, 0)
partition.label = cond_match(r'LABEL="((?:[^"\\]|\\.)*)"', line, 1)
partition.uuid = cond_match(r'UUID="((?:[^"\\]|\\.)*)"', line, 1)
partition.type = cond_match(r'TYPE="((?:[^"\\]|\\.)*)"', line, 1)
partitions.add(partition)
Please don't ask for Python to be like PHP. You should always explicitly trap the most specific error you can. Catching and ignoring all errors like that is not good best practice. This is because it can hide other problems and make bugs harder to find. But in the case of REs, you should really check for the None value that it returns. For example, your code:
label = re.search(r'LABEL="((?:[^"\\]|\.)*)"', line).group(1)
Raises an AttributeError if there is not match, because the re.search returns None if there is no match. But what if there was a match but you had a typo in your code:
label = re.search(r'LABEL="((?:[^"\\]|\.)*)"', line).roup(1)
This also raises an AttributeError, even if there was a match. But using the catchall exception and ignoring it would mask that error from you. You will never match a label in that case, and you would never know it until you found it some other way, such as by eventually noticing that your code never matches a label (but hopefully you have unit tests for that case...)
For REs, the usual pattern is this:
matchobj = re.search(r'LABEL="((?:[^"\\]|\.)*)"', line)
if matchobj:
label = matchobj.group(1)
No need to try and catch an exception here since there would not be one. Except... when there was an exception caused by a similar typo.
Use data-driven design instead of repeating yourself. Naming the relevant group also makes it easier to avoid group indexing bugs:
_components = dict(
identifier = re.compile(r'^(?P<value>/dev/[a-zA-Z0-9]+)'),
label = re.compile(r'LABEL="(?P<value>(?:[^"\\]|\\.)*)"'),
uuid = re.compile(r'UUID="(?P<value>(?:[^"\\]|\\.)*)"'),
type = re.compile(r'TYPE="(?P<value>(?:[^"\\]|\\.)*)"'),
)
for line in blkid:
line = line.strip()
partition = Partition()
for name, pattern in _components:
match = pattern.search(line)
value = match.group('value') if match else None
setattr(partition, name, value)
partitions.add(partition)
There is warnings control in Python - http://docs.python.org/library/warnings.html
After edit:
You probably want to check if it is not None before trying to get the groups.
Also use len() on the groups to see how many groups you have got. "Pass"ing the error is definitely not the way to go.