How do you set existing_data_behavior in pyarrow? - pyarrow

I'm getting this error. How do I change the behavior when writing a dataset (write_dataset)
pyarrow.lib.ArrowInvalid: Could not write to <my-output-dir> as the directory is not empty and existing_data_behavior is to error

Update: If you are using exactly version 6.0.0 then this was a bug (see below). If you are using a version >= 6.0.1 then you can specify it as part of the write_dataset call:
import pyarrow as pa
import pyarrow.dataset as ds
tab = pa.Table.from_pydict({"x": [1, 2, 3], "y": ["x", "y", "z"]})
partitioning = ds.partitioning(schema=pa.schema([pa.field('y', pa.utf8())]), flavor='hive')
ds.write_dataset(tab, '/tmp/foo_dataset', format='parquet', partitioning=partitioning)
# This write would fail because data exists and the default
# is to not allow a potential overwrite
ds.write_dataset(tab, '/tmp/foo_dataset', format='parquet', partitioning=partitioning)
# By specifying existing_data_behavior we can change that
# default to return to the previous behavior
ds.write_dataset(tab, '/tmp/foo_dataset', format='parquet', partitioning=partitioning, existing_data_behavior='overwrite_or_ignore')
Legacy 6.0.0 Answer
This is unfortunately a bug: https://issues.apache.org/jira/browse/ARROW-14620
The default behavior changed in 6.0.0 so that the write_dataset method will not proceed if data exists in the destination directory. The flag to override this behavior did not get included in the python bindings.
Workarounds are to use an older version or delete all files in the directory first.

Related

Opensmile: unreadable csv file while extracting prosody features from wav file

I am extracting prosody features from an audio file while using Opensmile using Windows version of Opensmile. It runs successful and an output csv is generated. But when I open csv, it shows some rows that are not readable. I used this command to extract prosody feature:
SMILEXtract -C \opensmile-3.0-win-x64\config\prosody\prosodyShs.conf -I audio_sample_01.wav -O prosody_sample1.csv
And the output of csv looks like this:
[
Even I tried to use the sample wave file given in Example audio folder given in opensmile directory and the output is same (not readable). Can someone help me in identifying where the problem is actually? and how can I fix it?
You need to enable the csvSink component in the configuration file to make it work. The file config\prosody\prosodyShs.conf that you are using does not have this component defined and always writes binary output.
You can verify that it is the standart binary output in this way: omit the -O parameter from your command so it becomesSMILEXtract -C \opensmile-3.0-win-x64\config\prosody\prosodyShs.conf -I audio_sample_01.wav and execute it. You will get a output.htk file which is exactly the same as the prosody_sample1.csv.
How output csv? You can take a look at the example configuration in opensmile-3.0-win-x64\config\demo\demo1_energy.conf where a csvSink component is defined.
You can find more information in the official documentation:
Get started page of the openSMILE documentation
The section on configuration files
Documentation for cCsvSink
This is how I solved the issue. First I added the csvSink component to the list of the component instances. instance[csvSink].type = cCsvSink
Next I added the configuration parameters for this instance.
[csvSink:cCsvSink]
reader.dmLevel = energy
filename = \cm[outputfile(O){output.csv}:file name of the output CSV
file]
delimChar = ;
append = 0
timestamp = 1
number = 1
printHeader = 1
\{../shared/standard_data_output_lldonly.conf.inc}`
Now if you run this file it will throw you errors because reader.dmLevel = energy is dependent on waveframes. So the final changes would be:
[energy:cEnergy]
reader.dmLevel = waveframes
writer.dmLevel = energy
[int:cIntensity]
reader.dmLevel = waveframes
[framer:cFramer]
reader.dmLevel=wave
writer.dmLevel=waveframes
Further reference on how to configure opensmile configuration files can be found here

drop_duplicates() got an unexpected keyword argument 'ignore_index'

In my machine, the code can run normally. But in my friend's machine, there is an error about drop_duplicates(). The error type is the same as the title.
Open your command prompt, type pip show pandas to check the current version of your pandas.
If it's lower than 1.0.0 as #paulperry says, then type pip install --upgrade pandas --user
(substitute user with your windows account name)
Type import pandas as pd; pd.__version__ and see what version of Pandas you are using and make sure it's >= 1.0 .
I was having the same problem as Wzh -- but am running pandas version 1.1.3. So, it was not a version problem.
Ilya Chernov's comment pointed me in the right direction. I needed to extract a list of unique names from a single column in a more complicated DataFrame so that I could use that list in a lookup table. This seems like something others might need to do, so I will expand a bit on Chernov's comment with this example, using the sample csv file "iris.csv" that isavailable on GitHub. The file lists sepal and petal length for a number of iris varieties. Here we extract the variety names.
df = pd.read_csv('iris.csv')
# drop duplicates BEFORE extracting the column
names = df.drop_duplicates('variety', inplace=False, ignore_index=True)
# THEN extract the column you want
names = names['variety']
print(names)
Here is the output:
0 Setosa
1 Versicolor
2 Virginica
Name: variety, dtype: object
The key idea here is to get rid of the duplicate variety names while the object is still a DataFrame (without changing the original file), and then extract the one column that is of interest.

How to solve or suppress wall of warnings using any pystan code

When I run any pystan code, the output is what I expect, but I get a wall of warnings.
I've tried updating pystan and cython, as these are mentioned in the wall of warnings. My pystan is now version 2.17.1 and cython 0.29.2. I'm running python3.7.
import pystan
model_code = 'parameters {real y;} model {y ~ normal(0,1);}'
model = pystan.StanModel(model_code=model_code) # this will take a minute
y = model.sampling(n_jobs=1).extract()['y']
y.mean() # should be close to 0
The error message that I get starts with:
/home/femke/anaconda3/lib/python3.7/site-packages/Cython/Compiler/Main.py:367: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /tmp/tmp8_plkepg/stanfit4anon_model_5944b02c79788fa0db5b3a93728ca2bf_5335140894361802645.pyx
tree = Parsing.p_module(s, pxd, full_module_name)
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /home/femke/anaconda3/lib/python3.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1823:0,
from /home/femke/anaconda3/lib/python3.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /home/femke/anaconda3/lib/python3.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from /tmp/tmp8_plkepg/stanfit4anon_model_5944b02c79788fa0db5b3a93728ca2bf_5335140894361802645.cpp:688:
/home/femke/anaconda3/lib/python3.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by " \
^~~~~~~
In file included from /home/femke/anaconda3/lib/python3.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.66.0/boost/numeric/ublas/matrix.hpp:19:0,
from /home/femke/anaconda3/lib/python3.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.66.0/boost/numeric/odeint/util/ublas_wrapper.hpp:24,
from /home/femke/anaconda3/lib/python3.7/site-packages
Is this something to worry about? If not, how do I specifically disable these warnings, but not from other parts of my code? If so, what should I change.
Edit: after having read the question Cython Numpy warning about NPY_NO_DEPRECATED_API when using MemoryView, I still don't know how to safely disable this warning.

Setting Jenkins build name from package.json version value

I want to include the value of the "version" parameter in package.json as part of the Jenkins build name.
I'm using the Jenkins Build Name Setter plugin - https://wiki.jenkins-ci.org/display/JENKINS/Build+Name+Setter+Plugin
So far I've tried to use PROPFILE syntax in the "Build name macro template" step:
${PROPFILE,file="./mainline/projectDirectory/package.json",property="\"version\""}
This successfully creates a build, but includes the quotes and comma surrounding the value of the version property in package.json, for example:
"0.0.1",
I want just the value inside returned, so it reads
0.0.1
How can I do this? Is there a different plugin that would work better for parsing package.json and getting it into the template, or should I resort to some sort of regex for removing the characters I don't want?
UPDATE:
I tried using token transforms based on reading the Token Macro Plugin documentation, but it's not working:
${PROPFILE%\"\,#\",file="./mainline/projectDirectory/package.json",property="\"version\""}
still just returns
However, using only one escaped character and only one of # or % works. No other combinations I tried work.
${PROPFILE%\,,file="./mainline/projectDirectory/package.json",property="\"version\""}
which returns "0.0.1" (comma removed)
${PROPFILE#\"%\"\,,file="./mainline/projectDirectory/package.json",property="\"version\""}
which returns "0.0.1", (no characters removed)
UPDATE:
Tried to use the new Jenkins Token Macro plugin's JSON macro with no luck.
Jenkins Build Name Setter set to update the build name with Macro:
${JSON,file="./mainline/pathToFiles/package.json",path="version"}-${P4_CHANGELIST}
Jenkins build logs for this job show:
10:57:55 Evaluated macro: 'Error processing tokens: Error while parsing action 'Text/ZeroOrMore/FirstOf/Token/DelimitedToken/DelimitedToken_Action3' at input position (line 1, pos 74):
10:57:55 ${JSON,file="./mainline/pathToFiles/package.json",path="version"}-334319
10:57:55 ^
10:57:55
10:57:55 java.io.IOException: Unable to serialize org.jenkinsci.plugins.tokenmacro.impl.JsonFileMacro$ReadJSON#2707de37'
I implemented a new macro JSON, which takes a file and a path (which is the key hierarchy in the JSON for the value you want) in token-macro-2.1. You can only use a single transform per macro usage.
Try the token transformations # and % (see Token-Makro-Plugin):
${PROPFILE#"%",file="./mainline/projectDirectory/package.json",property="\"version\""}
(This will only help if you are using pipelines. But for what it's worth,..)
What works for me is a combination of readJSON from the Pipeline Utility Steps plugin and directly setting currentBuild.displayName, thusly:
script {
// readJSON from "Pipeline Utility Steps"
def packageJson = readJSON file: 'package.json'
def version = packageJson.version
echo "Setting build version: ${packageJson.version}"
currentBuild.displayName = env.BUILD_NUMBER + " - " + packageJson.version
// currentBuild.description = "other cool stuff"
}
Omitting error handling etc obvs.

Django query executed in view returns old data

I have a view which queries a model to populate a form:
class AddServerForm(forms.Form):
…snip…
# Compile and present choices for HARDWARE CONFIG
hwChoices = HardwareConfig.objects.\
values_list('name','description').order_by('name')
hwChoices = [('', '----- SELECT ONE -----')] +\
[ (x,'{0} - {1}'.format(x,y)) for x,y in hwChoices ]
hwconfig = forms.ChoiceField(label='Hardware Config', choices=hwChoices)
…snip…
def addServers(request, template="manager/add_servers.html",
template_success="manager/add_servers_success.html"):
if request.method == 'POST':
# …snip… - process the form
else:
# Page was called via GET - use a default form
addForm = AddServerForm()
return render_to_response(template, dict(addForm=addForm),
context_instance=RequestContext(request))
Additions to the HardwareConfig model are done using the admin interface. Changes show up immediately in the admin interface as expected.
Running the query via the shell returns all results as expected:
michael#victory> python manage.py shell
Python 2.6 (r26:66714, Feb 21 2009, 02:16:04)
[GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from serverbuild.base.models import HardwareConfig
>>> hwChoices = HardwareConfig.objects.\
... values_list('name','description').order_by('name')
hwChoices now contains the complete set of results.
However, loading the addServers view (above) returns the old result set, missing the newly-added entries.
I have to restart the webserver in order for the changes to show up which makes it seem as though that query is being cached somewhere.
I'm not doing any explicit caching anywhere (grep -ri cache /project/root returns nothing)
It's not the browser caching the page - inspected via chrome tools, also tried using a different user & computer
What's going wrong and how do I fix it?
Versions:
MySQLdb: 1.2.2
django: 1.2.5
python: 2.6
hwChoices is evaluated when the form is defined - ie when the process starts.
Do the calculation in the form's __init__ method instead.