Just started using arcpy on ArcMap 10.2.2
Some modules - eg Hillshade - do not accept output raster names as parameters. I want to be able to specify the name of the output raster that appears in the ArcMap session's Table of Contents AND the geodatabase I'm currently working in. At the moment I'm using this method:
> # Some environment settings:
> import arcpy
> from arcpy import env
> from arcpy.sa import *
> # set geodatabase
> env.workspace = "path\to\my\Scratch.gdb"
> # Prevent output adding to the map
> env.addOutputsToMap="FALSE"
ESRI Help http://resources.arcgis.com/en/help/main/10.1/index.html#//009z000000v0000000 suggests setting the out_raster as a variable then saving output to workspace...
> myRaster== HillShade(inRaster, azimuth, altitude, modelShadows, zFactor)
> myRaster.save("path/to/my/place")
BUT the name myRaster is not applied to the file saved in the geodatabase. Instead it is the 'auto generated' raster name applied by ArcMap. If env.addOutputsToMap="TRUE" then the raster name is set to myRaster and added to the map but in the gbd is the auto name.
I find it hard to believe there is no functionality to do what I'm trying to do.
thanks
addOutputsToMap is a Boolean property. Set it to False.
> env.addOutputsToMap = False
Related
I'm working on exporting data from Foundry datasets in parquet format using various Magritte export tasks to an ABFS system (but the same issue occurs with SFTP, S3, HDFS, and other file based exports).
The datasets I'm exporting are relatively small, under 512 MB in size, which means they don't really need to be split across multiple parquet files, and putting all the data in one file is enough. I've done this by ending the previous transform with a .coalesce(1) to get all of the data in a single file.
The issues are:
By default the file name is part-0000-<rid>.snappy.parquet, with a different rid on every build. This means that, whenever a new file is uploaded, it appears in the same folder as an additional file, the only way to tell which is the newest version is by last modified date.
Every version of the data is stored in my external system, this takes up unnecessary storage unless I frequently go in and delete old files.
All of this is unnecessary complexity being added to my downstream system, I just want to be able to pull the latest version of data in a single step.
This is possible by renaming the single parquet file in the dataset so that it always has the same file name, that way the export task will overwrite the previous file in the external system.
This can be done using raw file system access. The write_single_named_parquet_file function below validates its inputs, creates a file with a given name in the output dataset, then copies the file in the input dataset to it. The result is a schemaless output dataset that contains a single named parquet file.
Notes
The build will fail if the input contains more than one parquet file, as pointed out in the question, calling .coalesce(1) (or .repartition(1)) is necessary in the upstream transform
If you require transaction history in your external store, or your dataset is much larger than 512 MB this method is not appropriate, as only the latest version is kept, and you likely want multiple parquet files for use in your downstream system. The createTransactionFolders (put each new export in a different folder) and flagFile (create a flag file once all files have been written) options can be useful in this case.
The transform does not require any spark executors, so it is possible to use #configure() to give it a driver only profile. Giving the driver additional memory should fix out of memory errors when working with larger datasets.
shutil.copyfileobj is used because the 'files' that are opened are actually just file objects.
Full code snippet
example_transform.py
from transforms.api import transform, Input, Output
import .utils
#transform(
output=Output("/path/to/output"),
source_df=Input("/path/to/input"),
)
def compute(output, source_df):
return utils.write_single_named_parquet_file(output, source_df, "readable_file_name")
utils.py
from transforms.api import Input, Output
import shutil
import logging
log = logging.getLogger(__name__)
def write_single_named_parquet_file(output: Output, input: Input, file_name: str):
"""Write a single ".snappy.parquet" file with a given file name to a transforms output, containing the data of the
single ".snappy.parquet" file in the transforms input. This is useful when you need to export the data using
magritte, wanting a human readable name in the output, when not using separate transaction folders this should cause
the previous output to be automatically overwritten.
The input to this function must contain a single ".snappy.parquet" file, this can be achieved by calling
`.coalesce(1)` or `.repartition(1)` on your dataframe at the end of the upstream transform that produces the input.
This function should not be used for large dataframes (e.g. those greater than 512 mb in size), instead
transaction folders should be enabled in the export. This function can work for larger sizes, but you may find you
need additional driver memory to perform both the coalesce/repartition in the upstream transform, and here.
This produces a dataset without a schema, so features like expectations can't be used.
Parameters:
output (Output): The transforms output to write the single custom named ".snappy.parquet" file to, this is
the dataset you want to export
input (Input): The transforms input containing the data to be written to output, this must contain only one
".snappy.parquet" file (it can contain other files, for example logs)
file_name: The name of the file to be written, if the ".snappy.parquet" will be automatically appended if not
already there, and ".snappy" and ".parquet" will be corrected to ".snappy.parquet"
Raises:
RuntimeError: Input dataset must be coalesced or repartitioned into a single file.
RuntimeError: Input dataset file system cannot be empty.
Returns:
void: writes the response to output, no return value
"""
output.set_mode("replace") # Make sure it is snapshotting
input_files_df = input.filesystem().files() # Get all files
input_files = [row[0] for row in input_files_df.collect()] # noqa - first column in files_df is path
input_files = [f for f in input_files if f.endswith(".snappy.parquet")] # filter non parquet files
if len(input_files) > 1:
raise RuntimeError("Input dataset must be coalesced or repartitioned into a single file.")
if len(input_files) == 0:
raise RuntimeError("Input dataset file system cannot be empty.")
input_file_path = input_files[0]
log.info("Inital output file name: " + file_name)
# check for snappy.parquet and append if needed
if file_name.endswith(".snappy.parquet"):
pass # if it is already correct, do nothing
elif file_name.endswith(".parquet"):
# if it ends with ".parquet" (and not ".snappy.parquet"), remove parquet and append ".snappy.parquet"
file_name = file_name.removesuffix(".parquet") + ".snappy.parquet"
elif file_name.endswith(".snappy"):
# if it ends with just ".snappy" then append ".parquet"
file_name = file_name + ".parquet"
else:
# if doesn't end with any of the above, add ".snappy.parquet"
file_name = file_name + ".snappy.parquet"
log.info("Final output file name: " + file_name)
with input.filesystem().open(input_file_path, "rb") as in_f: # open the input file
with output.filesystem().open(file_name, "wb") as out_f: # open the output file
shutil.copyfileobj(in_f, out_f) # write the file into a new file
You can also use the rewritePaths functionality of the export plugin, to rename the file under spark/*.snappy.parquet file to "export.parquet" while exporting. This of course only works if there is only a single file, so .coalesce(1) in the transform is a must:
excludePaths:
- ^_.*
- ^spark/_.*
rewritePaths:
'^spark/(.*[\/])(.*)': $1/export.parquet
uploadConfirmation: exportedFiles
incrementalType: snapshot
retriesPerFile: 0
bucketPolicy: BucketOwnerFullControl
directoryPath: features
setBucketPolicy: true
I ran into the same requirement the only difference was that the dataset required to be split into multiple parts due to the size. Posting here the code and how I have updated it to handle this use case.
def rename_multiple_parquet_outputs(output: Output, input: list, file_name_prefix: str):
"""
Slight improvement to allow multiple output files to be renamed
"""
output.set_mode("replace") # Make sure it is snapshotting
input_files_df = input.filesystem().files() # Get all files
input_files = [row[0] for row in input_files_df.collect()] # noqa - first column in files_df is path
input_files = [f for f in input_files if f.endswith(".snappy.parquet")] # filter non parquet files
if len(input_files) == 0:
raise RuntimeError("Input dataset file system cannot be empty.")
input_file_path = input_files[0]
print(f'input files {input_files}')
print("prefix for target name: " + file_name_prefix)
for i,f in enumerate(input_files):
with input.filesystem().open(f, "rb") as in_f: # open the input file
with output.filesystem().open(f'{file_name_prefix}_part_{i}.snappy.parquet', "wb") as out_f: # open the output file
shutil.copyfileobj(in_f, out_f) # write the file into a new file
Also to use this into a code workbook the input needs to be persisted and the output parameter can be retrieved as shown below.
def rename_outputs(persisted_input):
output = Transforms.get_output()
rename_parquet_outputs(output, persisted_input, "prefix_for_renamed_files")
I am extracting prosody features from an audio file while using Opensmile using Windows version of Opensmile. It runs successful and an output csv is generated. But when I open csv, it shows some rows that are not readable. I used this command to extract prosody feature:
SMILEXtract -C \opensmile-3.0-win-x64\config\prosody\prosodyShs.conf -I audio_sample_01.wav -O prosody_sample1.csv
And the output of csv looks like this:
[
Even I tried to use the sample wave file given in Example audio folder given in opensmile directory and the output is same (not readable). Can someone help me in identifying where the problem is actually? and how can I fix it?
You need to enable the csvSink component in the configuration file to make it work. The file config\prosody\prosodyShs.conf that you are using does not have this component defined and always writes binary output.
You can verify that it is the standart binary output in this way: omit the -O parameter from your command so it becomesSMILEXtract -C \opensmile-3.0-win-x64\config\prosody\prosodyShs.conf -I audio_sample_01.wav and execute it. You will get a output.htk file which is exactly the same as the prosody_sample1.csv.
How output csv? You can take a look at the example configuration in opensmile-3.0-win-x64\config\demo\demo1_energy.conf where a csvSink component is defined.
You can find more information in the official documentation:
Get started page of the openSMILE documentation
The section on configuration files
Documentation for cCsvSink
This is how I solved the issue. First I added the csvSink component to the list of the component instances. instance[csvSink].type = cCsvSink
Next I added the configuration parameters for this instance.
[csvSink:cCsvSink]
reader.dmLevel = energy
filename = \cm[outputfile(O){output.csv}:file name of the output CSV
file]
delimChar = ;
append = 0
timestamp = 1
number = 1
printHeader = 1
\{../shared/standard_data_output_lldonly.conf.inc}`
Now if you run this file it will throw you errors because reader.dmLevel = energy is dependent on waveframes. So the final changes would be:
[energy:cEnergy]
reader.dmLevel = waveframes
writer.dmLevel = energy
[int:cIntensity]
reader.dmLevel = waveframes
[framer:cFramer]
reader.dmLevel=wave
writer.dmLevel=waveframes
Further reference on how to configure opensmile configuration files can be found here
I've run a Spark job via databricks on AWS, and by calling
big_old_rdd.saveAsTextFile("path/to/my_file.json")
have saved the results of my job into an S3 bucket on AWS. The result of that spark command is a directory path/to/my_file.json containing portions of the result:
_SUCCESS
part-00000
part-00001
part-00002
and so on. I can copy those part files to my local machine using the AWS CLI with a relatively simple command:
aws s3 cp s3://my_bucket/path/to/my_file.json local_dir --recursive
and now I've got all those part-* files locally. Then I can get a single file with
cat $(ls part-*) > result.json
The problem is that this two-stage process is cumbersome and leaves file parts all over the place. I'd like to find a single command that will download and merge the files (ideally in order). When dealing with HDFS directly this is something like hadoop fs -cat "path/to/my_file.json/*" > result.json.
I've looked around through the AWS CLI documentation but haven't found an option to merge the file parts automatically, or to cat the files. I'd be interested in either some fancy tool in the AWS API or some bash magic that will combine the above commands.
Note: Saving the result into a single file via spark is not a viable option as this requires coalescing the data to a single partition during the job. Having multiple part files on AWS is fine, if not desirable. But when I download a local copy, I'd like to merge.
This can be done with a relatively simple function using boto3, the AWS python SDK.
The solution involves listing the part-* objects in a given key, and then downloading each of them and appending to a file object. First, to list the part files in path/to/my_file.json in the bucket my_bucket:
import boto3
bucket = boto3.resource('s3').Bucket('my_bucket')
keys = [obj.key for obj in bucket.objects.filter(Prefix='path/to/my_file.json/part-')]
Then, use Bucket.download_fileobj() with a file opened in append mode to write each of the parts. The function I'm now using, with a few other bells and whistles, is:
from os.path import basename
import boto3
def download_parts(base_object, bucket_name, output_name=None, limit_parts=0):
"""Download all file parts into a single local file"""
base_object = base_object.rstrip('/')
bucket = boto3.resource('s3').Bucket(bucket_name)
prefix = '{}/part-'.format(base_object)
output_name = output_name or basename(base_object)
with open(output_name, 'ab') as outfile:
for i, obj in enumerate(bucket.objects.filter(Prefix=prefix)):
bucket.download_fileobj(obj.key, outfile)
if limit_parts and i >= limit_parts:
print('Terminating download after {} parts.'.format(i))
break
else:
print('Download completed after {} parts.'.format(i))
The downloading part may be an extra line of code.
As far as cat'ing in order, you can do it according to time created, or alphabetically.
Combined in order of time created: cat $(ls -t) > outputfile
Combined & Sorted alphabetically: cat $(ls part-* | sort) > outputfile
Combined & Sorted reverse-alphabetically: cat $(ls part-* | sort -r) > outputfile
I'm trying to use a loop to read in multiple CSVs (for now but mix of that and xls in the future).
I'd like each data frame in pandas to be the same name excluding file extension in my folder.
import os
import pandas as pd
files = filter(os.path.isfile, os.listdir( os.curdir ) )
files # this shows a list of the files that I want to use/have in my directory- they are all CSVs if that matters
# i want to load these into pandas data frames with the corresponding filenames
# not sure if this is the right approach....
# but what is wrong is the variable is named 'weather_today.csv'... i need to drop the .csv or .xlsx or whatever it might be
for each_file in files:
frame = pd.read_csv( each_file)
each_file = frame
Bernie seems to be great but one problem:
or each_file in files:
frame = pd.read_csv(each_file)
filename_only = os.path.splitext(each_file)[0]
# Right below I am assigning my looped data frame the literal variable name of "filename_only" rather than the value that filename_only represents
#rather than what happens if I print(filename_only)
filename_only = frame
for example if my two files are weather_today, earthquakes.csv (in that order) in my files list, then both 'earthquakes' and 'weather' will not be created.
however, if I simply type 'filename_only' and click the enter key in python - then I will see the earthquake dataframe. If I have 100 files, then the last data frame name in the list loop will be titled 'filename_only' and the other 99 won't because the previous assignments are never made and the 100th one overwrites them.
You can use os.path.splitext() for this to "split the pathname path into a pair (root, ext) such that root + ext == path, and ext is empty or begins with a period and contains at most one period."
for each_file in files:
frame = pd.read_csv(each_file)
filename_only = os.path.splitext(each_file)[0]
filename_only = frame
As asked in a comment we would like a way to filter for just CSV files so you can do something like this:
files = [file for file in os.listdir( os.curdir ) if file.endswith(".csv")]
Use a dictionary to store your frames:
frames = {}
for each_file in files:
frames[os.path.splitext(each_file)[0]] = pd.read_csv(each_file)
Now you can get the DataFrame of your choice with:
frames[filename_without_ext]
Simple, right? Be careful about RAM usage though, reading a bunch of files can quickly fill up system memory and cause a crash.
Please help me out how to save the scripts in a single folder since I am facing the issue while importing Script1 inside Script2. Below are the two scripts.
Script1 : Variable.sikuli
PID = "r'C:\Program Files (x86)\Microsoft Office\Office14\outlook.exe'"
When i saved the script(Variable.sikuli) , by default it will create a folder "Variable.sikuli" inside that "Variable.py" and "Variable.html"
Script2 : openMO.sikuli
def openMO():
openApp(PID) # PID will taken from Variable.sikuli
openMO()
When I saved the script(openMO.sikuli), by default it will create a folder "openMO.sikuli" inside that "openMO.py" and "openMO.html"
Now my questions are:
How to save the two scripts in a single folder?
How to import Variable.sikuli in openMO.sikuli?
I don't think you should or can place multiple Sikuli scripts into one folder to make them visible to each other. Generally, the directories/folders containing your .sikuli’s you want to import have to be in sys.path. Sikuli automatically finds other Sikuli scripts in the same directory, when they are imported. Your imported script must contain (as first line) the following statement:
from sikuli import *
This is necessary for the Python environment to know the Sikuli classes, methods, functions and global names.
Below is an example
# on Windows
myScriptPath = "c:\\someDirectory\\myLibrary"
# on Mac/Linux
myScriptPath = "/someDirectory/myLibrary"
# all systems
if not myScriptPath in sys.path: sys.path.append(myScriptPath)
# supposing there is a myLib.sikuli
import myLib
# supposing myLib.sikuli contains a function "def myFunction():"
myLib.myFunction() # makes the call
More info is available here.
Maybe a bit of a late answer but if you would like to import things like a path to another file, you could also try making use of a global variable.
Putting multiple scripts inside one .sikuli folder is not advised.
If your scripts/programs get bigger it could become real messy.
With a global variable you make a variable that can be used throughout the whole script.
When you import a file in python it runs the class right away, and the variables are set.
If you define a global variable in script A, and then let script B import script A. Then script B also knows how the global variables of script A look like.
To set or use a global variable in a definition you have to call it first by using: "global variableName"
I have some example code below that might make things more clear.
File: BrowserPath.sikuli
# Define a Global variable
PathFireFox = ''
class Fire():
def __init__(self):
global PathFireFox
PathFireFox = r"C:\Program Files (x86)\Mozilla Firefox\firefox.exe"
# Run Fire()
Fire()
File: BrowserMain.sikuli
# Import other class.
from BrowserPath import *
class Main():
def __init__(self):
global PathFireFox
App.open(PathFireFox)
# Run Main class
Main()