In Python, how can we find out the command line arguments that were provided for a script, and process them?
For some more specific examples, see Implementing a "[command] [action] [parameter]" style command-line interfaces? and How do I format positional argument help using Python's optparse?.
import sys
print("\n".join(sys.argv))
sys.argv is a list that contains all the arguments passed to the script on the command line. sys.argv[0] is the script name.
Basically,
import sys
print(sys.argv[1:])
The canonical solution in the standard library is argparse (docs):
Here is an example:
from argparse import ArgumentParser
parser = ArgumentParser()
parser.add_argument("-f", "--file", dest="filename",
help="write report to FILE", metavar="FILE")
parser.add_argument("-q", "--quiet",
action="store_false", dest="verbose", default=True,
help="don't print status messages to stdout")
args = parser.parse_args()
argparse supports (among other things):
Multiple options in any order.
Short and long options.
Default values.
Generation of a usage help message.
Just going around evangelizing for argparse which is better for these reasons.. essentially:
(copied from the link)
argparse module can handle positional
and optional arguments, while
optparse can handle only optional
arguments
argparse isn’t dogmatic about
what your command line interface
should look like - options like -file
or /file are supported, as are
required options. Optparse refuses to
support these features, preferring
purity over practicality
argparse produces more
informative usage messages, including
command-line usage determined from
your arguments, and help messages for
both positional and optional
arguments. The optparse module
requires you to write your own usage
string, and has no way to display
help for positional arguments.
argparse supports action that
consume a variable number of
command-line args, while optparse
requires that the exact number of
arguments (e.g. 1, 2, or 3) be known
in advance
argparse supports parsers that
dispatch to sub-commands, while
optparse requires setting
allow_interspersed_args and doing the
parser dispatch manually
And my personal favorite:
argparse allows the type and
action parameters to add_argument()
to be specified with simple
callables, while optparse requires
hacking class attributes like
STORE_ACTIONS or CHECK_METHODS to get
proper argument checking
There is also argparse stdlib module (an "impovement" on stdlib's optparse module). Example from the introduction to argparse:
# script.py
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'integers', metavar='int', type=int, choices=range(10),
nargs='+', help='an integer in the range 0..9')
parser.add_argument(
'--sum', dest='accumulate', action='store_const', const=sum,
default=max, help='sum the integers (default: find the max)')
args = parser.parse_args()
print(args.accumulate(args.integers))
Usage:
$ script.py 1 2 3 4
4
$ script.py --sum 1 2 3 4
10
If you need something fast and not very flexible
main.py:
import sys
first_name = sys.argv[1]
last_name = sys.argv[2]
print("Hello " + first_name + " " + last_name)
Then run python main.py James Smith
to produce the following output:
Hello James Smith
The docopt library is really slick. It builds an argument dict from the usage string for your app.
Eg from the docopt readme:
"""Naval Fate.
Usage:
naval_fate.py ship new <name>...
naval_fate.py ship <name> move <x> <y> [--speed=<kn>]
naval_fate.py ship shoot <x> <y>
naval_fate.py mine (set|remove) <x> <y> [--moored | --drifting]
naval_fate.py (-h | --help)
naval_fate.py --version
Options:
-h --help Show this screen.
--version Show version.
--speed=<kn> Speed in knots [default: 10].
--moored Moored (anchored) mine.
--drifting Drifting mine.
"""
from docopt import docopt
if __name__ == '__main__':
arguments = docopt(__doc__, version='Naval Fate 2.0')
print(arguments)
One way to do it is using sys.argv. This will print the script name as the first argument and all the other parameters that you pass to it.
import sys
for arg in sys.argv:
print arg
#set default args as -h , if no args:
if len(sys.argv) == 1: sys.argv[1:] = ["-h"]
I use optparse myself, but really like the direction Simon Willison is taking with his recently introduced optfunc library. It works by:
"introspecting a function
definition (including its arguments
and their default values) and using
that to construct a command line
argument parser."
So, for example, this function definition:
def geocode(s, api_key='', geocoder='google', list_geocoders=False):
is turned into this optparse help text:
Options:
-h, --help show this help message and exit
-l, --list-geocoders
-a API_KEY, --api-key=API_KEY
-g GEOCODER, --geocoder=GEOCODER
I like getopt from stdlib, eg:
try:
opts, args = getopt.getopt(sys.argv[1:], 'h', ['help'])
except getopt.GetoptError, err:
usage(err)
for opt, arg in opts:
if opt in ('-h', '--help'):
usage()
if len(args) != 1:
usage("specify thing...")
Lately I have been wrapping something similiar to this to make things less verbose (eg; making "-h" implicit).
As you can see optparse "The optparse module is deprecated with and will not be developed further; development will continue with the argparse module."
Pocoo's click is more intuitive, requires less boilerplate, and is at least as powerful as argparse.
The only weakness I've encountered so far is that you can't do much customization to help pages, but that usually isn't a requirement and docopt seems like the clear choice when it is.
import argparse
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('integers', metavar='N', type=int, nargs='+',
help='an integer for the accumulator')
parser.add_argument('--sum', dest='accumulate', action='store_const',
const=sum, default=max,
help='sum the integers (default: find the max)')
args = parser.parse_args()
print(args.accumulate(args.integers))
Assuming the Python code above is saved into a file called prog.py
$ python prog.py -h
Ref-link: https://docs.python.org/3.3/library/argparse.html
You may be interested in a little Python module I wrote to make handling of command line arguments even easier (open source and free to use) - Commando
Yet another option is argh. It builds on argparse, and lets you write things like:
import argh
# declaring:
def echo(text):
"Returns given word as is."
return text
def greet(name, greeting='Hello'):
"Greets the user with given name. The greeting is customizable."
return greeting + ', ' + name
# assembling:
parser = argh.ArghParser()
parser.add_commands([echo, greet])
# dispatching:
if __name__ == '__main__':
parser.dispatch()
It will automatically generate help and so on, and you can use decorators to provide extra guidance on how the arg-parsing should work.
I recommend looking at docopt as a simple alternative to these others.
docopt is a new project that works by parsing your --help usage message rather than requiring you to implement everything yourself. You just have to put your usage message in the POSIX format.
Also with python3 you might find convenient to use Extended Iterable Unpacking to handle optional positional arguments without additional dependencies:
try:
_, arg1, arg2, arg3, *_ = sys.argv + [None] * 2
except ValueError:
print("Not enough arguments", file=sys.stderr) # unhandled exception traceback is meaningful enough also
exit(-1)
The above argv unpack makes arg2 and arg3 "optional" - if they are not specified in argv, they will be None, while if the first is not specified, ValueError will be thouwn:
Traceback (most recent call last):
File "test.py", line 3, in <module>
_, arg1, arg2, arg3, *_ = sys.argv + [None] * 2
ValueError: not enough values to unpack (expected at least 4, got 3)
My solution is entrypoint2. Example:
from entrypoint2 import entrypoint
#entrypoint
def add(file, quiet=True):
''' This function writes report.
:param file: write report to FILE
:param quiet: don't print status messages to stdout
'''
print file,quiet
help text:
usage: report.py [-h] [-q] [--debug] file
This function writes report.
positional arguments:
file write report to FILE
optional arguments:
-h, --help show this help message and exit
-q, --quiet don't print status messages to stdout
--debug set logging level to DEBUG
import sys
# Command line arguments are stored into sys.argv
# print(sys.argv[1:])
# I used the slice [1:] to print all the elements except the first
# This because the first element of sys.argv is the program name
# So the first argument is sys.argv[1], the second is sys.argv[2] ecc
print("File name: " + sys.argv[0])
print("Arguments:")
for i in sys.argv[1:]:
print(i)
Let's name this file command_line.py and let's run it:
C:\Users\simone> python command_line.py arg1 arg2 arg3 ecc
File name: command_line.py
Arguments:
arg1
arg2
arg3
ecc
Now let's write a simple program, sum.py:
import sys
try:
print(sum(map(float, sys.argv[1:])))
except:
print("An error has occurred")
Result:
C:\Users\simone> python sum.py 10 4 6 3
23
This handles simple switches, value switches with optional alternative flags.
import sys
# [IN] argv - array of args
# [IN] switch - switch to seek
# [IN] val - expecting value
# [IN] alt - switch alternative
# returns value or True if val not expected
def parse_cmd(argv,switch,val=None,alt=None):
for idx, x in enumerate(argv):
if x == switch or x == alt:
if val:
if len(argv) > (idx+1):
if not argv[idx+1].startswith('-'):
return argv[idx+1]
else:
return True
//expecting a value for -i
i = parse_cmd(sys.argv[1:],"-i", True, "--input")
//no value needed for -p
p = parse_cmd(sys.argv[1:],"-p")
Several of our biotechnology clients have posed these two questions recently:
How can we execute a Python script as a command?
How can we pass input values to a Python script when it is executed as a command?
I have included a Python script below which I believe answers both questions. Let's assume the following Python script is saved in the file test.py:
#
#----------------------------------------------------------------------
#
# file name: test.py
#
# input values: data - location of data to be processed
# date - date data were delivered for processing
# study - name of the study where data originated
# logs - location where log files should be written
#
# macOS usage:
#
# python3 test.py "/Users/lawrence/data" "20220518" "XYZ123" "/Users/lawrence/logs"
#
# Windows usage:
#
# python test.py "D:\data" "20220518" "XYZ123" "D:\logs"
#
#----------------------------------------------------------------------
#
# import needed modules...
#
import sys
import datetime
def main(argv):
#
# print message that process is starting...
#
print("test process starting at", datetime.datetime.now().strftime("%Y%m%d %H:%M"))
#
# set local values from input values...
#
data = sys.argv[1]
date = sys.argv[2]
study = sys.argv[3]
logs = sys.argv[4]
#
# print input arguments...
#
print("data value is", data)
print("date value is", date)
print("study value is", study)
print("logs value is", logs)
#
# print message that process is ending...
#
print("test process ending at", datetime.datetime.now().strftime("%Y%m%d %H:%M"))
#
# call main() to begin processing...
#
if __name__ == '__main__':
main(sys.argv)
The script can be executed on a macOS computer in a Terminal shell as shown below and the results will be printed to standard output (be sure the current directory includes the test.py file):
$ python3 test.py "/Users/lawrence/data" "20220518" "XYZ123" "/Users/lawrence/logs"
test process starting at 20220518 16:51
data value is /Users/lawrence/data
date value is 20220518
study value is XYZ123
logs value is /Users/lawrence/logs
test process ending at 20220518 16:51
The script can also be executed on a Windows computer in a Command Prompt as shown below and the results will be printed to standard output (be sure the current directory includes the test.py file):
D:\scripts>python test.py "D:\data" "20220518" "XYZ123" "D:\logs"
test process starting at 20220518 17:20
data value is D:\data
date value is 20220518
study value is XYZ123
logs value is D:\logs
test process ending at 20220518 17:20
This script answers both questions posed above and is a good starting point for developing scripts that will be executed as commands with input values.
Reason for the new answer:
Existing answers specify multiple options.
Standard option is to use argparse, a few answers provided examples from the documentation, and one answer suggested the advantage of it. But all fail to explain the answer adequately/clearly to the actual question by OP, at least for newbies.
An example of argparse:
import argparse
def load_config(conf_file):
pass
if __name__ == '__main__':
parser = argparse.ArgumentParser()
//Specifies one argument from the command line
//You can have any number of arguments like this
parser.add_argument("conf_file", help="configuration file for the application")
args = parser.parse_args()
config = load_config(args.conf_file)
Above program expects a config file as an argument. If you provide it, it will execute happily. If not, it will print the following
usage: test.py [-h] conf_file
test.py: error: the following arguments are required: conf_file
You can have the option to specify if the argument is optional.
You can specify the expected type for the argument using type key
parser.add_argument("age", type=int, help="age of the person")
You can specify default value for the arguments by specifying default key
This document will help you to understand it to an extent.
Related
I have a argparse function containing a mix of internal and user specify settings. I want to use a json as configuration file to store user-specified parameters so that the json will be parsed back to this argparse function.
I also have a mix of data types in the parameters, they are defined in argparse but not in the json.
My argparse function looks like this
def parse_opt():
parser = argparse.ArgumentParser()
parser.add_argument('--name', nargs='+', type=str, default='experiment', help='project name') #specify by users
parser.add_argument('--visualise', action='store_true', help='output contains graphs') #specify by users
parser.add_argument('--imgsize', '--img', '--img-size', nargs='+', type=int, default=[640], help='image size h,w') #let users specify
parser.add_argument('--data', type=str, default=ROOT / 'data/coco128.yaml', help='(optional) dataset.yaml path') #internal default setting
parser.add_argument('--thres', type=float, default=0.3, help='threshold') #internal default setting
opt = parser.parse_args()
return opt
My json configuration config.json looks like this, and it allows users to specify their parameters
d = {"name": "trial_001",
"visualise": true,
"imgsize": 1280}
I tried the following to pass new configurations using the script below, and ran into error TypeError: 'bool' object is not subscriptable . In the main() function, I want all default settings parsed as opt , then the three use user-defined parameters defined in config.json will override opt.name, opt.visualise and opt.imgsize. Then detect(**vars(opt)) reads all users and default parameters and apply detect() function to them (note: my detect() function isn't added in this post as it is quite long). Appreciate any pointers here. thanks.
import argparse
import json
def main(opt):
opt = parse_opt()
with open('config.json') as config_file:
d = json.loads(config_file.read())
for item in d.items():
args.extend(item)
detect(**vars(opt)) #detect() is a function that reads all variables from opt
if __name__ == "__main__":
main(opt)
EDIT: this is the full error message I encountered.
for item in d.items():
args.extend(item)
parser.parse_args(args)
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-26-53a113868d66>", line 1, in <module>
parser.parse_args(args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/argparse.py", line 1749, in parse_args
args, argv = self.parse_known_args(args, namespace)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/argparse.py", line 1781, in parse_known_args
namespace, args = self._parse_known_args(args, namespace)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/argparse.py", line 1822, in _parse_known_args
option_tuple = self._parse_optional(arg_string)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/argparse.py", line 2108, in _parse_optional
if not arg_string[0] in self.prefix_chars:
TypeError: 'bool' object is not subscriptable
I have 2 json configuration files to read and want to assign there values to variables. I am creating a data flow job using apache beam but unable to parse those files and assign there values to a variable.
config1.json - { "bucket_name": "mybucket"}
config2.json - { "dataset_name": "mydataset"}
This is the pipeline statements ---- I tried with one JSON file first but even that is not working
with beam.Pipeline(options=pipeline_options) as pipeline:
steps = (pipeline
| "Getdata" >> beam.io.ReadFromText(custom_options.configfile)
| "CUSTOM JSON PARSE" >> beam.ParDo(custom_json_parser(custom_options.configfile))
| "write to GCS" >> beam.io.WriteToText('gs://mynewbucket/outputfile.txt')
)
result = pipeline.run()
result.wait_until_finish()
I also tried creating a function to parse atleast one file. This is a sample method I created but it did not work.
class custom_json_parser(beam.DoFn):
import apache_beam as beam
from apache_beam.io.gcp import gcsio
import logging
def __init__(self, configfile):
self.configfile = configfile
def process(self, configfile):
logging.info("JSON PARSING STARTED")
with beam.io.gcp.gcsio.GcsIO().open(self.configfile, 'r') as f:
for line in f:
data = json.loads(line)
bucket = data.get('bucket_name')
dataset = data.get('dataset_name') ```
Can someone please suggest the best method to resolve this issue in apache beam?
Thanks in Advance
If you need to read only once your files in the pipeline, don't read them in the pipeline, but before running it.
Read the files from GCS
Parse the file and put the useful content in the pipeline options map
Run your pipeline and use the data from the options
EDIT 1
You can use this piece of code to load the file and read it, before your pipeline. Simple Python, standard GCS libraries.
from google.cloud import storage
import json
client = storage.Client()
bucket = client.get_bucket('your-bucket')
blob = bucket.get_blob("name.json")
json_data = blob.download_as_string().decode('UTF-8')
print(json_data) # print -> {"name": "works!!"}
print(json.loads(json_data)["name"]) # print -> works!!
You can try following code snippet: -
Function to Parse File
class custom_json_parser(beam.DoFn):
def process(self, element):
logging.info(element)
data = json.loads(element)
bucket = data.get('bucket_name')
dataset = data.get('dataset_name')
return [{"bucket": bucket , "dataset": dataset }]
Over Pipeline you can call function
with beam.Pipeline(options=pipeline_options) as pipeline:
steps = (pipeline
| "Getdata" >> beam.io.ReadFromText(custom_options.configfile)
| "CUSTOM JSON PARSE" >> beam.ParDo(custom_json_parser())
| "write to GCS" >> beam.io.WriteToText('gs://mynewbucket/outputfile.txt')
)
result = pipeline.run()
result.wait_until_finish()
It will work.
I've googled this quite a bit and am unable to find helpful insight. Basically, I need to take the user input from my argparse arguments from a python script (as shown below) and plug those values into a json file (packerfile.json) located in the same working directory. I have been experimenting with subprocess, invoke and plumbum libraries without being able to "find the shoe that fits".
From the following code, I have removed all except for the arguments as to clean up:
#!/usr/bin/python
import os, sys, subprocess
import argparse
import json
from invoke import run
import packer
parser = argparse.ArgumentParser()
parser._positionals.title = 'Positional arguments'
parser._optionals.title = 'Optional arguments'
parser.add_argument("--access_key",
required=False,
action='store',
default=os.environ['AWS_ACCESS_KEY_ID'],
help="AWS access key id")
parser.add_argument("--secret_key",
required=False,
action='store',
default=os.environ['AWS_SECRET_ACCESS_KEY'],
help="AWS secret access key")
parser.add_argument("--region",
required=False,
action='store',
help="AWS region")
parser.add_argument("--guest_os_type",
required=True,
action='store',
help="Operating system to install on guest machine")
parser.add_argument("--ami_id",
required=False,
help="AMI ID for image base")
parser.add_argument("--instance_type",
required=False,
action='store',
help="Type of instance determines overall performance (e.g. t2.medium)")
parser.add_argument("--ssh_key_path",
required=False,
action='store',
default=os.environ['HOME']+'/.ssh',
help="SSH key path (e.g. ~/.ssh)")
parser.add_argument("--ssh_key_name",
required=True,
action='store',
help="SSH key name (e.g. mykey)")
args = parser.parse_args()
print(vars(args))
json example code:
{
"variables": {
"aws_access_key": "{{ env `AWS_ACCESS_KEY_ID` }}",
"aws_secret_key": "{{ env `AWS_SECRET_ACCESS_KEY` }}",
"magic_reference_date": "{{ isotime \"2006-01-02\" }}",
"aws_region": "{{ env 'AWS_REGION' }}",
"aws_ami_id": "ami-036affea69a1101c9",
"aws_instance_type": "t2.medium",
"image_version" : "0.1.0",
"guest_os_type": "centos7",
"home": "{{ env `HOME` }}"
},
so, the user input for the --region as shown in the python script shoul get plugged into the value for aws_region in the json file.
I am aware of how to print the value of args. The full command that I am providing to the script is: python packager.py --region us-west-2 --guest_os_type rhel7 --ssh_key_name test_key and the printed results are {'access_key': 'REDACTED', 'secret_key': 'REDACTED', 'region': 'us-west-2', 'guest_os_type': 'rhel7', 'ami_id': None, 'instance_type': None, 'ssh_key_path': '/Users/REDACTEDt/.ssh', 'ssh_key_name': 'test_key'} .. what i need is to import thos values into the packerfile.json variables list.. preferably in a way that i can reuse it (so it musn't overwrite the file)
Note: I have also been experimenting with using python to export local environment variables then having the JSON file pick them up, but that doesn't really seem like a viable solution.
I think that the best solution might be to take all of these arguments and export them to its own JSON file called variables.json and import these variables from JSON (variables.json) to JSON (packerfile.json) as a seperate process. Still could use guidence here though :)
You might use the __dict__ attribute from the SimpleNamespace that is returned by the ArgumentParser. Like so:
import json
parsed = parser.parse_args()
with open('packerfile.json', 'w') as f:
json.dump(f, parsed.__dict__)
If required, you could use add_argument(dest='attrib_name') to customise attribute names.
I was actually able to come up with a pretty simple solution.
args = parser.parse_args()
print(json.dumps(vars(args), indent=4))
s.call("echo '%s' > variables.json && packer build -var-file=variables.json packerfile.json" % json_formatted, shell=True)
arguments are captured under the variable args and dumped to the output with json.dump while vars is making sure to also dump the arguments with their key values and I currently have to run my code with >> vars.json but I'll insert logic to have python handle that.
Note: s == subprocess in s.call
I've been finally getting into Python, and have noticed something strange, that works in Java, but not in Python.
When I type the following:
fn = "" # Local filename storage.
def read(filename):
fn = filename
return open(filename, 'r').read()
My flake8 linter for Atom gives me the following error:
F841 - local variable 'fn' is assigned to but never used.
I'm assuming this means that the variable is being defined on the def level, and not the module level, which I intend on doing. Please correct me if I'm wrong.
I've searched Google, with multiple wordings, but can't seem to word it in a way that the correct results display...
Any ideas on how I can be able to achieve module-level variable definitions from the function-level?
If you want to declare fn as a global variable (module-level), use global statement.
def read(filename):
global fn # <-----
fn = filename
return open(filename, 'r').read()
BTW, ; is optional. Don't use it.
You can set a module level variable from the function by doing:
import sys
def read(filename):
module = sys.modules[__name__]
setattr(module, 'fn', filename)
return open(filename, 'r').read()
However, it's a very strange necessity. Consider to change your architecture.
UPD: Let's consider an example:
# module1
# uncomment it to fix NameError and AttributeError
# some_var = ''
def foo(val):
global some_var
some_var = val
# module2
from module1 import *
print(some_var) # raises NameError: name 'some_var' is not defined
foo('bar')
print(some_var) # still raises NameError: name 'some_var' is not defined
# module3
import module1
print(module1.some_var) # raises AttributeError: 'module' object has no attribute 'some_var'
foo('bar')
print(module1.some_var) # prints 'bar' even without some_var = '' definition in the module1
So, it's not so obvious how global behaves during the import process. I think, that manually doing setattr(module, 'attr_name', value) during the read() call is more clear.
A simple program for reading a CSV file inside a ZIP archive:
import csv, sys, zipfile
zip_file = zipfile.ZipFile(sys.argv[1])
items_file = zip_file.open('items.csv', 'rU')
for row in csv.DictReader(items_file):
pass
works in Python 2.7:
$ python2.7 test_zip_file_py3k.py ~/data.zip
$
but not in Python 3.2:
$ python3.2 test_zip_file_py3k.py ~/data.zip
Traceback (most recent call last):
File "test_zip_file_py3k.py", line 8, in <module>
for row in csv.DictReader(items_file):
File "/somedir/python3.2/csv.py", line 109, in __next__
self.fieldnames
File "/somedir/python3.2/csv.py", line 96, in fieldnames
self._fieldnames = next(self.reader)
_csv.Error: iterator should return strings, not bytes (did you open the file
in text mode?)
The csv module in Python 3 wants to see a text file, but zipfile.ZipFile.open returns a zipfile.ZipExtFile that is always treated as binary data.
How does one make this work in Python 3?
I just noticed that Lennart's answer didn't work with Python 3.1, but it does work with Python 3.2. They've enhanced zipfile.ZipExtFile in Python 3.2 (see release notes). These changes appear to make zipfile.ZipExtFile work nicely with io.TextWrapper.
Incidentally, it works in Python 3.1, if you uncomment the hacky lines below to monkey-patch zipfile.ZipExtFile, not that I would recommend this sort of hackery. I include it only to illustrate the essence of what was done in Python 3.2 to make things work nicely.
$ cat test_zip_file_py3k.py
import csv, io, sys, zipfile
zip_file = zipfile.ZipFile(sys.argv[1])
items_file = zip_file.open('items.csv', 'rU')
# items_file.readable = lambda: True
# items_file.writable = lambda: False
# items_file.seekable = lambda: False
# items_file.read1 = items_file.read
items_file = io.TextIOWrapper(items_file)
for idx, row in enumerate(csv.DictReader(items_file)):
print('Processing row {0} -- row = {1}'.format(idx, row))
If I had to support py3k < 3.2, then I would go with the solution in my other answer.
Update for 3.6+
Starting w/3.6, support for mode='U' was removed^1:
Changed in version 3.6: Removed support of mode='U'. Use io.TextIOWrapper for reading compressed text files in universal newlines mode.
Starting w/3.8, a Path object was added which gives us an open() method that we can call like the built-in open() function (passing newline='' in the case of our CSV) and we get back an io.TextIOWrapper object the csv readers accept. See Yuri's answer, here.
You can wrap it in a io.TextIOWrapper.
items_file = io.TextIOWrapper(items_file, encoding='your-encoding', newline='')
Should work.
And if you just like to read a file into a string:
with ZipFile('spam.zip') as myzip:
with myzip.open('eggs.txt') as myfile:
eggs = myfile.read().decode('UTF-8'))
Lennart's answer is on the right track (Thanks, Lennart, I voted up your answer) and it almost works:
$ cat test_zip_file_py3k.py
import csv, io, sys, zipfile
zip_file = zipfile.ZipFile(sys.argv[1])
items_file = zip_file.open('items.csv', 'rU')
items_file = io.TextIOWrapper(items_file, encoding='iso-8859-1', newline='')
for idx, row in enumerate(csv.DictReader(items_file)):
print('Processing row {0}'.format(idx))
$ python3.1 test_zip_file_py3k.py ~/data.zip
Traceback (most recent call last):
File "test_zip_file_py3k.py", line 7, in <module>
items_file = io.TextIOWrapper(items_file,
encoding='iso-8859-1',
newline='')
AttributeError: readable
The problem appears to be that io.TextWrapper's first required parameter is a buffer; not a file object.
This appears to work:
items_file = io.TextIOWrapper(io.BytesIO(items_file.read()))
This seems a little complex and also it seems annoying to have to read in a whole (perhaps huge) zip file into memory. Any better way?
Here it is in action:
$ cat test_zip_file_py3k.py
import csv, io, sys, zipfile
zip_file = zipfile.ZipFile(sys.argv[1])
items_file = zip_file.open('items.csv', 'rU')
items_file = io.TextIOWrapper(io.BytesIO(items_file.read()))
for idx, row in enumerate(csv.DictReader(items_file)):
print('Processing row {0}'.format(idx))
$ python3.1 test_zip_file_py3k.py ~/data.zip
Processing row 0
Processing row 1
Processing row 2
...
Processing row 250
Starting with Python 3.8, the zipfile module has the Path object, which we can use with its open() method to get an io.TextIOWrapper object, which can be passed to the csv readers:
import csv, sys, zipfile
# Give a string path to the ZIP archive, and
# the archived file to read from
items_zipf = zipfile.Path(sys.argv[1], at='items.csv')
# Then use the open method, like you'd usually
# use the built-in open()
items_f = items_zipf.open(newline='')
# Pass the TextIO-like file to your reader as normal
for row in csv.DictReader(items_f):
print(row)
Here's a minimal recipe to open a zip file and read a text file inside that zip. I found the trick to be the TextIOWrapper read() method, not mentioned in any answers above (BytesIO.read() was mentioned above, but Python docs recommend TextIOWrapper).
import zipfile
import io
# Create the ZipFile object
zf = zipfile.ZipFile('my_zip_file.zip')
# Read a file that is inside the zip...reads it as a binary file-like object
my_file_binary = zf.open('my_text_file_inside_zip.txt')
# Convert the binary file-like object directly to text using TextIOWrapper and it's read() method
my_file_text = io.TextIOWrapper(my_file_binary, encoding='utf-8', newline='').read()
I wish they kept the mode='U' parameter in the ZipFile open() method to do this same thing since that was so succinct but, alas, that is obsolete.