ogr2ogr Or arcpy for csv to shapefile? - csv

Can ogr2ogr or arcpy do a direct csv to shapefile conversion?
I'm trying to automate some processes with a small script and was hoping I can do it easily with ogr2ogr or arcpy which I'm new to.
Any input would be appreciated.

It can be done with ogr2ogr easily.
Assuming you have a csv-file containing coordinates, like for example (has to be comma seperated):
coord.csv
x,y,z
48.66080825,10.28323850,0
48.66074700,10.28292000,0
48.66075045,10.28249425,0
48.66075395,10.28249175,0
48.66077113,10.28233356,0
48.66080136,10.28213118,0
48.66079620,10.28196900,0
Then you need to create a sample file (name it according to your csv) in the same directory:
coord.vrt
<OGRVRTDataSource>
<OGRVRTLayer name="output">
<SrcDataSource relativeToVRT="1">.</SrcDataSource>
<SrcLayer>coord</SrcLayer>
<GeometryType>wkbPoint</GeometryType>
<LayerSRS>WGS84</LayerSRS>
<GeometryField encoding="PointFromColumns" x="x" y="y"/>
</OGRVRTLayer>
</OGRVRTDataSource>
Then run:
ogr2ogr -f "ESRI Shapefile" . coord.csv && ogr2ogr -f "ESRI Shapefile" . coord.vrt
This will give you "output.shp" in the coordinate system, you specified in the sample file.
Regards,
muxav

You need the following workflow to convert a .csv of coordinates to a feature class using the Python arcpy site-package:
Make XY Event Layer (Data Management) converts the tabular data to a temporary spatial layer
Feature Class To Feature Class (Conversion) converts the layer to a permanent feature class
This should get you started.
import arcpy
from arcpy import env
# Set environment settings
env.workspace = "C:/data"
ws = env.workspace
# Set the local variables
in_Table = "your_table.csv"
x_coords = "POINT_X"
y_coords = "POINT_Y"
z_coords = "POINT_Z"
out_Layer = "your_layer"
# Set the spatial reference--this is simply a path to a .prj file
spRef = r"Coordinate Systems\Projected Coordinate Systems\Utm\Nad 1983\NAD 1983 UTM Zone 11N.prj"
# Make the XY event layer...
arcpy.MakeXYEventLayer_management(in_Table, x_coords, y_coords, out_Layer, spRef, z_coords)
# Now convert to a feature class
arcpy.FeatureClassToFeatureClass_conversion (out_layer, ws, "out.shp")

I did not have success with any of the solutions here but I was able come up with a solution that worked using Python's shapely and fiona modules. It uses a tab-delineated .ascii file (my preference as opposed to .csv) but can easily be adapted to use a .csv as in the question posed. Hopefully this is helpful someone else trying to automate this same task.
# ------------------------------------------------------
# IMPORTS
# ------------------------------------------------------
import os
import pandas as pd
from shapely.geometry import Point, mapping
from fiona import collection
# ------------------------------------------------------
# INPUTS
# ------------------------------------------------------
# Define path
path = os.path.abspath(os.path.dirname(__file__))
# Set working directory
os.chdir(path)
# Define file to convert
file = 'points.ascii'
# Define shp file schema
schema = { 'geometry': 'Point', 'properties': { 'LocationID': 'str', 'Latitude': 'float', 'Longitude': 'float' } }
# Read in data
data = pd.read_csv(file, sep='\t')
# Define shp file to write to
shpOut = 'points.shp'
# Create shp file
with collection(shpOut, "w", "ESRI Shapefile", schema) as output:
# Loop through dataframe and populate shp file
for index, row in data.iterrows():
# Define point
point = Point(row['Longitude'], row['Latitude'])
# Write output
output.write({
'properties': {'LocationID': row['LocationID'], 'Latitude': row['Latitude'], 'Longitude': row['Longitude'] },
'geometry': mapping(point)
})

Related

How can i convert json file from labelme interface to png or image format file?

When i used the labeling the images from labelme interface as output i get json file.but i need to in image format like png,bmp,jpeg after labeling. can anyone suggest me any code ?
import json
from PIL import Image
with open('your,json') as f:
data = json.load(f)
# Load the file path from the json
imgpath = data['yourkey']
# Place the image path into the open method
img = Image.open(imgpath)
Based on the tutorial of the original repository, you can use labelme_json_to_dataset <<JSON_PATH>> -o <<OUTPUT_FOLDER_PATH>>.
To run it on python / jupyter, you can use:
import os
def labelme_json_to_dataset(json_path):
os.system("labelme_json_to_dataset "+json_path+" -o "+json_path.replace(".","_"))
If you need to do it for multiple images, just loop the function.
Based on the issue, labelme_json_to_dataset behavior can be reimplemented by using either labelme2voc.py or labelme2coco.py.
You also could use other implementation like labelme2Datasets
You also can implement your own modification of labelme_json_to_dataset using labelme library. Basically, you use label_file = labelme.LabelFile(filename=filename) followed by img = labelme.utils.img_data_to_arr(label_file.imageData). An example of a process would be like this:
import labelme
import os
import glob
def labelme2images(input_dir, output_dir, force=False, save_img=False, new_size=False):
"""
new_size_width, new_size_height = new_size
"""
if save_img:
_makedirs(path=osp.join(output_dir, "images"), force=force)
if new_size:
new_size_width, new_size_height = new_size
print("Generating dataset")
filenames = glob.glob(osp.join(input_dir, "*.json"))
for filename in filenames:
# base name
base = osp.splitext(osp.basename(filename))[0]
label_file = labelme.LabelFile(filename=filename)
img = labelme.utils.img_data_to_arr(label_file.imageData)
h, w = img.shape[0], img.shape[1]
if save_img:
if new_size:
img_pil = Image.fromarray(img).resize((new_size_height, new_size_width))
else:
img_pil = Image.fromarray(img)
img_pil.save(osp.join(output_dir, "images", base + ".jpg"))

How to parse 2 json files in Apache beam

I have 2 json configuration files to read and want to assign there values to variables. I am creating a data flow job using apache beam but unable to parse those files and assign there values to a variable.
config1.json - { "bucket_name": "mybucket"}
config2.json - { "dataset_name": "mydataset"}
This is the pipeline statements ---- I tried with one JSON file first but even that is not working
with beam.Pipeline(options=pipeline_options) as pipeline:
steps = (pipeline
| "Getdata" >> beam.io.ReadFromText(custom_options.configfile)
| "CUSTOM JSON PARSE" >> beam.ParDo(custom_json_parser(custom_options.configfile))
| "write to GCS" >> beam.io.WriteToText('gs://mynewbucket/outputfile.txt')
)
result = pipeline.run()
result.wait_until_finish()
I also tried creating a function to parse atleast one file. This is a sample method I created but it did not work.
class custom_json_parser(beam.DoFn):
import apache_beam as beam
from apache_beam.io.gcp import gcsio
import logging
def __init__(self, configfile):
self.configfile = configfile
def process(self, configfile):
logging.info("JSON PARSING STARTED")
with beam.io.gcp.gcsio.GcsIO().open(self.configfile, 'r') as f:
for line in f:
data = json.loads(line)
bucket = data.get('bucket_name')
dataset = data.get('dataset_name') ```
Can someone please suggest the best method to resolve this issue in apache beam?
Thanks in Advance
If you need to read only once your files in the pipeline, don't read them in the pipeline, but before running it.
Read the files from GCS
Parse the file and put the useful content in the pipeline options map
Run your pipeline and use the data from the options
EDIT 1
You can use this piece of code to load the file and read it, before your pipeline. Simple Python, standard GCS libraries.
from google.cloud import storage
import json
client = storage.Client()
bucket = client.get_bucket('your-bucket')
blob = bucket.get_blob("name.json")
json_data = blob.download_as_string().decode('UTF-8')
print(json_data) # print -> {"name": "works!!"}
print(json.loads(json_data)["name"]) # print -> works!!
You can try following code snippet: -
Function to Parse File
class custom_json_parser(beam.DoFn):
def process(self, element):
logging.info(element)
data = json.loads(element)
bucket = data.get('bucket_name')
dataset = data.get('dataset_name')
return [{"bucket": bucket , "dataset": dataset }]
Over Pipeline you can call function
with beam.Pipeline(options=pipeline_options) as pipeline:
steps = (pipeline
| "Getdata" >> beam.io.ReadFromText(custom_options.configfile)
| "CUSTOM JSON PARSE" >> beam.ParDo(custom_json_parser())
| "write to GCS" >> beam.io.WriteToText('gs://mynewbucket/outputfile.txt')
)
result = pipeline.run()
result.wait_until_finish()
It will work.

How can i extract information quickly from 130,000+ Json files located in S3?

i have an S3 was over 130k Json Files which i need to calculate numbers based on data in the json files (for example calculate the number of gender of Speakers). i am currently using s3 Paginator and JSON.load to read each file and extract information form. but it take a very long time to process such a large number of file (2-3 files per second). how can i speed up the process? please provide working code examples if possible. Thank you
here is some of my code:
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket='bucket-name',StartAfter='')
for page in result:
if "Contents" in page:
for key in page[ "Contents" ]:
keyString = key[ "Key" ]
s3 = boto3.resource('s3')
content_object = s3.Bucket('bucket-name').Object(str(keyString))
file_content = content_object.get()['Body'].read().decode('utf-8')
json_content = json.loads(file_content)
x = (json_content['dict-name'])
In order to use the code below, I'm assuming you understand pandas (if not, you may want to get to know it). Also, it's not clear if your 2-3 seconds is on the read or includes part of the number crunching, nonetheless multiprocessing will speed this up dramatically. The gist is to read all the files in (as dataframes), concatenate them, then do your analysis.
To be useful for me, I run this on spot instances that have lots of vCPUs and memory. I've found the instances that are network optimized (like c5n - look for the n) and the inf1 (for machine learning) are much faster at reading/writing than T or M instance types, as examples.
My use case is reading 2000 'directories' with roughly 1200 files in each and analyzing them. The multithreading is orders of magnitude faster than single threading.
File 1: your main script
# create script.py file
import os
from multiprocessing import Pool
from itertools import repeat
import pandas as pd
import json
from utils_file_handling import *
ufh = file_utilities() #instantiate the class functions - see below (second file)
bucket = 'your-bucket'
prefix = 'your-prefix/here/' # if you don't have a prefix pass '' (empty string or function will fail)
#define multiprocessing function - get to know this to use multiple processors to read files simultaneously
def get_dflist_multiprocess(keys_list, num_proc=4):
with Pool(num_proc) as pool:
df_list = pool.starmap(ufh.reader_json, zip(repeat(bucket), keys_list), 15)
pool.close()
pool.join()
return df_list
#create your master keys list upfront; you can loop through all or slice the list to test
keys_list = ufh.get_keys_from_prefix(bucket, prefix)
# keys_list = keys_list[0:2000] # as an exampmle
num_proc = os.cpu_count() #tells you how many processors your machine has; function above defaults to 4 unelss given
df_list = get_dflist_multiprocess(keys_list, num_proc=num_proc) #collect dataframes for each file
df_new = pd.concat(df_list, sort=False)
df_new = df_new.reset_index(drop=True)
# do your analysis on the dataframe
File 2: class functions
#utils_file_handling.py
# create this in a separate file; name as you wish but change the import in the script.py file
import boto3
import json
import pandas as pd
#define client and resource
s3sr = boto3.resource('s3')
s3sc = boto3.client('s3')
class file_utilities:
"""file handling function"""
def get_keys_from_prefix(self, bucket, prefix):
'''gets list of keys and dates for given bucket and prefix'''
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
# use Delimiter to limit search to that level of hierarchy
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
print('keys in page: ', len(keys))
keys_list.extend(keys)
return keys_list
def read_json_file_from_s3(self, bucket, key):
"""read json file"""
bucket_obj = boto3.resource('s3').Bucket(bucket)
obj = boto3.client('s3').get_object(Bucket=bucket, Key=key)
data = obj['Body'].read().decode('utf-8')
return data
# you may need to tweak this for your ['dict-name'] example; I think I have it correct
def reader_json(self, bucket, key):
'''returns dataframe'''
return pd.DataFrame(json.loads(self.read_json_file_from_s3(bucket, key))['dict-name'])

Using python argparse arguments as variable values within a json file

I've googled this quite a bit and am unable to find helpful insight. Basically, I need to take the user input from my argparse arguments from a python script (as shown below) and plug those values into a json file (packerfile.json) located in the same working directory. I have been experimenting with subprocess, invoke and plumbum libraries without being able to "find the shoe that fits".
From the following code, I have removed all except for the arguments as to clean up:
#!/usr/bin/python
import os, sys, subprocess
import argparse
import json
from invoke import run
import packer
parser = argparse.ArgumentParser()
parser._positionals.title = 'Positional arguments'
parser._optionals.title = 'Optional arguments'
parser.add_argument("--access_key",
required=False,
action='store',
default=os.environ['AWS_ACCESS_KEY_ID'],
help="AWS access key id")
parser.add_argument("--secret_key",
required=False,
action='store',
default=os.environ['AWS_SECRET_ACCESS_KEY'],
help="AWS secret access key")
parser.add_argument("--region",
required=False,
action='store',
help="AWS region")
parser.add_argument("--guest_os_type",
required=True,
action='store',
help="Operating system to install on guest machine")
parser.add_argument("--ami_id",
required=False,
help="AMI ID for image base")
parser.add_argument("--instance_type",
required=False,
action='store',
help="Type of instance determines overall performance (e.g. t2.medium)")
parser.add_argument("--ssh_key_path",
required=False,
action='store',
default=os.environ['HOME']+'/.ssh',
help="SSH key path (e.g. ~/.ssh)")
parser.add_argument("--ssh_key_name",
required=True,
action='store',
help="SSH key name (e.g. mykey)")
args = parser.parse_args()
print(vars(args))
json example code:
{
"variables": {
"aws_access_key": "{{ env `AWS_ACCESS_KEY_ID` }}",
"aws_secret_key": "{{ env `AWS_SECRET_ACCESS_KEY` }}",
"magic_reference_date": "{{ isotime \"2006-01-02\" }}",
"aws_region": "{{ env 'AWS_REGION' }}",
"aws_ami_id": "ami-036affea69a1101c9",
"aws_instance_type": "t2.medium",
"image_version" : "0.1.0",
"guest_os_type": "centos7",
"home": "{{ env `HOME` }}"
},
so, the user input for the --region as shown in the python script shoul get plugged into the value for aws_region in the json file.
I am aware of how to print the value of args. The full command that I am providing to the script is: python packager.py --region us-west-2 --guest_os_type rhel7 --ssh_key_name test_key and the printed results are {'access_key': 'REDACTED', 'secret_key': 'REDACTED', 'region': 'us-west-2', 'guest_os_type': 'rhel7', 'ami_id': None, 'instance_type': None, 'ssh_key_path': '/Users/REDACTEDt/.ssh', 'ssh_key_name': 'test_key'} .. what i need is to import thos values into the packerfile.json variables list.. preferably in a way that i can reuse it (so it musn't overwrite the file)
Note: I have also been experimenting with using python to export local environment variables then having the JSON file pick them up, but that doesn't really seem like a viable solution.
I think that the best solution might be to take all of these arguments and export them to its own JSON file called variables.json and import these variables from JSON (variables.json) to JSON (packerfile.json) as a seperate process. Still could use guidence here though :)
You might use the __dict__ attribute from the SimpleNamespace that is returned by the ArgumentParser. Like so:
import json
parsed = parser.parse_args()
with open('packerfile.json', 'w') as f:
json.dump(f, parsed.__dict__)
If required, you could use add_argument(dest='attrib_name') to customise attribute names.
I was actually able to come up with a pretty simple solution.
args = parser.parse_args()
print(json.dumps(vars(args), indent=4))
s.call("echo '%s' > variables.json && packer build -var-file=variables.json packerfile.json" % json_formatted, shell=True)
arguments are captured under the variable args and dumped to the output with json.dump while vars is making sure to also dump the arguments with their key values and I currently have to run my code with >> vars.json but I'll insert logic to have python handle that.
Note: s == subprocess in s.call

How do you read a file inside a zip file as text, not bytes?

A simple program for reading a CSV file inside a ZIP archive:
import csv, sys, zipfile
zip_file = zipfile.ZipFile(sys.argv[1])
items_file = zip_file.open('items.csv', 'rU')
for row in csv.DictReader(items_file):
pass
works in Python 2.7:
$ python2.7 test_zip_file_py3k.py ~/data.zip
$
but not in Python 3.2:
$ python3.2 test_zip_file_py3k.py ~/data.zip
Traceback (most recent call last):
File "test_zip_file_py3k.py", line 8, in <module>
for row in csv.DictReader(items_file):
File "/somedir/python3.2/csv.py", line 109, in __next__
self.fieldnames
File "/somedir/python3.2/csv.py", line 96, in fieldnames
self._fieldnames = next(self.reader)
_csv.Error: iterator should return strings, not bytes (did you open the file
in text mode?)
The csv module in Python 3 wants to see a text file, but zipfile.ZipFile.open returns a zipfile.ZipExtFile that is always treated as binary data.
How does one make this work in Python 3?
I just noticed that Lennart's answer didn't work with Python 3.1, but it does work with Python 3.2. They've enhanced zipfile.ZipExtFile in Python 3.2 (see release notes). These changes appear to make zipfile.ZipExtFile work nicely with io.TextWrapper.
Incidentally, it works in Python 3.1, if you uncomment the hacky lines below to monkey-patch zipfile.ZipExtFile, not that I would recommend this sort of hackery. I include it only to illustrate the essence of what was done in Python 3.2 to make things work nicely.
$ cat test_zip_file_py3k.py
import csv, io, sys, zipfile
zip_file = zipfile.ZipFile(sys.argv[1])
items_file = zip_file.open('items.csv', 'rU')
# items_file.readable = lambda: True
# items_file.writable = lambda: False
# items_file.seekable = lambda: False
# items_file.read1 = items_file.read
items_file = io.TextIOWrapper(items_file)
for idx, row in enumerate(csv.DictReader(items_file)):
print('Processing row {0} -- row = {1}'.format(idx, row))
If I had to support py3k < 3.2, then I would go with the solution in my other answer.
Update for 3.6+
Starting w/3.6, support for mode='U' was removed^1:
Changed in version 3.6: Removed support of mode='U'. Use io.TextIOWrapper for reading compressed text files in universal newlines mode.
Starting w/3.8, a Path object was added which gives us an open() method that we can call like the built-in open() function (passing newline='' in the case of our CSV) and we get back an io.TextIOWrapper object the csv readers accept. See Yuri's answer, here.
You can wrap it in a io.TextIOWrapper.
items_file = io.TextIOWrapper(items_file, encoding='your-encoding', newline='')
Should work.
And if you just like to read a file into a string:
with ZipFile('spam.zip') as myzip:
with myzip.open('eggs.txt') as myfile:
eggs = myfile.read().decode('UTF-8'))
Lennart's answer is on the right track (Thanks, Lennart, I voted up your answer) and it almost works:
$ cat test_zip_file_py3k.py
import csv, io, sys, zipfile
zip_file = zipfile.ZipFile(sys.argv[1])
items_file = zip_file.open('items.csv', 'rU')
items_file = io.TextIOWrapper(items_file, encoding='iso-8859-1', newline='')
for idx, row in enumerate(csv.DictReader(items_file)):
print('Processing row {0}'.format(idx))
$ python3.1 test_zip_file_py3k.py ~/data.zip
Traceback (most recent call last):
File "test_zip_file_py3k.py", line 7, in <module>
items_file = io.TextIOWrapper(items_file,
encoding='iso-8859-1',
newline='')
AttributeError: readable
The problem appears to be that io.TextWrapper's first required parameter is a buffer; not a file object.
This appears to work:
items_file = io.TextIOWrapper(io.BytesIO(items_file.read()))
This seems a little complex and also it seems annoying to have to read in a whole (perhaps huge) zip file into memory. Any better way?
Here it is in action:
$ cat test_zip_file_py3k.py
import csv, io, sys, zipfile
zip_file = zipfile.ZipFile(sys.argv[1])
items_file = zip_file.open('items.csv', 'rU')
items_file = io.TextIOWrapper(io.BytesIO(items_file.read()))
for idx, row in enumerate(csv.DictReader(items_file)):
print('Processing row {0}'.format(idx))
$ python3.1 test_zip_file_py3k.py ~/data.zip
Processing row 0
Processing row 1
Processing row 2
...
Processing row 250
Starting with Python 3.8, the zipfile module has the Path object, which we can use with its open() method to get an io.TextIOWrapper object, which can be passed to the csv readers:
import csv, sys, zipfile
# Give a string path to the ZIP archive, and
# the archived file to read from
items_zipf = zipfile.Path(sys.argv[1], at='items.csv')
# Then use the open method, like you'd usually
# use the built-in open()
items_f = items_zipf.open(newline='')
# Pass the TextIO-like file to your reader as normal
for row in csv.DictReader(items_f):
print(row)
Here's a minimal recipe to open a zip file and read a text file inside that zip. I found the trick to be the TextIOWrapper read() method, not mentioned in any answers above (BytesIO.read() was mentioned above, but Python docs recommend TextIOWrapper).
import zipfile
import io
# Create the ZipFile object
zf = zipfile.ZipFile('my_zip_file.zip')
# Read a file that is inside the zip...reads it as a binary file-like object
my_file_binary = zf.open('my_text_file_inside_zip.txt')
# Convert the binary file-like object directly to text using TextIOWrapper and it's read() method
my_file_text = io.TextIOWrapper(my_file_binary, encoding='utf-8', newline='').read()
I wish they kept the mode='U' parameter in the ZipFile open() method to do this same thing since that was so succinct but, alas, that is obsolete.