Shapefile to csv with a WKT multipolygon column - csv

My data is from census maps. I am not familiar with shapefiles or WKT file, but I managed to find this solution, on by which I tried to create my own code.
import ogr
import csv
#Open files
csvfile=open("states_wkt.csv",'wb')
ds=ogr.Open("cb_2015_us_state_20m.shp")
lyr=ds.GetLayer()
#Get field names
dfn=lyr.GetLayerDefn()
nfields=dfn.GetFieldCount()
fields=[]
for i in range(nfields):
fields.append(dfn.GetFieldDefn(i).GetName())
fields.append('kmlgeometry')
csvwriter = csv.DictWriter(csvfile, fields)
While this works, i get geometry results looking like:
""kmlgeometry"":""<MultiGeometry>
<Polygon><outerBoundaryIs><LinearRing><coordinates>-118.593969,33.467198
-118.484785,33.487483 -118.370323,33.409285 -118.286261
</coordinates></LinearRing></outerBoundaryIs></Polygon>
<Polygon><outerBoundaryIs><LinearRing><coordinates>-118.594033,33.035951
-118.540069,32.980933 -118.446771,32.895424 -118.353504,32.821962 -118.425634
</coordinates></LinearRing></outerBoundaryIs></Polygon>
</MultiGeometry>
In my specific case I would like to return geometry data in a form of multipolygon like this:
MULTIPOLYGON (((-71.6062550000000044 42.0133709999999994,
-71.5276060000000058 42.0149979999999985, -71.5169060000000059
42.0155979999999971, -71.4999080000000049 42.0171989999999980,
-71.3814009999999968 42.0187979999999968, -71.3815050000000042
42.0000110000000006, -71.3812010000000043 41.9811979999999991)))
How can I achieve that ?

I managed to find a simple script that is easy to use using GDAL:
ogr2ogr -f CSV multipolygon_states.csv cb_2015_us_state_20m.shp -nlt MULTIPOLYGON -lco GEOMETRY=AS_WKT

Related

issue with connecting data in databricks from data lake and reading JSON into Folium

i'm working on something based of this blogpost:
https://python-visualization.github.io/folium/quickstart.html#Getting-Started
specifically part 13 - using Cloropleth maps:
the piece of code they use is the following:
import pandas as pd
url = (
"https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)
state_geo = f"{url}/us-states.json"
state_unemployment = f"{url}/US_Unemployment_Oct2012.csv"
state_data = pd.read_csv(state_unemployment)
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_data,
columns=["State", "Unemployment"],
key_on="feature.id",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Unemployment Rate (%)",
).add_to(m)
folium.LayerControl().add_to(m)
m
if I use this I get the requested map.
Now I try to do this with my own data; i work in databricks
so I have a JSON with the GEOJSON data (source_file1) and a CSV file (source_file2) with the data that needs to be "plotted" on the map.
source_file1 = "dbfs:/mnt/sandbox/MAARTEN/TOPO/Belgie_GEOJSON.JSON"
state_geo = spark.read.json(source_file1,multiLine=True)
source_file2 = "dbfs:/mnt/sandbox/MAARTEN/TOPO/DATASVZ.csv"
df_2 = spark.read.format("CSV").option("inferSchema", "true").option("header", "true").option("delimiter",";").load(source_file2)
state_data = df_2.toPandas()
when adjusting the code below:
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_data,
columns=["State", "Unemployment"],
key_on="feature.properties.name_nl",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="% Marktaandeel CC",
).add_to(m)
folium.LayerControl().add_to(m)
m
So i upload the geo_data parameter as a Sparkdatafram, I get the following error:
ValueError: Cannot render objects with any missing geometries: DataFrame[features: array<struct<geometry:struct<coordinates:array<array<array<string>>>,type:string>,properties:struct<arr_fr:string,arr_nis:bigint,arr_nl:string,fill:string,fill-opacity:double,name_fr:string,name_nl:string,nis:bigint,population:bigint,prov_fr:string,prov_nis:bigint,prov_nl:string,reg_fr:string,reg_nis:string,reg_nl:string,stroke:string,stroke-opacity:bigint,stroke-width:bigint>,type:string>>, type: string]```
I think it is because transforming the data from the "blob format" in the Azure datalake to the sparkdataframe, something goes wrong with the format. I tested this in a jupyter notebook from my desktop, data straight from file to folium and it all works.
If i load it directly from the source, like the example does with their webpage, so i adjust the 'geo_data' parameter for the folium function:
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=source_file1, #this gets adjusted directly to data lake
name="choropleth",
data=state_data,
columns=["State", "Unemployment"],
key_on="feature.properties.name_nl",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="% Marktaandeel CC",
).add_to(m)
folium.LayerControl().add_to(m)
m
I get the error
Use "/dbfs", not "dbfs:": The function expects a local file path. The error is caused by passing a path prefixed with "dbfs:".
So I started wondering what is the difference between my JSON file and the one of the blogpost. And the only thing i can imagine is that the Azure datalake doesn't store my json as a json but as a block blob file and for some reason i am not converting it properly so that folium can read it.
Azure blob storage (data lake)
So can someone with folium knowledge let me know if
A. it is not possible to load the geo_data directly from a datalake ?
B. in what format I need to upload the data ?
any thoughts on this would be helpfull!!!
thanks in advance!
Solved this issue, just had to replace "dbfs:" with "/dbfs". I tried it a lot of times but used "/dbfs:" and got another error.
can't believe i'm this stupid :-)

unable to load csv from GCS bucket to BigQuery table accurately

I am trying to load the airbnb_nyc data set from GCS bucket to BigqueryTable. Link to the dataset.
I am using the following Code:
def parse_file(element):
for line in csv.reader([element],delimiter=','):
return line
class DataIngestion2:
def parse_method2(self, values):
row1 = dict(
zip(('id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'latitude', 'longitude',
'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month',
'calculated_host_listings_count', 'availability_365'),
values))
return row1
with beam.Pipeline(options=pipeline_options) as p:
lines= p | 'Read' >> ReadFromText(known_args.input,skip_header_lines=1)\
| 'parse' >> beam.Map(parse_file)
pipeline2 = lines | 'Format to Dict _ original CSV' >> beam.Map(lambda x: data_ingestion2.parse_method2(x))
pipeline2 | 'Load2' >> beam.io.WriteToBigQuery(table_spec, schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
`
But my output on BigQuery Table is wrong.
I am only getting values for the first two columns and the rest of the 14 columns are showing NULL. I am not able to figure out what I am doing wrong. Can Someone Help me find the error in my logic. I basically want to know how to transfer a csv from GCS bucket to BigQuery through DataFlow pipeline.
Thank you,
You can use the ReadFromText method and then create your own transform by extending beam.DoFn. Attached the code below for reference.
https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromText
Note that you can use gs:// for GCS in file_pattern.
More details about Pardo and DoFn
https://beam.apache.org/documentation/programming-guide/#pardo
import apache_beam as beam
from apache_beam.io.textio import ReadAllFromText,ReadFromText
from apache_beam.io.gcp.bigquery import WriteToBigQuery
from apache_beam.io.gcp.gcsio import GcsIO
import csv
COLUMN_NAMES = ['id','name','host_id','host_name','neighbourhood_group','neighbourhood','latitude','longitude','room_type','price','minimum_nights','number_of_reviews','last_review','reviews_per_month','calculated_host_listings_count','availability_365']
def files(path='gs:/some/path'):
return list(GcsIO(storage_client='<ur storage client>').list_prefix(path=path).keys())
def transform_csv(element):
rows = []
with open(element,newline='\r\n') as f:
itr = csv.reader(f, delimiter = ',',quotechar= '"')
skip_head = next(itr)
for row in itr:
rows.append(row)
return rows
def to_dict(element):
rows = []
for item in element:
row_dict = {}
zipped = zip(COLUMN_NAMES,item)
for key,val in zipped:
row_dict[key] =val
rows.append(row_dict)
yield rows
with beam.Pipeline() as p:
read =(
p
|'read-file'>> beam.Create(files())
|'transform-dict'>>beam.Map(transform_csv)
|'list-to-dict'>>beam.FlatMap(to_dict )
|'print'>>beam.Map(print)
#|'write-to-bq'>>WriteToBigQuery(schema=COLUMN_NAMES,table='ur table',project='',dataset='')
)
EDITED1 The ReadFromText supports \r\n as newline char.But,this fails to consider the condition where column data itself has \r\n. Updating the code below.
EDITED 2 GcsIo error fixed.
Note - I have used GCSIO for getting the list of files.
Details here
Please Up-vote and mark as answer if this helps.
Let me suggest another approch for this use case. BiqQuery offers special feature for uploading from Google Could Storage (GCS) to Bigquery. You can load data in several formats and CSV is among them.
There is nice tutorial on Google documentation explaining how to do it. You do not have to use Dataflow or apache_beam. Such process is available through BigQuery API itself.
This is working in many languages, but you do not have to use any language as such process can be done from console or via Cloud SDK using bq command. Everything can be found in mentioned tutorial.

Importing sting arrays to integer from csv to Neo4j using Cypher

I'm new to Neo4j and Cypher and I'm trying to import some data from csv that includes an array of IDs. I have the query below working but as Cypher defaults to strings, I've been unable to find the best way to convert the array of placeIDs to integers.
LOAD CSV WITH HEADERS FROM 'http://localhost:11001/project-ca45d786-e360-4e3b-b4b4-eb8fe62a7b55/People-Gridv2.csv' AS row
CREATE (:People {peopleID: toInteger(row.peopleID), nickname: row.nickname, firstName: row.firstName, lastName: row.lastName, relationship: row.relationship, firstMemory: row.firstMemory, lastMemory: row.lastMemory, placeID: split(row.placeID,";")})
I hoped that I'd be able to do something like the following, but it doesn't work:
placeID: toInteger(split(row.placeID,";"))
Can anyone point me in the right direction?
that would probably something like
placeID : REDUCE(array=[] , s IN split(row.placeID,";") | array+[toInteger(s)] )
to get an array of integers
example
with '123;456' as placeID
return REDUCE(array=[] , s IN split(placeID,";") | array+[toInteger(s)] )
will return
[123,456]
and even shorter :)
with '123;456' as placeID
return [s IN split(placeID,";") | toInteger(s)]

feeding a dataframe created from a CSV to MLlib Kmeans: IndexError: list index out of range

because i can not use spark csv i have manually created a dataframe from CSV as follow:
raw_data=sc.textFile("data/ALS.csv").cache()
csv_data=raw_data.map(lambda l:l.split(","))
header=csv_data.first()
csv_data=csv_data.filter(lambda line:line !=header)
row_data=csv_data.map(lambda p :Row (
location_history_id=p[0],
user_id=p[1],
latitude=p[2],
longitude=p[3],
address=p[4],
created_at=p[5],
valid_until=p[6],
timezone_offset_secs=p[7],
opening_times_id=p[8],
timezone_id=p[9]))
location_df = sqlContext.createDataFrame(row_data)
location_df.registerTempTable("locations")
i need only two columns :
lati_longi_df=sqlContext.sql("""SELECT latitude, longitude FROM locations""")
rdd_lati_longi = lati_longi_df.map(lambda data: Vectors.dense([float(c) for c in data]))
rdd_lati_longi.take(2):
[DenseVector([-6.2416, 106.7949]),
DenseVector([-6.2443, 106.7956])]
now it seems that every thing is ready for KMeans training:
clusters = KMeans.train(rdd_lati_longi, 10, maxIterations=30,
runs=10, initializationMode="random")
but i get the following error:
IndexError: list index out of range
First three lines of ALS.csv:
location_history_id,user_id,latitude,longitude,address,created_at,valid_until,timezone_offset_secs,opening_times_id,timezone_id
Why don't you allow spark to parse csv instead? You can enable csv support with something like this:
pyspark --packages com.databricks:spark-csv_2.10:1.4.0

Setting properties of a node from a csv - Neo4j

This is an example of my csv file:
_id,official_name,common_name,country,started_by,
ABO.00,Association Football Club Bournemouth,Bournemouth,England,"{""day"":NumberInt(1),""month"":NumberInt(1),""year"":NumberInt(1899)}"
AOK.00,PAE Kerkyra,Kerkyra,Greece,"{""day"":NumberInt(30),""month"":NumberInt(11),""year"":NumberInt(1968)}"
I have to import this csv into Neo4j:
LOAD CSV WITH HEADERS FROM
'file:///Z:/path/to/file/team.csv' as line
create (p:Team {_id:line._id, official_name:line.official_name, common_name:line.common_name, country:line.country, started_by_day:line.started_by.day,started_by_month:line.started_by.month,started_by_year:line.started_by.year
I get an error(Neo.ClientError.Statement.InvalidType) setting started_by.day, started_by.month, started_by.year
How can I set rightly the properties about started_by?
Format of you csv should be following:
_id,official_name,common_name,country,started_by_day,started_by_month,started_by_year
ABO.00,Association Football Club Bournemouth,Bournemouth,England,1,1,1899
Cypher:
LOAD CSV WITH HEADERS FROM 'file:///Z:/path/to/file/team.csv' as line
CREATE (p:Team {_id:line._id, official_name:line.official_name, common_name:line.common_name, country:line.country, started_by_day:line.started_by_day,started_by_month:line.started_by_month,started_by_year:line.started_by_year})
It looks like your date part in the csv file is in JSON format - don't you need to parse that first?
line.started_by
is this string
"{""day"":NumberInt(30),""month"":NumberInt(11),""year"":NumberInt(1968)}"
There is no line.started_by.day