I have a Feature Collection of Polygons and MultiPolygons and I have to first write it in a temporary file to then load it with geopandas.GeoDataFrame.from_file(tmp_json_file), I'm looking for a way to do it without the temporary file. I've tried to use geopandas.GeoDataFrame.from_feature(), it works pretty well for Feature Collection of simple Polygon but i can't make it work for Feature Collection of Polygons and MultiPolygons, I was thinking about doing something like below, but it's not working yet.
features_collection = []
for feature in json_data['features']:
tmp_properties = {'id': feature['properties']['id']}
if is_multipolygon (feature):
tmp = Feature(geometry=MultiPolygon((feature['geometry']['coordinates'])), properties=tmp_properties)
else:
Feature(geometry=Polygon((feature['geometry']['coordinates'])), properties=tmp_properties)
features_collection.append(tmp)
collection = FeatureCollection(features_collection)
return geopandas.GeoDataFrame.from_features(collection['features'])
The GeoJSON is taken from an API, returning territory (some territory are modelized by a single polygon, other by a set of polygons (formatted as a MultiPolygon).
The GeoJSON are structured as follow : http://pastebin.com/PPdMUGkY
I'm getting the following error from the function above :
Traceback (most recent call last):
File "overlap.py", line 210, in <module>
print bdv_json_to_geodf(contours_bdv)
File "overlap.py", line 148, in json_to_geodf
return geopandas.GeoDataFrame.from_features(collection['features'])
File "/Library/Python/2.7/site-packages/geopandas/geodataframe.py", line 179, in from_features
d = {'geometry': shape(f['geometry'])}
File "/Library/Frameworks/GEOS.framework/Versions/3/Python/2.7/site-packages/shapely/geometry/geo.py", line 40, in shape
return MultiPolygon(ob["coordinates"], context_type='geojson')
File "/Library/Frameworks/GEOS.framework/Versions/3/Python/2.7/site-packages/shapely/geometry/multipolygon.py", line 64, in __init__
self._geom, self._ndim = geos_multipolygon_from_py(polygons)
File "/Library/Frameworks/GEOS.framework/Versions/3/Python/2.7/site-packages/shapely/geometry/multipolygon.py", line 138, in geos_multipolygon_from_py
N = len(ob[0][0][0])
TypeError: object of type 'float' has no len()
For me this works if I just feed the json_data features to GeoDataFrame.from_features:
In [17]: gdf = geopandas.GeoDataFrame.from_features(json_data['features'])
In [18]: gdf.head()
Out[18]:
geometry id
0 (POLYGON ((-0.58570861816406 44.810461337462, ... 2
1 (POLYGON ((-0.5851936340332 44.816550206151, -... 1
2 POLYGON ((-0.58805465698242 44.824018340447, -... 5
3 POLYGON ((-0.59412002563477 44.821664359038, -... 9
4 (POLYGON ((-0.58502197265625 44.817159057661, ... 12
The resulting GeoDataFrame has a mixture of Polygons and MultiPolygons like in the input data:
In [19]: gdf.geom_type.head()
Out[19]:
0 MultiPolygon
1 MultiPolygon
2 Polygon
3 Polygon
4 MultiPolygon
dtype: object
I tried this with GeoPandas 0.2, shapely 1.5.15, pandas 0.18.1 on Windows.
Related
I have a pandas dataframe "df". the dataframe has a field in it "json_field" that contains json data. I think it's actually yaml format. I'm trying to parse the json values in to their own columns. I have example data below. when I run the code below I'm getting an error when it hits the 'name' field. the code and the error are below. it seems like maybe it's having trouble because the value associated with the name field isn't quoted, or has spaces. does anyone see what the issue might be or suggest how to fix?
example:
print(df['json_field'][0])
output:
- start:"test"
name: Price Unit
field: priceunit
type: number
operator: between
value:
- '40'
- '60'
code:
import yaml
pd.io.json.json_normalize(yaml.load(df['json_field'][0]), 'name','field','type','operator','value').head()
error:
An error was encountered:
mapping values are not allowed here
in "<unicode string>", line 3, column 7:
name: Price Unit
^
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/yaml/__init__.py", line 94, in safe_load
return load(stream, SafeLoader)
File "/usr/local/lib64/python3.6/site-packages/yaml/__init__.py", line 72, in load
return loader.get_single_data()
File "/usr/local/lib64/python3.6/site-packages/yaml/constructor.py", line 35, in get_single_data
node = self.get_single_node()
File "/usr/local/lib64/python3.6/site-packages/yaml/composer.py", line 36, in get_single_node
document = self.compose_document()
File "/usr/local/lib64/python3.6/site-packages/yaml/composer.py", line 55, in compose_document
node = self.compose_node(None, None)
File "/usr/local/lib64/python3.6/site-packages/yaml/composer.py", line 82, in compose_node
node = self.compose_sequence_node(anchor)
File "/usr/local/lib64/python3.6/site-packages/yaml/composer.py", line 110, in compose_sequence_node
while not self.check_event(SequenceEndEvent):
File "/usr/local/lib64/python3.6/site-packages/yaml/parser.py", line 98, in check_event
self.current_event = self.state()
File "/usr/local/lib64/python3.6/site-packages/yaml/parser.py", line 382, in parse_block_sequence_entry
if self.check_token(BlockEntryToken):
File "/usr/local/lib64/python3.6/site-packages/yaml/scanner.py", line 116, in check_token
self.fetch_more_tokens()
File "/usr/local/lib64/python3.6/site-packages/yaml/scanner.py", line 220, in fetch_more_tokens
return self.fetch_value()
File "/usr/local/lib64/python3.6/site-packages/yaml/scanner.py", line 580, in fetch_value
self.get_mark())
yaml.scanner.ScannerError: mapping values are not allowed here
in "<unicode string>", line 3, column 7:
name: Price Unit
desired output:
name field type operator value
Price Unit priceunit number between - '40' - '60'
Update:
I tried the suggestion below
import yaml
df['json_field']=df['json_field'].str.replace('"test"', "")
pd.io.json.json_normalize(yaml.safe_load(df['json_field'][0].lstrip()), 'name','field','type','operator','value').head()
and got the output below:
operator0 typefield
0 P priceunit
1 r priceunit
2 i priceunit
3 c priceunit
4 e priceunit
Seems to me that it's having an issue because your yaml is improperly formatted. That "test" should not be there at all if you want a dictionary named "start" with 5 mappings inside of it.
import yaml
a = """
- start:
name: Price Unit
field: priceunit
type: number
operator: between
value:
- '40'
- '60'
""".lstrip()
# Single dictionary named start, with 5 entries
yaml.safe_load(a)[0]
{'start':
{'name': 'Price Unit',
'field': 'priceunit',
'type': 'number',
'operator': 'between',
'value': ['40', '60']}
}
To do this with your data, try:
data = yaml.load(df['json_field'][0].replace('"test"', ""))
df = pd.json_normalize(yaml.safe_load(a)[0]["start"])
print(df)
name field type operator value
0 Price Unit priceunit number between [40, 60]
It seems that your real problem in one line earlier.
Note that your YAML fragment starts with start:"test".
One of key principles of YAML is that a key is separated
from its value by a colon and a space, whereas your
sample does not contain this (mandatory) space.
Maybe you should manually edit your input, i.e. add this
missing space.
Other solution is to write a "specialized pre-parser",
which adds such missing spaces and its result is fed into
the YAML parser. This can be an option if the number of such
cases is big.
I have mounted my GDrive and have csv file in a folder. I am following the tutorial. However, when I issue the tf.keras.utils.get_file(), I get a ValueError As follows.
data_folder = r"/content/drive/My Drive/NLP/project2/data"
import os
print(os.listdir(data_folder))
It returns:
['crowdsourced_labelled_dataset.csv',
'P2_Testing_Dataset.csv',
'P2_Training_Dataset_old.csv',
'P2_Training_Dataset.csv']
TRAIN_DATA_URL = os.path.join(data_folder, 'P2_Training_Dataset.csv')
train_file_path = tf.compat.v1.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
But this returns:
Downloading data from /content/drive/My Drive/NLP/project2/data/P2_Training_Dataset.csv
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-5bd642083471> in <module>()
2 TRAIN_DATA_URL = os.path.join(data_folder, 'P2_Training_Dataset.csv')
3 TEST_DATA_URL = os.path.join(data_folder, 'P2_Testing_Dataset.csv')
----> 4 train_file_path = tf.compat.v1.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
5 test_file_path = tf.compat.v1.keras.utils.get_file("eval.csv", TEST_DATA_URL)
6 frames
/usr/lib/python3.6/urllib/request.py in _parse(self)
382 self.type, rest = splittype(self._full_url)
383 if self.type is None:
--> 384 raise ValueError("unknown url type: %r" % self.full_url)
385 self.host, self.selector = splithost(rest)
386 if self.host:
ValueError: unknown url type: '/content/drive/My Drive/NLP/project2/data/P2_Training_Dataset.csv'
What am I doing wrong please?
As per the docs, this will be the outcome of a call to the function tf.compat.v1.keras.utils.get_file.
tf.keras.utils.get_file(
fname,
origin,
untar=False,
md5_hash=None,
file_hash=None,
cache_subdir='datasets',
hash_algorithm='auto',
extract=False,
archive_format='auto',
cache_dir=None
)
By default the file at the url origin is downloaded to the cache_dir ~/.keras, placed in the cache_subdir datasets, and given the filename fname. The final location of a file example.txt would therefore be ~/.keras/datasets/example.txt.
Returns:
Path to the downloaded file
Since you already have the data in your drive, there's no need to download it again (and IIUC, the function is expecting an accessible URL). Also, there's no need of obtaining the file name from a function call because you already know it.
Assuming the drive is mounted, you can replace your file paths as below:
train_file_path = os.path.join(data_folder, 'P2_Training_Dataset.csv')
test_file_path = os.path.join(data_folder, 'P2_Testing_Dataset.csv')
I am trying to parse a .json file into a .kml file to be used by a plotting program. I am going to give a set a data samples to simplify the issue:
I have a LocationHistory.json file that has the following structure:
{
"data" : {
"items" : [ {
"kind" : "latitude#location",
"timestampMs" : "1374870896803",
"latitude" : 34.9482949,
"longitude" : -85.3245474,
"accuracy" : 2149
}, {
"kind" : "latitude#location",
"timestampMs" : "1374870711762",
"latitude" : 34.9857898,
"longitude" : -85.3526902,
"accuracy" : 2016"
}]
}
}
I define a function that parses the json data and then I want to feed it into a "placemark" string to write (output) into a "location.kml" file.
def parse_jason_data_to_kml_file():
kml_file = open('location.kml', "r+")
#Here I parse the info inside the LocationHistory.json file
json_file = open('LocationHistory.json')
json_string = json_file.read()
json_data = json.loads(json_string)
locations = json_data["data"]["items"]
# Here I get the placemark string structure
placemark = ["<Placemark>",
"<TimeStamp><when>the “timestampMS” value from the JSON data item</when></TimeStamp>",
"<ExtendedData>",
"<Data name=”accuracy”>",
"<value> the “accuracy” value from the JSON data item </value>]",
"</Data>",
"</ExtendedData><Point><coordinates>”longitude,latitude”</coordinates></Point>",
"</Placemark>"]
placemark = "\n".join(placemark)
# Now i try to find certain substrings in the "placemark" string that I would like to replace:
find_timestamp = placemark.find("the “timestampMS” value from the JSON data item")
find_accuracy = placemark.find("the “accuracy” value from the JSON data item")
find_longitude = placemark.find("”longitude")
find_latitude = placemark.find("latitude”")
# Next, I want to loop through the LocationHistory.json file (the data list above)
# and replace the strings with the parsed LocationHistory.json data saved as:
# location["timestampMs"], location["accuracy"], location["longitude"], location["latitude"]:
for location in locations:
if find_timestamp != -1:
placemark.replace('the “timestampMS” value from the JSON data item', location["timestampMs"])
if find_accuracy != -1:
placemark.replace('the “accuracy” value from the JSON data item', location["accuracy"])
if find_longitude != -1:
placemark.replace('”longitude', location["longitude"])
if find_latitude != -1:
placemark.replace('latitude”', location["latitude"])
kml_file.write(placemark)
kml_file.write("\n")
kml_file.close()
The aim of this code is to write the placemark string that contains the .json data, into the location.ml file, from the placemark string that looks like this:
<Placemark>
<TimeStamp><when>the “timestampMS” value from the JSON data item</when></TimeStamp>
<ExtendedData>
<Data name=”accuracy”>
<value> the “accuracy” value from the JSON data item </value>
</Data>
</ExtendedData><Point><coordinates>”longitude,latitude”</coordinates></Point>
</Placemark>
To an output, which should look like this:
<Placemark>
<TimeStamp><when>2013-07-26T22:34:56Z</when></TimeStamp>
<ExtendedData>
<Data name=”accuracy”>
<value>2149</value>
</Data>
</ExtendedData><Point><coordinates>-85.3245474,34.9482949</coordinates></Point>
</Placemark>
<Placemark>
<TimeStamp><when>2013-07-26T22:31:51Z</when></TimeStamp>
<ExtendedData>
<Data name=”accuracy”>
<value>2016</value>
</Data>
</ExtendedData><Point><coordinates>-85.3526902,34.9857898</coordinates></Point>
</Placemark>
If I try to run this code, I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Elysium/anaconda/lib/python2.7/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 685, in runfile
execfile(filename, namespace)
File "/Users/Elysium/anaconda/lib/python2.7/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 78, in execfile
builtins.execfile(filename, *where)
File "/Users/Elysium/Dropbox/LeeV/Task 5 - Location Plotting/ParseToKML.py", line 77, in <module>
parse_jason_data_to_kml_file()
File "/Users/Elysium/Dropbox/LeeV/Task 5 - Location Plotting/ParseToKML.py", line 51, in parse_jason_data_to_kml_file
placemark.replace('the “timestampMS” value from the JSON data item', location["timestampMs"])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 33: ordinal not in range(128)
Is this a problem with my code or a software problem?
From various sources I can find that it can be related to PIP or ascii or something, but no answer to help me... I'm sure I have pip installed...
Any help would be appreciated. Any suggestions to improve my code is also welcome :)
Thank You
You can try this as follows-
placemark = ["<Placemark>",
"<TimeStamp><when>%(timestampMs)r</when></TimeStamp>",
"<ExtendedData>",
"<Data name=\"accuracy\">",
"<value>%(accuracy)r</value>]",
"</Data>",
"</ExtendedData><Point><coordinates>%(longitude)r, %(latitude)r</coordinates></Point>",
"</Placemark>"]
placemark = "\n".join(placemark)
for location in locations:
temp = placemark % location
kml_file.write(temp)
kml_file.write("\n")
Here, in %(any_dict_key)r, r converts any Python object (here, the value of the any_dict_key) and then inserts it into the string. So, for your timestamps you have to convert it into a datetime object.
You can read this part of the documentation to check out the details - string-formatting-operations
I'm currently doing Cox Proportional Hazards Modeling using Rpy2 - I imagine my question will cover other functions and the results from calling them as well though.
After I run the function, I have a variable which contains the results from the function, in the form of a vector. I have tried explicitly converting this to a DataFrame (resultsDataFrame = DataFrame(resultVector)). There are no errors returned when doing this. However, when I do resultsDataFrame.to_csvfile(filename) I get the following error:
Traceback (most recent call last):
File "<pyshell#171>", line 1, in <module>
modelFrame.to_csvfile('/Users/fortylashes/Documents/Matthews_Research/Cox_PH/ResultOutput_Exp1.csv')
File "/Library/Python/2.7/site-packages/rpy2/robjects/vectors.py", line 1031, in to_csvfile
'col.names': col_names, 'qmethod': qmethod, 'append': append})
RRuntimeError: Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ""coxph"" to a data.frame
Furthermore, when I simply do:
for result in resultVector:
print (result)
I get an extremely long list of results- including information on each entry in the dataset used in the model, for each variable (so 9,000 records x 9 variables = 81,000 unneeded results). The results I really need are at the bottom of this vector and look like this:
coef exp(coef) se(coef) z p
age_age6574 -0.057775 0.944 0.05469 -1.056 2.9e-01
age_age75plus -0.020795 0.979 0.04891 -0.425 6.7e-01
sex_female -0.005304 0.995 0.03961 -0.134 8.9e-01
stage_late -0.261609 0.770 0.04527 -5.779 7.5e-09
access -0.000494 1.000 0.00069 -0.715 4.7e-01
Likelihood ratio test=36.6 on 5 df, p=7.31e-07 n= 9752, number of events= 2601
*NOTE: There were several more variables for which data was reported in the initial results (the 9,000 x 9 that I was talking about) but weren't actually used in the model.
I was wondering if there was a way to explicitly get this data, put it in one long ordered row, and then output it to a csv file?
::::UPDATE::::
When I call theModel.names I get a list of the various measures which can be called by numerical index:
[1] "coefficients" "var" "loglik"
[4] "score" "iter" "linear.predictors"
[7] "residuals" "means" "concordance"
[10] "method" "n" "nevent"
[13] "terms" "assign" "wald.test"
[16] "y" "formula" "call"
From this I can get the coefficients, which can then be exponentiated. I have not found, however, the p-value, the z score or the likelihood test ratio, which I will need.
[layer1 layer2 layer3] = trainNeuralNetwork4L(tlab, tvec, clab, cvec, 150, 100, 10);
save layerOne100SECOND.m layer1
save layerTwo100SECOND.m layer2
save layerThree100SECOND.m layer3
[efficiency errorsMatrix] = testClassifier4L(layer1, layer2, layer3, 150, 100, 10, tstv, tstl)
1 min 55 s
efficiency = 0.96150
[....]
load "layerTwo100SECOND.m"
layerTwo100SECOND
parse error near line 6 of file /home/yob/studies/rob/lab5/src/layerTwo100SECOND.m
syntax error
>>> 0.3555228566483329 1.434063629132475 0.3947326168010625 -0.2081288665103496 2.116026824600183 -3.72004826748463 -5.971912014167303 -1.831568668193203 -0.5698533706125537 -0.302019433067382 2.105773052363495 -1.386054572212726 1.379784981138861 2.086342965563345 1.686560884521974 1.501297857975125 5.491292848790862 -3.068496819708705 1.709375867569474 -0.0007631747244577478 -3.408706829842817 3.633531634060732 -4.848485685095641 -7.071386223304461 1.005495674207059 1.729698733795992 1.332654214742491 -2.757799109392227 0.5703177663227227 -3.962183321109198 -1.862612684812663 0.002426506616464667 -1.0133423788506 0.9856584491014603 3.261391305445486 -0.238792116035831 7.213403195852512 -0.4550088635822298 2.014786513359268 5.439781417403554 -1.780067076293333 -1.141234270367437 -3.716379329290984 1.329603499392993 0.6289460687541696 1.38704906311103 -1.799460630680088 -1.231927489757737 -1.199171465361949 6.464325931161664 0.7819466841352927 1.518220081499355 -0.3605511334486079 6.646043807207327 -1.885519415534916 1.164993883529136 -0.6867734922571105 -3.487015662787853 0.6052594571214193 0.9747958246654298 -6.681621035920442 6.539828816493673 0.4174688104699146 1.804835542540412 3.099980655618463 0.1957057586983393 -0.5199262355448695 -0.05556003295310553 0.5458621853042805 4.053727148988344 5.08596174444348 -4.4719975219626 4.718638484049811 4.579389030123606 -0.3683947372431971 0.9758069969974679 0.4742051227060113 6.761326112144753 0.9816521216523206 1.790072342537753 0.4513686207416066 -2.880053219384659 -3.256083938937911 3.099498881741825 -0.4967119404782309 -0.6140345297878478 -0.9933076418596357 7.522343253108136 4.93675021253316 -2.693878828387868 -1.358775970578509 -0.7940899801569826 4.867002040829598 4.418439759567837 -2.014761152547027 0.2349575211823655 -4.494720934106189 -2.674441246174409 -0.8495958842163256 0.1921793737146104
^
Why is it impossible to use previously saved data? Is there any way to use them one more time?
Ok, I dealed with it. I had to call: load -ascii filename.