I've been using ogr2ogr to do most of what I need with shapefiles (including dissolving them). However, I find that for big ones, it takes a REALLY long time.
Here's an example of what I'm doing:
ogr2ogr new.shp old.shp -dialect sqlite -sql "SELECT ST_Union(geometry) FROM old"
In certain instances, one might want to dissolve common neighboring shapes (which is what I think is going on here in the above command). However, in my case I simply want to flatten the entire file and every shape in it regardless of the values (I've already isolated the shapes I need).
Is there a faster way to do this when you don't need to care about the values and just want a shape that outlines the array of shapes in the file?
If you have isolated the shapes, and they don't have any shared boundaries, they can be easily collected into a single MULTIPOLYGON using ST_Collect. This should be really fast and simple to do:
ogr2ogr gcol.shp old.shp -dialect sqlite -sql "SELECT ST_Collect(geometry) FROM old"
If the geometries overlap and the boundaries need to be "dissolved", then ST_Union must be used. Faster spatial unions are done with a cascaded union technique, described here for PostGIS. It is supported by OGR, but it doesn't seem to be done elegantly.
Here is a two step SQL query. First make a MULTIPOLYGON of everything with ST_Collect (this is fast), then do a self-union which should trigger a UnionCascaded() call.
ogr2ogr new.shp old.shp -dialect sqlite -sql "SELECT ST_Union(gcol, gcol) FROM (SELECT ST_Collect(geometry) AS gcol FROM old) AS f"
Or to better view the actual SQL statement:
SELECT ST_Union(gcol, gcol)
FROM (
SELECT ST_Collect(geometry) AS gcol
FROM old
) AS f
I've had better success (i.e. faster) by converting it to raster then back to vector. For example:
# convert the vector file old.shp to a raster file new.tif using a pixel size of XRES/YRES
gdal_rasterize -tr XRES YRES -burn 255 -ot Byte -co COMPRESS=DEFLATE old.shp new.tif
# convert the raster file new.tif to a vector file new.shp, using the same raster as a -mask speeds up the processing
gdal_polygonize.py -f 'ESRI Shapefile' -mask new.tif new.tif new.shp
# removes the DN attribute created by gdal_polygonize.py
ogrinfo new.shp -sql "ALTER TABLE new DROP COLUMN DN"
Related
I am trying to train an xgboost model using its external memory version, which takes a libsvm file as training set. Right now, all the data is stored in a bunch of csv files which combine together are way larger than the memory I have, say 70G.(you can easily read any one of them). I just wonder how to create one large libsvm file for xgboost. Or if there is any other work round for this. Thank you.
If you csv files do not have headers you can combine them with the Unix cat command.
Example:
> ls
file1.csv file2.csv
> cat *.csv > combined.csv
Now combined.csv is the concatenation of all the other files.
If all your csv files have headers you''ll want to do something trickier, like take the n-1 lines with tail.
XGBoost supports csv as an input.
If you want to convert that to libsvm regardless, you can use phraug's scripts.
I have 70+ raster images in TIFF format that I am trying to merge.
Originals can be found here:
http://www.faa.gov/air_traffic/flight_info/aeronav/digital_products/vfr/
After pre-processing (pct2rgb, gdalwarp individual charts, gdal_translate to cut the collars) I try to run them through gdalwarp to mosaic them using a command like this:
gdalwarp --config GDAL_CACHEMAX 3000 -overwrite -wm 3000 -r bilinear -srcnodata 0 -dstnodata 0 -wo "NUM_THREADS=3" /data/aeronav/sec/c/Albuquerque_c.tif .....70 other file names ...master.tif
After 12 hours of processing:
Creating output file that is 321521P x 125647L.
Processing input file /data/aeronav/sec/c/Albuquerque_c.tif.
0...10...20...30...40...
This means gdalwarp is never going to finish.
In contrast. A gdal_merge command like this:
gdal_merge.py -n 0 -a_nodata 0 -o /data/aeronav/sec/master.tif /data/aeronav/sec/c/Albuquerque_c.tif ......70 plus files.....
Finishes in couple of hours.
Problem with gdal_merge is inferior quality output because of "average" sampling. I would like to use "bilinear" at the minimum - and "cubic" sampling if possible and for that gdalwarp is required.
Why is there such a big difference in performance of the two ? Why doesn't gdalwarp want to finish ? Is there any other command line option to speed things up in gadalwarp or is there a way to add sampling option to gdal_merge ?
It seems gdalwarp is not the ideal command to merge these GeoTiffs (since I am not interested in warping again). Instead I used
gdalbuildvrt /data/aeronav/sec/master.virt .... 70+ files in order
to build a virtual mosaic. And then I used gdal_translate to convert the virt file into a GeoTiff:
gdal_translate -of GTiff /data/aeronav/sec/master.virt /data/aeronav/sec/master.tif
That's it—this took less than an hour (even faster than gdal_merge and preserves quality of original files).
I have files being generated by another program/user that have names such as "jh-1.txt, jh-2.txt, ..., jh-100.txt, ..., jh-1024.txt". I'm extracting a column from these files, manipulating the data, and outputting to a new matrix. The only problem is that Octave is using ASCII ordering and not natural ordering when reading in the files. Thus, the output matrix is not ordered in a natural way. My question is, can Octave sort file names in a natural order? I'm getting file names in the standard method:
fileDirectory = '/path/to/directory';
filePattern = fullfile(fileDirectory, '*.txt'); % Selects only the txt files.
dataFiles = dir(filePattern); % Gets the info from the txt files in the directory.
baseFileName = {dataFiles.name}'; % Gets all the txt file names.
I can't rename the files because this is a script for another user. They are on a Windows machine and already have Octave installed with Cygwin and I don't want to make them use the command line more than they have to because they are unfamiliar with it. Alternatively, it would be nice to have the output with the file names in a column but, I haven't figured that one out either (bit of a noob with Octave myself). That way the user could use Excel (which they are familiar with) to sort the columns.
I don't think there's a built in natural sort in Octave. However, there is a natural sort submission on Mathwork's File Exchange. I've not used it, but the comments imply it works in Octave too.
I am trying to run the following code in a command window. The code executes, but it gives me no values in the .SHP files. The table has GeographyCollections and Polygons stored in a Field of type Geography. I have tried many variations for the Geography type in the sql statement - Binary, Text etc. but no luck. The output .DBF file has data, so the connection to the database works, but the shape .Shp file and .shx file has no data and is of size 17K and 11 K, respectively.
Any suggestions?
ogr2ogr -f "ESRI Shapefile" -overwrite c:\temp -nln Zip_States -sql "SELECT [ID2],[STATEFP10],[ZCTA5CE10],GEOMETRY::STGeomFromWKB([Geography].STAsBinary(),4326).STAsText() AS [Geography] FROM [GeoSpatial].[dbo].[us_State_Illinois_2010]" ODBC:dbo/GeoSpatial#PPDULCL708504
ESRI Shapefiles can contain only a single type of geometry - Point, LineString, Polygon etc.
Your description suggests that your query returns multiple types of geometry, so restrict that first (using STGeometryType() == 'POLYGON', for example).
Secondly, you're currently returning the spatial field as a text string using STAsText(), but you're not telling OGR that it's a spatial field so it's probably just treating the WKT as a regular text column and adding it as an attribute to the dbf file.
To tell OGR which column contains your spatial information you can add the "Tables" parameter to the connection string. However, there's no reason to do all the casting from WKT/WKB if you're using SQL Server 2008 - OGR2OGR will load SQL Server's native binary format fine.
Are you actually using SQL Server 2008, or Denali? Because the serialisation format changed, and OGR2OGR can't read the new format. So, in that case it's safer (but slower) to convert to WKB first.
The following works for me to dump a table of polygons from SQL Server to Shapefile:
ogr2ogr -f "ESRI Shapefile" -overwrite c:\temp -nln Zip_States -sql "SELECT ID, geom26986.STAsBinary() FROM [Spatial].[dbo].[OUTLINE25K_POLY]" "MSSQL:server=.\DENALICTP3;database=Spatial;trusted_connection=yes;Tables=dbo.OUTLINE25K_POLY(geom26986)"
Try the following command
ogr2ogr shapeFileName.shp -overwrite -sql "select top 10 * from schema.table" "MSSQL:Server=serverIP;Database=dbname;Uid=userid;trusted_connection=no;Pwd=password" -s_srs EPSG:4326 -t_srs EPSG:4326
I want to access every value (~10000) in .txt files (~1000) stored in directories (~20) in the most efficient manner possible. When the data is grabbed I would like to place them in a HTML string. I do this in order to display a HTML page with tables for each file. Pseudo:
fh=open('MyHtmlFile.html','w')
fh.write('''<head>Lots of tables</head><body>''')
for eachDirectory in rootFolder:
for eachFile in eachDirectory:
concat=''
for eachData in eachFile:
concat=concat+<tr><td>eachData</tr></td>
table='''
<table>%s</table>
'''%(concat)
fh.write(table)
fh.write('''</body>''')
fh.close()
There must be a better way (I imagine it would take forever)! I've checked out set() and read a bit about hashtables but rather ask the experts before the hole is dug.
Thank you for your time!
/Karl
import os, os.path
# If you're on Python 2.5 or newer, use 'with'
# needs 'from __future__ import with_statement' on 2.5
fh=open('MyHtmlFile.html','w')
fh.write('<html>\r\n<head><title>Lots of tables</title></head>\r\n<body>\r\n')
# this will recursively descend the tree
for dirpath, dirname, filenames in os.walk(rootFolder):
for filename in filenames:
# again, use 'with' on Python 2.5 or newer
infile = open(os.path.join(dirpath, filename))
# this will format the lines and join them, then format them into the table
# If you're on Python 2.6 or newer you could use 'str.format' instead
fh.write('<table>\r\n%s\r\n</table>' %
'\r\n'.join('<tr><td>%s</tr></td>' % line for line in infile))
infile.close()
fh.write('\r\n</body></html>')
fh.close()
Why do you "imagine it would take forever"? You are reading the file and then printing it out - that's pretty much the only thing you have as a requirement - and that's all you're doing.
You could tweak the script in a couple of ways (read blocks not lines, adjust buffers, print out instead of concatenating, etc.), but if you don't know how much time do you take now, how do you know what is better/worse?
Profile first, then find if the script is too slow, then find a place where it's slow, and only then optimise (or ask about optimisation).