OGR2OGR PostgreSQL / PostGIS issue after enabling postgis extension on import - gis

I'm running psql (PostgreSQL) 14.5 (Homebrew) with PostGIS extension version 3.3
I'm using gdal's ogr2ogr to import geojson files.
ogr2ogr -f "PostgreSQL" PG:"dbname=test4 user=myuser" "myfile.geojson"
If I import all files into a new database and enable the postgis extension after all my imports, my queries work as desired.
SELECT district,
ST_Contains('POINT (-##.## ##.## )', wkb_geometry) FROM table
Returns: booleans as expected
If I import another geojson file after the extension is enabled, I get an error on the query for new tables imported.
ERROR: contains: Operation on mixed SRID geometries (Point, 0) != (Polygon, 4326)
SQL state: XX000
It seems it changes the column type from bytrea to geometry and doesn't allow me to alter or disable the extension. I have to delete the database and import all tables again, then enable the extension. What am I doing wrong? Is there a problem in my process or query? Why does it work if I import the data and then enable the extension, but all new tables fail with the query?

You should be using something like:
SELECT district,
ST_Contains(ST_GeomFromText('POINT (-##.## ##.## )',4326), wkb_geometry)
FROM table
This make PostGIS aware that your WKT is in lon,lat coordinates and so it can safely compare them to your geometry column (which will be the same because that's what GeoJSON contains by specification). Other data sources may be in different projections in which case you'll probably need to read about ST_TRANSFORM too.

Related

Import spatial data from PostGIS into MySQL

I have a PostgreSQL DB that contains KML data in one column of a table. I query it with postGIS commands with a query like this:
SELECT ST_geomFromKML(geometry)
ST_Intersects(ST_SetSRID(ST_Buffer(ST_MakePoint(11.255492,43.779251),0.002), 4326), ST_GeomFromKML(geometry)) as intersect,
ST_SetSRID(ST_Buffer(ST_MakePoint(11.255492,43.779251),0.002), 4326)
FROM mydb
WHERE
ST_Intersects(ST_SetSRID(ST_Buffer( ST_MakePoint(11.255492,43.779251),0.002), 4326), ST_GeomFromKML(geometry))
LIMIT 1
in the geometry column the data are stored as KML like this:
<Polygon><outerBoundaryIs><LinearRing><coordinates>8.198905,40.667052 8.201007,40.667052 8.201007,40.665738 8.20127,40.665738 8.20127,40.664688 8.201532,40.664688 8.201532,40.663111 8.20127,40.663111 8.199956,40.663111 8.199956,40.663374 8.199693,40.663374 8.199693,40.664425 8.197591,40.664425 8.197591,40.665476 8.198905,40.665476 8.199168,40.665476 8.199168,40.666789 8.198905,40.666789 8.198905,40.667052</coordinates></LinearRing></outerBoundaryIs></Polygon>
so I use ST_geomFromKML to convert data to geometry then I search for intersection of a circle I create around the point.
I wanted to migrate the database to MySQL and I wanted to use its spatial functions, but I don't find a way to use/convert the KML data inside MySQL as I do with PostGIS.
Is there a way to do it?
I guess it would be worth trying to export your geometries in a format that can be read by MySQL, e.g. WKT (Well Known Text). By your question I assume you're indeed storing the geometries as KML in either a text or a xml column, so I believe this here will help you:
Test Data
CREATE TABLE t (kml TEXT);
INSERT INTO t VALUES ('<Point><coordinates>8.54,47.36</coordinates></Point>');
Export as CSV to the standard output (client)
COPY (SELECT ST_AsText(ST_geomFromKML(kml)) AS geom FROM t) TO STDOUT CSV;
query returned copy data:
POINT(8.54 47.36)
Export as CSV into a file in the server - keep in mind that the system user postgres needs to have writing permission in the given directory.
COPY (SELECT ST_AsText(ST_geomFromKML(kml)) AS geom FROM t) TO '/path/to/file.csv';
Yes me also I do that (geom from postgres -> wkt -> mysql -> geom from mysql).
To complete, I use :
pg_dump --column-inserts --data-only
To get just INSERT as in Mysql. Here it just remains to create the table in MySQL, delete what is before and after my INSERT from Postgres, remove "schema." and hop go in MySQL!

Why my PyAthena generate a csv and a csv meta data file in s3 location while reading a GLUE table?

I started to pull GLUE table via using pyathena since last week. However, one annoying thing I noticed that is if I wrote my code as shown below, sometimes it works and returns a pandas dataframe but other times, this piece of codes will create a csv and a csv metadata in the folder where physical data (parquet) are stored in S3 and registered in GLUE.
I know that if you use pandas cursor, it may end up with these two files but I just wonder if I can access data without these two files since every time these two files generated in S3, my read in process failed.
Thank you!
import os
access_key_id = os.getenv('AWS_ACCESS_KEY_ID')
secret_access_key = os.getenv('AWS_SECRET_ACCESS_KEY')
connect1 = connect(s3_staging_dir='s3://xxxxxxxxxxxxx')
df = pd.read_sql("select * from abc.table_name", connect1)
df.head()
go to Athena
click settings -> workgroup name -> edit workgroup
Update "Query result location"
click "Override client-side settings"
Note: If you have not setup any other workgroups for your Athena environment, you should only find one workgroup named "Primary".
This should resolve your problem. For more information you can read:
https://docs.aws.amazon.com/athena/latest/ug/querying.html

How to extract tables with data from .sql dumps using Spark?

I have around four *.sql self-contained dumps ( about 20GB each) which I need to convert to datasets in Apache Spark.
I have tried installing and making a local database using InnoDB and importing the dump but that seems too slow ( spent around 10 hours with that )
I directly read the file into spark using
import org.apache.spark.sql.SparkSession
var sparkSession = SparkSession.builder().appName("sparkSession").getOrCreate()
var myQueryFile = sc.textFile("C:/Users/some_db.sql")
//Convert this to indexed dataframe so you can parse multiple line create / data statements.
//This will also show you the structure of the sql dump for your usecase.
var myQueryFileDF = myQueryFile.toDF.withColumn("index",monotonically_increasing_id()).withColumnRenamed("value","text")
// Identify all tables and data in the sql dump along with their indexes
var tableStructures = myQueryFileDF.filter(col("text").contains("CREATE TABLE"))
var tableStructureEnds = myQueryFileDF.filter(col("text").contains(") ENGINE"))
println(" If there is a count mismatch between these values choose different substring "+ tableStructures.count()+ " " + tableStructureEnds.count())
var tableData = myQueryFileDF.filter(col("text").contains("INSERT INTO "))
The problem is that the dump contains multiple tables as well each of which needs to become a dataset. For which I need to understand if we can do it for even one table. Is there any .sql parser written for scala spark ?
Is there a faster way of going about it? Can I read it directly into hive from .sql self-contained file?
UPDATE 1: I am writing the parser for this based on Input given by Ajay
UPDATE 2: Changing everything to dataset based code to use SQL parser as suggested
Is there any .sql parser written for scala spark ?
Yes, there is one and you seem to be using it already. That's Spark SQL itself! Surprised?
The SQL parser interface (ParserInterface) can create relational entities from the textual representation of a SQL statement. That's almost your case, isn't it?
Please note that ParserInterface deals with a single SQL statement at a time so you'd have to somehow parse the entire dumps and find the table definitions and rows.
The ParserInterface is available as sqlParser of a SessionState.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> :type spark.sessionState.sqlParser
org.apache.spark.sql.catalyst.parser.ParserInterface
Spark SQL comes with several methods that offer an entry point to the interface, e.g. SparkSession.sql, Dataset.selectExpr or simply expr standard function. You may also use the SQL parser directly.
shameless plug You may want to read about ParserInterface — SQL Parser Contract in the Mastering Spark SQL book.
You need to parse it by yourself. It requires following steps -
Create a class for each table.
Load files using textFile.
Filter out all the statements other than insert statements.
Then split the RDD using filter into multiple RDDs based on the table name present in insert statement.
For each RDD, use map to parse values present in insert statement and create object.
Now convert RDDs to datasets.

AWS Athena output result.json to s3 - CREATE TABLE AS / INSERT INTO SELECT?

Is it anyhow possible to write the results of an AWS Athena query to a results.json within an s3 bucket?
My first idea was to use INSERT INTO SELECT ID, COUNT(*) ... or INSERT OVERWRITE but this seems not be supported according Amazon Athena DDL Statements and tdhoppers Blogpost
Is it anyhow possible to CREATE TABLE with new data with AWS Athena?
Is there any work around with AWS Glue?
Anyhow possible to trigger an lambda function with the results of Athena?
(I'm aware of S3 Hooks)
It would not matter to me to overwrite the whole json file / table and always create a new json, since it is very limited statistics I aggregate.
I do know AWS Athena automatically writes the results to an S3 bucket as CSV. However I like to do simple aggregations and write the outputs directly to a public s3 so that an spa angular application in the browser is able to read it. Thus JSON Format and a specific path is important to me.
The work around for me with glue. Use Athena jdbc driver for running the query and load result in a dataframe. Then save the dataframe as the required format on specified S3 location.
df=spark.read.format('jdbc').options(url='jdbc:awsathena://AwsRegion=region;UID=your-access-key;PWD=your-secret-access-key;Schema=database name;S3OutputLocation=s3 location where jdbc drivers stores athena query results',
driver='com.simba.athena.jdbc42.Driver',
dbtable='(your athena query)').load()
df.repartition(1).write.format("json").save("s3 location")
Specify query in format dbtable='(select * from foo)'
Download jar from here and store it in S3.
While configuring etl job on glue specify s3 location for jar in Jar lib path.
you can get Athena to create data in s3 by using a "create table as select" (CTAS) query. In that query you can specify where and in what format you want the created table to store its data.
https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
For json, the example you are looking for is:
CREATE TABLE ctas_json_unpartitioned
WITH (
format = 'JSON',
external_location = 's3://my_athena_results/ctas_json_unpartitioned/')
AS SELECT key1, name1, address1, comment1
FROM table1;
this would result in single lines json format

How can I read geospatial data from MySQL into R?

I am reading from a MySQL database into R. I can read the table of interest using dplyr without a problem - except for the geometry column which contains MULTILINE(...). Is there a way to read polylines into R directly from MySQL?
When I read the table containing the geometry column, it gives a warning "unrecognized MySQL field type 255 in column 5 imported as character", and for each record for which the geometry column is not NA, it gives an additional warning such as "internal error: row 51 field 5 truncated"
If your MySQL column is a a spatial SQL object you may consider using the ogr2ogr which is a popular command line utility to handle spatial data formats. For example you could do:
ogr2ogr -f MySQL MySQL:gis,user=root,password=password C:\file.shp -nln pianco_post -a_srs EPSG:29194 -update -overwrite -lco engine=MYISAM
As discussed here.
If you wish to do everything in R you can use the R wrapper for the ogr2ogr. Then you would just read your shapefiles into R, this is discussed at lengths across SO, and use along any other objects you may have.