How can I read geospatial data from MySQL into R? - mysql

I am reading from a MySQL database into R. I can read the table of interest using dplyr without a problem - except for the geometry column which contains MULTILINE(...). Is there a way to read polylines into R directly from MySQL?
When I read the table containing the geometry column, it gives a warning "unrecognized MySQL field type 255 in column 5 imported as character", and for each record for which the geometry column is not NA, it gives an additional warning such as "internal error: row 51 field 5 truncated"

If your MySQL column is a a spatial SQL object you may consider using the ogr2ogr which is a popular command line utility to handle spatial data formats. For example you could do:
ogr2ogr -f MySQL MySQL:gis,user=root,password=password C:\file.shp -nln pianco_post -a_srs EPSG:29194 -update -overwrite -lco engine=MYISAM
As discussed here.
If you wish to do everything in R you can use the R wrapper for the ogr2ogr. Then you would just read your shapefiles into R, this is discussed at lengths across SO, and use along any other objects you may have.

Related

OGR2OGR PostgreSQL / PostGIS issue after enabling postgis extension on import

I'm running psql (PostgreSQL) 14.5 (Homebrew) with PostGIS extension version 3.3
I'm using gdal's ogr2ogr to import geojson files.
ogr2ogr -f "PostgreSQL" PG:"dbname=test4 user=myuser" "myfile.geojson"
If I import all files into a new database and enable the postgis extension after all my imports, my queries work as desired.
SELECT district,
ST_Contains('POINT (-##.## ##.## )', wkb_geometry) FROM table
Returns: booleans as expected
If I import another geojson file after the extension is enabled, I get an error on the query for new tables imported.
ERROR: contains: Operation on mixed SRID geometries (Point, 0) != (Polygon, 4326)
SQL state: XX000
It seems it changes the column type from bytrea to geometry and doesn't allow me to alter or disable the extension. I have to delete the database and import all tables again, then enable the extension. What am I doing wrong? Is there a problem in my process or query? Why does it work if I import the data and then enable the extension, but all new tables fail with the query?
You should be using something like:
SELECT district,
ST_Contains(ST_GeomFromText('POINT (-##.## ##.## )',4326), wkb_geometry)
FROM table
This make PostGIS aware that your WKT is in lon,lat coordinates and so it can safely compare them to your geometry column (which will be the same because that's what GeoJSON contains by specification). Other data sources may be in different projections in which case you'll probably need to read about ST_TRANSFORM too.

How to Export GA360 table from Big query to snowflake through GCS as json file without data loss?

I am exporting GA360 table from Big query to snowflake as json format using bq cli command. I am losing some fields when I load it as table in snowflake. I use the copy command to load my json data from GCS external stage in snowflake to snowflake tables. But, I am missing some fields that are part of nested array. I even tried compressing the file when I export to gcs but I still loose data. Can someone suggest me how I can do this. I don't want to flatten the table in bigquery and transfer that. My daily table size is minimum of 1.5GB to maximum of 4GB.
bq extract \
--project_id=myproject \
--destination_format=NEWLINE_DELIMITED_JSON \
--compression GZIP \
datasetid.ga_sessions_20191001 \
gs://test_bucket/ga_sessions_20191001-*.json
I have set up my integration, file format, and stage in snowflake. I copying data from this bucket to a table that has one variant field. The row count matches with Big query but the fields are missing.
I am guessing this is due to the limit snowflake has where each variant column should be of 16MB. Is there some way I can compress each variant field to be under 16MB?
I had no problem exporting GA360, and getting the full objects into Snowflake.
First I exported the demo table bigquery-public-data.google_analytics_sample.ga_sessions_20170801 into GCS, JSON formatted.
Then I loaded it into Snowflake:
create or replace table ga_demo2(src variant);
COPY INTO ga_demo2
FROM 'gcs://[...]/ga_sessions000000000000'
FILE_FORMAT=(TYPE='JSON');
And then to find the transactionIds:
SELECT src:visitId, hit.value:transaction.transactionId
FROM ga_demo1, lateral flatten(input => src:hits) hit
WHERE src:visitId='1501621191'
LIMIT 10
Cool things to notice:
I read the GCS files easily from Snowflake deployed in AWS.
JSON manipulation in Snowflake is really cool.
See https://hoffa.medium.com/funnel-analytics-with-sql-match-recognize-on-snowflake-8bd576d9b7b1 for more.

Import spatial data from PostGIS into MySQL

I have a PostgreSQL DB that contains KML data in one column of a table. I query it with postGIS commands with a query like this:
SELECT ST_geomFromKML(geometry)
ST_Intersects(ST_SetSRID(ST_Buffer(ST_MakePoint(11.255492,43.779251),0.002), 4326), ST_GeomFromKML(geometry)) as intersect,
ST_SetSRID(ST_Buffer(ST_MakePoint(11.255492,43.779251),0.002), 4326)
FROM mydb
WHERE
ST_Intersects(ST_SetSRID(ST_Buffer( ST_MakePoint(11.255492,43.779251),0.002), 4326), ST_GeomFromKML(geometry))
LIMIT 1
in the geometry column the data are stored as KML like this:
<Polygon><outerBoundaryIs><LinearRing><coordinates>8.198905,40.667052 8.201007,40.667052 8.201007,40.665738 8.20127,40.665738 8.20127,40.664688 8.201532,40.664688 8.201532,40.663111 8.20127,40.663111 8.199956,40.663111 8.199956,40.663374 8.199693,40.663374 8.199693,40.664425 8.197591,40.664425 8.197591,40.665476 8.198905,40.665476 8.199168,40.665476 8.199168,40.666789 8.198905,40.666789 8.198905,40.667052</coordinates></LinearRing></outerBoundaryIs></Polygon>
so I use ST_geomFromKML to convert data to geometry then I search for intersection of a circle I create around the point.
I wanted to migrate the database to MySQL and I wanted to use its spatial functions, but I don't find a way to use/convert the KML data inside MySQL as I do with PostGIS.
Is there a way to do it?
I guess it would be worth trying to export your geometries in a format that can be read by MySQL, e.g. WKT (Well Known Text). By your question I assume you're indeed storing the geometries as KML in either a text or a xml column, so I believe this here will help you:
Test Data
CREATE TABLE t (kml TEXT);
INSERT INTO t VALUES ('<Point><coordinates>8.54,47.36</coordinates></Point>');
Export as CSV to the standard output (client)
COPY (SELECT ST_AsText(ST_geomFromKML(kml)) AS geom FROM t) TO STDOUT CSV;
query returned copy data:
POINT(8.54 47.36)
Export as CSV into a file in the server - keep in mind that the system user postgres needs to have writing permission in the given directory.
COPY (SELECT ST_AsText(ST_geomFromKML(kml)) AS geom FROM t) TO '/path/to/file.csv';
Yes me also I do that (geom from postgres -> wkt -> mysql -> geom from mysql).
To complete, I use :
pg_dump --column-inserts --data-only
To get just INSERT as in Mysql. Here it just remains to create the table in MySQL, delete what is before and after my INSERT from Postgres, remove "schema." and hop go in MySQL!

How to extract tables with data from .sql dumps using Spark?

I have around four *.sql self-contained dumps ( about 20GB each) which I need to convert to datasets in Apache Spark.
I have tried installing and making a local database using InnoDB and importing the dump but that seems too slow ( spent around 10 hours with that )
I directly read the file into spark using
import org.apache.spark.sql.SparkSession
var sparkSession = SparkSession.builder().appName("sparkSession").getOrCreate()
var myQueryFile = sc.textFile("C:/Users/some_db.sql")
//Convert this to indexed dataframe so you can parse multiple line create / data statements.
//This will also show you the structure of the sql dump for your usecase.
var myQueryFileDF = myQueryFile.toDF.withColumn("index",monotonically_increasing_id()).withColumnRenamed("value","text")
// Identify all tables and data in the sql dump along with their indexes
var tableStructures = myQueryFileDF.filter(col("text").contains("CREATE TABLE"))
var tableStructureEnds = myQueryFileDF.filter(col("text").contains(") ENGINE"))
println(" If there is a count mismatch between these values choose different substring "+ tableStructures.count()+ " " + tableStructureEnds.count())
var tableData = myQueryFileDF.filter(col("text").contains("INSERT INTO "))
The problem is that the dump contains multiple tables as well each of which needs to become a dataset. For which I need to understand if we can do it for even one table. Is there any .sql parser written for scala spark ?
Is there a faster way of going about it? Can I read it directly into hive from .sql self-contained file?
UPDATE 1: I am writing the parser for this based on Input given by Ajay
UPDATE 2: Changing everything to dataset based code to use SQL parser as suggested
Is there any .sql parser written for scala spark ?
Yes, there is one and you seem to be using it already. That's Spark SQL itself! Surprised?
The SQL parser interface (ParserInterface) can create relational entities from the textual representation of a SQL statement. That's almost your case, isn't it?
Please note that ParserInterface deals with a single SQL statement at a time so you'd have to somehow parse the entire dumps and find the table definitions and rows.
The ParserInterface is available as sqlParser of a SessionState.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> :type spark.sessionState.sqlParser
org.apache.spark.sql.catalyst.parser.ParserInterface
Spark SQL comes with several methods that offer an entry point to the interface, e.g. SparkSession.sql, Dataset.selectExpr or simply expr standard function. You may also use the SQL parser directly.
shameless plug You may want to read about ParserInterface — SQL Parser Contract in the Mastering Spark SQL book.
You need to parse it by yourself. It requires following steps -
Create a class for each table.
Load files using textFile.
Filter out all the statements other than insert statements.
Then split the RDD using filter into multiple RDDs based on the table name present in insert statement.
For each RDD, use map to parse values present in insert statement and create object.
Now convert RDDs to datasets.

How to convert original MySQL Datatypes in R

I have a MySQL database with a lot of data in it. Now I want to manipulate it with R. After I run the MySQL Query, R imports the data but converts the datatypes of the tables. e.g. datetime is converted to character and so on. And this is not the problem when just using the data in R. But after analyzing it I want to write it back to the MySQL database with a few changes. But the datatypes are still the converted ones from R. So the database struggles to show the newly created tables in another software. After I converted the datatypes manually in MySQL workbench they showed up. This takes ages for the big data. Now my question:
is there any way to convert the datatypes in R to the original types from MySQL before writing it back to the database?
R produces this warning:
10: In .local(conn, statement, ...) : Unsigned INTEGER in col 3 imported as numeric 11: In .local(conn, statement, ...) : unrecognized MySQL field type 7 in column 20 imported as character
The R code I run
dat = lapply(tables, function(table){fn$dbGetQuery(con, sprintf({"SELECT * FROM %s WHERE TIME>=(SELECT MIN(TIME) FROM %s) AND TIME <'$z'"}, table,table))}) ;
dat %<>% bind_rows()
with tables being a list of tables i want to fetch and z a year.