Load geojson in bigquery - json

What is the best way to load the following geojson file in Google Big Query?
http://storage.googleapis.com/velibs/stations/test.json
I have a lot of json files like this (much bigger) on Google Storage, and I cannot download/modify/upload them all (it would take forever). Note that the file is not newline-delimited, so I guess it needs to be modified online.
Thanks all.

Step by step 2019:
If you get the error "Error while reading data, error message: JSON parsing error in row starting at position 0: Nested arrays not allowed.", you might have a GeoJSON file.
Transform GeoJSON into new-line delimited JSON with jq, load as CSV into BigQuery:
jq -c .features[] \
san_francisco_censustracts.json > sf_censustracts_201905.json
bq load --source_format=CSV \
--quote='' --field_delimiter='|' \
fh-bigquery:deleting.sf_censustracts_201905 \
sf_censustracts_201905.json row
Parse the loaded file in BigQuery:
CREATE OR REPLACE TABLE `fh-bigquery.uber_201905.sf_censustracts`
AS
SELECT FORMAT('%f,%f', ST_Y(centroid), ST_X(centroid)) lat_lon, *
FROM (
SELECT *, ST_CENTROID(geometry) centroid
FROM (
SELECT
CAST(JSON_EXTRACT_SCALAR(row, '$.properties.MOVEMENT_ID') AS INT64) movement_id
, JSON_EXTRACT_SCALAR(row, '$.properties.DISPLAY_NAME') display_name
, ST_GeogFromGeoJson(JSON_EXTRACT(row, '$.geometry')) geometry
FROM `fh-bigquery.deleting.sf_censustracts_201905`
)
)
Alternative approaches:
With ogr2ogr:
https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8
https://medium.com/#mentin/loading-large-spatial-features-to-bigquery-geography-2f6ceb6796df
With Node.js:
https://github.com/mentin/geoscripts/blob/master/geojson2bq/geojson2bqjson.js

The bucket in the question no longer exists.... However five years later there is a new answer.
In July 2018, Google announced an alpha (now beta) of BigQuery GIS.
The docs highlight a limitation that
BigQuery GIS supports only individual geometry objects in GeoJSON.
BigQuery GIS does not currently support GeoJSON feature objects,
feature collections, or the GeoJSON file format.
This means that any Feature of Feature Collection properties would need to be added to separate columns, with a geography column to hold the geojson geography.
In this tutorial by a Google trainer, polygons in a shape file are converted into geojson strings inside rows of a CSV file using gdal.
ogr2ogr -f csv -dialect sqlite -sql "select AsGeoJSON(geometry) AS geom, * from LAYER_NAME" output.csv inputfilename.shp
You want to end up with one column with the geometry content like this
{"type":"Polygon","coordinates":[[....]]}
Other columns may contain feature properties.
The CSV can then be imported to BQ. Then a query on the table can be viewed in BigQuery Geo Viz. You need to tell it which column contains the geometry.

Related

Error when importing GeoJson into BigQuery

I'm trying to load GeoJson data [1] into BigQuery via Cloud Shell but I'm getting the following error:
Failed to parse JSON: Top-level GeoJson 'type' member should have value 'Feature', but was 'FeatureCollection'.; ParsedString returned false; Could not parse value; Parser terminated before end of string
It feels like the GeoJson file is not formatted properly for BQ but I have no idea if that's true or how to fix it.
[1] https://github.com/tonywr71/GeoJson-Data/blob/master/australian-suburbs.geojson
Expounding on #scespinoza's answer, I was able to convert to new-line delimited GeoJSON and load it to Bigquery with the following steps:
geojson2ndjson geodata.txt > geodata_converted.txt
Using this command, I encountered an error:
But was able to create a workaround by splitting the data into 2 tables, applying the same command.
Loaded table in Bigquery:
Your file is in standard GeoJSON format, but BigQuery only accepts new-line delimited GeoJSON files and individual GeoJSON objects (see documentation: https://cloud.google.com/bigquery/docs/geospatial-data#geojson-files). So, you should first convert the dataset to the appropiated format. Here is a good and simple explanation on how it works: https://stevage.github.io/ndgeojson/.

converting parquet files in S3 to CSV and store back in S3

Information:
I have parquet files stored in S3 which I need to convert into CSV and store back into S3.
the way I have the parquet files structured in S3 is as so:
2019
2020
|- 01
...
|- 12
|- 01
...
|- 29
|- part-0000.snappy.parquet
|- part-0001.snappy.parquet
...
|- part-1000.snappy.parquet
...
The solution required:
Any AWS tooling (needs to use lambda, no EC2, ECS) (open to suggestions though)
That the CSV files keep their headers during conversion (if they are split up)
That the CSV retain are original information and have no added columns/information
That the converted CSV file remain around 50-100MB
The solution I have already tried:
"entire folder method"
Using Athena CREATE EXTERNAL TABLE -> CREATE TABLE AS on the entire data folder (e.g: s3://2020/06/01/)
fig: #1
CREATE EXTERNAL TABLE IF NOT EXISTS database.table_name (
value_0 bigint,
value_1 string,
value_2 string,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ( 'serialization.format' = '1' )
LOCATION 's3://2020/06/01' TBLPROPERTIES ('has_encrypted_data'='false')
fig: #2
CREATE TABLE database.different_table_name
WITH ( format='TEXTFILE', field_delimiter=',', external_location='s3://2020/06/01-output') AS
SELECT * FROM database.table_name
doing this "entire folder method" works at converting parquet to CSV but leaves the CSV files at around 1GB+ size which is way too large. I tried creating a solution to split up the CSV files (thanks to help from this guide) but it failed since lambda has a 15-minute limit & memory constraints which made it difficult to split about all these 1GB+ CSV files into about 50-100MB files.
"single file method"
using the same CREATE EXTERNAL TABLE (see fig: #1) and
fig: #3
CREATE TABLE database.different_table_name
WITH ( format = 'TEXTFILE', field_delimiter=',', external_location = 's3://2020/06/01-output') AS
SELECT *, "$path" FROM database.table_name
WHERE "$path" LIKE 's3://2020/06/01/part-0000.snappy.parquet';
doing this "single file method" required me to integrate AWS SQS to listen to events from S3 for objects created in the bucket which looked for .snappy.parquet. this solution converted the parquet to CSV and created CSVs which fit the size requirements. the only issue is that the CSVs were missing headers, and had additional fields which never existed in the parquet in the first place such as the entire bucket location.
You can use dask
import dask.dataframe as dd
df = dd.read_parquet(s3://bucket_path/*.parquet’)
#converting dask df to pandas df
df = df.compute()
df.to_csv(’out.csv’)
While there's no way to configure the output file sizes, you can control the number out files in each output partition when using CTAS in Athena. The key is to use the bucket_count and bucketed_by configuration parameters, as described here: How can I set the number or size of files when I run a CTAS query in Athena?. Run a few conversions and record the sizes of the Parquet and CSV files, and use that as a heuristic for how many buckets to configure for each job, each bucket will become one file.
When working with Athena from Lambda you can use Step Functions to avoid the need for the Lambda function to run while Athena is executing. Use the Poll for Job Status tutorial as a starting point. It's especially useful when running CTAS jobs since these tend to take longer to run.

Can't simplify topojson for d3 mapping

I'm trying to map some statistical data of Italy and I need the infrastructure (railway and motorway) on top of it.
The problem is that I'm not able to simplify the infrastructure json file.
I'm using the openstreetmap shape of Italy by geofabbriK: http://download.geofabrik.de/europe/italy.html#
I've converted the roads.shp to json selecting only motorway and and primary roads using this command:
ogr2ogr -f GeoJSON -where "type IN ('motorway', 'motorway_link', 'primary', 'primary_link')" -t_srs EPSG:4326 roads.json roads.shp
I get a 55Mb json file. You can download it here: http://www.danielepennati.com/prove/mapping/roads_mw_pr.zip
Than I tryed to simplify and convert it in topojson.
Whit no -s command the new json file is about 13Mb
If I use -s or --simplify-proportion with any value form 1 to 0 I always get a max semplification of 95% and a filesize of 11Mb
How can I get a more simplified topojson?
Thanks
daniele

Load a json file from biq query command line

Is it possible to load data from a json file (not just csv) using the Big Query command line tool? I am able to load a simple json file using the GUI, however, the command line is assuming a csv, and I don't see any documentation on how to specify json.
Here's the simple json file I'm using
{"col":"value"}
With schema
col:STRING
As of version 2.0.12, bq does allow uploading newline-delimited JSON files. This is an example command that does the job:
bq load --source_format NEWLINE_DELIMITED_JSON datasetName.tableName data.json schema.json
As mentioned above, "bq help load" will give you all of the details.
1) Yes you can
2) The documentation is here . Go to step 3: Upload the table in documentation.
3) You have to use --source_format flag to tell the bq that you are uploading a JSON file and not a csv.
4) The complete commmand structure is
bq load [--source_format=NEWLINE_DELIMITED_JSON] [--project_id=your_project_id] destination_data_set.destination_table data_source_uri table_schema
bq load --project_id=my_project_bq dataset_name.bq_table_name gs://bucket_name/json_file_name.json path_to_schema_in_your_machine
5) You can find other bq load variants by
bq help load
It does not support JSON formatted data loading.
Here is the documentation (bq help load) for the loadcommand with the latest bq version 2.0.9:
USAGE: bq [--global_flags] <command> [--command_flags] [args]
load Perform a load operation of source into destination_table.
Usage:
load <destination_table> <source> [<schema>]
The <destination_table> is the fully-qualified table name of table to create, or append to if the table already exists.
The <source> argument can be a path to a single local file, or a comma-separated list of URIs.
The <schema> argument should be either the name of a JSON file or a text schema. This schema should be omitted if the table already has one.
In the case that the schema is provided in text form, it should be a comma-separated list of entries of the form name[:type], where type will default
to string if not specified.
In the case that <schema> is a filename, it should contain a single array object, each entry of which should be an object with properties 'name',
'type', and (optionally) 'mode'. See the online documentation for more detail:
https://code.google.com/apis/bigquery/docs/uploading.html#createtable
Note: the case of a single-entry schema with no type specified is
ambiguous; one can use name:string to force interpretation as a
text schema.
Examples:
bq load ds.new_tbl ./info.csv ./info_schema.json
bq load ds.new_tbl gs://mybucket/info.csv ./info_schema.json
bq load ds.small gs://mybucket/small.csv name:integer,value:string
bq load ds.small gs://mybucket/small.csv field1,field2,field3
Arguments:
destination_table: Destination table name.
source: Name of local file to import, or a comma-separated list of
URI paths to data to import.
schema: Either a text schema or JSON file, as above.
Flags for load:
/usr/local/bin/bq:
--[no]allow_quoted_newlines: Whether to allow quoted newlines in CSV import data.
-E,--encoding: <UTF-8|ISO-8859-1>: The character encoding used by the input file. Options include:
ISO-8859-1 (also known as Latin-1)
UTF-8
-F,--field_delimiter: The character that indicates the boundary between columns in the input file. "\t" and "tab" are accepted names for tab.
--max_bad_records: Maximum number of bad records allowed before the entire job fails.
(default: '0')
(an integer)
--[no]replace: If true erase existing contents before loading new data.
(default: 'false')
--schema: Either a filename or a comma-separated list of fields in the form name[:type].
--skip_leading_rows: The number of rows at the beginning of the source file to skip.
(an integer)
gflags:
--flagfile: Insert flag definitions from the given file into the command line.
(default: '')
--undefok: comma-separated list of flag names that it is okay to specify on the command line even if the program does not define a flag with that name.
IMPORTANT: flags in this list that have arguments MUST use the --flag=value format.
(default: '')

Creating Shape Files from SQL Server using Ogr2ogr

I am trying to run the following code in a command window. The code executes, but it gives me no values in the .SHP files. The table has GeographyCollections and Polygons stored in a Field of type Geography. I have tried many variations for the Geography type in the sql statement - Binary, Text etc. but no luck. The output .DBF file has data, so the connection to the database works, but the shape .Shp file and .shx file has no data and is of size 17K and 11 K, respectively.
Any suggestions?
ogr2ogr -f "ESRI Shapefile" -overwrite c:\temp -nln Zip_States -sql "SELECT [ID2],[STATEFP10],[ZCTA5CE10],GEOMETRY::STGeomFromWKB([Geography].STAsBinary(),4326).STAsText() AS [Geography] FROM [GeoSpatial].[dbo].[us_State_Illinois_2010]" ODBC:dbo/GeoSpatial#PPDULCL708504
ESRI Shapefiles can contain only a single type of geometry - Point, LineString, Polygon etc.
Your description suggests that your query returns multiple types of geometry, so restrict that first (using STGeometryType() == 'POLYGON', for example).
Secondly, you're currently returning the spatial field as a text string using STAsText(), but you're not telling OGR that it's a spatial field so it's probably just treating the WKT as a regular text column and adding it as an attribute to the dbf file.
To tell OGR which column contains your spatial information you can add the "Tables" parameter to the connection string. However, there's no reason to do all the casting from WKT/WKB if you're using SQL Server 2008 - OGR2OGR will load SQL Server's native binary format fine.
Are you actually using SQL Server 2008, or Denali? Because the serialisation format changed, and OGR2OGR can't read the new format. So, in that case it's safer (but slower) to convert to WKB first.
The following works for me to dump a table of polygons from SQL Server to Shapefile:
ogr2ogr -f "ESRI Shapefile" -overwrite c:\temp -nln Zip_States -sql "SELECT ID, geom26986.STAsBinary() FROM [Spatial].[dbo].[OUTLINE25K_POLY]" "MSSQL:server=.\DENALICTP3;database=Spatial;trusted_connection=yes;Tables=dbo.OUTLINE25K_POLY(geom26986)"
Try the following command
ogr2ogr shapeFileName.shp -overwrite -sql "select top 10 * from schema.table" "MSSQL:Server=serverIP;Database=dbname;Uid=userid;trusted_connection=no;Pwd=password" -s_srs EPSG:4326 -t_srs EPSG:4326