I'm using Osmosis to export the Postgis tables to .osm file.
I'm using the following command to generate the .osm file.
osmosis --read-pgsql host="localhost" database="pgsnapshot" user="postgres" password="postgres" --dataset-dump --write-xml file='output.osm"
It yields an empty osm file. The database which I'm using contains the following tables with no empty columns.
action
nodes_tags
nodes
relation_members
relations
relations_tag
way_tag
schema_info
users
way_nodes
ways
I've searched for the solution but mostly I'm getting search results on osm2pgsql. I want the opposite of the same.
Related
I have around four *.sql self-contained dumps ( about 20GB each) which I need to convert to datasets in Apache Spark.
I have tried installing and making a local database using InnoDB and importing the dump but that seems too slow ( spent around 10 hours with that )
I directly read the file into spark using
import org.apache.spark.sql.SparkSession
var sparkSession = SparkSession.builder().appName("sparkSession").getOrCreate()
var myQueryFile = sc.textFile("C:/Users/some_db.sql")
//Convert this to indexed dataframe so you can parse multiple line create / data statements.
//This will also show you the structure of the sql dump for your usecase.
var myQueryFileDF = myQueryFile.toDF.withColumn("index",monotonically_increasing_id()).withColumnRenamed("value","text")
// Identify all tables and data in the sql dump along with their indexes
var tableStructures = myQueryFileDF.filter(col("text").contains("CREATE TABLE"))
var tableStructureEnds = myQueryFileDF.filter(col("text").contains(") ENGINE"))
println(" If there is a count mismatch between these values choose different substring "+ tableStructures.count()+ " " + tableStructureEnds.count())
var tableData = myQueryFileDF.filter(col("text").contains("INSERT INTO "))
The problem is that the dump contains multiple tables as well each of which needs to become a dataset. For which I need to understand if we can do it for even one table. Is there any .sql parser written for scala spark ?
Is there a faster way of going about it? Can I read it directly into hive from .sql self-contained file?
UPDATE 1: I am writing the parser for this based on Input given by Ajay
UPDATE 2: Changing everything to dataset based code to use SQL parser as suggested
Is there any .sql parser written for scala spark ?
Yes, there is one and you seem to be using it already. That's Spark SQL itself! Surprised?
The SQL parser interface (ParserInterface) can create relational entities from the textual representation of a SQL statement. That's almost your case, isn't it?
Please note that ParserInterface deals with a single SQL statement at a time so you'd have to somehow parse the entire dumps and find the table definitions and rows.
The ParserInterface is available as sqlParser of a SessionState.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> :type spark.sessionState.sqlParser
org.apache.spark.sql.catalyst.parser.ParserInterface
Spark SQL comes with several methods that offer an entry point to the interface, e.g. SparkSession.sql, Dataset.selectExpr or simply expr standard function. You may also use the SQL parser directly.
shameless plug You may want to read about ParserInterface — SQL Parser Contract in the Mastering Spark SQL book.
You need to parse it by yourself. It requires following steps -
Create a class for each table.
Load files using textFile.
Filter out all the statements other than insert statements.
Then split the RDD using filter into multiple RDDs based on the table name present in insert statement.
For each RDD, use map to parse values present in insert statement and create object.
Now convert RDDs to datasets.
I try to import a XML file into a MYSQL table using the LOAD XML function:
LOAD XML INFILE 'test.xml INTO TABLE edge_delete ROWS IDENTIFIED BY '<edge>';
the XML File is structured like this:
<netstate xmlns:xsi=....>
<timestep time="2">
<edge id="10">
<lane id="10_0">
<vehicle id="veh1" pos="4.60" speed="0.00"/>
</lane>
</edge>
</timestep>
The Problem:
All node levels start with the attribute "id". The import does not distinguish between the node levels.
Each node level should be one corresponding column in my sql table: edge | lane | vehicle id |...
Thank you for you help
A solution which worked for me:
Use a python script which replaces the "id" attributes with another attribute string (e.g. with "edge"). The script replaces the string. Maybe not the best and efficient way, but solves the problem.
Now I can import all columns of the XML file into the mySQL table.
Another workaround: Using a text editor with a find/replace function.
Did not work properly with the tools I found for large xml files > 1GB.
The script can be found here:
https://studiofreya.com/2016/11/17/replace-string-in-xml-file-with-python/#comment-86628
I am trying to create a graph in neo4j and my data which is in CSV file looks like
node1,connection,node2
PPARA,0.5,PPARGC1A
PPARA,0.5,ENSG00000236349
PPARA,0.5,TSPA
I want connection values to use as labels of relationships in graph which I am not able to do. Following is the exact code I am using to create graph.
LOAD CSV WITH HEADERS FROM "file:///C:/Users/username/Desktop/Cytoscape-friend.csv" AS network
CREATE (:GeneRN2{sourceNode:network.node1, destNode:network.node2})
CREATE (sourceNode) -[:TO {exp:network.connection}] ->(destNode)
My second question is that as there are multiple repeating values in my file, by default neo4j is creating multiple nodes for repeating values. How do I create single node for multiple values and connect all other relating nodes to that single node?
Relationships do not have labels. They have a type.
If you need to specify the type of relationship from a variable, then you need to use the procedure apoc.create.relationship from the APOC library.
To avoid creating duplicate nodes, use MERGE instead of CREATE.
So your query might look like this:
LOAD CSV WITH HEADERS
FROM "file:///C:/Users/username/Desktop/Cytoscape-friend.csv"
AS network
MERGE (sourceNode {id:network.node1})
MERGE (destNode {id:network.node2})
WITH sourceNode,
destNode,
network
CALL apoc.create.relationship(sourceNode, network.connection, {}, destNode) yield rel
RETURN sourceNode,
rel,
destNode
I have a data set in the form of a edgelist as :
Click to see sample data
I want to load this csv into neo4j and create a one-to-one relationship between A and P, A and Q and so on, and represent it graphically.
How can i do that using neo4j?
I am only able to import the file and not able to create any relationship
There is very little information to go on here. If you are using LOAD CSV to import then post what you have tried. If all you have is an edge list, you need to create the nodes that do not exist already in the database. So use a MERGE command on the node identifier to use it if it exists or create it if it doesn't.
Need to write a Hadoop map reduce program.
Map:
Input is iTunes EPF data dump file.File contains data base record. Each row of record may be written in two or more than two lines in the file.Need to read this record from file.
Reduce:
Output is, write to mysql data base.
Prob:
1) In Hadoop mapping function, how to fetch a single record from the file which is two or more than two lines.And how to specify this is the value,which is corresponds to the corresponding data base file in the mapping function
Is it possible to use regex in Mapping function as single record may be span across more than a single line.