I am importing a .csv file with 3300 rows of data via the following:
myCSVfile = pd.read_csv(csv_file)
myCSVfile.to_sql(con=engine, name='foo', if_exists='replace')
Once successfully imported, I do a "select * from ..." query on my table, which returns 3100 rows, so where are the missing 200 rows?
I am assuming there is corrupt data which cannot be read in, which I further assume is then skipped over by pandas. However there is no warning, log or message to explicitly say so. The script executes as normal.
Has anyone experienced similar problems, or am I missing something completely obvious?
Although the question does not specify engine, let's assume it is sqlite3.
The follow re-runnable code shows that DataFrame.to_sql() creates a sqlite3 table, and places an index on it. Which is the data from the index of the dataframe.
Taking the question code literally, the csv should import into the DataFrame with a RangeIndex which will be unique ordinals. Because of this, one should be surprised if the number of rows in the csv do not match the number of rows loaded into the sqlite3 table.
So there are two things to do: Verify that the csv is being imported correctly. This is likely the problem since poorly formatted csv files, originating from human manipulated spreadsheets, frequently fail when manipulated by code for a variety of reasons. But that is impossible to answer here because we do not know the input data.
However, what DataFrame.to_sql() does should be excluded. And for that, method can be passed in. It can be used to see what DataFrame.to_sql() does with the DataFrame data prior to handing it off to the SQL engine.
import csv
import pandas as pd
import sqlite3
def dump_foo(conn):
cur = conn.cursor()
cur.execute("SELECT * FROM foo")
rows = cur.fetchall()
for row in rows:
print(row)
conn = sqlite3.connect('example145.db')
csv_data = """1,01-01-2019,724
2,01-01-2019,233,436
3,01-01-2019,345
4,01-01-2019,803,933,943,923,954
4,01-01-2019,803,933,943,923,954
4,01-01-2019,803,933,943,923,954
4,01-01-2019,803,933,943,923,954
4,01-01-2019,803,933,943,923,954
5,01-01-2019,454
5,01-01-2019,454
5,01-01-2019,454
5,01-01-2019,454
5,01-01-2019,454"""
with open('test145.csv', 'w') as f:
f.write(csv_data)
with open('test145.csv') as csvfile:
data = [row for row in csv.reader(csvfile)]
df = pd.DataFrame(data = data)
def checkit(table, conn, keys, data_iter):
print "What pandas wants to put into sqlite3"
for row in data_iter:
print(row)
# note, if_exists replaces the table and does not affect the data
df.to_sql('foo', conn, if_exists="replace", method=checkit)
df.to_sql('foo', conn, if_exists="replace")
print "*** What went into sqlite3"
dump_foo(conn)
Related
I am trying to fetch a data frame from the mysql database.
my_db=src_mysql(dbname='****',
host='****'
,port=****,user='****',password='****')
From this database(which shows as list of 2 in global environment) I want to extract a table.
w = src_tbls(my_db)[1]
But the above command return me a list. I actually needed dataframe. Now to convert this list to dataframe is taking a lot of time.
Can anyone suggest me a way to directly extract a dataframe from the database and reduce the total execution time of code.
I'm not experienced with src_mysql(), but you might try the RODBC package.
This should give you a dataframe of your table and it might be faster given that your SQL statement isn't slow:
library(RODBC)
channel <- odbcConnect("your dsn as character string",
uid="****", # Username
pwd="****",
believeNRows=FALSE)
w <- sqlQuery(channel, "SELECT * FROM YOUR_TABLE")
I have around four *.sql self-contained dumps ( about 20GB each) which I need to convert to datasets in Apache Spark.
I have tried installing and making a local database using InnoDB and importing the dump but that seems too slow ( spent around 10 hours with that )
I directly read the file into spark using
import org.apache.spark.sql.SparkSession
var sparkSession = SparkSession.builder().appName("sparkSession").getOrCreate()
var myQueryFile = sc.textFile("C:/Users/some_db.sql")
//Convert this to indexed dataframe so you can parse multiple line create / data statements.
//This will also show you the structure of the sql dump for your usecase.
var myQueryFileDF = myQueryFile.toDF.withColumn("index",monotonically_increasing_id()).withColumnRenamed("value","text")
// Identify all tables and data in the sql dump along with their indexes
var tableStructures = myQueryFileDF.filter(col("text").contains("CREATE TABLE"))
var tableStructureEnds = myQueryFileDF.filter(col("text").contains(") ENGINE"))
println(" If there is a count mismatch between these values choose different substring "+ tableStructures.count()+ " " + tableStructureEnds.count())
var tableData = myQueryFileDF.filter(col("text").contains("INSERT INTO "))
The problem is that the dump contains multiple tables as well each of which needs to become a dataset. For which I need to understand if we can do it for even one table. Is there any .sql parser written for scala spark ?
Is there a faster way of going about it? Can I read it directly into hive from .sql self-contained file?
UPDATE 1: I am writing the parser for this based on Input given by Ajay
UPDATE 2: Changing everything to dataset based code to use SQL parser as suggested
Is there any .sql parser written for scala spark ?
Yes, there is one and you seem to be using it already. That's Spark SQL itself! Surprised?
The SQL parser interface (ParserInterface) can create relational entities from the textual representation of a SQL statement. That's almost your case, isn't it?
Please note that ParserInterface deals with a single SQL statement at a time so you'd have to somehow parse the entire dumps and find the table definitions and rows.
The ParserInterface is available as sqlParser of a SessionState.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> :type spark.sessionState.sqlParser
org.apache.spark.sql.catalyst.parser.ParserInterface
Spark SQL comes with several methods that offer an entry point to the interface, e.g. SparkSession.sql, Dataset.selectExpr or simply expr standard function. You may also use the SQL parser directly.
shameless plug You may want to read about ParserInterface — SQL Parser Contract in the Mastering Spark SQL book.
You need to parse it by yourself. It requires following steps -
Create a class for each table.
Load files using textFile.
Filter out all the statements other than insert statements.
Then split the RDD using filter into multiple RDDs based on the table name present in insert statement.
For each RDD, use map to parse values present in insert statement and create object.
Now convert RDDs to datasets.
I am using following CSV load Cypher statement to import csv file with about 3.5m records. But it only imports about 3.2m. So about 300000 records are not imported.
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM ("file:///path/to/csvfile.csv") as line
CREATE (ticket:Ticket {id: line.transaction_hash, from_stop: toInt(line.from_stop), to_stop: toInt(line.to_stop), ride_id: toInt(line.ride_id), price: toFloat(line.price)})
MATCH (from_stop:Stop)-[r:RELATES]->(to_stop:Stop) WHERE toInt(line.route_id) in r.routes
CREATE (from_stop)-[:CONNECTS {ticket_id: ID(ticket)}]->(to_stop)
Note that Stop nodes are already created in separate import statement.
When I only created Nodes without creating relationships it was able to import all data. This same import statement works fine with smaller set of same format csv data.
I tried twice just to make sure it wasn't terminated accidentally.
Is there node to relationship limit in Neo4J? Or what could be other reason?
Neo4J version: 3.0.3 size of database directory is 5.31 GiB.
This is probably because whenever the MATCH does not succeed for a line, the entire query for that line (including the first CREATE) also fails.
On the other hand, the failure of an OPTIONAL MATCH would not abort the entire query for a line. Try this:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM ("file:///path/to/csvfile.csv") as line
CREATE (ticket:Ticket {id: line.transaction_hash, from_stop: toInt(line.from_stop), to_stop: toInt(line.to_stop), ride_id: toInt(line.ride_id), price: toFloat(line.price)})
OPTIONAL MATCH (from:Stop)-[r:RELATES]->(to:Stop)
WHERE toInt(line.route_id) in r.routes
FOREACH(x IN CASE WHEN from IS NULL THEN NULL ELSE [1] END |
CREATE (from)-[:CONNECTS {ticket_id: ID(ticket)}]->(to)
);
The FOREACH clause uses a somewhat roundabout technique to only CREATE the relationship if the OPTIONAL MATCH succeeded for a line.
How to write 10 million+ rows into csv file from vertica using python?
When I tried to write into bulk_data.csv as follows, it got struck after 200,000 rows.
con = pyodbc.connect("DRIVER={Vertica};SERVER=***;DATABASE=***;UID=****;PWD=***")
cursor = con.cursor()
cursor.execute('SELECT * FROM ***')
header = map(lambda x: x[0], cursor.description)
with open('bulk_data.csv', 'w+') as f:
f.write('\t'.join(header) + '\n')
csv.writer(f, delimiter='\t', quoting=csv.QUOTE_MINIMAL, quotechar='"', lineterminator='\n').writerows(cursor)
The simple answer is that you don't write row by row for this amount of data. You use a COPY to process the file in bulk. If you're using Python, you may want to leverage one of the many Vertica specific projects which allow for batch import such as PyVertica from Spil Games.
I have a CSV file that has all entries quoted i.e. with opening and closing quotes. When I import to the database using copy_from, the database table contains quotes on the data and where there is an empty entry I get quotes only i.e. "" entries in the column as seen below
[
Is there a way to tell copy_from to ignore quotes so that when I import the file the text doesn't have quotes around it and empty entries are converted to Null as below?
Here is my code:
with open(source_file_path) as inf:
cursor.copy_from(inf, table_name, columns=column_list, sep=',', null="None")
UPDATE:
I still haven't got a solution to the above but for the sake of getting the file imported I went ahead and wrote the raw SQL code and executed it in SQLAlchemy connection and Pyscopg2's cursor as below and they both removed quotes and put Null where there were empty entries.
sql = "COPY table_name (col1, col2, col3, col4) FROM '{}' DELIMITER ',' CSV HEADER".format(csv_file_path)
SQL Alchemy:
conn = engine.connect()
trans = conn.begin()
conn.execute(sql)
trans.commit()
conn.close()
Psycopg2:
conn = psycopg2.connect(pg_conn_string)
conn.set_isolation_level(0)
cursor = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
cursor = conn.cursor()
cursor.execute(sql)
While still wishing the copy_from function would work, now am wondering if the above two are equally as fast as copy_from and if so, which of the two is faster?
Probably a better approach would be to use the built-in CSV library to read the CSV file and transfer the rows to the database. Corollary to the UNIX philosophy of "do one thing and do it well" is to use the appropriate tool (the one that's specialized) for the job. What's good with the CSV library is that you have customization options on how to read the CSV like quoting characters and skipping initial rows (see documentation).
Assuming a simple CSV file with two columns: an integer "ID", and a quoted string "Country Code":
"ID", "Country Code"
1, "US"
2, "UK"
and a declarative SQLAlchemy target table:
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
engine = create_engine("postgresql+psycopg2://<REMAINDER_OF_YOUR_ENGINE_STRING>")
Base = declarative_base(bind=engine)
class CountryTable(Base):
__tablename__ = 'countries'
id = Column(Integer, primary_key=True)
country = Column(String)
you can transfer the data by:
import csv
from sqlalchemy.orm import sessionmaker
from your_model_module import engine, CountryTable
Session = sessionmaker(bind=engine)
with open("path_to_your.csv", "rb") as f:
reader = csv.DictReader(f)
session = Session()
for row in reader:
country_record = CountryTable(id=row["ID"], country=row["Country Code"])
session.add(country_record)
session.commit()
session.close()
This solution is longer than a one line .copy_from method but it gives you better control without having to dig through the code/understanding the documentation of wrapper or convenience functions like .copy_from. You can specify selected columns to be transferred and handle exceptions at row level since data is transferred row by row with a commit. Rows can be transferred in a batch with a single commit through:
with open("path_to_your.csv", "rb") as f:
reader = csv.DictReader(f)
session = Session()
session.add_all([
CountryTable(id=row["ID"], country=row["Country Code"]) for row in reader
])
session.commit()
session.close()
To compare the execution time of different approaches to your problem, use the timeit module (or rather the commandline command) that comes with Python. Caution however: it's better to be correct than fast.
EDIT:
I was trying to figure out where .copy_from is coded as I haven't used it before. It turns out to be a psycopg2 specific convenience function. It does not 100% support reading CSV files but only file like objects. The only customization argument applicable to CSVs is the separator. It does not understand quoting characters.