I have a CSV file that has all entries quoted i.e. with opening and closing quotes. When I import to the database using copy_from, the database table contains quotes on the data and where there is an empty entry I get quotes only i.e. "" entries in the column as seen below
[
Is there a way to tell copy_from to ignore quotes so that when I import the file the text doesn't have quotes around it and empty entries are converted to Null as below?
Here is my code:
with open(source_file_path) as inf:
cursor.copy_from(inf, table_name, columns=column_list, sep=',', null="None")
UPDATE:
I still haven't got a solution to the above but for the sake of getting the file imported I went ahead and wrote the raw SQL code and executed it in SQLAlchemy connection and Pyscopg2's cursor as below and they both removed quotes and put Null where there were empty entries.
sql = "COPY table_name (col1, col2, col3, col4) FROM '{}' DELIMITER ',' CSV HEADER".format(csv_file_path)
SQL Alchemy:
conn = engine.connect()
trans = conn.begin()
conn.execute(sql)
trans.commit()
conn.close()
Psycopg2:
conn = psycopg2.connect(pg_conn_string)
conn.set_isolation_level(0)
cursor = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
cursor = conn.cursor()
cursor.execute(sql)
While still wishing the copy_from function would work, now am wondering if the above two are equally as fast as copy_from and if so, which of the two is faster?
Probably a better approach would be to use the built-in CSV library to read the CSV file and transfer the rows to the database. Corollary to the UNIX philosophy of "do one thing and do it well" is to use the appropriate tool (the one that's specialized) for the job. What's good with the CSV library is that you have customization options on how to read the CSV like quoting characters and skipping initial rows (see documentation).
Assuming a simple CSV file with two columns: an integer "ID", and a quoted string "Country Code":
"ID", "Country Code"
1, "US"
2, "UK"
and a declarative SQLAlchemy target table:
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
engine = create_engine("postgresql+psycopg2://<REMAINDER_OF_YOUR_ENGINE_STRING>")
Base = declarative_base(bind=engine)
class CountryTable(Base):
__tablename__ = 'countries'
id = Column(Integer, primary_key=True)
country = Column(String)
you can transfer the data by:
import csv
from sqlalchemy.orm import sessionmaker
from your_model_module import engine, CountryTable
Session = sessionmaker(bind=engine)
with open("path_to_your.csv", "rb") as f:
reader = csv.DictReader(f)
session = Session()
for row in reader:
country_record = CountryTable(id=row["ID"], country=row["Country Code"])
session.add(country_record)
session.commit()
session.close()
This solution is longer than a one line .copy_from method but it gives you better control without having to dig through the code/understanding the documentation of wrapper or convenience functions like .copy_from. You can specify selected columns to be transferred and handle exceptions at row level since data is transferred row by row with a commit. Rows can be transferred in a batch with a single commit through:
with open("path_to_your.csv", "rb") as f:
reader = csv.DictReader(f)
session = Session()
session.add_all([
CountryTable(id=row["ID"], country=row["Country Code"]) for row in reader
])
session.commit()
session.close()
To compare the execution time of different approaches to your problem, use the timeit module (or rather the commandline command) that comes with Python. Caution however: it's better to be correct than fast.
EDIT:
I was trying to figure out where .copy_from is coded as I haven't used it before. It turns out to be a psycopg2 specific convenience function. It does not 100% support reading CSV files but only file like objects. The only customization argument applicable to CSVs is the separator. It does not understand quoting characters.
Related
I have a directory full of CSVs. A script I use loads each CSV via a Loop and corrects commonly known errors in several columns prior to being imported into an SQL database. The corrections I want to apply are stored in a JSON file so that a user can freely add/remove any corrections on-the-fly without altering the main script.
My script works fine for 1 value correction, per column, per CSV. However I have noticed that 2 or more columns per CSV now contain additional errors, as well as more than one correction per column is now required.
Here is relevant code:
with open('lookup.json') as f:
translation_table = json.load(f)
for filename in gl.glob("(Compacted)_*.csv"):
df = pd.read_csv(filename, dtype=object)
#... Some other enrichment...
# Extract the file "key" with a regular expression (regex)
filekey = re.match(r"^\(Compacted\)_([A-Z0-9-]+_[0-9A-z]+)_[0-9]{8}_[0-9]{6}.csv$", filename).group(1)
# Use the translation tables to apply any error fixes
if filekey in translation_table["error_lookup"]:
tablename = translation_table["error_lookup"][filekey]
df[tablename[0]] = df[tablename[0]].replace({tablename[1]: tablename[2]})
else:
pass
And here is the lookup.json file:
}
"error_lookup": {
"T7000_08": ["MODCT", "C00", -5555],
"T7000_17": ["MODCT", "C00", -5555],
"T7000_20": ["CLLM5", "--", -5555],
"T700_13": ["CODE", "100T", -5555]
}
For example if a column (in a CSV that includes the key "T7000_20") has a new erroneous value of ";;" in column CLLM5, how can I ensure that values that contain "--" and ";;" are replaced with "-5555"? How do I account for another column in the same CSV too?
Can you change the JSON file? The example below would edit Column A (old1 → new 1 and old2 → new2) and would make similar changes to Column B:
{'error_lookup': {'T7000_20': {'colA': ['old1', 'new1', 'old2', 'new2'],
'colB': ['old3', 'new3', 'old4', 'new4']}}}
The JSON parsing gets more complex, in order to handle current use case and new requirements.
I have a spreadsheet which really has only one complicated table. I basically convert the spreadsheet to a cvs and use a groovy script to generate the INSERT scripts.
However, I cannot do this with a table that has 28 fields with data within some of the fields on the spreadsheet that make importing into the CVS even more complicated. So the fields in the new CVS are not differentiated properly or my script has not accounted for it.
Does anyone have any suggestions on a better approach to do this? Thanks.
Have a look at LOAD DATA INFILE statement. It will help you to import data from the CSV file into table.
This is a recurrent question on stackoverflow. Here is an updated answer.
There are actually several ways to import an excel file in to a MySQL database with varying degrees of complexity and success.
Excel2MySQL or Navicat utilities. Full disclosure, I am the author of Excel2MySQL. These 2 utilities aren't free, but they are the easiest option and have the fewest limitations. They also include additional features to help with importing Excel data into MySQL. For example, Excel2MySQL automatically creates your table and automatically optimizes field data types like dates, times, floats, etc. If your in a hurry or can't get the other options to work with your data then these utilities may suit your needs.
LOAD DATA INFILE: This popular option is perhaps the most technical and requires some understanding of MySQL command execution. You must manually create your table before loading and use appropriately sized VARCHAR field types. Therefore, your field data types are not optimized. LOAD DATA INFILE has trouble importing large files that exceed 'max_allowed_packet' size. Special attention is required to avoid problems importing special characters and foreign unicode characters. Here is a recent example I used to import a csv file named test.csv.
phpMyAdmin: Select your database first, then select the Import tab. phpMyAdmin will automatically create your table and size your VARCHAR fields, but it won't optimize the field types. phpMyAdmin has trouble importing large files that exceed 'max_allowed_packet' size.
MySQL for Excel: This is a free Excel Add-in from Oracle. This option is a bit tedious because it uses a wizard and the import is slow and buggy with large files, but this may be a good option for small files with VARCHAR data. Fields are not optimized.
For comma-separated values (CSV) files, the results view panel in Workbench has an "Import records from external file" option that imports CSV data directly into the result set. Execute that and click "Apply" to commit the changes.
For Excel files, consider using the official MySQL for Excel plugin.
A while back I answered a very similar question on the EE site, and offered the following block of Perl, as a quick and dirty example of how you could directly load an Excel sheet into MySQL. Bypassing the need to export / import via CSV and so hopefully preserving more of those special characters, and eliminating the need to worry about escaping the content.
#!/usr/bin/perl -w
# Purpose: Insert each Worksheet, in an Excel Workbook, into an existing MySQL DB, of the same name as the Excel(.xls).
# The worksheet names are mapped to the table names, and the column names to column names.
# Assumes each sheet is named and that the first ROW on each sheet contains the column(field) names.
#
use strict;
use Spreadsheet::ParseExcel;
use DBI;
use Tie::IxHash;
die "You must provide a filename to $0 to be parsed as an Excel file" unless #ARGV;
my $sDbName = $ARGV[0];
$sDbName =~ s/\.xls//i;
my $oExcel = new Spreadsheet::ParseExcel;
my $oBook = $oExcel->Parse($ARGV[0]);
my $dbh = DBI->connect("DBI:mysql:database=$sDbName;host=192.168.123.123","root", "xxxxxx", {'RaiseError' => 1,AutoCommit => 1});
my ($sTableName, %hNewDoc, $sFieldName, $iR, $iC, $oWkS, $oWkC, $sSql);
print "FILE: ", $oBook->{File} , "\n";
print "DB: $sDbName\n";
print "Collection Count: ", $oBook->{SheetCount} , "\n";
for(my $iSheet=0; $iSheet < $oBook->{SheetCount} ; $iSheet++)
{
$oWkS = $oBook->{Worksheet}[$iSheet];
$sTableName = $oWkS->{Name};
print "Table(WorkSheet name):", $sTableName, "\n";
for(my $iR = $oWkS->{MinRow} ; defined $oWkS->{MaxRow} && $iR <= $oWkS->{MaxRow} ; $iR++)
{
tie ( %hNewDoc, "Tie::IxHash");
for(my $iC = $oWkS->{MinCol} ; defined $oWkS->{MaxCol} && $iC <= $oWkS->{MaxCol} ; $iC++)
{
$sFieldName = $oWkS->{Cells}[$oWkS->{MinRow}][$iC]->Value;
$sFieldName =~ s/[^A-Z0-9]//gi; #Strip non alpha-numerics from the Column name
$oWkC = $oWkS->{Cells}[$iR][$iC];
$hNewDoc{$sFieldName} = $dbh->quote($oWkC->Value) if($oWkC && $sFieldName);
}
if ($iR == $oWkS->{MinRow}){
#eval { $dbh->do("DROP TABLE $sTableName") };
$sSql = "CREATE TABLE IF NOT EXISTS $sTableName (".(join " VARCHAR(512), ", keys (%hNewDoc))." VARCHAR(255))";
#print "$sSql \n\n";
$dbh->do("$sSql");
} else {
$sSql = "INSERT INTO $sTableName (".(join ", ",keys (%hNewDoc)).") VALUES (".(join ", ",values (%hNewDoc)).")\n";
#print "$sSql \n\n";
eval { $dbh->do("$sSql") };
}
}
print "Rows inserted(Rows):", ($oWkS->{MaxRow} - $oWkS->{MinRow}), "\n";
}
# Disconnect from the database.
$dbh->disconnect();
Note:
Change the connection ($oConn) string to suit, and if needed add a
user-id and password to the arguments.
If you need XLSX support a quick switch to Spreadsheet::XLSX is all
that's needed. Alternatively it only takes a few lines of code, to
detect the filetype and call the appropriate library.
The above is a simple hack, assumes everything in a cell is a string
/ scalar, if preserving type is important, a little function with a
few regexp can be used in conjunction with a few if statements to
ensure numbers / dates remain in the applicable format when written
to the DB
The above code is dependent on a number of CPAN modules, that you can install, assuming outbound ftp access is permitted, via a:
cpan YAML Data::Dumper Spreadsheet::ParseExcel Tie::IxHash Encode Scalar::Util File::Basename DBD::mysql
Should return something along the following lines (tis rather slow, due to the auto commit):
# ./Excel2mysql.pl test.xls
FILE: test.xls
DB: test
Collection Count: 1
Table(WorkSheet name):Sheet1
Rows inserted(Rows):9892
I am importing a .csv file with 3300 rows of data via the following:
myCSVfile = pd.read_csv(csv_file)
myCSVfile.to_sql(con=engine, name='foo', if_exists='replace')
Once successfully imported, I do a "select * from ..." query on my table, which returns 3100 rows, so where are the missing 200 rows?
I am assuming there is corrupt data which cannot be read in, which I further assume is then skipped over by pandas. However there is no warning, log or message to explicitly say so. The script executes as normal.
Has anyone experienced similar problems, or am I missing something completely obvious?
Although the question does not specify engine, let's assume it is sqlite3.
The follow re-runnable code shows that DataFrame.to_sql() creates a sqlite3 table, and places an index on it. Which is the data from the index of the dataframe.
Taking the question code literally, the csv should import into the DataFrame with a RangeIndex which will be unique ordinals. Because of this, one should be surprised if the number of rows in the csv do not match the number of rows loaded into the sqlite3 table.
So there are two things to do: Verify that the csv is being imported correctly. This is likely the problem since poorly formatted csv files, originating from human manipulated spreadsheets, frequently fail when manipulated by code for a variety of reasons. But that is impossible to answer here because we do not know the input data.
However, what DataFrame.to_sql() does should be excluded. And for that, method can be passed in. It can be used to see what DataFrame.to_sql() does with the DataFrame data prior to handing it off to the SQL engine.
import csv
import pandas as pd
import sqlite3
def dump_foo(conn):
cur = conn.cursor()
cur.execute("SELECT * FROM foo")
rows = cur.fetchall()
for row in rows:
print(row)
conn = sqlite3.connect('example145.db')
csv_data = """1,01-01-2019,724
2,01-01-2019,233,436
3,01-01-2019,345
4,01-01-2019,803,933,943,923,954
4,01-01-2019,803,933,943,923,954
4,01-01-2019,803,933,943,923,954
4,01-01-2019,803,933,943,923,954
4,01-01-2019,803,933,943,923,954
5,01-01-2019,454
5,01-01-2019,454
5,01-01-2019,454
5,01-01-2019,454
5,01-01-2019,454"""
with open('test145.csv', 'w') as f:
f.write(csv_data)
with open('test145.csv') as csvfile:
data = [row for row in csv.reader(csvfile)]
df = pd.DataFrame(data = data)
def checkit(table, conn, keys, data_iter):
print "What pandas wants to put into sqlite3"
for row in data_iter:
print(row)
# note, if_exists replaces the table and does not affect the data
df.to_sql('foo', conn, if_exists="replace", method=checkit)
df.to_sql('foo', conn, if_exists="replace")
print "*** What went into sqlite3"
dump_foo(conn)
I have around four *.sql self-contained dumps ( about 20GB each) which I need to convert to datasets in Apache Spark.
I have tried installing and making a local database using InnoDB and importing the dump but that seems too slow ( spent around 10 hours with that )
I directly read the file into spark using
import org.apache.spark.sql.SparkSession
var sparkSession = SparkSession.builder().appName("sparkSession").getOrCreate()
var myQueryFile = sc.textFile("C:/Users/some_db.sql")
//Convert this to indexed dataframe so you can parse multiple line create / data statements.
//This will also show you the structure of the sql dump for your usecase.
var myQueryFileDF = myQueryFile.toDF.withColumn("index",monotonically_increasing_id()).withColumnRenamed("value","text")
// Identify all tables and data in the sql dump along with their indexes
var tableStructures = myQueryFileDF.filter(col("text").contains("CREATE TABLE"))
var tableStructureEnds = myQueryFileDF.filter(col("text").contains(") ENGINE"))
println(" If there is a count mismatch between these values choose different substring "+ tableStructures.count()+ " " + tableStructureEnds.count())
var tableData = myQueryFileDF.filter(col("text").contains("INSERT INTO "))
The problem is that the dump contains multiple tables as well each of which needs to become a dataset. For which I need to understand if we can do it for even one table. Is there any .sql parser written for scala spark ?
Is there a faster way of going about it? Can I read it directly into hive from .sql self-contained file?
UPDATE 1: I am writing the parser for this based on Input given by Ajay
UPDATE 2: Changing everything to dataset based code to use SQL parser as suggested
Is there any .sql parser written for scala spark ?
Yes, there is one and you seem to be using it already. That's Spark SQL itself! Surprised?
The SQL parser interface (ParserInterface) can create relational entities from the textual representation of a SQL statement. That's almost your case, isn't it?
Please note that ParserInterface deals with a single SQL statement at a time so you'd have to somehow parse the entire dumps and find the table definitions and rows.
The ParserInterface is available as sqlParser of a SessionState.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> :type spark.sessionState.sqlParser
org.apache.spark.sql.catalyst.parser.ParserInterface
Spark SQL comes with several methods that offer an entry point to the interface, e.g. SparkSession.sql, Dataset.selectExpr or simply expr standard function. You may also use the SQL parser directly.
shameless plug You may want to read about ParserInterface — SQL Parser Contract in the Mastering Spark SQL book.
You need to parse it by yourself. It requires following steps -
Create a class for each table.
Load files using textFile.
Filter out all the statements other than insert statements.
Then split the RDD using filter into multiple RDDs based on the table name present in insert statement.
For each RDD, use map to parse values present in insert statement and create object.
Now convert RDDs to datasets.
...I really thought this would be a well-traveled path.
I want to create the DDL statement in Hive (or SQL for that matter) by inspecting the first record in a CSV file that exposes (as is often the case) the column names.
I've seen a variety of near answers to this issue, but not to many that can be automated or replicated at scale.
I created the following code to handle the task, but I fear that it has some issues:
#!/usr/bin/python
import sys
import csv
# get file name (and hence table name) from command line
# exit with usage if no suitable argument
if len(sys.argv) < 2:
sys.exit('Usage: ' + sys.argv[0] + ': input CSV filename')
ifile = sys.argv[1]
# emit the standard invocation
print 'CREATE EXTERNAL TABLE ' + ifile + ' ('
with open(ifile + '.csv') as inputfile:
reader = csv.DictReader(inputfile)
for row in reader:
k = row.keys()
sprung = len(k)
latch = 0
for item in k:
latch += 1
dtype = '` STRING' if latch == sprung else '` STRING,'
print '`' + item.strip() + dtype
break
print ')\n'
print "ROW FORMAT DELIMITED FIELDS TERMINATED BY ','"
print "LOCATION 'replacethisstringwith HDFS or S3 location'"
The first is that it simply datatypes everything as a STRING. (I suppose that coming from CSV, that's a forgivable sin. And of course one could doctor the resulting output to set the datatypes more accurately.)
The second is that it does not sanitize the potential column names for characters not allowed in Hive table column names. (I easily broke it immediately by reading in a data set where the column names routinely had an apostrophe as data. This caused a mess.)
The third is that the data location is tokenized. I suppose with just a little more coding time, it could be passed on the command line as an argument.
My question is -- why would we need to do this? What easy approach to doing this am I missing?
(BTW: no bonus points for referencing the CSV Serde - I think that's only available in Hive 14. A lot of us are not that far along yet with our production systems.)
Regarding the first issue (all columns are typed as strings), this is actually the current behavior even if the table were being processed by something like the CSVSerde or RegexSerDe. Depending on whether the particulars of your use case can tolerate the additional runtime latency, one possible approach is to define a view based upon your external table that dynamically recasts the columns at query time, and direct queries against the view instead of the external table. Something like:
CREATE VIEW VIEW my_view (
CAST(col1 AS INT) AS col1,
CAST(col2 AS STRING) AS col2,
CAST(col3 AS INT) as col3,
...
...
) AS SELECT * FROM my_external_table;
For the second issue (sanitizing column names), I'm inferring your Hive installation is 0.12 or earlier (0.13 supports any unicode character in a column name). If you import the re regex module, you can perform that scrubbing in your Python with something like the following:
for item in k:
...
print '`' + re.sub(r'\W', '', item.strip()) + dtype
That should get rid of any non-alphernumeric/underscore characters, which was the pre-0.13 expectation for Hive column names. By the way, I don't think you need the surrounding backticks anymore if you sanitize the column name this way.
As for the third issue (external table location), I think specifying the location as a command line parameter is a reasonable approach. One alternative may be to add another "metarow" to your data file that specifies the location somehow, but that would be a pain if you are already sitting on a ton of data files - personally I prefer the command line approach.
The Kite SDK has functionality to infer a CSV schema with the names from the header record and the types from the first few data records, and then create a Hive table from that schema. You can also use it to import CSV data into that table.