I have a large CSV file (5.4GB) of data. It's a table with 6 columns a lot of rows. I want to import it into MySQL across several tables. Additionally I have to do some transformations to the data before import (e.g. parse a cell, and input the parts into several table values etc.). Now I can either do a script does a transformation and inserts a row at a time but it will take weeks to import the data. I know there is the LOAD DATA INFILE for MySQL but I am not sure how or if I can do the needed transformations in SQL.
Any advice how to proceed?
In my limited experience you won't want to use the Django ORM for something like this. It will be far too slow. I would write a Python script to operate on the CSV file, using Python's csv library. And then use the native MySQL facility LOAD DATA INFILE to load the data.
If the Python script to massage the CSV file is too slow you may consider writing that part in C or C++, assuming you can find a decent CSV library for those languages.
Related
I am trying to merge multiple small JSON files (about 500,000 files of 400-500 byte size and are no longer susceptible to change) into one big CSV file, using AWS Lambda. I have a job that works something like this:
Use the s3.listobjects() to fetch keys
Use s3.getObject() to fetch each JSON file (is there a better way to do this?)
Create a CSV file in-memory (what's the best way to do this in nodejs?)
Upload that file in S3
I'd love to know if there's a better way to go about doing this. Thanks!
I would recommend using Amazon Athena.
It allows you to run SQL commands across multiple data files simultaneously (including JSON) and can create output files by Creating a Table from Query Results (CTAS) - Amazon Athena.
Pardon my simple question but I'm relatively new to Spark/Hadoop.
I'm trying to load a bunch of small CSV files into Apache Spark. They're currently stored in S3, but I can download them locally if that simplifies things. My goal is to do this as efficiently as possible. It seems like it would be a shame to have some single-threaded master downloading and parsing a bunch of CSV files while my dozens of Spark workers sit idly. I'm hoping there's an idiomatic way to distribute this work.
The CSV files are arranged in a directory structure that looks like:
2014/01-01/fileabcd.csv
2014/01-01/filedefg.csv
...
I have two years of data, with directories for each day, and a few hundred CSVs inside of each. All of those CSVs should have an identical schema, but it's of course possible that one CSV is awry and I'd hate for the whole job to crash if there are a couple problematic files. Those files can be skipped as long as I'm notified in a log somewhere that that happened.
It seems that every Spark project I have in mind is in this same form and I don't know how to solve it. (e.g. trying to read in a bunch of tab-delimited weather data, or reading in a bunch of log files to look at those.)
What I've Tried
I've tried both SparkR and the Scala libraries. I don't really care which language I need to use; I'm more interested in the correct idioms/tools to use.
Pure Scala
My original thought was to enumerate and parallelize the list of all year/mm-dd combinations so that I could have my Spark workers all processing each day independently (download and parse all CSV files, then stack them on top of eachother (unionAll()) to reduce them). Unfortunately, downloading and parsing the CSV files using the spark-csv library can only be done in the "parent"/master job, and not from each child as Spark doesn't allow job nesting. So that won't work as long as I want to use the Spark libraries to do the importing/parsing.
Mixed-Language
You can, of course, use the language's native CSV parsing to read in each file then "upload" them to Spark. In R, this is a combination of some package to get the file out of S3 followed by a read.csv, and finishing off with a createDataFrame() to get the data into Spark. Unfortunately, this is really slow and also seems backwards to the way I want Spark to work. If all my data is piping through R before it can get into Spark, why bother with Spark?
Hive/Sqoop/Phoenix/Pig/Flume/Flume Ng/s3distcp
I've started looking into these tailored tools and quickly got overwhelmed. My understanding is that many/all of these tools could be used to get my CSV files from S3 into HDFS.
Of course it would be faster to read my CSV files in from HDFS than S3, so that solves some portion of the problem. But I still have tens of thousands of CSVs that I need to parse and am unaware of a distributed way to do that in Spark.
So right now (Spark 1.4) SparkR has support for json or parquet file structures. Csv files can be parsed, but then the spark context needs to be started with an extra jar (which needs to be downloaded and placed in the appropriate folder, never done this myself but my collegues have).
sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.0.3")
sqlContext <- sparkRSQL.init(sc)
There is more information in the docs. I expect that a newer spark release would have more support for this.
If you don't do this you'll need to either resort to a different file structure or use python to convert all your files from .csv into .parquet. Here is a snippet from a recent python talk that does this.
data = sc.textFile(s3_paths, 1200).cache()
def caster(x):
return Row(colname1 = x[0], colname2 = x[1])
df_rdd = data\
.map(lambda x: x.split(','))\
.map(caster)
ddf = sqlContext.inferSchema(df_rdd).cache()
ddf.write.save('s3n://<bucket>/<filename>.parquet')
Also, how big is your dataset? You may not even need spark for analysis. Note that also as of right now;
SparkR has only DataFrame support.
no distributed machine learning yet.
for visualisation you will need to convert a distributed dataframe back into a normal one if you want to use libraries like ggplot2.
if your dataset is no larger than a few gigabytes, then the extra bother of learning spark might not be worthwhile yet
it's modest now, but you can expect more from the future
I've run into this problem before (but w/ reading a large qty of Parquet files) and my recommendation would be to avoid dataframes and to use RDDs.
The general idiom used was:
Read in a list of the files w/ each file being a line (In the driver). The expected output here is a list of strings
Parallelize the list of strings and map over them with a customer csv reader. with the return being a list of case classes.
You can also use flatMap if at the end of the day you want a data structure like List[weather_data] that could be rewritten to parquet or a database.
I have a CSV file consisting of 78,000 records.Im using smarter_csv (https://github.com/tilo/smarter_csv) to parse the csv file. I want to import that into a MySQL database from my rails app. I have the following two questions
What would be the best approach to quickly importing such a large data-set into MySQL from my rails app ?. Would using resque or sidekiq to create multiple workers be a good idea ?
I need to insert this data into a given table which is present in multiple databases. In Rails, i have a model talk to only one database. So how can i scale the solution to talk to multiple mysql databases from my model ?
Thank You
One way would be to use the native interface of the database application itself for importing and exporting; it would be optimised for that specific purpose.
For MySQL, the mysqlimport provides that interface. Note that the import can also be done as an SQL statement and that this executable provides a much saner interface for the underlying SQL command.
As far as implementation goes, if this is a frequent import exercise, the sidekiq/resque/cron job is the best possible approach.
[EDIT]
The SQL command referred to above is the LOAD DATA INFILE as the other answer points out.
Performance wise probably the best method is the use MYSQL's LOAD DATA INFILE syntax and execute an import command on each database. This requires the data file to be local to each database instance.
As the other answer suggests, mysqlimport can be used to ease the import as the LOAD DATA INFILE statement syntax is highly customisable and can deal with many data formats.
I'm using Firebird database and I need to load Excel file into a database table. I need a tool that does this well. I tried some I found on Google, but all of them have some bugs.
Since Excel data is not created by me, it would be good if it could scan the file and discover what kind of data is inside and suggest a table to be created in the database.
Also, it would be nice if I could compare the file against the data that is already in the database table, and I can pick which data to load and which not.
Tools that load CSV files are also fine, I can "Save as" CSV from Excel before loading.
Well, if you can use CSV, the I guess XMLWizard is the right tool for you. It can load a CSV file and compare with database data. And you can select the changes you wish to make to the table.
Don't let the name fool you, it does work with XML, but it also works very well with CSV files. And it can also estimate the column datatypes and offer CREATE TABLE statement for your file.
Have you tried FSQL?
It's a freeware very similar to Firebird's standard ISQL, but with some extra features, like import data from CSV files.
I've used it with DBF files and it worked fine.
There is also EMS Data import tool for Firebird and Interbase
http://www.sqlmanager.net/en/products/ibfb/dataimport
Not free, though, but it accepts a big variety of formats, including CSV and Excel.
EDIT
Another similar payware tool is Firebird Data Wizard http://www.sqlmaestro.com/products/firebird/datawizard/
There are some online tools which can help you to generate DDL/DML scripts from csv header/sample dump file, check out: http://www.convertcsv.com/csv-to-sql.htm
You can then use sql-workbench's Data Pumper or WbImport Tool from command line.
Orbada has GUI which support for importing csv file also.
DBeaver Free edition also support importing csv out of the box.
BULK INSERT
Other way is on Excell you build formula in new cells with data you want to export. The formula consists to format in strings and lenght to your field according lenght your field in firebird. So you can copy all this cells from excell and past on txt editor, so is possible to use the strategy of BULK INSERT in Firebird.
See more details in http://www.firebirdfaq.org/faq209/
The problem is if you have blob or null data to import, so see if you have this kind of values and if this way is to you.
If you have formated data in txt file, BULK INSERT will be quick way.
Hint: You can too to disable trigger and index associated with your table to accelerate BULK INSERT, and after enable them.
Roberto Novakosky
I load the excel file to lazarus spreadsheet and then export to firebird db. Everythong is fine and the only problem is fpspreadsheet will consider string field with numbers only as a number field. I can check the titles in the first row to see whether the excel file is valid or not.
As far as I can see all replies so far focus on tools that essentially read the Excel (or CSV) file and uses SQL inserts to insert the records into the Firebird database. While this works, I have always found this approach painstakingly slow.
That's why I created a tool that reads an Excel file and writes one file that has a (text) format suitable for Firebird external table (including support for UTF8 char columns) and one DDL file to create the external table in Firebird.
I then use regular SQL to select from the external table, cast as needed, and insert into whatever normal Firebird table I want. The performance with this approach is orders of magnitude faster than SQL inserts from a client app in my experience.
I would be willing to publish the tool. It's written in C#. Let me know if there's any interest.
I am building my first database driven website with Drupal and I have a few questions.
I am currently populating a google docs excel spreadsheet with all of the data I want to eventually be able to query from the website (after it's imported). Is this the best way to start?
If this is not the best way to start what would you recommend?
My plan is to populate the spreadsheet then import it as a csv into the mysql db via the CCK Node.
I've seen two ways to do this.
http://drupal.org/node/133705 (importing data into CCK nodes)
http://drupal.org/node/237574 (Inserting data using spreadsheet/csv instead of SQL insert statements)
Basically my question(s) is what is the best way to gather, then import data into drupal?
Thanks in advance for any help, suggestions.
There's a comparison of the available modules at http://groups.drupal.org/node/21338
In the past when I've done this I simply write code to do it on cron runs (see http://drupal.org/project/phorum for an example framework that you could strip down and build back up to do what you need).
If I were to do this now I would probably use the http://drupal.org/project/migrate module where the philosophy is "get it into MySQL, View the data, Import via GUI."
There is a very good module for this, node import. It allows you to take your GoogleDocs spreadsheet and import it as a .csv file.
It's really easy to use, the module allows you to map your .csv columns to the node fields you want them to go to, so you don't have to worry about setting your columns in a particular order. Also, if there is an error on some records, it will spit out a .csv with the error files and what caused the error, but will import all good records.
I have imported up to 3000 nodes with this method.