Sqoop compatibility with TINYTEXT, TEXT, MEDIUMTEXT, and LONGTEXT - mysql

For a project of mine, I would like to transfer multiple tables fomr a MySQL database into hive using sqoop. Because I have a few columns that use the MEDIUMTEXT datatype, I'd like to check the compatibility with someone that has experience, to prevent sudden surprises down the road.
Taken from the latest Sqoop userguide (1.4.6) there is no compatibility for BLOB, CLOB, or LONGVARBINARY columns in direct mode.
Given that there is no mention of incompatibilities with "TEXT" datatypes, will I be able to import them from MySQL without problems?

In MySQL, TEXT is same as CLOB. What ever limitations user guide mentions for CLOB is applicable to TEXT types.
Unlike typical datatypes, CLOB and TEXT need not store data inline to the record, instead the contents can be stored in a separate file and there will be pointer in the record. That is why direct path does not work for special types like CLOB/TEXT, BLOB in most of the databases.

I finally got around to setting up my hadoop cluster for my project. I am using hadoop 2.6.3 with hive 1.2.1 and sqoop 1.4.6.
It turns out that there is no problem with importing TEXT datatypes from MySQL into Hive using Sqoop. You can even supply the '--direct' parameter that makes use of the mysqldump tool for quicker transfers. In my project I had to import multiple tables containing 2 MEDIUMTEXT columns each. The tables were only about 2 GB each, so not that massive.
I hope this helps someone that is in the same sitation I was in.

Related

VARCHAR(MAX) vs TEXT vs .txt file for use in MySQL database

I tried to google this, but any results I found were related to importing data from a txt file to populate the database as opposed to storing data.
To me, it seems strange that the contents of a file should be stored in a database. We're working on building an eCommerce site, and each item has a description. I assumed the standard would be to store the description in a txt file and the URL in the database, and not to store the huge contents in the database to keep the file size low and speeds high. When you need to store images in a database, you reference it using a URL instead of storing all the pixel data - why would text be any different?
That's what I thought, but everyone seems to be arguing about VARCHAR vs TEXT, so what really is the best way to store text data up to 1000 characters or so?
Thanks!
Whether you store long text data or image data in a database or in external files has no right or wrong answer. There are pros and cons on both sides—despite many developers making unequivocal claims that you should store images outside the database.
Consider you might want the text data to:
Allow changes to be rolled back.
Support transaction isolation.
Enforce SQL access privileges to the data.
Be searchable in SQL when you create a fulltext index.
Support the NOT NULL constraint, so your text is required to contain something.
Automatically be included when you create a database backup (and the version of the text is the correct version, assuring consistency with other data).
Automatically transfer the text to replica database instances.
For all of the above, you would need the text to be stored in the database. If the text is outside the database, those features won't work.
With respect to the VARCHAR vs. TEXT, I can answer for MySQL (though you mentioned VARCHAR(MAX) so you might be referring to Microsoft SQL Server).
In MySQL, both VARCHAR and TEXT max out at 64KB in bytes. If you use
a multibyte character set, the max number of characters is lower.
Both VARCHAR and TEXT have a character set and collation.
VARCHAR allows a DEFAULT, but TEXT does not.
Internally the InnoDB storage engine, VARCHAR and TEXT are stored identically (as well as VARBINARY and BLOB and all their cousins). See https://www.percona.com/blog/2010/02/09/blob-storage-in-innodb/

Importing Geometry from MSSQL to MySQL (Linestring)

I've been given some data which I am trying to import into mysql, the data was provided in a text file format which is usually fine by me - i know mssql uses different data types so a SQL dump was a none starter...
For some reason mssql must store LINESTRINGS in reverse order, which seemed very odd to me. As a result of this, when i try to upload the file with navicat the import fails. Below is an example of the LINESTRING - as you can see the longitude is first, then the latitude - this is what i believe to be the issue?
LINESTRING (-1.61674 54.9828,-1.61625 54.9828)
Does anybody know how i can get this data into my database?
Im quite new to spatial/geometry extensions.
Thanks,
Paul
must remember that the columns with spatial data have their own data type, navicat it does is call the "toString ()" or "AsText ()" event to display data, but in the background are blob, the advantage is that 2 are based on the standard WKT, I recommend that the db of origin to become space for text in the db destination takes that text and use it to "geometrifromtext" to convert the data (obviously you have to make a script with some programming language, with navicat can not do that)
info wkt
info mysql spatial
info sql server

mysql to oracle

I've googled this but can't get a straight answer. I have a mysql database that I want to import in to oracle. Can I just use the mysql dump?
Nope. You need to use some ETL (Export, Transform, Load) tool.
Oracle SQL Developer has inbuilt feature for migrating MySQL DB to Oracle.
Try this link - http://forums.oracle.com/forums/thread.jspa?threadID=875987&tstart=0 This is for migrating MySQL to Oracle.
If the dump is a SQL script, you will need to do a lot of copy & replace to make that script work on Oracle.
Things that come to my mind
remove the dreaded backticks
remove all ENGINE=.... options
remove all DEFAULT CHARSET=xxx options
remove all UNSIGNED options
convert all DATETIME types to DATE
replace BOOLEAN columns with e.g. integer or a CHAR(1) (Oracle does not support boolean)
convert all int(x), smallint, tinyint data types to simply integer
convert all mediumtext, longtext data types to CLOB
convert all VARCHAR columns that are defined with more than 4000 bytes to CLOB
remove all SET ... commands
remove all USE commands
remove all ON UPDATE options for columns
rewrite all triggers
rewrite all procedures
The answer depends on which MySQL features you use. If you don't use stored procedures, triggers, views etc, chances are you will be able to use the MySQL export without major problems.
Take a look at:
mysqldump --compatible=oracle
If you do use these features, you might want to try an automatic converter (Google offers some).
In every case, some knowledge of both syntaxes is required to be able to debug problems (there almost certainly will be some). Also remember to test everything thoroughly.

LONGTEXT valid in migration for PGSQL and MySQL

I am developing a Ruby on Rails application that stores a lot of text in a LONGTEXT column. I noticed that when deployed to Heroku (which uses PostgreSQL) I am getting insert exceptions due to two of the column sizes being too large. Is there something special that must be done in order to get a tagged large text column type in PostgreSQL?
These were defined as "string" datatype in the Rails migration.
If you want the longtext datatype in PostgreSQL as well, just create it. A domain will do:
CREATE DOMAIN longtext AS text;
CREATE TABLE foo(bar longtext);
In PostgreSQL the required type is text. See the Character Types section of the docs.
A new migration that updates the models datatype to 'text' should do the work. Don't forget to restart the database. if you still have problems, take a look at your model with 'heroku console' and just enter the modelname.
If the db restart won't fix the problem, the only way I figured out was to reset the database with 'heroku pg:reset'. No funny way if you already have important data in your database.

Set up large database in MySQL for analysis in R

I have reached the limit of RAM in analyzing large datasets in R. I think my next step is to import these data into a MySQL database and use the RMySQL package. Largely because I don't know database lingo, I haven't been able to figure out how to get beyond installing MySQL with hours of Googling and RSeeking (I am running MySQL and MySQL Workbench on Mac OSX 10.6, but can also run Ubuntu 10.04).
Is there a good reference on how to get started with this usage? At this point I don't want to do any sort of relational databasing. I just want to import .csv files into a local MySQL database and do the subsetting in with RMySQL.
I appreciate any pointers (including "You're way off base!" as I'm new to R and newer to large datasets... this one's around 80 mb)
The documentation for RMySQL is pretty good - but it does assume that you know the basics of SQL. These are:
creating a database
creating a table
getting data into the table
getting data out of the table
Step 1 is easy: in the MySQL console, simply "create database DBNAME". Or from the command line, use mysqladmin, or there are often MySQL admin GUIs.
Step 2 is a little more difficult, since you have to specify the table fields and their type. This will depend on the contents of your CSV (or other delimited) file. A simple example would look something like:
use DBNAME;
create table mydata(
id INT(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
height FLOAT(3,2)
);
Which says create a table with 2 fields: id, which will be the primary key (so has to be unique) and will autoincrement as new records are added; and height, which here is specified as a float (a numeric type), with 3 digits total and 2 after the decimal point (e.g. 100.27). It's important that you understand data types.
Step 3 - there are various ways to import data to a table. One of the easiest is to use the mysqlimport utility. In the example above, assuming that your data are in a file with the same name as the table (mydata), the first column a tab character and the second the height variable (with no header row), this would work:
mysqlimport -u DBUSERNAME -pDBPASSWORD DBNAME mydata
Step 4 - requires that you know how to run MySQL queries. Again, a simple example:
select * from mydata where height > 50;
Means "fetch all rows (id + height) from the table mydata where height is more than 50".
Once you have mastered those basics, you can move to more complex examples such as creating 2 or more tables and running queries that join data from each.
Then - you can turn to the RMySQL manual. In RMySQL, you set up the database connection, then use SQL query syntax to return rows from the table as a data frame. So it really is important that you get the SQL part - the RMySQL part is easy.
There are heaps of MySQL and SQL tutorials on the web, including the "official" tutorial at the MySQL website. Just Google search "mysql tutorial".
Personally, I don't consider 80 Mb to be a large dataset at all; I'm surprised that this is causing a RAM issue and I'm sure that native R functions can handle it quite easily. But it's good to learn new skill such as SQL, even if you don't need them for this problem.
I have a pretty good suggestion. For 80MB use SQLite. SQLite is a super public domain, lightweight, super fast file-based database that works (almost) just like a SQL database.
http://www.sqlite.org/index.html
You don't have to worry about running any kind of server or permissions, your database handle is just a file.
Also, it stores all data as a string, so you don't even have to worry about storing the data as types (since all you need to do is emulate a single text table anyway).
Someone else mentioned sqldf:
http://code.google.com/p/sqldf/
which does interact with SQLite:
http://code.google.com/p/sqldf/#9._How_do_I_examine_the_layout_that_SQLite_uses_for_a_table?_whi
So your SQL create statement would be like this
create table tablename (
id INT(11) INTEGER PRIMARY KEY,
first_column_name TEXT,
second_column_name TEXT,
third_column_name TEXT
);
Otherwise, neilfws' explanation is a pretty good one.
P.S. I'm also a little surprised that your script is choking on 80mb. It's not possible in R to just seek through the file in chunks without opening it all up in memory?
The sqldf package might give you an easier way to do what you need: http://code.google.com/p/sqldf/. Especially if you are the only person using the database.
Edit: Here is why I think it would be useful in this case (from the website):
With sqldf the user is freed from having to do the following, all of which are automatically done:
database setup
writing the create table statement which defines each table
importing and exporting to and from the database
coercing of the returned columns to the appropriate class in common cases
See also here: Quickly reading very large tables as dataframes in R
I agree with what's been said so far. Though I guess getting started with MySQL (databases) in general is not a bad idea for the long if you are going to deal with data. I mean I checked your profile which says finance PhD student. I don't know if that means quant. finance, but it is likely that you will come across really large datasets in your career. I you can afford some time, I would recommend to learn something about databases. It just helps.
The documentation of MySQL itself is pretty solid and you can a lot of additional (specific) help here at SO.
I run MySQL with MySQL workbench on Mac OS X Snow Leopard too. So here´s what helped me to get it done comparatively easy.
I installed MAMP , which gives my an local Apache webserver with PHP, MySQL and the MySQL tool PHPmyadmin, which can be used as a nice webbased alternative for MySQL workbench (which is not always super stable on a Mac :) . You will have a little widget to start and stop servers and can access some basic configuration settings (such as ports through your browser) . It´s really one-click install here.
Install the Rpackage RMySQL . I will put my connection string here, maybe that helps:
Create your databases with MySQL workbench. INT and VARCHAR (for categorical variables that contain characters) should be the field types you basically need at the beginning.
Try to find the import routine that works best for you. I don't know if you are a shell / terminal guy – if so you'll like what was suggested by neilfws. You could also use LOAD DATA INFILE which is I prefer since it's only one query as opposed to INSERT INTO (line by line)
If you specify the problems that you have more accurately, you'll get some more specific help – so feel free to ask ;)
I assume you have to work a lot with time series data – there is a project (TSMySQL) around that use R and relational databases (such as MySQL, but also available for other DBMS) to store time series data. Besides you can even connect R to FAME (which is popular among financers, but expensive). The last paragraph is certainly nothing basic, but I thought it might help you to consider if it´s worth the hustle to dive into it a little deeper.
Practical Computing for Biologists as a nice (though subject-specific) introduction to SQLite
Chapter 15. Data Organization and Databases