Fetch from large DB doesn't work - mysql

I have this MySQL database with over a million records. I am not the owner of the database and dont have write/modify permissions to it. I have a small target db called MyDB that fetches some fields from the giant view. Now these are the problems I face working with the huge million-record table on MysqlWorkbench..
GiantDB(MySQL database)
--gview(over a million records. No permissions to write)
+id(PK)
+name-String
+marks-Integer
+country-String
myDB(Target SQLite DB)
--mytable
+id(PK)
+name-String
So this is a rough sketch of these two databases. I am not able to query gtable without setting row limits(to 1000).
count(*) doesnt work either.
My ultimate goal is to insert the million names into myTable from gtable.
Inserting gView's fields into myTable takes forever, and automatically gets killed.
Any way of doing this efficiently?
I looked up and people were talking about indexes and stuff. I am completely clueless on what to do. A clear explanation would be of great help. Thanks and regards.

(A million rows is a medium sized table. Don't let its size throw you.)
From the comment thread it sounds like you're taking too long to read the result set from MySQL, because it takes time to create your rows in your output database.
Think of this as an export from MySQL followed by an import to sqlite.
The export you can do with MySQL Workbench's export... feature, which itself uses the mysqldump command-line tool.
You then may need to edit the .sql file created by the export command so it's compatible with sqlite. Then import it into sqliite. There are multiple tools that can do this.
Or, if you're doing this in a program (a python program, perhaps) try reading your resultset from the MySQL database row by row and writing it to a temporary disk file.
Then disconnect from the MySQL database, open up your sqlLite database and the file, read the file line by like and load it into the database.
Or, if you write the file so it looks like this
1,"Person Name"
2,"Another Name"
3,"More Name"
etc, you'll have a so-called CSV (comma-separated value) file. There are many tools that can load such files into SQLlite.
Another choice: this will be mandatory if your MySQL database has very tight restrictions on what you can do. For example, they may have given you a 30-second query time limit. ASK your database administrator for help exporting this table to your sqlite databse. Tell her you need a .csv file.
You should be able to say SELECT MAX(id) FROM bigtable to get the largest ID value. If that doesn't work the table is probably corrupt.
One more suggestion: fetch the rows in batches of, say, ten thousand.
SELECT id, name FROM bigtable LIMIT 0,10000
SELECT id, name FROM bigtable LIMIT 10000,10000
SELECT id, name FROM bigtable LIMIT 20000,10000
SELECT id, name FROM bigtable LIMIT 30000,10000 etc etc.
This will be a pain in the neck, but it will get you your data if your dba is uncooperative.
I hope this helps.

Related

Export blob column from mysql dB to disk and replace it with new file name

So I'm working on a legacy database, and unfortunately the performance of database is very slow. Simple select query can take up to 10 seconds in tables with less than 10000 record.
So i tried to investigate problem and found out that deleting column that they have used to store files (mostly videos and images) fix the problem and improve performance a lot.
Along with adding proper indexes I was able to run exact same query that used to take 10-15sec to run in under 1sec.
So my question is. Is there any already existing tool or script I can use to help me export those blobs (videos) from database and save the to disk and update row with new file name/path on file system?
If not is there any proper way to optimize database so that those blob would not impact performance that much?
Hint some one clients consuming this database use high level orms so we don't have much control on queries orm use to fetch rows and its relations. So I cannot optimize queries directly.
SELECT column FROM table1 WHERE id = 1 INTO DUMPFILE 'name.png';
How about this way?
These is also INTO_OUTFILEinstead of INTO_DUMPFILE
13.2.10.1 SELECT ... INTO Statement The SELECT ... INTO form of SELECT enables a query result to be stored in variables or written to a file:
SELECT ... INTO var_list selects column values and stores them into
variables.
SELECT ... INTO OUTFILE writes the selected rows to a file. Column and
line terminators can be specified to produce a specific output format.
SELECT ... INTO DUMPFILE writes a single row to a file without any
formatting.
Link: https://dev.mysql.com/doc/refman/8.0/en/select-into.html
Link: https://dev.mysql.com/doc/refman/8.0/en/select.html

MySQL - Data entry

I have just finished programming a company database in MySQL (I am a student, I am gaining experience). I have to do the data entry but the amount of data to insert is immense. Is there a way to quickly insert large amounts of data without going through Excel?
Best regards.
I use SQLYog , which is free tool on the internet.
You can use this tool to work on MySQL and you won't have to open your Workbench again.
It has a WIDE variety of functions including user-level-access and IMPORTING DATA IN A TABLE VIA CSV.
U need to ensure your data file is saved as .csv and it DOESN'T contain the header row, also you must have a table having the EXACT number of columns (of course), you can insert your entire data (even VERY VERY large data) into your table.
Also while inserting you need to check on "Insert Excel-friendly values", and BOOM....you are done !! :D
please feel free to connect whenever you encounter any issue, it looks hard the first time, then it's VERY VERY easy.

mysqldump: how to fetch dependent rows

I'd like a snapshot of a live MySQL DB to work with on my development machine. The problem is that the DB is too large, so my thought was to execute:
mysqldump [connection-info-here] --no-autocommit --where="1 limit 1000" mydb > /dump.sql
I think this will give me the first thousand rows of every table in database mydb. I anticipate that the resulting dataset will break a lot of foreign key constraints since some records will be missing. As a result the application I mean to run on the dev machine will fail.
Is there a way to mysqldump a sample of the database while ensuring that all records dumped abide by key constraints? (for instance if a foreign key is dumped, the matching record in the foreign table will also be dumped).
If that isn't possible, how do you guys deal with this problem?
No, there's no option for mysqldump to dump only rows that match in foreign key relationships. You already know about the --where option, and that won't do it.
I've had the same task as you, to dump a subset of data but only data that is related. For example, for creating a test instance.
I've been using MySQL for many years, I've worked as a MySQL consultant and trainer, and I try to keep up with current tools. I have never heard of any MySQL tool that does this operation.
The only solution I can suggest is to write your own script to dump table by table using SELECT...INTO OUTFILE.
It's sometimes easier to write a custom script just for your specific schema, than for someone to write a general-purpose tool that works for everyone's schema.
How I have dealt with this problem in the past is I don't copy data from the live database. I find some other way to create a subset of fake data for testing. It's probably better to create synthetic data anyway, because then you don't risk accidentally using live data in your dev/test environment, in case some of it is private data.

Exporting table data without the schema?

I've tried searching for this but so far I'm only finding results for "exporting the table schema without data," which is exactly the opposite of what I want to do. Is there a way to export data from a SQL table without having the script recreate the table?
Here's the problem I'm trying to solve if someone has a better solution: I have two databases, each on different servers; I'll call them the raw database and the analytics database. The raw database is the "real" database and collects records sent to its server, and stores them in a table using the transactional InnoDB engine. The analytics database is on an internal LAN, and is meant to mirror the raw database and will periodicly be updated so that it matches the raw database. It's separated like this because we have a program that will do some analysis and processing of the data, and we don't want to do it on the live server.
Because the analytics database is just a copy, it doesn't need to be transactional, and I'd like it to use the MyISAM engine for its table because I've found it to be much faster to import data into and query against. The problem is that when I export the table from the live raw database, the table schema gets exported too and the table engine is set to InnoDB, so when I run the script to import the data into the analytics database, it drops the MyISAM table and recreates it as an InnoDB table. I'd like to automate this process of exporting/importing data, but this problem of the generated sql script file changing the table engine from MyISAM to InnoDB is stopping me, and I don't know how to get around it. The only way I know is to write a program that has direct access to the live raw database, do a query, and update the analytics database with the results, but I'm looking for alternatives to this.
Like this?
mysqldump --no-create-info ...
Use the no-create-info option
mysqldump --no-create-info db [table]

Set up large database in MySQL for analysis in R

I have reached the limit of RAM in analyzing large datasets in R. I think my next step is to import these data into a MySQL database and use the RMySQL package. Largely because I don't know database lingo, I haven't been able to figure out how to get beyond installing MySQL with hours of Googling and RSeeking (I am running MySQL and MySQL Workbench on Mac OSX 10.6, but can also run Ubuntu 10.04).
Is there a good reference on how to get started with this usage? At this point I don't want to do any sort of relational databasing. I just want to import .csv files into a local MySQL database and do the subsetting in with RMySQL.
I appreciate any pointers (including "You're way off base!" as I'm new to R and newer to large datasets... this one's around 80 mb)
The documentation for RMySQL is pretty good - but it does assume that you know the basics of SQL. These are:
creating a database
creating a table
getting data into the table
getting data out of the table
Step 1 is easy: in the MySQL console, simply "create database DBNAME". Or from the command line, use mysqladmin, or there are often MySQL admin GUIs.
Step 2 is a little more difficult, since you have to specify the table fields and their type. This will depend on the contents of your CSV (or other delimited) file. A simple example would look something like:
use DBNAME;
create table mydata(
id INT(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
height FLOAT(3,2)
);
Which says create a table with 2 fields: id, which will be the primary key (so has to be unique) and will autoincrement as new records are added; and height, which here is specified as a float (a numeric type), with 3 digits total and 2 after the decimal point (e.g. 100.27). It's important that you understand data types.
Step 3 - there are various ways to import data to a table. One of the easiest is to use the mysqlimport utility. In the example above, assuming that your data are in a file with the same name as the table (mydata), the first column a tab character and the second the height variable (with no header row), this would work:
mysqlimport -u DBUSERNAME -pDBPASSWORD DBNAME mydata
Step 4 - requires that you know how to run MySQL queries. Again, a simple example:
select * from mydata where height > 50;
Means "fetch all rows (id + height) from the table mydata where height is more than 50".
Once you have mastered those basics, you can move to more complex examples such as creating 2 or more tables and running queries that join data from each.
Then - you can turn to the RMySQL manual. In RMySQL, you set up the database connection, then use SQL query syntax to return rows from the table as a data frame. So it really is important that you get the SQL part - the RMySQL part is easy.
There are heaps of MySQL and SQL tutorials on the web, including the "official" tutorial at the MySQL website. Just Google search "mysql tutorial".
Personally, I don't consider 80 Mb to be a large dataset at all; I'm surprised that this is causing a RAM issue and I'm sure that native R functions can handle it quite easily. But it's good to learn new skill such as SQL, even if you don't need them for this problem.
I have a pretty good suggestion. For 80MB use SQLite. SQLite is a super public domain, lightweight, super fast file-based database that works (almost) just like a SQL database.
http://www.sqlite.org/index.html
You don't have to worry about running any kind of server or permissions, your database handle is just a file.
Also, it stores all data as a string, so you don't even have to worry about storing the data as types (since all you need to do is emulate a single text table anyway).
Someone else mentioned sqldf:
http://code.google.com/p/sqldf/
which does interact with SQLite:
http://code.google.com/p/sqldf/#9._How_do_I_examine_the_layout_that_SQLite_uses_for_a_table?_whi
So your SQL create statement would be like this
create table tablename (
id INT(11) INTEGER PRIMARY KEY,
first_column_name TEXT,
second_column_name TEXT,
third_column_name TEXT
);
Otherwise, neilfws' explanation is a pretty good one.
P.S. I'm also a little surprised that your script is choking on 80mb. It's not possible in R to just seek through the file in chunks without opening it all up in memory?
The sqldf package might give you an easier way to do what you need: http://code.google.com/p/sqldf/. Especially if you are the only person using the database.
Edit: Here is why I think it would be useful in this case (from the website):
With sqldf the user is freed from having to do the following, all of which are automatically done:
database setup
writing the create table statement which defines each table
importing and exporting to and from the database
coercing of the returned columns to the appropriate class in common cases
See also here: Quickly reading very large tables as dataframes in R
I agree with what's been said so far. Though I guess getting started with MySQL (databases) in general is not a bad idea for the long if you are going to deal with data. I mean I checked your profile which says finance PhD student. I don't know if that means quant. finance, but it is likely that you will come across really large datasets in your career. I you can afford some time, I would recommend to learn something about databases. It just helps.
The documentation of MySQL itself is pretty solid and you can a lot of additional (specific) help here at SO.
I run MySQL with MySQL workbench on Mac OS X Snow Leopard too. So here´s what helped me to get it done comparatively easy.
I installed MAMP , which gives my an local Apache webserver with PHP, MySQL and the MySQL tool PHPmyadmin, which can be used as a nice webbased alternative for MySQL workbench (which is not always super stable on a Mac :) . You will have a little widget to start and stop servers and can access some basic configuration settings (such as ports through your browser) . It´s really one-click install here.
Install the Rpackage RMySQL . I will put my connection string here, maybe that helps:
Create your databases with MySQL workbench. INT and VARCHAR (for categorical variables that contain characters) should be the field types you basically need at the beginning.
Try to find the import routine that works best for you. I don't know if you are a shell / terminal guy – if so you'll like what was suggested by neilfws. You could also use LOAD DATA INFILE which is I prefer since it's only one query as opposed to INSERT INTO (line by line)
If you specify the problems that you have more accurately, you'll get some more specific help – so feel free to ask ;)
I assume you have to work a lot with time series data – there is a project (TSMySQL) around that use R and relational databases (such as MySQL, but also available for other DBMS) to store time series data. Besides you can even connect R to FAME (which is popular among financers, but expensive). The last paragraph is certainly nothing basic, but I thought it might help you to consider if it´s worth the hustle to dive into it a little deeper.
Practical Computing for Biologists as a nice (though subject-specific) introduction to SQLite
Chapter 15. Data Organization and Databases