Efficient upload data from large files into database - mysql

Good day everyone, i need a suggestion how to efficient upload data from sufficiently large files into mySql database. So i have two files 5,85Gb and 6Gb of data. For uploading i have used `LOAD DATA LOCAL INTO FILE. First file still uploading (for 36hours). Current index size 7,2 GB. I have two questions :
1) The data is formatted Lile : {string , int, int, int, int}. I do not need this int values, so i created table with one field of type varchar(128), my query is LOAD DATA LOCAL INFILE "file" INTO TABLE "tale" , so will the data be correct(i mean only strings without int fields).
2) Than larger index than larger load time for next. So, an i doing something wrong? I mean, what i need. I need a fast search then in that strings(espicially in last word). So all of the strings cantains exactly 5 words does it makes any sence to put every single words in different column(n rows, 5 columns).
Please, any suggestions.

Can u drop index for now and recreate index once data is loaded on table. I think this will work.
Recreation of index will take time.

Related

Is it correct to have a BLOB field directly in the main table?

Which one is better: having a BLOB field in the same table or having a 1-TO-1 reference to it in another table?
I'm making a MySQL database whose main table is called item(ID, Description). This table is consulted by a program I'm developing in VB.NET which offers the possibility to double-click a specific item obtained with a query. Once opened its dedicated form, I would like to show an image stored in the BLOB field, a sort of item preview. The problem is I don't know where is better to create this BLOB field.
Assuming to have a table like this: Item(ID, Description, BLOB), will the BLOB field affect the database performance on queries like:
SELECT ID, Description FROM Item;
If yes, what do you think about this solution:
Item(ID, Description)
Images(Item, File)
Where Images.Item references to Item.ID, and File is the BLOB field.
You can add the BLOB field directly to your main table, as BLOB fields are not stored in-row and require a separate look-up to retrieve its contents. Your dependent table is needless.
BUT another and preferred way is to store on your database table only a pointer (path to the file on server) to your image file. In this way you can retrive the path and access the file from your VB.NET application.
To quote the documentation about blobs:
Each BLOB or TEXT value is represented internally by a separately allocated object. This is in contrast to all other data types, for which storage is allocated once per column when the table is opened.
In simpler terms, the blob's storage isn't stored inside the table's row, only a pointer is - which is pretty similar to what you're trying to achieve with the secondary table. To make a long story short - there's no need for another table, MySQL already doesn't the same thing internally.
Most of what has been said in the other Answers is mostly correct. I'll start from scratch, adding some caveats.
The two-table, 1-1, design is usually better for MyISAM, but not for InnoDB. The rest of my Answer applies only to InnoDB.
"Off-record" storage may happen to BLOB, TEXT, and 'large' VARCHAR and VARBINARY, almost equally.
"Large" columns are usually stored "off-record", thereby providing something very similar to your 1-1 design. However, by having InnoDB do the work usually leads to better performance.
The ROW_FORMAT and the size of the column makes a difference.
A "small" BLOB may be stored on-record. Pro: no need for the extra fetch when you include the blob in the SELECT list. Con: clutter.
Some ROW_FORMATs cut off at 767 bytes.
Some ROW_FORMATs store 20 bytes on-record; this is just a 'pointer'; the entire blob is off-record.
etc, etc.
Off-record is beneficial when you need to filter out a bunch of rows, then fetch only a few. Also, when you don't need the column.
As a side note, TINYTEXT is possibly useless. There are situations where the 'equivalent' VARCHAR(255) performs better.
Storing an image in the table (on- or off-record) is arguably unwise if that image will be used in an HTML page. HTML is quite happy to request the <img src=...> from your server or even some other server. In this case, a smallish VARCHAR containing a url is the 'correct' design.

speed up LOAD DATA INFILE with duplicates - 250 GB

Am looking for advice about whether there is any way to speed up the import of about 250 GB of data into a MySQL table (InnoDB) from eight source csv files of approx. 30 GB each. The csv's have no duplicates within themselves, but do contain duplicates between files -- in fact some individual records appear in all 8 csv files. So those duplicates need to be removed at some point in the process. My current approach creates an empty table with a primary key, and then uses eight “LOAD DATA INFILE [...] IGNORE” statements to sequentially load each csv file, while dropping duplicate entries. It works great on small sample files. But with the real data, the first file takes about 1 hour to load, then the second takes more than 2 hours, the third one more than 5, the fourth one more than 9 hours, which is where I’m at right now. It appears that as the table grows, the time required to compare the new data to the existing data is increasing... which of course makes sense. But with four more files to go, it looks like it might take another 4 or 5 days to complete if I just let it run its course.
Would I be better off importing everything with no indexes on the table, and then removing duplicates after? Or should I import each of the 8 csv's into separate temporary tables and then do a union query to create a new consolidated table without duplicates? Or are those approaches going to take just as long?
Plan A
You have a column for dedupping; lets call it name.
CREATE TABLE New (
name ...,
...
PRIMARY KEY (name) -- no other indexes
) ENGINE=InnoDB;
Then, 1 csv at a time:
* Sort the csv by name (this makes any caching work better)
LOAD DATA ...
Yes, something similar to Plan A could be done with temp tables, but it might not be any faster.
Plan B
Sort all the csv files together (probably the unix "sort" can do this in a single command?).
Plan B is probably fastest, since it is extremely efficient in I/O.

MySQL DB normalization

I've got a single table DB with 100K rows. There are about 30 columns and 28 of them are varchars / tiny text and one of them is an int primary key and one of them is a blob.
My question, is in terms of performance, would it be better to separate the blob from the rest of the table and store them in their own table with foreign key constraint to the primary id?
The table will eventually be turned into a sqlite persistent store for iOS core data and a lot of the searching / filtering will be done based on the NSPredicate for the lighter varchar columns.
Sorry if this is too subjective, but I'm thinking there is a recommended way.
Thanks!
If you do SELECT * FROM table (which you shouldn't if you don't need the BLOB field actually) then yes, the query will be faster because in that case pages with BLOB won't be touched.
If you do frequent SELECT f1, f2, f3 FROM table (all fields are non-BLOBs) then yes, storing BLOBS in a separate table will make the query faster because of the same reason - MySQL will have to read less pages.
If however the BLOB is selected frequently then it makes no sense to keep it separately.
This totally depends on data usage.
If you need the data everytime you query the table, there is no difference in haviong a separate table for it (as long as blob data is unique in each row - that is, "as long as the database is normalized").
If you don'T need the blob data but only metadata from other columns, there may be a speed bonus qhen querying if the blob has its own table. querying the blob data is slower thoguh, as you need to query bowth tables.
The USUAL way is not to store any blob data inside the database (at least not huge data), but store the binary data into files and have the fiel path inside the database instead. This is recommended, as binary data most likely doesn'T benefit from being inside a DBMS (not indexable, sortable, groupable, ..), so there is no drawback of storing it inside files, while the database isn't optimized for binary data ('cause, again, it can't do much with it anyway).
Blobs are stored on disk only the point to the storage is in memory in Mysql. Moving it to another table with a foreign key will not noticeably help your performance. Don't know if this is the case for sqlite.

MYSQL BLOB Storage html data. In separate table or not?

I have table with 100000 of rows. This table also contains a BLOB field, so the table size is around 1GB. This table is scanned regularly by many queries in the application. The blob field is used only in one select query . This table also contains 5 index with size 10MB. My Doubts are.
1 ) Is it better to move the blob filed to another tables? Will this improve the speed of read operation from table?
2) The BLOB filed is used to store HTML data about 6 Kib in size. Is BLOB type apt for this?
If you can change the schema:
Store images in application server and store relative path of those images. this will result less overhead
Moving the blob field to another table can also be a good idea.
Why you are keeping html data in blob? are you seriously storing the image styles/css with this? Not recommended at all!

indexing varchars without duplicating the data

I've huge data-set of (~1 billion) records in the following format
|KEY(varchar(300),UNIQE,PK)|DATA1(int)|DATA2(bool)|DATA4(varchar(10)|
Currently the data is stored in MySAM MYSQL table, but the problem is that the key data (10G out of 12G table size) is stored twice - once in the table and once as index. (the data is append only there won't ever be UPDATE query on the table)
There are two major actions that run against the data-set :
contains - Simple check if a key is found
count - Aggregation (mostly) functions according to the data fields
Is there a way to store the key data only once?
One idea I had is to drop the DB all together and simply create 2-5 char folder structure.
this why the data assigned to the key "thesimon_wrote_this" would be stored in the fs as
~/data/the/sim/on_/wro/te_/thi/s.data
This way the data set will function much as btree and the "contains" and data retrieval functions will run in almost O(1) (with the obvious HDD limitations).
This makes the backups pretty easy (backing up only files with A attribute) but the aggregating functions became almost useless as I need to grep 1 billion of files every time. The allocation unit size is irrelevant as I can adjust the file structure so that only 5% of the disk space is taken without use.
I'm pretty sure that there is another - much more elegant way to do that, I can't Google it out :).
It would seem like a very good idea to consider having a fixed-width, integral key, like a 64-bit integer. Storing and searching a varchar key is very slow by comparison! You can still add an additional index on the KEY column for fast lookup, but it shouldn't be your primary key.