Filesize: CSV vs MySQL - mysql

I am trying to optimise my MySQL table structure for a 3GB CSV file. So far, I have managed to import 60% of the 19m+ rows, with a MySQL table size of 5.5GB. How could I optimise my table structure to reduce the database table size? (as I am running out of disk space!)
A sample line from the CSV file is
"{0C7ADEF5-878D-4066-B785-0000003ED74A}","163000","2003-02-21 00:00","UB5 4PJ","T","N","F","106","","READING ROAD","NORTHOLT","NORTHOLT","EALING","GREATER LONDON","A"
...and my database structure is:
(
`transaction_id` int(10) unsigned NOT NULL,
`reference` varchar(100) COLLATE utf32_unicode_ci NOT NULL,
`price` int(10) unsigned NOT NULL,
`sale_date` date COLLATE utf32_unicode_ci NOT NULL,
`postcode` varchar(8) COLLATE utf32_unicode_ci NOT NULL,
`type` varchar(1) COLLATE utf32_unicode_ci NOT NULL,
`new_build` varchar(1) COLLATE utf32_unicode_ci NOT NULL,
`tenure` varchar(1) COLLATE utf32_unicode_ci NOT NULL,
`property_number` varchar(10) COLLATE utf32_unicode_ci NOT NULL,
`property_name` varchar(100) COLLATE utf32_unicode_ci NOT NULL,
`street` varchar(100) COLLATE utf32_unicode_ci NOT NULL,
`area` varchar(100) COLLATE utf32_unicode_ci NOT NULL,
`city` varchar(100) COLLATE utf32_unicode_ci NOT NULL,
`county1` varchar(100) COLLATE utf32_unicode_ci NOT NULL,
`county2` varchar(100) COLLATE utf32_unicode_ci NOT NULL,
`unknown` varchar(1) COLLATE utf32_unicode_ci NOT NULL
)

Let's look at the size of the fields.
Your database structure consists primarily of varchars. Which, under normal circumstances, should be about one byte per character in the CSV file. With the overhead for length, these should be about the same size or a little bigger (two bytes for length versus one for the comma). You might throw in a 10% fudge factor for storing in the database.
Integers could go either way. They could be a single digit in the CSV file (two characters with the comma) or several digits. They will occupy 4 bytes in MySQL. Dates are probably smaller in MySQL than in the CSV file.
There is additional overhead for indexes, particularly if you have a fill factor that leaves space on a data page for additional storage. There is additional overhead for other stuff on a data page. But, your tables seem much larger than would be expected.
My guess is that your table is much bigger because of the utf32 considerations. If you have no really good reason for this, switch to utf8.
As a note: normally varchar(1) not null can be replaced by char(1) or char(1) not null. This saves you on the encoding of the length, which is a big savings for such small fields. This is also a savings for other fields If you know the postal code is 8 characters, then define it as char(8) rather than varchar(8).

Two suggestions:
(1) Your fields
You might ask MySQL itself regarding your data! Try
SELECT * FROM yourtable PROCEDURE ANALYSE;
and have a look at the result.
(2) Your charset
You're using utf32. If you don't really need it due to other parts of your tables/applications, go for utf8 instead.

Related

How to overcome performance issue when converting utf8mb4 to latin1?

By my ignorance, I have altered a few tables without specifying collation.
That caused changed columns, which used to be latin1 characters, to be changed to utf8mb4.
This brought HUGE performance loss running joins. And when I say HUGE I mean fraction of a second changed to one hour or more!
So I have made an other request to convert it back to latin1.
And here comes the problem. Mere 60k row table, with ONE utf8mb4 column of 64 characters required 10 hours to complete. No, it is not a mistake. TEN hours. And my even bigger problem is that I have other tables that have millions of rows giving me ETA in years from today!
So now, I wonder what my options are because I can't afford having these tables to be read-only for longer than one day time.
I know that MYSQL ALTER creates a copy of a table. It makes sense because this is field size change, so I doubt I have an option to use ALGORITHM=INPLACE.
If I cannot do INPLACE then I cannot use LOCK=NONE option.
Why in the world utf8mb4 -> latin1 conversion could make such a big impact?
Note that the converted column is indexed, and this may be a reason for the impact!
ANY suggestion or a link would be greatly appreciated!
Maybe the solution would be to drop index (to avoid funky multibyte issue in the index conversion,) do fast alter, and then add an index?
Thanks in advance for any serious suggestion and I suspect I may not find much of a help because of the uniqueness of it.
EDIT
jobs | CREATE TABLE `jobs` (
`auto_inc_key` int(11) NOT NULL AUTO_INCREMENT,
`request_entered_timestamp` datetime NOT NULL,
`hash_id` char(64) CHARACTER SET latin1 NOT NULL,
`name` varchar(128) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`host` char(20) CHARACTER SET latin1 NOT NULL,
`user_id` int(11) NOT NULL,
`start_date` datetime NOT NULL,
`end_date` datetime NOT NULL,
`state` char(12) CHARACTER SET latin1 NOT NULL,
`location` varchar(50) NOT NULL,
`value` int(10) NOT NULL DEFAULT '0',
`aggregation_job_id` char(64) CHARACTER SET latin1 DEFAULT NULL,
`aggregation_job_order` int(11) DEFAULT NULL,
PRIMARY KEY (`auto_inc_key`),
KEY `host` (`host`),
KEY `hash_id` (`hash_id`),
KEY `user_id` (`user_id`,`request_entered_timestamp`),
KEY `request_entered_timestamp_idx` (`request_entered_timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=9068466 DEFAULT CHARSET=utf8mb4
jobs_archive | CREATE TABLE `jobs_archive` (
`auto_inc_key` int(11) NOT NULL AUTO_INCREMENT,
`request_entered_timestamp` datetime NOT NULL,
`hash_id` char(64) CHARACTER SET latin1 NOT NULL,
`name` varchar(128) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`host` char(20) CHARACTER SET latin1 NOT NULL,
`user_id` int(11) NOT NULL,
`start_date` datetime NOT NULL,
`end_date` datetime NOT NULL,
`state` char(12) CHARACTER SET latin1 NOT NULL,
`value` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`auto_inc_key`),
KEY `host` (`host`),
KEY `hash_id` (`hash_id`),
KEY `user_id` (`user_id`,`request_entered_timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=239432 DEFAULT CHARSET=utf8mb4
(taken from PROCEDURE, but you catch the drift...)
INSERT INTO jobs_archive (SELECT * FROM jobs WHERE (TIMESTAMPDIFF(DAY, request_entered_timestamp, starttime) > days));

Not inserting Turkish characters in SQL table. Displays 'xxx??xxx' when selected

I am trying to add Turkish names on my table but then when displayed it gives me ? instead of any of them. Any help what I am missing here? This is my table:
CREATE TABLE IF NOT EXISTS `offerings` (
`dep` varchar(5) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`grade` varchar(4) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`section` varchar(3) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`name` varchar(100) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`teacher` varchar(50) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`quota` varchar(2) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`lec1` varchar(35) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`lec2` varchar(35) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`lec3` varchar(35) DEFAULT NULL,
`lec4` varchar(35) DEFAULT NULL,
`lec5` varchar(35) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf16;
As suggested from the answer I choose here is the solution to the problem for whoever googles this topic. Special thanks to all who contributed in the solution of my problem.
CREATE TABLE IF NOT EXISTS `offerings` (
`dep` varchar(5) NOT NULL,
`grade` varchar(4) COLLATE utf8_unicode_ci NOT NULL,
`section` varchar(3) COLLATE utf8_unicode_ci NOT NULL,
`name` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
`teacher` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
`quota` varchar(2) COLLATE utf8_unicode_ci,
`lec1` varchar(35) DEFAULT NULL,
`lec2` varchar(35) DEFAULT NULL,
`lec3` varchar(35) DEFAULT NULL,
`lec4` varchar(35) DEFAULT NULL,
`lec5` varchar(35) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
(Beginnings of an answer...)
Please don't use utf16; there is virtually no reason for such in a MySQL table.
So, assuming you switch to utf8, let's see if we can get rid of the ? problems.
utf8 needs to be established in about 4 places.
The column(s) in the database -- Use SHOW CREATE TABLE to verify that they are explicitly set to utf8, or defaulted from the table definition. (It is not enough to change the database default.)
The connection between the client and the server. See SET NAMES utf8.
The bytes you have. (This is probably the case.)
If you are displaying the text in a web page, check the <meta> tag.
What probably happened:
you had utf8-encoded data (good)
SET NAMES latin1 was in effect (default, but wrong)
the column was declared CHARACTER SET latin1 (default, but wrong)
Since the CHARACTER SET disagrees with what you have shown, the problem is possibly more complex. Please provide
SELECT col, HEX(col) FROM tbl WHERE ...
for some simple cell with Turkish characters. With this, I may be able to figure out what happened.
Also, Reference notes on encodings.
VARCHARs are character strings, while NVARCHARS are Unicode character strings. NVARCHARS require more bits per character to store, but have a greater range. Try updating your data types. This should fix your problem.
EDIT This answer is wrong. The OP clearly asked for a MySQL solution, but the above applies only to SQL Server.

does text variable effect queries speed ? mysql

I have table that contain 2 long text columns , when I fetch 100 rows it takes 5 seconds, this is long time right ?
maybe this happen because I have this 2 long text columns ?
here it is the table structure :
CREATE TABLE `tempBusiness2` (
`bussId` int(11) NOT NULL AUTO_INCREMENT,
`nameHe` varchar(200) COLLATE utf8_bin NOT NULL,
`nameAr` varchar(200) COLLATE utf8_bin NOT NULL,
`nameEn` varchar(200) COLLATE utf8_bin NOT NULL,
`addressHe` varchar(200) COLLATE utf8_bin NOT NULL,
`addressAr` varchar(200) COLLATE utf8_bin NOT NULL,
`addressEn` varchar(200) COLLATE utf8_bin NOT NULL,
`x` varchar(200) COLLATE utf8_bin NOT NULL,
`y` varchar(200) COLLATE utf8_bin NOT NULL,
`categoryId` int(11) NOT NULL,
`subcategoryId` int(11) NOT NULL,
`cityId` int(11) NOT NULL,
`cityName` varchar(200) COLLATE utf8_bin NOT NULL,
`phone` varchar(200) COLLATE utf8_bin NOT NULL,
`userDetails` text COLLATE utf8_bin NOT NULL,
`selectedIDFace` tinyint(4) NOT NULL,
`alluserDetails` longtext COLLATE utf8_bin NOT NULL,
`details` varchar(500) COLLATE utf8_bin NOT NULL,
`picture` varchar(200) COLLATE utf8_bin NOT NULL,
`imageUrl` varchar(200) COLLATE utf8_bin NOT NULL,
`fax` varchar(200) COLLATE utf8_bin NOT NULL,
`email` varchar(200) COLLATE utf8_bin NOT NULL,
`facebook` varchar(200) COLLATE utf8_bin NOT NULL,
`trash` tinyint(4) NOT NULL,
`subCategories` varchar(500) COLLATE utf8_bin NOT NULL,
`openHours` varchar(500) COLLATE utf8_bin NOT NULL,
`lastCheckedDuplications` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`bussStatus` tinyint(4) NOT NULL,
`approveStatus` tinyint(4) NOT NULL,
`steps` tinyint(4) NOT NULL DEFAULT '0',
`allDetails` longtext COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`bussId`),
KEY `bussId` (`allDetails`(5),`bussId`),
KEY `face` (`alluserDetails`(5),`userDetails`(5),`bussId`)
) ENGINE=InnoDB AUTO_INCREMENT=2515926 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
my query = SELECT * FROM tempBusiness2 LIMIT 100
If SELECT * FROM tempBusiness2 LIMIT 100 is really your query, then no INDEX is involved, and no INDEX would make it run any faster.
What that statement does:
Start at the beginning of the "data". (In InnoDB the PRIMARY KEY and the data are 'clustered' together. So, you are coincidentally starting with the first value of the PK.)
Read that row.
Move to the next row -- This is easy and efficient because the PK & Data are stored in a B+Tree structure.
Repeat until finished with the 100 or the table.
But... Because of lots of TEXT and VARCHAR fields, it is not that efficient. No more than 8K of a row is stored in the B+Tree mentioned above; the rest is sitting in extra blocks that are linked to. (I do not know how many extra blocks, but I fear it is more than one.) Each extra block is another disk hit.
Now, let's try to "count the disk hits". If you run this query a second time (and have a reasonably large innodb_buffer_pool_size), there would be any disk hits. Instead, let's focus on a "cold cache" and count the data blocks that are touched.
If there is only one row per block (as derived from the 8KB comment), that's 100 blocks to read. Plus the extra blocks -- hundred(s) more.
Ordinary disks can handle 100 reads per second. So that is a total of several seconds -- possibly the 5 that you experienced!.
Now... What can be done?
Don't do SELECT * unless you really want all the columns. By avoiding some of the bulky column, you can avoid some of the disk hits.
innodb_buffer_pool_size should be about 70% of available RAM.
"Vertical partitioning" may help. This is where you split off some columns into a 'parallel' table. This is handy if some subset of the columns are really a chunk of related stuff, and especially handy if it is "optional" in some sense. The JOIN to "put the data back together" is likely to be no worse than what you are experiencing now.
Do you really need (200) in all those fields?
You have what looks like a 3-element array of names and addresses. That might be better as another table with up to 3 rows per bussId.
On another note: If you run EXPLAIN SELECT ... on all your queries, you will probably find that the "prefix indexes" are never used:
KEY `bussId` (`allDetails`(5),`bussId`),
KEY `face` (`alluserDetails`(5),`userDetails`(5),`bussId`)
What were you hoping for in them? Consider FULLTEXT index(es), instead.
Why do you have both city_id and city_name in this table? It sounds like normalization gone berserk.
Yes, this kind of column take a lot of time, return this column only when you need.
And in addiction you will have to do a index on your table.

MySQL character set for multiple European languages + mathematic symbols

I need some help getting a MySQL table to store and output characters from the following languages:
English
French
Russian
Turkish
German
These are the languages I know of in the data. It also uses mathematical symbols such as this:
b ∈ A. Define s(A):= supn≥0 r A (n) for each A ⊆ ? ∪ {0}.
I'm using htmlentities to encode the text. The ? above is meant to display as a ℕ.
It displays this way when I look at the data in PhpMyAdmin. The other characters are encoded as expected.
The table is set to utf8_unicode_ci and all aspects of the website have been set to UTF-8 (including via the .htaccess file, a PHP header and a meta tag).
Please help?
Additional info:
Hosting environment:
Linux, Apache
Mysql 5.5.38
PHP Version 5.4.4-14
Connection string :
ini_set('default_charset', 'UTF-8');
$mysqli = new mysqli($DB_host , $DB_username, $DB_password);
$mysqli->set_charset("utf8");
$mysqli->select_db($DB_name);
SHOW CREATE TABLE mydatabase.mytable output:
CREATE TABLE `tablename` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`created` datetime NOT NULL,
`updated` datetime NOT NULL,
`product` int(11) NOT NULL,
`ppub` tinytext COLLATE utf8_unicode_ci NOT NULL,
`pubdate` date NOT NULL,
`numerous_other_tinytext_cols` tinytext COLLATE utf8_unicode_ci NOT NULL,
`numerous_other_tinytext_cols` tinytext COLLATE utf8_unicode_ci NOT NULL,
`text` text COLLATE utf8_unicode_ci NOT NULL,
`keywords` tinytext COLLATE utf8_unicode_ci NOT NULL,
`active` int(11) NOT NULL DEFAULT '1',
`orderid` int(11) NOT NULL,
`src` tinytext CHARACTER SET latin1 NOT NULL,
`views` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=17780 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
SELECT DEFAULT_CHARACTER_SET_NAME FROM information_schema.SCHEMATA output:
DEFAULT_CHARACTER_SET_NAME
utf8 [->UTF-8 Unicode]
utf8mb4 [->UTF-8 Unicode]
Fonts used:
Arial
Sample of text in the database:
Let <em>A</em> be a subset of the set of nonnegative integers ℕ ∪ {0}, and let <em>r</em><sub><em>A</em></sub> (<em>n</em>) be the number of representations of <em>n</em> ≥ 0 by the sum <em>a</em> + <em>b</em> with <em>a, b</em> ∈ <em>A</em>.
Output on web page:
Let <em>A</em> be a subset of the set of nonnegative integers ? ∪ {0}, and let <em>r</em><sub><em>A</em></sub> (<em>n</em>) be the number of representations of <em>n</em> ≥ 0 by the sum <em>a</em> + <em>b</em> with <em>a, b</em> ∈ <em>A</em>.
Which becomes
Let A be a subset of the set of nonnegative integers ? ∪ {0}, and let rA (n) be the number of representations of n ≥ 0 by the sum a + b with a, b ∈ A.
While your database and table are configured to use UTF-8, one of your columns still isn't:
CREATE TABLE `tablename` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`created` datetime NOT NULL,
`updated` datetime NOT NULL,
`product` int(11) NOT NULL,
`ppub` tinytext COLLATE utf8_unicode_ci NOT NULL,
`pubdate` date NOT NULL,
`numerous_other_tinytext_cols` tinytext COLLATE utf8_unicode_ci NOT NULL,
`numerous_other_tinytext_cols` tinytext COLLATE utf8_unicode_ci NOT NULL,
`text` text COLLATE utf8_unicode_ci NOT NULL,
`keywords` tinytext COLLATE utf8_unicode_ci NOT NULL,
`active` int(11) NOT NULL DEFAULT '1',
`orderid` int(11) NOT NULL,
`src` tinytext CHARACTER SET latin1 NOT NULL, <--------- This one
`views` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=17780 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Since all the other symbols got HTML-encoded, they will survive all charsets, but not ℕ, which doesn't have a named entity reference.
You need to convert your column:
ALTER TABLE tablename MODIFY src TINYTEXT CHARACTER SET utf8;
NOTE: I noticed you like mathematical symbols. Some of them are outside of the Basic Multilingual Plane, ie. have codepoints > 0xFFFF, for example mathematical letter variants (fraktur, double-struck, semantic italic etc.).
If you want to support them, you need to switch encoding in MySQL everywhere (table, columns, connection) to utf8mb4, which is the true UTF-8 (utf8 in MySQL means the subset of UTF-8 with BMP only), with utf8mb4_unicode_ci collation. Here is how to do the migration.
Also, I have noticed that you are HTML-encoding HTML. Maybe you have a reason, but in my opinion storing this doesn't make sense:
<em>A</em>
If you want to put it into an HTML document, now you need to HTML-decode it at least once, sometimes twice. I'd rather store what almost everybody else does:
<em>A</em>
This way, you will store Unicode characters natively, in optimal way.

MySQL table design / architecture, table is too big

I have a MySQL DB that contains a lot of text, I'm fetching data from a website and inserting it into a table.
I'm using a SSD HD (100GB) for the DB and I'm out of space, I think that something in the table structure made it too big, I can't predict the size of all the columns so I'm using varchar\text\medium text for most of the fields. when I insert all the data to the DB I monitor the errors and when I see that a certain field is too small for the data I'm trying to insert I'm increasing the size of the field (e.g. from varchar(1000) to varchar(2000)).
until now I have about 1.8M~ rows, I think that I'm doing something wrong.
here is the structure of my table -
CREATE TABLE `PT` (
`patID` int(11) NOT NULL,
`Title` varchar(450) DEFAULT NULL,
`IssueDate` date DEFAULT NULL,
`NoFullText` tinyint(1) DEFAULT NULL,
`Abstract` text,
`ForeignReferences` varchar(15000) DEFAULT NULL,
`CurrentUSClass` varchar(2200) DEFAULT NULL,
`OtherReferences` mediumtext,
`ForeignPrio` varchar(900) DEFAULT NULL,
`CurrentIntlClass` varchar(3000) DEFAULT NULL,
`AppNum` varchar(45) DEFAULT NULL,
`AppDate` date DEFAULT NULL,
`Assignee` varchar(300) DEFAULT NULL,
`Inventors` varchar(1500) DEFAULT NULL,
`RelatedUSAppData` text,
`PrimaryExaminer` varchar(100) DEFAULT NULL,
`AssistantExaminer` varchar(100) DEFAULT NULL,
`AttorneyOrAgent` varchar(300) DEFAULT NULL,
`ReferencedBy` text,
`AssigneeName` varchar(150) DEFAULT NULL,
`AssigneeState` varchar(80) DEFAULT NULL,
`AssigneeCity` varchar(150) DEFAULT NULL,
`InventorsName` varchar(800) DEFAULT NULL,
`InventorsState` varchar(300) DEFAULT NULL,
`InventorsCity` varchar(800) DEFAULT NULL,
`Claims` mediumtext,
`Description` mediumtext,
`InsertionTime` datetime NOT NULL,
`LastUpdatedOn` datetime NOT NULL,
PRIMARY KEY (`patID`),
UNIQUE KEY `patID_UNIQUE` (`patID`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
what should I do? I have about 20% of the data (which means I'm going to need 350GB~ space) what is the performance impact here? should I divide the table into several tables over several HDs? I'm going to use sphinx to index and query the data in the end.
All of the non-TEXT column values are stored in one 8KB record (undivided unit of space on your HDD). TEXT column values are stored as pointers to external blocks of data.
These kinds of structures (very text oriented) are better handled by NOSQL (Not Only SQL) databases like MongoDB.
But I suspect that there are a lot of things you could do regarding how to handle & structure your data in order to avoid saving huge chunks of text.
The process of structuring a database to avoid repetitious information and to allow for easy updates (update in one place - visible everywhere) is called normalization.
If the data you're storing in those big VARCHARs (ex.: Inventors length 1500) is structured as multiple elements of data (ex.: names of inventors separated by a coma) then you can restructure your DB table by creating an inventors table and referencing to it.