Doctrine/MySQL performance: 470 columns in one table - mysql

I have an external csv file that I need to import into the MySQL database: the csv has 473 columns (144k rows) which in my opinion is too much columns for one single table.
The problem: I was thinking of doing some normalization and split data into more tables but this will require extra work whenever a new csv is released (with more or less columns).
Is it okay if I keep the structure of the CSV/Table intact and have a big table? what are the performance impact of both approaches on MySQL/Doctrine?
The data:
I don't have ownership of this data to split it onto more tables: this data comes from government public resources as it is: no column duplicates.. so there's no way to split it :( I must take it as it is... Any additional categorization/splitting is overwork and may change on the next update of data.

Digging deep into the CSV data I found some interesting kind of organization: it can be split into 18 different tables (providers).
Each table has its own columns (some columns exist in multiple tables) but the largest one is around 180 column..
This is so far how I can split the data: since I don't have ownership of the CSV I cannot go ahead and just group similar columns/tables.

Related

Best database design to have efficient analysis on it with some millions records

I have a basic question about database designing.
I have a lot of files which I have to read and insert them in database. Each file has some thousand lines and each line has about 30 fields (by these types: small int, int, big int, varchar, json). Of course I use multi threads along with bulk inserting in order to increase insert speed (finally I have 30-40 millions records).
After inserting I want to have some sophisticated analysis and the performance is important to me.
Now I get each line fields and I'm ready to insert so I have 3 approaches:
1- One big table:
In this case I can create a big table with 30 columns and stores all of the files fields in that. So there is a table with huge size which I want to have a lot of analysis on it.
2- A fairly large table (A) and some little tables (B)s
In this case I can create some little tables which consist of the columns that have fairly identical records if we separate them from the other columns. So these little tables just has some hundred or thousand records instead of 30 millions records. So in fairly large table (A), I emit the columns which I put them in another table and I use a foreign key instead of them. Finally I has a table (A) with 20 columns and 30 millions records and some tables (B) with 2-3 columns and 100-50000 records for each of them. So in order to analysis the table A, I have to use some joins ,for example in select and ...
3- just a fairly large table
In this case I can create a fairly large table like table A in above case (with 20 columns) and instead of using foreign keys, I use a mapping between source columns and destination columns (this is something like foreign keys but has a little difference). For example I have 3 columns c1,c2,c3 that in case 2, I put them in another table B and use foreign key to access them, but now I assign a specific number to each distinctive records consist of c1,c2,c3 at inserting time and store the relation between the record and its assigned value in the program codes. So this table is completely like the table A in case number 2 but there is no need to use join in select or ...
While the inserting time is important, the analysis time that I will have is more important to me, so I want to know your opinion about which of these case is better and also I will glad to see the other solutions.
From a design perspective 30 to 40 million is not that bad a number. Performance is fully dependent on how you would design your DB to be.
If you are using SQL Server then you could consider putting the large table on a separate database file group. I have worked on one case in a similar fashion where we had around 1.8 Billion record in a single table.
For the analysis if you are not going to look into the entire data in one shot. You could consider a vertical partitioning of the data. You could use a partition schema based on your need. Some sample could be to split the data as yearly partitions and this will help if your analysis will be limited to a years worth of data(just an example).
The major thing would be de-normalization /normalization based on your need and of course non clustered/clustered indexing of the data. Again this will depend on what sort of analysis queries you would be using.
A single thread can INSERT one row at a time and finish 40M rows in a day or two. With LOAD DATA, you can do it in perhaps an hour or less.
But is loading the real question? For doing grouping, summing, etc, the question is about SELECT. For "analytics", the question is not one of table structure. Have a single table for the raw data, plus one or more "Summary tables" to make the selects really fast for your typical queries.
Until you give more details about the data, I cannot give more details about a custom solution.
Partitioning (vertical or horizontal) is unlikely to help much in MySQL. (Again, details needed.)
Normalization shrinks the data, which leads to faster processing. But, it sounds like the dataset is so small that it will all fit in RAM?? (I assume your #2 is 'normalization'?)
Beware of over-normalization.

How to store big matrix(data frame) that can be subsetted easily later

I will generate a big matrix(data frame) in R whose size is about 1300000*10000, about 50 GB. I want to store this matrix in a appropriate format, so later I can feed the data into Python or other program codes to make some analysis. Of course I cannot feed the data one time, so I have to subset the matrix and feed them little by little.
But I don't know how to store the matrix. I think of two ways, but I think neither is appropriate:
(1) plain text(including csv or excel table), because it is very hard to subset(e.g. if I just want some columns and some rows of the data)
(2) database, I have searched information about mysql and sqlite, but it seems that the number of columns is limited in sql database(1024).
So I just want to know if there are any good strategies to store the data, so that I can subset the data by row/column indexes or name.
Have separate column(s) for each of the few columns you need to search/filter on. Then put the entire 10K columns into some data format that is convenient for the client code to parse. JSON is one common possibility.
So the table would 1.3M rows and perhaps 3 columns: an id (auto_increment, primary key), the column search on, and the JSON blob - as datatype JSON or TEXT (depending on software version) for the many data values.

speed up LOAD DATA INFILE with duplicates - 250 GB

Am looking for advice about whether there is any way to speed up the import of about 250 GB of data into a MySQL table (InnoDB) from eight source csv files of approx. 30 GB each. The csv's have no duplicates within themselves, but do contain duplicates between files -- in fact some individual records appear in all 8 csv files. So those duplicates need to be removed at some point in the process. My current approach creates an empty table with a primary key, and then uses eight “LOAD DATA INFILE [...] IGNORE” statements to sequentially load each csv file, while dropping duplicate entries. It works great on small sample files. But with the real data, the first file takes about 1 hour to load, then the second takes more than 2 hours, the third one more than 5, the fourth one more than 9 hours, which is where I’m at right now. It appears that as the table grows, the time required to compare the new data to the existing data is increasing... which of course makes sense. But with four more files to go, it looks like it might take another 4 or 5 days to complete if I just let it run its course.
Would I be better off importing everything with no indexes on the table, and then removing duplicates after? Or should I import each of the 8 csv's into separate temporary tables and then do a union query to create a new consolidated table without duplicates? Or are those approaches going to take just as long?
Plan A
You have a column for dedupping; lets call it name.
CREATE TABLE New (
name ...,
...
PRIMARY KEY (name) -- no other indexes
) ENGINE=InnoDB;
Then, 1 csv at a time:
* Sort the csv by name (this makes any caching work better)
LOAD DATA ...
Yes, something similar to Plan A could be done with temp tables, but it might not be any faster.
Plan B
Sort all the csv files together (probably the unix "sort" can do this in a single command?).
Plan B is probably fastest, since it is extremely efficient in I/O.

Choose between storing all mysql data in 1 table or split data to 2 or more tables

i am creating a news mysql database
info i need to store include
newstitle
newscontent
newsauthor
when listing news i dont query the large newscontent field but its in the same table
on large scale where millions of articles is it better to store news content on same table as long as i query only the news title without content when listing or create separate table for listing and another displaying full content ?
thanks
While both ways would work fine, there are advantages/disadvantages of each:
One large table: As you mentioned, you 'may' read more data than needed (Solution for that later)
Two tables: More fields to add, and 'join' will be required when contents are needed, so 'more complexity and slower queries to get contents'
My proposals:
If one table, then create index on each of the fields that you want to query, so when you query, the index file will be used (Which is smaller in size that the data file)
If one table, partition the table based on some criteria, like date. This way you avoid having to deal with huge table in the future
Personally, I would go with two tables solution, and I would partition the tables that has "newscontent". [Although it is more complex from development POV, it is easier on DB servers, and this is more important]

Can I split a single SQL 2008 DB Table into multiple filegroups, based on a discriminator column?

I've got a SQL Server 2008 R2 database which has a number of tables. Two of these tables contains a lot of large data .. mainly because one of them is VARBINARY(MAX) and the sister table is GEOGRAPHY. (Why two tables? Read Below if you're interested***)
The data in these tables are geospatial shapes, such as zipcode boundaries.
Now, the first 70K odd rows are for DataType = 1
the rest 5mil rows are for DataType = 2
Now, is it possible to split the table data into two files? so all rows that are for DataType != 2 goes into File_A and DataType = 2 goes into File_B?
This way, when I backup the DB, I can skip adding File_B so my download is waaaaay smaller? Is this possible?
I guessing you might be thinking -> why not keep them as TWO extra tables? Mainly because in the code, the data is conceptually the same .. it's just happens that I want to split the storage of this model data. It really messes up my model if I now how two aggregates in my model, instead of one.
***Entity Framework doesn't like Tables with GEOGRAPHY, so i have to create a new table which transforms the GEOGRAPHY to VARBINARY, and then drop that into EF.
It's a bit of overkill, but you could use table paritioning to do this, as each partition can be mapped to a distinct file group. Some caveats:
Table partitioning is only available in Enterprise (and developer) edition
Like clustered indexes, you only get one, so be sure that this is how you'd want to partition your tables
I'm not sure how well this would play out against "selective backups" or, much more importantly, partial restores. You'd want to test a lot of oddball recovery scenarios before going to Production with this
An older-fashioned way to do it would be to set up partitioned views. This gives you two tables, true, but the partitioned view "structure" is pretty solid and fairly elegant, and you wouldn't have to worry about having your data split across multiple backup files.
I think you might want to look into data partitioning. You can partition your data into multiple file groups, and therefore files, based on a key value, such as your DataType column.
Data partitioning can also help with performance. So, if you need that too, you can check out partition schemes for your indexes as well.