CSV to MySQL: Single table vs. multiple tables? - mysql

I have a CSV file with 74 columns and about 60K rows. The contents of this CSV file have to be imported to a MySQL database, every month.
After the data is inserted, the end-user can query the contents of the MySQL database with predefined filters.
Putting everything in a single table would mean faster inserts, but slower reads. Splitting the content in multiple tables (with foreign keys) would mean slower inserts, faster reads and, I think, higher chance of failure.
What do you think is the best option for me, or are there any other possibilities?

If all the data relationships (between the buses, clients, and trips) are 1 to 1 and information is not being duplicated throughout your CSV, you can go with a single table for these reasons:
Simplest conversion from CVS to database, each column in the CVS will correspond with one column in the database
Anyone that works on the database after you will know exactly what data is where because it will "look like" the CVS
Your main concern "Reads being slower" won't be a big problem because when you query the database for information, you ask for only the data you want and filter out the columns you don't. (e.g. SELECT departure, arrival, distance FROM bustrips WHERE distance > 1000)\
However, if you look at the data, and there is a massive amount of duplication in the CVS, possibly from more than one client riding on the same trip, or the same bus is used for more than one trip, etc. I would create a new table for each block of unique data. One example that I might already see would be a new table for buses:
Bus_ID;
Numberplate;
Handicap;
Odometer reading;
I hope this helps you make the decision. It is not about "easy read" vs. "easy write" it is about information clarity by reduction of redundancy.

Without looking at your columns, I can almost guarantee that multiple tables is the way to go.
It will reduce human error
by reducing redundancy,
and as a bonus, any updates to say, a clients address, can by made once to the the client's table instead of having to be updated on every line item they've been involved with.
You'll also notice that insertions become easier as entire lines of data covered in another table can be summed up be referencing a single foreign key!
If database insertion time does become a big problem, you could always take a little bit of time to write a macro to do it for you.

Related

How to handle 20M+ records from tables with same structure in MySQL

I have to handle 25M rows of data that I have collected and transformed from about 50 different sources. Every source leads to about 500.000 to 600.000 rows. Each record has the same structure, regardless the source (let say: id, title, author, release_date)
For flexibility, I would prefer to create a dedicated table for each source, (then I can clear/drop data from a source and reload/upload data very quickly (using LOAD INFILE)). This way, it seems very easy to truncate the table with no risk to delete rows from other sources.
But then I don't know how to select records having the same author across the different tables, and cherry on the cake, with pagination (LIMIT keyword).
Is the only solution to store everything into a single huge table and deal with the pain of indexing/backuping a 25M+ database or is there a kind of abstract layer to virtually merge 50 tables into a virtual one.
It is probably a usual question for a dba, but I could not find any answer yet...
Any help/idea much appreicated. Thx
This might be a good spot for MySQL partitoning.
This lets you handle a big volume of data, while giving you the opportunity to run DML operations on a specific partition when needed (such as truncate, or event drop) very efficiently, and without impacting the rest of your data. Partitioning selection is also supported in LOAD DATA statements.
You can run queries across partitions as you would with a normal table, or target a specific partition when you need to (which can be done very efficiently).
In your specific use case, list partitioning seems like a relevant choice: you have a pre-defined list of sources, so you would typically have one partition per source.

Compress a mysql partition

I have a table that is partitioned by a timestamp into a separate partition for each day.
During each day around a billion events are received. Each event is tagged with an object, and the business logic needs all the events for an object to decide what to do with them. So the system has a big table with one row per object (which is just hundreds of millions of rows per day), and these events are concatenated into an 'event buffer' mediumtext.
One row per object works really well. It is very fast and suitable for our business logic and reporting to consume. Once upon a time we started with an event table and joining instead, and it was far too slow.
After 5 days no more events will be received for an object. At that point, if we haven't had terminating events, our system adds our own 'timed out' event to the buffer.
We are doing a lot of business logic as events for an object are received, and we have a bool to flag which objects have no final event etc.
Although the "online" system only wants 5 days of object events, the reporting system wants a year's worth.
I want partitions over 5 days old to be compressed. I can run a cron-job to trigger this.
The current approach is: have another table, with identical schema and partitioning like the online table but row_format=compressed. Then each day, create a new table like these table but without partitioning. First we ALTER TABLE EXCHANGE PARTITION to swap-out the 5-days-old partition. Then we insert this into the new table.
There are two problems with the current approach: 1) that reporting tools have to scan two separate tables, and 2) that there is a race condition when objects are in neither of the main tables.
Is it possible to ALTER the row_format for an individual existing partition?
No, you cannot compress individual partitions. The basic attributes of the table are uniform across all partitions.
Please provide SHOW CREATE TABLE, the table size, and further discussion of "why" you want to compress. There may be a workaround that achieves a similar function.
Reporting
"Reports" usually need "summary" information, not all the raw data. So... Summarize the data each day (or hour, or whatever) and put the summary into a much smaller table. Then throw away the raw data.
If you are concerned about need the raw data at a later data, then save the logs, compressed, in regular files. Sure, it will be some extra work to pull up old data. But there are trade offs.
This also solves the interference -- different table, smaller table.
For summarizing, I usually do it periodically, but it could be done as data is inserted into the raw ("Fact") table. See Summary tables and High speed ingestion. In the later link, it explains a way to gather data in a "staging table", massague it, send it to the Fact table, Normalization tables (which perhaps you don't have) and do the incremental summarization.
Your raw data might be partitioned in 6-hour chunks: PARTITION BY RANGE(TO_DAYS(...)). 30 partitions is a pretty good number (a compromise). The Summary table(s) might need to be partitioned. Consider 12 or 52 if you are purging after a year. (Actually 14 or 54; see the links for why.)
Migrating a partition
Suppose you could move the 6-day-old partition to another, more compressed, table; and use a VIEW in front of a UNION to mask the existence of the split?
If you have 5.7, it is pretty easy to "export a tablespace"; this turn the partition into a table. Then, you could either import it into another partitioned table (which does not seem very useful, unless it is somehow compressed), or otherwise transform the data to shrink it.
Manual compression
For large text columns, I recommend compressing in the client, storing into a BLOB (instead of TEXT), and uncompressing on the way out. This saves disk space (3x for typical English, code, xml, etc) and bandwidth (to/from client -- especially handy if the client and server are distant).
Database API Layer
You have users issuing SQL queries? You should seriously consider providing a simple layer between the clients and the database. With that, you can hide view, union, compress, two tables, etc. And you can change things without making the customers make changes. Be sure to create an API that understands the client and feels more 'generic'. GetObject(1234) does whatever SQL is needed, then returns "object #1234" in some agreed-upon format (JSON, XML, PHP structure, whatever).

Store large amounts of sensor data in SQL, optimize for query performance

I need to store sensor data from various locations (different factories with different rooms with each different sensors). Data is being downloaded in regular intervals from a device on site in the factories that collects the data transmitted from all sensors.
The sensor data looks like this:
collecting_device_id, sensor_id, type, value, unit, timestamp
Type could be temperature, unit could be degrees_celsius. collecting_device_id will identify the factory.
There are quite a lot of different things (==types) being measured.
I will collect around 500 million to 750 million rows and then perform analyses on them.
Here's the question for storing the data in a SQL database (let's say MySQL InnoDB on AWS RDS, large machine if necessary):
When considering query performance for future queries, is it better to store this data in one huge table just like it comes from the sensors? Or to distribute it across tables (tables for factories, temperatures, humidities, …, everything normalized)? Or to have a wide table with different fields for the data points?
Yes, I know, it's hard to say "better" without knowing the queries. Here's more info and a few things I have thought about:
There's no constant data stream as data is uploaded in chunks every 2 days (a lot of writes when uploading, the rest of the time no writes at all), so I would guess that index maintenance won't be a huge issue.
I will try to reduce the amount of data being inserted upfront (data that can easily be replicated later on, data that does not add additional information, …)
Queries that should be performed are not defined yet (I know, designing the query makes a big difference in terms of performance). It's exploratory work (so we don't know ahead what will be asked and cannot easily pre-compute values), so one time you want to compare data points of one type in a time range to data points of another type, the other time you might want to compare rooms in factories, calculate correlations, find duplicates, etc.
If I would have multiple tables and normalize everything the queries would need a lot of joins (which probably makes everything quite slow)
Queries mostly need to be performed on the whole ~ 500 million rows database, rarely on separately downloaded subsets
There will be very few users (<10), most of them will execute these "complex" queries.
Is a SQL database a good choice at all? Would there be a big difference in terms of performance for this use case to use a NoSQL system?
In this setup with this amount of data, will I have queries that never "come back"? (considering the query is not too stupid :-))
Don't pre-optimize. If you don't know the queries then you don't know the queries. It is to easy to make choices now that will slow down some sub-set of queries. When you know how the data will be queried you can optimize then -- it is easy to normalize after the fact (pull out temperature data into a related table for example.) For now I suggest you put it all in one table.
You might consider partitioning the data by date or if you have another way that might be useful (recording device maybe?). Often data of this size is partitioned if you have the resources.
After you think about the queries, you will possibly realize that you don't really need all the datapoints. Instead, max/min/avg/etc for, say, 10-minute intervals may be sufficient. And you may want to "alarm" on "over-temp" values. This should not involve the database, but should involve the program receiving the sensor data.
So, I recommend not storing all the data; instead only store summarized data. This will greatly shrink the disk requirements. (You could store the 'raw' data to a plain file in case you are worried about losing it. It will be adequately easy to reprocess the raw file if you need to.)
If you do decide to store all the data in table(s), then I recommend these tips:
High speed ingestion (includes tips on Normalization)
Summary Tables
Data Warehousing
Time series partitioning (if you plan to delete 'old' data) (partitioning is painful to add later)
750M rows -- per day? per decade? Per month - not too much challenge.
By receiving a batch every other day, it becomes quite easy to load the batch into a temp table, do normalization, summarization, etc; then store the results in the Summary table(s) and finally copy to the 'Fact' table (if you choose to keep the raw data in a table).
In reading my tips, you will notice that avg is not summarized; instead sum and count are. If you need standard deviation, also, keep sum-of-squares.
If you fail to include all the Summary Tables you ultimately need, it is not too difficult to re-process the Fact table (or Fact files) to populate the new Summary Table. This is a one-time task. After that, the summarization of each chunk should keep the table up to date.
The Fact table should be Normalized (for space); the Summary tables should be somewhat denormalized (for performance). Exactly how much denormalization depends on size, speed, etc., and cannot be predicted at this level of discussion.
"Queries on 500M rows" -- Design the Summary tables so that all queries can be done against them, instead. A starting rule-of-thumb: Any Summary table should have one-tenth the number of rows as the Fact table.
Indexes... The Fact table should have only a primary key. (The first 100M rows will work nicely; the last 100M will run so slowly. This is a lesson you don't want to have to learn 11 months into the project; so do pre-optimize.) The Summary tables should have whatever indexes make sense. This also makes querying a Summary table faster than the Fact table. (Note: Having a secondary index on a 500M-rows table is, itself, a non-trivial performance issue.)
NoSQL either forces you to re-invent SQL, or depends on brute-force full-table-scans. Summary tables are the real solution. In one (albeit extreme) case, I sped up a 1-hour query to 2-seconds by by using a Summary table. So, I vote for SQL, not NoSQL.
As for whether to "pre-optimize" -- I say it is a lot easier than rebuilding a 500M-row table. That brings up another issue: Start with the minimal datasize for each field: Look at MEDIUMINT (3 bytes), UNSIGNED (an extra bit), CHARACTER SET ascii (utf8 or utf8mb4) only for columns that need it), NOT NULL (NULL costs a bit), etc.
Sure, it is possible to have 'queries that never come back'. This one 'never comes back, even with only 100 rows in a: SELECT * FROM a JOIN a JOIN a JOIN a JOIN a. The resultset has 10 billion rows.

Are there any (potential) problems with having large gaps in auto-increment IDs in the neighbouring rows of a table?

I have a mysql web app that allows users to edit personal information.
A single record is stored in the database across multiple tables. There is a single row in one table for the record, and then additional one-to-many tables for related information. Rows in the one-to-many tables can additionally point to other one-to-many-tables.
All this is to say, data for a single personal information record is a tree that is very spread out in the database.
To update a record, rather than trying to deal with a hodgepodge of update and delete and insert statements to address all the different information that may change from save to save, I simply delete the entire old tree, and then re-insert a new one. This is much simpler on the application side, and so far it has been working fine for me without any problems.
However I do note that some of the auto-incrementing IDs in the one-to-many tables are starting to creep higher. It will still be decades at least before I am anywhere close to this bumping against the limits of INT, let alone BIGINT -- however I am still wondering if there are any drawbacks to this approach that I should be aware of.
So I guess my question is: For database structures like mine, which consist of large trees of information spread across multiple tables, when updating the information, any part of which may have changed, is it ok to just delete the old tree and re-insert a new one? Or should I be rethinking this. IOW is it ok or not ok for there to be large gaps between the IDs of the rows in a table?
Thanks (in advance) for your help.
If your primary keys are indexed (which they should) you shouldn't get problems, apart from the database files needing some compacting from time to time.
However the kind of data you are storing could probably stored better in a document database, like MogoDB, have you consider using one of these?

Best database design for storing a high number columns?

Situation: We are working on a project that reads datafeeds into the database at our company. These datafeeds can contain a high number of fields. We match those fields with certain columns.
At this moment we have about 120 types of fields. Those all needs a column. We need to be able to filter and sort all columns.
The problem is that I'm unsure what database design would be best for this. I'm using MySQL for the job but I'm are open for suggestions. At this moment I'm planning to make a table with all 120 columns since that is the most natural way to do things.
Options: My other options are a meta table that stores key and values. Or using a document based database so I have access to a variable schema and scale it when needed.
Question:
What is the best way to store all this data? The row count could go up to 100k rows and I need a storage that can select, sort and filter really fast.
Update:
Some more information about usage. XML feeds will be generated live from this table. we are talking about 100 - 500 requests per hours but this will be growing. The fields will not change regularly but it could be once every 6 months. We will also be updating the datafeeds daily. So checking if items are updated and deleting old and adding new ones.
120 columns at 100k rows is not enough information, that only really gives one of the metrics: size. The other is transactions. How many transactions per second are you talking about here?
Is it a nightly update with a manager running a report once a week, or a million page-requests an hour?
I don't generally need to start looking at 'clever' solutions until hitting a 10m record table, or hundreds of queries per second.
Oh, and do not use a Key-Value pair table. They are not great in a relational database, so stick to proper typed fields.
I personally would recommend sticking to a conventional one-column-per-field approach and only deviate from this if testing shows it really isn't right.
With regards to retrieval, if the INSERTS/UPDATES are only happening daily, then I think some careful indexing on the server side, and good caching wherever the XML is generated, should reduce the server hit a good amount.
For example, you say 'we will be updating the datafeeds daily', then there shouldn't be any need to query the database every time. Although, 1000 per hour is only 17 per minute. That probably rounds down to nothing.
I'm working on a similar project right now, downloading dumps from the net and loading them into the database, merging changes into the main table and properly adjusting the dictionary tables.
First, you know the data you'll be working with. So it is necessary to analyze it in advance and pick the best table/column layout. If you have all your 120 columns containing textual data, then a single row will take several K-bytes of disk space. In such situation you will want to make all queries highly selective, so that indexes are used to minimize IO. Full scans might take significant time with such a design. You've said nothing about how big your 500/h requests will be, will each request extract a single row, a small bunch of rows or a big portion (up to whole table)?
Second, looking at the data, you might outline a number of columns that will have a limited set of values. I prefer to do the following transformation for such columns:
setup a dictionary table, making an integer PK for it;
replace the actual value in a master table's column with PK from the dictionary.
The transformation is done by triggers written in C, so although it gives me upload penalty, I do have some benefits:
decreased total size of the database and master table;
better options for the database and OS to cache frequently accessed data blocks;
better query performance.
Third, try to split data according to the extracts you'll be doing. Quite often it turns out that only 30-40% of the fields in the table are typically being used by the all queries, the rest 60-70% are evenly distributed among all of them and used partially. In this case I would recommend splitting main table accordingly: extract the fields that are always used into single "master" table, and create another one for the rest of the fields. In fact, you can have several "another ones", logically grouping data in a separate tables.
In my practice we've had a table that contained customer detailed information: name details, addresses details, status details, banking details, billing details, financial details and a set of custom comments. All queries on such a table were expensive ones, as it was used in the majority of our reports (reports typically perform Full scans). Splitting this table into a set of smaller ones and building a view with rules on top of them (to make external application happy) we've managed to gain a pleasant performance boost (sorry, don't have numbers any longer).
To summarize: you know the data you'll be working with and you know the queries that will be used to access your database, analyze and design accordingly.